Hacker news

Top
New
Past
Ask
Show
Jobs

A 10 year old Xeon is all you need (https://point.free)

735 points by cafkafk 5 days ago | 288 comments | View on ycombinator

cafkafk 5 days ago |

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

cmiles8 5 days ago |

We’re not there yet, but the obvious endgame of the present bubble insanity is open models running on local hardware and devices are “good enough” for most use cases. That will completely implode what’s going on at the moment in tech.

deng 5 days ago |

Nice post and technically impressive work. I agree we need to understand the build pipeline and be able to do things locally. However, depending on your electricity cost, it might not make sense financially. These old servers are not energy efficient at all (I'm guessing that old Xeon server will easily pull 200W on load), and that model is currently at 0.1$/0.3$ per 1M tokens (with 76 tps and 262k context) in Openrouter (also, these servers are LOUD).

EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.

throwaway2027 5 days ago |

Glad to see other people realizing this. I've been running Gemma 26B-A4B Q4 on a 2012 Xeon with 16GB to 24GB of RAM in a container. It's getting around 8 to 12 tokens per second. Obviously it's not comparable to huge contexts and running it on a GPU and the image decoder in llama.cpp is super slow compared to a GPU but for some small automation tasks and general trivia questions it's decent. The speed is just enough to not have to wait for it to finish so you can read along.

Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.

# Building

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON

# Running

export OPENBLAS_NUM_THREADS=4

export OMP_NUM_THREADS=4

OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \

llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1

phaser 5 days ago |

What intrigues me the most about AI progress, is not AGI or the model du jour by $AI_UNICORN, but rather what can be run locally. I remember having an amusing, but rather useless model in a beefy gaming PC that I had 6 years ago; and now, something that’s a hundred times better on my M5 laptop.

Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.

Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.

montroser 5 days ago |

Result is ~12 tokens per second, as reported by OP down in these comments here.

An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.

jansommer 5 days ago |

The E5-2620 v4 is great. Have been using it for 10 years now. Wanted to upgrade until I saw current prices. I have 64 GB ddr4. Paired it with rx 9060 xt 16 GB and games run as fast as ever. Perhaps the cpu is a slight bottleneck in DOOM The Dark Ages, but i'm at 60 fps, so no problem. Light llm on the gpu is a nobrainer, and it's cool to see that things can be tuned to run ok on the cpu. I bought 2667 v4 a month ago for 30$. I'd expect it to give a decent performance boost but I just haven't had the need for it yet, but pushing into llm like in the article I'd probably upgrade because 2667 can handle slightly faster ram.

hualapais 5 days ago |

Went this route after hemming and hawing over a Mac Studio Pro for some time. Eventually bought and configured a headless HP Z620 with 192 GB of ECC RAM and dual Xeon E5-2680 v2 processors, an Optane AIC, two P102-100s with 10 GB VRAM each, and a minimal bootable SDD running Debian 12.6 with an older, locked version of CUDA that supports the Pascal cards. Run it remotely from the basement via AMT/meshcommander. Just fire up llama.cpp and its front end and connect over the local network. Currently playing with Talkie, Qwen 3.6 27b, and medgemma, but have had good luck with GGUF performance in general after selecting an appropriate quant. Total cost was under $500, but I bought the server via eBay last year; things may be different now.

Details aside, the hope is that ternary LLMs blossom in the coming months and this old hardware can eventually host some very dense models full of factual information, perhaps even larger than the GPU RAM and spilling over to the Optane for IO. Speed would be less important than general factual knowledge. The plan would be to configure then mothball the machine in a Faraday trashcan in the basement, retaining it as a possible "rebuild civilization" oracle should the world fall apart. Of course, power would be an issue in such a scenario, but for how cheap this hardware is and how often AI seems to be practically useful in its latest iterations, why not...

RobotToaster 5 days ago |

Apparently Itanium works quite well for LLMs https://medium.com/@tglozar/running-llama-inference-on-intel...

Which makes sense I suppose.

car 5 days ago |

Similar recent posting with optimizations for older Xeon:

High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440

https://news.ycombinator.com/item?id=47320244

vhaudiquet 5 days ago |

The E5 2620-v4 only supports DDR4.

rldjbpin 1 day ago |

the surge of articles on using decommissioned datacentre hw to run LLMs lately, is more of a symptom of the times than their viability. back when intel had a monopoly on cpu and would refuse to give consumers more than four cores, the old xeon route was popular for a different reason.

memory is the bottleneck here (capacity, or rather speed). before you run out to set up your own, try to rather squeeze out the most of your existing hardware. if you are a lucky owner of a lot of cheap memory, you are already in luck. otherwise LM studio allows you to split memory between your gpu and system memory. avoid MoE models or even consider tensor parallelism between the onboard gpu and dedicated one before going for more hardware.

there is little to no benefit for using a specific quantization for your models, so go crazy and test out whatever can easily run for you.

FartyMcFarter 5 days ago |

I may have missed this in the article, but:

What was the net effect of the optimisations? How much faster did it get?

andai 5 days ago |

I want to share something strange. I found a typo or two in the post and this absolutely delighted me, because it implies a human wrote the words. (Or was at least heavily involved in the editing.)

Guess I am a species-ist after all ;)

ryandrake 5 days ago |

I've got an old HP Z-620 workstation with dual E5-2697 v2 CPUs (24 cores total, 48 threads @ 2.7GHz) and 128GB of DDR3 RAM. The docs say it supports up to 192GB, but I wasn't able to get it to POST with all the RAM slots full.

It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.

NSUserDefaults 5 days ago |

How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).

rvba 5 days ago |

As someone doing this for fun on a windows 11 machine (96gb ram, 5090 24gb) I wonder if I need any flags to keep the model in memory and avoid swapping to ssd?

I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.

Om am unrelated note, does anyone know a model that can help with this use case:

https://news.ycombinator.com/item?id=48301635

christkv 5 days ago |

Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

potus_kushner 5 days ago |

@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?

cykros 5 days ago |

Does this mean my 15 year old Phenom is too old? But it has 16 gb of DDR3 RAM!

Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.

cbdevidal 5 days ago |

Old hardware is surprisingly effective. I've been considering a side hustle selling offline AI to local businesses who are privacy-sensitive. Medical, legal, places like that.

At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.

The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.

Of course, AI helped me work out a plan for this. Haha

lreeves 5 days ago |

Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong.

Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.

anon-3988 5 days ago |

I tried to run gemma 4 on this CPU and it did not go well

https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281

It is way too slow

ForOldHack 5 days ago |

Well, lets get started. I have 4 of those machines, and they are Two dual processor. They all had 32GB of ram, so now I have two with 64GB, and two with zero. They all hand stock K5000s, now how two have two cards. I stripped the uni processors ram and video cards, and put those into the dual procs. They have 256Gb SSDs, and two 1TB disk drives. One machine has 8Gb of VRam across two cards. Dual processors are 8Cx2 and 32 Threads. They can easily play 16 videos at once. For AI, I have not found a model that I can get above 3 tokens a second. Not a one.

gigatexal 5 days ago |

What kind of tokens per second did the op get I saw nothing of this written.

Hasan121212 5 days ago |

I think one overlooked advantage of older Xeon systems is their availability. Many people can experiment with local AI deployments at a fraction of the cost of building a brand-new setup.

bitwize 5 days ago |

Successfully ran Gemma4-26B-A4B on my 8yo first-gen Ryzen with a GeForce GTX 1070. It actually ran acceptably well; I was surprised. I even did some coding with it, but the wheels fell abruptly off when it tried several times to use a constant I told it doesn't exist. I only have 32 GiB of RAM in this old bucket, and these results are not worth the RAM consumption, so I put it aside. Maybe if I finish that build with more memory...

rbanffy 4 days ago |

The other day I was considering the adoption of a POWER7+ box. Sadly, Linux hasn't supported POWER7 in quite some time. The machine looked pretty nice, with 4 CPUs with 8 cores each, a total of 128 threads and 512 GB of RAM. I'm not sure it'd run AIX without a license though, which is unfortunate - it's a gorgeous box.

npn 5 days ago |

I bought one AMD MI50 32GB back then when they were sold rather cheap (around $150-$170). it can easily generate over 70 tokens per second for gemma 4 26B moe model (q4).

I have no doubt that we will have another wave of cheap retired server gpus just like before. And that is the time when everyone will have their own models at their home.

Or we can just buy the newest medusa halo mini pc. they will be pretty decent, too, albeit pricey.

asimovDev 5 days ago |

I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?

bombcar 5 days ago |

Is this John Siracusa? It sounds like it could be something he’d say…

(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).

nurettin 5 days ago |

I also run a Qwen 3.6 moe A4B on old hardware. I set it up with

numactl --membind=1

so it is constrained to one of the memory sticks which speeds up token generation a little.

tomega2134 5 days ago |

I wish this were somehow tagged with AI, so I would know that it's not about say, general computing or cost-efficiency (e.g. using an old xeon machine from ebay instead of new, in these cost-conscious times.)

As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.

kristjansson 5 days ago |

Noting for reference that Gemma4 MTP work is in progress[0] on llama.cpp; similar work for Qwen3.6 landed recently and has been great thus far.

[0]: https://github.com/ggml-org/llama.cpp/pull/23398

Eonexus 5 days ago |

I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?

Aurornis 5 days ago |

llama.cpp includes a benchmarking tool called llama-bench https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...

When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.

haunter 5 days ago |

And this is one of those CPUs which had dual slot motherboards so you can have double the fun (and power bill)

https://pcpartpicker.com/products/motherboard/#s=20028,20029...

alimbada 5 days ago |

What's the best way to apply this to slightly more modern hardware - i.e. 5800XT 32GB DDR4, 9060XT 16GB?

shevy-java 5 days ago |

The webpage's layout is just horrible. Scrolling is also non-default - and thus rather annoying; I had to stop after two scroll events. Why do people think they need so much fancy effects or non-standard behaviour, if their alleged goal is to get information across to other people?

Liftyee 5 days ago |

Very intriguing. This might be the use for my e5-2430 V2 X2 server that's been lying around. DDR3 is (relatively) cheap now too. Could fit 192GB of RAM in it and play around for much cheaper than a new GPU.

danbruc 5 days ago |

Did some try to estimates what it would take to bake interference for a capable large language model into silicon so that one can pipeline inputs through it and produce outputs at one token per clock cycle?

robotswantdata 5 days ago |

Granite or sapphire rapids are very under rated for MoE inference loads. But you need a GPU for the KV cache.

Plus many boards also support CXL for RAM expansion over PCI 5!

Source: building a hybrid inference business for regulated industry workloads.

fortran77 5 days ago |

My current desktop machine is a 24-core Xeon-3345 with 256GB of RAM and an Nvidia 5090. It still feels extremely fast, even though it's about 8 year old technology with a newer video card.

shovas 5 days ago |

I have run llama.cpp on an i7-2600 with a 1050. It's too slow for everyday usage but it's not too slow to make it obvious AI is going to be everywhere and in everything. It's too easy to run.

ezconnect 5 days ago |

When you use page up and page down key when reading that blog the first line on the screen is obscured by the floating bar or what ever it is. It is not even needed for reading.

api 5 days ago |

Have to point out one boring thing though: this will use a lot more electricity than newer things. So it'll work, but it'll run up your electric bill.

egorfine 5 days ago |

This and the previous one are insanely good articles. Thank you!

mv4 5 days ago |

I have an old 192GB DDR4 Dell Precision with dual Intel Xeon Gold 6130 that I've considered spinning up. What's giving me pause is 250W at idle.

coldcity_again 5 days ago |

This is great work.

I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.

dzonga 5 days ago |

for solo operators that run saas (targeting business customers) & if you do a lot of data processing - old servers are the best bang for the buck.

remember if you serve real customers as a bootstrapped business - you can afford the whole serve down for maintenance. no need for 99.999%.

better than hetzner.

SirMaster 5 days ago |

Either they have a E5-2620 V2 from 13 years ago, or they have DDR4, not DDR3. The V3 and V4 only support DDR4.

bflesch 5 days ago |

Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors

qingcharles 5 days ago |

Would there be any advantage of running this as dual Xeon? The CPUs are $5 and a dual mobo is $50...

Floppyrom 2 days ago |

Famous last words...

hparadiz 5 days ago |

I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm

sperandeo 5 days ago |

ive been doing the same thing. i refactored a old newtek stream machine . its my new favorite thing to do! adding old PCs to my "starcraft" fleet xD

1970-01-01 5 days ago |

Hah. My Xeon turns 20 this year. No issues.

undefined 5 days ago |

undefined

b65e8bee43c2ed0 5 days ago |

so how many tokens/s do you get, pp and tg? did I miss it in the article?

SXX 5 days ago |

Now we need someone try run Kimi K2.6 on old Xeon and DDR3. After all these platforms do support up to 768GB RAM.

maxothex 5 days ago |

[flagged]

6_7 5 days ago |

[dead]

hypfer 5 days ago |

> The argument for speculative decoding is stronger on CPU than on GPU.

Uh. Uuuh.

No?

___

Also

> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.

What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?