735 points by cafkafk 5 days ago | 288 comments | View on ycombinator
cafkafk 5 days ago |
cmiles8 5 days ago |
deng 5 days ago |
EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.
throwaway2027 5 days ago |
Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.
# Building
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON
# Running
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \
llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1
phaser 5 days ago |
Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.
Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.
montroser 5 days ago |
An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.
jansommer 5 days ago |
hualapais 5 days ago |
Details aside, the hope is that ternary LLMs blossom in the coming months and this old hardware can eventually host some very dense models full of factual information, perhaps even larger than the GPU RAM and spilling over to the Optane for IO. Speed would be less important than general factual knowledge. The plan would be to configure then mothball the machine in a Faraday trashcan in the basement, retaining it as a possible "rebuild civilization" oracle should the world fall apart. Of course, power would be an issue in such a scenario, but for how cheap this hardware is and how often AI seems to be practically useful in its latest iterations, why not...
RobotToaster 5 days ago |
Which makes sense I suppose.
car 5 days ago |
High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440
vhaudiquet 5 days ago |
rldjbpin 1 day ago |
memory is the bottleneck here (capacity, or rather speed). before you run out to set up your own, try to rather squeeze out the most of your existing hardware. if you are a lucky owner of a lot of cheap memory, you are already in luck. otherwise LM studio allows you to split memory between your gpu and system memory. avoid MoE models or even consider tensor parallelism between the onboard gpu and dedicated one before going for more hardware.
there is little to no benefit for using a specific quantization for your models, so go crazy and test out whatever can easily run for you.
FartyMcFarter 5 days ago |
What was the net effect of the optimisations? How much faster did it get?
andai 5 days ago |
Guess I am a species-ist after all ;)
ryandrake 5 days ago |
It's still a "homelab" beast and does great with development and GIS/Mapping applications. I was not able to figure out how to run AI workloads on it with decent performance, however, so I finally broke down and got a dedicated GPU for it. It's pretty great what can still be done with older hardware.
NSUserDefaults 5 days ago |
rvba 5 days ago |
I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.
Om am unrelated note, does anyone know a model that can help with this use case:
christkv 5 days ago |
potus_kushner 5 days ago |
cykros 5 days ago |
Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.
cbdevidal 5 days ago |
At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.
The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.
Of course, AI helped me work out a plan for this. Haha
lreeves 5 days ago |
Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.
anon-3988 5 days ago |
https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281
It is way too slow
ForOldHack 5 days ago |
gigatexal 5 days ago |
Hasan121212 5 days ago |
bitwize 5 days ago |
rbanffy 4 days ago |
npn 5 days ago |
I have no doubt that we will have another wave of cheap retired server gpus just like before. And that is the time when everyone will have their own models at their home.
Or we can just buy the newest medusa halo mini pc. they will be pretty decent, too, albeit pricey.
asimovDev 5 days ago |
bombcar 5 days ago |
(He has a fully maxed out “last Intel” Mac Pro and laments the lack of replacement).
nurettin 5 days ago |
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.
tomega2134 5 days ago |
As it is, the title is click-bait for me, as 1) it says I need at least a Xeon somehow and 2) as it doesn't say what I actually need it for.
kristjansson 5 days ago |
Eonexus 5 days ago |
Aurornis 5 days ago |
ik_llama includes llama-sweep-bench https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples...
When comparing hardware, the output of these tools is very helpful to let others put it into context. The post says the output is "reading speed" but knowing the prefill and token generation speeds would be a lot more helpful.
haunter 5 days ago |
https://pcpartpicker.com/products/motherboard/#s=20028,20029...
alimbada 5 days ago |
shevy-java 5 days ago |
Liftyee 5 days ago |
danbruc 5 days ago |
robotswantdata 5 days ago |
Plus many boards also support CXL for RAM expansion over PCI 5!
Source: building a hybrid inference business for regulated industry workloads.
fortran77 5 days ago |
shovas 5 days ago |
ezconnect 5 days ago |
api 5 days ago |
egorfine 5 days ago |
mv4 5 days ago |
coldcity_again 5 days ago |
I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.
dzonga 5 days ago |
remember if you serve real customers as a bootstrapped business - you can afford the whole serve down for maintenance. no need for 99.999%.
better than hetzner.
SirMaster 5 days ago |
bflesch 5 days ago |
qingcharles 5 days ago |
Floppyrom 2 days ago |
hparadiz 5 days ago |
sperandeo 5 days ago |
1970-01-01 5 days ago |
undefined 5 days ago |
b65e8bee43c2ed0 5 days ago |
SXX 5 days ago |
maxothex 5 days ago |
6_7 5 days ago |
hypfer 5 days ago |
Uh. Uuuh.
No?
___
Also
> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.
What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?
I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.