274 points by mmastrac 4 days ago | 55 comments | View on ycombinator
0xbadcafebee about 9 hours ago |
lastdong 26 minutes ago |
Author’s announcement thread on the NVIDIA developer forums: https://forums.developer.nvidia.com/t/nvidia-greenboost-kern...
nl about 8 hours ago |
The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)
But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?
I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.
It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.
I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.
daneel_w about 9 hours ago |
"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"
Havoc about 8 hours ago |
Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.
Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet
ma2kx about 8 hours ago |
I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.
yjtpesesu2 about 9 hours ago |
Insanity about 7 hours ago |
paultendo about 9 hours ago |
yjftsjthsd-h about 10 hours ago |
(Still cool, still would benefit from better benchmarks)
bhewes about 9 hours ago |
armada651 about 5 hours ago |
felipe_aramburu about 6 hours ago |
sabareesh about 8 hours ago |
tandr 3 days ago |
NooneAtAll3 about 3 hours ago |
and instead of improving the actual product, it decided to "solve the problem in software"
I expect this greenboost to fall and burn, honestly...
pabs3 3 days ago |
aplomb1026 about 8 hours ago |
ajaimk about 8 hours ago |
holoduke about 10 hours ago |
In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.