1051 points by rvz 4 days ago | 392 comments | View on ycombinator
senko 4 days ago |
minimaxir 4 days ago |
> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
asim 4 days ago |
I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.
Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.
ethanpil 4 days ago |
Is it simply goodwill and/or marketing? Or am I missing something strategic?
petercooper 4 days ago |
It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.
ComputerGuru 4 days ago |
A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.
djyde 4 days ago |
nickandbro 4 days ago |
dwa3592 4 days ago |
julianlam 4 days ago |
Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.
kristianp 3 days ago |
I guess we have to wait for someone to produce perplexity curves at different Q's.
accrual 3 days ago |
My system has a 4080 Super (16GB) installed and using llama.cpp (b9333-35c9b1f39) I got these results on a test prompt:
* Qwen3.5-9B-Q6_K.gguf - Prompt: 1492.0 t/s | Generation: 81.0 t/s
* gemma-4-12b-it-Q4_K_M.gguf - Prompt: 1329.2 t/s | Generation: 72.3 t/s
* gemma-4-12b-it-Q8_0.gguf - Prompt: 504.4 t/s | Generation: 25.2 t/s
powera 4 days ago |
It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.
I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.
scirob 3 days ago |
Gemma 4 26B (a4b MoE) 0.647
Qwen 3 14B 0.621
Gemma 4 12B 0.618
Ministral 14B 2512 0.604
Gemma 3 12B 0.547
The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.
thomasjb 4 days ago |
outageroom 3 days ago |
RandyOrion 4 days ago |
Wait, *Excluding Chinese language.
This is ... curious.
P.S. Where is gemma 4 124b?
fluffyspork 2 days ago |
4-bit Quantized (Q4_K_M GGUF): You need at least 25 GB of total RAM. This is the most practical configuration for consumer hardware.
8-bit Quantized (Q8_0 / SFP8): You need at least 32 GB to 36 GB of RAM
Uncompressed 16-bit (BF16): You will need upwards of 45 GB to 50 GB of RAM to account for both the 26.7 GB base model and the massive KV cache.
dgacmu 3 days ago |
It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.
In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.
Was getting about 50 t/s output on a 3090 with Q8 which seems ok.
jamwise 3 days ago |
With many laptops dropping back down to 8GB because of the memory shortage there's some interesting pressures building in the industry.
lxgr 4 days ago |
Havoc 4 days ago |
LarsKrimi 2 days ago |
Interesting for my 8GB VRAM system, but the system RAM requirement seems to balloon quickly, and it starts misspelling words. Also token/s drops off quickly it seems
Zambyte 4 days ago |
[0] https://ollama.com/library/gemma4/tags
Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.
__natty__ 4 days ago |
zuminator 4 days ago |
macwhisperer 2 days ago |
https://huggingface.co/macwhisperer/Gemma4-12B-SuperDense
should run perfect for 12-16gb with maybe 10-20k context
seems intelligent enough that I would recommend this as a daily driver for friends who just want a local ai that can do most things relatively quickly (getting 10 tps on my m2 air)
briansm 3 days ago |
comma_at 4 days ago |
baalimago 3 days ago |
Scorched earth tactics to make anthropic and openai IPO fail?
christina97 3 days ago |
Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.
randomNumber7 4 days ago |
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
wuyunhuo 3 days ago |
anonova 4 days ago |
dyauspitr 3 days ago |
benbojangles 3 days ago |
spott 4 days ago |
I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.
I wonder how hard it would be to add it back on.
4k4 3 days ago |
zkmon 4 days ago |
SubiculumCode 3 days ago |
BiraIgnacio 4 days ago |
t0lo 3 days ago |
foota 3 days ago |
undefined 4 days ago |
semiinfinitely 4 days ago |
benbojangles 3 days ago |
zkmon 4 days ago |
SuperV1234 4 days ago |
adt 3 days ago |
claysmithr 4 days ago |
easygenes 3 days ago |
alienjesus 3 days ago |
jdelman 4 days ago |
mlmonkey 4 days ago |
undefined 4 days ago |
keyle 3 days ago |
mmmkay.
synergy20 3 days ago |
kordlessagain 4 days ago |
tmuhlestein 2 days ago |
Miles_Stone 3 days ago |
Lapsa 4 days ago |
digdugdirk 4 days ago |
The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.
So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)
I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.
To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).
Lists of various models I tested: https://senko.net/vibecode-bench/