Hacker news

Top
New
Past
Ask
Show
Jobs

Gemma 4 12B: A unified, encoder-free multimodal model (https://blog.google)

1051 points by rvz 4 days ago | 392 comments | View on ycombinator

senko 4 days ago |

I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...

The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.

So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)

I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.

To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).

Lists of various models I tested: https://senko.net/vibecode-bench/

minimaxir 4 days ago |

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

asim 4 days ago |

We are now entering the closed loop game. Google doesn't need anyone else to accelerate their models. This is their bread and butter.

I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.

Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.

ethanpil 4 days ago |

What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

petercooper 4 days ago |

Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.

It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.

ComputerGuru 4 days ago |

Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

djyde 4 days ago |

What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?

nickandbro 4 days ago |

Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.

dwa3592 4 days ago |

This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.

julianlam 4 days ago |

Last time I tried Gemma 4 (26B-A4B) its memory usage would balloon and consume all of my swap until my machine died.

Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.

kristianp 3 days ago |

What quantisation do the creators intend this to be run at? They talk about 16GB of ram, so should it be run at 8 bit? People here are talking about using q4, but I would have thought a smaller model like this wouldn't perform well at such low bits per parameter. Edit, it looks like their bechmarks would have been done at 16 bit float, as the hugging face release is that size: https://huggingface.co/google/gemma-4-12B . Which is a little deceptive: they're advertising an 8 bit size will fit on 16GB laptops, while releasing a 16bit size.

I guess we have to wait for someone to produce perplexity curves at different Q's.

accrual 3 days ago |

Splendid model, it reminds me of Gemma3 27B which was my favorite local model last year. Gemma always had a bit more warmth/empathy compared to Qwen and Mistral in my experience and I found it more useful for personal questions.

My system has a 4080 Super (16GB) installed and using llama.cpp (b9333-35c9b1f39) I got these results on a test prompt:

* Qwen3.5-9B-Q6_K.gguf - Prompt: 1492.0 t/s | Generation: 81.0 t/s

* gemma-4-12b-it-Q4_K_M.gguf - Prompt: 1329.2 t/s | Generation: 72.3 t/s

* gemma-4-12b-it-Q8_0.gguf - Prompt: 504.4 t/s | Generation: 25.2 t/s

powera 4 days ago |

I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.

It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.

I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.

scirob 3 days ago |

Quickly deployed it to check some benchmarks relevant for German language. These are results for CohereLabs/include-base-44 german only : Gemma 4 12B %61.9

  Gemma 4 26B (a4b MoE)    0.647
  Qwen 3 14B               0.621 
  Gemma 4 12B              0.618
  Ministral 14B 2512       0.604 
  Gemma 3 12B              0.547

The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.

I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.

thomasjb 4 days ago |

Unfortunately there's no gguf quants of the assistant model yet: https://huggingface.co/models?other=base_model:quantized:goo...

outageroom 3 days ago |

I really like the idea of small models that you can get the most out of. If I weren't a programmer, I wouldn't even know what I would use Opus 4.8 or GPT 5.5 models for.

RandyOrion 4 days ago |

A small dense multimodal model with audio support, interesting.

Wait, *Excluding Chinese language.

This is ... curious.

P.S. Where is gemma 4 124b?

fluffyspork 2 days ago |

According to Gemini to run full 256k context window with unified RAM

4-bit Quantized (Q4_K_M GGUF): You need at least 25 GB of total RAM. This is the most practical configuration for consumer hardware.

8-bit Quantized (Q8_0 / SFP8): You need at least 32 GB to 36 GB of RAM

Uncompressed 16-bit (BF16): You will need upwards of 45 GB to 50 GB of RAM to account for both the 26.7 GB base model and the massive KV cache.

dgacmu 3 days ago |

I was excited about this until I fed it one of my local test problems: coin identification. I then spent 10 minutes arguing with it that a photo of a 1998 washington quarter was not, in fact, a Morgan Silver Dollar. I mean, I wish it was.

It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.

In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.

Was getting about 50 t/s output on a 3090 with Q8 which seems ok.

jamwise 3 days ago |

"Small enough to run locally with just 16GB of VRAM or unified memory"

With many laptops dropping back down to 8GB because of the memory shortage there's some interesting pressures building in the industry.

lxgr 4 days ago |

Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?

Havoc 4 days ago |

Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE

LarsKrimi 2 days ago |

It seems very good at understanding human language clues even in a 4-bit (Q4_K_S) model, similar in feel to E4B but a great incremental improvement.

Interesting for my 8GB VRAM system, but the system RAM requirement seems to balloon quickly, and it starts misspelling words. Also token/s drops off quickly it seems

Zambyte 4 days ago |

Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

__natty__ 4 days ago |

It’s fascinating for me to see how small language models grow recently in capabilities while still consumer friendly in size to run on their machines

zuminator 4 days ago |

How does it compare with e4b, aside from being larger?

macwhisperer 2 days ago |

check out a custom 4-bit quant I made today

https://huggingface.co/macwhisperer/Gemma4-12B-SuperDense

should run perfect for 12-16gb with maybe 10-20k context

seems intelligent enough that I would recommend this as a daily driver for friends who just want a local ai that can do most things relatively quickly (getting 10 tps on my m2 air)

briansm 3 days ago |

Strange that they are feeding raw audio in. Even in humans, there is a hardware transform to the frequency domain (the cochlea) before data is fed to the brain, effectively doing this part in the LLM seems inefficient.

comma_at 4 days ago |

Are there qwen or minimax or other open weight models of same hardware requirements that outperform this?

baalimago 3 days ago |

I don't understand why Google does this. If I can run this locally, why would I need a subscription or use any inference provider, including Google..?

Scorched earth tactics to make anthropic and openai IPO fail?

christina97 3 days ago |

It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?

Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.

randomNumber7 4 days ago |

> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

wuyunhuo 3 days ago |

The optimal small-model solution, delivering multimodal, reasoning, and coding experiences on affordable hardware that were remarkably close to those of mid-to-large models at the time.

anonova 4 days ago |

Do Gemma 4 models compete with Gemini 3.1 Flash-Lite? I would assume even the smallest Gemini model would outperform even Gemma 4 31B, but I can't really get a sense of performance or output quality difference.

dyauspitr 3 days ago |

Just tried this out. Jesus Christ. Google does some things so well.

benbojangles 3 days ago |

I run gemma-4-26b-bf16 in mtp mode and it runs very smooth, spitting out answers in seconds and outputting text 30x faster than i can read.

spott 4 days ago |

Is there a paper on this?

I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.

I wonder how hard it would be to add it back on.

4k4 3 days ago |

I'm actually thinking how much this is bett3r (besides multimedia) over prismml's 1.5bit model based on qwen2.5 or sth.

zkmon 4 days ago |

It's quite interesting to see the quants pour into the HF page. I keep refreshing it and see many new quants every few mins.

SubiculumCode 3 days ago |

"Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory." I wish. I just have 12.

BiraIgnacio 4 days ago |

using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.

t0lo 3 days ago |

Asked it to name the director who wears a rolex and likes submarines. It said christopher nolan.

foota 3 days ago |

It feels like this would be beneficial to give the model more of a deep understanding of visual knowledge.

undefined 4 days ago |

undefined

semiinfinitely 4 days ago |

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away

benbojangles 3 days ago |

why combine audio & image analysis into an llm though, why not allow the user to choose their own audio & image analysis alongside their own llm choice?

zkmon 4 days ago |

I'm waiting for FP8 quant, preferably from Google.

SuperV1234 4 days ago |

How does this compare to frontier models?

adt 3 days ago |

https://lifearchitect.ai/models-table/

claysmithr 4 days ago |

I don’t see the download in lm studio

easygenes 3 days ago |

I want to like the vision capabilities of the model. However, when I gave it an image which Gemma 26B A4B and Qwen 3.6 35B A3B has no problem correctly describing in detail, including identifying the Taj Mahal in the background it utterly failed. Its sense of the image was that it was a "distorted wide panorama" and even when I asked directly if it was the Taj Mahal it said no. The reference models saw it correctly as a normal square image taken from a fairly rectilinear lens (iPhone main camera).

alienjesus 3 days ago |

good one, wanna try on Cerebras inference in Agentic Browsing

jdelman 4 days ago |

I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.

mlmonkey 4 days ago |

Is there some place where we can try it before downloading the gigabytes of weights?

undefined 4 days ago |

undefined

keyle 3 days ago |

Not terribly impressed with this one. I asked it for recommendation between Paris to Berlin and option 3 was Rome... and option 4 was Tokyo.

mmmkay.

synergy20 3 days ago |

ollama does not support this yet, what else can I try

kordlessagain 4 days ago |

Cool!

tmuhlestein 2 days ago |

[flagged]

Miles_Stone 3 days ago |

[flagged]

Lapsa 4 days ago |

[dead]

digdugdirk 4 days ago |

I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?