Hacker news

Top
New
Past
Ask
Show
Jobs

Speculative KV coding: losslessly compressing KV cache by up to ~4× (https://fergusfinn.com)

126 points by kkm 3 days ago | 21 comments | View on ycombinator

oceanplexian 21 minutes ago |

A lot of this is over my head but why would you do compression when GPU time is the most expensive thing in the world right now?

KV can be trivially stored on ram or even a spinning disk and retrieved on the order of milliseconds. See LM cache for vLLM for example. In fact it’s so easy it kinda shocks me when Claude Code will sit and recompute my entire KV on a new session after a couple of hours, I guess Anthropic infra is not as optimized as it would seem.

Think about the problem from first principles:

Storing a few GB per user at scale isn’t that hard and was solved years ago. Let’s say I have 20 chat sessions open and the session persists for a day or two, this seems negligible to me as a systems design problem.

zozbot234 about 10 hours ago |

The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

hypfer about 11 hours ago |

TL;DR (and please correct me if I got it wrong):

Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta

And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice

___

If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.

Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.

That would be massive. Especially since the thing has so much compute to spare.

Though, all depending on the size of that predictor model I guess?

syllogistic about 5 hours ago |

How do these results compare with the engram based approach from deepseek?

0-_-0 about 10 hours ago |

You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it.

ssivark about 8 hours ago |

Note that any cache (eg LRU-eviction) is just a specific speculative model for future usage :-)

The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.

monster_truck about 9 hours ago |

There is no compression taking place here.

mirekrusin about 11 hours ago |

If “speculative” approach works so well in different contexts why not make it first class and use everywhere, possibly recursively?

haeseong about 7 hours ago |

[dead]

porridgeraisin about 11 hours ago |

I am yet to do a "deep dive" into the results, but what a well written article. An LLM could _never_ write so crisply.