Hacker news

  • Top
  • New
  • Past
  • Ask
  • Show
  • Jobs

TurboQuant: Redefining AI efficiency with extreme compression (https://research.google)

180 points by ray__ about 6 hours ago | 37 comments | View on ycombinator

amitport about 3 hours ago |

This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.

benob about 4 hours ago |

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

zeeshana07x about 1 hour ago |

The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer

bluequbit about 4 hours ago |

I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

moktonar about 3 hours ago |

Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?

maurelius2 about 3 hours ago |

I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?

lucrbvi about 2 hours ago |

Sounds like Multi-Head Latent Attention (MLA) from DeepSeek

veunes about 1 hour ago |

[dead]

aledevv about 1 hour ago |

[dead]

dev_tools_lab about 1 hour ago |

[dead]

rsmtjohn about 2 hours ago |

[dead]

mohsen1 about 3 hours ago |

[dead]

hikaru_ai about 4 hours ago |

[dead]

mskkm about 2 hours ago |

Pied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results?