Follow

Keep up to date with the latest Stelia advancements

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

LLMs are memory-hungry. How has Google’s TurboQuant compression changed that?

What if the memory demands of modern AI aren’t as fixed as they appear? TurboQuant makes a compelling case.

When Google Research published TurboQuant last week, engineers were quick to make the comparison: this is what Pied Piper spent six seasons trying to build. For anyone unfamiliar with HBO’s Silicon Valley, the show’s plot revolves around a startup chasing a breakthrough compression algorithm, one that could shrink data dramatically without losing any of its integrity, and that last week, Google turned from fiction to reality.

TurboQuant is a new family of quantisation algorithms that compress large language models down to as little as 3-bit precision – without any loss of model accuracy, and without requiring the model to be retrained from scratch.

In practical terms, that means dramatically reducing the cost, memory, and hardware required to run modern AI systems, one of the biggest constraints of scaling them today.

The longstanding assumption in AI compression has been that squeezing a model’s memory footprint inevitably squeezes its intelligence along with it. TurboQuant challenges that assumption directly.

At its core, quantisation refers to reducing the numerical precision used to store a model’s weights – a technique that shrinks memory footprint, but has historically come with a trade-off in accuracy.

AI’s memory problem, and why compression hasn’t solved it until now

Large language models operating today are extraordinarily memory-hungry. The way they store and retrieve information – through what is known as the key-value cache, essentially a high-speed working memory that the model consults during inference – consumes vast amounts of GPU memory. As models have grown larger, the memory demand has scaled with them, becoming one of the most significant bottlenecks in deploying AI at scale. In practice, this means serving a model can require three to four times more VRAM than the task itself actually warrants under standard FP16 precision – a widely used format that balances accuracy with memory efficiency. This is a level of overhead that quickly becomes prohibitive.

To date, conventional approaches have tackled this bluntly – pushing compression to the point where you are not just reducing memory but stripping out what the model has learned. That has consistently meant one of two outcomes: accepting accuracy trade-offs or absorbing the enormous cost of retraining from scratch.

How TurboQuant breaks the accuracy trade-off

TurboQuant trims the fat without cutting into the muscle, shedding memory overhead without compromising the model’s knowledge and reasoning. Using two novel algorithms – 1. PolarQuant, which restructures how data geometry is represented before compression, and 2. QJL, a 1-bit error-correction layer that eliminates bias introduced during that process – it achieves that compression with zero accuracy loss, including no meaningful degradation in perplexity; a measure of how confidently and coherently a model reasons.

In benchmarks, that translates to a 6x reduction in KV cache memory, while inference performance can improve up to eight times. And because TurboQuant is a post-training quantisation method, it can be applied directly to an existing model without the enormous computational cost of retraining from scratch. That makes it fast to adopt and broadly applicable – the technique has shown consistent results across a wide range of model families, including Llama, Mistral, and Gemma, at scales ranging from 7 billion to over 70 billion parameters.

Why the market reacted immediately

When TurboQuant launched, the memory market felt the impact immediately. Samsung and Micron – two of the world’s largest producers of the high-bandwidth memory that AI infrastructure depends on – had been riding a wave of surging demand. Both fell within days of the announcement; Samsung by nearly 5%, Micron by around 4%.

This market reaction reflects a growing understanding that the constraints around AI deployment are not purely about raw compute, and that how much intelligence can be extracted from a given hardware footprint is becoming just as strategically important as the hardware beneath it.

What this means at scale: optimising capability, cost, and architecture

At scale, the ability to run a model at 3-bit precision without accuracy loss changes what is physically possible on existing hardware. Significantly more capability can be extracted from the same footprint. And the longer horizon points further still: towards edge devices, lower-cost cloud instances, and deployment environments that today would be considered completely unsuitable for frontier-class AI.

The same logic extends to cloud economics. When more models can be packed onto existing server infrastructure, the cost per inference drops – and with it, one of the most persistent barriers to deploying production-grade AI at scale. For organisations that have so far found frontier AI prohibitively expensive to operate, that is a meaningful change.

But perhaps the more immediate consequence is architectural. Compression at this level makes it increasingly viable to run multiple smaller, specialised models simultaneously, each optimised for a discrete task, rather than relying on a single generalised model to handle everything regardless of fit. For organisations implementing AI into real-world operational environments, the ability to deploy the right model for the right job, served efficiently and at a fraction of the memory cost, is not just a performance consideration, but rather a structural shift towards AI systems that are better matched to the complexity and variety that production environments actually demand.

From research to real-world deployment

At Stelia, staying close to advances like TurboQuant is part of how we ensure our AI Operating System remains calibrated to what the best available AI capabilities can actually deliver. We evaluate developments like this not as isolated announcements but as part of a broader commitment to making sure the organisations we work with are never constrained by yesterday’s assumptions. Across the full AI stack, our architecture is built to absorb meaningful advances as they arrive and pass those benefits directly on to our customers.

Find out more about Stelia AI OS here.

Enterprise AI 2025 Report