3.5x Faster Inference - Stelia AI Newsroom

3.5× faster inference with smarter quantisation: the QServe playbook

QServe challenges brute-force scaling with a co-designed quantisation stack, showing how thoughtful engineering can unlock 3.5× inference gains.

byStelia

July 8, 2025

We asked our VP Engineering what last made him stop and think: “that’s genuinely clever.” Here’s what he had to say:

“I tend to be pretty selective about what catches my attention.” David Hughes, Stelia VP Engineering, “Everyone these days is obsessed with ‘We don’t get enough tokens out of this GPU, let’s just buy a bigger, more powerful GPU.’ That’s not the answer. There’s an interesting bit of software from a MIT HAN Lab project at the minute called QServe that flips the whole problem on its head [1].”

Challenging scale-up thinking

That frustration with brute-force scaling is exactly why QServe, the project Dave called out from MIT, stood out to him. QServe is an inferencing library developed by researchers at MIT, specifically focusing on reducing bottlenecks, and increasing throughput in LLM serving – a clear rejection of the “just throw power at it” mentality.

In real-world terms, quantisation can meaningfully accelerate LLM inference.

Quantisation as a runtime strategy

Traditionally, deep learning relies on precision formats like FP16, FP32, FP64, and INT8. As research has advanced, there has been a clear push beyond INT8, exploring even lower precision such as INT4. However, INT4 quantisation techniques have mostly been effective in low-batch, edge-style LLM inference, while struggling to deliver gains for large-batch cloud-based serving. That shortfall is primarily due to runtime penalties introduced during dequantisation of either weights or partial sums on GPUs.

To address this, QServe implements a W4A8KV4 quantisation algorithm (meaning 4-bit weight, 8-bit activation, 4-bit KV cache), following a 4-8-4 pattern known as QoQ. This approach achieves measurable speedup by recognising a key insight: the efficiency of LLM serving is often critically bottlenecked by operations on low-throughput CUDA cores. QoQ tackles this with progressive quantisation, lowering dequantisation overhead and pairs it with SmoothAttention to mitigate the accuracy losses typical of 4-bit quantisation. Additional strategies include compute-aware weight reordering, register-level parallelism to reduce quantisation latency, and making fused attention memory-bound to fully harness the performance gains of KV4 quantisation.

The performance picture

Benchmarks from the QServe team [1] show throughput gains of 1.2–1.4x on LLaMA-3-8B and up to 3.5x on Qwen-1.5-72B, with L40S GPUs even outperforming A100-class setups in several scenarios. These results tie back to QServe’s efforts to minimise dequantisation bottlenecks on GPU cores, which are a well-known source of INT4 inefficiency. Reported token costs were up to three times lower, thanks to reduced kernel overhead and higher hardware utilisation. That said, the paper notes remaining challenges around runtime stability under high concurrency and large context windows. It’s promising but not guaranteed to hold under every production scenario.

The implementation trade-offs

Beyond consistency, this kind of adaptive quantisation also isn’t trivial to roll out. The entire inference stack – from tokenizers to kernel-level scheduling – needs to tolerate non-uniform precision at runtime. That demands deeper model introspection, more robust calibration workflows and a willingness to invest in kernel-level engineering, especially for workloads beyond the benchmarked standard models. It’s not a drop-in fix, but for anyone willing to trade a bit of complexity for massive efficiency gains, QServe offers a solid blueprint.

What history teaches us

Dave noted the similarities between QServe’s solution and earlier industry breakthroughs:
“In a throwback to ye olden days of the storage world, we were presented with (seemingly) hard limits of, for example, 1TB of physical storage – and yet we found ways to work around this in software. Enter: on-the-fly compression which, as algorithms developed further, became more and more efficient and you had different compression algorithms for different use cases, e.g. LZ4 vs GZIP and even continued to innovate in later years with ZSTD becoming even more efficient and delivering the sweet spots of both the aforementioned.

We will undoubtedly reach the same point in AI workloads, where we are presented with (again, seemingly) hard limits of the physical infrastructure involved and we will have a need to get clever. QServe is a glimpse of what getting clever might actually look like.”

Ultimately, QServe is a wake-up call about breaking habits. While the industry has been conditioned to worship bigger GPUs on reference architectures – driving up costs and complexity in the name of vendor roadmaps – QServe is a reminder that smart software still comes out on top.

As Dave accurately puts it:
”We’ve accepted that hardware has limitations, but there’s no reason to just lie down. You can be clever and work around it with software. Ingenuity always wins in the end.”

Why this matters for Stelia

This is just one example of what happens when ingenuity reclaims center stage, refusing to be boxed in by hardware limitations. That principle resonates strongly with Stelia’s own purpose. At Stelia, we are committed to advancing artificial intelligence that is not just more powerful, but more purposeful – adaptive, transparent, and human-centered. While we continue to explore advanced quantisation, co-designed inference stacks, and systems-level optimisations like those demonstrated by QServe, our broader goal is to ensure these technologies serve humanity with responsibility, fairness and long-term impact. It is systems-led ideas like QServe that illustrate a path toward this goal – by proving that thoughtful engineering can break barriers and expand what AI is capable of.