TurboQuant: What 3-Bit KV Caches Actually Mean for Your Inference Stack

If you need a refresher on quantization, check out my other article about it

If you have been running long-context models in production, you already know the feeling. You scale context to 32K, 64K, 128K tokens and the KV cache quietly eats your VRAM alive. On a Llama-3.1-8B at 128K context, the cache alone can outweigh the model. Double the context, double the memory. There is no trick around it, it is linear.

TurboQuant (ICLR 2026, Google Research + NYU) compresses KV vectors down to 3-4 bits per coordinate with near-zero quality loss and no calibration step. The algorithm is a rotation and a table lookup. It runs online, meaning you quantize each vector the moment it is produced. No preprocessing pass over your dataset, no learned codebooks, no second-order statistics. For KV cache, that is the only kind of quantizer that makes practical sense, since vectors arrive one at a time during autoregressive decode and you cannot afford to wait.

The paper backs this with formal distortion bounds within ~2.7x of the information-theoretic limit, strong benchmark results, and a clean two-stage design for unbiased inner products. It also comes with a narrow evaluation, a publicly contested relationship with prior work, and a gap between the paper and production reality.

Let’s walk through what matters.

What Changes in Your System

Before getting into the math, it is worth understanding what 3-bit KV cache quantization actually buys you at the systems level.

The KV cache lives in HBM. During decode, the attention kernel reads it every single step. For each new token, you compute dot products between the query and every stored key, then weight the values. This is a memory-bound operation. The compute is cheap (just matrix-vector products), but you are bottlenecked by how fast you can stream key/value vectors from HBM into the compute units.

Compressing the cache from 16 bits to 3-4 bits shrinks the memory footprint by 4-5x. That alone is useful, it means longer contexts fit in the same GPU, or you can serve more concurrent requests. But the real win is bandwidth. Fewer bits per vector means less data to move per attention step, which translates directly to lower decode latency. Google reports up to 8x speedup in attention logit computation on H100, though that number is against an FP32 baseline that nobody uses in production. Against realistic FP16 serving, expect something meaningfully smaller, but still worth having.

The other systems-level benefit is simplicity. Methods like GPTQ and AWQ require offline calibration passes, Hessian computation, and per-model tuning. That works for weight quantization, where you quantize once and serve forever. For KV cache, the data is ephemeral and unique to each request. You need something that works on any vector, instantly, with no state. TurboQuant fits this constraint by design. Its codebook is precomputed from probability theory, not learned from data. Same codebook for every model, every layer, every head.

How It Works

Link to the original blogpost

Link to the original blogpost

The algorithm has one key insight and one clever extension.

The insight: make every vector look the same before quantizing it. Arbitrary KV vectors have unpredictable coordinate distributions. Some channels spike (outliers), others carry little energy, and the pattern varies across layers and heads. This is terrible for scalar quantization since a fixed codebook cannot handle every distribution well.

TurboQuant fixes this by multiplying each vector by a random orthogonal matrix (implemented as a fast Walsh-Hadamard transform, O(d log d), fully vectorized). Orthogonal transforms preserve norms and inner products, so no information is lost. But the coordinate distribution changes completely. After rotation, every coordinate of a unit-norm vector follows a Beta distribution that depends only on the dimension d, not on the input. In typical head dimensions (d=128), this looks like a tight Gaussian centered at zero.

Now you have a known, fixed distribution. You can precompute the optimal scalar codebook for it (Lloyd-Max centroids, solved once per bit-width via 1D k-means on the Beta density). At runtime, you rotate the vector, find the nearest centroid for each coordinate, and store the indices plus the original norm. Dequantization reverses it by looking up the centroids and applying the inverse rotation. Fully vectorized, both directions.

Random rotation as a preprocessing step for quantization is not new to TurboQuant. RaBitQ, QuaRot, and QJL all use variants, and the relationship between TurboQuant and RaBitQ in particular has become a point of public contention (more on this below). TurboQuant’s specific contribution is pairing the rotation with provably optimal Lloyd-Max codebooks and establishing formal lower bounds showing this combination is about as good as any data-oblivious quantizer can get, within ~2.7x of the information-theoretic floor across all bit-widths and within ~1.45x at 1-bit.

For unit-norm vectors at bit-widths 1, 2, 3, 4, the concrete MSE values are approximately 0.36, 0.117, 0.03, and 0.009. Independent community implementations confirm these numbers.

The extension: fixing inner product bias. MSE-optimal quantizers systematically underestimate inner products. At 1-bit, the bias is a factor of 2/π (~0.64). You could rescale, but that amplifies variance quadratically, making the estimator noisier overall.

TurboQuant_prod instead allocates (b-1) bits for the MSE quantizer and spends the last bit on a QJL (Quantized Johnson-Lindenstrauss) transform of the residual, projecting through a random Gaussian matrix and keeping only the signs. QJL gives unbiased inner product estimates, so the combined estimator is unbiased at total bit-width b, with inner product distortion scaling as ~‖y‖²/(d · 4^b).

In practice, though, this second stage is where trouble starts. Multiple independent implementations (Triton kernels, llama.cpp prototypes) found that naively adding QJL correction back into reconstructed vectors hurts quality. It works when you own the attention kernel and can feed in the two-part representation directly. Through a standard attention path, the noise from reconstruction outweighs the bias correction. For most real deployments, TurboQuant_mse alone is the pragmatic choice. At 3+ bits, the inner product bias is small enough that attention barely notices.

What Holds Up, What Doesn’t

The distortion theory is solid and empirically confirmed. On Needle-in-a-Haystack, TurboQuant scores 0.997 at 4x compression, identical to full-precision on Llama-3.1-8B up to 104K context. KIVI manages 0.981, PyramidKV 0.895, SnapKV 0.858. On LongBench at 3.5 bits, TurboQuant matches the 16-bit baseline exactly (50.06 average). For nearest neighbor search, it outperforms PQ, NestQuant, LSQ++, and OPQ in recall with essentially zero indexing time.

The picture gets murkier at lower bit-widths. LongBench at 2.5 bits shows marginal degradation, and the score shifted from 49.44 in the arXiv version to 49.74 in the camera-ready without explanation. GSM8K at 3-bit on Qwen2-7B drops 1.4 points (84.3% vs 85.7%), which is more honest about real-world impact than the near-perfect long-context numbers suggest.

Several things remain unresolved.

The evaluation is narrow in ways that matter. LongBench and NIAH run prefill in full BF16, then use the quantized cache only during decode. These benchmarks are prefill-heavy, so the quantized cache barely participates. A community researcher publicly pointed this out and suggested generation-heavy tasks (reasoning on GPQA/AIME, or chunked prefill) would reveal more. The GSM8K addition helps, but one extra benchmark does not close the gap.

All experiments run on Llama 3.1 8B or Ministral 7B. No hybrid architectures (Qwen 3.5), nothing above 8B. We do not know how TurboQuant behaves when head dimensions, layer counts, or attention patterns change significantly.

The “8x speedup” from Google’s blog is measured against FP32 unquantized keys on H100. The paper says “PyTorch einsum” baseline, the blog says “JAX”. Methodology (head dimension, matrix shapes) remains underspecified despite public requests for clarification.

The “2.5-bit” and “3.5-bit” numbers come from outlier-aware channel splitting (32 channels at 3 bits + 96 at 2 bits = 2.5 average over 128-dim heads), a standard technique from prior work, not TurboQuant-specific.

No official code from Google. Supplementary material is insufficient for reproduction. Community has built working implementations in PyTorch, Triton, and llama.cpp. Official release expected Q2 2026.

The RaBitQ Dispute

This deserves a separate mention because it affects how you should read the paper’s novelty claims.

RaBitQ is a prior quantization method that also uses random rotation as a core component. TurboQuant’s paper describes RaBitQ as a “grid-based PQ method” with “suboptimal” guarantees due to “loose analysis,” while omitting the shared rotation mechanism.

In March 2026, the RaBitQ authors posted a public comment on OpenReview documenting three issues. I suggest you to read it carefully if the topic interests you.

My take is that TurboQuant does contribute genuine novelty. The Lloyd-Max codebooks for the Beta distribution, the two-stage QJL residual approach, and the formal lower bounds are real contributions. But if you are evaluating this method, read the RaBitQ line of work too. The relationship is closer than TurboQuant’s paper suggests.

Where It Is Going

Over 100 KV cache compression methods exist in the literature. vLLM supports only 8-bit. llama.cpp has block-based quantizers down to 4-bit but nothing KV-specific.

TurboQuant has unusual momentum for a research method. There are open feature requests in vLLM, prototype branches in llama.cpp, a pip-installable package, and a community project (turboquant_plus) running end-to-end on Apple Silicon with 3-bit KV at q8_0 speed parity and 4.6x compression.

The algorithm’s strongest asset for adoption is its simplicity. No learned parameters, no calibration, no per-model tuning. A rotation matrix, a lookup table, a norm. If you are building or maintaining an inference stack, this is a method you can implement, understand, and debug. Whether it gets merged into the engines that matter depends on fused kernel support and wider model validation, but the barrier is engineering effort, not algorithmic complexity.

The math is settled. The production story is still being written.