TurboQuantDC
From-scratch KV cache compression library that found why TurboQuant catastrophically fails on Qwen models, proved the fix in production C code, and wrote 3 publishable findings. One person. One week.
Per-head key mean carries 57% of variance but is invisible to attention. Standard quantization wastes its entire codebook encoding something the model never uses.
| Config | PPL | vs FP16 |
|---|---|---|
| FP16 baseline | 7.52 | -- |
| WHT 3-bit | 9,410 | +9,403 |
| WHT 3-bit + mean-removal | 7.90 | +0.38 |
| WHT 4-bit | 1,049 | +1,041 |
| WHT 4-bit + mean-removal | 7.76 | +0.24 |
| Config | PPL | vs FP16 |
|---|---|---|
| FP16 baseline | 10.72 | -- |
| WHT 3-bit | 60.20 | +49.48 |
| WHT 3-bit + mean-removal | 11.02 | +0.30 |
| WHT 4-bit | 13.22 | +2.51 |
| WHT 4-bit + mean-removal | 10.83 | +0.11 |
| Config | PPL (wikitext-2) | vs FP16 |
|---|---|---|
| Q4_0 weights + FP16 KV (baseline) | 7.50 | -- |
| Q4_0 weights + turbo3 KV | 7.55 | +0.67% |
| Q4_0 weights + turbo4 KV | 7.53 | +0.36% |
| Q4_0 weights + turbo3 KV + mean-removal | 7.37 | -0.13 (beats FP16) |
Turbo3 with mean-removal produces lower perplexity than FP16 KV on Llama. The quantization noise from mean-centered coordinates acts as a mild regularizer.
Needle: "The secret code is PINEAPPLE-77." -- tested at 10%, 50%, 90% depth.
Without mean-removal
With mean-removal
Softmax is shift-invariant. softmax(x + c) = softmax(x). The per-head key mean is a constant shift that attention ignores completely.
of total KV variance is in the per-head channel mean. Standard quantization burns codebook range encoding this signal. Attention never looks at it. Remove it before quantizing and every bit counts.
Mean-removed 3-bit beats standard 4-bit on every metric. Removing the mean effectively gives you one extra bit of precision for free, because the codebook no longer wastes resolution on the constant offset.
Qwen models have unusually high per-head key bias. The channel means are large relative to the signal. At 3-bit, the codebook cannot represent both the mean and the variation -- it clips the signal entirely. The model sees near-uniform attention and generates random tokens.
Tom Turney (TurboQuant author) independently observed +62.95 PPL on Qwen2.5-3B. Same root cause.
NSNQuant (May 2025) published channel centering for KV cache quantization. The technique is not new. What IS new: connecting it to why TurboQuant specifically fails on Qwen models, and proving it is the single root cause of catastrophic PPL collapse, not just an optimization.
Google's latest model at its full native context window. Single RTX 4090. FP16 KV OOMs at 262K.
FP16 KV cache needs ~28 GB for 262K context. The RTX 4090 has 24 GB. TurboQuantDC 3-bit needs ~5.4 GB. Fits comfortably with room to spare.
| Context Length | FP16 KV (tok/s) | TurboQuantDC 3-bit (tok/s) | |
|---|---|---|---|
| 4,096 | 152 | 154 | Equal |
| 65,536 | 159 | 161 | TurboQuantDC faster |
| 131,072 | 161 | 166 | Less VRAM pressure = faster |
| 196,608 | 158 | 158 | Equal |
| 262,144 | OOM (needs 28 GB) | 150 tok/s | Only TurboQuantDC runs |
| Bits | Cosine Sim | Top-5 Match | Compression |
|---|---|---|---|
| 3-bit | 0.999994 | 100% | 5.12x |
| 4-bit | 0.999999 | 100% | 3.88x |
Gemma 4 uses mixed head dimensions across layer types:
| Layer Type | Count | Head Dim |
|---|---|---|
| Sliding Window | 20 layers | d=256 |
| Global Anchor | 4 layers | d=512 |
CUDA WHT kernel handles both. 29x faster than Triton at d=256.
Every experiment published, including the failures. Each ran on real models with real KV caches on RTX 4090.
Removes per-head channel mean before quantization. 57% of variance is invisible to attention. Fixes catastrophic failure on Qwen models.
Store sign(residual) directly in rotated space. No random projection. Same 1-bit budget, lower variance. QJL hurts generation; ResidualQuant fixes it. Confirmed by TurboQuant author.
Longer context = better compression. Min bits/token is O(1/n). At 2K, only 0.3% of tokens get meaningful attention. Inverts "more context = more memory."
Block-diagonal WHT plus mean-removal outperforms RotorQuant's full learned rotation. Simpler, faster, no calibration data needed.
Eviction + distillation + quantization stacked. At 37.9x compression, 93% cosine similarity preserved. Pipeline: EA evict -> KVSculpt distill -> TQ 3-bit.
Synthesize fewer tokens that preserve attention patterns. Distillation quality improves with context length (more redundancy). 50 gradient steps at lr=0.01.
Analytically predict future token importance without seeing future queries. Works well on steady-state text. Caveat: fails on topic shifts.
Learn rotation to minimize attention divergence, not just MSE. Orthogonality via Cayley parameterization. Modest practical gain (+0.002-0.006 on typical layers).
Storing pre-RoPE activations to rematerialize K. Problem: X is 4x larger than K+V combined in GQA models. 0.815 cosine. More memory, worse quality.
EA predicts importance well for steady text but fails when the topic changes. Future queries diverge from past patterns. Needs a shift detector guard.
Rotation calibrated on 3B does not help 14B. Wins on 3/5 layers by <0.003, loses on 2. Must recalibrate per model. Not worth the cost.
FAISS-based CPU/GPU hybrid KV cache. Works at short context but index corruption above 2K tokens. Architecture sound, implementation needs rebuild.
48.7% of KV variance in 10% of coordinates. PCA exploits this structure. Top-5 match: 94.4% -> 100% at 3-bit. Calibrate on 128 tokens.
Keep last 128 tokens at FP16. At 32K context this is 0.4% of tokens. Eliminates error accumulation at the most-attended positions.
First 2 + last 2 layers at FP16. PPL increase drops from +159% to +15.5%. 4 layers out of 36-80 is <10% of vectors.
High-entropy heads need +1 bit. ~15% of heads drive most quality loss. Route K3/K4 by entropy threshold. Complementary to channel-level work.
Fused Walsh-Hadamard transform in CUDA. O(d log d) butterfly. Triton hits register pressure cliff at d=256; CUDA does not. Handles d=512 (Gemma anchors).
Learned rotation angles via straight-through estimator. Novel attention-KL objective vs standard MSE. Works but Cayley improvement is modest.
Outlier-aware mixed bit-width per coordinate. Smooth quality-compression tradeoff between integer bit-widths.
Hot/warm/cold tiers with increasing compression as tokens age. Recent tokens at FP16, old tokens at 2-bit. Smooth degradation curve.
The longer the context, the better the compression. This inverts the standard assumption.
Attention Gini coefficient vs context length. Measured on Qwen2.5-3B-Instruct across 128-2094 tokens.
Gini coefficient (higher = more concentrated attention)
At 128 tokens: min 0.189 bits/token. At 2,094 tokens: min 0.015 bits/token.
Longer context does not mean more memory. Under optimal adaptive compression, it means less.
Every number measured on RTX 4090 with real LLM KV caches. Not synthetic benchmarks.
| Model | Params | d | Bits | Cosine Sim | Top-5 Match | Compression | Generation |
|---|---|---|---|---|---|---|---|
| Qwen2.5-3B | 3B | 128 | 3 | 0.9969 | 94.4% | 5.0x | 5/5 match |
| Gemma 4 E4B | 4B MoE | 256/512 | 3 | 0.999994 | 100% | 5.12x | 150 tok/s @ 262K |
| Qwen2.5-7B | 7B | 128 | 3 | -- | -- | 5.0x | PPL 7.90 (+0.38) |
| Llama 3.1 8B | 8B | 128 | 3 | -- | -- | 5.0x | PPL 7.37 (beats FP16) |
| Qwen2.5-14B | 14B | 128 | 3 | 0.9964 | 95.3% | 5.0x | 5/5 match |
| Qwen3.5-27B | 27B | 256 | 3 | 0.9932 | 100% | 5.2x | -- |
| Qwen2.5-32B | 32B | 128 | 3 | -- | -- | 5.0x | 5/5 match |
| Qwen3.5-35B | 35B MoE | 128 | 3 | -- | -- | 5.0x | 28 tok/s @ 2K |
| Llama 3.1 70B | 70B | 128 | 3 | -- | -- | 5.0x | 4x context (4K->16K) |
| Qwen2.5-72B | 72B | 128 | 3 | -- | -- | 5.0x | Ollama tested |
| Model | FP16 KV Max Context | Turbo3 KV Max Context | Extension |
|---|---|---|---|
| Llama 3.1 8B (TQ4_1S weights) | ~48K (OOM at 56K) | ~100K | 2.1x |
| Llama 3.1 70B (Q2_K weights) | ~4K (OOM at 8K) | ~16K | 4x |
| Gemma 4 26B | ~196K (OOM at 262K) | 262K (full native) | Full window |
External validation from the TurboQuant paper author.
"We've actually been working on removing QJL for a while now and focusing on getting the stacking story right [...] your finding on the stacking numbers is very interesting"
TurboQuant paper author (ICLR 2026), TheTom/llama-cpp-turboquant
While we built TurboQuantDC from scratch, the open-source community was converging on the same findings independently. Tom was already removing QJL. NSNQuant published channel centering. The boundary layer strategy appeared in multiple codebases. When multiple teams arrive at the same conclusions from different starting points, the conclusions are probably right.