The fix the paper missed
PPL 9,410 7.90
Qwen2.5-7B. WHT 3-bit. One line of code.

TurboQuantDC

From-scratch KV cache compression library that found why TurboQuant catastrophically fails on Qwen models, proved the fix in production C code, and wrote 3 publishable findings. One person. One week.

1,190x PPL improvement
from one line
262K Gemma 4 full context
single RTX 4090
5.2x KV cache
compression
1,796 tests passing
MIT license
GitHub pip install turboquantdc 123 commits 1,796 tests 20 experiments 8 breakthroughs, 4 dead ends MIT License

The Fix

Per-head key mean carries 57% of variance but is invisible to attention. Standard quantization wastes its entire codebook encoding something the model never uses.

The 6-line C patch for llama.cpp TurboQuant // Before quantizing keys, remove per-head channel mean.
// Softmax is shift-invariant: softmax(x+c) == softmax(x).
// The mean is invisible to attention but consumes codebook range.
for (int h = 0; h < n_head; h++) {     float mean = ggml_vec_mean(head_dim, keys + h * head_dim);     ggml_vec_sub1(head_dim, keys + h * head_dim, mean); }

Qwen2.5-7B-Instruct

ConfigPPLvs FP16
FP16 baseline7.52--
WHT 3-bit9,410+9,403
WHT 3-bit + mean-removal7.90+0.38
WHT 4-bit1,049+1,041
WHT 4-bit + mean-removal7.76+0.24

Qwen2.5-3B-Instruct

ConfigPPLvs FP16
FP16 baseline10.72--
WHT 3-bit60.20+49.48
WHT 3-bit + mean-removal11.02+0.30
WHT 4-bit13.22+2.51
WHT 4-bit + mean-removal10.83+0.11

Llama 3.1 8B in llama.cpp (production C code)

ConfigPPL (wikitext-2)vs FP16
Q4_0 weights + FP16 KV (baseline)7.50--
Q4_0 weights + turbo3 KV7.55+0.67%
Q4_0 weights + turbo4 KV7.53+0.36%
Q4_0 weights + turbo3 KV + mean-removal7.37-0.13 (beats FP16)

Turbo3 with mean-removal produces lower perplexity than FP16 KV on Llama. The quantization noise from mean-centered coordinates acts as a mild regularizer.

Needle-in-a-Haystack: 8K context, Qwen2.5-7B

Needle: "The secret code is PINEAPPLE-77." -- tested at 10%, 50%, 90% depth.

Without mean-removal

10% depth
FAIL
0000., . numberWith); .0..0..
50% depth
FAIL
001 9111 2 mathematics 9 of mathematics 0.
90% depth
FAIL
1 0 0000000 strugg0 mathematics. 00.52

With mean-removal

10% depth
PASS
PINEAPPLE-77
50% depth
PASS
PINEAPPLE-77
90% depth
PASS
PINEAPPLE-77

Why It Works

Softmax is shift-invariant. softmax(x + c) = softmax(x). The per-head key mean is a constant shift that attention ignores completely.

The Wasted Bits

57%

of total KV variance is in the per-head channel mean. Standard quantization burns codebook range encoding this signal. Attention never looks at it. Remove it before quantizing and every bit counts.

The Free Precision

+1 bit

Mean-removed 3-bit beats standard 4-bit on every metric. Removing the mean effectively gives you one extra bit of precision for free, because the codebook no longer wastes resolution on the constant offset.

Why Qwen Fails Worst

Qwen models have unusually high per-head key bias. The channel means are large relative to the signal. At 3-bit, the codebook cannot represent both the mean and the variation -- it clips the signal entirely. The model sees near-uniform attention and generates random tokens.

Tom Turney (TurboQuant author) independently observed +62.95 PPL on Qwen2.5-3B. Same root cause.

Prior Art

NSNQuant (May 2025) published channel centering for KV cache quantization. The technique is not new. What IS new: connecting it to why TurboQuant specifically fails on Qwen models, and proving it is the single root cause of catastrophic PPL collapse, not just an optimization.

Gemma 4: 262K Context on One GPU

Google's latest model at its full native context window. Single RTX 4090. FP16 KV OOMs at 262K.

Gemma 4 26B MoE -- Full Native 262K Window

FP16 KV cache needs ~28 GB for 262K context. The RTX 4090 has 24 GB. TurboQuantDC 3-bit needs ~5.4 GB. Fits comfortably with room to spare.

Context LengthFP16 KV (tok/s)TurboQuantDC 3-bit (tok/s)
4,096152154Equal
65,536159161TurboQuantDC faster
131,072161166Less VRAM pressure = faster
196,608158158Equal
262,144OOM (needs 28 GB)150 tok/sOnly TurboQuantDC runs

Compression Quality

BitsCosine SimTop-5 MatchCompression
3-bit0.999994100%5.12x
4-bit0.999999100%3.88x

Architecture Discovery

Gemma 4 uses mixed head dimensions across layer types:

Layer TypeCountHead Dim
Sliding Window20 layersd=256
Global Anchor4 layersd=512

CUDA WHT kernel handles both. 29x faster than Triton at d=256.

20 Experiments. 8 Breakthroughs. 4 Dead Ends.

Every experiment published, including the failures. Each ran on real models with real KV caches on RTX 4090.

Breakthrough

Mean-Removal Quantization

PPL 9,410 -> 7.90

Removes per-head channel mean before quantization. 57% of variance is invisible to attention. Fixes catastrophic failure on Qwen models.

Breakthrough

ResidualQuant > QJL

Matches FP16 generation

Store sign(residual) directly in rotated space. No random projection. Same 1-bit budget, lower variance. QJL hurts generation; ResidualQuant fixes it. Confirmed by TurboQuant author.

Breakthrough

Asymptotic Compression Law

Gini ~ 0.08*ln(n), R2=0.989

Longer context = better compression. Min bits/token is O(1/n). At 2K, only 0.3% of tokens get meaningful attention. Inverts "more context = more memory."

Breakthrough

Block Rotation + Mean-Removal

Beats RotorQuant at 3-bit

Block-diagonal WHT plus mean-removal outperforms RotorQuant's full learned rotation. Simpler, faster, no calibration data needed.

Breakthrough

Triple-Stack Compression

37.9x at 0.93 quality

Eviction + distillation + quantization stacked. At 37.9x compression, 93% cosine similarity preserved. Pipeline: EA evict -> KVSculpt distill -> TQ 3-bit.

Breakthrough

KVSculpt Cache Distillation

19.7x near-lossless

Synthesize fewer tokens that preserve attention patterns. Distillation quality improves with context length (more redundancy). 50 gradient steps at lr=0.01.

Breakthrough

Expected Attention Pruning

10x compression, 0.978 cosine

Analytically predict future token importance without seeing future queries. Works well on steady-state text. Caveat: fails on topic shifts.

Breakthrough

Cayley Learned Rotation

Novel attention-KL objective

Learn rotation to minimize attention divergence, not just MSE. Orthogonality via Cayley parameterization. Modest practical gain (+0.002-0.006 on typical layers).

Dead End

XQuant Rematerialization

NOT viable for GQA

Storing pre-RoPE activations to rematerialize K. Problem: X is 4x larger than K+V combined in GQA models. 0.815 cosine. More memory, worse quality.

Dead End

Expected Attention on Topic Shifts

-0.035 Spearman (anti-correlated)

EA predicts importance well for steady text but fails when the topic changes. Future queries diverge from past patterns. Needs a shift detector guard.

Dead End

Cross-Model Cayley Transfer

No transfer across models

Rotation calibrated on 3B does not help 14B. Wins on 3/5 layers by <0.003, loses on 2. Must recalibrate per model. Not worth the cost.

Dead End

TurboRetrievalCache at Scale

Broken > 2K tokens

FAISS-based CPU/GPU hybrid KV cache. Works at short context but index corruption above 2K tokens. Architecture sound, implementation needs rebuild.

Technique

PCA-Adaptive Rotation

13x lower MSE

48.7% of KV variance in 10% of coordinates. PCA exploits this structure. Top-5 match: 94.4% -> 100% at 3-bit. Calibrate on 128 tokens.

Technique

FP16 Hot Window

0.4% cost, fixes accumulation

Keep last 128 tokens at FP16. At 32K context this is 0.4% of tokens. Eliminates error accumulation at the most-attended positions.

Technique

Boundary Layer Anchoring

Recovers 90% of quality gap

First 2 + last 2 layers at FP16. PPL increase drops from +159% to +15.5%. 4 layers out of 36-80 is <10% of vectors.

Technique

Per-Head Bit Allocation

Same avg bits, better quality

High-entropy heads need +1 bit. ~15% of heads drive most quality loss. Route K3/K4 by entropy threshold. Complementary to channel-level work.

Technique

CUDA WHT Kernel

29x speedup at d=256

Fused Walsh-Hadamard transform in CUDA. O(d log d) butterfly. Triton hits register pressure cliff at d=256; CUDA does not. Handles d=512 (Gemma anchors).

Technique

Differentiable Learned Quantization

Gradient-optimized codebook

Learned rotation angles via straight-through estimator. Novel attention-KL objective vs standard MSE. Works but Cayley improvement is modest.

Technique

Fractional Bit Rates

2.5-bit at 5.56x, 3.5-bit at 4.13x

Outlier-aware mixed bit-width per coordinate. Smooth quality-compression tradeoff between integer bit-widths.

Technique

Temporal Decay Compression

3-tier progressive

Hot/warm/cold tiers with increasing compression as tokens age. Recent tokens at FP16, old tokens at 2-bit. Smooth degradation curve.

The Asymptotic Compression Law

The longer the context, the better the compression. This inverts the standard assumption.

G(n) = 0.08 * ln(n) + beta     R2 = 0.989

Attention Gini coefficient vs context length. Measured on Qwen2.5-3B-Instruct across 128-2094 tokens.

Gini coefficient (higher = more concentrated attention)

128 tok
0.60
256 tok
0.66
512 tok
0.73
1024 tok
0.79
2094 tok
0.85
12.8%
Tokens >1% attn
at 128 tokens
->
0.3%
Tokens >1% attn
at 2,094 tokens
->
O(1/n)
Theoretical
min bits/token

At 128 tokens: min 0.189 bits/token. At 2,094 tokens: min 0.015 bits/token.

Longer context does not mean more memory. Under optimal adaptive compression, it means less.

Validated 3B to 72B

Every number measured on RTX 4090 with real LLM KV caches. Not synthetic benchmarks.

ModelParamsdBitsCosine SimTop-5 MatchCompressionGeneration
Qwen2.5-3B3B1283 0.996994.4%5.0x5/5 match
Gemma 4 E4B4B MoE256/5123 0.999994100%5.12x150 tok/s @ 262K
Qwen2.5-7B7B1283 ----5.0xPPL 7.90 (+0.38)
Llama 3.1 8B8B1283 ----5.0xPPL 7.37 (beats FP16)
Qwen2.5-14B14B1283 0.996495.3%5.0x5/5 match
Qwen3.5-27B27B2563 0.9932100%5.2x--
Qwen2.5-32B32B1283 ----5.0x5/5 match
Qwen3.5-35B35B MoE1283 ----5.0x28 tok/s @ 2K
Llama 3.1 70B70B1283 ----5.0x4x context (4K->16K)
Qwen2.5-72B72B1283 ----5.0xOllama tested

Context Extension via KV Compression (llama.cpp, RTX 4090)

ModelFP16 KV Max ContextTurbo3 KV Max ContextExtension
Llama 3.1 8B (TQ4_1S weights) ~48K (OOM at 56K) ~100K 2.1x
Llama 3.1 70B (Q2_K weights) ~4K (OOM at 8K) ~16K 4x
Gemma 4 26B ~196K (OOM at 262K) 262K (full native) Full window

Validated By

External validation from the TurboQuant paper author.

"We've actually been working on removing QJL for a while now and focusing on getting the stacking story right [...] your finding on the stacking numbers is very interesting"

Tom Turney

TurboQuant paper author (ICLR 2026), TheTom/llama-cpp-turboquant

Independently confirmed

  • + ResidualQuant > QJL (Tom removing QJL)
  • + Boundary layer protection (matches Tom's Boundary V)
  • + Mean-removal fixes Qwen failure (Tom saw +62.95 PPL)
  • + Weight + KV stacking extends context 2-4x

The convergence

While we built TurboQuantDC from scratch, the open-source community was converging on the same findings independently. Tom was already removing QJL. NSNQuant published channel centering. The boundary layer strategy appeared in multiple codebases. When multiple teams arrive at the same conclusions from different starting points, the conclusions are probably right.