Crush your KV cache to 3 bits. Run 27B models on a single GPU. Lose nothing.
Every token stores key-value vectors in FP16. At long context, the KV cache devours your VRAM.
A two-stage vector quantization pipeline. Each step is mathematically principled.
Random orthogonal rotation via QR decomposition. After rotation, coordinates follow a concentrated N(0, 1/d) distribution — the key insight that enables scalar quantization.
Per-coordinate optimal scalar quantization. Continuous values snap to 8 centroids (3-bit). MSE-optimal for the Gaussian distribution from Stage 1.
Compute residual error. Project through random Gaussian matrix. Store only the signs — 1 bit per dimension. This makes inner products mathematically unbiased.
Combine both stages: E[<q, k̂>] = <q, k>. The estimator is unbiased with variance O(1/d). Individual vectors can have 23-44% reconstruction error — what matters is accurate attention scores.
Interactive quality explorer. Pick a bit-width and see the numbers.
KV cache at 262K context. Red bars show FP16 OOM. Green bars show TurboQuant fits.
Running models that shouldn't fit. Finding needles that shouldn't be found.
| Context | KV Cache (TQ-2) | FP16 Would Need | Needle Found |
|---|---|---|---|
| 128K | 0.62 GB | 4.50 GB | 100% |
| 512K | 2.46 GB | 18.00 GB | 100% |
| 1M | 4.92 GB | 36.00 GB | 100% |
Five extensions we built on top of the core TurboQuant algorithm.
Honest findings from 511 tests and 25,000 lines of implementation. Not all results were positive.