TurboQuantDC

Crush your KV cache to 3 bits. Run 27B models on a single GPU. Lose nothing.

0x
Compression
0%
Quality Retained
0
Tests Passing
0
Modules
Scroll to see the magic

The Problem

Every token stores key-value vectors in FP16. At long context, the KV cache devours your VRAM.

NVIDIA RTX 4090 24 GB VRAM
Weights: 14GB
KV Cache
OOM!
Model Weights KV Cache (FP16) 24 GB Limit
Context Length: 1K tokens
KV Cache: 0.06 GB Total: 14.06 GB

How TurboQuant Works

A two-stage vector quantization pipeline. Each step is mathematically principled.

1
Random Rotation

Random orthogonal rotation via QR decomposition. After rotation, coordinates follow a concentrated N(0, 1/d) distribution — the key insight that enables scalar quantization.

2
Lloyd-Max Quantization

Per-coordinate optimal scalar quantization. Continuous values snap to 8 centroids (3-bit). MSE-optimal for the Gaussian distribution from Stage 1.

3
QJL Correction

Compute residual error. Project through random Gaussian matrix. Store only the signs — 1 bit per dimension. This makes inner products mathematically unbiased.

4
Unbiased Inner Product

Combine both stages: E[<q, k̂>] = <q, k>. The estimator is unbiased with variance O(1/d). Individual vectors can have 23-44% reconstruction error — what matters is accurate attention scores.

The Proof

Interactive quality explorer. Pick a bit-width and see the numbers.

0.9788
Cosine Similarity
Per-Head Cosine Similarity (72 attention heads, 3-bit)
Every head preserves >99% quality. Worst head: 0.9902. Median: 0.9973.
>0.998 >0.995 >0.990

The Money Shot

KV cache at 262K context. Red bars show FP16 OOM. Green bars show TurboQuant fits.

Impossible Inference

Running models that shouldn't fit. Finding needles that shouldn't be found.

14B model on 24 GB GPU
Qwen2.5-14B (29.5 GB) running at 8.3 GB peak VRAM
Streaming engine: one layer at a time from CPU. Any model size fits.
Needle-in-haystack: NEPTUNE-4422 found in 4K context
1M Tokens on a Single RTX 4090
TQ-2 cache at 4.92 GB for 1M tokens. FP16 would need 36 GB. 100% needle retrieval at all depths.
Context KV Cache (TQ-2) FP16 Would Need Needle Found
128K 0.62 GB 4.50 GB 100%
512K 2.46 GB 18.00 GB 100%
1M 4.92 GB 36.00 GB 100%

Beyond the Paper

Five extensions we built on top of the core TurboQuant algorithm.

What We Discovered

Honest findings from 511 tests and 25,000 lines of implementation. Not all results were positive.