TurboQuantDC

Crush your KV cache to 3 bits. Run 27B models on a single GPU. Lose nothing.

Compression

Quality Retained

Tests Passing

Modules

Scroll to see the magic ↓

The Problem

Every token stores key-value vectors in FP16. At long context, the KV cache devours your VRAM.

NVIDIA RTX 4090 24 GB VRAM

Weights: 14GB

KV Cache

OOM!

Model Weights KV Cache (FP16) 24 GB Limit

Context Length: 1K tokens

KV Cache: 0.06 GB Total: 14.06 GB

How TurboQuant Works

A two-stage vector quantization pipeline. Each step is mathematically principled.

Random Rotation

Random orthogonal rotation via QR decomposition. After rotation, coordinates follow a concentrated N(0, 1/d) distribution — the key insight that enables scalar quantization.

Lloyd-Max Quantization

Per-coordinate optimal scalar quantization. Continuous values snap to 8 centroids (3-bit). MSE-optimal for the Gaussian distribution from Stage 1.

QJL Correction

Compute residual error. Project through random Gaussian matrix. Store only the signs — 1 bit per dimension. This makes inner products mathematically unbiased.

Unbiased Inner Product

Combine both stages: E[<q, k̂>] = <q, k>. The estimator is unbiased with variance O(1/d). Individual vectors can have 23-44% reconstruction error — what matters is accurate attention scores.

The Proof

Interactive quality explorer. Pick a bit-width and see the numbers.

0.9788

Cosine Similarity

Top-1 Accuracy 73.6%

Top-5 Accuracy 94.4%

Compression Ratio 5.0x

Generation Quality

Per-Head Cosine Similarity (72 attention heads, 3-bit)

Every head preserves >99% quality. Worst head: 0.9902. Median: 0.9973.

>0.998 >0.995 >0.990

The Money Shot

KV cache at 262K context. Red bars show FP16 OOM. Green bars show TurboQuant fits.

Impossible Inference

Running models that shouldn't fit. Finding needles that shouldn't be found.

14B model on 24 GB GPU

Qwen2.5-14B (29.5 GB) running at 8.3 GB peak VRAM

Streaming engine: one layer at a time from CPU. Any model size fits.

Needle-in-haystack: NEPTUNE-4422 found in 4K context

1M Tokens on a Single RTX 4090

TQ-2 cache at 4.92 GB for 1M tokens. FP16 would need 36 GB. 100% needle retrieval at all depths.

Context	KV Cache (TQ-2)	FP16 Would Need	Needle Found
128K	0.62 GB	4.50 GB	100%
512K	2.46 GB	18.00 GB	100%
1M	4.92 GB	36.00 GB	100%

Beyond the Paper

Five extensions we built on top of the core TurboQuant algorithm.

What We Discovered

Honest findings from 511 tests and 25,000 lines of implementation. Not all results were positive.