Reiner Pope — LLM Inference Economics (7 Equations)

Dwarkesh Podcast 黑板讲座形式。Reiner Pope（MatX CEO，前 Google TPU v5e 负责人）从两个硬件参数（HBM 带宽 + FLOPS）出发，用 7 个方程推导出 AI 产业的定价、架构和物理极限。

The 7 Equations

#	Equation	What It Tells You
①	`T ≥ max(T_compute, T_memory)`	Roofline: every step is bounded by the slower of compute or memory
②	`T_compute = batch × active_params / FLOPS`	Compute time grows linearly with users and active parameters
③	`T_memory = total_params/bandwidth + batch × ctx_len × bytes_per_token/bandwidth`	Memory time: weight read (shared) + KV cache read (per-user)
④	`Cost = T / batch`	Per-token cost: weight read is amortized, KV cache is not
⑤	`Critical batch ≈ FLOPS/bandwidth × total_params/active_params ≈ 300 × sparsity`	Where memory and compute bottlenecks balance; independent of model size
⑥	`bytes_per_token = active_params / (FLOPS/bandwidth × ctx_len)`	Reverse-engineer KV cache size from API pricing cliffs
⑦	`D_pretrain ≈ D_RL ≈ D_inference`	Training/inference equilibrium: three cost pools equalize at optimum

Two Bottlenecks Shape Everything

Every token generation step is bounded by the slower of:

Compute: doing the matrix multiplications (batch × active_params / FLOPS)
Memory: reading weights + KV cache from HBM (total_params/bandwidth + per-user KV)

GPU scheduling: one "train" departs every ~20ms (time to read all weights from HBM on Nvidia Rubin: 288 GB / 20 TB/s ≈ 15ms). Requests board the next train. Worst case: 40ms latency.

Batch Economics

Weight read is the fixed cost. 1 user or 2000 users — the GPU reads all weights from HBM either way. This is why:

No batching = 1000× worse economics than full batching
Fast mode costs 6× more: uses smaller batch for lower latency, weight read less amortized
"Slow mode" saves almost nothing: once batch is large enough that compute becomes the bottleneck, cost hits a floor. Bigger batch doesn't help — per-user compute and KV cache are irreducible.

Critical batch size ≈ 300 × sparsity. For DeepSeek V3 (sparsity ≈ 8), critical batch ≈ 2400 concurrent sequences. Independent of model size — depends only on hardware FLOPS/bandwidth ratio and model sparsity.

KV Cache: The Irreducible Cost

KV cache is the only cost that cannot be amortized by any parallelism strategy:

Batch dimension: per-user conversation history, can't share
Pipeline dimension: more stages → more micro-batches in flight → same total KV cache
This is why DeepSeek's inference uses expert parallelism across the full scale-up domain, almost zero pipeline parallelism

Context length ~200K wall: Reiner, as a hardware architect: "I actually don't see a very good path to solving that." Not engineering laziness — a physics wall. HBM bandwidth grows far slower than compute. Sparse attention helps (DeepSeek: KV read ∝ √ctx_len) but degrades quality past a point.

API Pricing as Architecture Reverse-Engineering

Observation	What It Reveals
Gemini 3.1 +50% price at >200K ctx	KV cache per token ≈ 2 KB (solve equation ⑥ at the crossover point)
Output 5× more expensive than input	GPU is memory-bound during decode (reading weights+KV for just 1 token)
5min vs 1hr cache retention tiers	Flash vs HDD storage tiers (drain time = capacity/bandwidth maps to physical media)

Overtraining 100× Beyond Chinchilla

Chinchilla optimal: training data ≈ 20 × active parameters (~2T tokens for 100B active). Actual frontier models: 100–200T tokens — 100× beyond.

This is not waste. It's training-inference joint optimization:

Smaller active parameters → cheaper inference
More training data → compensates for quality loss from smaller model
When inference volume is large enough, inference savings dwarf extra training cost

Model obsolescence risk tilts the balance: if your model has only 50% chance of being best-in-class, you should train less (equation ⑦'s D_inference gets discounted).

MoE Rack Layout: One Rack = One MoE Layer Boundary

Expert parallelism: each expert on a different GPU, all-to-all communication
Scale-up network (NV Switch): 72 GPUs, 2-hop any-to-any — perfect match
Scale-out (inter-rack): ~8× less bandwidth — crossing racks with MoE hurts
Scale-up size solves bandwidth, pipeline solves capacity — two different problems
GPT-4 rumored ~1T params in 2023, but params haven't grown much since: need bigger scale-up domains for bandwidth

Pipeline parallelism in inference: throughput is free (no training-style bubbles), but latency penalty ~50% with 4 stages. Main benefit is reducing per-rack weight storage, but Blackwell NVL72 has 13.5 TB HBM vs ~1 TB needed for 1T-param model — capacity is not the bottleneck.

Feistel Networks → RevNets: Cryptography Meets Neural Networks

Classic Feistel network makes any non-invertible function invertible: split input (x,y), output (y, x+f(y)). Reverse: read x directly, compute y = z - f(x).

RevNets (2017-18) ported the exact same construction into Transformers. The entire network becomes invertible — backprop can recompute activations on the fly instead of storing them.

The贯穿全文的对称性: KV cache = "spend memory to save compute" (store intermediates). RevNets = "spend compute to save memory" (recompute instead of store). Under current hardware (memory expensive, compute relatively abundant), the former usually wins.

Deep connection: backdoors. Cryptography spent decades studying how to hide structured information in seemingly random systems; neural network backdoor defense is at an early stage.

ai-ecosystem/ai-inference-rationing-2026 — inference rationing at the industry level
ai-ecosystem/pretraining-parallelisms-dwarkesh — pipeline parallelism in training + 6ND formula
ai-ecosystem/deepseek-v4 — sparse attention as KV cache mitigation
product-trends/token-optimization-economics — cost optimization levers

Reiner Pope — LLM Inference Economics (7 Equations)

Reiner Pope — LLM Inference Economics (7 Equations)

The 7 Equations

Two Bottlenecks Shape Everything

Batch Economics

KV Cache: The Irreducible Cost

API Pricing as Architecture Reverse-Engineering

Overtraining 100× Beyond Chinchilla

MoE Rack Layout: One Rack = One MoE Layer Boundary

Feistel Networks → RevNets: Cryptography Meets Neural Networks

Sources

Evolution

Derived from source material

Linked from

Reiner Pope — LLM Inference Economics (7 Equations)

The 7 Equations

Two Bottlenecks Shape Everything

Batch Economics

KV Cache: The Irreducible Cost

API Pricing as Architecture Reverse-Engineering

Overtraining 100× Beyond Chinchilla

MoE Rack Layout: One Rack = One MoE Layer Boundary

Feistel Networks → RevNets: Cryptography Meets Neural Networks

Related

Sources

Evolution

Derived from source material

Linked from