Reiner Pope — LLM Inference Economics (7 Equations)
Dwarkesh Podcast 黑板讲座形式。Reiner Pope(MatX CEO,前 Google TPU v5e 负责人)从两个硬件参数(HBM 带宽 + FLOPS)出发,用 7 个方程推导出 AI 产业的定价、架构和物理极限。
The 7 Equations
| # | Equation | What It Tells You |
|---|---|---|
| ① | T ≥ max(T_compute, T_memory) |
Roofline: every step is bounded by the slower of compute or memory |
| ② | T_compute = batch × active_params / FLOPS |
Compute time grows linearly with users and active parameters |
| ③ | T_memory = total_params/bandwidth + batch × ctx_len × bytes_per_token/bandwidth |
Memory time: weight read (shared) + KV cache read (per-user) |
| ④ | Cost = T / batch |
Per-token cost: weight read is amortized, KV cache is not |
| ⑤ | Critical batch ≈ FLOPS/bandwidth × total_params/active_params ≈ 300 × sparsity |
Where memory and compute bottlenecks balance; independent of model size |
| ⑥ | bytes_per_token = active_params / (FLOPS/bandwidth × ctx_len) |
Reverse-engineer KV cache size from API pricing cliffs |
| ⑦ | D_pretrain ≈ D_RL ≈ D_inference |
Training/inference equilibrium: three cost pools equalize at optimum |
Two Bottlenecks Shape Everything
Every token generation step is bounded by the slower of:
- Compute: doing the matrix multiplications (
batch × active_params / FLOPS) - Memory: reading weights + KV cache from HBM (
total_params/bandwidth + per-user KV)
GPU scheduling: one "train" departs every ~20ms (time to read all weights from HBM on Nvidia Rubin: 288 GB / 20 TB/s ≈ 15ms). Requests board the next train. Worst case: 40ms latency.
Batch Economics
Weight read is the fixed cost. 1 user or 2000 users — the GPU reads all weights from HBM either way. This is why:
- No batching = 1000× worse economics than full batching
- Fast mode costs 6× more: uses smaller batch for lower latency, weight read less amortized
- "Slow mode" saves almost nothing: once batch is large enough that compute becomes the bottleneck, cost hits a floor. Bigger batch doesn't help — per-user compute and KV cache are irreducible.
Critical batch size ≈ 300 × sparsity. For DeepSeek V3 (sparsity ≈ 8), critical batch ≈ 2400 concurrent sequences. Independent of model size — depends only on hardware FLOPS/bandwidth ratio and model sparsity.
KV Cache: The Irreducible Cost
KV cache is the only cost that cannot be amortized by any parallelism strategy:
- Batch dimension: per-user conversation history, can't share
- Pipeline dimension: more stages → more micro-batches in flight → same total KV cache
- This is why DeepSeek's inference uses expert parallelism across the full scale-up domain, almost zero pipeline parallelism
Context length ~200K wall: Reiner, as a hardware architect: "I actually don't see a very good path to solving that." Not engineering laziness — a physics wall. HBM bandwidth grows far slower than compute. Sparse attention helps (DeepSeek: KV read ∝ √ctx_len) but degrades quality past a point.
API Pricing as Architecture Reverse-Engineering
| Observation | What It Reveals |
|---|---|
| Gemini 3.1 +50% price at >200K ctx | KV cache per token ≈ 2 KB (solve equation ⑥ at the crossover point) |
| Output 5× more expensive than input | GPU is memory-bound during decode (reading weights+KV for just 1 token) |
| 5min vs 1hr cache retention tiers | Flash vs HDD storage tiers (drain time = capacity/bandwidth maps to physical media) |
Overtraining 100× Beyond Chinchilla
Chinchilla optimal: training data ≈ 20 × active parameters (~2T tokens for 100B active). Actual frontier models: 100–200T tokens — 100× beyond.
This is not waste. It's training-inference joint optimization:
- Smaller active parameters → cheaper inference
- More training data → compensates for quality loss from smaller model
- When inference volume is large enough, inference savings dwarf extra training cost
Model obsolescence risk tilts the balance: if your model has only 50% chance of being best-in-class, you should train less (equation ⑦'s D_inference gets discounted).
MoE Rack Layout: One Rack = One MoE Layer Boundary
- Expert parallelism: each expert on a different GPU, all-to-all communication
- Scale-up network (NV Switch): 72 GPUs, 2-hop any-to-any — perfect match
- Scale-out (inter-rack): ~8× less bandwidth — crossing racks with MoE hurts
- Scale-up size solves bandwidth, pipeline solves capacity — two different problems
- GPT-4 rumored ~1T params in 2023, but params haven't grown much since: need bigger scale-up domains for bandwidth
Pipeline parallelism in inference: throughput is free (no training-style bubbles), but latency penalty ~50% with 4 stages. Main benefit is reducing per-rack weight storage, but Blackwell NVL72 has 13.5 TB HBM vs ~1 TB needed for 1T-param model — capacity is not the bottleneck.
Feistel Networks → RevNets: Cryptography Meets Neural Networks
Classic Feistel network makes any non-invertible function invertible: split input (x,y), output (y, x+f(y)). Reverse: read x directly, compute y = z - f(x).
RevNets (2017-18) ported the exact same construction into Transformers. The entire network becomes invertible — backprop can recompute activations on the fly instead of storing them.
The贯穿全文的对称性: KV cache = "spend memory to save compute" (store intermediates). RevNets = "spend compute to save memory" (recompute instead of store). Under current hardware (memory expensive, compute relatively abundant), the former usually wins.
Deep connection: backdoors. Cryptography spent decades studying how to hide structured information in seemingly random systems; neural network backdoor defense is at an early stage.
Related
- ai-ecosystem/ai-inference-rationing-2026 — inference rationing at the industry level
- ai-ecosystem/pretraining-parallelisms-dwarkesh — pipeline parallelism in training + 6ND formula
- ai-ecosystem/deepseek-v4 — sparse attention as KV cache mitigation
- product-trends/token-optimization-economics — cost optimization levers