Skip to content
Back/AI Ecosystem

Reiner Pope — LLM Inference Economics (7 Equations)

View in Graph
Updated 2026-05-04
4 min read
935 words

Reiner Pope — LLM Inference Economics (7 Equations)

Dwarkesh Podcast 黑板讲座形式。Reiner Pope(MatX CEO,前 Google TPU v5e 负责人)从两个硬件参数(HBM 带宽 + FLOPS)出发,用 7 个方程推导出 AI 产业的定价、架构和物理极限。

The 7 Equations

# Equation What It Tells You
T ≥ max(T_compute, T_memory) Roofline: every step is bounded by the slower of compute or memory
T_compute = batch × active_params / FLOPS Compute time grows linearly with users and active parameters
T_memory = total_params/bandwidth + batch × ctx_len × bytes_per_token/bandwidth Memory time: weight read (shared) + KV cache read (per-user)
Cost = T / batch Per-token cost: weight read is amortized, KV cache is not
Critical batch ≈ FLOPS/bandwidth × total_params/active_params ≈ 300 × sparsity Where memory and compute bottlenecks balance; independent of model size
bytes_per_token = active_params / (FLOPS/bandwidth × ctx_len) Reverse-engineer KV cache size from API pricing cliffs
D_pretrain ≈ D_RL ≈ D_inference Training/inference equilibrium: three cost pools equalize at optimum

Two Bottlenecks Shape Everything

Every token generation step is bounded by the slower of:

  • Compute: doing the matrix multiplications (batch × active_params / FLOPS)
  • Memory: reading weights + KV cache from HBM (total_params/bandwidth + per-user KV)

GPU scheduling: one "train" departs every ~20ms (time to read all weights from HBM on Nvidia Rubin: 288 GB / 20 TB/s ≈ 15ms). Requests board the next train. Worst case: 40ms latency.

Batch Economics

Weight read is the fixed cost. 1 user or 2000 users — the GPU reads all weights from HBM either way. This is why:

  • No batching = 1000× worse economics than full batching
  • Fast mode costs 6× more: uses smaller batch for lower latency, weight read less amortized
  • "Slow mode" saves almost nothing: once batch is large enough that compute becomes the bottleneck, cost hits a floor. Bigger batch doesn't help — per-user compute and KV cache are irreducible.

Critical batch size ≈ 300 × sparsity. For DeepSeek V3 (sparsity ≈ 8), critical batch ≈ 2400 concurrent sequences. Independent of model size — depends only on hardware FLOPS/bandwidth ratio and model sparsity.

KV Cache: The Irreducible Cost

KV cache is the only cost that cannot be amortized by any parallelism strategy:

  • Batch dimension: per-user conversation history, can't share
  • Pipeline dimension: more stages → more micro-batches in flight → same total KV cache
  • This is why DeepSeek's inference uses expert parallelism across the full scale-up domain, almost zero pipeline parallelism

Context length ~200K wall: Reiner, as a hardware architect: "I actually don't see a very good path to solving that." Not engineering laziness — a physics wall. HBM bandwidth grows far slower than compute. Sparse attention helps (DeepSeek: KV read ∝ √ctx_len) but degrades quality past a point.

API Pricing as Architecture Reverse-Engineering

Observation What It Reveals
Gemini 3.1 +50% price at >200K ctx KV cache per token ≈ 2 KB (solve equation ⑥ at the crossover point)
Output 5× more expensive than input GPU is memory-bound during decode (reading weights+KV for just 1 token)
5min vs 1hr cache retention tiers Flash vs HDD storage tiers (drain time = capacity/bandwidth maps to physical media)

Overtraining 100× Beyond Chinchilla

Chinchilla optimal: training data ≈ 20 × active parameters (~2T tokens for 100B active). Actual frontier models: 100–200T tokens — 100× beyond.

This is not waste. It's training-inference joint optimization:

  • Smaller active parameters → cheaper inference
  • More training data → compensates for quality loss from smaller model
  • When inference volume is large enough, inference savings dwarf extra training cost

Model obsolescence risk tilts the balance: if your model has only 50% chance of being best-in-class, you should train less (equation ⑦'s D_inference gets discounted).

MoE Rack Layout: One Rack = One MoE Layer Boundary

  • Expert parallelism: each expert on a different GPU, all-to-all communication
  • Scale-up network (NV Switch): 72 GPUs, 2-hop any-to-any — perfect match
  • Scale-out (inter-rack): ~8× less bandwidth — crossing racks with MoE hurts
  • Scale-up size solves bandwidth, pipeline solves capacity — two different problems
  • GPT-4 rumored ~1T params in 2023, but params haven't grown much since: need bigger scale-up domains for bandwidth

Pipeline parallelism in inference: throughput is free (no training-style bubbles), but latency penalty ~50% with 4 stages. Main benefit is reducing per-rack weight storage, but Blackwell NVL72 has 13.5 TB HBM vs ~1 TB needed for 1T-param model — capacity is not the bottleneck.

Feistel Networks → RevNets: Cryptography Meets Neural Networks

Classic Feistel network makes any non-invertible function invertible: split input (x,y), output (y, x+f(y)). Reverse: read x directly, compute y = z - f(x).

RevNets (2017-18) ported the exact same construction into Transformers. The entire network becomes invertible — backprop can recompute activations on the fly instead of storing them.

The贯穿全文的对称性: KV cache = "spend memory to save compute" (store intermediates). RevNets = "spend compute to save memory" (recompute instead of store). Under current hardware (memory expensive, compute relatively abundant), the former usually wins.

Deep connection: backdoors. Cryptography spent decades studying how to hide structured information in seemingly random systems; neural network backdoor defense is at an early stage.

Sources

Synthesized from 1 source
  • 高飞:GPT、Claude和Gemini到底是怎么训练和推理的:7个方程和API报价单背后的推导Primary source for this page.Whole pagehighbody

Evolution

1 event
  1. absorbed

    Derived from source material

    This page is currently synthesized from 1 source.

    From 高飞:GPT、Claude和Gemini到底是怎么训练和推理的:7个方程和API报价单背后的推导To Reiner Pope — LLM Inference Economics (7 Equations)
    Sources: raw/social-triage/2026-05-04 - Reiner Pope LLM训练推理 7个方程

Linked from