LLM Inference Mechanics
What it is
An overview of how large language models execute inference efficiently, covering the full stack from hardware to software optimization. Based on a systematic technical summary of AI infrastructure for beginners.
Key concepts
1. Model Parallelism Strategies
| Strategy | Mechanism | When to use |
|---|---|---|
| Tensor Parallelism (TP) | Split individual layers across multiple GPUs | Single node, large models that don't fit on one GPU |
| Pipeline Parallelism (PP) | Split model into stages, each on a different GPU | Very deep models, cross-node deployment |
| Sequence Parallelism (SP) | Split sequence dimension across GPUs | Long-context inference |
| Expert Parallelism (EP) | Route tokens to different expert GPUs | MoE models |
| Context Parallelism (CP) | Distribute attention computation across sequence length | Ultra-long context |
2. Memory Optimization
- Weight quantization: FP16 → INT8/INT4 reduces memory by 2-4× with acceptable accuracy loss
- KV-cache management: The key bottleneck in autoregressive generation. Optimizations include:
- PagedAttention (vLLM): Treat KV cache as non-contiguous memory blocks, eliminating fragmentation
- Dynamic memory allocation rather than pre-allocating max context size
- Offloading: Move less-active layers to CPU/SSD when GPU memory is constrained
3. Serving System Architecture
- Continuous batching (inflight batching): Unlike static batching, new requests join an ongoing batch as soon as a slot frees up—dramatically improving GPU utilization
- Speculative decoding: A smaller draft model generates candidate tokens; the large model verifies them in parallel. 2-3× speedup when acceptance rate is high
- Prefix caching: Cache the KV vectors of common prompt prefixes (system prompts, few-shot examples) to avoid recomputation
- Disaggregated serving: Separate prefill (compute-intensive, parallelizable) and decode (memory-bandwidth-bound, sequential) phases onto different GPU pools
4. Request Scheduling
- Shortest-job-first with preemption: Prioritize requests that will finish quickly; preempt long-running requests if needed
- Chunked prefill: Break long prefills into chunks to avoid starving short requests
- Load balancing: Route requests based on current queue depth and KV-cache pressure
5. vLLM Architecture
vLLM represents a generation shift in inference serving:
- PagedAttention: Borrowed from OS virtual memory—treat KV cache as pages that can be allocated/freed dynamically
- Copy-on-write: When a request forks (e.g., beam search), share KV cache pages until one branch modifies them
- Continuous batching: Dynamic batching with iteration-level scheduling
Open questions
- At what context length does disaggregated prefill/decode become necessary?
- How do quantization-aware training and post-training quantization compare for production accuracy?
- Will hardware specialization (TPU, Groq) change these software optimizations, or do the principles transfer?