LLM Inference Mechanics

What it is

An overview of how large language models execute inference efficiently, covering the full stack from hardware to software optimization. Based on a systematic technical summary of AI infrastructure for beginners.

Key concepts

1. Model Parallelism Strategies

Strategy	Mechanism	When to use
Tensor Parallelism (TP)	Split individual layers across multiple GPUs	Single node, large models that don't fit on one GPU
Pipeline Parallelism (PP)	Split model into stages, each on a different GPU	Very deep models, cross-node deployment
Sequence Parallelism (SP)	Split sequence dimension across GPUs	Long-context inference
Expert Parallelism (EP)	Route tokens to different expert GPUs	MoE models
Context Parallelism (CP)	Distribute attention computation across sequence length	Ultra-long context

2. Memory Optimization

Weight quantization: FP16 → INT8/INT4 reduces memory by 2-4× with acceptable accuracy loss
KV-cache management: The key bottleneck in autoregressive generation. Optimizations include:
- PagedAttention (vLLM): Treat KV cache as non-contiguous memory blocks, eliminating fragmentation
- Dynamic memory allocation rather than pre-allocating max context size
Offloading: Move less-active layers to CPU/SSD when GPU memory is constrained

3. Serving System Architecture

Continuous batching (inflight batching): Unlike static batching, new requests join an ongoing batch as soon as a slot frees up—dramatically improving GPU utilization
Speculative decoding: A smaller draft model generates candidate tokens; the large model verifies them in parallel. 2-3× speedup when acceptance rate is high
Prefix caching: Cache the KV vectors of common prompt prefixes (system prompts, few-shot examples) to avoid recomputation
Disaggregated serving: Separate prefill (compute-intensive, parallelizable) and decode (memory-bandwidth-bound, sequential) phases onto different GPU pools

4. Request Scheduling

Shortest-job-first with preemption: Prioritize requests that will finish quickly; preempt long-running requests if needed
Chunked prefill: Break long prefills into chunks to avoid starving short requests
Load balancing: Route requests based on current queue depth and KV-cache pressure

5. vLLM Architecture

vLLM represents a generation shift in inference serving:

PagedAttention: Borrowed from OS virtual memory—treat KV cache as pages that can be allocated/freed dynamically
Copy-on-write: When a request forks (e.g., beam search), share KV cache pages until one branch modifies them
Continuous batching: Dynamic batching with iteration-level scheduling

Open questions

At what context length does disaggregated prefill/decode become necessary?
How do quantization-aware training and post-training quantization compare for production accuracy?
Will hardware specialization (TPU, Groq) change these software optimizations, or do the principles transfer?

LLM Inference Mechanics

LLM Inference Mechanics

What it is

Key concepts

1. Model Parallelism Strategies

2. Memory Optimization

3. Serving System Architecture

4. Request Scheduling

5. vLLM Architecture

Open questions

Sources

Evolution

Derived from source material

Linked from

LLM Inference Mechanics

What it is

Key concepts

1. Model Parallelism Strategies

2. Memory Optimization

3. Serving System Architecture

4. Request Scheduling

5. vLLM Architecture

Open questions

Related

Sources

Evolution

Derived from source material

Linked from