Skip to content
Back/AI Ecosystem

LLM Inference Mechanics

View in Graph
Updated 2026-05-28
2 min read
449 words

LLM Inference Mechanics

What it is

An overview of how large language models execute inference efficiently, covering the full stack from hardware to software optimization. Based on a systematic technical summary of AI infrastructure for beginners.

Key concepts

1. Model Parallelism Strategies

Strategy Mechanism When to use
Tensor Parallelism (TP) Split individual layers across multiple GPUs Single node, large models that don't fit on one GPU
Pipeline Parallelism (PP) Split model into stages, each on a different GPU Very deep models, cross-node deployment
Sequence Parallelism (SP) Split sequence dimension across GPUs Long-context inference
Expert Parallelism (EP) Route tokens to different expert GPUs MoE models
Context Parallelism (CP) Distribute attention computation across sequence length Ultra-long context

2. Memory Optimization

  • Weight quantization: FP16 → INT8/INT4 reduces memory by 2-4× with acceptable accuracy loss
  • KV-cache management: The key bottleneck in autoregressive generation. Optimizations include:
    • PagedAttention (vLLM): Treat KV cache as non-contiguous memory blocks, eliminating fragmentation
    • Dynamic memory allocation rather than pre-allocating max context size
  • Offloading: Move less-active layers to CPU/SSD when GPU memory is constrained

3. Serving System Architecture

  • Continuous batching (inflight batching): Unlike static batching, new requests join an ongoing batch as soon as a slot frees up—dramatically improving GPU utilization
  • Speculative decoding: A smaller draft model generates candidate tokens; the large model verifies them in parallel. 2-3× speedup when acceptance rate is high
  • Prefix caching: Cache the KV vectors of common prompt prefixes (system prompts, few-shot examples) to avoid recomputation
  • Disaggregated serving: Separate prefill (compute-intensive, parallelizable) and decode (memory-bandwidth-bound, sequential) phases onto different GPU pools

4. Request Scheduling

  • Shortest-job-first with preemption: Prioritize requests that will finish quickly; preempt long-running requests if needed
  • Chunked prefill: Break long prefills into chunks to avoid starving short requests
  • Load balancing: Route requests based on current queue depth and KV-cache pressure

5. vLLM Architecture

vLLM represents a generation shift in inference serving:

  • PagedAttention: Borrowed from OS virtual memory—treat KV cache as pages that can be allocated/freed dynamically
  • Copy-on-write: When a request forks (e.g., beam search), share KV cache pages until one branch modifies them
  • Continuous batching: Dynamic batching with iteration-level scheduling

Open questions

  • At what context length does disaggregated prefill/decode become necessary?
  • How do quantization-aware training and post-training quantization compare for production accuracy?
  • Will hardware specialization (TPU, Groq) change these software optimizations, or do the principles transfer?

Sources

Synthesized from 1 source
  • AI Infra入门干货总结:大模型是如何高效推理的Primary source for this page.Whole pagehighbody

Evolution

1 event
  1. absorbed

    Derived from source material

    This page is currently synthesized from 1 source.

    From AI Infra入门干货总结:大模型是如何高效推理的To LLM Inference Mechanics
    Sources: raw/to-learn/AI Infra入门干货总结:大模型是如何高效推理的.md

Linked from