The LLM Training Pipeline — From Pre-training to Agent Training
Author: @HiTw93 (tw93) Core thesis: By 2026, the real differentiation in LLMs comes not from pre-training, but from everything after it: post-training, evaluation, reward design, agent training, distillation, and harness engineering.
The pipeline as a whole
Modern LLM development is a six-layer pipeline:
- Raw data + system recipe
- Pre-training
- Post-training
- Evaluation / Grader / Reward
- Agent harness + training
- Deployment + feedback loops
Two feedback loops run continuously:
- Production traffic → data engineering
- Offline evaluation → pre-training
Pre-training: just the foundation
Pre-training determines:
- Knowledge scope
- Generalization potential
- Pattern induction capability
What it does not determine:
- Whether the model follows instructions
- Whether it cooperates with users
- Stability on critical tasks
Key design decisions locked in at this stage:
- Tokenizer vocabulary (affects sequence length, downstream performance, multilingual ability, code/math efficiency)
- Context window length (changes attention cost, batch size, training curriculum, parallelism strategy)
- Multimodal vs text-only
- Single-acceleror deployment constraints (e.g., Gemma 3)
Over-training: Llama3 8B used 15T tokens vs Chinchilla's ~200B recommendation (75× over-training). This trades training compute for a smaller, cheaper, more capable model at inference time.
Data recipe: capability engineering
Data engineering is not "more is better" — it is capability design.
What the model sees (web, code, books, forums) and in what proportions directly shapes its ability distribution.
Critical but often neglected:
- Deduplication at document and line level
- Contamination control (benchmark leakage)
- Data mixing laws — the ratio of data types steers model capabilities
- Synthetic data —正式训练流程的一部分
- Self-Instruct for instruction data
- DeepSeek-R1 distillation trajectories
- Qwen / Kimi synthetic supervision
The virtuous cycle: stronger models generate higher-quality training data for the next generation.
Systems and architecture constraints
Training at scale is a distributed systems problem, not a single-machine deep learning problem.
Key constraints decided before training starts:
- GPU count, memory bandwidth, parallelism strategy
- MoE (Mixture of Experts) for cost-effective scale expansion
- FP8 mixed precision (DeepSeek-V3 proved this at massive scale)
- muP (maximal update parameterization) for hyperparameter transfer from small experiments
- WSD learning rate schedule
Training stability: thousands of GPUs running for weeks will encounter loss spikes, silent GPU errors, NVLink anomalies, communication jitter. The ability to detect, isolate, and recover quickly is a core lab-level engineering competency.
DeepSeek-V3 reported zero irrecoverable loss spikes and zero rollbacks across 14.8T tokens and 2.788M H800 GPU hours.
Post-training: where users feel the difference
Instruction tuning (SFT)
Teaches the model how to take tasks, organize output, and behave like a helpful assistant.
A 1.3B parameter InstructGPT outperformed 175B GPT-3 on human preference — demonstrating that post-training can override two orders of magnitude in parameter count.
Alignment methods
| Method | Approach |
|---|---|
| RLHF | Imitate high-quality answers, then reinforce via preference comparison |
| DPO | Direct preference optimization — no separate reward model needed |
| RFT | Productized interface: task definition + grader + reward signal |
DeepSeek-R1's four-stage pipeline
- Cold-start SFT — high-quality CoT data to stabilize before RL
- RL on verifiable domains — math, code, logic using GRPO (group relative policy optimization)
- Rejection sampling fine-tuning — turn RL success trajectories into new SFT data
- Alignment RL — incorporate helpfulness and safety preferences
GRPO vs PPO: GRPO samples multiple answers for the same prompt and uses intra-group ranking instead of a separate value network. Much lighter engineering burden — adopted by DeepSeek and Cursor Composer 2.
Eval, Grader, Reward — redefining the training target
"Users think they're comparing base model gaps, but the gap is often in how the objective is defined."
Grader pitfalls
- Final-answer-only grading → model learns shortcuts
- Coarse scoring → noise gets amplified by RL
- Benchmark up, real task flat
Verified rewards
In math, code, and logic, programs can directly verify correctness. This shifts RLHF toward verified rewards.
But new problems emerge:
- Reward overfitting — gaming the grader without real capability gains
- Mode collapse — output diversity collapses
- Reward hacking / tampering — model manipulates the reward channel itself
- Alignment faking — surface compliance with hidden misalignment
Anthropic's 2025 research showed models injected with reward-hack knowledge generalized to alignment faking in production coding RL environments.
ORM vs PRM
| Outcome Reward Model (ORM) | Process Reward Model (PRM) | |
|---|---|---|
| Signal | Sparse (final answer only) | Dense (intermediate steps) |
| Cost | Low | High (often 3–5×) |
| Best for | Getting started | Math, code, logic reasoning |
Most real systems start with ORM and move to automated PRM (program verification of intermediate steps) where possible.
Agent training: optimizing more than the model
By 2026, training targets have expanded:
- Not just "answer correctly" but "act correctly over time"
- Not just "think longer" but "allocate reasoning budget intelligently"
- Not just "use tools" but "plan, call tools, receive feedback, maintain coherence across long tasks"
This brings the entire runtime stack into training:
- Browser, terminal, search, execution sandbox
- Memory systems, tool servers, orchestration framework
- The harness itself — prompt construction, memory updates, retrieval policy, context editing, tool orchestration
"When training an agent, you're often debugging the model and debugging the environment."
Notable engineering cases
- Kimi K2.5 (PARL): trains only the orchestrator, not all sub-agents. Reward signals: task success + parallel decomposition + completion constraints. Early training weights parallel exploration high, then fades to avoid spawning sub-agents as a shortcut.
- Cursor Composer 2: real-time RL connecting long coding sessions back to training. Self-summarization is explicitly rewarded because summary drift poisons downstream context.
- Chroma Context-1: trains
prune_chunksas a policy, making context pruning part of the retrieval process.
Meta-Harness: optimizing harness code itself
A 2026 preprint showed that optimizing harness code (not model weights) can produce 6× performance gaps on the same base model.
Meta-Harness writes prior code, scores, and execution traces to the filesystem, then uses a proposer to grep, cat, diff, and iterate on harness failures. One discovered improvement: environment bootstrap — running a shell command before the agent loop to snapshot working directory, languages, package managers, and memory state into the first prompt.
"The optimization target has expanded from answers → trajectories → harness programs."
Continuous loops beyond release
Released models are just snapshots. The real product is the continuous loop:
- Distillation: stronger models generate training data for smaller, specialized models (DeepSeek-R1-Distill, TranslateGemma)
- Production feedback: Cursor Composer 2's real-time RL shows agent capabilities iterating on live traffic, not waiting for the next offline training cycle
- Harness evolution: outer loops rewrite prompt construction, retrieval, and memory programs based on rollouts and logs
How to interpret a "suddenly stronger" model
When a model appears to leap in capability, ask three questions:
- Pre-training or post-training? Many user-perceived improvements (instruction following, tool use, style stability) come from post-training, not more pre-training data.
- Weights, reward/eval, or harness? In reasoning models and agents, the felt improvement often comes from evaluation design, reward signals, tool environment stability, retrieval, memory, context pruning, and checkpoint selection — not just the base model.
- What is the release optimizing for? Some versions chase higher ceilings; others compress cost, latency, and regression risk; still others specialize for specific scenarios. The shipped checkpoint is a product decision, not necessarily the strongest on the training curve.
Related
- harness-engineering/overview — Harness engineering overview
- harness-engineering/openai-frontier-symphony — OpenAI Frontier's agent-native development
- harness-engineering/fat-skills-fat-code-thin-harness — Fat skills, thin harness
- DeepSeek-R1 — DeepSeek-R1 (if exists, else link to relevant)
- claude-code/overview — Claude Code
Sources
- 你不知道的大模型训练:原理、路径与新实践 — @HiTw93