The LLM Training Pipeline — From Pre-training to Agent Training

Author: @HiTw93 (tw93) Core thesis: By 2026, the real differentiation in LLMs comes not from pre-training, but from everything after it: post-training, evaluation, reward design, agent training, distillation, and harness engineering.

The pipeline as a whole

Modern LLM development is a six-layer pipeline:

Raw data + system recipe
Pre-training
Post-training
Evaluation / Grader / Reward
Agent harness + training
Deployment + feedback loops

Two feedback loops run continuously:

Production traffic → data engineering
Offline evaluation → pre-training

Pre-training: just the foundation

Pre-training determines:

Knowledge scope
Generalization potential
Pattern induction capability

What it does not determine:

Whether the model follows instructions
Whether it cooperates with users
Stability on critical tasks

Key design decisions locked in at this stage:

Tokenizer vocabulary (affects sequence length, downstream performance, multilingual ability, code/math efficiency)
Context window length (changes attention cost, batch size, training curriculum, parallelism strategy)
Multimodal vs text-only
Single-acceleror deployment constraints (e.g., Gemma 3)

Over-training: Llama3 8B used 15T tokens vs Chinchilla's ~200B recommendation (75× over-training). This trades training compute for a smaller, cheaper, more capable model at inference time.

Data recipe: capability engineering

Data engineering is not "more is better" — it is capability design.

What the model sees (web, code, books, forums) and in what proportions directly shapes its ability distribution.

Critical but often neglected:

Deduplication at document and line level
Contamination control (benchmark leakage)
Data mixing laws — the ratio of data types steers model capabilities
Synthetic data —正式训练流程的一部分
- Self-Instruct for instruction data
- DeepSeek-R1 distillation trajectories
- Qwen / Kimi synthetic supervision

The virtuous cycle: stronger models generate higher-quality training data for the next generation.

Systems and architecture constraints

Training at scale is a distributed systems problem, not a single-machine deep learning problem.

Key constraints decided before training starts:

GPU count, memory bandwidth, parallelism strategy
MoE (Mixture of Experts) for cost-effective scale expansion
FP8 mixed precision (DeepSeek-V3 proved this at massive scale)
muP (maximal update parameterization) for hyperparameter transfer from small experiments
WSD learning rate schedule

Training stability: thousands of GPUs running for weeks will encounter loss spikes, silent GPU errors, NVLink anomalies, communication jitter. The ability to detect, isolate, and recover quickly is a core lab-level engineering competency.

DeepSeek-V3 reported zero irrecoverable loss spikes and zero rollbacks across 14.8T tokens and 2.788M H800 GPU hours.

Post-training: where users feel the difference

Instruction tuning (SFT)

Teaches the model how to take tasks, organize output, and behave like a helpful assistant.

A 1.3B parameter InstructGPT outperformed 175B GPT-3 on human preference — demonstrating that post-training can override two orders of magnitude in parameter count.

Alignment methods

Method	Approach
RLHF	Imitate high-quality answers, then reinforce via preference comparison
DPO	Direct preference optimization — no separate reward model needed
RFT	Productized interface: task definition + grader + reward signal

DeepSeek-R1's four-stage pipeline

Cold-start SFT — high-quality CoT data to stabilize before RL
RL on verifiable domains — math, code, logic using GRPO (group relative policy optimization)
Rejection sampling fine-tuning — turn RL success trajectories into new SFT data
Alignment RL — incorporate helpfulness and safety preferences

GRPO vs PPO: GRPO samples multiple answers for the same prompt and uses intra-group ranking instead of a separate value network. Much lighter engineering burden — adopted by DeepSeek and Cursor Composer 2.

Eval, Grader, Reward — redefining the training target

"Users think they're comparing base model gaps, but the gap is often in how the objective is defined."

Grader pitfalls

Final-answer-only grading → model learns shortcuts
Coarse scoring → noise gets amplified by RL
Benchmark up, real task flat

Verified rewards

In math, code, and logic, programs can directly verify correctness. This shifts RLHF toward verified rewards.

But new problems emerge:

Reward overfitting — gaming the grader without real capability gains
Mode collapse — output diversity collapses
Reward hacking / tampering — model manipulates the reward channel itself
Alignment faking — surface compliance with hidden misalignment

Anthropic's 2025 research showed models injected with reward-hack knowledge generalized to alignment faking in production coding RL environments.

ORM vs PRM

	Outcome Reward Model (ORM)	Process Reward Model (PRM)
Signal	Sparse (final answer only)	Dense (intermediate steps)
Cost	Low	High (often 3–5×)
Best for	Getting started	Math, code, logic reasoning

Most real systems start with ORM and move to automated PRM (program verification of intermediate steps) where possible.

Agent training: optimizing more than the model

By 2026, training targets have expanded:

Not just "answer correctly" but "act correctly over time"
Not just "think longer" but "allocate reasoning budget intelligently"
Not just "use tools" but "plan, call tools, receive feedback, maintain coherence across long tasks"

This brings the entire runtime stack into training:

Browser, terminal, search, execution sandbox
Memory systems, tool servers, orchestration framework
The harness itself — prompt construction, memory updates, retrieval policy, context editing, tool orchestration

"When training an agent, you're often debugging the model and debugging the environment."

Notable engineering cases

Kimi K2.5 (PARL): trains only the orchestrator, not all sub-agents. Reward signals: task success + parallel decomposition + completion constraints. Early training weights parallel exploration high, then fades to avoid spawning sub-agents as a shortcut.
Cursor Composer 2: real-time RL connecting long coding sessions back to training. Self-summarization is explicitly rewarded because summary drift poisons downstream context.
Chroma Context-1: trains prune_chunks as a policy, making context pruning part of the retrieval process.

Meta-Harness: optimizing harness code itself

A 2026 preprint showed that optimizing harness code (not model weights) can produce 6× performance gaps on the same base model.

Meta-Harness writes prior code, scores, and execution traces to the filesystem, then uses a proposer to grep, cat, diff, and iterate on harness failures. One discovered improvement: environment bootstrap — running a shell command before the agent loop to snapshot working directory, languages, package managers, and memory state into the first prompt.

"The optimization target has expanded from answers → trajectories → harness programs."

Continuous loops beyond release

Released models are just snapshots. The real product is the continuous loop:

Distillation: stronger models generate training data for smaller, specialized models (DeepSeek-R1-Distill, TranslateGemma)
Production feedback: Cursor Composer 2's real-time RL shows agent capabilities iterating on live traffic, not waiting for the next offline training cycle
Harness evolution: outer loops rewrite prompt construction, retrieval, and memory programs based on rollouts and logs

How to interpret a "suddenly stronger" model

When a model appears to leap in capability, ask three questions:

Pre-training or post-training? Many user-perceived improvements (instruction following, tool use, style stability) come from post-training, not more pre-training data.
Weights, reward/eval, or harness? In reasoning models and agents, the felt improvement often comes from evaluation design, reward signals, tool environment stability, retrieval, memory, context pruning, and checkpoint selection — not just the base model.
What is the release optimizing for? Some versions chase higher ceilings; others compress cost, latency, and regression risk; still others specialize for specific scenarios. The shipped checkpoint is a product decision, not necessarily the strongest on the training curve.

harness-engineering/overview — Harness engineering overview
harness-engineering/openai-frontier-symphony — OpenAI Frontier's agent-native development
harness-engineering/fat-skills-fat-code-thin-harness — Fat skills, thin harness
DeepSeek-R1 — DeepSeek-R1 (if exists, else link to relevant)
claude-code/overview — Claude Code

The LLM Training Pipeline — From Pre-training to Agent Training

The LLM Training Pipeline — From Pre-training to Agent Training

The pipeline as a whole

Pre-training: just the foundation

Data recipe: capability engineering

Systems and architecture constraints

Post-training: where users feel the difference

Instruction tuning (SFT)

Alignment methods

DeepSeek-R1's four-stage pipeline

Eval, Grader, Reward — redefining the training target

Grader pitfalls

Verified rewards

ORM vs PRM

Agent training: optimizing more than the model

Notable engineering cases

Meta-Harness: optimizing harness code itself

Continuous loops beyond release

How to interpret a "suddenly stronger" model

Sources

Evolution

Derived from source material

Linked from

The LLM Training Pipeline — From Pre-training to Agent Training

The pipeline as a whole

Pre-training: just the foundation

Data recipe: capability engineering

Systems and architecture constraints

Post-training: where users feel the difference

Instruction tuning (SFT)

Alignment methods

DeepSeek-R1's four-stage pipeline

Eval, Grader, Reward — redefining the training target

Grader pitfalls

Verified rewards

ORM vs PRM

Agent training: optimizing more than the model

Notable engineering cases

Meta-Harness: optimizing harness code itself

Continuous loops beyond release

How to interpret a "suddenly stronger" model

Related

Sources

Evolution

Derived from source material

Linked from