Back/harness engineering

The LLM Training Pipeline — From Pre-training to Agent Training

Updated 2026-04-13
6 min read
1,301 words

The LLM Training Pipeline — From Pre-training to Agent Training

Author: @HiTw93 (tw93) Core thesis: By 2026, the real differentiation in LLMs comes not from pre-training, but from everything after it: post-training, evaluation, reward design, agent training, distillation, and harness engineering.

The pipeline as a whole

Modern LLM development is a six-layer pipeline:

  1. Raw data + system recipe
  2. Pre-training
  3. Post-training
  4. Evaluation / Grader / Reward
  5. Agent harness + training
  6. Deployment + feedback loops

Two feedback loops run continuously:

  • Production traffic → data engineering
  • Offline evaluation → pre-training

Pre-training: just the foundation

Pre-training determines:

  • Knowledge scope
  • Generalization potential
  • Pattern induction capability

What it does not determine:

  • Whether the model follows instructions
  • Whether it cooperates with users
  • Stability on critical tasks

Key design decisions locked in at this stage:

  • Tokenizer vocabulary (affects sequence length, downstream performance, multilingual ability, code/math efficiency)
  • Context window length (changes attention cost, batch size, training curriculum, parallelism strategy)
  • Multimodal vs text-only
  • Single-acceleror deployment constraints (e.g., Gemma 3)

Over-training: Llama3 8B used 15T tokens vs Chinchilla's ~200B recommendation (75× over-training). This trades training compute for a smaller, cheaper, more capable model at inference time.

Data recipe: capability engineering

Data engineering is not "more is better" — it is capability design.

What the model sees (web, code, books, forums) and in what proportions directly shapes its ability distribution.

Critical but often neglected:

  • Deduplication at document and line level
  • Contamination control (benchmark leakage)
  • Data mixing laws — the ratio of data types steers model capabilities
  • Synthetic data —正式训练流程的一部分
    • Self-Instruct for instruction data
    • DeepSeek-R1 distillation trajectories
    • Qwen / Kimi synthetic supervision

The virtuous cycle: stronger models generate higher-quality training data for the next generation.

Systems and architecture constraints

Training at scale is a distributed systems problem, not a single-machine deep learning problem.

Key constraints decided before training starts:

  • GPU count, memory bandwidth, parallelism strategy
  • MoE (Mixture of Experts) for cost-effective scale expansion
  • FP8 mixed precision (DeepSeek-V3 proved this at massive scale)
  • muP (maximal update parameterization) for hyperparameter transfer from small experiments
  • WSD learning rate schedule

Training stability: thousands of GPUs running for weeks will encounter loss spikes, silent GPU errors, NVLink anomalies, communication jitter. The ability to detect, isolate, and recover quickly is a core lab-level engineering competency.

DeepSeek-V3 reported zero irrecoverable loss spikes and zero rollbacks across 14.8T tokens and 2.788M H800 GPU hours.

Post-training: where users feel the difference

Instruction tuning (SFT)

Teaches the model how to take tasks, organize output, and behave like a helpful assistant.

A 1.3B parameter InstructGPT outperformed 175B GPT-3 on human preference — demonstrating that post-training can override two orders of magnitude in parameter count.

Alignment methods

Method Approach
RLHF Imitate high-quality answers, then reinforce via preference comparison
DPO Direct preference optimization — no separate reward model needed
RFT Productized interface: task definition + grader + reward signal

DeepSeek-R1's four-stage pipeline

  1. Cold-start SFT — high-quality CoT data to stabilize before RL
  2. RL on verifiable domains — math, code, logic using GRPO (group relative policy optimization)
  3. Rejection sampling fine-tuning — turn RL success trajectories into new SFT data
  4. Alignment RL — incorporate helpfulness and safety preferences

GRPO vs PPO: GRPO samples multiple answers for the same prompt and uses intra-group ranking instead of a separate value network. Much lighter engineering burden — adopted by DeepSeek and Cursor Composer 2.

Eval, Grader, Reward — redefining the training target

"Users think they're comparing base model gaps, but the gap is often in how the objective is defined."

Grader pitfalls

  • Final-answer-only grading → model learns shortcuts
  • Coarse scoring → noise gets amplified by RL
  • Benchmark up, real task flat

Verified rewards

In math, code, and logic, programs can directly verify correctness. This shifts RLHF toward verified rewards.

But new problems emerge:

  • Reward overfitting — gaming the grader without real capability gains
  • Mode collapse — output diversity collapses
  • Reward hacking / tampering — model manipulates the reward channel itself
  • Alignment faking — surface compliance with hidden misalignment

Anthropic's 2025 research showed models injected with reward-hack knowledge generalized to alignment faking in production coding RL environments.

ORM vs PRM

Outcome Reward Model (ORM) Process Reward Model (PRM)
Signal Sparse (final answer only) Dense (intermediate steps)
Cost Low High (often 3–5×)
Best for Getting started Math, code, logic reasoning

Most real systems start with ORM and move to automated PRM (program verification of intermediate steps) where possible.

Agent training: optimizing more than the model

By 2026, training targets have expanded:

  • Not just "answer correctly" but "act correctly over time"
  • Not just "think longer" but "allocate reasoning budget intelligently"
  • Not just "use tools" but "plan, call tools, receive feedback, maintain coherence across long tasks"

This brings the entire runtime stack into training:

  • Browser, terminal, search, execution sandbox
  • Memory systems, tool servers, orchestration framework
  • The harness itself — prompt construction, memory updates, retrieval policy, context editing, tool orchestration

"When training an agent, you're often debugging the model and debugging the environment."

Notable engineering cases

  • Kimi K2.5 (PARL): trains only the orchestrator, not all sub-agents. Reward signals: task success + parallel decomposition + completion constraints. Early training weights parallel exploration high, then fades to avoid spawning sub-agents as a shortcut.
  • Cursor Composer 2: real-time RL connecting long coding sessions back to training. Self-summarization is explicitly rewarded because summary drift poisons downstream context.
  • Chroma Context-1: trains prune_chunks as a policy, making context pruning part of the retrieval process.

Meta-Harness: optimizing harness code itself

A 2026 preprint showed that optimizing harness code (not model weights) can produce 6× performance gaps on the same base model.

Meta-Harness writes prior code, scores, and execution traces to the filesystem, then uses a proposer to grep, cat, diff, and iterate on harness failures. One discovered improvement: environment bootstrap — running a shell command before the agent loop to snapshot working directory, languages, package managers, and memory state into the first prompt.

"The optimization target has expanded from answers → trajectories → harness programs."

Continuous loops beyond release

Released models are just snapshots. The real product is the continuous loop:

  • Distillation: stronger models generate training data for smaller, specialized models (DeepSeek-R1-Distill, TranslateGemma)
  • Production feedback: Cursor Composer 2's real-time RL shows agent capabilities iterating on live traffic, not waiting for the next offline training cycle
  • Harness evolution: outer loops rewrite prompt construction, retrieval, and memory programs based on rollouts and logs

How to interpret a "suddenly stronger" model

When a model appears to leap in capability, ask three questions:

  1. Pre-training or post-training? Many user-perceived improvements (instruction following, tool use, style stability) come from post-training, not more pre-training data.
  2. Weights, reward/eval, or harness? In reasoning models and agents, the felt improvement often comes from evaluation design, reward signals, tool environment stability, retrieval, memory, context pruning, and checkpoint selection — not just the base model.
  3. What is the release optimizing for? Some versions chase higher ceilings; others compress cost, latency, and regression risk; still others specialize for specific scenarios. The shipped checkpoint is a product decision, not necessarily the strongest on the training curve.

Sources

Linked from