Heuristic Learning
What it is
Heuristic Learning is a proposed learning loop where the object being updated is not neural-network weights, but a maintainable software system: rules, detectors, controllers, tests, logs, replays, memory, and regression cases. A coding agent consumes feedback and edits the heuristic system directly.
Jiayi Weng calls the maintained object a Heuristic System. It is more than a single policy.py; it is a program system with feedback channels, experiment records, replay artifacts, and mechanisms for compression or regression protection.
Why it matters
The key claim is that coding agents change the maintenance curve for heuristics. Expert systems and rule systems were historically brittle because humans could not afford to keep repairing them. If coding agents can continuously read failures, edit code, add tests, inspect replays, and simplify accumulated patches, some heuristic systems may become worth owning again.
This reframes Agent 持续学习: continual learning does not have to mean updating weights every time. Some online learning can happen by turning feedback into readable, testable, refactorable software artifacts.
Evidence across sources
| Source | Key Claim | Relevance |
|---|---|---|
| Jiayi Weng (original paper) | Codex-maintained heuristic policies reach high scores in Breakout, Ant, HalfCheetah, and Atari57 without training new neural-network weights | Foundation: the learning object is code, not weights |
| AI Briefing 2026-05-09 | GPT-5.4 writes Python policies for Atari Breakout, iterates from 387 to 864 (perfect score); MuJoCo Ant exceeds 6000; full Atari57 approaches PPO baseline | Second-source validation with specific scores and explicit boundary conditions |
Specific results from the second source:
- Atari Breakout: 387 → 864 (perfect) through iterative code refinement by GPT-5.4
- MuJoCo Ant: exceeds 6000 points (deep RL level performance)
- Atari57 full suite: approaches PPO baseline
- Knowledge is stored in readable code with ball-path predictors, stuck-ball detectors, regression tests, and experiment logs
Preservation mechanisms: Old capabilities can be preserved as regression tests, fixed-seed replays, golden traces, failure videos, version diffs, and written-down failed directions.
Failure mode: A heuristic system that only grows and never compresses becomes a large, coupled system that neither humans nor agents can maintain.
Core loop
environment feedback / test failure / log anomaly
-> coding agent reads context
-> edits policy / test / memory
-> reruns
-> writes results back into trials and summaries
-> compresses local patches into maintainable structure
Boundary conditions
- Heuristic Learning is strongest where feedback is clear, states are inspectable, and failures can be reproduced.
- It is weaker in domains where perception and long-horizon generalization cannot be expressed cleanly in code.
- Explicit boundary from practitioner evidence: pure code cannot handle complex perception tasks (e.g., image recognition with if-else rules). The endgame is a hybrid architecture: lightweight neural networks for perception, heuristic learning for real-time logic and safety rules, and large models for log review and periodic updates.
- It does not remove forgetting; it turns forgetting into an engineering problem around regression tests, replay coverage, state reproducibility, coupling, and refactoring.
Open questions
- When should a heuristic remain code, and when should its experience become training data for a neural network?
- How can a coding agent measure whether a heuristic system has become too coupled to maintain?
- Can Heuristic Learning become a general layer inside robotics, or is it mostly useful for environments with strong simulators and reproducible tests?
- What is the minimum harness needed for safe unattended HL loops?