title: Self-Improving Skills — Eval Loops and Memory title: Self-Improving Skills section: claude-code page_kind: concept sources: 1 status: candidate knowledge_status: ai_draft source_type: article judgment_owner: ai updated: 2026-06-03
Self-Improving Skills
What it is
A methodology for building Claude skills that catch their own mistakes and improve over time through structured feedback loops. The approach, developed by Peter Yang, combines three mechanisms: (1) example-driven skill drafting, (2) pass/fail evaluation loops, and (3) conversational memory logs. The goal is to encode personal taste and quality standards into reusable agent instructions that get better with each use.
Why it matters
Most skills are static: they execute the same instructions regardless of past performance. Self-improving skills add a feedback layer that mirrors how humans actually learn — by recognizing errors, applying corrections, and remembering what worked. This turns skills from one-off scripts into compounding knowledge assets.
Key points
Five-step construction process
-
Give the agent personal context and best-in-class examples: Separate example files (e.g.,
example-tutorial.md,example-personal.md) keep the mainskill.mdlean while giving the agent grounded reference material. The agent loads only the relevant example based on the draft type. -
Explicit trigger conditions in the description: The skill name and description are what the agent reads first to decide whether to load the full skill. "Use when..." instructions must be precise so the agent triggers the skill automatically at the right moments.
-
Evals.md with pass/fail checks: After manual testing, the agent generates an
evals.mdwith 10 pass/fail checks across categories like Introduction (does it hook?), Voice (is there AI slop?), Substance (practical insight?), and CTA (clear next steps). Pass/fail is preferred over scoring (4/5 vs 3/5) because agents cannot reliably distinguish adjacent scores. -
Memory.md for ongoing learning: A reverse-chronological log of lessons from past chats. This captures feedback that does not fit clean pass/fail checks, such as "make the voice more authentic." The skill.md references both evals.md and memory.md.
-
Skill-editor skill for maintenance: A meta-skill that reviews all other skills for conciseness, strips AI slop (em dashes, "X, not Y" phrasing, duplicate instructions), and keeps the skill library clean.
Eval loop mechanics
- One agent edits the post using the skill.
- A second agent, spawned with a clean context window, grades the output against evals.md.
- The clean context prevents the grader from being biased by the first agent's work.
- If any eval fails, the two agents iterate until all pass.
- In the author's experience, a newsletter draft took five rounds between the two agents to pass all evals.
Skill folder structure
edit-post/
├ skill.md
├ example-tutorial.md
├ example-personal.md
├ example-product.md
├ evals.md
├ memory.md
Why this works
- Evals improve output quality: The pass/fail loop stripped every em dash and "X, not Y" phrase from the author's drafts.
- Memory improves the skill itself: Feedback that is too nuanced for evals ("more authentic voice") accumulates in memory.md and shapes future skill behavior.
- Separation of concerns: Examples, evals, and memory are separate files so the skill.md stays focused on what to do, not how it was learned.
Evidence across sources
| Source | Key Claim | Relevance |
|---|---|---|
| Full Tutorial — Peter Yang | 5-step skill construction with eval loop and memory.md | Primary methodology |
| Building Skills Best Practices | Anthropic official skill design patterns | Foundation for skill structure |
| Skill Engineering as Algorithm Design | Deterministic layer + probabilistic decision engine | Architectural theory for eval loops |
Open questions
- How does the eval loop scale to skills with subjective output (design, creative writing) versus objective output (code, data validation)?
- At what point does memory.md become too large and start degrading the skill's performance?
- Can the eval loop be generalized to non-Claude agents (Codex, OpenClaw, Hermes) or is it Claude-specific?
- Does the five-round iteration cost (agent time + tokens) justify the quality improvement for all skill types, or only high-stakes ones?
Prompts for witness
- Which of your current skills would benefit most from an eval loop? What would the 3-5 pass/fail checks be?
- What feedback do you give agents repeatedly that could be captured in a memory.md instead?
- The article warns that "relying on AI to write skills could lead to a mess of slop." How do you balance skill automation with quality control?
Related
- claude-code/building-skills-best-practices — Official Anthropic skill design patterns
- harness-engineering/skill-engineering-as-algorithm — Algorithmic approach to skill architecture
- harness-engineering/compound-engineering — The compound loop that includes skill improvement
- claude-code/skills-collection — Community skill examples and patterns