title: Self-Improving Skills — Eval Loops and Memory title: Self-Improving Skills section: claude-code page_kind: concept sources: 1 status: candidate knowledge_status: ai_draft source_type: article judgment_owner: ai updated: 2026-06-03

Self-Improving Skills

What it is

A methodology for building Claude skills that catch their own mistakes and improve over time through structured feedback loops. The approach, developed by Peter Yang, combines three mechanisms: (1) example-driven skill drafting, (2) pass/fail evaluation loops, and (3) conversational memory logs. The goal is to encode personal taste and quality standards into reusable agent instructions that get better with each use.

Why it matters

Most skills are static: they execute the same instructions regardless of past performance. Self-improving skills add a feedback layer that mirrors how humans actually learn — by recognizing errors, applying corrections, and remembering what worked. This turns skills from one-off scripts into compounding knowledge assets.

Key points

Five-step construction process

Give the agent personal context and best-in-class examples: Separate example files (e.g., example-tutorial.md, example-personal.md) keep the main skill.md lean while giving the agent grounded reference material. The agent loads only the relevant example based on the draft type.
Explicit trigger conditions in the description: The skill name and description are what the agent reads first to decide whether to load the full skill. "Use when..." instructions must be precise so the agent triggers the skill automatically at the right moments.
Evals.md with pass/fail checks: After manual testing, the agent generates an evals.md with 10 pass/fail checks across categories like Introduction (does it hook?), Voice (is there AI slop?), Substance (practical insight?), and CTA (clear next steps). Pass/fail is preferred over scoring (4/5 vs 3/5) because agents cannot reliably distinguish adjacent scores.
Memory.md for ongoing learning: A reverse-chronological log of lessons from past chats. This captures feedback that does not fit clean pass/fail checks, such as "make the voice more authentic." The skill.md references both evals.md and memory.md.
Skill-editor skill for maintenance: A meta-skill that reviews all other skills for conciseness, strips AI slop (em dashes, "X, not Y" phrasing, duplicate instructions), and keeps the skill library clean.

Eval loop mechanics

One agent edits the post using the skill.
A second agent, spawned with a clean context window, grades the output against evals.md.
The clean context prevents the grader from being biased by the first agent's work.
If any eval fails, the two agents iterate until all pass.
In the author's experience, a newsletter draft took five rounds between the two agents to pass all evals.

Skill folder structure

edit-post/
├ skill.md
├ example-tutorial.md
├ example-personal.md
├ example-product.md
├ evals.md
├ memory.md

Why this works

Evals improve output quality: The pass/fail loop stripped every em dash and "X, not Y" phrase from the author's drafts.
Memory improves the skill itself: Feedback that is too nuanced for evals ("more authentic voice") accumulates in memory.md and shapes future skill behavior.
Separation of concerns: Examples, evals, and memory are separate files so the skill.md stays focused on what to do, not how it was learned.

Evidence across sources

Source	Key Claim	Relevance
Full Tutorial — Peter Yang	5-step skill construction with eval loop and memory.md	Primary methodology
Building Skills Best Practices	Anthropic official skill design patterns	Foundation for skill structure
Skill Engineering as Algorithm Design	Deterministic layer + probabilistic decision engine	Architectural theory for eval loops

Open questions

How does the eval loop scale to skills with subjective output (design, creative writing) versus objective output (code, data validation)?
At what point does memory.md become too large and start degrading the skill's performance?
Can the eval loop be generalized to non-Claude agents (Codex, OpenClaw, Hermes) or is it Claude-specific?
Does the five-round iteration cost (agent time + tokens) justify the quality improvement for all skill types, or only high-stakes ones?

Prompts for witness

Which of your current skills would benefit most from an eval loop? What would the 3-5 pass/fail checks be?
What feedback do you give agents repeatedly that could be captured in a memory.md instead?
The article warns that "relying on AI to write skills could lead to a mess of slop." How do you balance skill automation with quality control?

claude-code/building-skills-best-practices — Official Anthropic skill design patterns
harness-engineering/skill-engineering-as-algorithm — Algorithmic approach to skill architecture
harness-engineering/compound-engineering — The compound loop that includes skill improvement
claude-code/skills-collection — Community skill examples and patterns

self improving skills

title: Self-Improving Skills — Eval Loops and Memory title: Self-Improving Skills section: claude-code page_kind: concept sources: 1 status: candidate knowledge_status: ai_draft source_type: article judgment_owner: ai updated: 2026-06-03

Self-Improving Skills

What it is

Why it matters

Key points

Five-step construction process

Eval loop mechanics

Skill folder structure

Why this works

Evidence across sources

Open questions

Prompts for witness

Sources

Evolution

Derived from source material

title: Self-Improving Skills — Eval Loops and Memory title: Self-Improving Skills section: claude-code page_kind: concept sources: 1 status: candidate knowledge_status: ai_draft source_type: article judgment_owner: ai updated: 2026-06-03

Self-Improving Skills

What it is

Why it matters

Key points

Five-step construction process

Eval loop mechanics

Skill folder structure

Why this works

Evidence across sources

Open questions

Prompts for witness

Related

Sources

Evolution

Derived from source material