SkillOpt: Stop Writing Your skill.md by Hand — Microsoft Says You Should Be Training It

I’m going to be direct: there’s a number buried in a recent Microsoft Research paper that should change how your team builds agent skills — and most of the coverage is treating it as an academic curiosity instead of the operational shift it really is.

Here’s the number. On GPT-5.5, a skill document optimized by SkillOpt added +19.1 points running inside the Claude Code harness, and +24.8 points inside Codex — compared to running the same frozen model without any skill. Same model. Same harness. Same inference cost. The only thing that changed was the Markdown file telling the agent how to work.

If you’re currently maintaining a CLAUDE.md, an AGENTS.md, or any skill.md at hand, that’s the part to sit with for a while.

The premise: your skill doc is trainable state, not documentation

For two years, the entire conversation around AI and code revolved around models. GPT-5.5 vs Opus 4.8. Which writes better code, which hallucinates less, which holds more context. SkillOpt comes from a completely different angle: keep the model frozen and treat the skill document as the thing you optimize.

The framing is deliberately borrowed from deep learning. You have epochs, minibatches, a learning rate and validation gates — except all of that applies to a Markdown file instead of model weights. The skill is “external state” of a frozen agent, and you train it the same way you’d train anything else: you run, score, adjust, keep what improves.

This isn’t prompt engineering with extra steps. The discipline is exactly the point.

How the loop actually works

Four stages, repeated:

Rollout. The frozen target model runs the tasks using the current skill and records scored trajectories — every tool call, code generation, compiler output and verifier result. Think of it as the forward pass.

Reflect. A separate optimizer model (a frontier model, different from the one doing the work) analyzes batches of successes and failures separately, looking for reusable procedures. This is the backward pass at the language level.

Edit. The optimizer proposes bounded add / delete / replace operations on the skill document. And here’s the most clever part: there’s an edit budget that works like a textual learning rate. It limits how much the doc can change in a single step. Without it, self-editing becomes erratic — the agent overwrites rules that were working and loses its place. The budget is what keeps evolution gradual and reproducible instead of being a coin flip.

Gate. A candidate edit is accepted only if it strictly improves a held-out validation score. If it doesn’t improve, it’s rejected — and rejected edits become negative feedback so the optimizer doesn’t go down the same dead-end again. This turns reflection into propose-and-test optimization, instead of the unconditional approach of “let the agent rewrite its own instructions” that tends to drift.

The model itself never changes. You end up with a deployable best_skill.md artifact that has zero extra inference cost.

Why the cross-harness result is the headline

The benchmark sweep is broad: 7 target models, 6 benchmarks, two real execution harnesses — Codex and Claude Code. SkillOpt was best or tied for best in all 52 cells (model × benchmark × harness). It beat hand-written skills, and it beat previous automated approaches (TextGrad, GEPA, EvoSkill).

Two things matter more than raw scores for anyone running a team:

First, it works inside the harnesses you already use. It’s not a benchmark that only lives in a lab notebook. The Codex and Claude Code numbers show the optimized skill transfers to the real tools your devs open every morning.

Second, learned skills transfer between models and harnesses. A skill trained against one setup carries gains to others. For a team standardizing agentic workflows, that’s the difference between maintaining a single trained artifact and rewriting instructions for each tool and each model bump.

The honest caveats — because this isn’t a plugin

This is where I’d pump the brakes a bit before someone on your team clones the repo expecting a /install-skill command.

SkillOpt is a research framework, MIT-licensed and public at github.com/microsoft/SkillOpt (~3.2K stars and growing). It’s not a one-click plugin for Claude Code. To run it you need an LLM API or an Azure OpenAI endpoint, a separate optimizer model, and real compute for rollout batches. The benchmark datasets don’t come included — you bring your own data, formatted for what each environment expects. And the gains, while consistent, vary wildly by benchmark: some cells show single digits, others jump 50+ points. Your domain decides where you land.

There’s also a naming trap worth marking: there are unrelated “skill factory” repos floating around that bridge Claude Code and Codex. Those aren’t this. SkillOpt is the Microsoft Research optimizer, and the distinction matters when you’re searching.

What I’d do with this

Don’t wait for the polished plugin. The idea is the asset here, and you can adopt it before the tooling catches up.

If your team has a recurring agentic task — a tight plan→dev→test→deploy cycle, a documented review flow, anything you’ve already tried to codify in a hand-written skill — that’s a candidate. Run your frozen agent against a representative set of specs, score the trajectories honestly, and let a stronger model propose bounded edits to the doc, gated on whether they actually improve results. Even a manual version of that loop beats the status quo, which for most teams is: write the skill once, hope it generalizes, and never touch it again.

The hand-written skill.md was always a placeholder. SkillOpt is the argument for what replaces it.