Create the “Best Agent Skill” with SkillOpt from Microsoft Research

Hans Scharler — Fri, 19 Jun 2026 01:15:49 +0000

For two years the move was simple: pick a better model. Then the models got good, and stayed good, and the gaps between them got boring. So the interesting lever is the other thing you hand the agent now. The skill file.

Microsoft Research just shipped a tool called SkillOpt that takes that idea literally.

SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.

It treats the skill markdown you hand an agent (the instructions, the system prompt, the SKILL.md) as trainable state. It runs epochs. It has a batch size. It has a learning rate. It just never touches the model weights. The weights stay frozen behind the API. The thing that gets trained is the text.

The loop is short. Run the agent on a batch of tasks. Score each one. A second model reads the failures and proposes small edits to the skill file. Keep an edit only if it improves a held-out score. Repeat. What you get at the end is a best_skill.md you drop in front of the same unchanged model.

Microsoft Research SkillOpt

I wanted to see it actually work, so I gave it a job…

The setup

I had OpenAI Codex do the writing. The task: take a paragraph stuffed with AI slop and clean it up. The skill it started from was short:

Improve the passage so it reads a little better. Keep the meaning and roughly the length.

No mention of slop. No list of banned words. Nothing to go on.

The verifier was the part that mattered. I wrote a dumb little counter that scans the output for the tells (the canned phrases, the em-dashes) and returns how many are left. Fewer is better. That is the whole eval. Objective, cheap, no judgment calls.

Then I let SkillOpt run.

What it wrote

The seed turned into a thirty-line deslop skill. SkillOpt wrote it.

It worked out, on its own, that “remove AI slop” is a removal constraint and not a tone nudge. It named the exact phrases that kept leaking through (“let’s dive in,” “have you ever wondered,” “it’s worth noting,” “furthermore,” “in conclusion,” “great question”). It flagged em-dashes. And it caught the sneaky failure mode I never told it about: the model dodging a banned word by swapping in a synonym. That one became its own rule, the last line of the file:

Before answering, do a quick scan to ensure none of the original flagged words remain, including close synonyms you may have introduced.

The score went from passing half the held-out passages to passing all of them. The skill it wrote is one I would actually keep.

The part worth keeping

The optimizer is the easy part.

I went in assuming the clever bit was the model proposing edits. It isn’t. The clever bit is the verifier. Give SkillOpt a score and nothing else and it does nothing. I tried that first. It saw the failures, shrugged, and changed not a single line, because “you got a 0.6” tells it nothing about what to fix. The run that worked was the one where the verifier also said which words leaked. Same model, same loop. The difference was the signal.

So the rule of thumb is less “use the self-improving optimizer” and more: can you score this cheaply, and can you tell it why it failed. If yes, the skill mostly writes itself. If no, there is nothing to train and the fancy loop sits there idle.

That is also the catch, and it is the same catch hiding under every self-improving-agent demo. A slop counter is a clean verifier. Most of what you do in a day is not. “Was this brief any good.” “Did this post land.” No cheap score, no training. The optimizer was never the bottleneck. The verifier is.

So build the verifier first. Get that right and the skill mostly writes itself.

Bonus: a taste of running it

The repo is github.com/microsoft/SkillOpt. MIT licensed, Python.

Get it:

pip install skillopt
# or, to poke at the internals:
git clone https://github.com/microsoft/SkillOpt && cd SkillOpt && pip install -e .

The mental model is one small folder per task. A loader that hands over your examples, a rollout that runs your agent and scores it, and a seed skill to start from. SkillOpt’s whole job is to grow that seed.

The config reads like a training run, on purpose:

train:
  num_epochs: 4
  batch_size: 40
optimizer:
  learning_rate: 4        # max edits to the skill per step
  lr_scheduler: cosine
evaluation:
  use_gate: true          # keep an edit only if it beats the held-out score
model:
  optimizer: gpt-5.4      # the model that proposes the edits

Then it is one command:

python scripts/train.py --config configs/yourtask/default.yaml

The only part that is really on you is the scoring. Your rollout returns, per task, a pass/fail and a number between 0 and 1:

return {"id": task_id, "hard": passed, "soft": fraction_correct}

That is the whole contract. Give it tasks, a way to score them, and a skill to start from. It runs the epochs.

One side note: keep the tasks hard enough that the agent fails some of them. If it aces everything on the seed skill, there is nothing to learn and SkillOpt politely does nothing. Ask me how I know.

Microsoft Research – About Things | A Hans Scharler Blog

Create the “Best Agent Skill” with SkillOpt from Microsoft Research

The setup

What it wrote

The part worth keeping

Bonus: a taste of running it