What are the key points?

Microsoft Research unveiled SkillOpt, a text-space optimizer for improving AI agent skills without modifying model weights. SkillOpt outperformed existing methods in all 52 benchmark tests, utilizing validation-gated edits for stable performance gains. The framework boosted GPT-5.5 accuracy by up to 24.8 points in agentic loops, with skills remaining transferable across environments.

Microsoft Introduces SkillOpt for Self-Evolving AI Agent Skills

•Microsoft Research unveiled SkillOpt, a text-space optimizer for improving AI agent skills without modifying model weights.
•SkillOpt outperformed existing methods in all 52 benchmark tests, utilizing validation-gated edits for stable performance gains.
•The framework boosted GPT-5.5 accuracy by up to 24.8 points in agentic loops, with skills remaining transferable across environments.

Microsoft Research introduced SkillOpt, a systematic text-space optimizer designed to improve agent skills by treating natural-language skill documents as external, trainable states for frozen models. Unlike traditional methods that rely on hand-crafted or one-shot generation, SkillOpt utilizes a separate optimizer model to convert scored rollouts into precise add, delete, or replace edits on a single skill document. These modifications are accepted only when they provide a measurable improvement to a held-out validation score. The system incorporates stability mechanisms such as a textual learning-rate budget, a rejected-edit buffer, and epoch-wise updates to ensure consistent progress without requiring additional inference-time model calls.

Performance evaluations across 6 benchmarks, 7 target models, and 3 execution harnesses—including direct chat, Codex, and Claude Code—demonstrate the efficacy of the approach. SkillOpt achieved the best or tied-best performance in all 52 evaluated model-benchmark-harness settings, outperforming competitors including Trace2Skill, TextGrad, GEPA, and EvoSkill. When tested on GPT-5.5, the optimizer improved average no-skill accuracy by 23.5 points in direct chat, 24.8 points within the Codex agentic loop, and 19.1 points in Claude Code.

Transfer experiments conducted by the research team indicate that these optimized skill artifacts retain their utility even when transferred across different model scales and distinct execution environments. Furthermore, the skills demonstrated versatility by maintaining performance when applied to secondary math benchmarks without undergoing additional optimization. The research underscores a shift toward prioritizing procedural capability—such as tool use and failure recovery—as AI agents evolve from simple assistants into autonomous workers.

Microsoft Research introduced SkillOpt, a systematic text-space optimizer designed to improve agent skills by treating natural-language skill documents as external, trainable states for frozen models. Unlike traditional methods that rely on hand-crafted or one-shot generation, SkillOpt utilizes a separate optimizer model to convert scored rollouts into precise add, delete, or replace edits on a single skill document. These modifications are accepted only when they provide a measurable improvement to a held-out validation score. The system incorporates stability mechanisms such as a textual learning-rate budget, a rejected-edit buffer, and epoch-wise updates to ensure consistent progress without requiring additional inference-time model calls.

Performance evaluations across 6 benchmarks, 7 target models, and 3 execution harnesses—including direct chat, Codex, and Claude Code—demonstrate the efficacy of the approach. SkillOpt achieved the best or tied-best performance in all 52 evaluated model-benchmark-harness settings, outperforming competitors including Trace2Skill, TextGrad, GEPA, and EvoSkill. When tested on GPT-5.5, the optimizer improved average no-skill accuracy by 23.5 points in direct chat, 24.8 points within the Codex agentic loop, and 19.1 points in Claude Code.

Transfer experiments conducted by the research team indicate that these optimized skill artifacts retain their utility even when transferred across different model scales and distinct execution environments. Furthermore, the skills demonstrated versatility by maintaining performance when applied to secondary math benchmarks without undergoing additional optimization. The research underscores a shift toward prioritizing procedural capability—such as tool use and failure recovery—as AI agents evolve from simple assistants into autonomous workers.