If you’re betting on “agent skills” to level up your coding agent, you’re mostly buying ceremony, and sometimes negative ROI. SWE-Skills-Bench tested 49 popular skills against 565 real GitHub tasks and found that skill injection is a narrow intervention: usually inert, occasionally useful, and sometimes actively harmful. Independent research on a much larger dataset tells us why, and the answer isn’t what you’d expect.
The headline result is blunt. Across those 565 requirement-driven tasks (real repos pinned to commits, acceptance criteria enforced by tests), 39 of 49 skills produced zero pass-rate improvement. The average gain across all skills was +1.2%. That’s not “skills are the future.” That’s skills as a rounding error.
This should change how practitioners think about skill libraries. Most teams treat skills like reusable best practices: drop in a “React skill” or “debugging skill” and expect a consistent bump. SWE-Skills-Bench suggests the opposite. Generic skills don’t generalize because end-to-end software work is dominated by local context: repo conventions, dependency versions, build tooling, test harnesses, and weird historical decisions that never make it into a tidy procedure.
Skills can fight the repo. The paper reports three skills that degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. I see this as creating a failure mode that looks a lot like configuration drift. You’re not just prompting a model; you’re installing a policy layer that can go stale relative to the codebase.
Token overhead makes the story worse. Skills can increase token usage by up to +451% while pass rates stay flat. You pay latency and context budget to carry instructions that don’t move outcomes. If you’re running agents in CI or paying per token at scale, “skills everywhere” becomes an infrastructure tax.
The failure modes go deeper than version mismatch #
SWE-Skills-Bench identifies version-mismatched guidance as the culprit for skills that actively hurt. That’s real, but it’s one mechanism among several. A separate analysis of 673 skills across 41 repositories with behavioral evaluation of 19 skills under controlled conditions found six distinct interference mechanisms, and the most surprising finding was what didn’t predict them.
The structural indicators that seem like they should matter (how many languages does a skill mix? how contaminated are its reference files?) showed no correlation with actual behavioral degradation (r = 0.077). A skill with near-zero structural risk produced one of the largest behavioral drops in the dataset. A skill with the highest structural contamination score (0.93 out of 1.0) failed for a completely different reason than expected. Structural analysis can flag potential risks, but behavioral testing is the only way to find actual ones.
API hallucination. The upgrade-stripe skill had the highest structural contamination score in the 673-skill dataset, mixing code examples across Python, Ruby, JavaScript, and Go. The expected failure was cross-language confusion. That didn’t happen. Instead, the model started inventing plausible Stripe API calls: params.SetStripeVersion() in Go (doesn’t exist), stripe-go/v81 (wrong version), nonexistent error constants. The skill taught API vocabulary without grounding it in API facts, and the model filled the gap with confident fabrication. Worse, adding realistic agentic context (system prompts, conversation history) amplified the problem. The degradation tripled from -0.117 to -0.383 under realistic conditions, the opposite of the mitigation pattern seen in most other skills.
Template propagation. One skill’s output template contained // comments in JSON (syntactically invalid). The model reproduced this 100% of the time with the skill loaded, 0% at baseline. Not stochastic. Deterministic. The model knows JSON doesn’t support comments but follows the template anyway.
Token budget competition. Some skills cause the model to allocate 2.6x more tokens to prose explanations, leaving fewer tokens for code. Under a 4,096-token output ceiling, that means truncated implementations. The skill doesn’t make the code wrong; it makes the code incomplete.
Textual frame leakage. A React Native skill’s identity bled into explanations for completely unrelated tasks. A Swift task got an introduction about “React Native best practices for native iOS.” The code was fine. The framing was contaminated.
Architectural pattern bleed. Go patterns transferred to Python output without syntax errors but caused over-engineering: a single-file task ballooned into a seven-file project structure. Syntactically correct, architecturally contaminated.
Each of these mechanisms would show up in SWE-Skills-Bench’s numbers as “skill didn’t help” or “skill made things worse.” But the remediation for each is different, and none of them are addressed by the generic advice to “write better skills.”
Scale makes everything worse #
Seven of the 49 skills in SWE-Skills-Bench delivered meaningful gains (up to +30%). They were all specialized and domain-specific. The lesson is clear: narrow skills matched to the task subdomain can be real leverage. Generic “best practices” skills are overhead.
But the ecosystem is moving in exactly the wrong direction. One popular mega-repo ships 1,200+ skills installable with a single command and has 23,700 GitHub stars. A structural analysis of that collection reveals what happens when you take “skills everywhere” to its logical conclusion.
The skill catalog alone (just the names and descriptions the agent loads to know what’s available) consumes 47,000 tokens, or 37% of a 128k context window, before a single skill activates and before the user types their first message. Activate five skills during a session and you’re at 43% of your context window in skill overhead. The token cost that SWE-Skills-Bench flags at the individual skill level compounds into a context window crisis at collection scale.
Then there’s trigger ambiguity. With 1,200+ skills loaded, 84 mention security, 74 mention documentation, and 60 mention React. When a user says “create API documentation,” 20 skills compete to activate. The agent disambiguates based on description text that, in many cases, uses near-identical language. 13 groups of skills in the collection are 85-100% content-identical duplicates filed under different category prefixes, and 58.5% of all skills lack any trigger guidance (“use when…”) in their descriptions. The agent is choosing from a noisy catalog with bad labels.
This is the gap between SWE-Skills-Bench’s controlled experiment (one skill at a time, hand-selected for relevance) and how skills actually get deployed. In practice, teams install collections, not individual skills. The interference is multiplicative.
What practitioners should do instead #
The combined picture from SWE-Skills-Bench and the broader ecosystem research points to a different operational model than today’s skill-pack approach.
Treat skills like dependencies, not prompts. Version them, test them, and gate their deployment. If a skill can regress performance due to version mismatch or API hallucination, it belongs in the same mental bucket as a library upgrade. A skill that taught Stripe API concepts without grounding them in current API facts caused the model to invent plausible but wrong SDK calls. That’s a dependency that shipped without pinning its version.
Test with behavioral evals, not structural analysis. The skill with the highest structural contamination score didn’t fail the way anyone predicted. The skill with near-zero structural risk produced one of the worst behavioral outcomes. Structural analysis is a useful first pass, but the only way to know if a skill helps is to run your actual tasks with and without it, under realistic agentic conditions (not just the skill in isolation), and compare the results.
Keep your collection small. Five well-chosen skills cost ~125 tokens of catalog overhead. 1,200 cost 47,000. Smaller collections also avoid trigger ambiguity entirely. Your agent doesn’t need to disambiguate between 20 documentation skills if you’ve only installed the one you actually use.
There’s also a structural argument for grounding. In the upgrade-stripe evaluation, the one task that was completely immune to the skill’s interference was the grounded task, where the model had existing code to modify rather than generating from scratch. The code anchored the model against fabrication. Skills that orient agents toward modifying existing code rather than generating from nothing are inherently safer.
We wrote yesterday about an agent that ground through a test suite to build a JavaScript engine, and the question of whether conformance testing is enough to trust agent output. Skills are the flip side. JSSE showed what happens when the verification loop is tight and the acceptance criteria are external. SWE-Skills-Bench shows what happens when you inject “helpful” guidance without any verification at all. Constraint and verification beat breadth and vibes.
If you want agents that improve over time, you probably need less skill injection and more boring engineering discipline around evaluation, compatibility, and drift. The hard part of software isn’t knowing procedures. It’s fitting them to the system you have.