Composer 2.5: Why the Harness Matters More Than the Model

There’s a buried data point in a recent security benchmark from Endor Labs that should change how you evaluate AI coding tools — and most of the coverage of Composer 2.5 is overlooking it.

Same model. Same week. Different runtime. GPT-5.5 scored 61.5% on functionality running inside OpenAI Codex’s native harness. Put that same model inside Cursor’s harness and the number jumps to 87.2%. That’s a 26-point difference without touching the model. The security numbers tell the same story: GPT-5.5 in Cursor’s harness hit 23.5%. Claude Opus 4.7 in Cursor’s harness hit 22.9%. Both outperformed what either model achieved in its own native environment.

That data point is the real context for understanding Composer 2.5 — released on May 18, 2026.

What Composer 2.5 actually is

Cursor’s proprietary model is built on the same open source foundation as Composer 2: Moonshot AI’s Kimi K2.5, a mixture-of-experts architecture with roughly 1 trillion total parameters and ~32 billion active per inference. The foundation didn’t change. Everything built on top of it did.

85% of the total compute budget went into Cursor’s own post-training pipeline: 25 times more synthetic training tasks than Composer 2, a new reinforcement learning technique that gives the model localized textual feedback at the exact moment it makes a bad tool call (instead of a single reward signal at the end of a long execution), and infrastructure improvements including fragmented Muon optimizers for MoE-scale training.

Benchmark results:

  • SWE-Bench Multilingual: 79.8% (up from 73.7% in Composer 2)
  • Terminal-Bench 2.0: 69.3% (up from 61.7%), practically matching Opus 4.7’s 69.4%
  • CursorBench v3.1 at default effort: 63.2% — ahead of Opus 4.7 (61.6%) and GPT-5.5 (59.2%)

Pricing: $0.50/M input tokens and $2.50/M output tokens on the standard tier. The fast tier (default for interactive use) is $3.00/$15.00. At roughly one-tenth the cost of frontier models on comparable tasks, the economics of long agentic sessions fundamentally shift.

The thesis benchmarks don’t capture

Cursor is transparent about something in their launch post: the behavioral dimensions that matter most to developers working day-to-day — effort calibration, communication style, knowing when to stop and ask versus when to push forward — aren’t well captured by existing benchmarks. They built and trained for them anyway.

This is where two years of product investment become visible. The retrieval layer, the tool-calling patterns, how context is managed across a 200-file refactor, the signals the agent uses to decide if a failed test is noise or a real problem — none of that lives in the model weights. It lives in the scaffolding Cursor has been building since 2023.

The Endor Labs data is the clearest external validation of this thesis I’ve seen. The harness is the product. The model is a component of the harness.

What this means for teams evaluating tools

If you’re making tooling decisions based on which foundation lab is shipping the hottest model this month, you’re optimizing the wrong variable. Cursor isn’t winning because they have a better base model — they’re running Kimi K2.5, the same open source checkpoint anyone can download. They’re winning because they built the best software engineering agent runtime on the market, and they keep improving it independently of what foundation labs release.

Two practical implications:

First, the cost argument for Cursor is now legitimate at scale. Frontier model inference in long agentic sessions is genuinely expensive. At standard tier pricing, Composer 2.5 changes the math on how many parallel agent sessions a team can run, and how often.

Second, Cursor announced they’re training a significantly larger model from scratch with SpaceXAI — on Colossus 2, with 10 times more compute than Composer 2.5. That model has no release date. But the implication is clear: Cursor isn’t positioning themselves as an IDE that wraps other people’s models. They’re building a vertical AI stack for software engineering, and they’re moving fast.

Composer 2.5 is available now. Cursor is offering double usage through approximately May 25. If you’re evaluating it for your team, this week is the right time to run real workloads.