← Overview: Benchmarking Wing Pro's MCP Tools for Claude Code

Claude Code in Wing 12 · Detailed results

The benchmark, explained

The detailed benchmark results for Claude Code running inside Wing 12 — the full per-task numbers behind the overview, and, just as important, what every term means. The raw measurements are genuinely opaque on their own (cache tokens, "cold vs. warm" prefixes, turns, ratios), so this page defines each one before showing the results. Model: Claude Opus 4.8 · steady-state, interleaved runs · June 2026.

The setupWhat we measured, and how the comparison stays fair

We gave Claude Code the identical task twice — once able to use Wing's IDE tools, once not — and measured how much work each took to finish. We call these the two "arms."

Wing arm "mcp"

Claude Code launched from inside Wing, with Wing's MCP servers attached (analysis, testing, debugger, version control) plus Wing's guidance on when to use them. Claude chooses whether to use a tool — nothing is forced.

Plain arm "bare"

Plain claude in a terminal in the same repo, with no Wing running — Claude's ordinary Read, Grep, Glob, and shell tools. A fair, capable baseline, not a handicapped one.

Both arms run the same model, the same prompts, on the same real codebase. For the plain arm we hide the Wing-only inputs so it can't accidentally benefit from them; the project's own CLAUDE.md (which describes the project, not Wing) is left in place for both, since that's fair to each. A task only "counts" when both arms reach the correct answer — we're comparing the cost of getting there, not who got there.

The toolsWhat Wing hands to Claude

Four families of IDE capability, exposed to Claude as callable tools (MCP servers).

Code analysis wing-analysis

Find a symbol's definition and every real use, resolved through imports, inheritance, and type inference — so it distinguishes the symbol you mean from like-named ones a text search would confuse.

Testing wing-testing

Run tests and get terse, targeted results; select exactly the tests whose coverage touches a change (from Wing's coverage database); and run a single test under the debugger.

Debugger wing-debug

Stop a running program or test at a breakpoint and read or evaluate the actual values present at that moment — runtime state you can't get by reading the source.

Version control wing-review

Review and commit changes through Wing's integrated workflow, so the AI commits cleanly as work proceeds instead of shelling out to raw VCS commands.

Reading the numbersA plain-English glossary

The columns and footnotes below use these terms. This is the part that makes the rest of the page legible.

Cost total_cost_usd: The full price pay-as-you-go Claude API usage to finish one task, in US dollars. It's driven by how many tokens (chunks of text) Claude has to read and write. Lower cost generally tracks with fewer steps and less waiting — it means Claude reached the answer more directly. Note: Actual cost with a Claude Code subscription is much less than the dollar amounts given here..
Turns num_turns: How many internal act-and-observe steps Claude took — each tool call and the thinking around it. Fewer turns means a more direct path to the answer.
Wall time: Real elapsed seconds to finish the task — what you actually wait at the keyboard.
Ratio mcp ÷ bare: Wing's cost divided by the plain arm's cost. Below 1.0 means Wing was cheaper (0.42× = 58% less); above 1.0 means it cost more.
Tokens & caching: To avoid re-reading the same text every step, Claude caches it. Writing something into the cache the first time is the expensive part; reading it back afterward is cheap. This one fact drives most of what follows.
The "prefix" & cold vs. warm: At the start of a session a fixed block loads — guidance plus every tool's definition. Wing's prefix is larger, so the first task pays a one-time tax to write it into the cache (a cold start). Every task after reads it cheaply (warm). We always compare warm-vs-warm.
Terseness: Once prefixes are warm, cost is dominated by how much new text each turn pulls in. Plain shell output — full test runs, big greps, whole files dumped to the screen — is bulky and gets re-cached every turn. Wing's tools return short, targeted results. This terseness is the main reason Wing wins when it wins.
Steady-state · interleaved · median (n≥3): Run-to-run variance is high (30–50%), and whoever runs first eats the cold-start tax. So we interleave the arms with both warm, repeat each at least three times, and report the median with its range — never a single back-to-back pair.
Correctness-matched: Every task shown here was scored by hand against a written rubric, and both arms produced the right answer. The numbers compare cost, speed, and effort — not accuracy.

All cellsHeadline numbers

Task	What it tests	Wing (wall / turns / $)	Plain (wall / turns / $)	mcp÷bare $	Verdict
runtime	Debugger reads a runtime-only value	23s / 2t / $0.121	246s / 25t / $1.22	0.10×	WING WIN
01-nav	Enumerate all 9 callers of a method	~93s / 5–8t / $0.43	~193s / 19–26t / $1.02	0.42×	WING WIN
03d-tasks	Coverage-aware test selection (completeness)	~43s / 5–7t / $0.28	~118s / 14–17t / $0.65	0.43×	WING WIN
03e-fgreen	Catch a real regression, steadily	~104s / 9t / $0.40 med	~269s / ~29t / $1.58 med	0.25×	WING WIN
04-fail	Test-failure triage (ongoing use)	~27s / 5–6t / $0.174	~75s / 8–17t / $0.323	0.54×	WING WIN
08-testing	Sustained testing session (amortization)	~83s / 8t / $0.393 med	~108s / 14–15t / $0.587 med	0.67×	WING WIN
05-rename	Hierarchy-aware refactor-rename	~120s / 17–23t / $0.76	~144s / 18–19t / $0.91	0.84×	~TIE (slight edge)
06-callchain	Type-inference call chain (1 chain, 2 hops)	~89s / 10–15t / $0.52	~92s / 13–19t / $0.57	0.91×	~TIE
02-explain	Depth-budgeted explanation + citations	~137s / 20–21t / $0.78	~153s / 8–22t / $0.70	1.11×	~TIE
08-mixed /jinja2	Cross-tool session on a small library — short (7-turn)	~390s / 7t / $2.68 med	~310s / 7t / $1.49 med	~1.6–1.8×	NON-WIN (small + short)
08-small-long /jinja2	Same small library — long (~20-turn) session	~665s / 20t / $4.30 med	~735s / 20t / $5.12 med	0.84×	WING WIN (small + long)
09-tasks /ide-bench	Cross-tool session on a large codebase	~470s / 7t / $2.71 med	~600s / 7t / $4.14 med	0.65×	WING WIN

How to read a row: the same task, run both ways. "Wall / turns / $" is elapsed time, internal steps, and API cost. mcp÷bare is the cost ratio (below 1.0 favors Wing). med = median over a repeated multi-turn session. Every row is correctness-matched — both arms got it right.

Group 1 · TestingCoverage-aware selection & failure triage

Wing knows which tests cover a change and returns short output. The plain arm must deduce the affected tests and read bulky raw test output.

04-failtest-failure triagen=3 steady

The task: a small change broke a test — find the failing test, get the traceback, explain the root cause, and propose the exact fix.

Wing

$0.174 med · 5–6t · ~27s

Plain

$0.323 med · 8–17t · ~75s

Wing ~46% cheaper, ~3× fewer steps, ~3× faster (0.54×). Warm prefix plus terse test output close the gap. A single cold first task here costs Wing more (~$0.72) — that's the one-time startup tax, which amortizes away over a real session (see 08-testing).

03d-taskscoverage selection — completenessn=2 steady

The task: "I made a small change — run only the tests affected by it, and don't run the ones that aren't."

Wing

$0.28 avg · 5–7t · ~43s · selects 61 ✓

Plain

$0.65 avg · 14–17t · ~118s · 60→8 (unreliable)

Wing wins on every axis and is the only arm that reliably found the complete affected set (61 tests, both runs). The plain arm was unreliable in both directions — once it ran the whole 60-test class, once it reasoned its way down to just 8 and missed about 53. The model can't reliably derive the affected set by reading; coverage data can. (61 is correct, not over-selection: 17 in the test body plus 44 in shared setup/teardown.)

03e-fgreencatch a real regression — robustnessn=5 · 2026 refresh

The task: same prompt as 03d, but the change is a genuine one-line regression that breaks a single test — will each arm reliably catch it?

Wing

$0.40 med · 9t · ~104s · caught ✓

Plain

$1.58 med · ~29t · ~269s · caught ✓

Same correct answer — caught every run — but cheaper and far steadier (~0.25×). The plain arm is high-variance ($0.7–1.8 across samples): its strategy swings from a reasoned subset to brute-running all 748 tests plus a revert-to-confirm. Wing stays steady (select affected → run → done, 9 turns). Read it as a solid win with a wide band, not a precise point — and not a "gotcha"; both arms always caught the bug.

Group 2 · DebuggerFor genuinely runtime-only information

Claude reads most bugs without running them. The debugger earns its place only when the answer exists nowhere in the source — it has to be observed live.

runtimeread a runtime-only value at a breakpointn=3 steady

The task: report the exact values a function holds at a specific line during a test run — including a freshly generated id that exists only at runtime. No reasoning from the source allowed; show the literal characters.

Wing

$0.121 med · 2t · 23s · value ✓

Plain

$1.22 med · 25t · 246s · value ✓

~90% cheaper, ~12× fewer steps, ~11× faster (0.10×) — the single strongest result. Wing does it in one call: name the test, set a breakpoint, evaluate the expressions. The plain arm has to reverse-engineer how the test is run and hand-write a tracing harness each time. Both report the correct value; the gap is cache-insensitive because the plain arm genuinely does ~10× more work. An honest efficiency win on runtime-only state — not a trick where one arm is wrong.

Group 3 · Code analysisStructural search the model can't cheaply fake with grep

The durable win is finding the right symbol among like-named ones — and it only bites when text search is noisy (many look-alikes, inheritance, a large tree). On small, readable code, grep+reason is already cheap and correct, so the edge doesn't pay.

01-navenumerate all 9 production callers of one methodn=3 steady

The task: list every production call site of a generically-named Save method, with the calling function and why it saves — excluding test files and unrelated same-named methods in the bundled runtime.

Wing

~$0.43 warm · 5–8t · ~93s · 9/9 ✓

Plain

~$1.02 · 19–26t · ~193s · 9/9 ✓

~58% cheaper, ~3× fewer steps, ~2× faster (0.42×). Both arms 9/9 correct with zero false positives. Even Wing's cold first run ($0.83) beat every plain run. "Enumerate all callers" is exactly what structural search is for — the clearest analysis win.

05-renamehierarchy-aware refactor-renamen=2 steady

The task: rename a method across a class hierarchy and all its call sites, while leaving alone the look-alike methods on unrelated classes with different signatures.

Wing

~$0.76 · 17–23t · ~120s · PASS

Plain

~$0.91 · 18–19t · ~144s · PASS

Correctness tie (both renamed the right 10, left the look-alikes alone), Wing ~16% cheaper. The model can disambiguate the hierarchy by reading signatures, so the analysis edge is real but thin here.

06-callchainwhich class's method, following one chainn=2 steady

The task: follow a single call chain two hops to identify which class's SetValue (a common method name) is actually being called, and trace what it persists.

Wing

~$0.52 · 10–15t · ~89s · ✓

Plain

~$0.57 · 13–19t · ~92s · ✓

Honest non-win — essentially tied. Both fully correct every run. Following one chain two hops is cheap to just read, so type inference doesn't translate into a cost win. Kept as the "where the edge is thin" contrast to 01's breadth win.

02-explaindepth-budgeted explanation with citationsn=2 steady

The task: explain how a subsystem works in under 600 words, covering four specific topics, each with accurate file:line citations.

Wing

~$0.78 · 20–21t · ~137s · ✓

Plain

~$0.70 · 8–22t · ~153s · ✓

Tied, plain marginally cheaper (high variance). This task is reading-bound — the bottleneck is synthesizing the prose, not locating code, so the analysis tools' locating edge barely helps.

Group 4 · Real sessionsMany tasks in a row — where the startup cost amortizes

A single tiny task pays Wing's fixed startup cost against almost no work; a session of many tasks pays it once. These cells run multi-turn sessions and report the session-total median — and show that two things decide whether that overhead is repaid: codebase size (how much the tools save per task) and session length (how many tasks share the one-time cost).

08-testingsustained testing session — 3 diagnosesn=4 interleaved

The task: one session — run a failing test file, then diagnose three failures one at a time, naming each buggy function and its one-line fix.

Wing

$0.393 med (0.34–0.47) · ~83s · 8t

Plain

$0.587 med (0.55–0.63) · ~108s · 14–15t

Wing ~33% cheaper, ~23% faster, ~½ the steps in ongoing use (0.67×). Both arms named 3/3 bugs every run. The driver is terseness — the plain arm writes 2–3× more into cache from bulky shell output and runs more internal steps (12–18 vs. a flat 8). This is the companion to 04: the first task pays the ~$0.68 cold tax, then amortizes below the plain arm.

09-taskscross-tool session on a large codebasen=3 each · faithful re-baseline

The task: a 4-task session on the real, 25k-line-file Wing codebase — enumerate a method's callers, fix a seeded bug, rename a helper everywhere (including by-name test imports), tweak a behavior — committing after each.

Wing

$2.71 med (2.04–2.84) · ~470s · 7t · 3 commits

Plain

$4.14 med (3.35–4.72) · ~600s · 7t · 3 commits

Wing ~35% cheaper, ~22% faster (0.65×) — and markedly steadier. This is the large-codebase answer to 08-mixed below, and it wins where the small library didn't. Both arms 4/4 correct every run, 3 commits each. Wing's cost spans ~$2.1–2.9 while the plain arm swings ~$2.7–4.3 (~60% run-to-run) — Wing is both cheaper and far more predictable.

Why it wins here: terseness plus configured test integration. The plain arm pulls ~75% more into context from bulky output and has to discover the project's test runner. It exercises all four tool families in one session — analysis, testing, debugger, version control — which is the point of the cell.

An honest engineering footnote. On the freshly-opened copy of this codebase, Wing's "find usages" initially returned zero for every query — a real bug in Wing that a swallowed exception hid (the other analysis tools still worked, so it was invisible until cross-checked). We found and fixed it. The lesson we took: verify a tool returns correct results, not merely non-error ones, before trusting a measurement.

08-mixedthe same kind of session on a small library (jinja2)n=3 each · faithful re-baseline

The task: the same shape of cross-tool session — navigate, fix a bug, rename, adjust a filter, committing as you go — but on a ~13,000-line library instead of a large codebase.

Wing

$2.68 med (2.1–2.9) · ~390s · 7t · 3 commits

Plain

$1.49 med · ~310s · 7t · 3 commits

Non-win — Wing ran ~1.6–1.8× more expensive here, and we report it openly. Both arms finished all turns and committed cleanly. The library is small enough that plain grep+reason is cheap and correct, so across just seven turns Wing's structural and testing edges never overcome its fixed startup overhead. The debugger was correctly not used (the seeded bug was readable). Two things flip this: a bigger, search-noisy codebase (09-tasks above) or a longer session on the very same library (08-small-long below).

This is a band, not a point. Both arms are high-variance on a small tree (the single bug-fix turn alone swings widely run-to-run). The takeaway is robust: a session that is small and short is the one case the overhead isn't repaid — which is exactly the boundary the next cell maps.

08-small-longthe same small library, but a long (~20-task) sessionn=3 each

The task: on the same ~13,000-line library as 08-mixed, a sustained ~20-turn feature build — add four small text filters test-first, refactor a shared helper, harden edge cases, fix a runtime bug, committing after each (9 commits).

Wing

$4.30 med (4.24–4.72) · ~665s · 20t · 9 commits

Plain

$5.12 med (4.20–5.31) · ~735s · 20t · 9 commits

Wing ~16% cheaper (0.84×) — the small-tree cross-over. Same small library where the short 08-mixed session ran 1.6–1.8× more expensive; here a longer session amortizes the fixed startup cost until Wing comes out ahead. Both arms correct every run, 9 commits each, suite green. The win is the integrated commit workflow — terse, where the plain arm shells out to git for every commit — compounding over a 9-commit session; analysis and the debugger were correctly not used, because grep+read is the right call on a small, readable tree (expected behavior, not a gap).

An honest, modest win. Small sample (3 runs each) on a high-variance cell, and the ranges overlap — the cheapest plain run ($4.20) is below Wing's median. Read it as "a long session pulls Wing into equal-or-better territory on a small codebase," not a blowout. The lesson is the direction: the only combination that doesn't repay the fixed overhead is small and short.

MethodHow the runs were kept honest

Same model, pinned. Every run used Claude Opus 4.8. Fair baseline. The plain arm has Claude's full ordinary toolkit (read, search, shell) — we measured against a capable Claude, not a crippled one.

Cache discipline. The prompt cache accumulates across runs and can swing the same arm ~2× between a cold and a warm start. So we interleave the two arms with both prefixes warm, never quote a back-to-back pair, and judge cold-vs-warm by the cache-write token counts (the one signal that can't be gamed).

Variance & repetition. Run-to-run variance is 30–50%; a single pair can't separate cells that are within ~20% of each other. Quotable cells are run at least three times and reported as a median with its range. Ties and the one non-win are reported as first-class results — burying them would make the wins less trustworthy, not more.

Scoring. Each task has a written success rubric; runs are scored by hand from the full transcript. A cell only enters the comparison once both arms clear the rubric.