How One Dev Improved 15 LLMs Without Changing the Model

I’m going to be direct: all the conversation about AI and code right now revolves around models. GPT-5.4 vs. Claude Opus 4.6. Gemini 3 vs. whatever came out this week. Which model writes better code? Which understands your codebase faster? Which hallucinates less?

Can Bölük just proved that’s the wrong question.

Bölük — a developer with a background in video game security — maintains oh-my-pi, an open-source code agent forked from Mario Zechner’s Pi. It has over 1,300 commits on top, most of them improving the machinery between the model and your files. The part nobody talks about: the harness.

His thesis is simple and backed by data: most code agent failures aren’t model failures. They’re harness failures. The model knows what to change. It just can’t express the change reliably through the editing tool it’s been given.

He built a technique called Hashline to prove it. Then he benchmarked it against 16 models, 180 tasks, and 3 runs each. The results should change how you think about AI code tools.


The Editing Tool Problem Nobody Talks About

Every code agent needs to modify files. Sounds trivial. It’s not. The three dominant approaches have fundamental problems:

OpenAI Codex uses apply_patch — a proprietary diff format. The model generates something that looks like a diff, and the harness applies it. The problem? This format is essentially baked into Codex models. Give that format to any other model and patch failures explode. Bölük measured a 50.7% failure rate for Grok 4 and 46.2% for GLM-4.7. These aren’t bad models — they just don’t speak the format.

Claude Code uses str_replace — find exact text, replace it with new text. Conceptually simple, but the model has to reproduce every character perfectly: spaces, indentation, quotes. Multiple matches in the file? Rejected. The “String to replace not found in file” error is so common it has its own megathread on GitHub Issues with over 27 related issues.

Cursor trained a separate 70B neural network whose only job is to take a draft edit and merge it correctly into the file. The harness problem is so hard that one of the best-funded AI companies decided to throw another model at it.

Aider’s own benchmarks showed that the choice of format alone got GPT-4 Turbo from 26% to 59% success rate. JetBrains’ Diff-XYZ paper confirmed it systematically: no edit format dominates across all models. The through line is that all these approaches force the model to reproduce content it already saw. When it can’t — and often it can’t — we blame the model.


Hashline: Reference Lines by Hash, Not by Text

Bölük’s idea is elegant. When the model reads a file, each line comes tagged with a 2-character content hash:

1:a3|function hello() {
2:f1|  return "world";
3:0e|}

When the model edits, it references those labels: “replace line 2:f1”, “replace range 1:a3 to 3:0e”, “insert after 3:0e.” The model never needs to reproduce previous content. No whitespace reproduction, no “string not found”, no ambiguous matches.

And it has a built-in safety mechanism: if the file changed since the last read, the hashes won’t match and the edit gets rejected before anything gets corrupted. The model gets a clear error telling it to reread the file, not a cryptic failure.

The technical implementation uses xxHash32 mapped to a 16-character alphabet, producing short and memorable anchors. It’s roughly 200 lines of core code.


The Numbers That Should Change the Conversation

The benchmark: 180 tasks generated from real files in the React codebase, with mechanical mutations (operator swaps, boolean flips, off-by-one errors, removed guard clauses). 3 runs per task, fresh agent session each time, four tools (read, edit, write). Three edit formats tested: apply_patch, str_replace, and hashline.

The key results:

Grok Code Fast 1: 6.7% → 68.3%. A ten-fold improvement. The model’s actual coding capacity was almost completely hidden behind mechanical edit failures.

Gemini 3 Flash: +5 percentage points over str_replace — beating Google’s best attempt at solving this problem.

Grok 4 Fast: 61% reduction in output tokens. The model stopped burning context in retry loops from failed edits.

MiniMax: More than doubled its success rate.

The pattern is consistent: hashline matches or exceeds str_replace for nearly all tested models, and weaker models benefit most. The models that looked worst on paper weren’t bad at coding. They were bad at reproducing exact text for editing tools that demanded it.

Bölük’s takeaway is worth internalizing: “+8% improvement in Gemini’s success rate is bigger than what most model updates deliver, and it cost zero training compute.” Just a different editing interface and ~$300 in benchmarking costs.


The Vendor Lock-In Angle

This is where the story gets sharp. While Bölük was running these benchmarks, two things happened:

Anthropic blocked OpenCode — a very popular open-source code agent — from accessing Claude through Claude Code subscriptions. Their position: “OpenCode reverse-engineered a private API.” Technically fair. But the signal it sends is clear: don’t build alternative harnesses. Use ours.

Google completely disabled Bölük’s Gemini account. They didn’t rate-limit him. They didn’t warn him. Disabled. For running a benchmark — the same one that showed his own model improving 5 percentage points with his technique.

Bölük’s argument against this stance is convincing: no vendor is going to optimize their harness for competitors’ models. Anthropic won’t tune for Grok. xAI won’t tune for Gemini. OpenAI won’t tune for Claude. But an open-source harness tunes for all of them, because contributors use different models and fix the failures they personally encounter.

The model is the moat. The harness is the bridge. Burning bridges means fewer people bother crossing.


What This Means for Developers

If you’re using any AI code agent — Claude Code, Codex, Cursor, Windsurf, or an open-source alternative — you need to understand that a significant percentage of the failures you attribute to “the model is dumb” are actually the editing tool failing silently.

Three takeaways:

The edit format matters as much as the model. Aider proved it, JetBrains confirmed it, and Bölük quantified it. When you see your agent struggling with edits, the bottleneck might not be intelligence — it might be the interface between intelligence and your files.

Open-source harnesses are where the innovation is. Vendors have strong incentives to keep you in their harness. The community has strong incentives for all models to work better. oh-my-pi, OpenCode, Aider — these projects are improving the entire ecosystem, not just one model.

“Which model is best?” is increasingly the wrong question. The better question is: which system — model plus harness plus tools — produces the best results for your specific work? Can Bölük improved 15 models simultaneously by changing a single variable. That variable wasn’t the model.

We’re blaming the pilot for the landing gear. Time to look at the harness.


Links:* Blog post: I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed. | Can.ac