Harness Engineering: Coding Agent Performance Is an Environment Problem
March 25, 2026 · 8 min read
Most teams troubleshoot agent output by writing better prompts. More instructions, more examples, more constraints packed into AGENTS.md or CLAUDE.md. It rarely works. More instructions don't fix the problem. They just add more signal for the model to weight against everything it learned from public codebases.
The fix isn't a better prompt. It's a better environment.
Harness engineering is the practice of building infrastructure (types, tests, lint rules, observability, feedback loops) that constrains and guides coding agents from the outside. Instead of telling the agent what to do, you design the environment so it self-corrects through its own toolchain. OpenAI built a million-line product this way: zero human-written code, shipped to real users. Their lesson was immediate: "Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified." The primary job of their engineering team became building the infrastructure that made good output the default.
You've already solved this problem#
Every CTO / Engineer has onboarded a junior developer. Nobody writes them a 10-page doc and walks away.
You build an environment where the junior can ship safely. Clear tickets with well-defined scope. Automated checks that catch mistakes before code review. Access to the tools they need to verify their own work. A codebase structured so that doing the right thing is easier than doing the wrong thing.
That's harness engineering. The playbook is identical. The agent is just faster, cheaper, and tireless. What it shares with the junior is the core limitation: competence without judgment.
The junior developer metaphor maps directly to infrastructure decisions:
- Strongly typed system: the code won't compile if the agent gets the shape wrong
- Tests: how does the agent know it's done? Tests give it a definition of done it can verify against
- Lint rules with remediation messages: guardrails the agent can read and act on
- Observability and logs: the agent can debug itself
- Browser access (Playwright, Chrome DevTools): the agent can see what it built. Without this, frontend work is blind
- Small, detailed tickets. Less scope, less judgment required
- Enforced architecture layers: dependency direction the agent can't violate
Nobody manages a junior by handing them the entire backlog and hoping for the best. You scope the work, automate the checks, and make sure they have the tools to verify their output before it reaches you. Agents need the same structure. Not because they're incompetent, but because they lack the judgment to know when they're drifting.
The boundaries that matter#
Harness engineering isn't a single tool. It's a set of boundaries, each one closing a failure mode.
They fall into three categories: boundaries that prevent bad output, boundaries that let the agent verify its own work, and boundaries that scope the work so mistakes stay contained.
Preventing bad output#
The cost of missing these boundaries is quantifiable. Agents produce more bugs. Every team I've worked with has the data to prove it. CodeRabbit quantified it across 470 GitHub repos: 1.7x more bugs than humans overall, 75% more logic and correctness errors. Excessive I/O operations were 8x higher in AI-generated code.
CodeScene's research narrows the threshold: AI performs best in codebases with a Code Health score of 9.5 or above. Below that, "AI operates in a self-harm mode, often writing code it cannot reliably maintain later." These aren't model failures. They're environment failures.
So what does prevention look like? The agent gets a data shape wrong. The code doesn't compile. Done. No ambiguity, no prompt required. The type checker rejects it before anything runs. This is why OpenAI enforces boundary parsing on every data shape entering their system: the agent can choose how to validate, but validation is non-negotiable.
Types catch shape errors. Lint rules catch everything else that can be statically analyzed: a database query in the wrong layer, a route missing validation middleware, an environment variable accessed outside the config file. The critical design choice: the error message is written for the agent. It's surgical, scoped, and actionable. The agent reads it, self-corrects, and moves on. Zero context window cost until the rule fires. I covered this pattern in depth in a previous article.
Enforced architecture layers (strict dependency directions validated by linters and structural tests) limit what the agent can reach. OpenAI's codebase enforces a rigid layered model per business domain (Types → Config → Repo → Service → Runtime → UI) with only explicit cross-cutting interfaces. The constraint is what allows speed without architectural drift.
Letting the agent verify its own work#
An agent that can check its own output doesn't need a human in the loop for every iteration.
The most common failure pattern is deceptively simple. LangChain found that agents would write a solution, re-read their own code, confirm it "looks ok," and stop. Adding verification guidance (build, run tests, compare against the spec, fix) changed the outcome dramatically. Models are strong self-improvement machines, but they don't have a natural tendency to enter the build-verify loop. The harness pushes them into it.
Spotify's Honk system validates this at scale. Running background agents across thousands of software components, they built deterministic verifiers plus an LLM judge layer. The judge vetoes roughly 25% of agent sessions. When flagged, agents self-correct about half the time. Their assessment is blunt: "Without these feedback loops, the agents often produce code that simply doesn't work."
OpenAI wired logs, metrics, and traces into their agent runtime through a local observability stack, ephemeral per worktree, queryable via LogQL and PromQL. With that context available, prompts like "ensure service startup completes in under 800ms" become tractable. Without it, the agent is blind to everything that happens after compilation.
The same logic applies to the visual layer. Wire in Playwright or Chrome DevTools Protocol, and the agent can launch the application, take screenshots, compare against design assets, and iterate. A full QA loop without human eyes.
Scoping the work#
A randomized trial found that experienced open-source maintainers were actually 19% slower with AI, while believing they were 20% faster. A 39-point perception gap. A separate study saw only 8% of agentic invocations result in a merged pull request.
A UCSD/Cornell study synthesized these findings. Their core finding: professional developers who get results deploy explicit control strategies. They plan, they supervise, they validate. Agents work for "small, straightforward, well-defined tasks" and fail at "complex tasks requiring domain knowledge."
This isn't a limitation to work around. It's a constraint to design for. Small, detailed tickets limit blast radius. The more precisely scoped the ticket, the less judgment the agent needs. A monorepo helps too: everything discoverable in one place, no guessing where code lives or how packages relate.
But some work can't be scoped small enough. Architectural decisions that cut across the whole system, novel problem-solving where the right answer isn't clear yet, cross-cutting refactors that touch everything. These still need a human driving. The harness doesn't eliminate the need for judgment. It concentrates it where it actually matters.
First the boundaries, then the delegation#
OpenAI's Symphony, their open-source agent orchestrator, states it plainly in the README: "works best in codebases that have adopted harness engineering." Symphony polls an issue tracker, creates isolated per-issue workspaces, and runs coding agents autonomously. Without the harness, it would just be dispatching agents into chaos.
OpenAI reports single Codex runs working on a single task for over six hours, often while the humans are sleeping. The agent validates the codebase, reproduces a bug, implements a fix, drives the application to verify it, opens a PR, responds to feedback, and merges the change. That level of autonomy only works because the harness catches what the agent can't.
Before they systematized it, OpenAI spent 20% of every engineering week cleaning up "AI slop." Once they encoded their standards into linters, structural tests, and recurring agent-driven refactoring, that cleanup became automated. The harness didn't just improve output quality. It reclaimed the human time that was being burned on supervision.
The discipline moved#
Building software still demands discipline. But the discipline has moved.
The discipline isn't in writing code. It's in the scaffolding, the feedback loops, the constraints that keep the codebase coherent as agents generate thousands of lines per day. The best managers don't micromanage output. They build systems where good output is the default. They scope the work clearly, automate the checks, give their team the tools to verify their own results, and enforce the boundaries that matter while allowing autonomy within them.
Coding agents need exactly the same thing. Not better prompts. Not better models. A better environment, one where doing the right thing is easier than doing the wrong thing.
Don't mistake a good harness for autopilot. You're still the one deciding what gets built and why. But engineer the environment, and the agent will follow.
If you're building these patterns into your team's workflow, I write about this regularly. Follow me @AxelVincent_ and DM me if you want to talk shop.
References#
- Improving Deep Agents with Harness Engineering — LangChain
- Harness Engineering: Leveraging Codex in an Agent-First World — OpenAI
- Background Coding Agents: Predictable Results Through Strong Feedback Loops — Spotify Engineering
- Professional Software Developers Don't Vibe, They Control — UCSD / Cornell
- Are Bugs and Incidents Inevitable with AI Coding Agents? — StackOverflow / CodeRabbit
- Agentic AI Coding: Best Practice Patterns for Speed with Quality — CodeScene
- Symphony — OpenAI
- Lint Rules Are the Best Context Engineering Nobody's Talking About — Axel Vincent