What Harness Engineering Actually Is
My conclusion after reading these articles side by side:
Harness Engineering is not just about writing better prompts. It is about engineering all the capabilities around the model into an iterative system, so an agent can produce stable and verifiable outcomes during long-running tasks.
One-line summary:
Agent = Model + Harness
Harness = State management + Tooling + Constraints + Feedback loops + Execution orchestration
The model provides intelligence. The harness makes that intelligence usable, controllable, and repeatable.
Shared Takeaways Across the Articles
| Theme | Common Ground |
|---|---|
| Definition of harness | Not the model itself, but surrounding code, configuration, process, tools, and validation mechanisms |
| Goal | Reduce supervision cost, improve first-pass correctness, and support long-running execution |
| Core method | Turn repeated failure modes into engineered assets: rules, tools, tests, and loops |
| Main long-task challenge | Limited context windows, session interruption, state drift, and premature “done” claims |
| Solution direction | Incremental task decomposition, state handoff, automated checks, observability, and continuous correction |
5 Core Components (My Practical View)
- Task scaffolding
- Clear decomposition strategy (one feature at a time)
- Clear Definition of Done (DoD) to avoid “looks finished” outputs
- State and memory
- Recoverable state: progress files, commit notes, change logs
- Reliable handoff between sessions instead of relying on model guessing
- Tools and environment
- Fast deterministic tools for agents (tests, lint, screenshots, logs)
- Self-serve context access instead of manual copy/paste
- Feedback and sensors
- Computational sensors: lint/typecheck/unit/e2e (fast, deterministic)
- Reasoning sensors: LLM review/semantic QA (slower, costlier, but useful for semantics)
- Scheduling and governance
- After failure, do not only retry; improve capability
- Accumulate reusable rules in templates (
AGENTS.md, docs, checklists)
Practical Harness Workflow for Normal WebCoding Users
This is my compressed version for individual developers. You do not need multi-agent orchestration to start.
Step 0: Define “Done” First
Create a one-page SPEC.md for each feature:
- User scenario
- Input and output
- Acceptance criteria
- Failure scenarios
Without this, agents tend to produce “confident but misaligned” output.
Step 1: Create Minimal Harness Files
At least these 4 files:
AGENTS.md: repository rules (commands, directory conventions, no-touch zones, commit style)TASKS.md: feature backlog withtodo/doing/donePROGRESS.md: what was done, what is unfinished, next stepCHECKLIST.md: unified acceptance checks (build, test, UI, performance, security)
Step 2: One Feature Per Iteration
Execution pattern:
- Pick one item from
TASKS.md - Give the agent a bounded task
- Avoid “build the entire site in one go” requests
This sharply reduces context chaos and regressions.
Step 3: Let the Agent Change, Then Prove
Require the agent to output every round:
- Files changed
- Why each change was made
- Commands executed
- Passed/failed checks
- Risk and rollback points
This converts hidden reasoning into auditable execution traces.
Step 4: Two-Layer Validation (Computational First)
Run at least:
npm run lint
npm run test
npm run build
For frontend UI changes, also add:
- Key path screenshot checks
- Manual critical interaction checklist
- Responsive checks on main breakpoints
Rule: pass deterministic checks first, then do semantic review.
Step 5: Convert Every Failure into Harness Assets
When agent output fails, do not only patch the immediate bug:
- If it is a rule issue, add it to
AGENTS.md - If it is repeated execution, script it
- If it is quality drift, add it to
CHECKLIST.md
Goal: prevent the same class of errors from recurring.
Step 6: Force Handoff for Long Tasks
If work spans more than one context window, generate a handoff containing:
- Current goal
- Completed work
- Remaining work
- Blockers
- First step for next round
Store it in PROGRESS.md or planning files, not only in chat history.
Step 7: Run a Release-Grade Loop Before Merge
Before merge, run one unified cycle:
- Regression checks
- Critical user-path smoke tests
- Quick performance and error-log scan
- Agent self-review plus human spot-check
This prevents “local pass, system-level failure.”
Step 8: Weekly Harness Cleanup
Weekly maintenance:
- Remove stale rules
- Fix broken scripts
- Merge duplicate constraints
- Refresh docs index
Harness is also code. Without maintenance, it decays.
Minimum Viable Harness (MVP) for Individuals
If you want the fastest starting point, do this:
- Write 20-50 lines of hard rules in
AGENTS.md - Ask the agent to do only one feature per iteration
- Run
lint/test/buildevery round - Update
PROGRESS.mdeach round - Convert repeated failures into rules or scripts
These five actions are usually enough to move from “using agents by feel” to “compounding engineering productivity.”
My Practical Conclusion
Harness Engineering answers one core question:
When an agent fails, do you supervise it repeatedly, or convert that failure into system capability?
The first consumes human time. The second compounds.
For normal webcoding users, the key is not the fanciest model, but:
- Do you have executable rules?
- Do you have automated feedback?
- Do you convert failures into deterministic advantages for the next run?
That is the real value of harness engineering.
References
- OpenAI: Harness engineering: leveraging Codex in an agent-first world
- Anthropic: Effective harnesses for long-running agents
- Anthropic: Harness design for long-running application development
- LangChain: The Anatomy of an Agent Harness
- Mitchell Hashimoto: My AI Adoption Journey
- Martin Fowler: Harness Engineering - first thoughts
- Martin Fowler: Harness engineering for coding agent users