Back to Blog

True Alpha Lives in the Agent Harness

Everyone is optimizing the model. True alpha lives in the harness: the infrastructure that wraps the LLM and makes it actually work.

Faizan Ahmed
Faizan Ahmed

March 3, 2026

Everyone is optimizing the model. Better prompts, bigger context windows, newer checkpoints. That's not where the alpha is.

We built a coding agent at Headstarter that can read your repository, understand it, make multi-file edits, validate its own work, and self-correct when it makes mistakes. The core LLM loop is about 200 lines of code. The agent harness , the infrastructure that wraps that loop and makes it actually ship code, is 10x that.

The harness is everything the model doesn't do: choosing what context to feed it, stopping it from spiraling, validating its output, managing the diffs it produces, and orchestrating the sandbox it runs in. This post is about what we learned building each of those layers.

Harness Layer 1: Context Curation

The first job of the harness is deciding what the model sees. The biggest mistake in agent design is dumping the entire repository into the context window. A typical repo has 500+ files. Sending all of them means the model drowns in irrelevant code, burns tokens on files it will never touch, and slows down every iteration.

Our agent starts with exactly four categories of files:

  1. The file currently open in the editor
  2. Files explicitly @mentioned in the prompt
  3. Files with pending code changes from the current session
  4. package.json for dependency context

That's it. Everything else? The agent fetches on demand using read_file tool calls. It starts lean and pulls in what it needs. This means the model's first iteration runs fast, with high signal-to-noise context.

There's a subtlety here too. When grep results exceed the token budget, we don't just truncate from the bottom. We group results by file path and keep a proportional number of matches per file, ensuring the model sees matches from many files rather than 50 matches from one file. We call it diversity-preserving truncation.

Context filter
5/22
22 files in repo5 sent to LLM
Currently Open
src/app/api/users/route.ts
@Mentioned
src/lib/db.ts
src/lib/redis.ts
Pending Changes
src/components/navbar.tsx
Dependencies
package.json

The agent can read_file anything else on demand. It just doesn't start with everything.

The takeaway: the alpha isn't in bigger context windows. It's in the harness layer that curates what goes into them.

Harness Layer 2: Behavioral Guardrails

Left unchecked, a coding agent will read files forever. It will grep, then read the results, then grep for something related, then read more files. It will happily burn through all 15 iterations without writing a single line of code. This is the single most common failure mode we observed.

The fix isn't better prompting. It's system-level interventions : messages injected into the conversation at specific iteration thresholds that the model never sees coming:

  • Iteration 8, Progress check: "You've used 8 of 15 iterations and haven't made any edits yet. Start editing now. You can always fix issues in later iterations."
  • Iteration 12, Urgent wrap-up: "You have 3 iterations remaining. Finish your current edits and provide a summary. Do NOT start new file reads."
  • Stall detection: If 4 consecutive iterations pass with zero edits (after iteration 6), inject a warning to stop reading and start writing.

These are system messages. The LLM doesn't know they're coming. They're part of the harness, not the model. The hardest part of building agents isn't giving them tools. It's shaping their temporal behavior : knowing when to stop exploring and start acting. That's a harness problem, not a model problem.

1/12
1
Iteration 1 · reading
grep"authentication"

Searching for authentication patterns...

read_filesrc/lib/auth.ts

Reading auth module (234 lines)

Found the auth module. Let me check how sessions are managed.

reading

Harness Layer 3: Self-Healing Validation

The harness does not trust LLM output. After every edit, it runs two deterministic validation passes:

  1. TypeScript compiler API which catches syntax errors, missing brackets, and type issues
  2. Import path validator which checks every import path against the actual file tree (augmented with newly created files from the current session)

If errors are found, the system enters a self-healing loop. It sends the broken code plus the compiler diagnostics to the LLM and asks for a corrected version. The fix is then re-validated, syntax and imports checked again, before being accepted. If the fix still has errors, it's rejected.

This is the harness at its most powerful: the LLM is sandwiched between deterministic validators. It produces code, the compiler validates, if it fails the LLM corrects, and the compiler re-validates. Only clean code makes it through. The user never sees the broken version.

src/components/card.tsxAgent Output
1import { cn } from "@/lib/utls";typo
2import { Button } from "@/components/ui/button";
3
4export function Card({ className }: { className?: string }) {
5 return (
6 <div className={cn("rounded-[4px] border", className)}>
7 <h2>Card Title</h2>
8 </div>
9 );
10}
1 / 4

This creates a non-obvious opportunity: companies building fast, embeddable language analysis tools like tree-sitter, Biome, and oxc become critical infrastructure. Self-healing loops create 100x more demand for sub-second syntax validation than build systems ever did.

Harness Layer 4: Diff Management

The agent doesn't produce code. The harness produces diffs. The entire user experience is built around reviewing, accepting, and rejecting file-level changes, not editing code directly.

Each edit is tracked as a change object with three versions: originalContent (from the repo), modifiedContent (the AI's version), and previousContent (for undo when the user rejects). Every AI edit batch gets a prompt trace that records whether the user accepted, rejected, or partially accepted the changes.

The harness hides all the agent's internal complexity (15 iterations, tool calls, self-healing) behind a familiar PR-review interface. The user sees a diff. They accept or reject. That's it.

The best agent interface is one where the user forgets there's an agent at all. They just see diffs.

Harness Layer 5: Tool Orchestration

The harness classifies every tool call as either read-only or mutating. Read-only tools like read_file, validate_file, grep run in parallel via Promise.all. Mutating tools like edit_file, create_file run sequentially because edit order matters.

It's a three-line classification that saves ~40% wall-clock time per iteration. In a typical iteration, the agent reads 3 files and runs a grep simultaneously, finishing in the time of the slowest call instead of the sum of all calls.

Tool Execution Timeline
Read-only
Mutating
Sequential (naive)~100% time
read_file
read_file
grep
edit_file
create_file
Parallel reads + sequential writes~64% time
read_file
read_file
grep
edit_file
create_file
~36% wall-clock time saved

The Full Harness

The agent loop itself (stream an OpenAI response, dispatch tool calls, accumulate edits, repeat) is roughly 200 lines of TypeScript. The harness around it:

  • Context curation covers choosing which files to include, smart grep truncation, diversity-preserving search results
  • Behavioral guardrails with two-tier budget warnings, stall detection, read-only loop prevention
  • Self-healing using TypeScript compiler + import validation + LLM fix loop with re-verification
  • Diff management with Redis persistence, per-file undo, turn-level trace tracking, outcome recording
  • Sandbox orchestration handling E2B auto-creation, file syncing, command execution, sandbox lifecycle management
  • Observability with AI call logging, prompt traces, activity audit trails, all fire-and-forget to avoid blocking the UX

Each of these harness layers has its own module, its own edge cases, and its own failure modes. They interact in non-obvious ways. Context curation affects behavioral guardrails (less context = fewer wasted iterations), self-healing affects diff management (healed files need re-tracked), sandbox orchestration affects tool execution (sandbox auto-created on first run_command).


The model is the simplest thing in the system. The harness is what makes it work.

If you're building in this space, don't start with the model. Start with the harness. Build your context curation. Build your validation layer. Build your diff management. Then plug in the LLM. It's the easy part. The real alpha was always in the infrastructure.