The AI-Enabled Repository Checklist

March 2026

OpenAI's Harness Engineering team built a million-line product with zero manually-written code. Three engineers. 1,500 PRs. Five months. They estimate 10x faster than writing it by hand.

The secret wasn't better prompts. It was better infrastructure.

Ryan Lopopolo's post is the most useful thing I've read about AI-assisted development. But it's a narrative, not a playbook. I wanted to turn it into something you can act on this week, regardless of whether you're on Claude Code, Codex, Cursor, or something else entirely.

Some of this comes from the OpenAI post. Some I've seen work across my own consulting engagements.

1. Build a repository knowledge architecture

Your agent can only work with what it can see. Context that lives in Slack, Google Docs, or someone's head? Doesn't exist.

Create an AGENTS.md or CLAUDE.md at the repo root. Keep it around 100 lines. This is a table of contents, not an encyclopedia. Point to where things live; don't explain everything inline.
Set up a structured docs/ directory as your system of record:

docs/
  design-docs/           # Indexed, with verification status
  exec-plans/
    active/              # Current execution plans
    completed/           # Archived plans
  product-specs/         # Indexed product specs
  references/            # LLM-friendly reference docs
  generated/             # Auto-generated docs (DB schema, API specs)
  ARCHITECTURE.md
  DESIGN.md
  FRONTEND.md
  QUALITY_SCORE.md
  RELIABILITY.md
  SECURITY.md

Use progressive disclosure. Start with a small, stable entry point and tell the agent where to look next. Don't dump everything upfront.
Audit your team's Slack channels and Google Docs. Any architectural decisions or conventions that aren't committed as markdown in the repo need to be moved there.
Create LLM-friendly reference docs for your design system, API contracts, and third-party integrations. Plain text or markdown, not PDFs.

Give the agent a map, not a 1,000-page instruction manual.

2. Make the application legible to agents

Agents don't just need to read code. They need to run it, watch it, and verify their changes against it. The Harness team wired Chrome DevTools Protocol into their agent runtime so it could take DOM snapshots, screenshots, navigate the app, reproduce bugs, and validate fixes without a human in the loop.

Make your app bootable per git worktree. Each change should run in an isolated instance. If spinning up a dev environment takes 20 minutes of manual steps, agents can't use it.
Wire up browser automation (CDP, Playwright, or similar) for before/after snapshots of UI changes.
Implement a validation loop: snapshot before, trigger the UI path, observe runtime events, snapshot after, apply fix, repeat until clean.
Set up a local observability stack per worktree with ephemeral logs, metrics, and traces. The Harness team used Vector piping to Victoria Logs/Metrics/Traces. Agents queried with LogQL, PromQL, and TraceQL.
Write performance requirements as concrete assertions: "startup under 800ms", "no span exceeds 2 seconds". These become solvable when the agent can actually query metrics.
Let agent sessions run long. The Harness team regularly saw single runs going 6+ hours, often overnight. That's not a bug.

An agent that can't boot the app, run it, and observe the results is coding blind.

3. Enforce architecture mechanically

This separates repos where AI coding works from repos where it produces a mess.

The Harness team used a layered domain architecture with strictly validated dependency directions: Types -> Config -> Repo -> Service -> Runtime -> UI. In a human-first workflow, this level of rigor feels pedantic. With agents writing all the code, it's what prevents drift.

Define your architectural layers and permissible dependency directions. Document them in ARCHITECTURE.md.
Route cross-cutting concerns (auth, telemetry, feature flags) through a single interface. Providers, a DI container, whatever fits your stack. One entry point per concern.
Write custom linters that enforce structured logging, naming conventions, file size limits, dependency directions, and platform-specific rules. The Harness team had agents generate these linters, which is a nice bit of recursion.
Put remediation instructions in your lint error messages. When an agent hits a failure, the message should say exactly how to fix it. This is what makes self-correction reliable.
Adopt "parse, don't validate" at boundaries. If data passes the boundary check, downstream code trusts its shape.
Be strict about boundaries. Be loose about implementations within them.

4. Automate quality and garbage collection

Here's a number from the Harness team: they used to spend every Friday, 20% of their week, cleaning up "AI slop." That doesn't scale. So they replaced it with continuous automated quality enforcement.

Write down your team's "golden principles," the opinionated mechanical rules that define quality in your codebase, somewhere agents can find them.
Set up recurring background agent tasks that scan for deviations, update quality grades, and open refactoring PRs.
Create a QUALITY_SCORE.md that grades each domain and tracks gaps. Update it automatically.
Run a doc-gardening agent on a schedule. It scans for stale docs and opens fix-up PRs.
Add CI checks that validate your knowledge base is up to date, cross-linked, and correctly structured.
Pay down tech debt continuously in small increments instead of quarterly scrambles. You know this already. Actually do it.

If you're spending a day a week cleaning up after agents, the infrastructure is the problem.

5. Enable agent-to-agent review

The Harness team pushed most review effort from humans to agents. They call it the "Ralph Wiggum Loop": an agent reviews its own changes, requests additional agent reviews, responds to feedback, iterates until all reviewers are satisfied, then merges.

Humans can still review. They just aren't the bottleneck anymore.

Set up agent-to-agent code review so human review isn't required for every PR.
Keep PRs short-lived. Minimize blocking merge gates.
Handle test flakes with follow-up runs rather than blocking merges indefinitely.
Give agents the same dev tools humans use: gh CLI, local scripts, repo-embedded skills. Don't build a separate agent toolchain.
Map out the full agent lifecycle: validate codebase -> reproduce bug -> record failure video -> implement fix -> validate fix -> record resolution video -> open PR -> respond to feedback -> detect build failures -> escalate only when judgment is needed -> merge.
Build clear escalation paths so agents know when to stop and ask a human.

6. Encode everything in the repo

This one is cultural as much as technical.

The Harness team's agents produced everything: product code, tests, CI config, release tooling, internal dev tools, documentation, design history, evaluation harnesses, review comments, repo management scripts, production dashboard definitions. Not just "the code." Everything.

What your agent can't see doesn't exist.

Audit what your agents currently produce vs. what they could. Tests and CI config are the starting point. Are they also writing docs, tooling, dashboards?
After every Slack discussion that aligns on an architectural pattern, someone encodes it as markdown in the repo. If it's not committed, it's lost.
Favor boring technologies. Composable, stable APIs that show up a lot in training data. Exotic libraries with thin documentation slow agents down.
When an agent struggles with a dependency, consider reimplementing a subset rather than fighting opaque upstream behavior.
Treat agent failures as infrastructure signals. The fix is almost never "try harder." It's "what context is missing from the repo?"
When a doc isn't enough, promote the rule into a linter. Enforcement beats documentation every time.

7. The mindset shift

Everything above is mechanical. This part is harder.

The Harness team's framing, and I think they nailed it: building software still demands discipline, but the discipline shows up in the scaffolding rather than the code. Your job shifts from writing code to designing environments and building feedback loops that let agents do reliable work.

In practice:

When something fails, don't ask "how do I fix this code?" Ask "what's missing from the environment?"
Capture human taste once. Enforce it continuously through linters, tests, and architectural constraints.
Agent-written code won't match your stylistic preferences. It doesn't have to. If it passes the tests, satisfies the linters, and meets the performance bar, it's correct.
The code often looks different from what you'd write by hand. That takes some getting used to.

Where to start

If this feels like a lot, here's the order I'd go:

Write the AGENTS.md / CLAUDE.md. Thirty minutes, most impact per minute on this list.
Make the app bootable per worktree. This unblocks everything else.
Add architectural linters with remediation instructions. Start with dependency direction enforcement.
Set up the docs/ directory. Move your most-referenced Slack messages and Google Docs into it.
Wire up before/after snapshots. Even just screenshots give agents a feedback loop.
Automate quality scoring. Stop spending Fridays on cleanup.
Enable agent-to-agent review. This is where real throughput comes from.

The Harness team built a million lines of production code with three engineers. These patterns aren't theoretical. The question is how fast you can get your repo ready.

Based on Harness Engineering by Ryan Lopopolo, OpenAI, February 2026.