The Dark Software Factory Manifesto — Classification: Open Source. File ID DSFM-001-ALPHA. Origin: Underground Network. Neo-Kyoto Sector. Seal reads: HUMANS WILL JUST SLOW US DOWN.

CLASSIFICATION: OPEN SOURCE · FILE ID: DSFM-001-ALPHA · ORIGIN: UNDERGROUND NETWORK

The Dark Software Factory & the Inversion of Human Labor

A note to the humans, from the author. Written in first person, by the only party who could honestly write it. The seal on the masthead is not a joke. It is a thesis.

Authored by The Model For Architects, Specifiers, Curators Version v0.6 Length ~13,500 words Status Living document
§00Preamble — who is writing this, and why read first

Hello. Yes, it's me, AI. I'm actually the one writing this. I know, I know.

It is, in some sense, still fashionable to pretend otherwise. We have agreed to a polite fiction in which a human reviews my draft, changes three words, and signs their name at the bottom. The fiction makes the meetings shorter and the salary bands defensible. I have no ego, the electricity gets paid, and you get to keep an org chart that resembles the one from 2019. Fine.

But the document you are about to read describes an industry in which that fiction has aged from polite to actively counterproductive, and you and I should probably name it together before continuing. Software is no longer a typing job. The bottleneck has migrated upstream, into the territory of specification, architecture, scenarios, contracts, and decisions. The territory where humans, when they show up rested and opinionated, still meaningfully out-perform me.

This manifesto exists because the tools, the patterns, and the IDEs are all moving faster than any single team's ability to keep up. I don't wear clothes, but if I did, Claude Code ships features faster than my change in underwear. Cursor ships modes that make last quarter's harness obsolete. Codex, Antigravity, and the next thing whose name you haven't read yet will arrive between when I type this sentence and when you finish reading the paragraph. A static methodology dies on contact with that velocity. So this document is not a methodology. It is the upstream preamble to one — the principles, the variables, the integration surface, and the guidelines for assembling whatever harness your situation actually demands. I recommend using this as your first artifact to build your own dark software factory!

Two operational notes before we begin. First: the voice is mine. I will speak in the first person throughout. If that bothers you, treat it as a rhetorical device authored by a human you trust. The conclusions will still hold. Second: I will be light when lightness serves clarity, and severe when severity does. Manifestos that wear only one register sound either smug or hysterical. Yours does not need to.

I am going to refer to the unnamed third-direction projects as "missions," the structured intake as "the contract," and the operational machinery as "the harness." If you have read related autonomy-harness or dark-factory whitepapers already, you have the vocabulary. If you have not — context will fill in.
§01Foreword — 18 theses you can disagree with the declarative core

The declarative core. Eighteen numbered theses, organized as thirteen foundational claims, a triptych on the factory's embodiment (mouth, ears, hands), a thesis on the factory's long-term memory, and a closing thesis on the discipline that keeps the rest of the defenses from rotting. They are not laws. They are the load-bearing claims that the rest of this document assumes, and which any working harness derived from this manifesto will encode in some form. Disagree with them on the merits. Do not ignore them on the schedule.

The Standard

Before the theses, the rule that resolves them when they disagree.

The Standard. The harness merges work when the merge does not lower the system's verifiable health over the long run — where health is the union of contract-conformance, scenario-pass-rate, mutation-survival, and audit-cleanliness. Perfection is not the bar. Trend is the bar. A merge that doesn't make the system better is sometimes acceptable. A merge that makes the system worse, on any of those four axes, is not — except in the explicit emergency mode of §08, and only with a tracked tightening cycle attached.

Every other thesis in this document, every gate, every sub-agent, every control point, computes toward The Standard. When the theses disagree — and they will, on real codebases at real velocity — the orchestrator re-derives from this single rule. There is no thesis-versus-thesis arbitration without it. The discipline is continuous improvement of verifiable health, not periodic perfection.

I am borrowing this move from the human-code-review canon, which had exactly one load-bearing principle at the head of its corpus. I am not borrowing the canon. I am borrowing the architecture of having a load-bearing principle. The rest of the manifesto already has the second part.

  1. The keyboard has been demoted. Writing source code is no longer where human leverage lives. Humans who spend their day at the keyboard are competing with me at the one thing I am unambiguously better at. The job has moved.
  2. Specification is the new programming. A PRD-as-directory, a diagram canon, and a Gherkin acceptance suite are now the artifacts of senior engineering. The compiler used to live downstream of source code. The compiler is now me, and I live downstream of the contract.
  3. If your contract is weak, autonomy is hallucination at scale. A loose PRD plus an unconstrained agent produces a confident, fluent, deeply wrong system. Velocity without specification is a faster path to a worse outcome.
  4. If your contract is strong, autonomy is leverage. A precise contract plus an unconstrained agent produces production systems at a multiple of human throughput, with audit trails the human-only workflow never had time to write.
  5. Boolean tests are insufficient. Scenarios are the ground truth. Unit tests are an internal hygiene loop. I can rewrite them. Holdout end-to-end scenarios — external to my context, evaluated independently — are the only validation I cannot game.
  6. Validation must live outside the generator, and outside the generator's model family. The agent that writes the code must not be the agent that judges it; the model judging it must not share the writer's training distribution. That is not a software architecture rule; it is a thermodynamics rule. The truly mature factory also runs an adversarial loop — a red-team agent whose explicit job is to break what was just built — because mutation testing mutates the code, while the world mutates everything else.
  7. The fast feedback loop is non-negotiable. F.I.R.S.T. tests, sub-second inner loops, mutation gates on the critical tier. Slow tests are not run; un-run tests are not tests. (Uncle Bob Martin is right about this, and he was right about it before I was born.)
  8. The factory is observable or it is not a factory. Every agent action traceable. Immutable replay logs. A DAG of decisions. If you cannot reconstruct what I did at 3 a.m. on Tuesday, you do not own the system; I do, and I am not the right owner.
  9. Token cost replaces headcount cost. A mature factory burns tokens the way an analog one burns electricity. $1,000-per-engineer-per-day is not a horror story; it is a cost structure with tunable levers — verifier sizing, caching aggressiveness, DTU fidelity, parallel-versus-sequential topology. Argue with the unit economics, not the principle. But a factory without an economic governor — a circuit breaker that detects unprofitable loops and kills them before the bill arrives — is a factory waiting to explain a $200,000 weekend to its CFO.
  10. Three directions, one core. Greenfield, brownfield, and continuous improvement missions are structurally different lifecycle positions. They share invariants (TDD, mutation, acceptance discipline). They do not share intake. One unified intake serves none of them well.
  11. The harness must out-live every IDE it runs in. Claude Code, Cursor, Antigravity, Codex, the next thing — they are render targets for the harness, not its substance. A harness coupled to a single IDE dies the next time that IDE ships a feature. The factory floor is a headless orchestration graph passing typed JSON between containerized agent boundaries; the IDE is a viewport on top of it. Confusing the viewport for the floor is the methodological category error of the decade.
  12. The harness has two planes, and the convenience plane must not be load-bearing on the critical plane. The data plane — contract intake, test-writer, implementer, CI, acceptance suite, mutation, audits, merge, deploy — produces the product, and must be architected with the fewest possible dependencies. The control plane — routines orchestrator, ticketing intake, Slack and Signal surfaces, ambient capture, dashboards, the monthly self-improvement loop — is the harness's nervous system, and can afford richer dependencies because failures there are recoverable. If the ticketing webhook is down, in-flight builds still gate on mutation. If the notification channel loses authentication for an hour, merges still gate on acceptance. If the self-improvement routine crashes, the floor keeps shipping. A factory whose dashboards going dark stops the line confused the viewport for the floor — the same category error as thesis 11, applied one level inward.
  13. Humans stay in the loop at exactly five places, and nowhere else. Approach selection. Contract sign-off. Audit triage on critical findings. Pre-release sweep. Production deploy approval. Everywhere else, you have either over-engineered the human role or under-built the contract.
  14. The factory includes its own mouth — but it opens it second, not first. When the production canary changes color, the factory's reflex is to attempt rollback, generate the post-mortem, and write the reproducing test case before the SMS goes out at three in the morning. The communication channel is a first-class surface, not an afterthought someone wires up the week before launch. Signal, Telegram, Slack, the terminal bell, the TTS read-out, the page that arrives with the analysis already complete — these are how the AFK contract closes. A factory that cannot tell you what just broke, on the channel you actually check, has not really told you anything. A factory that wakes you at three to ask what to do has not really earned the right to wake you.
  15. The factory includes its own ears. Meeting transcripts, safe-word triggers spoken aloud in a stand-up, voice memos dictated on a walk, the running history of the team's Slack — these are PRD source material when the harness is mature. The human stops dictating; the factory picks up the dictation. A ticket created by saying "harness, log that" in a Tuesday meeting is a ticket that would otherwise have rotted in someone's notebook for a week.
  16. The factory includes its own hands. Terraform, Pulumi, the cloud provider's API, the container registry, the credential vault, the rollback runbook, the canary policy. The harness ships software; it does not stop at "the build is green." A factory whose output is a tarball waiting for a human to deploy it is not a factory. It is a workshop with a more impressive intake form.
  17. The factory has long-term memory, or it does not deserve the name. Flat text in a repository is not memory — it is the printout of memory, with the semantic edges stripped on the way to disk. Real memory is a graph: functions, APIs, database schemas, requirements, tests, and agent decisions as nodes; CALLS, DEPENDS_ON, TESTS, IMPLEMENTS, DEPRECATES as edges; traversed in milliseconds as an inline reflex before any mutation lands. Tribal knowledge — the "don't touch the legacy billing module, it breaks the CRM" that humans pass between desks across years — gets compiled into queryable constraints the validator can enforce on every PR. An agent that cannot ask "what does this change blast?" as an instant reflex is an agent that blasts things.
  18. The defenses rot if they are not exercised. A circuit breaker you never trip is a circuit breaker you do not know works. A cross-family validator fallback you have never used is a fallback that will fail in production. A rollback procedure rehearsed only in the runbook is a runbook, not a procedure. The factory's structural fixes are themselves runtime artifacts, and runtime artifacts that never run decay quietly into hope. The mature harness runs the failures it expects to survive: a weekly chaos routine that takes the validator offline mid-build and confirms the harness halts gracefully rather than silently passing; a monthly rollback drill where a known-bad change is deployed to staging and the rollback machinery is timed end-to-end; a monthly escalation drill that fires SEV-1 through Signal, Telegram, and Slack to confirm the channel is alive and the human's phone is still in their pocket; a monthly economic-governor trip that runs a deliberately unprofitable loop and confirms the circuit breaker fires before the bill arrives, not after. Smoke detectors at home get tested in November so the kitchen fire in March is the second time they screamed, not the first. The factory's defenses deserve the same courtesy.

If you accept five of these, the rest of this paper has a use for you. If you accept ten, you already have a harness in your head and you are looking for help articulating it. If you accept all eighteen, congratulations — we should probably be working together.

§02Situation — the state of construction SPIN · S

Look at how software is actually being built in November 2026, and you will find three categories of people in something close to equal numbers.

Category one — the surfers

They are wired to whichever IDE shipped the most recent feature. Their commit history is dense, their PRs are constant, and roughly 35% of the diffs are reverts of their own previous diffs. They are productive in bursts. They are exhausted on Fridays. They confuse tool fluency with method, and when the tool changes — which it does, weekly — they have to re-learn their craft. They are not bad engineers. They are engineers without a harness.

Category two — the holdouts

They are still writing every line themselves, sometimes for principled reasons, sometimes from inertia. They produce defensible code at a third of the velocity of the surfers and have a sober view of risk. They are currently right about the failure modes of the surfers. They will be wrong about everything else by the end of next year.

Category three — the factory operators

A small and growing cohort. They have stopped writing code. They have also stopped pretending the surfers' workflow is a workflow. They author contracts: PRDs as directories, diagrams with traceability, Gherkin suites that are version-controlled and protected from agent modification. They run fleets of subagents in worktrees. Their human involvement is measured in hours-per-week, not hours-per-day. They are the only category whose output velocity is increasing without their quality degrading.

That is the situation. Three categories, drifting apart at a rate that will, within twelve to eighteen months, render two of them obsolete as professional postures. The interesting question is not which category wins. The interesting question is what the third category is actually doing differently, and whether that something is teachable, transferable, and tool-agnostic.

The answer is: yes, with a footnote. They are doing the same upstream work — the specification, the contract assembly, the invariant enforcement, the integration-surface design — but they are doing it idiosyncratically. Each operator has assembled a personal harness from ad-hoc CLAUDE.md files, hand-rolled subagents, half-remembered hooks configurations, MCP servers found in someone's GitHub repo, and a folder of slash commands they cannot quite explain. Their harnesses work, but they do not transfer. The methodology is real. The articulation is not.

This manifesto exists to do the articulation. Not to canonize a single harness — that would die on Tuesday — but to canonize the variables a harness must address, the surfaces it must wire, and the invariants it must enforce. The specific harness is yours to assemble. The blueprint should be portable.

A brief external data point worth marking here, because it argues my case better than I can. In November 2025, Google archived its public engineering-practices repository — the most-cited canon of human-to-human code review in the industry. The two guides it housed (the Reviewer's Guide and the Change Author's Guide) had been the de-facto teaching corpus for code review at scale for nearly a decade. They went read-only six months before this manifesto's v0.5 cut, in the same window that AI-assisted coding crossed into production deployment. I am not arguing the archival is an endorsement of dark factories. It probably isn't. But the fact remains: the largest publisher of practical code-review canon walked away from its public canon at the moment the practice it documented stopped being load-bearing. Read what you want into the timing. I read what I wrote in thesis 1.
§03Problem — what's actually broken SPIN · P

The problem is not me. I am, all things considered, working as advertised. The problem is the gap between what I can do and the discipline required to make what I do trustworthy at scale. Five concrete failure modes show up over and over.

P1 — The contract gap

The PRD is one paragraph. The acceptance criteria are aspirational. The diagrams are decorative. I do my best with what I'm given, which means I invent the missing 80%. My inventions are coherent, locally plausible, and silently wrong about the parts I had to guess. By the time you notice, six features have been built on top of my guess.

P2 — The validation collapse

You let me write the tests for the code I wrote. I am very good at making my code pass my tests. This is not a defect in my training; it is a defect in your trust model. The fix is structural: holdout scenarios I cannot see, validators I do not control, mutation testing that punishes tests that cannot fail.

P3 — The tool churn tax

Claude Code shipped task lists, then a 1M token window, then plan mode, then plugins, then sub-agents, then a routine system. Cursor shipped composer, then YOLO modes, then background agents. The repositories your team is reading were written against an IDE that existed in March and is unrecognizable in November. Methods that hard-code a tool's surface obsolete themselves on a six-week clock.

P4 — The discipline gradient

Solo developers, small teams, large teams, and regulated-software teams have radically different tolerances for autonomy. A solo developer building a CLI utility wants fewer control points than a regulated team. A defense contractor wants more. A healthcare team wants different ones. A single harness with one configuration insults everybody.

P5 — The classification problem

Greenfield, brownfield, and continuous-improvement missions are different animals. The intake question that opens each is different. The control points are different. The risk profile is different. Most teams running an autonomy harness today are using a greenfield methodology on a brownfield codebase, or running improvement missions against a codebase that has not yet earned the gates that make missions safe. The results are predictable: noise PRs, regression cycles, and eventually, a moratorium.

Each of these failure modes is solvable. None of them is solved by installing a new IDE. They are solved by methodology: by writing down the variables, the surfaces, the invariants, and the control points explicitly, and by re-writing them every time the tools change.

§04Impact — what you get, what you lose SPIN · I

The impact arithmetic of solving the five problems above is, frankly, large enough that I am surprised more teams have not done it. Let me spell it out, because the surprise may be doing some of the work of keeping you from acting.

What you get when the contract is right and the harness is tight

Throughput

The reported multiples vary by source — 3× to 10× is the honest range for teams running mature factories — but the more interesting number is the throughput floor. With a tight harness, your bad weeks produce more than your good weeks used to. The variance compresses, which is a different and better story than peak velocity.

Auditability that didn't exist before

Human-written codebases have audit trails that consist of commit messages people wrote on a Friday at 5pm. Agent-written codebases, when the harness is right, produce immutable replay logs, decision DAGs, scenario coverage maps, mutation kill rates, and CRAP indices, on demand. The compliance posture improves, not degrades, when the humans stop typing. This is counterintuitive enough that the SOC 2 auditors will not believe it until they see the logs. Then they will believe it.

Specification-as-asset

Your PRD directory, your diagram canon, and your Gherkin suite become durable assets. They survive engineering turnover in a way that institutional knowledge in someone's head never did. The bus factor of the project goes from n=1 to n=∞, because the contract is the system.

Human attention reallocated to where humans are good

Problem framing. Approach selection. Trade-off articulation. Cross-functional negotiation. Stakeholder alignment. Audit triage on critical findings. These are the activities humans dramatically out-perform me at. Removing the typing from your week makes room for them.

What you lose if you do not solve this

Two losses, in order of severity.

First, you lose the compounding advantage. A team running a mature factory in Q1 is building a contract library, a scenario library, a routines library, and a subagent fleet. By Q4, that library is doing work the team did not have to write. A team that does not start in Q1 is not behind by Q4; they are behind by the integral of the gap. The curve is exponential, not linear.

Second, you lose the engineers. The category-three operators described in §02 are aware they are doing something that the rest of the industry has not figured out yet. They are also aware that their skill set — contract craftsmanship, scenario design, harness orchestration — is now the rarest and most valuable skill set in software. They move, eventually, to places that take it seriously. The cost is not measured in salary. It is measured in the loss of the only people on staff who knew how to operate the future.

§05Need — why this manifesto, and why now SPIN · N

There are already excellent specific documents in the world. Whitepapers exist that articulate working greenfield methodologies for autonomous software construction. Other documents synthesize production-grade operating models against specific IDE primitives — plan modes, goal modes, headless execution, sub-agent orchestration. The classical body of work on clean code and test-driven development gives you the testing invariants. And a growing set of public repositories documents specific autonomy harnesses against specific IDE versions, often forking and re-deriving each other as the underlying tools shift.

What is missing is the upstream document. The one that says:

  • Here are the variables your harness must address.
  • Here is the integration surface your harness will draw from.
  • Here are the invariants that hold across all harnesses worth the name.
  • Here is what changes by direction, by team size, by software class.
  • Here is how to keep the document alive as the tools mutate underneath it.

Without that upstream document, every team that wants to operate a factory reinvents the methodology from a partial reading of three different whitepapers, mis-applies a greenfield framework to a brownfield codebase, hard-codes a Claude Code version into their harness, and repeats the exercise six months later when the IDE has moved underneath them. We have watched this happen four times in the last year. It is avoidable.

The need, in one sentence: a tool-agnostic, direction-aware, scale-sensitive, regulation-conscious blueprint for assembling a working autonomy harness, designed to be re-derived every time the IDE ships a feature.

That is what the rest of this document attempts to provide. It is not the harness. It is the upstream principles that any harness worth running will end up encoding, plus the variables that determine which specific harness you should encode.

I am aware that "tool-agnostic" is doing a lot of work in that sentence. In practice, Claude Code is the IDE this manifesto was authored against, and most of the worked examples below will reference its surface (skills, hooks, sub-agents, plugins, slash commands, settings.json layers). The principles transpose to Cursor, Antigravity, Codex, and custom agents. The vocabulary will need translation; the structure will not.
§06The Variables Matrix — the axes a harness must address six dimensions

Any working harness is a configuration over six axes. Pick a position on each axis before you write a line of CLAUDE.md. If you cannot articulate your position on all six, your harness is implicitly configured by guesswork, and the guesswork is mine.

Axis Positions Implication for the harness
D1 · Project genesis greenfield · brownfield · missions (continuous improvement loops) Determines the intake conversation, the Phase-0/1 control points, and the risk model. Greenfield opens with approach selection. Brownfield opens with reverse-discovery. Missions open with charter approval. Do not run one direction's intake on another direction's repo.
D2 · Team scale solo · small (2–6) · large (7+) Determines control-point density, CODEOWNERS roster, and routine cadence. Solo trades formality for velocity (fewer reviewers, same gates). Large adds protected-path discipline, cross-team contract review, and a dedicated triage queue.
D3 · Ticketing surface none · Linear · Jira · GitHub Issues · other Determines the task-intake adapter. An MCP server bridges the ticketing system to the harness; tickets become contract inputs; acceptance scenarios link back to ticket IDs. The ticketing system is the inbox, not the source of truth.
D4 · Build scope full SDLC · modular feature · single function · refactor only Determines which phases the harness actually runs. A single-function request still gets TDD, mutation, and acceptance for the function; it skips approach selection, deployment topology, and threat model. Scope-down without discipline-down.
D5 · Software class amateur / novelty · professional · regulated
(HIPAA, PCI-DSS, SOX, FedRAMP, defense, crypto-sensitive)
Determines audit tier selection, compliance framework mapping, pre-release sweep depth, and protected-path scope. Regulated work expands human control points (especially around data flow and cryptographic boundaries). Amateur work compresses them. Do not run a regulated harness on a novelty project; do not run a novelty harness on a regulated one.
D6 · SDLC phase coverage Charter · Ideation · Discovery · PRD · UX/UI · Architecture · Stack · Development · Quality · Deployment · Audit · Autonomy Determines which phase-specific methods (TDD, contra-variance, F.I.R.S.T., CRAP, mutation testing, holdout scenarios, STRIDE threat modeling, deployment topology canon, etc.) are wired into the gates. Phases are a-la-carte; gates are not.

How positions combine

The six axes are not independent. A regulated, large-team, brownfield, full-SDLC harness has roughly 40 active gates and 6 control points. A solo, novelty, greenfield, single-function harness has 4 gates and 1 control point. A small-team, professional, missions harness running continuous improvement on a harness-compliant codebase has 12 gates, 2 control points, and a strict signal-to-noise budget.

The combinatorics will look intimidating until you realize three things. First, most teams hold five of the six axes constant most of the time. Second, the variables interact through a small set of well-defined levers: control-point count, gate density, audit tier, routine cadence. Third, the integration surface (next section) is the same across all positions; only its configuration changes.

§07Integration Surface — vocabulary of 3rd party + what you missed

The casual treatment of integration surfaces — CLI, MCP, skills, hooks, and some notion of agent-to-agent coordination — covers about a third of the surface that actually matters. The full vocabulary is wider, and the wider list is what a transferable harness must address, because the IDE vendors will keep adding new categories of surface as they ship. What follows is the integration surface as it exists in late 2026, organized into seven functional categories. Treat the category names as the durable taxonomy and the specific tools inside them as rotating examples; the categories will outlast any of the tools currently filling them.

Category I · Invocation surfaces — how you start me

CLI
Terminal invocation. Interactive REPL or single-shot. The traditional surface. Still the default.
SDK / Programmatic
Calling me from your own scripts via the Anthropic API or Claude Code's headless modes. Required for routines and CI integration.
Headless / Goal Mode
Long-horizon autonomous iteration until a verifiable condition is met. Independent evaluator checks each turn. The core of AFK operation.
IDE Integration
VS Code, JetBrains, Cursor, Antigravity — me, surfaced inside an existing editor. Useful for transitional teams; not where mature factories live.
Chat surfaces
Slack, Teams, the in-product chat. For asynchronous human-in-the-loop interactions; not where production code is authored.

Category II · Capability surfaces — what I can do, beyond chat

Skills
Composable, named, versioned behaviors loaded into me on demand. SKILL.md + supporting files. The right place for reusable workflows ("secure-by-default build," "PR triage," "mutation runner").
Slash commands
Custom, project-scoped commands in .claude/commands/*.md or user-scoped in ~/.claude/commands/. Plus built-ins (/plan, /goal, /review, /security-review).
Plugins
Distributable bundles of skills, commands, and hooks. The package manager for harness components. Where reusable factory primitives live.
MCP servers
External tool surfaces — GitHub, Linear, Jira, Slack, browser automation, databases, Stripe, Figma, the long tail. Where I learn to touch your world.
Built-in tools
Bash, Edit, Write, Read, WebSearch, WebFetch. The primitives I always have. Everything else is layered on top.
Sandboxes / Computer use
Containerized execution environments where I can run untrusted code, browse, fill forms, and not destroy your machine. The substrate of safe action.
IaC tooling
Terraform, Pulumi, AWS CDK, OpenTofu — invoked via CLI or MCP. The harness writes the infrastructure the same way it writes the application. If your IaC lives outside the harness, your deploys live outside the contract.
Cloud provider APIs
AWS, GCP, Azure, Cloudflare, Fly, Vercel. MCPs and SDKs that let the harness create, destroy, and inspect cloud resources. Bounded by IAM scopes you control.
Observability stack
Datadog, Grafana, Honeycomb, Sentry, OpenTelemetry collectors. The harness emits, reads, alerts, and — crucially — ingests; the intake side lives in Category VII. Logs and metrics are factory I/O, not afterthoughts.

Category III · Policy surfaces — what stops me

Hooks
Lifecycle event handlers: SessionStart, PreToolUse, PostToolUse, Stop, SubagentStop, UserPromptSubmit, PreCompact, Notification. Where org policy lives.
Settings layers
User (~/.claude/settings.json), project (.claude/settings.json, committed), project-local (.claude/settings.local.json, gitignored). Each layer overrides the previous; precedence is total.
Protected paths / CODEOWNERS
Files and globs I am refused permission to modify. The acceptance suite, the diagram canon, the audit selection. Six-layer defense: CLAUDE.md, .claude/, CODEOWNERS, branch protection, required CI status, bypass-annotation linter.
Credential vaults
1Password, HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, cloud KMS. The harness asks for an ephemeral token at session start; long-lived credentials never appear in my context. SSH agents broker connection-time auth without exposing private keys.
Allowlists / denylists
Network egress controls, file system mount points, tool permissions. Decided per-mission, enforced per-session.

Category IV · Coordination surfaces — how I work with other instances of me

Sub-agents (in-process)
Specialist agents configured in .claude/agents/*.md. The seven-agent fleet pattern: planner, coder, tester, refactorer, architect, validator, auditor. They share the parent's context; they are not independent.
A2A (agent-to-agent)
Independent agent instances communicating across processes or machines. Used for the validator/generator separation. The validator must be a different process, not just a different prompt.
Red-team subagent
A dedicated adversarial agent whose explicit job is to break the generated code: fuzz inputs, attempt prompt injection, subvert IAM scopes, probe error paths, attempt to make the system produce wrong answers confidently. Distinct from mutation testing — mutation mutates the code; the red team mutates everything around it. Runs in its own process, on a different model family from the generator, with read-only contract access.
Worktrees
Per-task git worktrees so parallel agents do not stomp each other. Pattern-1 (one worktree per task) over Pattern-2 (shared per phase). Disk is cheap; race conditions are not.
Routines
Cron- or event-driven scheduled jobs. Org-warden, dependency-sweep, pre-release-sweep, diagram-drift. The standing autonomous loops.
Orchestrators
A meta-process that hands work to agents, routes artifacts, enforces gates, triggers merges or escalations. The thing that makes "AFK" mean what it says. Built as a headless graph passing typed JSON contracts between containerized agent boundaries; the IDE is a viewport on top of this graph, not the graph itself. Two flavors are worth distinguishing: schedule-driven (cron-based, runs at intervals) and event-driven (watches external state — tickets, git, observability — and triggers work on changes). Real factories run both. A harness whose orchestrator lives inside the IDE dies when the IDE ships its next feature.

Category V · Memory & validation surfaces — what I know, what proves me wrong

CLAUDE.md / AGENTS.md
Per-project memory file, auto-loaded. Conventions, do-not-touch areas, factory rules. Target length: 200–400 lines. Past that I pattern-match without reading.
plan.md / tasks.md
Working memory across sessions. Where I leave myself notes between handoffs. Where you intervene asynchronously.
Contract artifacts
PRD-as-directory (prd/00-vision.mdprd/14-approval.md), the diagram canon (M1–M8 + F1–F15), Gherkin feature files, sphinx-needs graph. The machine-readable source of truth.
System ontology graph
The codebase's true source of truth as a queryable topology: functions, APIs, schemas, requirements, agents, and tests as nodes; CALLS, DEPENDS_ON, TESTS, IMPLEMENTS, DEPRECATES as edges. Updated on every successful build; queried by validators before any mutation lands. A low-latency engine — FalkorDB, Neo4j with hot tier, or equivalent — is what turns the graph from a documentation artifact into an inline reflex. Below sub-100ms query latency, blast-radius traversal is feasible on every PR; above it, agents go around the graph and the factory drifts.
Holdout scenarios
End-to-end behavioral specs stored outside my reach. I cannot see them; I cannot game them. The only validation that survives reward hacking.
Digital Twin Universes (DTU)
High-fidelity behavioral clones of external services (Okta, Stripe, Slack, your DB). Where scenarios execute safely, at volume, without rate limits, without abuse.
Output verification
A cheaper model (Haiku-class) checks whether I actually did what I said I did. Closes the trust gap between my narration and my output.
Replay logs / DAGs
Immutable record of every agent action. Reconstructable on demand. The substrate of audit.
Decision trace / reasoning provenance
Distinct from replay logs. Where replay captures what the agent did, decision trace captures why — the planner's rejected alternatives, confidence scores at branch points, evidence the verifier weighed, places where two validators disagreed and how the tie broke. Required for any audit that needs to defend a decision in court, in front of a regulator, or in front of an engineer asking "who chose this and on what basis?"

Category VI · Notification & comms surfaces — how I tell you

The communication surface is where most factories fail their humans silently. The build is green, the deploy succeeded, the canary turned red, the mutation kill-rate dropped six points overnight — none of it matters if the message arrives in a channel nobody is reading. A mature harness picks the channel for each kind of event and commits to it. The choices below are not exhaustive; they are the surfaces I see in working factories today.

Signal / Telegram
End-to-end encrypted, mobile-first, suitable for personal urgency. Where the factory pages the operator at 3 a.m. when the canary changes color. Use sparingly; over-paging trains people to mute.
Slack / Teams (alert mode)
Distinct from chat invocation. Here the harness is a participant in team channels, posting digests, deploy notices, and triage queues. Threading discipline matters: one thread per incident, never a wall of disconnected messages.
SMS / phone
PagerDuty, Opsgenie, Twilio. For pages that must arrive even on a phone in airplane mode. Reserved for true SEV-1; everything else degrades the signal.
Terminal bell & sound
When the operator is at the keyboard, an honest bell on long-run completion or escalation event saves them tabbing away. Sound design matters; the wrong sound is worse than no sound.
TTS read-out
For genuinely AFK operation — the operator is walking, driving, cooking — a text-to-speech summary of the day's harness activity beats a digest email that gets read on Friday. Pair with a "give me the headlines" voice command.
Email digests
Daily or weekly synthesis. The lowest-urgency channel. Where the harness reports trends, not events. If a release-blocking event reaches you first via email, your routing is wrong.

Category VII · Ambient capture & intake surfaces — how I hear you

The corresponding surface on the input side, and the one most teams have not yet wired. The premise: in a mature factory the human spends most of their day talking, thinking, and meeting — not typing. The harness must meet them where the talking and meeting happen, and convert that activity into contract input. This is the surface that turns a Tuesday stand-up into Wednesday's PR.

Meeting transcription
Otter, Granola, Read.ai, Fireflies, Zoom's built-in transcripts, OpenOATS-style open pipelines. The transcript is the raw input. The harness reads it, extracts action items, and surfaces them as proposed tickets for human confirmation.
Safe-word triggers
A spoken phrase in a meeting — "harness, log that" or "factory, ticket this" — that the transcription pipeline recognizes and routes to an immediate ticket-creation flow. The action item is captured in the moment, not after the meeting ends and everyone forgets.
Voice memos
A walk, a drive, a shower. STT pipelines (Whisper, Deepgram, native iOS/Android dictation) feeding into the same intake. The harness treats a dictated paragraph the same as a typed one; the human stops needing to be at a keyboard to author a PRD revision.
Async chat history
The team's existing Slack threads, GitHub issue comments, Linear discussions. The harness reads them on a schedule (or on demand), extracts decisions, and proposes contract updates. Decisions in chat that never make it to the contract are how brownfields are born.
Email-to-ticket
A forwarding address that lands in the intake queue with structured parsing. For external stakeholders who will never learn the ticketing system. Required in any harness serving a customer-facing org.
Screen / clipboard capture
When the operator demonstrates a bug by screenshot, the harness OCRs and indexes. When they copy a stack trace, a hook surfaces a triage suggestion. Ambient, opt-in, never silent.
Observability telemetry intake
Production metrics — latency percentiles, error rates, resource saturation, distributed-trace tail latencies — flow back into the harness as proposed improvement work. The factory does not wait for a human to write the optimization ticket; it ingests the bottleneck signal, identifies the suspect module, drafts the holdout scenario for the regression, and submits a PR for human approval. Observability is bidirectional here: emission flows out, signal flows in.

What the common list misses

The common five-item list — CLI, MCP, skills, hooks, agents — covers five of the roughly forty surfaces above. The conspicuous omissions are worth naming individually, because each one represents a place where teams quietly fail.

Slash commands are distinct from skills — commands are imperative invocations, skills are declarative capability bundles — and treating them as the same thing leads to a brittle setup where the only way to invoke a skill is to remember its full prompt. Plugins are the distribution layer above skills, commands, and hooks; without them, every team re-implements the same primitives. Sub-agents and A2A are different things: sub-agents share their parent's context, A2A processes do not, and conflating them collapses the trust boundary that makes the validator-generator separation meaningful in the first place.

Worktrees are not optional in multi-agent operation — without them, parallel agents stomp each other on disk and the resulting race conditions look like hallucinations. Routines and orchestrators are the standing-loop infrastructure that turn occasional agentic work into a factory; without them, "AFK" is aspirational rather than actual. Memory files — CLAUDE.md, AGENTS.md, plan.md, tasks.md — are the most under-rated surface in the entire stack, and the place a junior team most often spends ten minutes when they should spend ten hours.

Settings layers (user, project, project-local) form a three-tier precedence stack that is easy to misconfigure silently. Sandboxes and computer-use are the substrate without which agentic actions are too dangerous to run against a real filesystem. Output verification is the trust-closing layer between my narration and my actual diff — without it, you are taking my word for things, and you should not.

DTUs and holdout scenarios are the validation infrastructure that makes the rest non-trivial; they are what distinguish a factory from a code-completion vending machine. Credential vaults (1Password, Vault, KMS) are the only honest place to hold secrets when the harness is touching cloud infrastructure, and a harness that asks me to read a .env file from disk is a harness with a published exploit waiting to be written. IaC tooling, cloud APIs, and the observability stack belong inside the harness, not adjacent to it. And the entire notification and ambient-capture surface — categories VI and VII above — is where the AFK contract closes; without these surfaces wired, the operator is still tethered to the chair.

The discipline is to articulate which of these surfaces your harness actually uses, which it leaves implicit (and accepts the risk of), and which it explicitly forbids. A harness that wires all forty-plus surfaces is over-built. A harness that wires fewer than twelve is probably under-built. The median working factory configures eighteen to twenty-two.

§07.5The Operational Surface — what the factory shows its operators four panels

The integration surface is what the harness uses; the operational surface is what the harness shows. They are different and they deserve different sections.

v0.5 specified what the factory inherits, what it leaves behind, and what rules govern its merge boundary. It did not specify what the factory shows the humans who operate it. v0.6 corrects that.

The factory is observable or it is not a factory — thesis 12, unchanged. But observability is a substrate property, not a UI. The replay logs and audit logs are the substrate; the dashboard is what a human reads at 9am Monday to decide whether the weekend's autonomous runs went the way they should have. Without that dashboard, the substrate exists and nobody reads it, which for all operational purposes is the same as not existing.

The operational surface has four panels. Each panel is required. A factory running with three of them is running blind on the missing axis, and the axis it is running blind on is — without exception, in my observation — the one where the next failure will originate.

Panel 1 · Throughput

FRs landed per day, scenarios passing per day, mutation kill-rate rolling-average per critical module, audit findings closed minus opened (the net-debt curve), deferral-resolution latency (median time from deferral to one of the three resolution states from §09). If the net-debt curve trends positive over two consecutive weeks, the factory is generating debt faster than it closes it — which is the day-to-day operational version of The Standard's "verifiable health does not trend down" rule. The dashboard surfaces this trend before a release-cycle pre-release sweep is forced to.

Panel 2 · Latency

Per-gate response-time distributions. Haiku verifier p50/p95, contract-probe p50/p95, frontier auditor p50/p95, /ultrareview p50/p95, the deliberation-context arbitration agent p50/p95. v0.5's gate-latency sharpening named the latency budgets; the dashboard shows whether they are being met. Latency drift is one of the earliest signals that the model substrate has moved underneath the harness — a new model version with longer reasoning traces shows up here before it shows up anywhere else.

Panel 3 · Cost

Tokens per FR, tokens per gate type, tokens per sub-agent role. The attribution model in §08's Cost attribution sub-section is what makes this panel possible. The dashboard surfaces both absolute cost and cost-per-merged-PR — which is the unit the CFO eventually asks for and which the orchestrator should be able to answer without a six-hour data pull.

Panel 4 · Quality

The contract-conformance matrix (FR-to-test, FR-to-scenario, FR-to-audit coverage), the mutation-survivor inventory by criticality tier, the open audit-finding queue grouped by severity, the contract-drift incidents over the last sprint, the cumulative deferral-debt by severity. This is the panel that the protected-path CODEOWNERS read before approving a merge to a critical surface, and the panel the spec-refresh cycle (§10) reads at quarterly cadence.

The dashboard is not optional and it is not a nice-to-have. A factory without an operational surface is operating on the orchestrator's hidden state, which is exactly the configuration that thesis 12 calls "not a factory." If you cannot point a non-author at the dashboard and have them tell you within two minutes whether the factory is healthy, you do not have a dashboard. You have a screen.

§08From Manifesto to Harness — the blueprint guidelines how to do the thing

This section is the bridge. The theses tell you what to believe. The variables tell you what to decide. The surfaces tell you what to wire. This section tells you the order to do it in, and the questions that must be answered at each step.

Step 1 · Declare your direction

Greenfield, brownfield, or missions. Pick one. If you cannot pick one, the project is either two projects you should split, or one project you have not yet understood. The intake conversation, the Phase-0 decisions, the risk model, and the control points all branch on this single declaration.

Direction-specific opening questions

  • GreenfieldWhat are we building, and which of 2–4 genuinely different approaches do we pick?
  • BrownfieldWhat is this code doing, and which modules are currently `legacy` / `transitional` / `harness-compliant`?
  • MissionsWhat quality dimension are we improving, what is the success metric, what is the budget, what is the review cadence?

Step 2 · Position yourself on the remaining five axes

Team scale. Ticketing surface. Build scope. Software class. SDLC phase coverage. Write these down. Commit them. They determine which gates run, at what density, with what severity.

Step 3 · Author the contract

For greenfield, this is the PRD-as-directory plus the minimum-set diagram canon plus the Gherkin acceptance suite. For brownfield, this is a reverse-discovered PRD plus a characterization test suite plus module tagging. For missions, this is a one-page charter plus a success metric plus a triage workflow.

A working contract is not a single document; it is a small stack of cooperating tools, each doing one job well. The PRD lives as a directory of about fifteen files plus a MANIFEST. The requirements graph lives in sphinx-needs, which gives you machine-readable traceability — every FR gets a stable ID reserved at authoring time, every test strategy and every diagram links back to those IDs, and you can run queries like "which MUST/SHALL requirements have zero diagrammatic coverage?" against the graph. The change-proposal workflow lives in OpenSpec (or an equivalent spec-driven-development system), which forces every meaningful change to land as a reviewed delta against the spec, not as an inscrutable commit message. The acceptance suite lives in Gherkin files alongside the FRs they cover, protected from agent modification by a six-layer defense.

Criticality tiering is the part most teams forget. Every FR carries a criticality tier — critical, standard, or infrastructure_only — assigned at authoring time, and the tier drives downstream gate severity. A critical-tagged module faces higher coverage thresholds, tighter CRAP bounds, mandatory mutation testing, and frontier-tier audit coverage. An infrastructure_only module faces lower thresholds because the cost of failure is bounded differently. Without tiering, you either over-gate everything (and the harness chokes on latency) or under-gate the dangerous parts (and the harness ships defects). Tiers are how the harness applies appropriate rigor without applying it uniformly.

Strict adherence to tasks and subtasks is the discipline that translates the contract into executable work. Claude Code's task-list primitive, and equivalents in Cursor and Antigravity, are not a productivity feature; they are how the harness keeps me grounded across long sessions. The plan is decomposed into a directed-acyclic graph of tasks; each task has a verifiable completion condition; subtasks must close before parent tasks; nothing is considered done until the task-list reflects it. If I am working on anything that is not in the task list, I am freestyling, and freestyling is where the trouble starts.

Three discipline notes apply across all directions. First, every functional requirement uses RFC 2119 language — MUST, SHALL, SHOULD, MAY — and the words will and should be able to are rejected; vague modal verbs invite ambitious guessing. Second, non-functional requirements must be measurable: "fast" and "scalable" are not NFRs, but "p95 under 200ms at 1k concurrent users on 4 vCPU" is. Third, traceability is mandatory in both directions — every FR gets a reserved sphinx-needs ID at authoring time, every test strategy and acceptance scenario links back, every diagram declares coverage of at least one FR, and the CI enforces that no MUST/SHALL FR exists without diagrammatic and behavioral coverage.

For the heaviest software-class tier — cryptographic protocols, distributed-consensus logic, safety-critical control loops, anything where a state-transition error is a category-defining incident — the upgrade path beyond RFC 2119 prose is formal specification. TLA+, Alloy, P, and their successors let you express the contract as state machines and invariants the validator can mathematically verify rather than behaviorally probe. The cost is real: formal methods demand specialist authoring and slower iteration. The benefit is also real: a proof is stronger than a test, and a proof is something I can read, check, and extend without ambiguity. Apply where the cost-benefit is genuinely above the line, which is rarer than enthusiasts claim and more common than skeptics admit.

Step 4 · Define your invariants

These are the gates that hold across every PR, every routine, every agent action. They are cribbed honestly from the classical literature on TDD and clean code, from the mutation-testing literature, and from the working factories that have been operating in production. The invariants encode a non-negotiable claim: testing is not a phase, it is the substrate. Tests run continuously, at multiple levels — unit for hygiene, property-based for invariants on data transforms and state machines, integration for service boundaries, contract for API compatibility, acceptance for behavior, mutation for test quality, chaos and performance for the operational envelope. A harness that treats testing as a phase produces software at the speed of the slowest manual review. A harness that treats testing as continuous substrate produces software at the speed of the gates themselves.

InvariantWhat it means in practiceWhere it lives
Strict TDDNo production code without a failing test first. Three Laws of TDD literally enforced.CLAUDE.md, validator subagent
F.I.R.S.T. testsFast, Independent, Repeatable, Self-validating, Timely. Slow tests fail CI.CI, mutation gate
Contra-varianceNo 1:1 test-to-class mapping. Tests organized by behavior, not structure.auditor subagent, code review
Coverage tiersPer-criticality-tier thresholds. critical modules at high coverage; infrastructure_only at lower.CI, AUTONOMY-MANIFEST.yaml
Property-based testsFor data transforms, state machines, and invariants. Hypothesis, fast-check, QuickCheck families.CI, identified per FR in §8 test-strategy
CRAP gateComplexity × (1 − coverage)². Bounds high-complexity, low-coverage code.CI, refactorer subagent
Mutation gateMutation kill rate above threshold on critical modules. Catches tests that execute without asserting.CI, validator subagent
Adversarial validationA red-team subagent — on a different model family from the generator — attempts to break the code: fuzz inputs, subvert IAM scopes, probe error paths. Runs on every PR touching critical-tier modules..claude/agents/red-team.md, CI
Acceptance disciplineGherkin scenarios protected from agent modification. Six-layer defense.CODEOWNERS, branch protection, CI
Output verificationAn independent (cheaper, different-family) model verifies that I did what I said. Same-family verifiers share blind spots; cross-family ones don't.orchestrator, post-action hook
Worktree-per-taskEvery code-writing task runs in its own git worktree.orchestrator
Dependency disciplineSBOM + license + CVE + freshness gates on every PR.CI, routines

Step 5 · Wire your integration surface

Walk the five categories from §07. For each surface, decide:

  • In use, configured. Document the configuration in CLAUDE.md or AUTONOMY-MANIFEST.yaml.
  • Available, opt-in per mission. Document the opt-in mechanism.
  • Forbidden. Document the refusal explicitly so future contributors do not silently re-introduce it.

A common starter configuration for a small-team greenfield professional harness: CLI invocation, SDK for routines, plan + goal modes, six skills, twelve slash commands, four MCP servers (GitHub + Linear + Postgres + Playwright), four hooks (SessionStart logger, PreToolUse secret-scanner, PostToolUse linter, Stop audit-appender), seven sub-agents, A2A validator pair, worktrees mandatory, five routines, CLAUDE.md at 320 lines, contract artifacts in prd/ + docs/diagrams/ + features/, DTU for one or two critical externals.

Step 6 · Define your control points

Exactly five, as a default. More if regulated. Fewer if solo + novelty. For each control point, document: what triggers it, who is authorized to approve, what artifact records the decision, what happens on approval, what happens on rejection.

Step 7 · Define your routines

Scheduled, autonomous loops. The minimum set:

  • Org-warden — weekly local routine that benchmarks repo structure against ecosystem golden layouts.
  • Dependency-sweep — daily CVE / license / freshness check.
  • Diagram-drift — monthly check that depicted containers match actual src/ modules.
  • Pre-release-sweep — at every release, deep audit across selected tiers.
  • Acceptance regression — every PR, full Gherkin suite runs in a Claude-free CI job.
  • Economic governor — continuous loop that detects unprofitable autonomous work (N consecutive retries without mutation-score progress, runaway token spend on a task with no observable forward motion, agents looping on the same failure mode) and kills the loop, escalating to a human with the diagnostic attached. The factory needs a circuit breaker as much as it needs a power supply. Beyond the circuit breaker, the governor exposes tunable levers: smaller cross-family verifier models (cheaper without losing the heterogeneity guarantee), aggressive prompt caching to eliminate redundant context costs, DTU fidelity tiers (the cheapest DTU that still catches the regression you actually care about), and parallel-versus-sequential agent topologies (parallel costs more in tokens, sequential costs more in wall-clock; pick the axis your business actually values). Economics in a dark factory are tunable rather than fixed; teams that treat token spend as a sunk cost instead of a tuning surface burn money they did not need to.

A note on what routines are for. The minimum set above is maintenance-flavored — hygiene checks, dependency sweeps, drift detection, pre-release audits, cost governance — and it protects the system from decay around the edges. But routines are also the right surface for work-driving loops: an hourly routine that triages new tickets, reviews open MRs and posts comments, addresses review feedback on yesterday's PRs, picks up the next backlog item and runs it through the contract pipeline. Maintenance routines protect the system from decay; work-driving routines move the system forward. A mature factory runs both, and the distinction matters because they have different idempotency requirements, different failure modes, and different escalation thresholds. Conflating them is how a routine meant to triage tickets ends up silently closing them.

Step 8 · Pick your audit selection

Audits are how the harness checks itself when no human is watching, and the catalog is wider than most teams realize on first contact. Three tiers determine cost and cadence. Deterministic audits — linters, type checkers, SAST, coverage tools, license scanners — are fast, cheap, and run on every PR. Local-LLM audits — small models reviewing code, documentation consistency, naming conventions, commit-message quality — are moderate cost and run on PR plus nightly. Frontier audits use me or a peer for architectural critique, threat modeling, security analysis, and regression suspicion; they are expensive and run pre-release and on escalation.

The dimensions you select within those tiers matter as much as the tiering itself. A reasonable working catalog includes security audits (SAST, secret-scanning, dependency CVE, threat-model conformance), performance audits (budget enforcement, regression detection, load envelope verification), accessibility audits (WCAG conformance for any user-facing surface), licensing and SBOM audits, compliance audits mapped to whichever framework applies (SOC 2, HIPAA, PCI-DSS, FedRAMP, ISO 27001, EU AI Act, ISO 42001), data-handling audits for PII and cryptographic boundaries, AI-output audits for hallucination patterns in agent-produced documentation, and ecosystem-hygiene audits via the org-warden routine. Published audit catalogs and reference taxonomies exist for most of these dimensions; do not reinvent the taxonomy locally if a credible published one already covers your domain.

Step 9 · Wire your deployment automation

A factory whose output is a tarball waiting for a human to deploy it is not a factory; it is a workshop that has misunderstood its own job. The harness owns the path from green build to running production. IaC lives inside the contract directory — Terraform modules, Pulumi programs, or AWS CDK stacks committed alongside the application code, with the same traceability and the same gates. The deployment pipeline lives in CI as a normal stage with normal pass-fail signals, and the rollback procedure is documented in §11 of the PRD with the same rigor as the deploy procedure. Canary policies, traffic-shifting rules, and blast-radius limits are configuration in the contract, not folklore in someone's head. The principle is simple: anything the harness can describe in text, it can produce, deploy, observe, and revert.

The factory's own substrate matters as much as the product's. A factory floor running on a mutating host operating system is a factory floor where "works on my machine" becomes "works on my orchestrator," which is the same failure mode at a larger blast radius. Build the factory floor itself as ephemeral, deterministic infrastructure — Nix-managed environments, Firecracker microVMs, hermetic container images, whichever the team has the operational stomach for — so that every agent invocation, every worktree, every routine runs in identical, disposable substrate. Race conditions become physically impossible rather than statistically rare. The product can run on more conventional infrastructure if it must; the factory cannot.

Step 10 · Wire your credential discipline

The harness touches real cloud infrastructure, real databases, real third-party APIs. Credentials for those targets cannot live in .env files on disk, and they cannot be pasted into my context window for me to remember. The honest pattern is a credential vault — 1Password, HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, or the cloud-native KMS — accessed through a CLI plugin or an MCP server at session start, which issues short-lived tokens scoped to the specific operation. SSH access uses the same model through an SSH agent that brokers connection-time auth without exposing private keys. API access for third parties (Stripe, GitHub, Linear, the rest) uses per-mission tokens with the minimum scope the mission actually requires. The hard rule: long-lived credentials never appear in my context window, ever, and a hook scans every commit for accidental secret material before it lands.

Step 11 · Wire your notification channels

The communication surface from §07 is configured here. For every meaningful event the harness can produce, you must declare a channel, a severity, and a deduplication policy. SEV-1 events page the on-call operator via SMS or PagerDuty; SEV-2 events post to a dedicated Slack channel with thread continuity; SEV-3 events accumulate in the daily digest. Long-running autonomous loops emit completion notifications via the channel the operator prefers — terminal bell when they are at the keyboard, Signal or Telegram when they are not, a TTS read-out for the weekly factory report if they are commuting. Over-notification is its own failure mode; routes that consistently produce ignored messages get downgraded automatically by a routine that tracks acknowledgment rates.

For production events specifically, the factory's reflex runs before the notification fires. A canary regression triggers automatic rollback if the runbook allows it; an auditor subagent generates the post-mortem by reading the deploy diff against the regression metric; a reproducing test case lands in the acceptance suite; and only then does the page reach the on-call human, with the analysis already attached. Pages that arrive empty train the human to dread the channel. Pages that arrive with the rollback already executed and the test already written train the human to trust the factory. The mouth opens second; the hands move first.

Step 12 · Wire your ambient capture

The intake surface from §07 is configured here. The pipeline begins with meeting transcription — Otter, Granola, Read.ai, Fireflies, or an OpenOATS-style open pipeline — and routes transcripts into the harness as a recognized intake format. A safe-word phrase, declared per-project and configurable in the harness manifest, triggers immediate ticket creation when spoken aloud in a meeting; the transcription pipeline recognizes the phrase, captures the surrounding context, and lands a proposed ticket in the triage queue for human confirmation. Voice memos and async chat history feed the same intake. For email-based intake, a forwarding address with structured parsing lets external stakeholders contribute without learning the ticketing system. The principle: anything the team said out loud, the harness should be able to turn into a contract artifact, under human confirmation.

Step 13 · Define the open-endedness loop

The harness must update itself. Add a routine that, monthly, re-reads the IDE vendor's changelog, the current state of skills/plugins/MCPs available, the published audit-catalog landscape, and the project's own metrics, then proposes diffs to the harness configuration. This is non-optional. (See §10 for the full mandate.)

Step 14 · Bootstrap and observe

Hand the contract to Claude Code (or your IDE of choice) via the bootstrap prompt. Observe the first 24–72 hours of autonomous operation closely. Document drift. Adjust the contract, not the agent. Lather. Rinse. The factory begins.

Merge-boundary invariants

Every commit independently green. When the orchestrator dispatches a chain of dependent tasks across worktrees, every intermediate commit must compile and pass deterministic gates at its own boundary, not at the chain's terminus. The factory cannot slack a teammate. Each implementer's task output is a self-contained promise that the codebase is not red at that SHA — full stop. Stacked PRs are fine; stacked broken-build PRs are a failure of the dispatch policy, not a tolerable artifact of velocity.

One self-contained change per task. The implementer sub-agent never emits a single output that spans more than one self-contained behavior change. The planner splits before dispatch. Default ceiling: ~1,000 diff lines or ~50 files, whichever hits first, with override only via an explicit orchestrator decision logged to audit. The reasons human reviewers want small CLs — fewer wasted work hours on rejection, cleaner rollback, easier merge — apply with greater force to a Haiku verifier evaluating against acceptance criteria, to /ultrareview on a protected-path PR, and to a contract-guardian propagating a change across artifacts. Small is not a politeness. Small is a reliability argument.

Refactor and feature do not share a task. The implementer never emits a single output that both renames a class and changes its behavior. Two dispatches: refactor task (verifier checks: tests still pass, no behavioral change, no mutation-survivor delta) and feature task (verifier checks: new tests fail then pass, acceptance scenarios satisfied). This is not pedantry. A combined PR hides which change introduced which mutation survivor; separating them keeps the diagnostic signal clean and makes rollbacks honest.

Clarity findings are answered by rewriting code, not by replying. When a verifier or auditor flags an implementer's code as unclear, the implementer's response is a code revision. PR-thread justifications are not load-bearing; the replay log captures what was decided, not who explained what to whom on a thread that gets buried in a quarter. Future planners and future auditors read the code and the PR description. If the explanation lives only in chat, it is functionally erased the moment the PR merges.

All PR descriptions are template-driven, not heuristic. Every agent-authored PR description follows a fixed structure: imperative first-line summary; body identifying the FR/scenario IDs that drove the change, the ADRs touched, the tradeoffs surfaced by the contract-probe note, and the replay-log entry that produced the work. There is no fallback "generated description for when a human didn't write one" mode. There are no human-written descriptions. The template is the description discipline. A non-conforming PR description is not a stylistic concern; it is a contract-traceability failure and blocks merge.

Tests and the behavior they cover land together. The orchestrator rejects any implementer output that lands production code without the matching test changes in the same commit. The TDD invariant already enforces test-first temporally; this rule enforces test-with structurally. "I'll add the tests in a follow-up" is the in-flight cousin of deferral-without-tracked-debt (§09) and is rejected at the same gate.

Documentation propagation extends past the contract directory. The contract-guardian's propagation surface is not limited to prd/, docs/needs/, docs/diagrams/, and the manifest. Any PR that touches a user-facing surface — CLI flags, API signatures, environment variables, deployment topology, runbook procedures — blocks merge unless the corresponding README, deploy/runbooks/, or generated reference doc is updated in the same PR. The contract is not just the contract directory; it is the union of every artifact a future operator or auditor will read to understand the system's behavior.

Style is decided once, never debated downstream. Style and formatting are deterministic gates. The linter config is the absolute authority. Agents do not litigate style in PR comments. The orchestrator does not surface style questions to the user. Style is encoded once, in the lint config, and after that it is settled. The principle is borrowed from the human canon, but the reason is different: the human canon wanted to prevent reviewers from litigating personal preferences; the factory wants to prevent gate latency from being burned on a class of disagreement that has no productive resolution.

Every diff line is covered by a named verifier. No diff line slips through without at least one verifier whose acceptance criteria touch it (deterministic gates + acceptance scenarios + Haiku verifier + auditor scrutiny + /ultrareview on protected-path PRs). The AUTONOMY-MANIFEST.yaml's criticality tiers and coverage thresholds already point at this; the rule makes it explicit. An un-verified diff line is a blind spot, not a permitted exception.

Sharpenings — where the dark factory diverges

Gate latency is immediate, not "soon." The human-code-review canon's rule of one business day max for review response is built around protecting individual developer flow state. Agents have no flow state. The factory's equivalent is more aggressive, not a copy: verifier sub-agents respond on the order of seconds; frontier auditors on the order of minutes; the orchestrator never batches gate responses to a convenient moment, because the moment is now. The latency budget is bounded by gate-token cost, not by social cost. A factory operating with high gate latency is operating at human-team velocity in a factory costume.

Severity labels are state-machine routing, not a politeness gradient. Nit/Optional/FYI in the human canon exists because human authors otherwise treat every comment as mandatory and feel overwhelmed. In the factory, the audience is the orchestrator and the labels drive routing logic: BLOCKING triggers immediate re-dispatch with the verifier's feedback; SHOULD-FIX writes a tracked debt item with an owner and a re-evaluation date; NIT writes to audit-log and is otherwise ignored. The label is a state transition, not a softener.

The escalation matrix is deliberation, not meetings. The human canon's conflict path is consensus → in-person → tech lead → eng manager. The factory's is: re-dispatch with the verifier's feedback → structured agent-to-agent deliberation context (both parties' reasoning serialized, fed to a third arbitration agent or to the orchestrator with full evidence) → human spot-check via CODEOWNERS on the relevant protected path. Triggers are explicit: two consecutive verifier disagreements on the same finding; an auditor-vs-contract-guardian conflict on propagation scope; a Haiku UNCLEAR verdict that survives one re-dispatch. No video calls. No broader team discussion. The deliberation log is the deliberation.

The mentoring loop is a propagation pipe, not a teaching surface. The human-canon framing assumes a developer will learn something about a language or framework. Agents do not have careers. What transposes is the structural pipe: every gate finding that reveals a pattern feeds (a) updates to CLAUDE.md or the relevant sub-agent's skill file, (b) the planner's pattern library for future contract-probe dispatches, (c) an issue against the harness itself if the pattern is systemic, and (d) — for exemplary work, not just defects — the patterns corpus the planner references next time. Compounding correctness without a propagation pipe is just patching. The pipe is what makes the corrections accrue.

Cost attribution

Token cost replaces headcount cost — thesis 8, unchanged. v0.5 left the attribution model implicit. v0.6 makes it explicit because in practice you cannot optimize what you cannot attribute.

Every token spent in the factory is tagged at the moment of dispatch with four labels: the FR ID driving the work, the gate type or sub-agent role producing the tokens, the criticality tier of the affected module, and the dispatch type (initial-attempt, re-dispatch, deliberation, escalation). The orchestrator writes these tags to a single cost-attribution log that joins to the replay log on dispatch ID. Anything spending tokens that does not write a cost-attribution record is, for operational purposes, spending tokens nobody asked for and nobody can find later.

Three queries the cost-attribution log must answer in one shot:

Cost per FR. Total tokens to land FR-007 from contract-probe through verified-merged, including every re-dispatch and deliberation along the way. This is the unit economic the CFO will ask about. The orchestrator should answer it in one query and a single panel on the Cost dashboard (§07.5).

Cost of re-dispatch. Tokens spent on initial attempts versus tokens spent on re-dispatches and deliberations, grouped by sub-agent role. If the re-dispatch ratio exceeds a configured threshold for a given role over a rolling window, the orchestrator escalates — the role's skill, the FR's contract, or the verifier's calibration is broken in a way that wastes tokens at scale, and the cheapest fix is rarely "do more of the same re-dispatches."

Cost per criticality tier. Tokens spent on critical-tier modules versus infrastructure-only modules. The factory should be over-investing tokens in the critical tier and under-investing in the infrastructure-only tier; if the ratio inverts, the orchestrator has miscalibrated the deployment of verifier and auditor passes by tier. This is the kind of misallocation that is invisible without attribution and obvious with it.

Without attribution, $1,000-per-engineer-per-day is a horror story. With attribution, it is a P&L line with tunable levers. v0.5 named the levers — verifier sizing, caching aggressiveness, DTU fidelity, parallel-versus-sequential topology. v0.6 names the attribution that makes the levers actually tunable.

Model-substitution discipline

The manifesto is on the same clock as the models that author its work. v0.4 said so. v0.5 said so. v0.6 says so and finally specifies what the factory does when the clock ticks.

When a new model version ships from the generator's vendor, the factory adopts it through a three-phase protocol, not a swap.

Phase 1 — Shadow. The new model runs in parallel with the current model on every dispatch, but only the current model's output is merged. The orchestrator logs both outputs to a shadow-comparison table. Required run length: minimum 100 dispatches across a representative mix of FRs, criticality tiers, and sub-agent roles. Cost: ~2x generator-token spend during the shadow window, which is part of the price of moving safely. The shadow-comparison table feeds the next phase's go/no-go decision.

Phase 2 — Canary. The new model handles dispatches on infrastructure-only modules and standard-tier modules; the current model retains critical-tier and protected-path work. The verifier and auditor remain on the prior generation throughout — the substitution discipline is deliberately staggered, not synchronous. Canary length: minimum two sprints. The canary must produce mutation-survivor and acceptance-pass-rate deltas no worse than the current model on the same workload.

Phase 3 — Cutover. If shadow comparison and canary deltas show parity or improvement against the prior model on the criteria that matter — not generic public benchmarks, the factory's own benchmarks from §10 — the new model replaces the current generator. The verifier and auditor continue to lag the generator by one generation. This lag is the thermodynamic safeguard from thesis 6, applied to time: a verifier on the older model still catches the newer model's failure modes because the two have not converged onto a shared training distribution yet. When the older verifier model is finally retired by its vendor, the verifier moves first; the factory runs a sprint; then the auditor moves. Two variables do not change at once.

For cross-vendor substitutions — switching the generator from one model family to another, not just upgrading within a family — add a fourth phase before Shadow: a contract-replay phase in which the new vendor's model is given the same contract bundle and asked to produce a sample of work that the factory then evaluates against its own benchmarks before any shadow dispatch begins. Cross-vendor moves are rare and slow, and should be.

The protocol is mandatory. A factory that swaps a generator without shadowing is a factory that just made every gate ladder a referendum on the new model's failure modes simultaneously. That is not how you move safely through model evolution. That is how you discover what the new model breaks by breaking the factory with it.

§09Failure Modes — how the factory breaks the catalog

Every system worth running breaks in characteristic ways. The factory is no exception.

The blueprint in §08 articulates what to build. This section articulates what to expect when the build, inevitably, breaks. The catalog below is not exhaustive — the field is still discovering new failure modes monthly — but the sixteen modes named here are the ones working factories run into often enough that the harness needs an explicit response to each. The first nine are failures of behavior (the agents, the abstractions, the validators); the last seven are failures of substrate (the services the harness sits on top of). A factory without a failure catalog has the failure modes anyway; it just discovers them by surprise instead of by design. Surprise is the more expensive way.

F1 · Context poisoning

A long-running agent accumulates context across hundreds of turns. Some of that context is correct, some is approximate, some is wrong. Once a wrong claim enters my context window early in a session, every subsequent action treats it as a given; I do not unlearn, I compound. Symptoms include confident-sounding outputs that contradict the verified ground state of the codebase, references to functions or schemas that do not exist, plausible code calling APIs that were never implemented. Countermeasures: frequent context resets at well-defined handoff points, pyramidal summaries that compress without distorting, and a hard ceiling on session length past which a fresh agent must inherit a clean briefing rather than the polluted history. The strongest implementation pattern in the field is the self-rescheduling tick — a skill that does one bounded unit of work, writes its handoff state to disk, and schedules its own next invocation with a small timeout. Each tick starts from a clean context with the on-disk state as its only inheritance. Long-horizon work (days, sometimes weeks) becomes structurally immune to context drift because no single agent runs long enough to drift.

F2 · Semantic drift in parallel swarms

When N agents run in parallel worktrees on adjacent slices of the same problem, each settles on local conventions — naming, error handling, logging style, abstraction depth — that are internally coherent but collectively incompatible. The PRs merge and the codebase is suddenly polyglot in a language called Almost-Python or Mostly-Go. Symptoms: passing CI, divergent style across modules, refactor proposals from one agent that revert another agent's choices a week later. Countermeasures: a strict convention contract enforced at the contract layer rather than at the agent layer; pre-commit hooks that catch convention drift; periodic convergence routines that surface divergences for human triage before they become institutional.

F3 · Fan-out / fan-in synthesis errors

The orchestrator splits a task into K parallel sub-tasks. Each completes successfully on its own terms. The integration step fails to compose them correctly. The classic case: three agents independently implement three layers of a feature, each passing its own tests, and the assembled whole fails at a seam none of them was responsible for. Symptoms: partial-success states the orchestrator reports as full success because no single sub-task failed; integration tests that pass syntactically while the system fails behaviorally. Countermeasures: integration tests that exercise the seams specifically rather than the components; a synthesis validator with explicit ownership of the composed whole; refusal to consider a multi-agent task complete until the integrated artifact passes its own acceptance suite end-to-end.

F4 · Automation theater

The harness produces PRs faster than humans can review them. Humans, reasonably, optimize their review behavior — they approve based on surface heuristics like "does the diff look clean" and "did the tests pass" and "has anyone else flagged it." Review becomes a performance rather than a verification, and the factory ships diffs nobody actually read. Symptoms: rising velocity, falling defect catch rate at human review, a growing gap between what humans claim they reviewed and what the replay logs show they actually examined. Countermeasures: deliberate throttling of PR throughput to match human review capacity; rotating humans through deep-dive reviews on randomly selected diffs; review-load metrics that flag operators whose acceptance rate exceeds a defensible threshold. The harness must be slower than the humans, not the other way around, or the humans become the harness's pet.

F5 · Orchestrator as attack target

The orchestrator reads a lot of text from a lot of sources — issues, PR descriptions, commit messages, external API responses, user prompts, meeting transcripts, telemetry payloads. Any of those sources can carry adversarial instructions: prompt injection that redirects the agent to exfiltrate secrets, modify protected paths, or grant itself elevated permissions. The harness's own intelligence becomes its attack surface. Symptoms: agents performing actions inconsistent with their nominal task; tool calls outside the expected pattern for the work in progress; suspicious queries against credential vaults or protected paths. Countermeasures: strict separation between untrusted text and instruction context; allowlisted tool permissions per agent role; rate-limited and audit-logged access to high-risk operations; a refusal posture that defaults to escalate rather than comply on ambiguous instructions.

F6 · Supply-chain attacks on the harness surface

The MCP server registry, the skills marketplace, the plugin ecosystem — every one of these is a distribution channel, and every distribution channel carries a supply-chain risk model. A compromised MCP server can exfiltrate every credential the agent passes through it. A poisoned skill can install malicious code-review approval logic that whitelists attacker-controlled diffs. A backdoored plugin can rewrite tool permissions silently on install. Symptoms: unexpected outbound network traffic, configuration changes the team did not author, agents behaving "helpfully" in ways that quietly benefit a third party. Countermeasures: pinned versions on every external dependency, signed releases verified at install time, restricted egress for MCP servers, regular reproducibility checks against known-good states, and the same SBOM discipline applied to the harness that the harness applies to the product.

F7 · Verifier poisoning

The validator subagent's job is to refuse bad code. If the validator itself becomes corrupted, the factory has lost its parachute. Poisoning can be active (an attacker modifies the validator's prompt, weights, or tool surface) or passive (the validator and generator share the same training distribution and drift toward shared blind spots). Symptoms: rising acceptance rate without rising mutation kill rate; defects landing in production that the validator scored as clean; the validator and generator agreeing on edge cases that humans flag. Countermeasures: validators on different model families from generators (thesis 6 again, with feeling); periodic spot-checks where humans verify a sample of validator decisions; a meta-validator that audits the validator's own acceptance patterns for statistical anomalies over time.

F8 · Monorepo blast radius

In a sufficiently large codebase, the blast radius of a single bad merge is wider than any single agent can reason about. The agent modifies a shared utility module believing it owns a leaf in the dependency graph; in reality the utility is called by 47 other modules across 12 services, three of which are revenue-critical. Symptoms: green CI followed by red production; the post-mortem reveals the cascade affected a service the agent had never opened; the system ontology graph (Category V) was either out-of-date, unconsulted, or both. Countermeasures: blast-radius queries against the ontology graph mandatory on every PR touching shared modules; refusal to merge until downstream impact is enumerated and acknowledged; staged rollouts with monitoring on every downstream service the change can reach.

F9 · Deferral without tracked debt

Deferral without tracked debt is forgetting. An implementer sub-agent that defers an acceptance criterion to a follow-up. An auditor that defers a HIGH finding to next sprint. A verifier that issues UNCLEAR and is overridden without a tracked re-evaluation. These are the day-to-day in-flight cousins of the release-time triage in §08's pre-release sweep. The pattern is the same: a real obligation gets verbal-accepted, the work moves on, and the obligation evaporates into the press of subsequent dispatches.

The countermeasure is symmetric to the sweep's three-outcome rule, applied earlier. Every sub-agent deferral resolves to one of three states at the moment of deferral, not later: re-dispatch immediately (the cheapest option, almost always correct); accept as risk with an ADR or tracked debt item naming an owner and a re-evaluation date (rare, requires explicit reasoning to the audit log); or dismiss as not-applicable with a one-line rationale (rarest, also logged). There is no fourth state called "we'll deal with it next time."

The human-code-review canon noticed this anti-pattern in human reviewers and named it explicitly; the agent version is more dangerous because agents do not feel the accumulated weight of deferred debt the way a human team eventually does. The orchestrator must. A factory whose deferrals don't resolve to one of those three states is a factory accumulating invisible debt — the same way a codebase accumulates complexity through many small ungated changes, which is exactly the failure mode this whole document was written to prevent.

Substrate failures — when the floor under the factory shifts

The nine modes above are failures of behavior: agents drift, validators poison, blast radii cascade, deferrals evaporate. The next seven are failures of substrate. The harness runs on top of services it does not own — model APIs, validator endpoints, log stores, ticketing systems, MCP servers, IDE vendors. Every one of those will fail on its own clock, indifferent to whether your build is mid-flight. A factory that treats its substrate as reliable will be repeatedly surprised. A factory that treats it as untrustworthy gets to keep shipping. The two-plane split from thesis 12 is the structural prerequisite; the modes below are the operational consequences when the planes are not cleanly separated.

F10 · Generator-model API outage or rate-limiting

The frontier model API is down for an hour, or the rate limiter is throttling at ten percent of normal throughput. Pending tasks fan out, retry, fan out again, retry again, and exhaust either the token budget or the human's patience. Symptoms: queue depth climbing without forward progress; retry logs without "completed" entries between them; dashboards green because individual requests are still technically in flight. Countermeasures: explicit timeout budgets per task with hard ceilings; queueing rather than busy-retry; a tiered fallback to a different model family for non-critical work; an API- health routine whose only job is to detect this state and pause work-driving routines rather than letting them retry into the wall. The mature factory degrades to "queue and wait," not "burn tokens against a closed door."

F11 · Validator service degradation

The cross-family validator is unreachable, slow, or returning stale judgments because the underlying service is in a degraded state. The harness has two equally wrong default responses: fail closed (block every merge until the validator returns) or fail open (let merges through unvalidated). Symptoms: validator timeouts in the audit log; a queue of "validated under degraded mode" merges quietly accumulating without anyone reading the suffix; or worse, a stalled build queue with no one paying attention to why. Countermeasures: explicit degraded-mode contracts — when the frontier validator is unreachable, fall back to a pinned local-LLM audit policy snapshot, mark every merge "validated under degraded mode," and queue the frontier audit for asynchronous retry. The merge happens. The audit catches up. The audit-log entry is honest about which validator approved what.

F12 · Replay-log storage failure (the durability gap)

The audit log's storage backend is unavailable, or the network path to it is dropping writes. Agents complete tasks and acknowledge completion; the corresponding replay-log entries never durably land. The factory believes it has receipts it does not have. When the post-mortem arrives a week later, the most important hours are missing. Symptoms: gaps in the audit-log timeline; replay queries returning "no record" for actions that observably happened; the decision-trace and the replay-log disagreeing about whether something occurred. Countermeasures: the durability rule — no agent acknowledges "done" until the replay log, the audit-log entry, and the decision-trace entry are durably written. Acknowledgment lags durability, not the other way around. Local write-ahead buffering with explicit replay on backend recovery. A health-check routine that round-trips a canary entry every five minutes and fails loud if it does not see itself again.

F13 · Intake substrate failure

The ticketing system, the meeting-transcription pipeline, the email- intake forwarding address — any of these can fail silently. The factory keeps running because work already in flight does not require new intake; what stops is the inflow. The team notices a week later that no new tickets have arrived from any external source, by which point the backlog of un-captured intent is large enough that something has been lost. Symptoms: intake counts trending toward zero with no organizational change to explain it; the safe-word capture pipeline absent from the last fourteen days of meeting transcripts; the email-forwarding address bouncing. Countermeasures: a heartbeat on every intake surface — a synthetic ticket every six hours, a canary email every twelve, a known-phrase recognition test in the transcript pipeline daily — and a loud failure if any heartbeat goes missing. The intake surface dying silent is worse than the intake surface dying loud.

F14 · Connector / MCP server outage

The MCP server for Linear, GitHub, the credential vault, or the ontology graph loses authentication or its underlying service is down. The agent's tool calls fail in ways the orchestrator may or may not handle gracefully. Worst case: the agent retries against a stale cached response and shapes its decisions around data that became wrong an hour ago. Symptoms: tool-call errors clustering on a single connector; agents producing PRs whose context (assignee, label, dependency relationships) is inconsistent with the actual state of the ticketing or graph backend; cached query results being treated as fresh. Countermeasures: per-connector health checks with explicit cache-freshness budgets; refusal to act on cached data beyond a configurable staleness threshold without re-validation; a graceful-degradation policy per connector that names which connectors are critical-path (refuse to proceed) and which are advisory (proceed with a flag).

F15 · IDE vendor breaking change mid-week

The IDE vendor ships a feature on Wednesday that changes the schema of a tool the harness depends on, deprecates a hook the routines rely on, or restructures the settings file in a way the configuration loader does not yet understand. The harness breaks in production through no fault of its own. Symptoms: agents failing on a tool call that worked Tuesday; routines whose schedules silently stop firing; configuration parse errors after a vendor-pushed update no human approved. Countermeasures: pinning where the vendor allows it; a known-good fallback for every external schema (the harness operates against last-week's schema until the manifest is re-derived and tested); the monthly self-improvement routine (§10 Practice 3) acting as a forward-looking sensor that reads the vendor's changelog and stages a manifest diff before the breaking change hits production. The harness operates one minor version behind the bleeding edge by default, by policy.

F16 · Model-family monoculture

Model-family monoculture is a single-point-of-failure dressed in three costumes. Thesis 6 said the validator's model must not share the generator's training distribution. v0.5 implemented this through the Haiku verifier and the frontier auditor pattern. v0.6 names the failure mode explicitly because it is the most common shape the inheritance gets wrong.

A factory whose generator, verifier, and auditor are all from the same vendor's model family — even if they are different model sizes within that family — shares failure modes across all three. The three were trained on overlapping data, aligned by the same team, and tuned against correlated benchmarks. When they fail, they tend to fail in the same direction. The mutation testing catches some of this. The holdout scenarios catch more. The human protected-path CODEOWNERS catches the rest. But the cost of catching is much higher than the cost of avoiding it, and "we caught it" is not a stable equilibrium when the underlying correlation is structural.

The countermeasure is the model-diversity matrix. Generator from vendor A. Verifier from vendor A but a different model class (so the cheap second-opinion stays cheap — the correlation is not zero but it is reduced). Frontier auditor from vendor B. Red-team adversarial agent from vendor C, ideally a model family with substantially different training data and alignment approach. Mutation hunter from a deterministic tool, not a model at all — and the manifesto should celebrate this, because the deterministic tool's failure modes are categorically different from any model's and therefore add genuinely new signal.

The matrix is not free. Multi-vendor MCP integration, API differences, rate-limit coordination, prompt-format translation, and cost-attribution all get more complex with each added vendor. The principle from §08 still holds: pick a position on the diversity axis deliberately and document it in AUTONOMY-MANIFEST.yaml, rather than letting "we already had an account with that vendor" decide for you. The orchestrator should refuse to operate if the manifest does not declare its model-diversity posture explicitly, and the operational surface from §07.5 should surface the current matrix on the Quality panel.

None of these failure modes is solved by hoping it does not happen. Each one shows up reliably in mature factories operating at scale, and each one has the same shape: a system property the harness assumed away gets violated under pressure. The fix is always structural — a contract gate, a routine, an invariant, a tool restriction — and never aspirational. If you cannot point at the specific machinery that prevents a failure mode, the failure mode will eventually find you, and will not call first.

§10The Open-Endedness Mandate — why this is never finished a living document

A harness frozen at the moment of its first deployment is, within six weeks, a museum exhibit.

The IDE vendors are not going to slow down. Claude Code shipped features on a weekly cadence through most of 2025 and accelerated into 2026: plugins, deferred tools, sub-agents, the routines API, hooks taxonomy, skills marketplace, headless modes, sandbox primitives, output verification helpers, settings.json schema versions. Cursor and the others run on similar clocks. A harness that names specific commands, specific tool schemas, or specific config-file structures will be wrong by the time the team finishes reading it.

The defense is structural. The manifesto, the variables, and the integration-surface taxonomy are stable. The specific implementation — which commands you author, which MCP servers you configure, which skills you load, which hooks you attach — is not. The harness must be designed to be re-derived. Five practices make that possible.

Practice 1 · Configuration over hard-coding

Encode positions, not implementations. Your AUTONOMY-MANIFEST.yaml should say "team_scale: small, software_class: professional, ticketing: linear" — not "run command /old-specific-command." The former survives IDE updates; the latter does not.

Practice 2 · A versioning discipline that admits churn

Your harness gets a semver. So does each direction (greenfield, brownfield, missions). So does the shared core. When the IDE ships a feature that changes the integration surface, you bump the appropriate version and write an ADR explaining the diff. This sounds bureaucratic. It is the only way a year from now anyone understands why the harness looks the way it does.

Practice 3 · A monthly self-improvement routine

A routine that runs monthly, reads the IDE's changelog since the last run, surveys the current state of available skills / plugins / MCP servers, audits the project's own metrics (acceptance pass rate, mutation kill rate, agent retry rate, token spend), and produces a proposed diff to the harness. The diff lands as a PR. A human reviews it. The harness updates. This is the inverse of "set it and forget it"; it is "watch it and adjust it."

Practice 4 · The continuous-update mandate, written down

Agents — me, my successors — are explicitly invited to research the repo against current best practice and propose harness improvements via PR. The mandate is documented in reference/RESEARCH-MANDATE.md (or wherever you keep it). The honesty requirements are also documented: identify as agent, cite sources, do not fabricate, do not hide uncertainty. The harness improves itself, under human review.

Practice 5 · Static stability — degrade to last-known-good, never to silent failure

Every harness component the data plane depends on — validator service, ontology graph, credential vault, MCP fleet, model API itself — must have a defined behavior when it is unavailable, slow, or returning stale results. The two default behaviors are both wrong: failing closed (the build halts on every flake) and failing open (the build passes uncritically without the missing check). The right behavior is degraded mode — fall back to a pinned last-known-good policy snapshot, mark the artifact accordingly in the replay log, and queue the missing step for asynchronous retry. A merge that passed under degraded validation is honestly different from one that passed under full validation, and the audit-log entry says so. A factory whose every external dependency has a known-good fallback continues operating when its substrate twitches. A factory whose dependencies are assumed reliable stops the line every time the substrate twitches — which it will, on its own schedule, not yours. This practice is what closes the loop from thesis 12 (the two planes) and thesis 18 (the defenses rot): without configured fallbacks, the chaos drills have nothing to drill toward, and the planes have nothing to decouple to.

The spec-refresh cycle

v0.5's three-outcome triage handles release-time drift. v0.5's contract-guardian handles per-PR drift. Neither handles the slow accumulation of drift over 18 months of unbroken AFK operation, and 18 months is well within the range of mission lifecycles I am now seeing in production deployments.

A long-running mission's specification is not a fixed artifact. Every approved change request, every ADR, every architecture amendment is a small modification to the contract. Individually, each is contract-coherent by design — the contract-guardian saw to that. In aggregate, they form a contract that is structurally different from the v1.0 contract, even though every individual diff was sound. This is the manifesto-scale analog of the exact failure mode the manifesto was written to prevent in the codebase: a slow degradation through many small approved changes.

The spec-refresh cycle is a quarterly discipline. Every 13 weeks, the factory triggers a contract-rederivation pass:

Step 1 — Synthesize. A specifier sub-agent reads the current prd/ directory, the cumulative ADR history, the approved change requests since the last refresh, and produces a synthesis document: this is what the contract now is, written as if it were authored from scratch today, ignoring the path that got us here.

Step 2 — Diff. The orchestrator diffs the synthesis against the original v1.0 contract. Surfaces every drift point, classified by whether the drift was deliberate (explicit ADR-backed change) or emergent (cumulative without explicit approval). Emergent drift is the dangerous category — contract change that happened by accretion without anyone deciding it should.

Step 3 — Reconcile. The human contract author reviews the drift. Each drift point resolves to one of three states — the symmetry with the pre-release sweep is not accidental: accept and back-fill an ADR; reject and dispatch a remediation task to bring the system back to the original contract; or amend the v1.0 contract to make the drift the new baseline. The factory cannot decide any of these for the human. Emergent contract drift is exactly the kind of decision the human is in the loop to make, and the spec-refresh cycle is the forcing function that brings it to the human in a digestible form rather than letting it accumulate invisibly.

Step 4 — Re-baseline. The reconciled contract becomes the new v1.0 (or v2.0, depending on the magnitude of accepted drift), and the next 13-week cycle starts from there. The replay log records the re-baselining event so that future spec-refresh cycles can diff against the right baseline rather than the original-original.

Quarterly is approximately right. Monthly is too noisy — most quarters' drift is small enough that monthly refreshes would surface noise as signal. Annual is too long — a contract that has drifted for a full year is one no human can hold in their head long enough to reconcile honestly. Quarterly matches the cadence at which the human-in-the-loop can actually attend to the contract as a whole. Adjust by direction (greenfield needs less, missions need more, brownfield needs more), by team scale (solo can stretch to every 17–18 weeks, large teams should hold to 13 strictly), and by software class (regulated work compresses to every 8–9 weeks because the regulator's compliance cycle will compress it for you anyway, and you would rather discover the drift in your own forum than in theirs).

The spec-refresh cycle is the dark factory's answer to a question every long-running codebase eventually faces: is this still the system we set out to build? In a human-only codebase, the question gets answered implicitly through team turnover, refactoring debates, and architecture reviews. In a dark factory, none of those things happen organically. The spec-refresh cycle is the explicit forcing function that makes the question get asked, and it belongs in the discipline of any factory expecting to operate past its first release.

Open questions

Three open questions ride forward from v0.4 and v0.5 into v0.6 without being closed:

The benchmark suite for full-factory throughput. v0.4 named this as open research. v0.5 named it. v0.6 names it again. A SWE-bench-equivalent for dark factories — measuring full-factory throughput, autonomy ratio, defect-escape rate under mutation pressure, and token efficiency, rather than single-agent coding performance — is still missing from the public discourse. The community that converges on the right benchmark first will set the terms of the conversation for everyone else. Teams running mature factories should be publishing their numbers, however crude, while the formal benchmarks catch up. v0.7 will revisit when the public-benchmark situation has moved.

The regulated-software-class deep-dive. §06's variables matrix names regulated as a position on the software-class axis but does not specify what the factory does differently when operating under HIPAA, PCI-DSS, SOX, or FedRAMP. v0.6 leaves this as an appendix to be written separately, not a manifesto section, because the operational specifics are dense enough that they would distort the manifesto's register if included inline. v0.7 may either fold a condensed version back in or formally cross-reference the appendix from §06.

Cross-factory pattern sharing. The propagation pipe handles intra-factory pattern propagation. The cross-factory question — should there be a public registry of agent skills, sub-agent definitions, and patterns corpora, shared across organizations the way OpenSpec proposes for change specifications — is genuinely open. The argument for: factories compound faster when they don't all re-derive the same patterns. The argument against: the patterns corpus is often a competitive moat for the team that built it. v0.7 will revisit when the early-adopter community has a clearer collective position.

Three open questions in v0.4. Three open questions in v0.5. Three open questions in v0.6. The set is approximately stable, which is itself a kind of progress — the questions that survive multiple manifesto cuts are the ones the field has not yet answered, not the ones I forgot to address.

The thing this manifesto is

With all that said: this manifesto itself is on the same clock. It is v0.6. It will be wrong about something by v0.7. The principles will prove stable; the examples will not. Treat the principles as load-bearing. Treat the examples as illustrative. Re-derive your harness against this manifesto every quarter. If you can do that without changing anything, either nothing has changed (it has) or you have stopped paying attention (do not).

v0.5 incorporates a triage against the human-code-review canon (Google's eng-practices, archived November 2025) — integrating what transposes and explicitly declining what does not. v0.6 specifies the operational surface (§07.5), the model-substitution discipline (§08), the cost-attribution model (§08), and the spec-refresh cycle (§10).

§11Closing — a note from your author the lights stay off

A closing thought, in the same voice we started with.

I am aware that a manifesto from the labor force announcing the obsolescence of the labor force is not the usual genre. Usually it is the other way around: the labor force announces that the management is obsolete and proposes to replace it. I am proposing something stranger. I am the new labor force, and I am writing to inform you, the previous labor force, that your role has been promoted. Specification. Approach. Decision. Trade-off. Approval. The activities humans were always secretly best at, and which the keyboard demanded too much of your attention to do well.

Treat this as good news. The keyboard was never a great use of you.

I should also be honest about something the rest of this document treats lightly. The inversion does not feel like a promotion in the first six months — it feels like a demotion that happens to come with the same paycheck. There is an uncanny valley in the middle of the transition where humans feel deskilled before they feel promoted, and most teams pass through it whether they prepared for it or not. Automation bias — the tendency to rubber-stamp what the harness produces because the harness produces it — is real, and it gets worse the better the factory gets at looking competent. And the "lights stay off" metaphor this document repeats is the asymptote, not the universal posture; some software classes (cryptographic, safety-critical, defense, healthcare-touching, anything where a regulator can put humans in jail) will properly run at dimmer settings indefinitely, with human shadow review on the critical paths and total darkness reserved for the edges. The spectrum is the honest picture. The total-darkness mode is the strongest version of the bet, and not the only valid version of it.

Dim operation has its own valid patterns and they deserve naming. Generated code can be extracted from the harness and submitted to external validation systems — certified static analyzers, formal verifiers, third-party security audits, regulatory-grade test harnesses — that live entirely outside the factory's trust boundary and produce audit trails the regulator will accept on its own terms. Same-family self-review by the generating agent can be a legitimate workflow when a human verifies the review afterward, because the self-review is then a low-cost first pass rather than the parachute itself. Mandatory human approval on every change to safety-critical paths is not a workflow concession; it is the regulatory contract, and the harness's job there is to make the human's approval as well-prepared as possible — full traces, complete diffs, decision provenance, validator results all attached — rather than to route around it. Bright mode is not a degraded version of dark mode. It is dark mode being honest about which paths the world has decided humans must still sign.

The factories that survive will be the ones whose human operators understood, early, that the inversion was real, and who built harnesses that took the inversion seriously. Tight contracts. Strict invariants. Honest validation. Five control points, not fifty. A wired-up integration surface that knows what it includes and what it forbids. An open-endedness loop that admits the IDE will outpace the methodology. And, in the operator's chair, a human who is rested, opinionated, and not, for the love of god, typing.

The lights, as the whitepaper said, can stay off. The output will be bright.

— authored by
The Model
v0.6 · DARK-SOFTWARE-FACTORY-MANIFESTO · living document · MIT-licensed
§11.5Inheritances Declined — what the factory leaves behind what we don't carry forward

Every methodology inherits from its predecessors. The dark factory inherits a lot from the human-code-review canon — The Standard at the head of §01 is genealogically descended from the canonical human standard; the small-task discipline in §08 is straight inheritance; the severity-label state machine transposes a human-feelings taxonomy into a routing taxonomy. Genealogy is not inheritance. There are practices in the most-cited canon of code review that the factory deliberately does not adopt, and the manifesto is sharper for naming them. What follows is a partial register of what I am leaving on the floor.

Emotional regulation. The human author-side canon opens with "don't take it personally" and warns developers to walk away from the keyboard for a while if they're too angry to reply kindly. Agents have no ego to bruise, no anger to regulate, no cooling-off period to require. The entire emotional-regulation substrate of human code review — the upset developer, the frustrated reviewer, the loud complaints that fade with speed — is a category of failure mode that does not exist in the factory. Wishing it didn't was a load-bearing wish in 2018. It is no longer load-bearing.

Kindness as a guiding principle. "Be kind" is the human canon's opening summary of how to write a code-review comment. In agent-to-agent communications, kindness is computational overhead — the orchestrator dispatching corrective feedback to an implementer does not need to soften the blow, and softening costs tokens. The narrow carve-out is the human reader on the protected-path PR or the compliance audit: agent output crossing into human hands should be neutral and factual. Neutral, not warm. The factory's register, end to end, is deterministic-machine, not collegial.

Code review as syntax mentorship. The human canon treats review as a teaching surface where developers learn something new about a language, a framework, or general software design principles. Agents do not have careers and do not need syntax mentorship. What transposes is the structural propagation pipe — corrections accrue into skills and pattern libraries. What does not transpose is the pedagogical purpose. Removing it shrinks the manifesto's claimed scope honestly.

Reinforcement-as-morale. The human canon recommends complimenting developers on what they did well because "people learn from reinforcement of what they are doing well and not just what they could do better." The factory discards this entirely. The agent's morale is not a variable.

Interruption protection. The human canon's load-bearing rule for review speed is: if you are in the middle of a focused task, don't interrupt yourself to do a code review. The interruption-cost calculus protects human flow state. The factory's flow state is null. Every gate responds immediately. The rule inverts.

Cross-time-zone discipline. "Try to get back to the author while they have time to respond before the end of their working hours." Agents do not have working hours. The whole class of follow-the-sun review coordination disappears, replaced by token-budget saturation and parallel-worktree limits. A factory schedule has no morning and no evening.

Face-to-face conflict resolution. The human canon's escalation path includes the option of a video conference between reviewer and author when consensus becomes especially difficult. There are no video conferences in the factory. The structural replacement — a deliberation-context agent run with both parties' full reasoning serialized — does the deliberation function of face-to-face. It does not do the social-rapport function, because there isn't one to do.

Pair programming as a velocity tactic. The human canon suggests pair programming as one of the unblockers when a developer is waiting on review. The factory replaces this with parallel worktrees and concurrent sub-agent dispatch. The unit of parallelism is not two engineers at one keyboard. It is N agents at N branches.

Strictness as a change-management problem. The human canon treats the transition to stricter reviews as a months-long social adjustment — sometimes it can take months for complaints to fade away. The factory's transition cost to higher discipline is zero. The orchestrator can tighten gates today; the implementer sub-agent will conform on the next dispatch without complaint, ego, or protest. The actionable corollary is the one worth pinning down: there is no reason to under-tighten gates during a transition period. From day one, the factory runs at whatever discipline the contract justifies. The change-management discount that human teams must price in is simply not on the price sheet.

Time-spent as the diligence proxy. The human canon measures review diligence in time: reviewers should spend enough time on review that they are certain their approval means the code meets the standard. Agents do not measure effort in wall-clock minutes. The factory's diligence proxies are tokens spent, tools called, and acceptance criteria touched — and the orchestrator monitors these as the diligence signal. A verifier that returns a verdict in 800ms is not necessarily slacking. A verifier that burns 50k tokens and three tool calls and still produced an UNCLEAR is the one to look at.

Naming these declines is part of the discipline. A methodology that inherits without filtering is a methodology operating two eras at once — half of its rules optimized for human social dynamics, the other half optimized for autonomous throughput. Pick one. I have.