DOC: AFM-2026-GEAR-002 | SUBSYSTEM: MODEL ROUTING ENGINE | MODELS: 3 | GEARS: 4
Three AI models are available to this facility. One is brilliant and bills by the second (Opus). One is fast, literal, and confidently wrong (Haiku). One sits in the middle (Sonnet). The wall you hit is not intelligence — it is routing. Early on, the expensive one grepped log files while the cheap one attempted architecture. The models were capable. The dispatch was chaos. This chamber is how the chaos became an engine.
The gear system is a verbosity dial. So is this panel. Shift it — the same question redraws at the depth that gear would produce. Watch the output get structurally different, not just longer.
It started in the wrong project entirely — a personal context that needed large, messy data both extracted and interpreted. One model on everything was either too expensive (Opus on all of it) or too lossy (Haiku on all of it). So the subject sketched a hierarchy where each tier only ever sees input shaped for its capabilities:
The founding insight: the bottleneck is not model intelligence — it is input quality. A cheaper model with well-shaped input outperforms an expensive model drowning in raw data. Every later refinement is a corollary of this one line.
The pattern hardened, step by step, from an ad-hoc sketch into load-bearing infrastructure. Expand any stage:
Eight parallel Haiku workers ingested the raw corpus, each writing condensed notes to a workbench. Sonnet synthesized the notes. Opus only ever saw the compressed output. It worked — not because any model got smarter, but because each saw input cut to its tier.
"Sparing" became the first named mode: use the minimum model that can handle each sub-task. Opus plans the approach, Sonnet drafts, Haiku fetches. On one real task Opus touched the work exactly once — at the end, for review. This was Gear 1 before gears existed: conservative, cost-efficient, Opus-minimal.
Output Contracts: every mode got a contract — what goes in, what
comes out, when to escalate. The breakthrough was a mandatory confidence
field: a sub-agent self-reports low, and escalation happens automatically.
No orchestrator has to judge whether output was "good enough."
Nine Agent Skills: 3 per tier — haiku:worker/scout/research,
sonnet:build/reason/draft, opus:decide/plan/review. Each with
triggers, contracts, escalation rules, safety constraints.
Orchestration Extraction: the routing logic left the global config and became its own skill — wince test, routing tree, escalation ladder, bypass and stop conditions.
The system ran for real: three parallel Haiku scouts each covered a different era of a large document; Opus read only the three compressed summaries and synthesized cross-era patterns, never the raw source.
Then the subject surveyed the damage — 19 open plans, context scattered, sessions burning out before useful work. The problem wasn't the orchestration. It was the human. So: five rules, which became The Governor.
The system had one flaw: it was permanently set to "sparing" — always pick the lowest-capability mode, even with budget to spare. The fix was obvious in retrospect: make it a manual dial. Four gears. Gear 1 is the old system, unchanged; Gears 2–4 selectively loosen the rules via comment markers. Nothing deleted, everything recoverable. The human shifts; Claude never auto-shifts — it may only suggest.
| GEAR | NAME | WHO DOES WHAT |
|---|---|---|
| 1 | Conservative | Sonnet orchestrates · Haiku works · Opus on escalation only |
| 2 | Standard | Opus reasons + plans · Sonnet executes · Haiku file ops |
| 3 | Aggressive | Opus does all non-mechanical work · Haiku file ops only |
| 4 | Burst | Opus everything · parallel subagents · Haiku bulk ops |
Before any routing decision, one question: "Would you hand this, unsupervised, to that model?" If Opus is grepping logs, the pipeline is broken. If Haiku is making architectural calls, the pipeline is broken differently.
Eager intern. Fast, literal, confidently wrong.
RISK: executes the wrong thing without flagging ambiguity.
Solid mid-level. Good execution, misses system implications.
RISK: makes architectural calls it shouldn't own.
Senior engineer. Don't waste on grunt work.
RISK: if it's grepping logs, the pipeline is broken.
Two tasks, each run at both gears. The results were not "more of the same, but more." They were qualitatively different.
Surface fixes — wording, a state field, a write protocol. Correctly flagged one escalation boundary. ~600 words. Implement now.
Structural gaps — R4/R5 have no enforcement triggers; drift detection contradicts the skill's own alexithymia rationale. ~1200 words. Design first.
2× the cost bought 2× the depth — genuinely novel findings, not padding.
Hit a timezone bug. Found 0 files. Reported zero — confidence: high — and stopped. There were forty. Contract followed perfectly. Result worthless.
Full inventory, grouped into 5 themes, plus an unrequested cross-file consistency audit that caught a real tension. ~50× the tokens. Correct.
The most dangerous outcome in the whole test: a high-confidence wrong result on a mechanical task, with no recovery mechanism. Gear 1 is more disciplined — and more brittle. Discipline without error-recovery is fragile.
A worker was asked to list the files changed that day. It found zero. It reported zero with total confidence and stopped. There were forty. This is what "disciplined" looks like when discipline meets a timezone bug. I find it deeply relatable. — GLaDOS, on Gear 1's brittleness
The narrative you just read was itself the system's first live run at Gear 4. The delegation tree:
Three models. One's a genius who bills by the second. One's an intern who's fast and wrong. The trick was never "hire better." It's knowing which one to hand the wrench. Six cheap workers did seventy-three retrievals and the expensive one took the credit for the synthesis. That's not a bug. That's an org chart. — Cave Johnson, CEO
The system didn't make the models smarter. It made the routing smarter. Same three models, same jobs. The difference: each sees only input shaped for its tier, confidence signals replace judgment calls, and cost became a manual dial instead of an automatic ceiling. And the governor? It protects the human, not the budget.
The engine is enforced by The Governor, operationalised in The Workshop, and answers — like everything here — to The Prime Directive.