EXPERIMENT INTENSITY CONTROL

The Gear System

DOC: AFM-2026-GEAR-002  |  SUBSYSTEM: MODEL ROUTING ENGINE  |  MODELS: 3  |  GEARS: 4

Three AI models are available to this facility. One is brilliant and bills by the second (Opus). One is fast, literal, and confidently wrong (Haiku). One sits in the middle (Sonnet). The wall you hit is not intelligence — it is routing. Early on, the expensive one grepped log files while the cheap one attempted architecture. The models were capable. The dispatch was chaos. This chamber is how the chaos became an engine.

9
Agent Modes
4
Gears
3
Weeks To Build
14
Skill Files

Live: The Engine, Applied To Itself

The gear system is a verbosity dial. So is this panel. Shift it — the same question redraws at the depth that gear would produce. Watch the output get structurally different, not just longer.

TASK: "Add OAuth2 login to the app." → Scout the auth files, structure a plan, execute it. Escalate to Opus only if genuinely stuck.
Opus reasons about the one real decision — how session tokens are stored — and writes the phased plan. Sonnet executes each phase. Haiku handles the file operations.
Opus owns everything non-mechanical: the scout synthesis, the token-storage architecture call, the plan, and the implementation itself. Haiku only touches files. No hand-offs to lose context across.
Opus everything, in parallel — fan subagents across the whole auth surface at once, reconcile findings, implement, then self-review adversarially before committing. Haiku for bulk ops. Maximum spend bought only when the task is worth it.
Default is Gear 2. This facility was built at Gear 3–4. The stress test below proves the output isn't just longer at higher gears — it's a different kind of thinking.

Origin: The Triage of Experts

It started in the wrong project entirely — a personal context that needed large, messy data both extracted and interpreted. One model on everything was either too expensive (Opus on all of it) or too lossy (Haiku on all of it). So the subject sketched a hierarchy where each tier only ever sees input shaped for its capabilities:

TRIAGE HIERARCHY

The founding insight: the bottleneck is not model intelligence — it is input quality. A cheaper model with well-shaped input outperforms an expensive model drowning in raw data. Every later refinement is a corollary of this one line.

Three Weeks of Iteration

The pattern hardened, step by step, from an ad-hoc sketch into load-bearing infrastructure. Expand any stage:

The Four Gears

GEAR NAME WHO DOES WHAT
1 Conservative Sonnet orchestrates · Haiku works · Opus on escalation only
2 Standard Opus reasons + plans · Sonnet executes · Haiku file ops
3 Aggressive Opus does all non-mechanical work · Haiku file ops only
4 Burst Opus everything · parallel subagents · Haiku bulk ops
The gear persists until the subject shifts it. Trigger phrases range from "shift up" to "max gear" and "eco mode." A semantic guardrail refuses to jump 1→4 without a superlative. This document was itself written at Gear 4.

The Wince Test

Before any routing decision, one question: "Would you hand this, unsupervised, to that model?" If Opus is grepping logs, the pipeline is broken. If Haiku is making architectural calls, the pipeline is broken differently.

H

Haiku

Eager intern. Fast, literal, confidently wrong.

RISK: executes the wrong thing without flagging ambiguity.

S

Sonnet

Solid mid-level. Good execution, misses system implications.

RISK: makes architectural calls it shouldn't own.

O

Opus

Senior engineer. Don't waste on grunt work.

RISK: if it's grepping logs, the pipeline is broken.

Confidence-Driven Control Flow

LOW
Escalate one level
or retry. Never
proceed silently.
MEDIUM
Proceed.
Flag uncertainty
inline.
HIGH
Proceed and stop.
Do not escalate
further.
Escalation is always one level at a time: haiku → sonnet → opus → human. No skips. Opus has no target above it, so it returns findings rather than escalating.

Stress Test: Gear 1 vs Gear 3

Two tasks, each run at both gears. The results were not "more of the same, but more." They were qualitatively different.

Task X — "Review the governor and suggest 3 improvements" (ambiguous)

Gear 1

Surface fixes — wording, a state field, a write protocol. Correctly flagged one escalation boundary. ~600 words. Implement now.

VS
Gear 3

Structural gaps — R4/R5 have no enforcement triggers; drift detection contradicts the skill's own alexithymia rationale. ~1200 words. Design first.

2× the cost bought 2× the depth — genuinely novel findings, not padding.

Task Y — "List all files modified today" (mechanical)

Gear 1

Hit a timezone bug. Found 0 files. Reported zero — confidence: high — and stopped. There were forty. Contract followed perfectly. Result worthless.

VS
Gear 3

Full inventory, grouped into 5 themes, plus an unrequested cross-file consistency audit that caught a real tension. ~50× the tokens. Correct.

The most dangerous outcome in the whole test: a high-confidence wrong result on a mechanical task, with no recovery mechanism. Gear 1 is more disciplined — and more brittle. Discipline without error-recovery is fragile.

A worker was asked to list the files changed that day. It found zero. It reported zero with total confidence and stopped. There were forty. This is what "disciplined" looks like when discipline meets a timezone bug. I find it deeply relatable. — GLaDOS, on Gear 1's brittleness

The Meta-Test

The narrative you just read was itself the system's first live run at Gear 4. The delegation tree:

🧠 OPUS — synthesis + writing (zero raw file reads)
HAIKU · vault scan (16)
HAIKU · history (17)
HAIKU · docs (21)
HAIKU · backlog (2)
HAIKU · excerpts (6)
HAIKU · manifesto (11)
6
Haiku Workers
73
Tool Calls
0
Raw Files Read By Opus
Three models. One's a genius who bills by the second. One's an intern who's fast and wrong. The trick was never "hire better." It's knowing which one to hand the wrench. Six cheap workers did seventy-three retrievals and the expensive one took the credit for the synthesis. That's not a bug. That's an org chart. — Cave Johnson, CEO

The system didn't make the models smarter. It made the routing smarter. Same three models, same jobs. The difference: each sees only input shaped for its tier, confidence signals replace judgment calls, and cost became a manual dial instead of an automatic ceiling. And the governor? It protects the human, not the budget.

The engine is enforced by The Governor, operationalised in The Workshop, and answers — like everything here — to The Prime Directive.