The agent orchestra: why 3 specialists beat 1 generalist in production

TL;DR Pairing with a single agent is what most people practice — and it's where most people get stuck. Productive teams today coordinate 3 to 5 specialized agents, each with its own context, running in isolated worktrees, syncing through a shared task list, with automated quality gates. The bottleneck is no longer code generation — it's verification. And the leverage isn't the prompt anymore: it's the spec.

The single-agent ceiling

Anyone who’s been working with agents for more than six months has felt it: there’s a plateau where adding more prompts stops accelerating anything. Three constraints explain the ceiling:

Constraint 1

📚 Context overload

Real customer codebases exceed any window. When the agent has to "remember" everything at once, it forgets the essential.

Constraint 2

🎯 No specialization

An agent that does database, API, UI and tests becomes a "jack of all trades" — master of none. The agent that only knows the data layer writes better SQL.

Constraint 3

🚦 No coordination

Multiple agents without coordination primitives (file locks, task dependencies) become chaos. Merge conflicts eat your parallelism gains.

Conductor vs orchestrator

The metaphor that best captures the shift (we owe it to Addy Osmani): you go from conductor — guiding a single musician in real time — to orchestrator — coordinating an entire ensemble asynchronously. The chat thread stops being your environment; the repo (and task board) takes its place.

Dimension	Conductor (1 agent)	Orchestrator (team)
Mode	Synchronous — you wait for each reply	Asynchronous — you plan and review
Context	Your window is the ceiling	N independent windows, one per specialist
Throughput	1× — sequential	~3× with 3 agents in parallel
Workspace	Chat thread	Repo + worktrees + task list
Your role	Prompt typist	Process engineer: spec, gates, retro

Three orchestration patterns

In production, we use three patterns — chosen by task scope, not by trend:

Pattern 1

🌿 Subagents

Parent decomposes work, delegates to focused children, manages the dependency graph manually. Zero setup — start today. Token-cost neutral.

Setup: None
Sweet spot: 2–4 children

Pattern 2

👥 Agent Teams

Team Lead + shared task list with file locking + peer-to-peer messaging. Auto-unblocking when dependencies clear.

Sweet spot: 3–5 agents
Coordination: Task list + locks

Pattern 3

☁️ Cloud Async

Assign a task, close the laptop, come back to a PR. Runs on managed VMs — in our case, on top of Agent Runtime from Gemini Enterprise Agent Platform.

Mode: Fire-and-forget
Session: Days at a time

The bottleneck is no longer generation. It's verification. Human review isn't optional overhead — it's the safety system.

The numbers that matter

~3×

Throughput
3 agents in parallel

3–5

Team sweet spot
above that, dispersion

−3%

When AI writes AGENTS.md
+20% cost (ETH Zurich)

The right-hand number matters: research by Gloaguen et al. (ETH Zurich) shows that letting agents write their own AGENTS.md degrades performance by ~3% and increases cost by 20%+. The AGENTS.md must be human-curated — it’s what encodes the team’s institutional knowledge.

AGENTS.md: the shared brain

Four sections are enough:

## STYLE
- Functional components with hooks; named exports
- Errors always typed; never `throw "string"`

## GOTCHAS
- SQLite needs WAL for concurrent reads
- Express middleware order matters for auth

## ARCH_DECISIONS
- State in SQLite, no in-memory cache
- One Express router per feature module

## TEST_STRATEGY
- Integration > unit for HTTP routes
- supertest for request assertions

Every session reads it. No agent writes to it directly — the lead approves every line that goes in.

5 practices to start tomorrow

Start with subagents

Decompose the task into 2–3 surgically scoped children. No setup. The cheapest way to prove the thesis internally.

Isolate with worktrees

Each agent in its own git worktree. Zero merge conflicts — and a per-feature diff that's trivial to review.

Plan approval before coding

Teammate writes the plan; lead approves or rejects. Fixing architecture in the planning phase costs 1/10th of fixing it in code.

Hooks running lint + test

On TaskCompleted, validate automatically. If it fails, the agent keeps working. Lead only sees green code — like built-in CI.

Compound learning via AGENTS.md

Every session reads, lead updates. Patterns and gotchas become the team's long-term memory — never rediscovered each sprint.

⚠️ Classic trap Spinning up 10 agents in parallel just because you can. WIP limits are a virtue: run only what you can actually review. Above 5 agents, you're usually "generating debt that another human will pay for".

What changes at Autenticare

This model only closes in production when you have the infrastructure to sustain it. That’s exactly what the Gemini Enterprise Agent Platform (announced 04/22 — full analysis here) delivered:

Agent Runtime gives you the “managed VMs” of the Cloud Async pattern — sub-second cold start, day-long sessions.
Memory Bank + Memory Profiles is AGENTS.md elevated to a platform primitive — long-term memory across sessions.
Agent Sandbox is the isolated worktree in production — generated code runs without risk to the real system.

Not a coincidence. The industry path is the same: from one-shot prompt execution to the agent factory.

You're no longer writing software. You're building the factory that writes the software.

Agentic engineering in production

Want to move from a single agent to an orchestra?

We structure 3-to-5 agent teams on Gemini Enterprise Agent Platform — with worktrees, curated AGENTS.md, quality hooks and plan approval. Auditable, repeatable and measurable.

Talk to Autenticare → See Agent Factory

Inspired by "The Code Agent Orchestra" by Addy Osmani (Google Chrome). Adapted to the stack we operate at Autenticare on top of Gemini Enterprise Agent Platform.