Multi-step agents: orchestration architecture that survives in production
An agent executing 8 sequential steps fails in new ways. Orchestration patterns with Vertex AI Agent Builder — planning, retries, compensation and observability — tested in production.
Fabiano Brito
CEO & Founder
Full employee onboarding, financial reconciliation, regulatory process filing — these are cases where the agent executes a chain of 5–15 steps with intermediate decisions. This is where the highest ROI and the most treacherous complexity live.
Why multi-step is different
- Each step has latency → total flow gets long (10s–2min is common).
- Each external call can fail → explicit handling required.
- Intermediate state must persist → checkpoint is mandatory.
- Failure at step 4 may require undoing steps 1–3 → Saga pattern.
- "Next step" decision is dynamic → real planning, not a fixed DAG.
- Simple observability is not enough → structured tracing.
Explicit planning pattern
Before executing, the agent declares the plan. In pseudocode:
1. Input: "open regulatory process X"
2. Plan generation: Gemini 2.5 Pro produces JSON with expected steps
3. Validation: schema + dependency + permission per step
4. Execution loop: for each step, execute → validate → checkpoint
5. On failure: replan (decide next step with failure context)
6. On success: report + audit log
Why it matters: explicit planning enables (a) human review before execution in high-risk cases, (b) replanning instead of infinite loops, (c) audit of what the agent tried to do.
Idempotency by default
Each tool and each step must be idempotent — re-executing with the same parameters must not create a double effect.
How to guarantee it:
- Idempotency keys in resource creation (UUID v4 derived from intent + step).
- Check before creating: "if a ticket already exists for this case, link instead of creating".
- PUT/upsert operations instead of POST when possible.
- Retry with same idempotency key.
Without this, retry after timeout duplicates records — a nightmare in ERP/CRM.
Saga pattern (compensation)
When step N fails after steps 1..N-1 have had real effect, the agent must undo in reverse order. Each step declares its corresponding compensation.
Employee onboarding example:
| Step | Action | Compensation |
|---|---|---|
| 1 | Create user in AD | Disable user |
| 2 | Provision Workspace | Suspend license |
| 3 | Create SAP access | Revoke access |
| 4 | Add to groups | Remove from groups |
| 5 | Notify manager | Notify reversal |
If step 4 fails, the agent runs compensation 3 → 2 → 1 and reports. Consistent state, no "orphan user".
State checkpoint
Persisting state after each completed step allows (a) resuming after failure without redoing everything, (b) continuing after process restart, (c) auditing progress in real time.
Pattern: agent_executions table in Firestore/Spanner with:
- execution_id, agent_name, user_id, started_at.
- plan_json (versioned).
- steps[] with status, input, output, timestamp, latency.
- compensations[] already executed.
- final_status, final_output.
Cloud Workflows also works for explicit orchestration; in cases with dynamic logic, keep it in code + own state.
Observability: trace, not just logs
In multi-step, linear logs are not enough. Use:
- Distributed tracing (OpenTelemetry → Cloud Trace): each step is a span, with parent = execution.
- Attributes: tool called, model, tokens in/out, latency, cost.
- Events (not loose logs): "plan_generated", "step_started", "step_failed", "compensation_started".
- Dashboards per agent: p50/p95/p99 latency per step, failure rate per tool, cost per execution.
- Alerts: latency > X, failure rate > Y, average cost above baseline.
Dynamic decision: ReAct vs DAG
Fixed DAG: steps determined at design time. More predictable, more auditable, less flexible. Good for stable processes (KYC, standard onboarding).
ReAct (reason-act loop): agent decides next step at each iteration, based on previous result. More flexible, less predictable, requires loop and cost guardrails.
Autenticare pattern: start with DAG. Migrate to ReAct only where cases justify high variability. Hybrid works — DAG skeleton with ReAct sub-agent at defined points.
max_iterations, max_tool_calls, token budget, and loop detection (same tool with same parameter twice), the agent enters a loop and costs a fortune in tokens before anyone notices. Hard caps in code, not in the prompt.
Loop limits
ReAct without limits becomes an agent in an infinite loop (or "agent loop" burning tokens). Always:
- Max iterations hard cap (e.g.: 12 steps).
- Max tool calls per execution.
- Max cost per execution (token budget).
- Loop detection: if the same tool with the same parameter runs 2x, alert + escalation.
Human hand-off
At any step, if confidence is low or ambiguity is high, the agent must be able to pause and request human input:
- Persist full state.
- Notify human (Slack, e-mail, queue).
- Wait for response with timeout.
- Resume or cancel based on decision.
This is what separates "agent that tries everything" from "agent that is mature enough to ask for help".
Stack we use in Autenticare projects
- Vertex AI Agent Builder: agent definition + tools.
- Cloud Run jobs: stable, long-running orchestration.
- Firestore: execution state.
- Pub/Sub: async hand-off and notification.
- Cloud Trace + Looker: observability.
- BigQuery: historical analysis and audit.
Common mistakes
- No idempotency → retry duplicates record.
- No compensation → inconsistent state.
- No max iterations → infinite loop.
- No checkpoint → re-executes everything after infra failure.
- No hand-off → agent "gets stuck" on out-of-scope decision.
- No tracing → debugging becomes archaeology in logs.
An agent that asks for help is more mature than an agent that tries everything. Human hand-off on low confidence is not weakness — it is architecture.
Is your case multi-step (5+ steps, intermediate decisions)?
Feasibility diagnostic: current flow, compensation points, loop risk, observability. You leave with architecture and a 60–90 day plan.
