Multi-Step AI Agents: Production-Grade Orchestration Patterns

TL;DR Single-step agents (RAG + response) are easy. Multi-step (plan, execute, validate, decide next) introduce new failure classes. In Vertex AI Agent Builder + Gemini Enterprise, 6 non-negotiable patterns: explicit planning, idempotency, saga/compensation, checkpoint, loop limits and end-to-end observability.

Full employee onboarding, financial reconciliation, regulatory process filing — these are cases where the agent executes a chain of 5–15 steps with intermediate decisions. This is where the highest ROI and the most treacherous complexity live.

Why multi-step is different

Each step has latency → total flow gets long (10s–2min is common).
Each external call can fail → explicit handling required.
Intermediate state must persist → checkpoint is mandatory.
Failure at step 4 may require undoing steps 1–3 → Saga pattern.
"Next step" decision is dynamic → real planning, not a fixed DAG.
Simple observability is not enough → structured tracing.

Explicit planning pattern

Before executing, the agent declares the plan. In pseudocode:

1. Input: "open regulatory process X"
2. Plan generation: Gemini 2.5 Pro produces JSON with expected steps
3. Validation: schema + dependency + permission per step
4. Execution loop: for each step, execute → validate → checkpoint
5. On failure: replan (decide next step with failure context)
6. On success: report + audit log

Why it matters: explicit planning enables (a) human review before execution in high-risk cases, (b) replanning instead of infinite loops, (c) audit of what the agent tried to do.

Idempotency by default

Each tool and each step must be idempotent — re-executing with the same parameters must not create a double effect.

How to guarantee it:

Idempotency keys in resource creation (UUID v4 derived from intent + step).
Check before creating: "if a ticket already exists for this case, link instead of creating".
PUT/upsert operations instead of POST when possible.
Retry with same idempotency key.

Without this, retry after timeout duplicates records — a nightmare in ERP/CRM.

Saga pattern (compensation)

When step N fails after steps 1..N-1 have had real effect, the agent must undo in reverse order. Each step declares its corresponding compensation.

Employee onboarding example:

Step	Action	Compensation
1	Create user in AD	Disable user
2	Provision Workspace	Suspend license
3	Create SAP access	Revoke access
4	Add to groups	Remove from groups
5	Notify manager	Notify reversal

If step 4 fails, the agent runs compensation 3 → 2 → 1 and reports. Consistent state, no "orphan user".

State checkpoint

Persisting state after each completed step allows (a) resuming after failure without redoing everything, (b) continuing after process restart, (c) auditing progress in real time.

Pattern: agent_executions table in Firestore/Spanner with:

execution_id, agent_name, user_id, started_at.
plan_json (versioned).
steps[] with status, input, output, timestamp, latency.
compensations[] already executed.
final_status, final_output.

Cloud Workflows also works for explicit orchestration; in cases with dynamic logic, keep it in code + own state.

Observability: trace, not just logs

In multi-step, linear logs are not enough. Use:

Distributed tracing (OpenTelemetry → Cloud Trace): each step is a span, with parent = execution.
Attributes: tool called, model, tokens in/out, latency, cost.
Events (not loose logs): "plan_generated", "step_started", "step_failed", "compensation_started".
Dashboards per agent: p50/p95/p99 latency per step, failure rate per tool, cost per execution.
Alerts: latency > X, failure rate > Y, average cost above baseline.

Dynamic decision: ReAct vs DAG

Fixed DAG: steps determined at design time. More predictable, more auditable, less flexible. Good for stable processes (KYC, standard onboarding).

ReAct (reason-act loop): agent decides next step at each iteration, based on previous result. More flexible, less predictable, requires loop and cost guardrails.

Autenticare pattern: start with DAG. Migrate to ReAct only where cases justify high variability. Hybrid works — DAG skeleton with ReAct sub-agent at defined points.

⚠️ ReAct without guardrails = burned credit card Without max_iterations, max_tool_calls, token budget, and loop detection (same tool with same parameter twice), the agent enters a loop and costs a fortune in tokens before anyone notices. Hard caps in code, not in the prompt.

Loop limits

ReAct without limits becomes an agent in an infinite loop (or "agent loop" burning tokens). Always:

Max iterations hard cap (e.g.: 12 steps).
Max tool calls per execution.
Max cost per execution (token budget).
Loop detection: if the same tool with the same parameter runs 2x, alert + escalation.

Human hand-off

At any step, if confidence is low or ambiguity is high, the agent must be able to pause and request human input:

Persist full state.
Notify human (Slack, e-mail, queue).
Wait for response with timeout.
Resume or cancel based on decision.

This is what separates "agent that tries everything" from "agent that is mature enough to ask for help".

Stack we use in Autenticare projects

Vertex AI Agent Builder: agent definition + tools.
Cloud Run jobs: stable, long-running orchestration.
Firestore: execution state.
Pub/Sub: async hand-off and notification.
Cloud Trace + Looker: observability.
BigQuery: historical analysis and audit.

Common mistakes

No idempotency → retry duplicates record.
No compensation → inconsistent state.
No max iterations → infinite loop.
No checkpoint → re-executes everything after infra failure.
No hand-off → agent "gets stuck" on out-of-scope decision.
No tracing → debugging becomes archaeology in logs.

An agent that asks for help is more mature than an agent that tries everything. Human hand-off on low confidence is not weakness — it is architecture.

Agentic orchestration

Is your case multi-step (5+ steps, intermediate decisions)?

Feasibility diagnostic: current flow, compensation points, loop risk, observability. You leave with architecture and a 60–90 day plan.

Schedule diagnostic → KYC mid-size bank case