AI agent over structured data: text-to-SQL with Gemini + BigQuery in practice
RAG handles documents. But what about questions over structured data in BigQuery, Snowflake, or PostgreSQL? The text-to-SQL pattern with Gemini that actually works — and what still requires a human.
Fabiano Brito
CEO & Founder
Half of the questions executives ask the "company ChatGPT" are analytical: comparisons, totals, trends, segmentation. Without text-to-SQL, the agent replies "check the BI tool." With a well-built text-to-SQL, the agent returns the right number with a chart.
Standard architecture (7 steps)
Raw user input, with session context.
Agent searches the catalog for tables that cover the topic (semantic layer).
Gemini 2.5 Pro produces a parameterized query in the warehouse dialect.
SQL parser + operation whitelist + ACL + dry-run.
BigQuery/Snowflake/Postgres with the user's own identity, not a service account.
Agent formats the result and suggests a visualization.
Number + table + optional chart + the query used (for auditing).
The key component: semantic layer
The model doesn't memorize your data warehouse schema. Without a semantic layer, it guesses table and column names.
A semantic layer is a curated catalog:
- Tables and columns with descriptions in PT/EN.
- Synonyms ("receita" = "revenue" = "faturamento").
- Relationships between tables (explicit foreign keys).
- Pre-defined metrics ("average ticket = SUM(value)/COUNT(order)").
- Default filters ("confirmed orders only").
- Temporal and geographic granularity.
Tools: dbt + Looker semantic layer, Cube.js, or your own YAML definition. In Autenticare projects, we standardize on versioned YAML.
Prompt patterns for text-to-SQL
Always include in the prompt:
- Schema of the relevant tables (full DDL).
- 3–5 examples of question → well-formed SQL.
- Explicit dialect ("PostgreSQL 15", "BigQuery Standard SQL").
- Constraints: "always use LIMIT 1000", "never DELETE/UPDATE/DROP", "use named parameters".
- Output format: raw SQL inside a code fence, no extra comments.
- Uncertainty rule: "if there is no data to answer, return null + explanation".
See broader patterns in corporate prompt engineering.
Mandatory validation before execution
- SQL parser: ensures it is valid SQL (not a comment, not loose text).
- Operation whitelist: SELECT only.
- Table whitelist: agent may only touch tables marked "agent-accessible".
- ACL check: does the user have permission on the table?
- Dry run in BigQuery (zero cost, validates syntax and estimates bytes).
- Bytes-processed cap (cost ceiling).
- Execution timeout.
SELECT * FROM events on a 50 TB table and the DBA becomes famous. Dry-run, table whitelist, and byte cap are not optional — they are the difference between text-to-SQL in production and a BigQuery billing incident.
Hybrid "analytical agent" pattern
In real cases, it is better to combine:
- Pre-defined metrics (semantic layer): for common questions, runs a ready query with parameters. Faster and more reliable.
- Free text-to-SQL: for ad-hoc questions outside the catalog. Extra validation required.
- RAG over a dictionary: agent consults business documentation (policies, definitions) before generating SQL.
What goes wrong with naive text-to-SQL
Forgotten joins
"How much did we sell by region?" — agent forgets the JOIN with the store table and returns the wrong number. Solution: recommended joins in the semantic layer.
Undeclared filters
"Monthly revenue" — agent doesn't filter status="confirmed". Always include a default filter in the semantic layer.
Temporal granularity
"March sales" — March of which year? Without context the agent assumes wrongly. Include the current date in the prompt + rule: "if the year is not specified, use the current year".
Currency conversion
Multinational with BRL, USD, ARS in the same table: agent sums everything without converting. A derived metric in the semantic layer solves this.
Confusion between similar metrics
"Net revenue" vs. "gross revenue" vs. "GMV" — without a clear definition the agent chooses badly. A glossary with canonical definitions solves this.
Specific evaluation
A gold set for text-to-SQL differs from RAG. Structure:
- Natural-language question.
- Expected SQL (canonical) or a set of equivalent SQLs.
- Expected result (numbers).
- Primary metric: execution match (result matches, even if SQL is written differently).
- Secondary metric: logical form match (correct SQL structure).
90%+ recall on execution match is a realistic target for a corpus of 100–300 questions. Details in agent evaluation in production.
Costs to control
- BigQuery bytes: limit per query and per user/day.
- Gemini tokens: a large schema in the prompt consumes tokens. Select only relevant tables via a first router call.
- Query cache: identical question within a short window = result from cache, no re-execution.
Governance
- Audit log: question + generated SQL + bytes processed + result + user.
- Real identity: query runs with the user's own permission, not a service account.
- PII handling: never expose CPF/e-mail in an aggregated answer — aggregation is mandatory on sensitive columns.
- DPIA for analysis over personal data — see DPIA for Gemini Enterprise projects.
When NOT to use text-to-SQL
- Questions requiring judgment ("what should we do?"): combine with complementary analysis.
- Critical decisions (price, customer offer): agent suggests, human decides.
- Complex multi-table analysis with deep business logic: better to use traditional BI + agent to summarize.
Text-to-SQL works when the agent knows what it cannot touch. Curated semantic layer + whitelist + dry-run >> giant prompt with 200 tables dumped inside.
Text-to-SQL in production over your BigQuery/Snowflake/Postgres?
Autenticare delivers a POC in 4–6 weeks: semantic layer + agent + dashboard + evaluation gold set. You leave with a measured execution match and a rollout plan.
