Ask a large language model to write SQL and it will happily oblige. It will also invent table names, hallucinate column names, ignore your access controls, and return results that look plausible but wrong. The gap between “generates SQL” and “generates correct, safe, auditable SQL against your actual data warehouse” is where most natural-language-to-SQL projects die. I built the data warehouse agent inside Voltaire, the internal AI assistant at Parallel Learning. It handles nearly 700 tables across 30-plus source systems, enforces PII restrictions at the schema level, and repairs its own broken queries. I wrote previously about how the orchestrator routes questions through a reasoning loop. This one goes deeper: what happens inside the data warehouse agent when the orchestrator hands it a question. The model never sees your full schema. It never sees PII columns. It never runs unvalidated SQL. When it gets a query wrong, it reads the error message and tries again. When the orchestrator calls the data warehouse agent, the question passes through six stages before any SQL touches the database. A table planner selects relevant tables from a compact catalog of nearly 700. The selected tables’ column schemas get serialized and injected into the SQL generation prompt, with PII columns stripped out. A separate model generates SQL under strict constraints. The generated SQL passes through multi-layer validation: AST parsing, column existence checks against the loaded schema, deprecated table blocking, PII column guards. The validated SQL executes against the data warehouse with a 45-second timeout. If it fails with a repairable error, a repair loop generates corrected SQL using the error message and valid column hints. Results get serialized into a token-efficient format and returned to the orchestrator. Two model calls (planner and generator). One optional repair call. Multiple deterministic gates between each. The whole pipeline runs inside a single tool call from the orchestrator’s perspective. It asked a question and got an answer. The most common failure mode in natural-language-to-SQL is hallucinated schema. The model confidently references tables and columns that don’t exist. Better prompting doesn’t fix this. Restricting what the model can see does. Voltaire maintains a structured catalog of every table in the data warehouse: name, dataset, type (dbt model vs raw source), and a human-written description. This catalog uses TOON, a compact serialization format designed for tabular data. Where JSON repeats key names for every row, TOON declares the schema once and uses positional values. For 700 tables with four fields each, that roughly halves the token count compared to JSON. The table planner sees only this catalog. No columns, no full schemas. Just table names and descriptions. Given the user’s question, it selects one to fifteen relevant tables and produces a query plan. Every selected table name must be copied verbatim from the catalog. No invention, no abbreviation. Models still hallucinate table names despite this constraint. So after the planner returns, a deterministic check validates every selection against the known set. Hallucinated tables trigger a rerun with an explicit correction: “Your previous attempt selected tables that do NOT exist in the catalog. Revise your plan using only real tables.” If it hallucinates again, those tables get dropped. After table selection, the SQL generator receives full column schemas for only the selected tables: name, type, and description for each column. The generator never sees the full catalog. It works with exactly the tables the planner chose, with exactly the columns that exist. The dbt layer underneath matters. The warehouse has two tiers: dbt models (cleaned, documented, tested) and raw source tables (as-ingested from 30-plus systems). The planner explicitly prefers dbt models over raw sources when both cover the same data. Raw sources are the fallback. This means the SQL generator usually works with well-documented columns that have clear descriptions, consistent types, and known relationships. Clean metadata in, clean SQL out. Business logic gets injected too. A YAML configuration describes domain-specific conventions: school year runs July through June, pay periods are semi-monthly, timestamps are UTC and need timezone conversion. Matching strategies are codified: exact equality for structured values like state codes, case-insensitive LIKE for free-text identifiers like account names. Anti-patterns are explicitly forbidden: no INFORMATION_SCHEMA queries, no Oracle/Postgres functions, always SAFE_DIVIDE instead of raw division. The SQL generator doesn’t need to be creative. It needs to be constrained. Narrow, accurate context plus explicit rules equals reliable SQL. Most PII protection in AI systems works as a post-processing filter: generate the output, scan for sensitive data, redact. By the time you’re redacting, the model has already seen the PII. It might reference it in its explanation. It might use it as a join condition that leaks information through cardinality. Voltaire’s approach: remove PII columns from the schema before the model ever sees them. The model cannot reference columns it doesn’t know exist. A PII registry maps every sensitive column across every table to a tag: employee_name, employee_email, student_email, ssn, phone, address, birth_date. When the SQL generator’s schema context is built, columns with sensitive tags get filtered out. The model receives a complete, functional schema minus the dangerous fields. Not all tagged columns disappear. Employee names and emails are whitelisted for internal queries. You need to look up your colleagues. Provider-specific columns in provider tables are whitelisted. Student PII is never whitelisted. The filtering considers both the PII tag and the table context: a name column in an employee table is safe; the same tag on a student record is not. After the model generates SQL, a second layer kicks in. The SQL gets parsed into an abstract syntax tree using sqlglot, a SQL parser with BigQuery dialect support. The validator walks SELECT expressions, resolves table aliases, checks every selected column against the PII registry. SELECT * is blocked by default, forcing explicit column selection. Any PII column in the output means rejection before execution. Then a third check on the actual results. After execution, returned column names are verified against the PII registry again. Defense in depth: if a PII column somehow slipped through schema filtering and AST validation, it gets caught here. When a query gets blocked, the system proposes alternatives. “I can’t look up students by name. Please provide the Pathway ID instead.” Never just block. Refuse the PII while offering a safe path to the same insight using anonymized identifiers. For authorized users, a privileged mode includes PII columns in the schema. Every privileged query is logged for compliance audit. Your department and title determine whether PII columns appear in the schema at all. Even with a perfect schema and explicit constraints, generated SQL fails. The model picks a function that doesn’t exist in BigQuery. It writes a correlated subquery the engine can’t handle. It divides by zero. The first defense is pre-execution validation. Before SQL touches the database: Does it start with SELECT or WITH? Exactly one statement? No named parameters? No deprecated tables? Do all referenced columns exist in the schema? Each check is deterministic. No model calls. Parsing and comparing. When SQL passes validation but fails at execution, the repair loop takes over. The system reads the error message and classifies it. Unrecognized column names, syntax mistakes, type mismatches, unsupported function signatures, division by zero: all repairable. Network timeouts, permission denials on the underlying table: not repairable, returned to the orchestrator immediately. The repair request includes the original question, the failed SQL, the exact error message, and valid column hints for relevant tables. When the error says “column X not found in table Y,” the repair prompt includes every valid column for table Y. The model sees the options instead of guessing. Prior errors accumulate across tool calls. Up to three recent failures get injected into subsequent instructions as “PRIOR SQL ERRORS (do NOT repeat these mistakes).” The model doesn’t make the same mistake twice in the same conversation. One repair attempt. That’s the budget. If the repaired SQL also fails, the system returns the error to the orchestrator. One attempt catches most fixable errors. More attempts add latency without improving success rates, and the orchestrator can always reformulate and try a different approach. All the schema injection and repair context described above competes for space in the model’s context window. A query returning 500 rows with 12 columns produces a lot of data, and every tool result feeds back into that same window. Token efficiency isn’t a cost optimization. It’s a reliability constraint. Bloated context means less room for the model to reason about what the data actually says. Voltaire uses TOON for query results. TOON is a compact serialization format backed by a Rust library. For tabular data, it declares column names once in a header and uses positional values per row. A 500-row JSON result repeats the same key names 500 times. TOON states them once. The savings scale linearly with row count. TOON isn’t just for results. The table catalog is TOON. Column schemas are TOON. The format runs through the entire data pipeline within the agent. The planner reads the catalog in TOON. The SQL generator reads column schemas in TOON. The orchestrator reads query results in TOON. At every stage, the context window stays as dense with useful information and as free of structural noise as possible. CSV would be denser, but it loses type information, can’t represent nested structures, and has no metadata support. TOON keeps the structure while cutting the overhead. None of this works without a foundation. The 700 tables didn’t describe themselves. Each has a human-written description, documented columns with types and descriptions, PII classifications, deprecation flags, business logic annotations. That metadata comes from the dbt layer: 19 schema domains, 30-plus source systems, maintained and tested as part of the regular data engineering workflow. The data warehouse agent is where the investment in data infrastructure pays off concretely. Every documented column description makes the table planner more accurate. Every PII annotation makes the filtering more precise. Every dbt model that cleans a raw source table means better inputs for the SQL generator. A 2026 benchmark from dbt Labs put numbers on this. They compared deterministic semantic layers (pre-modeled metrics and dimensions) against raw text-to-SQL across multiple LLMs. The semantic layer hit 98–100% accuracy on modeled questions. Text-to-SQL topped out around 90%, and its failure mode was worse: plausible but wrong answers instead of clean refusals. The benchmark’s sharpest insight: “With text-to-SQL, failure looks like a plausible but incorrect answer. With the Semantic Layer, failure looks like an error message.” Voltaire’s data layer was built from the start as a semantic layer for the agent. The table catalog with human-written descriptions, the business logic YAML, the PII registry, the preference for dbt models over raw sources: the model never sees raw DDL. It sees curated metadata with domain context. When our head of data shared the dbt benchmark internally, the architecture already matched what they were advocating. We chose to generate SQL from that semantic metadata rather than compile deterministic queries from a constrained metric spec. The tradeoff is intentional: deterministic compilation is safer for recurring KPIs, but it locks you into a rigid coverage surface. Every new question type requires a new metric definition. Text-to-SQL with strong validation handles anything the schema covers, and it gets better every time the underlying model improves. We don’t have to update our scaffolding when the next model generation ships. There’s a fine line here about how much scaffolding to build around a model that keeps getting smarter. Every deterministic guardrail you add is maintenance you carry forward, and some of it becomes dead weight when the next model generation lands. The dbt benchmark measured models from early 2026. By the time you read this, text-to-SQL accuracy has probably moved again. Our bet is that the right investment is in the semantic metadata itself (descriptions, PII tags, business logic) rather than in deterministic query compilation on top of it. The metadata makes every model better. A rigid compilation layer only helps the current one. If you want an AI that can query your data, start with the data. Clean schemas, documented columns, classified sensitivity. The AI layer is the last 10% of effort and the first thing users see. Everything underneath is what makes it work.