Most people imagine an AI assistant as a single prompt: question in, answer out. That works for a chatbot riffing on general knowledge. It falls apart the moment you need the AI to pull real data from real systems, respect who’s asking, and get the answer right. I built Voltaire, the internal AI assistant at Parallel Learning. The hardest problem wasn’t getting an LLM to generate text. It was everything around it: identity, context selection, tool orchestration, safety checks. I wrote previously about what Voltaire is and what it connects. This one goes under the hood: the path a question takes from a Slack message to a grounded answer. Here’s what actually happens when someone mentions Voltaire in Slack. The request passes through six distinct stages before a single word of the response gets written. First, the Slack handler receives the event and runs a sensitive-request gate: a lightweight model call that screens for queries that should be refused outright (requests for personal contact info, clinical advice, or data the requester shouldn’t access). Second, the system resolves the user’s identity. Who is this person? What department? What role? What timezone? Third, a pre-orchestrator planner decides what context the AI needs and which tools to make available. Fourth, the orchestrator itself runs a reasoning loop: thinking, planning tool calls, observing results, and iterating until it has enough evidence to answer. Fifth, PII redaction runs inside each tool at the point of data retrieval, not as a post-processing pass. Sixth, the formatted answer lands back in the Slack thread. Six stages. Multiple model invocations before a user ever sees a word. A hard budget on tool calls. And a wall-clock deadline of four minutes to get it all done (longer for complex queries). Most of the intelligence lives before the orchestrator even starts. Get identity and context routing right, and the reasoning loop becomes straightforward. Voltaire’s first move isn’t answering. It’s figuring out who asked. From the user’s email, the system queries the employee directory and builds a UserContext: display name, job title, department, timezone, and a computed role. That role comes from an RBAC policy file mapping departments to access tiers (executive, clinical, operations, default). The same question from different people should produce different answers. When someone from the clinical team asks about session metrics, the system steers the AI toward patient-facing context and clinical glossary terms. Sales asks about a school district? Competitive intelligence and pipeline data get loaded instead. An executive gets broader access; a contractor sees a restricted set of knowledge sections. Identity also feeds into PII enforcement, and it works at two points: at the beginning, by controlling what knowledge and tools become available, and at the tool level, where each data-retrieval function applies its own redaction rules before returning results. PII never reaches the reasoning loop in the first place. Most AI systems treat identity as a binary gate: authenticated or not. That’s a missed opportunity. Identity is a spectrum, and it should shape every downstream decision. Voltaire maintains a YAML knowledge base organized into topic-specific sections: competitive positioning, compliance policies, organizational contacts, product glossary, state-level market intelligence. At any given time, fewer than a dozen of these are relevant to a question. Loading them all would waste tokens, dilute the model’s attention, and slow things down. So a pre-orchestrator planner runs first. It’s a separate, cheaper model call using the smallest model in the Gemini family. Given the user’s question, catalogs of available knowledge sections, agents, and skills, it returns three decisions: which knowledge sections to inject, whether a skill or a set of agents should handle the request, and whether the orchestrator needs tools at all. That third decision is the short-circuit. If someone asks “how do you define a high-priority district?” and the knowledge base already has a section on district prioritization criteria, the planner returns routing_mode: short_circuit. The orchestrator gets the knowledge context but no tools. Single pass, no tool-calling machinery, answer from context alone. A meaningful fraction of questions get short-circuited this way. It saves seconds per request and eliminates the risk of unnecessary tool calls producing confusing results. For questions that do need tools, the planner selects one to four agents from the registry. Ask about enrollment trends? The data warehouse agent. A support ticket? The ticket search agent. Both data and document context needed? Both agents get selected. The orchestrator only sees the tools it actually needs, which keeps the surface minimal and cuts down on hallucinated tool calls. The orchestrator is a PydanticAI agent. Default model: Gemini Flash, with a fallback chain that drops to lighter models if the primary fails or times out. Try the best model first, fall back to cheaper ones, never let a single model failure crash the request. It receives a system prompt assembled from Jinja2 templates (more on that shortly), dynamic instructions injected at runtime (the user’s identity, temporal context, loaded knowledge, guidance about which agents were pre-selected), and the user’s question. Then it reasons. Read the prompt, decide which tool to call, observe the result, decide whether to call another tool or write the final answer. Each tool call maps to a domain-specific agent or a deterministic function: query the data warehouse, search Google Drive, look up a support ticket, search the web, read a document. None of this works without a tool-call budget. Default: 10 calls. With --think (for harder questions): 14. With --ultrathink: 20. Hard limits, enforced at the runtime level. The prompt actually advertises a lower number to preserve headroom for the final response. Without hard limits, an LLM will call tools in a loop until you stop it. I’ve watched orchestrators burn through 40 tool calls on a question that needed two. Each unnecessary call adds latency and cost, and the user is sitting in Slack waiting. The budget forces the model to plan: which tools give the highest-signal evidence in the fewest calls? Then there’s the timeout. The entire run has a wall-clock budget that varies by execution mode. Each tool call has its own deadline inside that. If the data warehouse takes too long, the orchestrator gets a partial result and works with what it has. If the whole run times out, a recovery path synthesizes whatever tool outputs completed into a best-effort response rather than returning nothing. Lumping every external call into a single “tool” abstraction is a common first pass. It breaks down fast. A SQL executor, a multi-step data warehouse researcher, and a prescribed session-tracing workflow have different failure modes, different levels of autonomy, and different needs for oversight. Voltaire separates them into three runtime primitives. Tools are deterministic functions. A tool runs a SQL query against the data warehouse. A tool exports data to a spreadsheet. A tool renders a chart. Pure inputs, pure outputs, no LLM reasoning involved. They’re the atoms of the system, and the design intent is to keep them as isolated as possible so each one can be tested and trusted independently. Agents are domain-specific reasoning engines that wrap tools. The orchestrator calls an agent entry point, and the agent handles multi-step logic internally. Ask the data warehouse agent a question, and it runs its own loop: plan which tables to query, generate SQL, validate it against PII rules, execute it, check for errors, repair if needed, then return a synthesized result. The orchestrator never sees those internals. It asked a question and got an answer. Skills are prescribed playbooks. A skill definition specifies the exact tools available, a fixed step-by-step sequence, custom PII rules for that workflow, pre-built SQL queries, and a separate tool-call budget. Unlike agents, skills expose their tools directly to the orchestrator. The orchestrator follows the script rather than deciding its own approach. What actually separates these isn’t complexity. It’s visibility. In agent mode, the orchestrator asks “what do I need to know?” and delegates the “how” entirely. In skill mode, the orchestrator operates at the level of individual tool calls, following a prescribed sequence. Agents hide complexity; skills make it explicit. The planner routes between these two modes. When someone asks “what’s happening with enrollment in California?” the planner selects one to four agents, the orchestrator calls them in whatever order its reasoning dictates, and the final answer emerges from synthesis. Each agent internally decides how to use its tools. When someone asks “trace session ABC-123,” the planner matches a skill instead. The skill declares two tools (a log query function and a SQL executor), prescribes a sequence (fetch session events, cross-reference with audit logs, synthesize a timeline), and includes its own PII policy. Same playbook every time. The routing contract is strict: the planner returns either matched skills or matched agents, never both. If a skill matches, agents get cleared. The orchestrator can’t access tools outside the skill’s declared surface. We tried a single mode first. It doesn’t work. Exploratory questions need the flexibility of agent reasoning; operational workflows need the predictability of a deterministic playbook. One mode can’t do both well. Nearly every system prompt, agent description, and tool instruction in Voltaire lives in a Jinja2 template file, not in Python code. The first thing this buys you: prompts become version-controlled artifacts. When the AI misbehaves, you can git blame the prompt template and see exactly what changed. You can review prompt changes in pull requests with the same rigor as code changes. No more mystery regressions because someone edited an inline string. The template system also enables composition. The main orchestrator prompt includes shared fragments for company context, temporal context, PII policy, Slack formatting rules. Skills inject their own instructions into the same template structure. Update the PII policy in one place, it propagates everywhere. And XML-structured sections give the model clear boundaries. The orchestrator prompt is organized into tagged blocks: identity, company_context, temporal_context, tone, tools, pii_policy, knowledge, formatting, guardrails. The model can attend to the relevant section for any given decision rather than parsing a wall of unstructured text. On top of this static structure sits a dynamic instruction layer. At runtime, functions inject the user’s identity, the current date and timezone, the planner’s selected knowledge sections, response budget constraints, and guidance about pre-selected agents. Static template defines the shape. Dynamic layer fills in the specifics. When we added a new agent for Google Workspace search, the core wiring was modular: a registry entry, an agent description, routing rules. New capabilities plug in without rewriting the orchestrator. The orchestration brain itself accounts for roughly 12% of Voltaire’s codebase. Everything around it, the identity layer, the planner, the skill system, the PII enforcement, the audit logging, exists to make that 12% reliable. That’s the real work of building an AI assistant: not the model, but the pipeline that lets you trust it. The model gets better every quarter. The orchestration layer is what lets you actually ship it.