Beyond ETL: Building Data Pipelines for LLMs and AI Agents
The enterprise data pipeline has a new job description. For decades, it moved batched records from operational systems into warehouses where analysts could query them. That model is functionally broken for LLMs and AI agents, which need live context, semantic meaning, and validated outputs—not nightly snapshots loaded into fact tables.
The failure isn’t theoretical. Only 16% of AI-generated answers to open-ended enterprise questions are accurate enough for decision-making, and Gartner projects 60% of AI projects will fail due to missing AI-ready data management practices. The root cause isn’t the models—it’s the architecture underneath them.
This guide examines what a data pipeline for LLMs actually requires: how context injection and RAG differ, what MCP-based data access changes about pipeline design, why semantic enrichment is a prerequisite for accuracy, and what governance controls prevent agentic systems from becoming liabilities.
Why Batch ETL Breaks for AI Workloads
Traditional ETL optimizes for human analysts with stable schemas and acceptable multi-hour latency. LLMs invert every one of these assumptions.
AI agents must answer open-ended natural language questions spanning multiple domains, often requiring awareness of events that occurred minutes ago. Because LLMs are probabilistic and prone to hallucination without authoritative grounding, they need context injected at inference time—not precomputed tables built the night before.
The result: the pipeline’s job shifts from producing stable analytical datasets to delivering live, query-specific context to models and agents at sub-second latency. Batch wins no awards here.
Context Injection vs. RAG: A Necessary Distinction
These terms are used interchangeably, but the architectural difference matters.
Context injection is the broader practice of adding task-relevant information to an LLM’s prompt—regardless of how that information was obtained. Static system instructions, user profiles from a feature store, or recent conversation history are all context injection.
Retrieval-augmented generation (RAG) is a specific pattern where the model’s prompt is augmented with content retrieved at query time from an external index, typically via vector search. All RAG involves context injection, but not all context injection requires RAG.
Early RAG prototypes—embed a document corpus, retrieve top-k chunks, concatenate into a prompt—work acceptably for small, homogeneous corpora with minimal security requirements. At enterprise scale, context injection must orchestrate multiple retrieval mechanisms simultaneously:
- Entity lookups into feature stores keyed by customer or session ID
- Semantic vector search over unstructured documents
- Relational queries via governed SQL endpoints
- Policy metadata from governance systems
The pipeline assembles a coherent context window from these heterogeneous sources while respecting token limits, security policies, and latency budgets. That’s qualitatively different from batch ETL.
Critically, research shows that excessive or poorly curated context can degrade LLM output quality—a phenomenon sometimes called “context overload.” More data in the prompt is not always better. Effective pipelines require context engineering: summarizing, filtering, and ranking signals before injection.
Enterprise RAG Infrastructure: What Breaks at Scale
The Multi-System Problem
When RAG moves from a single document corpus to 20+ heterogeneous enterprise systems, several failure modes surface simultaneously:
Retrieval quality degrades. Embeddings computed over sources with different structures and vocabularies lose coherence. A question like “Has this customer ever disputed a charge?” requires correlating CRM notes, case management tickets, and billing records. No cosine similarity search over isolated document chunks captures that relationship. Hybrid retrieval—combining keyword search, structured filters, and semantic vectors—becomes mandatory.
Data freshness becomes critical. Stale order status or outdated pricing in an AI response is worse than no answer. Change Data Capture (CDC) has emerged as the key technique for continuously propagating updates from operational systems into downstream stores, including vector databases and context layers, without overloading source systems.
Semantic inconsistency degrades outputs. Atlan identifies “context failure” as a primary upstream cause of LLM hallucinations—missing metadata, conflicting definitions, untrusted sources. If “customer_id” in one system is “party_key” in another, or “risk_score” uses different scales across business units, the model synthesizes answers from incompatible signals and produces nonsense with confidence.
Security complexity explodes. Directly connecting LLM agents to operational systems creates brittle, difficult-to-govern architectures. Each connection must handle authentication, authorization, field masking, and injection protection independently. Without a centralized governance layer, row-level and column-level policies are inconsistently applied—if applied at all.
The Architectural Response: Semantic Layers and Context Graphs
Enterprise RAG architectures are converging on a layered approach combining vector search, semantic layers, context graphs, and governed APIs.
A semantic layer maps physical schemas to business concepts—”Customer,” “Monthly Recurring Revenue,” “Order”—with metrics definitions and relationships using business-friendly names. In AI architectures, it serves two functions: it tells LLMs what fields mean, and it acts as a deterministic query generator, translating semantic intent into validated SQL rather than letting models generate arbitrary queries against raw tables.
Context graphs add relationship-awareness. TrustGraph describes context graphs as encoding not only knowledge but also agentic behavior—how knowledge was created, queried, and used, including model parameters and timestamps. This allows retrieval to consider usage patterns and provenance, not just static facts.
The combination constitutes a context fabric: a unified layer presenting AI systems with a coherent view of enterprise knowledge, relationships, and policies—backed by heterogeneous storage and processing technologies. Building and maintaining this fabric is what “data pipeline for AI” actually means in practice.
MCP Data Access: A New Pipeline Model
The Model Context Protocol standardizes how LLMs and agents connect to external systems. Built on JSON-RPC 2.0, MCP exposes tools, resources, and prompts as RPC-style capabilities rather than raw database endpoints.
This design change is significant. Instead of letting models emit arbitrary SQL or HTTP, MCP encourages a capability-based interface: the MCP server advertises specific tools—”lookup_customer_profile,” “search_knowledge_base”—with typed schemas. The agent chooses among well-defined capabilities. Each tool call is a data access event that the MCP server can authenticate, authorize, and audit.
An MCP tool call for a governed semantic query might look like this:
{
"jsonrpc": "2.0",
"id": "99",
"method": "tools.call",
"params": {
"name": "run_metric_query",
"arguments": {
"metric_name": "average_order_value",
"filters": { "region": "EMEA" },
"time_grain": "month",
"start_date": "2026-01-01",
"end_date": "2026-03-31"
}
}
}
The tool implementation maps the metric name to the semantic layer, constructs the query with appropriate filters, enforces row-level access based on the authenticated identity, and returns structured results. The agent never touches raw SQL or table schemas.
MCP doesn’t replace data pipelines—it relocates pipeline logic into tool implementations. Data engineers must design these tools as governed data products, each encapsulating a well-defined slice of the enterprise context graph. Traditional ETL work cleaning, transforming, and modeling data remains essential; its outputs are now consumed by AI tools under MCP rather than BI dashboards.
The Agent2Agent (A2A) protocol, now under the Linux Foundation, extends this further. A2A enables agents from different vendors to collaborate—a planning agent can delegate data retrieval to a specialized “data agent” that communicates with MCP servers, caches results, and constructs context bundles. This decoupling of data access logic from reasoning creates hardened, shared services—but amplifies governance requirements accordingly.
Semantic Enrichment for AI: Why Metadata Is Now a First-Class Concern
In BI environments, a human analyst can recognize when “flag” or “code” is ambiguous and ask for clarification. An LLM fills gaps with confident fabrication. Missing or inconsistent business metadata is therefore a more severe failure mode in AI systems than in traditional analytics.
Semantic enrichment transforms raw schema into AI-ready context:
| Aspect | Raw Schema | Semantically Enriched |
|---|---|---|
| Column name | cust_id | customer_id |
| Description | (none) | “Identifier for Customer entity; joins to dim_customer” |
| Business definition | (none) | “Excludes prospects and closed accounts” |
| Sensitivity tag | (none) | PII – Customer Identifier |
| Quality rule | (none) | “Must not be null; must exist in dim_customer” |
This enrichment pays dividends across the entire RAG pipeline. Metadata tags like entity IDs and sensitivity labels enable filtered vector search, ensuring retrieved documents are both relevant and accessible. Semantic layer mappings enable automatic joins. Lineage annotations let agents reason about data provenance and reliability.
Atlan’s framework for AI-ready data identifies metadata management, data quality, lineage, and governance as the four fundamental factors for grounding large models reliably. Investing in metadata infrastructure isn’t optional—it’s the mechanism by which hallucination rates fall. Platforms like Promethium’s Insights Context Graph operationalize this by unifying five levels of context—from raw technical metadata to tribal knowledge and usage patterns—into a single graph that agents can query at inference time.
AI Data Governance: The Line Between Useful and Unsafe
Why Naive SQL Generation Is an Architecture Mistake
Allowing LLMs to generate and execute arbitrary SQL against production systems is a serious engineering and security error. LLM-generated queries have no inherent awareness of row-level security, column masking, or user entitlements. They can produce inefficient plans that strain infrastructure, expose sensitive fields, or be manipulated via prompt injection to exfiltrate data.
Research on LLM-generated SQL confirms that queries vary widely in execution efficiency for the same intent, with naive generation producing plans that cost orders of magnitude more than human-written equivalents. The correctness risk compounds the cost risk.
The Intent-Not-SQL Pattern
The solution is architectural separation: LLMs generate structured intent, not executable code. A JSON object specifying requested entities, filters, and metrics gets validated against policies and translated into parameterized calls against a governed semantic layer. The model never accesses raw tables.
Row-level security filters and column masks must be enforced at the data access layer—not assumed at the application layer. In Unity Catalog, for example, row filters are implemented as SQL UDFs that evaluate each row at query time; column masks return original or masked values depending on the requester’s role. These controls apply regardless of how the query was generated.
MCP’s security requirements reinforce this pattern. Per the MCP security specification, tokens must be validated on every request—including verifying the audience claim against the server’s identifier to prevent confused deputy attacks. Session-based authentication is explicitly prohibited. Per-client consent mapping ensures that a user’s authentication doesn’t grant blanket authorization to any connected agent.
Audit Trails and Observability
Audit trail quality is a strong predictor of AI governance maturity. Organizations that log AI agent interactions with regulated data at the data layer—capturing which tools were called, what inputs were passed, how policies were applied, and what data was returned—are substantially better positioned for compliance and incident response.
AI agent observability requires the same discipline as microservice monitoring: distributed tracing for each agent action and tool call, metrics for latency and token usage, structured logs for debugging, and automated evaluations scoring output quality. Promethium’s Trust Harness embeds this validation loop directly—accuracy scoring, lineage for every SQL query, and anti-hallucination safeguards validated against actual data sources, not just prompt engineering.
What Data Engineers Actually Need to Build
Translating these principles into action means treating the LLM data pipeline as five distinct but connected responsibilities:
- Continuous ingestion: CDC-based synchronization from operational systems into context layers, vector databases, and online stores—not nightly batch loads
- Semantic modeling: Maintaining a semantic layer and context graph with enriched metadata, entity relationships, and business definitions
- Governed MCP tool implementations: Designing tools as reusable data products with embedded access control, masking, and validation logic
- Context engineering: Summarizing and compressing retrieved content to fit model context windows without degrading quality
- Observability pipelines: Capturing retrieval metrics (hit rate, NDCG), tool call telemetry, and output quality scores to drive continuous improvement
The traditional ETL/ELT work—cleaning, transforming, and modeling data—remains essential. What changes is the consumer: instead of BI dashboards waiting for the nightly refresh, the consumer is an AI agent that needs the right slice of enterprise reality, semantically enriched, policy-enforced, and delivered in milliseconds.
That’s not a pipeline. It’s a context fabric. And building it is the core work of AI-native data engineering.