How Do You Wire Your Enterprise With AI-Ready Data? >>> Read the blog by our CEO

April 24, 2026

Zero Copy Data Integration: 7 Questions to Ask Before You Buy

Vendors claim zero copy, but few deliver it. Use these 7 technical questions to validate genuine federated query execution before you buy.

Zero Copy Data Integration: 7 Questions to Ask Before You Buy

“Zero copy” has become one of the most abused terms in enterprise data marketing. Vendors apply it to everything from genuine federated query execution—where queries run against source data without moving it—to API wrappers that materialize data in transit, caching layers that move data on a schedule, and virtual access patterns that still push data through intermediary systems before delivery.

For data architects and CDOs with real budget on the line, the distinction matters enormously. The wrong choice means paying for infrastructure that replicates the exact data movement costs you were trying to eliminate. This guide gives you seven precise technical questions—and the answers you should demand—to separate genuine zero copy data integration from expensive marketing language.

Why “Zero Copy” Claims Are Hard to Verify

The core promise of zero copy data integration is straightforward: consumers query data where it lives, through a governed access layer, without creating physical copies. As Conduktor explains, a company with a 500GB customer database that allows Marketing, Analytics, and Sales to each copy it now stores 2TB total—1.5TB of pure duplication that becomes stale the moment the source updates.

True zero copy eliminates that problem. But vendors have stretched the term to cover patterns that retain significant data movement:

  • Cached zero copy: Data is retrieved and temporarily stored as objects within the vendor’s platform on every query cycle
  • Virtual layer access: Metadata pointers avoid permanent copies, but data still passes through intermediary transformation layers
  • API wrapping: External sources are exposed through APIs, but data is materialized in transit stages before reaching end users
  • Time-window federation: Systems marketed as “real-time” that actually sync on 15-minute refresh cycles

Each pattern fails differently. The questions below expose which category a vendor actually falls into.


Question 1: Where Does Query Execution Actually Happen?

What to ask: “Show me a query execution plan for a filtered query against a remote source. Does filtering happen at the source, or after retrieval?”

What you should hear: In a genuine federated architecture, the source system’s query engine understands and executes the query, retrieves the filtered data from its storage layer, and returns only the results. Systems without true pushdown retrieve entire tables and filter locally—generating massive network traffic.

How to verify it: Request access to the EXPLAIN or EXPLAIN ANALYZE tool during your evaluation. AWS Athena’s documentation shows what proper pushdown looks like: a RETURN operator from the remote system and a SHIP operator separating local from remote operations. Plans that show all filtering occurring locally after data retrieval indicate missing pushdown optimization—a red flag.

A genuine federated query engine implements query pushdown verification as table stakes: predicates, aggregations, and joins should all execute at the source where possible, with only result sets transmitted back.


Question 2: Does Your Platform Support Heterogeneous Cross-Source Queries?

What to ask: “Can I write a single SQL query that joins a table in Snowflake with a table in PostgreSQL and a Salesforce object? Show me the execution plan.”

What you should hear: Cross-source query capabilities are where most “zero copy” claims break down. Vendors with genuine federated execution coordinate distributed SQL across heterogeneous systems—cloud warehouses, SaaS applications, on-premises databases—within a single query with built-in optimization for each source’s dialect and capabilities.

The dialect problem is real. Different database systems support different SQL syntax variations. A query written for PostgreSQL may not execute against Oracle without modification. Vendors must either support multiple SQL dialects natively or implement query translation that preserves semantics across source types.

Ask vendors to demonstrate cross-source joins against your specific source combination—not a curated demo environment. The answer reveals whether cross-source SQL is production-ready or a roadmap item.


Question 3: How Fresh Is the Data, and How Can I Prove It?

What to ask: “If I update a record in the source system right now and immediately run a query through your platform, will I see the updated value?”

What you should hear: Data freshness refers to how current data is relative to when it was generated or should be available for consumption. In a genuinely live data access architecture, the answer is yes—every query reflects the current state of the source.

How to test it: Run the same query against both the federated platform and the source system simultaneously. Any divergence indicates caching or refresh cycles. Then test under concurrent load—if multiple users’ results diverge from source data by different amounts, the system is serving stale cached responses.

This matters acutely for fraud detection, dynamic pricing, and campaign analytics. A 15-minute refresh cycle and a 24-hour batch sync have identical marketing language (“zero copy”) but completely different operational realities.


Question 4: How Is Access Control Enforced—and at What Layer?

What to ask: “If I have row-level security policies, where are they enforced: at the source, at your query layer, or after data is retrieved?”

What you should hear: In federated architectures, access policies must be enforced at query time, before results are returned to end users. This is fundamentally different from traditional approaches where policies are applied at the storage layer after data lands in a central repository.

Query-level RBAC enforcement is non-negotiable for enterprise deployments. Acceptable answers include: policies enforced by the federated query engine before execution, dynamic data masking applied at the query layer (not post-retrieval), and row-level filters pushed down as query predicates to source systems.

Also ask specifically about audit trails. Audit logs must capture the status of workflows, which users modified data, and what activities users performed in real-time—not as batch exports. Without query-level audit trails, GDPR and HIPAA compliance attestation becomes unreliable.


Question 5: What Happens When a Source System Is Unavailable?

What to ask: “If one of my source systems goes down during query execution, what happens? What’s the degradation path?”

What you should hear: This question exposes the architectural honesty of vendor claims. Pure federated approaches have a real limitation: if a source system experiences downtime, queries that depend on it cannot complete. Vendors who claim this never happens are either describing a caching architecture (which is not pure zero copy) or overstating their resilience.

Acceptable answers: transparent error handling with clear source-level failure attribution, optional caching for specific high-availability datasets with explicit freshness trade-offs disclosed, and circuit-breaker patterns that degrade gracefully rather than silently returning stale data.

Vendors who don’t have a clear answer to this question haven’t designed for production workloads.


Question 6: How Do You Handle Schema Changes in Source Systems?

What to ask: “If a source system adds a column or renames a field tonight, what breaks, and how quickly does your platform detect and handle it?”

What you should hear: Schema evolution is an operational reality. AWS Glue crawlers automatically discover new data, extract schema definitions, detect schema changes, and version tables. A production-ready federated platform needs comparable capability: automated schema change detection, alerting to data teams when upstream schemas change, and tools to update downstream queries accordingly.

The failure mode to watch for: Vendors that discover schema changes only when a query fails in production. This means business users encounter unexplained errors while engineers scramble to identify which upstream system changed and what broke downstream. Enterprise data integration RFP processes should make automated schema detection a scored evaluation criterion, not a nice-to-have.


Question 7: What Is the True Total Cost of Ownership?

What to ask: “Walk me through compute costs for query execution. Who pays when queries run—us, the source system, or you? What happens to costs at 10x our current query volume?”

What you should hear: Zero copy does not automatically mean zero cost. Three distinct cost models exist in federated architectures:

  • Live query federation: Compute costs delegated to source systems; organizations pay source-system query fees per execution
  • File-based federation: Consuming system runs its own compute against source storage; costs scale with query complexity, not data volume
  • Cached acceleration: Vendor stores result sets in their platform; lower per-query latency but introduces caching costs and freshness trade-offs

Demand a written cost model across all three scenarios at your expected query volume. Also model network egress costs for cross-region or cross-cloud queries—these accumulate rapidly in cloud environments and are rarely surfaced in initial pricing conversations.


The Data Fabric Evaluation Criteria Vendors Hope You Skip

Beyond the seven questions above, three technical areas receive systematic underdisclosure in vendor presentations:

Performance at scale. Network latency creates hard physical limits: approximately 1 millisecond per 100 miles at the speed of light. Federated queries spanning geographic regions carry latency floors no software optimization can eliminate. Test your actual geographic source distribution against your SLA requirements before committing.

Data type compatibility. Different systems represent dates, timestamps, and numeric precision differently. Vendors should document which type conversions occur automatically and which require manual handling. Test your highest-risk data types—financial figures, timestamps with timezone offsets, multi-byte character fields—in the proof of concept.

Multi-source join performance. When queries join data from multiple heterogeneous sources, the joining operation must occur in the consuming system because no source system understands the other’s data. Architects should specify the larger table on the left side of the join and smaller on the right to enable proper distribution. Benchmark this pattern specifically—it’s where performance cliffs appear.

Beyond the technical questions, there’s a strategic layer to this evaluation: whether the platform you’re buying is an open fabric you control or a closed ecosystem that locks your data in. Open vs. Closed Data Fabric: A Strategic Guide for Enterprise Data Leaders lays out the architectural trade-offs CDOs should weigh alongside the technical criteria above.


Running a Proof of Concept That Validates the Claims

Marketing claims are easy. Production behavior under realistic conditions is what matters. Structure your PoC around four validation checkpoints:

Week 1 — Source connectivity and schema discovery. Connect your three most complex source systems. Verify that metadata is discovered automatically, data types are mapped correctly, and cross-source schema relationships are visible without manual configuration.

Week 2 — Query pushdown and freshness verification. Run your most selective queries and inspect execution plans. Confirm that filtering happens at sources. Update records in source systems and verify that federated queries reflect changes immediately.

Week 3 — Governance and access control validation. Implement your row-level security policies and column masking requirements. Confirm enforcement occurs at the query layer, not post-retrieval. Verify audit trails capture sufficient detail for compliance attestation.

Week 4 — Performance and cost modeling under realistic load. Run concurrent queries at projected production volume. Measure response times for cross-source joins. Generate actual cost data from compute consumption and model what that means at 5x scale.

Define your go/no-go criteria before the PoC starts—not after you’ve seen the vendor demo. The questions in this guide translate directly into measurable pass/fail criteria: pushdown verified in execution plans, freshness confirmed by direct source comparison, governance enforced at query time, and costs documented at scale.

Vendors with genuine zero copy data integration capabilities will welcome this structure. Vendors whose claims rest on marketing language will find reasons to avoid it.


Curious what genuine zero copy looks like from the analyst’s seat — not the architect’s? 5 Ways to “Talk” to All Your Data With Promethium walks through the day-to-day outcomes of a federated architecture that actually passes the seven questions above: cross-source queries with no pipelines, answers in plain English grounded in live data, and full lineage from question to source.