The Data Fusion Secret & Why Custom Query Engines Fail with Nikita Lapkov

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Nikita Lapkov, Senior Software Engineer at Cloudflare, to explore the architecture, design decisions, and future roadmap of R2 SQL- Cloudflare's new R2-based distributed query engine launched in September 2024.

What You'll Learn:

How to leverage existing query engines strategically: Why Cloudflare chose Apache Data Fusion for single-node query processing rather than building an analytical engine from scratch, freeing engineering resources for distributed orchestration challenges.
The stateless architecture pattern for global infrastructure: How to design compute nodes that hold zero persistent state by storing all metadata in a distributed catalog (Iceberg), enabling per-query worker provisioning across 300+ geographically dispersed data centers.
Why filter pushdown and metadata-driven pruning are non-negotiable optimizations: How to reduce data scanned from object storage before query execution begins by leveraging catalog statistics and range filtering - the foundation of R2 SQL's performance gains.
How to solve version compatibility at infrastructure scale: Why backward compatibility matters more than cross-version support when you can't control individual node upgrade timing, and how this constraint drives architectural decisions.
The shuffle strategy for point-to-point distributed joins: How to implement in-memory and disk-based shuffles within ephemeral worker clusters using network-addressable worker IDs, allowing stateless workers to forget completely after query completion.
Why adaptive query execution is the next frontier for petabyte-scale analytics: How collecting runtime data distribution statistics mid-query execution enables mid-flight plan reconfiguration - a technique worth the overhead investment when queries run for minutes or hours rather than milliseconds.

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review

About the Guest(s)

Nikita is a Senior Software Engineer at Cloudflare, specializing in distributed query engines and data platform architecture. With extensive experience in database internals gained through roles at ClickHouse, Yandex, and MongoDB, Nikita has developed deep expertise in query optimization and system design at scale. At Cloudflare, he leads the development of R2 SQL, a distributed analytical query engine built on Apache Data Fusion, serving as a critical component of Cloudflare's data platform. In this episode, Nikita discusses the architecture, design decisions, and technical challenges of building a stateless, distributed SQL engine across Cloudflare's unique 300-location infrastructure, offering valuable insights for engineers working on large-scale data systems. Their work demonstrates how thoughtful architectural choices and infrastructure constraints drive innovation in distributed database systems.

Quotes

"It was my crash course into OS engineering. We encouraged every possible bug in this project. It was very painful and very hard." - Nikita Lapkov

"Collecting a stack trace is very hidden, especially if you're not writing in C or C++. It is actually a very complicated and involved process." - Nikita Lapkov

"What excites me is that it has free egress. Usually, you would pay per gigabyte to load your data. You don't have that with R2." - Nikita Lapkov

"What we explicitly wanted to avoid when building R2 SQL is building an analytical query engine again. We would much rather use something off the shelf and work on the interesting distributed parts." - Nikita Lapkov

"No matter how complex the query is, you can make a case that, with extreme cases, the throughput for a single load operation is relatively constant, no matter how complex the query is." - Nikita Lapkov

"We try to be as stateless as possible. All our state lives in the catalog itself, so we only need what's in the catalog and the query that comes from the request." - Nikita Lapkov

"The shuffles cannot really be reused unless you do some very fancy heuristics. Once we have picked the workers for a particular query, we can think of them as our little cluster." - Nikita Lapkov

"Joins consume your entire roadmap, and this is pretty much what will be happening with us at some point. We need to make sure that distributed joins work really well, no matter what your data distribution is like." - Nikita Lapkov

"We have potentially minutes to spare, and optimizing some even subparts of the query is worthy investigation because it could shave hours or something like that." - Nikita Lapkov

"Finding the safe points for replanning and doing this distributed coordination while we have 50 different workers working on different parts of the query is definitely the area we want to look at in the coming year." - Nikita Lapkov

Resources

Connect on LinkedIn:

Nikita Lapkov - https://www.linkedin.com/in/nikitalapkov
Benjamin Wagner - https://www.linkedin.com/in/wagjamin

Websites:

Firebolt – firebolt.io
Cloudflare – cloudflare.com
Apache Arrow DataFusion – datafusion.apache.org

Tools & Platforms:

R2 SQL – Cloudflare's R2-based query engine for analytical queries
Apache Arrow DataFusion – Analytical query engine used for single-node number crunching
Arroyo – Rust-based streaming solution built on DataFusion
R2 – S3-compatible object storage with free egress
Apache Iceberg – Catalog system for state management