The Data Engineering Show
The Data Fusion Secret & Why Custom Query Engines Fail with Nikita Lapkov
March 24, 2026
What if building a distributed SQL engine meant rethinking everything about how query execution works at scale? In this episode, Benjamin sits down with Nikita, Senior Software Engineer at Cloudflare, to explore how R2 SQL leverages object storage and distributed computing to power analytics across 300 global locations, why backward compatibility becomes critical when you can't control infrastructure rollouts, and the key strategies for handling joins and adaptive query execution in a stateless, point-to-point network architecture. Whether you're designing distributed systems or curious about how Cloudflare processes petabytes of data, this conversation reveals the real-world engineering challenges and innovations shaping the future of cloud data platforms.
In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Nikita Lapkov, Senior Software Engineer at Cloudflare, to explore the architecture, design decisions, and future roadmap of R2 SQL- Cloudflare's new R2-based distributed query engine launched in September 2024.


What You'll Learn:


If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review


About the Guest(s)


Nikita is a Senior Software Engineer at Cloudflare, specializing in distributed query engines and data platform architecture. With extensive experience in database internals gained through roles at ClickHouse, Yandex, and MongoDB, Nikita has developed deep expertise in query optimization and system design at scale. At Cloudflare, he leads the development of R2 SQL, a distributed analytical query engine built on Apache Data Fusion, serving as a critical component of Cloudflare's data platform. In this episode, Nikita discusses the architecture, design decisions, and technical challenges of building a stateless, distributed SQL engine across Cloudflare's unique 300-location infrastructure, offering valuable insights for engineers working on large-scale data systems. Their work demonstrates how thoughtful architectural choices and infrastructure constraints drive innovation in distributed database systems.


Quotes


"It was my crash course into OS engineering. We encouraged every possible bug in this project. It was very painful and very hard." - Nikita Lapkov

"Collecting a stack trace is very hidden, especially if you're not writing in C or C++. It is actually a very complicated and involved process." - Nikita Lapkov

"What excites me is that it has free egress. Usually, you would pay per gigabyte to load your data. You don't have that with R2." - Nikita Lapkov

"What we explicitly wanted to avoid when building R2 SQL is building an analytical query engine again. We would much rather use something off the shelf and work on the interesting distributed parts." - Nikita Lapkov

"No matter how complex the query is, you can make a case that, with extreme cases, the throughput for a single load operation is relatively constant, no matter how complex the query is." - Nikita Lapkov

"We try to be as stateless as possible. All our state lives in the catalog itself, so we only need what's in the catalog and the query that comes from the request." - Nikita Lapkov

"The shuffles cannot really be reused unless you do some very fancy heuristics. Once we have picked the workers for a particular query, we can think of them as our little cluster." - Nikita Lapkov

"Joins consume your entire roadmap, and this is pretty much what will be happening with us at some point. We need to make sure that distributed joins work really well, no matter what your data distribution is like." - Nikita Lapkov

"We have potentially minutes to spare, and optimizing some even subparts of the query is worthy investigation because it could shave hours or something like that." - Nikita Lapkov

"Finding the safe points for replanning and doing this distributed coordination while we have 50 different workers working on different parts of the query is definitely the area we want to look at in the coming year." - Nikita Lapkov


Resources


Connect on LinkedIn:

Websites:


Tools & Platforms:


The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes: