What happens when a team of seven engineers spends a year trying to build a production-ready CDC connector and fails? For Artie CTO and co-founder Robin Tang, it was the spark needed to build a platform that makes data streaming accessible. In this episode, Robin joins Benjamin to discuss the "DFS" (Deep First Search) approach to data sources, the engineering hurdles of real-time Postgres-to-Snowflake pipelines, and why "theoretically correct" architectures often fail in practice.
In this episode of The Data Engineering Show,
Benjamin sits down with Artie CTO and co-founder
Robin Tang, to explore the complexities of high-performance data movement. Robin shares his journey from building Maxwell at Zendesk to scaling data systems at Open Door, highlighting the gap between business-oriented SaaS connectors and the rigorous demands of production database replication.
Robin dives deep into Artie’s architecture, explaining how they leverage a split-plane model (Control Plane and Data Plane) to provide a "Bring Your Own Cloud" (BYOC) experience that engineering teams actually trust. You’ll hear about the technical nuances of CDC, from handling Postgres TOAST columns to the "economy of scale" challenges of processing billions of rows for Substack, Artie’s first customer. Whether you're struggling with real-time ingestion costs or curious about the future of platform-agnostic partitioning, this conversation provides a masterclass in modern data movement.
What You'll Learn:
- Why the data movement market is bifurcating: Managed vendors like Fivetran excel at SaaS integrations (hundreds of connectors), while specialized vendors like Artie focus on production databases at high volume - a fundamentally different job to be done requiring expertise in failure recovery, observability, and advanced use cases.
- How to design CDC architecture that doesn't break production databases: Use online backfill strategies (DB log framework) instead of long-running transactions that hold write locks; implement table-level parallelism so a single table error doesn't halt the entire pipeline.
- The split-plane architecture pattern for flexible deployment models: Build control plane and data plane separation from day one, allowing customers to choose between fully managed cloud deployments or bring-your-own-cloud (BYOC) without compromising UX or architecture.
- Why database-specific expertise matters more than breadth: SQL Server CDC requires reverse engineering undocumented code; Postgres has TOAST columns; MongoDB allows invalid timestamp values - each data source has hidden complexity that justifies deep specialization over connector sprawl.
- How to build trust with early-stage customers on mission-critical workloads: Walk prospects through architecture and failure modes before implementation; encourage them to stress-test with real data volumes; establish deep engineering partnerships where both teams debug problems together (not sales-driven relationships).
- The platform-specific optimization trap and how to solve it: Instead of requiring customers to understand nuances of BigQuery time partitioning vs. Snowflake's lack thereof, build platform-agnostic features (like soft partitioning) that work consistently across destinations while handling platform-specific optimizations under the hood.
About the Guest(s)
Robin is the CTO and cofounder of Artie, a data movement platform built for high-volume, low-latency production database replication. With over a decade of experience building large-scale data systems, including early work on Maxwell (an open-source CDC framework at Zendesk) and database architecture at venture-backed startups, Robin identified a critical gap: existing tools optimize for SaaS integrations, not production databases at scale. In this episode, Robin shares hard-won lessons from building mission-critical infrastructure, including architectural innovations that prevent data loss and failure modes that only surface under real-world production load. His work at Artie has powered reliable data replication for companies like Substack, making this conversation essential for engineering teams building or evaluating real-time data movement solutions.
Quotes
“Artie helps companies make data streaming accessible." - Robin
"I didn't want to make any sort of compromises and it just turned out to be a really hard problem, so then we started a company around this." - Robin
"The complexity is not just at the destination level, the complexity is also at the source level." - Robin
"Every pipeline that we touch is mission critical for customers, or else they would just use either their existing pipeline or a managed vendor that's out there." - Robin
"We handle the whole thing, whereas other vendors more or less provide a component and expect engineers to either build or attach additional pieces." - Robin
"I think the biggest bottleneck for real time right now is accessibility. When people think about real time, they immediately think it's not worth it because they implicitly have a cost associated with it." - Robin
"We use Kafka transactions, so we do not commit offsets until the destination tells us the data has actually been flushed." - Robin
"There's so much nuance with every single data source that it becomes a whack-a-mole problem." - Robin
"When there's sufficient pain on the other side and they buy into your vision, it's easier to overcome obstacles during technical implementation." - Robin
"We're spending more time developing platform-agnostic solutions so customers don't have to understand platform nuances." - Robin
Resources
Connect on LinkedIn:
Websites:
Tools & Platforms:
- Maxwell – Open source CDC framework for MySQL to read binlog into Kafka
- Kafka – Distributed event streaming platform for data movement
- WarpStream – Cost-optimized Kafka alternative using object storage
- Streamsy – Kubernetes-native Kafka deployment tool
- Apache Iceberg – Open table format for data lakehouse architecture
- Delta Live Tables – Databricks' data movement and transformation tool
- ClickPipes – ClickHouse's native data ingestion platform
- Snowpipe Streaming – Snowflake's real-time data ingestion service
- Google Datastream – Google Cloud's CDC and data movement service
- AWS MSK Tiered Storage – Amazon managed Kafka with tiered storage capabilities