The Data Engineering Show | Transcript: Building Data Products For Data Engineers

September 9, 2021 • 40 Minutes

Building Data Products For Data Engineers

How does a tech stack that always needs to be at the forefront of technology look like? Roy Miara from Explorium talks about building data products for the audience that can’t be fooled – Data Engineers.

Boaz: Hello, everybody. Welcome to another episode of the data engineering show. So happy to see you with us again.

Eldad: Yes, I came back from abroad.

Boaz: I missed you.

Eldad: I missed you so much.

Boaz: So, welcome to another episode where we host data practitioners from all around. Today with us Roy Mira. Did I pronounce it correctly, Roy

Roy: Yep.

Boaz: Great. So, Roy an engineering manager at a super interesting company called Explorium, engineering manager of data and ML who's been in a variety of data engineering and machine learning focused roles, a variety of startups in his career. And typically, we talked to big brand names before, but it's important to also talk to sometimes smaller, interesting companies...

Eldad: Emerging brands.

Boaz: Like Explorium so companies that actually build stuff for data engineers and data scientists, which makes Explorium interesting. So, Explorium, if you haven't heard about them recently landed 75 million in investment. What they do, they help with data enrichment and help you sort of use external and public data sources to simplify your training data procedures and stuff like that. We'll have Roy expand on that a little bit but before that, actually Roy ran late a little bit. Roy, why did you run late to this webinar? I think you have a story for us.

Roy: Changing a flat tire.

Boaz: For whom?

Roy: For a pregnant woman.

Boaz: This is so nice. Sometimes we have people who are also amazing humans outside of the data engineering show. Thank you for helping...

Eldad: And your wife is not expecting.

Roy: No, it's not my wife. I had a feeling because I was texting Boaz and then I said, oh, it's going to pop. So, I text him I'm going to be late.

Boaz: So, take an example from Roy. If you see pregnant people who need to change flat tires in the heat help them out. Okay, so Roy, tell us a bit about Explorium in your own words, please.

Roy: Among data scientists today, we hear a lot that it's all about the data. It's no more about the models, this is the era of data centric AI in general. And big data analytics has always been around data, who has the best data, who is the most accurate, the most relevant data to answer some business questions and in Explorium we're looking at this whole world of machine learning and big data analytics and we say it's all about having the right data. So, what Explorium does, it enables organizations; large organizations, small organizations, to have immediate access to the most relevant data for their business problems. And we do it by obviously aggregating a lot of data sources in a lot of fields, modeling the data correctly and enabling this kind of search engine over our dataset with accordance to what the user has and already has as an internal data from his company. But also, in cases where the user doesn’t even have data and only has some business question, he wants to know all the companies that have more than 10 employees and more than $5 million in capital in the East Coast, for example.

Boaz: So, who typically are the end users? Is it more for data engineers, data scientists? Is it sometimes for business users as well?

Roy: Yeah, our users span from data scientists and deep ML engineers on the far hand and on the other hand, you have people from the business and analytics side that use the platform to do all their exploratory data analysis. So, we see everything in between, obviously.

Boaz: Okay. So, we're talking about a company who manages a lot of data, makes it easily accessible, both for engineering and for a business. Super interesting. So, tell us a little bit just so we can understand the data challenges you guys have. What data volumes more or less are you guys dealing with?

Roy: Data volume here is tricky because we're processing somewhere around the two terabytes a day, but when I'm saying processing, I'm talking about already structured tabular data. So, we're not looking into raw data, we're looking at fine-grained, high quality data from a variety of sources. And I think one of our major challenges is having this variety of sources and different schemas constantly evolving. So, around a couple of terabytes a day this is what we process and in volume, we're talking also a couple of hundreds of terabytes constantly updating.

Eldad: And is that one data source? Is that one global schema with many tables in it?

Roy: No, there's hundreds of different sources, most of them are structured but we're talking hundreds of data sources, thousands of features or thousands of different schemas because from every source we generate and aggregate data to many, many different use cases.

Boaz: Because Explorium is company where the data is the product, essentially. How does the split between engineering, data teams, data engineering sort of look like? So maybe give us an overview of how many people are in engineering, how many people deal with data, how many people have a data related title versus an engineering related title. Do people without a data title still work on data stuff?

Roy: That's a good question, actually, because I claimed that everybody here are data engineers, scientists, and analysts, kind of a mix. We have an infrastructure organization, data infrastructure, which is the team that I'm leading. We’re about nine data engineers, a data ops engineer, ML engineers so we kind of work in between. We have another team that's building Explorium's feature store to say.

Eldad: So, a user can just add a one of many data sources that are completely structured into their schema, into their model, enrich it and query it, basically.

Roy: We have these two flows. One flow, we call the auto ML flow, the ML engine. So, in this flow, a user is coming with his data set, internal data that he collects internally inside his organization and he has some target that he wishes to predict like the classical ML flow. What happens at that point; he can connect to our platform and basically the platform will enrich automatically. So, add sources automatically, according to the context and the analysis of the original core data set that the user uploaded. The platform will automatically analyze the data, understand which data sets are the most relevant, connect them, train the model, evaluate the model based on how much gain do we see from those external features. So, it's connecting the data, it's running feature extraction, it's running feature selection, closes the loop with a model and then finds the best features according to the target.

This is what we call the auto ML, ML engine flow. And we have a flow that is more towards general like, you know, analysis. Meaning that you can upload your data, we analyze the data, but we will present our internal catalog of features. So, you'd be able to see the sources, you'll be able to see the coverage, you will be able to see the feature is with all the description and everything, and the user can add features. So, add enrichments, he can run transformations and basically build what we call the recipe. He could build the recipe of the features that he wants to add with all the transformation and then he can query this recipe in production, he can use this recipe to schedule batch jobs that will update this data. And now we're, we're kind of connecting all of these flows together, create one unified flow where a model is just a part of the recipe.

Boaz: And what does your data stack look like?

Roy: Our data stack is quite varied. We have a couple of internally built tools, our feature stories internally built but what my team is managing in terms of looking at the entire data stack, we're looking at databases, Postgress, DynamoDB, we have Elastic Search. Those are the main databases that we have. We have a cloud data warehouse, we are using Firebolt for a lot of our internal processing and also for exposing some of the data directly to the users. We're using Spark a lot, we're using DBT, we're trying to be very on the edge of the technology because we are facing new use cases all the time. We kind of have this challenge of trying to be good with any data so we going have to be familiar with any type of warehouse, any type of lake architecture you have to like keep track with what's happening. This is a big part of what we do is try to understand where is the data engineering world is going, because this is where also our users and customers are going.

Eldad: By the way, did you try the new latest DBT, Spark integration, any feedback on that?

Roy: We have tried it. We actually trying it as we speak so this is something we're building.

Eldad: Nice

Roy: Yeah. Working nicely. We also ran it based on Presto, which worked nicely and I'm hearing integrations are coming also from your end so we will be trying that as well. As feedback, I love this tool, I think this is the way to go. For us it's been very fitting because of the variety and the constant need to create more pipelines and other complexities and orchestrate everything, and to have it in a way that we kind of democratize it among data scientists and analysts both internally and externally in a way. So, you kind of have to have a unified layer where engineers can sleep in their beds quietly at night without having to worry, waking up on pipelines breaking, over schema changing stuff like that.

Boaz: Let's do a quick switch to a fun and blitz round, in which we will ask you a variety of questions where you're not supposed to think too much, just answer quickly. There are no wrong answers, only yes and no. There are no wrong answers only except the ones that are wrong.

Roy: Okay.

Boaz: Okay. So, are you ready?

Roy: Yes.

Boaz: Commercial or open source?

Roy: Open source and commercial.

Boaz: Batch or streaming?

Roy: Batch. Streaming is mini batching.

Boaz: Write your own SQL or use the drag and drop vis tool.

Roy: I like the SQL, come on.

Boaz: Work from home or from the office?

Roy: Office.

Boaz: AWS, GCP or Azure.

Roy: Oh, wow. Hopefully all of them, but AWS, GCP.

Boaz: To DBT or not to DBT? Although, I think you hinted...

Eldad: Not to DBT, always DBT.

Roy: To DBT.

Boaz: To Delta Lake or not to Delta Lake?

Roy: Delta Lake.

Boaz: Okay. Thank you. I think Roy answer differently than typically.

Eldad: Yes.

Boaz: I think you're the first one who said from the office, like home or office.

Eldad: Yes. Everyone was confused.

Boaz: People typically say either home or both.

Eldad: Yeah.

Boaz: Nobody's just says office, except Roy.

Boaz: How many kids do you have?

Roy: I have one. Right now, we're living in a very small apartment so working from home is super hard, but also, I like the fact that office is where you work and home is where you live.

Boaz: They're separate.

Eldad: Old school.

Roy: Yeah, old school.

Boaz: Nice.

Eldad: Nice.

Boaz: So, what were some of the bigger data challenges or bigger projects you guys had at Explorium in the last year?

Eldad: Traumas, huge success, huge surprise.

Roy: As I said before, one of the biggest challenges I think that we have is that we kind of have to be good with any data so the platform and the infrastructure that we build has to be kind of generalized from inception which is hard. And most of engineering organizations and I'm guessing that also your engineering organization, is trying to the right thing and not generalize too early, and not try to have the wrong abstractions over things but when we kind of have to, because in a way, if we're not abstracting in the right way, or if we're not generalizing enough, then with any new source that we find, we kind of have to tweak everything around it. So, this is one big challenge I think that we have. One other challenge that I think is interesting is that you have to understand your user when he uploads, if you look at the auto-ML flow, for example. When the user uploads his data, you have the challenge of understanding exactly what he's searching for and exactly what his data means and that's one challenge and also what he's actually searching for.

Because sometimes you'd be surprised it's not the features that bring the most correlation, sometimes it brings more knowledge. Sometimes knowledge is important when you're doing that analysis. We have users coming to us saying, this is an interesting feature. It's not always with the highest correlation to a target, it's not always with the best statistics, but it's interesting because it tells me something about my business that I didn't know. So, this is another challenge and democratizing this data platform that we built. This is also a big challenge because you have to enable data scientists, both in Explorium and outside and data analysts who actually work with high volumes of data, complex pipelines, and be able to build their own processing pipelines and features and you have to enable them and abstract them from the engineering underneath. You asked before about DBT on top of Spark, I think the cool thing is that you're just writing SQL queries and underneath it runs on hundreds of machines on Spark.

Eldad: This is amazing. I remember the days it wasn't long ago where people are so excited about Scala and Java and writing those Spark jobs and owning the threads and the machines and the hardware and the wiring. And it was all poof, it was all gone. I think part of that, or maybe a big part of that is cloud native data warehouses like Snowflake and BigQuery that actually kind of taught many of us that it's okay to abstract, to simplify, and it's okay to have that decoupled from your day to day so thank you Snowflake and BigQuery for teaching Databricks that SQL is good. Now we see everyone, many people we talked, they're using SQL over Spark. That's kind of the biggest change we're seeing moving from developing it to actually declaring it using SQL and as you said, you love it and most people do love it, and it's a good change. SQL is back. We were confused for a few years. We had no SQL, we had new SQL, we had side SQL.

Roy: And now SQL is back.

Boaz: This is the longest secret rant I've heard.

Roy: Exactly. And it's important to rant about it.

Boaz: Okay, this is all exciting stuff. Let's talk about it from a more, maybe personal perspective. Tell us about something that didn't go well. Tell us about what we call an epic failure that you guys ran into. Maybe an approach that didn't work well, lessons learned and such.

Roy: I have a personal failure so I'm going to talk about my personal one, to own it. You asked me about batch versus streaming. When I started Explorium one of the things that was clear to me is that we need to have this kind of stream ingest, complex stream ingests, that we will have to report some events because we're working with such a variety of sources some of them are more dynamic by nature, APIs, for example. We have many of those and we needed some way to dynamically both enrich using those APIs and then propagate the data again to the data lake and re-ingest it and reprocess it. So, I was building this wonderful Kafka based streaming connector for everything with the best abstraction in the world and it flunked essentially.

Eldad: Abstraction is slow.

Roy: Yeah, but you got to learn, you got to learn the hard way.

Boaz: Why didn't it work?

Roy: It was a premature obstruction in a way. And it didn't match the way that now we're looking at processing because we have so many processes running offline because we have to do this complex modeling and connect different sources into one knowledge that we do which happens offline in batch jobs. And it's very important to be able to support quality at scale. Data quality is one of our team's priority and challenges that how do you maintain quality over such a variety of data sources. And there isn't really a good way to provide super blazing fast latency along with quality, because there is some processing that you have to do behind the scenes and so that streaming approach was a bit premature. And now when we're looking at streaming, we're looking at streaming as an additional entry point to those periodic jobs that can run or more batch type of jobs that can run, and look at streaming as it's just another way to get data into our lakes, warehouses, and then, from there to processing.

Eldad: So, what you're saying, I might stream data in, stream on write, but it's always batch on read because the batchy part of the schema will make all the streaming one’s batchy. So streaming is a one-way ticket, and then you start needing to analyze.

Boaz: We see it often. Often, we'll see streaming as just the way I put data in my lake or whatever, but it doesn't mean that end users really enjoy that streaming and that low latency of the data coming in. Sometimes we see that, but definitely it's actually rare.

Roy: For a lot of cases, we have updated live data. For example, take weather. Whether is something that, first of all, how much history of weather do you keep and how you do it? So, with whether we work with APIs, for example, when users have to have the data for now, or today, or one hour ago, or right now, you have to deliver it through APIs in live in real time. Every recipe and every auto ML pipeline that we build internally also has this real time face where the user wants to consume it because he has his user or his customer waiting in checkout or whatever. So, we have real-time, but the complex processing and understanding exactly what is the data and modeling it, this happens in batch,

Boaz: Let's now move to something positive. Tell us about something you're proud of, a big win in your data work.

Roy: We have a big win. I think that combining modeling with the right serving infrastructure, it is a tool that we build internally that enables us to have a more quality matching capabilities over variety and complexity of data. I think that the win was when we started onboarding more and more and more data sources and you have this feeling when you take this new product to production and it works.

Eldad: A machine, a working machine.

Roy: Yeah, so you kind of reflect and you say, okay, so it was worth investing all of this time really understanding modeling. Data is always talking about something in the real world. Always there is something in the real world that generated this data and when you're actually able to model the real world correctly, then somehow the data behind the scenes kind of falls into place when everything kind of makes sense and I think this is what we saw, and this was really exciting.

Boaz: This is a project. Initially we started talking about data engineers versus software engineers and we often talk about the boundaries of blurring between the two. So, such a project where you build that super-interesting flow, do you consider this a data engineering challenge, a software engineering challenge, or both? And which skill set did you need internally to deliver that?

Roy: Wow. That's a question I actually talk to a lot of people a lot - what is a data engineer and how is it different from being a software engineer or big data developer? So, I look at the developer world, we have software developers and you have software engineers and you have data developers and you have data engineers. I look at development like writing logic versus engineering, which is construction actually, in a way it's construction. Foundations and understanding things in lower levels.

Eldad: So, you're saying that in many ways you're dividing engineers into two groups, those who are responsible to generate the data and deliver it and those who build something on top of it.

Roy: I'm honestly not sure where the line is because I'm also looking at the line between where is data engineer meeting an ML engineer, or a software engineer, are they the same person? But I think essentially that the problem I was talking about was the data engineering problem, essentially. Because it had this element of data model and schemas, indexing, how do you index data correctly? It has a lot of those elements that I think makes it very data engineering. So having these schemas and model on one hand, but understanding, which is the right infrastructure to kind of hold everything in place, so it could scale natively I think it's a more of a data engineering problem.

Roy: You mentioned quality a lot, data quality and the importance of data being at high quality. How do you treat quality internally within your data pipelines and flow and data stack?

Roy: When you're thinking about finding the right data, when you kind of try to think, okay, my user wants to use the platform to actually find the right data, sometimes it doesn't even know what is the right data. It has two elements to it, I think. All of them are under quality in some, in some form, but it has kind of the matching problem which is when a user is coming and he's talking about a certain company or a certain place. How do I know to match it? How to find the geography, the place that the user is talking about, or the organization that the user is talking about, or the combination between the two? So matching is one thing that leads, so our approach there was enabling experimentation. Because it's very hard to get like ground truth because almost every aspect of data that you look at has like those amazing challenges and complexities that only when you start getting your hands dirty you understand what it is, what are they.

So, matching really affects quality. We treat matching as an experiment as understanding how do we tune exactly in every case, the system, and this is where enabling a generalized system that works well with metadata and enables the data owners and the domain experts to change and tweak it and play with it to see that it fits with the real world or with their expectations, so relying heavily on domain experts here. So, this is one thing, the other aspects are correctness. Even if I found the right organization, the right place that the user was talking about and I want to get back points of interest, for example, I need to make sure that if I said, there's a coffee shop there, then there is a coffee shop, that the data is correct.

So, I probably found the right place, which is one thing, but now I need to make sure that I retrieved back data that is correct. And then after you did these two, which are super complex, you have to ask yourself, is this relevant? If my user is trying to predict the sales of his products or optimizing routes for shipping or whatever, is having a coffee place something that he needs to know? Maybe if there's a coffee place in Tel Aviv, people are stopping their car to get coffee and there's always you know, cars, tailing, it takes five minutes more or whatever so those small nuances. So, combining all of those, I think this is how we treat data quality.

Boaz: Interesting. Thank you for that. What gets on your nerves the most in your daily work with data? What contributed the most to your hair growing whiter?

Eldad: Frustration.

Roy: Inconsistencies. We work with a lot of data vendors and you can actually see vendors that they own their data and they provide a consistent stream or consistent delivery. Schema is our consistent file format. You know what, I'm writing my answer. File formats. It's 2021, use compressed RK, use something with Schema. It's not that hard. We were working with textual data, delimited with pipes, and weird stuff and when you compressed with weird compression, you ask yourself why?

Roy: I think that's a good new corner we need in the podcast.

Eldad: Yes, data format.

Boaz: Not data formats, just let people let off steam from the stuff they have, because let's face it, engineering has some pretty nasty parts in our day to day.

Roy: Data counseling with the Data Bros.

Boaz: So, the rant about your stuff you hate corner.

Eldad: Do you support XML as a data source?

Roy: We have to.

Eldad: Wow. That's amazing. Nice.

Boaz: Wait, what are you saying? If you hear me then?

Roy: Then use a compressed parquet, Snappy or whatever.

Boaz: So, data engineers worldwide, let's agree to use only that from this point forward to make our lives better.

Eldad: Please use faster decompression with Snappy. That's the ask here.

Boaz: Standardized boom.

Eldad: Everyone that hears that. No Zip, no nothing else. Just Snappy.

Boaz: How do you stay on your toes in terms of being updated in what's going on in the data world? Any sort of tips on who to follow?

Eldad: Aside from this podcast.

Roy: I'm an avid Googler when it comes to maintaining my knowledge. I'm following several... I'll try to find it, but I'm using Daily Dev, I don't know if you know them. I think they're Israeli as well, I'm not sure. So, I use Daily Dev, which is kind of a Chrome home page that connects me to really interesting sources.

Boaz: Daily Dev, okay.

Roy: I'm following several. You probably get lots of awesome pages that are awesome data engineering, awesome ML ops awesome ML engineering so I'm following those. And I'm trying to find projects to follow and kind of lead from there in a way. One example, and people make fun of me on that, but I’ll say it anyway. Now I have it on record. You know Spark, everybody knows Spark. I was following Spark as a project and then I started following kind of the people behind Spark and the laboratory in Berkeley behind Spark and then they emerged with this new laboratory that brought us eventually Ray which is a project that I've been following for two or three years. No one was talking about it, no one was using it and a while ago, I'm kind of iterating back and I was getting into the project again, and I'm saying they really made progress and they released a general version like the first full production version and that was really exciting and we started talking to the team behind it.

So, I think it's a lot there. So, finding those projects and then trying to find a way to actually talk to the people. Everything is virtual today, but talking to the people is great, using like Slack channels. I think those are the places where you’re actually able to, as you said, kind of be on your toes and be on kind of the verge.

Boaz: Yeah. I think like the community and people aspect in data engineering is huge.

Eldad: Huge. 36

Boaz: Much more than even generalized software engineering because its space is growing so fast, changing so quickly and unless you keep track and keep your eyes open to the open-source projects, to people talking about this and that you lose a lot of great things. And I think there's a lot of community power even without that being formalized that impacts this space we're in, which makes it exciting.

Roy: I have this WhatsApp group of data engineers and yesterday someone asked a question about how do I process? I want to process this and that data and the data is partitioned this way and so on. And I was tagging Boaz, I think it's a relevant case.

Boaz: It's an interesting case let's share with our listeners. Typically, when we say about communities, we talk about all the famous and open and...

Eldad: Big broadcasts

Boaz: Forums and insights we're on. Here we have something very local to Tel Aviv maybe, but there is sort of a local data engineering group on WhatsApp, which is very popular here for instant messaging, it's a very local, but very effective group or local practitioners get advice on their daily challenges with data. So maybe the tip from us would be localize more communities, it doesn't necessarily have to be worldwide. Sometimes if you're in the valley, if you're on east coast, if you're on the west coast, if you're here or there, talk to people around you, they're always in the same times or to hop on a quick call, it makes a little bit more personal. So, it's an interesting approach, definitely worked well here for us.

Roy: I think this makes the difference. I'm not sure it has to be local, but it has to be a group where people feel comfortable enough to ask stupid questions sometimes.

Boaz: Exactly.

Roy: And get very straightforward StackOverflow, you know, top result answers. Once you have that, you know you have the right group because people are feeling comfortable and they will ask questions and discussion will move from there.

Boaz: Yeah. Good point. Thank you. Okay, I think we're almost done any last famous words for our listeners?

Roy: Listen to the data. In a way...

Eldad: Nice, nice.

Roy: Data engineers, it's very obvious with data scientists and data analysts, I came from data. I did a couple of years of data science, pure data science, and I actually got into data engineering by accident twice. So, the beginning of my career, and then I deviated to data science, and I got back to data engineering again, somehow.

Boaz: Keeps pulling you in.

Roy: Yeah, it found me, I didn't find data engineering, but listen to the data. If you're able to have the understanding of the data scientist as an engineer, I think it your life a lot easier.

Eldad: So, listen to the data and if you don't like what you hear cleanse it.

Boaz: Listen to the data is great. It's t-shirt bubble.

Eldad: Yes.

Boaz: Abstract enough to be a debatable for hours, and it's catchy, well-done Roy. Okay, good. So, thank you so much everybody for joining us for another episode, we'll see you next time. Thanks again, Roy. Bye-bye.

Eldad: Bye-bye, everyone.

Brought to you with love from Firebolt