How Bolt Engineers Are Designing Its Next-Gen Data Platform

Bolt's ride-hailing app serves over 75M users in Europe and Africa and handles 500K queries every day. Erik Heintare along with Bolt's engineering team is in the midst of designing a new next-gen data platform and is sharing how it's going to solve their biggest data challenges. Guest: Erik Heintare - Senior Analytics Engineer at Bolt Hosts: Eldad and Boaz Farkash, AKA The Data Bros

Boaz: Hello!

Eldad: Hello!

Boaz: Welcome to another episode of the Data Engineering Show with myself, Boaz and Eldad.

Eldad: Hi, everyone!

Boaz: How you have been Eldad?

Eldad: I am good.

Boaz: Are you ready for a Hanukkahs around the corner.

Eldad: Yeah, absolutely!

Boaz: Yeah. Do you want to come over to light some candles?

Eldad: Light some candles!

Boaz: Yeah, do it!

Boaz: Okay. With us today, getting ready for Christmas time, I guess, is Erik Heintare. How are you?

Erik: I am good. How are you? Hi everyone!

Boaz: Thank you so much for joining us, Erik. Erik is joining us from Bolt. He is a Senior Analytics Engineer/Lead Data Analyst at Bolt. He has been there for over 4 years. Before we let Erik introduce us a little bit further, a few fun facts about, Estonia in general, where Erik is from. Estonia, a small country, only around 1.3 million, but is actually ranked number one in the world in unicorns per capita. That's very interesting.

Eldad: Woohoo!

Boaz: Also regarding Bolt. If you have not heard about the Bolt, means that you probably have not been spending enough time in Europe or outside the US. Bolt is actually a big competitor to Uber. Next time you hop off a plane in a European country, try not defaulting to Uber and see if you can get a ride a scooter or bike, even food delivery from Bolt. A Bolt is active in over 300 cities across more than 40 countries, by now. Mostly in Europe, but also Asia Africa, Latin America, and that is exciting stuff. Bolt was founded in 2014 in San Francisco. But in recent years, especially I think saw tremendous growth.

Eldad: Amazing intro.

Boaz: Am I wrong? Am I wrong?

Erik: You were wrong about the San Francisco part. This was actually founded in Tallinn in Estonia.

Boaz: Got it.

Eldad: Other than that, amazing intro.

Boaz: In the PR, once they opened the San Francisco office, it takes over everything in the PR. But thanks for the correction. It makes a lot of sense. So far, the company is still privately held and has raised over $600 million to date serving more than 75 million end-users. So, Erik, thanks. What did I miss anything about Bolt did I leave out?

Erik: No. Actually, maybe a couple of things about Bolt to add more is that we are actually not focused purely on Europe and I think we are the leading ride-hailing and delivery app, not only in Europe but also in Africa. So, we should be the biggest player in Africa as well. So, Europe and Africa together are like 2 billion people. Quite a big market actually.

Boaz: Tell us, what do you do at Bolt? Let us here it from you.

Erik: Just one more thing coming back to the intro part, then, when you mentioned Estonia and the facts you gave, then actually I prepared one of the fun facts about Estonia and it included that also that we have most unicorn per capita, but when I usually do this intro to foreigners, then I also say we have most supermodels per capita also.

Boaz: Interesting, we should investigate a correlation between startups and supermodels, Interesting!

Erik: But yeah. Great intro! I have been at Bolt for more than 4 years. Initially, I started off as the first data team member ever. Before me, there was no one working purely on data. I started off as a data analyst, basically trying to get started with our data warehouse together with engineers, try to automate and move away some queries from a production or a pre-live environment to the data warehouse. Also wanted to get rid of Google sheets, all of those things, and try to optimize everything we did in a business. Within those 4 years, I gradually, since we hired more analysts, moved to like analytics managers/lead analyst role, but then after some time, still wanted to be more hands-on. So, now, I am mostly focusing on our analytics platform, and still helping everyone in the analytics also, around all of those topics. So, yeah, this is what I do.

Boaz: Awesome! Let us talk about your title for a second because it is interesting your course of a senior analytics engineer, we all know data engineers, analytics engineering is something you heard less about. What do you think what that means?

Erik: I think it is the around tooling and the things that we actually do in the analyst world. So this analytics engineer, in general, does not mean there is a big difference between like analyst work or data engineer work. It is just I am more focusing on helping to build the infrastructure for our analysts. Meaning, like I think, the biggest user of analytics engineering is a company DBT labs and they are promoting it heavily. For us, we are not using DBT, but still, the work that we need to do is basically build out a platform for analysts, for business users, for product users to do their work efficiently, smoothly with high quality, high speed. So, I am somewhere in between data engineers and data analysts. So this is how we combine those two together.

Boaz: How big is Bolt in terms of headcount worldwide?

Erik: Worldwide, I think, we are more than 3000 employees and the data team just reached over 100 members.

Boaz: Wow!

Boaz: Tell us a little bit about that. What kind of data roles exists in Bolt? And how are they spread out in terms of departments, groups, teams, etc? And where do you fit in?

Erik: Yeah, we have 3 different roles. I think it is pretty common. We have data scientists, data engineers, and data analysts. Data scientists and data engineers belong to an engineering organization and are like a centralized group. They report to some higher management in engineering and data analysts, they belong to a product organization, mostly, and they report to product managers. So, they are way closer to the business and to have the effect and the knowledge about everything that is going on in the business. So, this is why we went with this hybrid approach. Initially, with, I think, 3 data analysts in the company, we were thought of going with centralized also for analysts, but it did not make any sense and we are quite happy with the setup at the moment.

Boaz: You report to the product as well and not to engineering.

Erik: No. Actually, I am part of data engineering.

Boaz: Okay.

Erik: So we have 4 soft teams in data engineering. Firstly it is a data lake, then it is data transformation, thirdly model life cycle and experimentation platform, and fourth is my team, which is analytics engineering.

Boaz: Got it. Let us talk about the data stack and definitely, a bit deeper. Before that, in terms of data volumes, what kind of data volumes does Bolt deal with?

Erik: If you want to be fancy and fancy starts, I think from petabytes, then we can say that our Redshift Cluster can handle more than 4 petabytes of data, but actually we are not there yet. I think our data lake in total is somewhere between half a petabyte and one petabyte, somewhere around that.

Boaz: In terms of sort of a daily number of events or daily data volume?

Erik: Yeah. So I am talking about from an analyst perspective, then, we do, I think, a bit less than half a million queries a day in our BI tools. Then, talking about, as we serve all of the models also as a platform then, I think we do 100 million model life cycles daily.

Boaz: Wow!

Erik: So, quite a lot, and, yeah, I think there are from those 3000 employees more than half of them use our BI tool also on a weekly basis.

Boaz: Okay. So, let's break that down. Tell us about the data stack a little bit, from bottom to top, how does it look like?

Erik: Yeah. So, we stream our data from live databases and services with Kafka to S3. Then, we have, of course, as I mentioned, Redshift as a data warehouse. In the S3 and Redshift, we do some transformations in Apache Spark or with Apache Airflow. Then together with Redshift, of course, we use a Spectrum, which is basically a layer to get data from S3 directly without storing it in Redshift. Then, we use Looker as our BI tool. We use SageMaker and all of the data team members use Jupiter notebooks. So, this is, just a brief overview and the really, really high level of what are the tools we use. I think, there is nothing, really, really epically different than the other companies are doing. So, I think it is a pretty common stack.

Boaz: How long has this stack been active? I mean, if you remind 4 years ago how did it look back then? And how is the journey?

Erik: So 4 years ago

Boaz: Google Sheet.

Erik: There was not almost anything. When I joined the guys, so cool thing, since we are over on AWS already back then, they were like, “Hey, what is this Redshift. Let's spin it up and see, maybe you can use it.” Of course, we still use S3. So S3 was there. We did not use Kafka. So it was a bit different then. And then I think in Redshift when I joined, we initially had like top 10 tables, maybe from the live database, just like getting orders and do understand like where our drivers or how active they are just to get the first initial. I really remember when I joined, I needed to get some data and the engineers were talking to me is like, “we do it from the live database.” “We acquire it from the live database. It is like, we do not have time to like a data warehouse. Why should I bother?” And now it is like thousands of tables. I don't know even how many together with Spectrum we have in Redshift. It is constantly improving. So yeah, what's going on there?

Boaz: And eventually, because there is Looker that the only BI tool or do you have other methods of visualizing data apparently?

Erik: For front-end analyses, we also use Mixpanel, but everything that we as a data team wants to have a better control and better structure and the use of backend data also then, Looker is our, by far the biggest BI tool that we use, Yeah!

Boaz: I think I saw on your profile, somewhere you mentioned the use of Amundsen as a data catalog, is that is in use?

Erik: Yes

Boaz: Can you, maybe, tell us how you guys are using it?

Erik: Yeah, I would say it is in the alpha stage for us; but basically, the company is growing so fast. There are more people coming in. There are more tables popping up all the time and we wanted to get a better way to scalably share the information that we have around our data and data discovery has been one of the bottlenecks of scaling. So, we started off a trying out actually two different tools. So firstly, Knowledge Repo, which is basically where you can host; it is a repo filled with Jupiter notebooks where you can add some metadata on top of it and also, Amundsen. We currently use it in a quite limited scope, so we have all of the Redshift tables there. We have key tables, key columns, key schema all, like commented and we have attributed owners to them. So at least if you search for something, you want to know what is going on with food delivery couriers, and you are the first time doing it from the marketing perspective, then at least you will be easily able, with one search, to understand where are the tables? How they are structured? Whether they are like key tables, and who is the owner of those tables? from a data team perspective.

Boaz: Can you maybe share how does Bolt approach the data engineer versus analyst relationship? How do you guys make sure that, you know, things are delivered quickly, and the analyst and the other business departments can sort of become, stay self-sufficient and move fast. Has that typically been a friction point or any insights you can share on how you guys do that?

Erik: Yeah. I think it is also pretty common; but of course, we struggle all the time either to build the platform to be better and scalable for the future or support the current needs for analysts or data scientists, so they could move on with their projects. So it is a constant battle between the prioritization. What we have done is that periodically we have changed our focus. So, for some time, we focus more on the product requests, making sure that everyone gets their stuff quickly and efficiently. And, then we communicated to our users also, for the next two weeks, we are heavily focused on building our data platform. So all the requests that you have, we will only maybe solve the most critical ones. So please mark them accordingly. It is never easy, like products always try to push their own needs first because why would you bother optimizing something in data engineering. For them, it does not give any return on investment, but their feature, which they want to launch, of course, it is super easy to put a number to it. So it is a constant battle. Of course, we communicate a lot together with a team, so this helps to understand the priorities, and this is the only way to go.

Boaz: Let us talk about some of the use cases. So, you know, there is the big Redshift at the center, according to what you say. So what are the sort of uses cases that are run on top of it?

Erik: Yeah, I think, Redshift’s biggest abuser is Looker. As I said we do quite a lot of queries from our business users, from our analysts. They are the main users. Of course, we have our experimentation platform using it. Then, we have our model life cycle which is basically they are also using to re-train their models, like profiling services fetches some data from it. So basically everyone fights for the spot in a Redshift query queue.

Boaz: Can that get ugly sometimes?

Erik: Oh, yeah! I think I will leave it to the latter question, maybe potential about the glorious failure.

Boaz: Okay. So, let's do a quick stop and move to something funnily quick - the Blitz Question Round. Are you ready?

Erik: Yes.

Boaz: Do not overthink. Let us see.

Boaz: Write your own sequel or use a drag and drop visualization tool?

Erik: Write your own sequel.

Boaz: Looker or Tableau?

Erik: Looker.

Boaz: Commercial or open-source?

Erik: Heart says, open-source, head says commercial.

Boaz: Batch or streaming?

Erik: Streaming

Boaz: Work from home or from the office?

Erik: Hybrids

Boaz: AWS, GCP or Azure?

Erik: AWS.

Boaz: To let people self-serve for analytics or not bother?

Erik: Let self-serve.

Boaz: Bolt or Uber.

Erik: Bolt.

Boaz: Boom! I like that.

Erik: I did not even leave you time to finish your question. I already said, Bolt.

Boaz: Very Good!

Boaz: What are the biggest challenges though, with the current stacks? What are your top priorities for next year given what you guys have today?

Erik: Since the current stack has been around for more than 4 years and actually, we are currently moving in the phase of finalizing POCs for that next-generation data warehouse, so we are potentially either getting rid of Redshift, they are replacing with some other or adding something on top of the current stack or just playing around with all the different things that are available right now to make sure that we are enabling our users for the next 100x growth also. So, this has been a big focus for us in the last couple of months and, we will try to finalize it during this year and next year, will be where we basically work hard on making sure that we have that next-generation data warehouse ready.

Boaz: So you are going through a traditional evaluation process?

Erik: No, we are using some of the methods that I have been out there and a lot of companies have been doing their POCs. So, of course, would take ideas from there, but what we do is we also apply a lot of our internal knowledge and let's say like we gathered a lot of internal queries, which are heavily used at the moment and we wanted to see how they would be formed. Because there are some tools that are really, really good when you just need to query data from one table and there are some other tools that are really good at joining together the 20 tables, so what is the best for us? We try to cover all of those things, and, yeah, it has been a heavy job to do those POC and kudos to all of the team members who are doing this. It is not an easy decision to either switch out or keep the current state. So you are only responsible for the most used and valuable asset, but then subsequently will become familiar.

Boaz: Any particular technology or feature that is out there that you really are upset, you don't have access to today or you would like to have in your next-gen platform?

Erik: I think one of the things that pushed us to move maybe a bit faster with this evaluation process was that we are currently hitting some of the limits when we talk about concurrent queries and that was basically peak hours. So this is where we struggle the most at the moment. Then, this is what we want to solve as quickly as possible, because if business users cannot see the data or need to wait for, I do not know, 5 minutes for it, then it is not worth it.

Eldad: So you have products blessing.

Erk: Yeah! Definitely!

Boaz: So I am assuming the coupling storage and compute will be a big deal.

Erik: Yes, yes! Most likely.

Boaz: You had mentioned the glorious failure prior. I really do not want to pry, but you know, since you brought it up, tell us about some glorious failure you remember.

Erik: Maybe, I hyped it up too much.

Eldad: Too late.

Boaz: Too late, you have to.

Erik: But yeah, one of the things we had, so basically it is a combination of multiple things. Looker, quite recently introduced native integration with Google Sheets and Google drive, which means it is easy to set up a spatial from Looker to those bases. We did it. We enabled it and we thought it is like good thing to have. Of course, business users, you will never ever solve all of the cases. Like people always would try to copy your data tools and Google Sheet and add some manual stuff on top of it.

Eldad: Sort the data.

Erik: So we thought doing this is a great idea. It reduces manual workload and all of those things, but what we did not realize was that due to some limitations from the Google sheets and Google drive APIs, the scheduling takes a lot longer than, like scheduling slack message or email. So what happened was, people were too happy about this and they set up a lot of things, of course, to the Monday morning together with all the other reports that are running on Monday morning. So, basically what happened was, that we had like, I don't know, hundreds of new spatial randomly popping up on Monday morning all of a sudden, none of our overnight schedules or works did not finish. We did not know what is going on. We saw that there is a load. It was really hard to estimate, like, what is causing it? Is Redshift working slower? Or is it because of Looker doing something wrong? So it was like, we could not find it out on the first Monday. It took us 3 Mondays to actually solve it. And it was a combination of many things, as I said, like from concurrent scaling from data warehouse side wasn't performing as we expect them and Looker was not working as expected. So basically for 3 Mondays in a row, our company users could not query the data in the first 4 hours.

Eldad: How did it affect your weekends?

Boaz: If it highlights on Monday and a little bit on the Tuesday and Wednesday. Thursday, got it right, had a weekend. It was okay and then Monday, all over again.

Erik: Yeah! This is exactly like, by the lunch of Monday, we were seeing like, okay, we killed the top queries. We did a couple of adjustments. It looks like it is working. Let's see how it works on next Monday. And, yeah, it didn't work then as well, of course, and then it was easy to get the priorities, from the product also to speed up all the POCs for the next-generation data warehouse.

Boaz: This goes back to the blitz question – Do not let people self-serve because they will over-schedule stuff on their own and ruin your Mondays.

Erik: Well, I still say self-serve. Yeah.

Boaz: But to care for these, care.

Eldad: Gradually.

Boaz: Another takeaway, you know, do the crazy stuff on Mondays and not on Fridays.

Eldad: Exactly, respect a weekend.

Boaz: Okay. But, you know, let us not be so negative. What about, positive stories? Tell us about the great win.

Erik: So, since I have heard a couple of episodes before, then I immediately started to think about it then. Well, there are a lot of wins that should be mentioned. But one of the things that I thought about was actually when the COVID hit first and then basically, we lost 80% of our company's revenue and, basically, we got a request from top management, that is like, Hey guys, can you scale down a bit, maybe 60% or so with all of your data infrastructures and actually what we did was we managed to reduce all of our infra costs, more than 50%, within like a couple of weeks, and it did not affect the end-users that much, everything remained to work as is, so basically it gave us a huge boost moving forward. We did quite interesting things there and yeah! from that, we learned a lot, and thanks to that we are now a way lower level than we would have been before the COVID hit.

Boaz: Can you share some of the details? How did you go about reducing costs?

Erik: I do not think there is only one thing to point out, just on a high level, I think that one of the quick switches was that we moved away much of the tables from Redshift to Spectrum that were not used that heavily. So, this gave us, like, we could easily have a smaller Redshift cluster, everything that I needed to be, like, get out from S3 would be used easily through Spectrum. But, I think there were more than 20 items that helped us to get this low.

Boaz: Wow! Impressive!

Boaz: Goodwin, goodwin!

Eldad: This is how you scale. This is so well spent the time

Erik: Yeah!

Boaz: Okay. So I think, we are reaching the end almost, maybe before we end, we would love to get a few recommendations from you from maybe recent technologies that you have had a chance to play around and got you sort of interested or excited, even if you do not have adopted them fully. So any tips or things that you ran into recently that you want to give advice to our listeners?

Erik: I do not want to point out any specific technology or tool specifically but during those POCs what we found was that be ready to change your perspective on some of the tools? So, maybe you checked out something 2 years ago. Maybe you checked out something 6 months ago. Nowadays, everything moves so fast and, it could be that six months makes a huge difference to the product. So, what I would highlight is like, if you, reject a something, like one or two years ago, and you still have this issue in your hand, take a look at those candidates also, which you rejected back then because so many things are happening in this space and huge, huge improvements all across the board. But, yeah, we saw some, like we needed to change our perspective on some tools.

Boaz: I do want to also close the loop on the topic we touched on prior, going back to that big Redshift cluster and how ugly it can get when so many people want to get access to the resource. How do you manage that? How are decisions being made in terms of prioritizing, who gets what?

Erik: You mean who gets what? like query prioritization or access to that?

Boaz: Query prioritization because so many workloads running.

Erik: For us, since Looker is the main user and the majority of the business users use it, so we have prioritized quite a bit to the business user side, so it is fine that some of your, I don't know, experimentation platform can run instead of 5 minutes, 7 minutes; this is fine. But for Looker, if it runs like a couple of seconds or 2 minutes, then it is like a huge, huge difference. So, we prioritize our business users first and also like what we have seen is also coming back to the previous point about changing perspective, then we tried to solve like a lot of this prioritization manually for quite a long time. But what we see is that a lot of tools that have, like internal auto prioritization, auto-scaling all of those that are basically built into the product, then, in the long run, I see that all the auto settings might win. Same goes with, I don’t know, Google or Facebook ads from different industries. It is like, you can fine-tune it yourself however you want. In the long run, they will have more data on how to optimize all of this. So you can trust them in the long run. This is also what helped us. And also it is like, you do not need to have the manpower to it. You do not need to, I don't know, monitor it so actively that if they are screwing up or not.

Boaz: Also regarding the Kafka. Sorry for suddenly reminding myself of so many questions when we are moving about so close to finishing, but Kafka, are any of the reports actually, you know, closer to real-time or is everything more batch-oriented since using Kafka? What is looked at in a more sort of real-timeish fashion?

Erik: Together with the next-generation data warehouse that we are planning, and actually there are two side projects to it also, what we call, like next-generation reporting system, for batch and live. So, in here, we are building out infra to serve not only internal use cases but also external use cases. All of the engineers, I do not know, show some numbers in the client apps or restaurant apps or courier apps and also a batch reporting. If you are a restaurant and you want to get their weekly results, so we are also working on it to get it live already, first half next year. We are not doing that much live reporting at the moment, but we have a dedicated team working on it, so we could have it and also, maybe not necessarily a Kafka live, but still one of the next-generation data warehouse goal was to reduce the ingestion lag also to get it more closer to the real-time.

Boaz: Got it. Awesome! Thank you! Erik, this has been super, super interesting. Thank you so much and yeah, it has been great having you.

Erik: Thank you. It was great.

Eldad: Yes, it is.

Boaz: To see around the data world and when we visit Estonia, we will make sure to stop by for a coffee.

Erik: Sure! You're welcome!

Boaz: Take care!

Eldad: Take Care!

Brought to you with love from Firebolt