The Data Engineering Show | Transcript: Getting rid of raw data with Jens Larsson

March 22, 2022 • 29 Minutes

Getting rid of raw data with Jens Larsson

Why would you create ugly data? According to Jens Larsson, don’t even go near raw data. Jens started off at Google, continued to manage data science at Spotify, caught the startup bug at Tink, and recently joined an exciting new company called Ark Kapital, together with Spotify’s former VP Analytics. Jens explains how he and his team killed the notion of raw data at Tink and walks us through the Google, Spotify and Ark Kapital data stacks.

Jens Larsson
Head of Analytics at Ark Kapital

Boaz: Hello Everybody! Welcome! Eldad, How are you?

Eldad: I am good.

Boaz: Welcome to another episode of the Data Engineering Show. Before we start, I do not know how much time will pass since this will be published. So, if you are listening to this, probably like a few weeks have passed since we recorded it, but I did want to send out our support to our Ukrainian friends, at Firebolt and to the whole of the Ukrainian people with such a situation that is out there and we just wanted to send out our support. Now, we can dive in.

Today with us is Jens Larsson, Head of Analytics at Ark Kapital. Jens has quite a background in analytics in general. He spent his initial years doing analytics at Google. Google was too big for him. He moved to Spotify, which is a little bit smaller, and spent a few years there doing analytics and then moved to FinTech in Sweden in Stockholm, which is where he stayed and recently moved to a startup, which is called Ark Kapital, doing exciting stuff with analytics. How are you, Jens?

Jens: Hey guys, I am doing really, really good.

Boaz: Did I miss anything about your story?

Jens: If by the story, you mean where I have worked, then, I do not think you missed anything. I have not moved around that much. But, yeah, that is above my story. I am from an industrial town in Western Sweden and studied engineering and business. I got my first job at Google in 2011, moved to Dublin, which was exciting, moved back to Sweden, did Spotify just like you said. Then Tink, who recently was acquired by Visa and now Ark Kapital, and I think I am more or less 10x downwards in size, every time I have transitioned a company. So next time I join another company, it will be like 0.2 employees or something like that.

Boaz: And it is also with a pandemic, you are working from home. It seems like you are getting more secluded from society over time and so started to be worried, what's wrong?

Jens: No, that is very much true, though I have to say, since I joined Ark in November, I have actually spent most of my days in the office. It is just so nice to be around good fun people and I have truly missed that over the last two years.

Boaz: How did you get into analytics, to begin with?

Jens: I think it started out at Google. We were working with customers doing kind of sales, support, blogging, and stuff, to essentially help the ad-words business. And, I guess I just had a knack for analytics and somehow was transferred into a local analytics team doing, analyzing the performance of our sales teams, analyzing the performance of our ads customers, and so on, trying to optimize the way we sold ads at Google. I was, of course, super exciting and not so much maybe the problems we were solving, but the people I got to work with and that data stack, I do not think it occurred to me at the time I was fairly junior, and it was my first job, but what Google had already back then in 2011, 2012, must have taken me six or seven years before I got to experience anything like that again somewhere else.

Boaz: Let us talk about that. How was Google as a school for becoming a data person? Tell us about that data stack a little bit more.

Jens: Yeah, absolutely. So, we are talking about 2012, roughly, right? So Google has MapReduce and all of that is already 10 years behind Google or something. But, basically, I started out with a bunch of these patterns that people are talking about just now in terms of, let us do ELT, let us load all our data into a big data warehouse and do transformations in SQL, and I never realized how revolutionary that was, because that is the way I have always worked with data. Data stack at Google back then was all loaded into some distributor file systems and queried through a tool. I think it was called Tenzing. I think it is equivalent to Hive, which was eventually open source to backup it. While I was there, they were starting to roll off Dremel internally, which eventually would become a big query, I guess, and I just remember being blown away by the speed. Everyone has this story about how you start to query across gigs or terabytes of data, and then you would go and get a cup of coffee and you come back to see if it is done. We did that transition there back in 2012 or something like that. It was fun, transitioning, rewriting queries from the Tenzing dialect to big query, took a lot of time, was a lot of headache. I remember Dremel when it came, it did not support joins. Then, eventually, it started supporting joins, but you can only have one join in each query. It was quite painful, but the reward was the speed you got in return.

Boaz: Interesting to think back then, did data engineering exist? What kind of positions were around analytics that helped the end-to-end data flow.

Jens: I had never heard about anyone calling themselves a data engineer back then. We were part of the Google sales and marketing organization. There were not many engineers on the payroll in that part of the company. To me, raw data magically appeared in a bucket somewhere in a file system. And, then I would write all my queries in SQL. I would schedule it through the SQL UI, and I would power some internal dashboarding tool off of the back of that as well, all end to end essentially.

Boaz: On the other end, there is somebody sitting and saying, how ungrateful are these people? They get the data into a bucket.

Eldad: He had a mustache.

Jens: It just magically appears.

Eldad: Did he have a mustache?

Boaz: Probably.

Boaz: How many years did you spend at Google?

Jens: I think it turned out to be like three and a half or something, four maybe.

Boaz: And then you move to Spotify. Tell us about that. What did you do there?

Jens: Yeah. So moved to Spotify and side note, the VP of analytics at Spotify, when I started, is actually my current manager. So, if we skip ahead a little bit on the story, that is why I am at Ark Kapital today, but going back to 2014, I joined Spotify, into their still relatively small and unproven analytics team. We were responsible for everything from calculating the royalty payouts that went out to artists and labels, all the way to understanding how people were doing product analytics, really understanding how people were interacting with the application, to trying and what I spent most of my time, understanding how people were using the free product and eventually upgrading to the premium tier and see what can we do to get more people to upgrade and what different product offerings do we need to offer in order to have that? Because this is back to 2014, Spotify has just gone live in the US, I believe, a reason to go live in the US and there was only one product really and that was the Classic 999 Spotify Premium, but we fairly soon launched students tier, a family plan and so on. So a lot of workaround, how those different products would cannibalize our own users or will they complement our own subscriber base, essentially will we eat their own lunch or will someone else do it for us?

Eldad: So, you switched the metadata to playlists, songs.

Jens: Yes, absolutely.

Eldad: Stopped listening after x seconds, time becomes like 2 minutes, 3 minutes when the queen is being played, maybe 10 minutes.

Jens: Yeah, we did a lot of those fun little experiments because we had all this metadata about when songs are being streamed. We could basically detect when a local football team had won a game because the "we are the champion" song would spike in that region when that happened. There were a lot of these kinds of fun things you can do. On the other hand, we were quite limited by our capabilities. I think we used to at least tell ourselves that we had the world's largest Hadoop cluster at that time to process all this data. That does not mean it was very fast for the kinds of analysis we tried to do. That is why I said, it was like going back in time, I had just helped our team at Google migrate off of Tenzing and now, we come back to a Hadoop cluster where people are still writing MapReduce jobs in Python.

Eldad: Progress.

Boaz: The years you spent at Spotify were years of crazy growth, right? How did the increase in scale feel like from your end?

Jens: Yeah, it was pretty intense. We must have gone from 600, 700 to 5000 to 6000 employees over those four years or five years, maybe. Scaling was pretty intense and in particular around the analytics team, and I think analytics and data engineering were some of the areas that really had to scale and scale pretty fast. And in this time span, we also moved from this massive Hadoop cluster into the Google cloud, the world of data services, basically every service we are running on GCP eventually.

Boaz: Do you guys move completely to GCP or was it sort of that you have things in AWS side-by-side?

Jens: I know that we were a few things still running on AWS. The Hadoop, it took a while to properly kill off the Hadoop cluster. But yeah, I think the end goal is everything was moved into GCS.

Boaz: So from 1 to 10, how much do you miss the Hadoop days?

Jens: I have this kind of romantic view of these MapReduce jobs we used to write. I love just going really deep and optimizing the combiners, and figuring out how to avoid an unnecessary shuffling step between jobs and stuff like that. I kind of miss that. I also do not miss that because it was taking way too much of my time. But, it gives you this opportunity to feel really smart about the work you are doing.

Boaz: Yeah and now you look at the younglings of today who do not have to worry about these things.

Jens: Exactly. It is all just drag and drop this and trying to click boxes.

Boaz: Okay, you spent all the time Google, Spotify, learning how to work with some of the world's most complex systems and data sets and then you went on to the startup world, right? Move to FinTech.

Jens: Moved to FinTech to Tink. This was back in 2019 or early 2020. They had just done a pivot from being this business to consumer personal finance management application. They had kind of been a driving force in creating the whole world of open banking, forcing banks to open up APIs. Long story short, Tink started fetching data from banks, closed APIs before there was a mandate, whether banks had to open up or not and the banks deemed it was probably legal. Then the courts realized, no, it is not illegal. It is perfectly illegal and we are actually going to create these mandates that force banks to open up APIs. Tink was really early in that journey, but we are building most of their own applications. When I joined, the company completely pivoted into becoming this platform as a service or API as a service. That kind of unbundled the app fee functionality and features and sold them off to other FinTechs and sold them off to other banks. So, we did things like connecting to bank accounts and fetching all the data so that you can do various analyses, risk analyses, or actually building personal finance management. We were categorizing transactions using various AI machine learning models to figure out all these line items on your balance sheet, on your transaction sheet, what are they actually, which is a surprisingly hard problem because the banks do not include much data in those transaction lines. It is usually like just mumbo-jumbo when trying to read it, sometimes you can read like an MCD and figure out it is probably a McDonald's transaction, but that is basically all you get.

Boaz: And so the analytic stack there was part of the product, the services that the clients were consuming, went through, your stack.

Jens: Yes. The entire backend of Tink is a data platform in a way and it is completely homegrown. It is using things like Kafka and S3 and Cassandra and various databases, but it is all focused on productionizing data access to customers. What I was in charge of building was really this data warehouse, where we could learn ourselves, how our products are actually being used? Instrumenting everything from event collection so that we know what our systems are doing, to how our customers are using our systems, and also some batch processing to get the data back out of Cassandra because Cassandra is not a very nice place to run massive queries over.

Boaz: Can you walk us through the different teams in charge of the different data deliverables. You run the analytics department. There is data engineering, engineering around it. How do all of you interact in terms of responsibilities around the data platform and stack?

Jens: Yeah, so when I was at Tink, I was heading up both the data engineering side and the product analytics side. Most of the analytics we did were product-focused. And most of the data engineering we did was focused on getting this metric data, getting data out of the platform that allowed us to do analysis, create KPIs and metrics, and so on. We were, of course, leveraging a lot of the infrastructure that other teams were building at the company. So, the data engineers were heavily dependent on our infrastructure team that managed not one, but I think 11 Kubernetes clusters in various AWS and on-prem and so on, instances to create these environments that we provided out to our customers. It was an extremely complex setup in a complex environment but the data engineering team had to figure out ways to kind of standardize how we source data from all these different systems and how we put them in a unified data warehouse model.

Boaz: And what did the data stack look like? What data warehouse did you guys use and what was around it?

Jens: Yes, for much of the data, we were kind of bound to you saying, like AWS tooling. So, we were using S3 and Athena. We tried a little bit of Redshift and so on, on that. But for the data where we had anonymized it, taken out any sensitive bits of information, we moved a lot of that over to Google cloud, to use that to power things like interactive dashboards, metric collection. We even used that power, the developer console will be feeding some of these metrics back to our developers that we're building stuff on the platform.

Boaz: In retrospect, you spent a fair amount of time there building that. What would you have done differently if you were to restart that entire journey? What lessons were learned? What could have been avoided?

Jens: What could have been avoided? So much pain could have been avoided. I think what took the most time was to figure out which data we are allowed to do work with? Because you have to realize, Tink is a data sub-process around the GDPR. Does not really own the data that flows through the platform. It is processed there on behalf of someone else and at the end of the day, it is on behalf of the end-user, the person is actual financial data wizes. So, we need to make sure that the data we look at and analyze is only metadata about whether or not someone has to aggregate the data, not the actual aggregate data. And, I think if I were to start this over again, we should have created much, much clearer vocabulary around this really early on in the process, because we could have just avoided so much back and forth discussing security and anonymization and legal. If we would have just said, this data is metadata, about how Tink services are being used, we do not even need to bring that into the discussion and we could have limited the scope drastically. And, we could have probably been more proactive in how we set up contracts with customers to allow that. But yeah, at the end of the day, I think we would have made better progress if we would have been more upfront with figuring out what are the different classes of data and the different use cases we have of that data and realized that we do not have to enforce the strictest restrictions on all of it, because what would you do when there is uncertainty and unclarity in this case, what you do is you apply the strictest rules across all the data, and that kind of also limits what you can do with the data. And it creates a lot of headaches for people trying to do stuff that we have designed, that you cannot do for good reason. And, then you try to approximate or you try to not necessarily sidestep, and not necessarily work yourself through limits and barriers, but you try to create the proxies, try to make an estimate of a metric that you actually could have probably just queried if we had clearer, the clearer delineation between what data is sensitive, which data is not sensitive.

Boaz: Yeah. I guess, as you noted, it is not the kind of thing that you thought would be critical to your work when you first started off at Google. You thought one day I will be regretting, not having planned enough, not having thought about what data I am allowed to keep and what not. But it is true at the end of the day, these things bring the entire difference in the project with a lot of headache or less.

Jens: Yeah, absolutely. Then, there were a lot of good things we did there. One thing we did is we more or less killed off the notion of raw data, because with the kind of idea that there is no such thing as raw data. We are not like pumping raw and crude oil out of the ground and then we have to refine it. Data is something that we control and that we create. We could have probably created crude oil and then, we would have created a process to refine it. But, instead, we decided to create nice tabular data with strictly enforced schemers and contracts from the get-go. And a big chunk of our data engineering work we just never had to do because the data that is streamed in from all these other services, was already pretty clean when it came to it.

Boaz: That is now an interesting philosophy that I never heard anybody articulate that well. But saying, data essentially, it is not born out of a vacuum.

Jens: It is not. It is our system. It is our source code that creates the data. Why would we create ugly data? And, I think one of the reasons why you would create the ugly data is because you created it for a different purpose like you created this data because it is a log record that you want to feed into Logstash and do debugging. We had this alert when you set up this system, you are like, either we tried to build something brand new or we go with what we already have and we tried to retrofit it like a data platform on top of these Logstash data that we had available to us. And, thanks to our lead data engineer who had spent many years also at Spotify and other companies in the past. He was heavily advising against trying to retrofit and repurpose that log data and, actually now let us build a service that we speak through with proto buff, schemas, and contractually sound data. Then, we can stay in control of that data throughout the whole chain.

Boaz: So, the message may be - Let us not be victims of our raw data. Let us take control. You can actually change it up.

Jens: Yeah. Do not even go near the raw data. If you have the opportunity to say no, let us build a parallel system that creates nice data. Do that instead, I think.

Boaz: It is time to question your raw data. So, tell us about Ark Kapital. So, then you decided to join, like you said, your former boss at Spotify. What do you guys do at Ark Kapital? What is the mission there?

Jens: Yeah. So, it is really two-fold and I am obviously most excited about the data part of this. So, I might not give the most flamboyant or elaborate description of the other. But I will start with a "boring part" which is we are lending money to companies to finance tech companies and other modern companies. We do this as an alternative to bringing in say, venture capital and the idea is your current options as a company looking for capital for growth is either you have assets or you have some other security that you can use to get the loan or you go to venture capital and you give away part of your company in exchange for that funding that you need to grow and both of those have a place. If you own property or buildings or whatever, go to the bank and get a loan on those. If your business is not proven, you may not have found product-market fit and so on, go to venture capital. They are perfect for handling that risk. That is exactly what they are experts at. But, then we see all these other companies, they might be too new or they might not be established enough to really go to the bank and get money, but they might also have very predictable growth. They know that for every $10 they spend on marketing, they get $25 back in terms of sales. If you are in that position, you do not necessarily want to give away your company to a venture capitalist either. And, that is kind of where we come in. The reason they do not go to the bank is that the bank does not have access to the information really. They can look at your annual reports from a couple of years ago. If you are a fast-growing company, that is not going to work out. It is really hard to model this and excel in a way that allows you to really understand what is going on. That is where our data platform comes in. So, most of these modern companies are building their organizations on SAS platforms. Modern e-commerce companies built on Shopify or Instacart or any other of those services. They do their advertising on Facebook with Facebook ads or Google ad words. They do their bookkeeping and CRO and so on. And, we connect to all of these systems. We are using Fivetran Airbyte and other services to connect straight to the system and get the source data. Then, we apply models who we have years and years of experience in business analysis and it comes from a VC firm before joining Ark. The rest of us are quite seasoned analysts. We could analyze the business performance using all these different data from all these different sources, and we apply machine learning and we do forecasting to try to really get an idea of where this company is heading. And, based on that, we can really tailor the financing option to that company. So, we call it precision financing, where we learned so much about the company from these sources, and hence, we can tailor financing solutions to them. But, in order for us to kind of digest this, all this data and all this information, we are also building really cool dashboards for ourselves that we are also providing back to our customers. So, we are really building a turnkey solution where you connect your data, and we give you a best-in-class business dashboard with KPIs and forecasts and LTV models and all of that nice stuff that all the big players already have. So, I think of it as a bit of an analogy when you buy data tools these days. You buy a data warehouse, you spin it up and it is empty or you buy Looker and you get this dashboarding solution and you open it for the first time and you are greeted with a blank empty page, which I think is a quite boring way to buy data products. Ideally, more and more products in the future, I hope, will be like turnkey. You buy the product, you authenticate to get your data and then you open it and there is an actual dashboard there already.

Boaz: Yeah. Beautiful. This is an essence you are entering Ark Kapital with a blank slate. Please share, how do you go about implementing the analytic stack, with your experience, but now have the complete freedom of choice?

Jens: Yeah, with the complete freedom of choice. There are many of us in this company that have a lot of experience with GCP. We have decided we are going with GCP. Then, we have all this low-hanging fruit. There are all these kinds of companies that help make our lives easier. One of them is DBT. I have built tooling that works pretty much like DBT several times before, but it is quite nice to just install it, get it off the shelf. We get Fivetran, Airbyte which allows us to connect to all these different APIs. So, we are basically collecting this different software that helps us kind of do our job. But we are also taking quite a lot of time to figure out exactly what that platform is going to look like. There are so many things that we still have not decided on. For instance, orchestration, one of our big headaches. I have heard it on this podcast too, people talking about how much time they are spending, just managing the Airflow instance and trying to upgrade it, and so on. Trying to figure out what is the modern way of orchestration of all these different data pipelines, because our complexity does not really lie in the volumes of data. It is the diversity of data. We were fetching it from hundreds of platforms for hundreds, or potentially thousands of customers and we cannot really, as a colleague told me yesterday, who had been working at Spotify with similar problems in the past, we cannot really take a representative of Google analytics and put them in the room with one from Mixpanel and tell them to start aligning their data models, "Hey guys, can you please start defining amu the same way."

Eldad: Let us join everything together.

Jens: Let us join everything together. Can't you guys just agree on what the daily active user is? So, I do not have to kind of show diverging definitions of the same data. There is a lot of complexity that stems from just the fact that all the data is different that comes from various sources and yeah, the complexities are hard to grasp actually.

Boaz: How do you imagine things looking a year from now? What do you want to look back on with pride having built?

Jens: Year from now, I really, really want as much as possible to be automated, and we have this idea that a customer that comes to us and opens their dashboard for the first time and sees their metric, they are kind of blown away by how rich it is and the insights they are getting. We are currently achieving that, but with quite a bit of manual labor in between, and I mean, creating good visualizations of data or standardizing it is a craft and it is quite hard to automate and do that at scale. So, I am really hoping that a year from now, someone just goes and connects their data to our platform, and they are more or less immediately blown away by the insight that they are able to get. And, hopefully recognizing the numbers they see in our platform from what they have in their own spreadsheets and so.

Boaz: Awesome! This is great Jens! I really appreciate it.

Eldad: Yes.

Jens: Thank you, guys!

Boaz: It was great talking to you, a super interesting, amazing journey, amazing challenges in the past, and amazing challenges you are working on right now. Thanks again!

Jens: Yeah, thank you too.

Eldad: It was great to connect. Thanks.

Boaz: All the best Jens.

Brought to you with love from Firebolt