The Data Engineering Show | Transcript: How Klarna Designed a New Data Platform in the Cloud

June 9, 2022 • 41 Minutes

How Klarna Designed a New Data Platform in the Cloud

Klarna is one of the leading fintech companies in the world, valued at $45B. While many corporations are “stuck” on-prem, Klarna made the move and today is a cloud-only company. Gunnar Tangring, Klarna’s Lead Data Engineer tells Boaz what this new modernized stack looks like.

Gunnar Tangring - Klarna

Boaz: Welcome everybody to another episode of the Data Engineering Show. Welcome, Welcome! With me today is Gunnar Tangring from Klarna in Sweden. Hi, Gunnar! How are you?

Gunnar: Hi, I am good. How are you?

Boaz: I am very good. Thanks for joining us. For all you out there who do not know Klarna, Klarna is a super, super interesting company. Klarna is in FinTech from Sweden. It is one of the most valuable privately held FinTech companies in the world. Last year, the last round of investment came from SoftBank, which valued the company at $45 billion, unbelievable. Klarna has been run around since 2005 and has shown an amazing journey and started around just making payments online smoother and a team that kept evolving and pushing throughout the years. They are rather famous in recent years for this kind of, I do not know if you have seen this, but shop-now-pay-later kind of experience. Throughout the years, Klarna has also evolved into being licensed as an actual bank, having its own credit card, which is relatively recent with already more than a million consumers using it, I think. So, a really exciting and interesting FinTech company. Gunnar, is there anything I missed? You have been there for six years.

Gunnar: Yes.

Boaz: Gunnar is a lead data engineer, architect and more. Tell us a little bit about somebody who has been there in the last six years, and then, we will dive into what you do there and beyond.

Gunnar: It is an incredibly good description. So thank you for that. But just my personal perspective more from the data angle, is actually that I remember that I walked into a building and I thought it was kind of a growing startup and it turned out to be a slightly larger company than I thought. But today it is something very different that is incredibly noticeable on the data side. I remember when I joined, there was this BI team, which consisted of a handful of people who handled all the different data requests across the company. The scale we are at now means that we have an entire, what we call the domain of around 60 people doing similar tasks where we have been just five people doing. So it is a completely different ball game, of course, with the type of volumes and there has been a lot of growth on the pivoting into new areas all the time. So, it was always good fun.

Boaz: Amazing! Tell us about your roles throughout the years?

Gunnar: I started with reporting BI work, traditional working in Cognos at the time and the majority was merchant reporting, but anything people would want to have covered like I want to know this, I want to know that, and that was not exactly what I wanted to do. So, I was working a couple of years on our Hadoop infrastructure, basically building fairly large scale data flows at the time in various forms, a lot of Hive SQL and building frameworks around it. We built a tool called the HiveRunner, as an example, to do unit testing of the data, which is kind of cool. And then, we moved to the cloud. Then you know what you are doing for a couple of days.

Boaz: Cognus is not around anymore.

Gunnar: No, it is not. But the journey of "Hey, let us move everything to the cloud," I think is when I understood what Klarna was all about. Because it was not some kind of let us do this. It will take time. It is more like, "Hey, let us do this now, fast and get it all done." We did that. We pulled it off within reasonable timelines. So, that was nice. But we shifted our entire infrastructure to a cloud-based only, which obviously data is one of the parts where that becomes tricky because you have to migrate the data itself, but you also need to migrate your tooling. We did not go for a Hadoop on cloud type of setup, so we shifted quite a lot.

Boaz: Today, Klarna is cloud only when it comes to the data stack.

Gunnar: Yes.

Boaz: When was the final switch off for an on-prem? We know we have got companies throughout the years that the length of time it takes to complete is between years to indefinitely. So, just reaching the state where it is completely off is an achievement on its own sometimes. So, how long ago was that?

Gunnar: I do not remember 100%, but I think it was probably like three years ago or something like that. But the nice thing was that we were quite happy with our work because we were not the last ones out. So, that meant that someone else was running something on some service somewhere. But when we managed to pull the plug there was obviously an announcement video where someone was going to the computer all that, like unplugging the final computer and then, we ran into a new era.

Boaz: That is fine. So many people would not be able to ever experience that. People nowadays are born in the cloud. So, you deserve a medal for being there when on-prem was unplugged for Klarna.

Gunnar: We are a fairly modern company, but I still think kids these days do not understand why you have things under your desk, "oh, it is to keep your computer." The computer was the server, which was giving people what they needed. It is a different area, but it has been a kind of a fast transition in the industry. The debate of cloud or not, is fairly done to me.

Boaz: This debate is definitely over, I think, even though there are a lot of on-prem activities still around. More than we typically think? Every time there is a survey comes out and it turns out there is so much workload still happening on-prem. Some portions of the industry are much slower to move than we, sort of, on the more advanced side of data and tech tend to realize. Because we live in this modern data stack world and sometimes we forget that we were just a segment and that there is so much workload out there on-prem, but yeah, it is all coming to the cloud for sure. Klarna is over 6,000 people worldwide according to what I see on LinkedIn, more or less, but how many people are spread across the different data teams? How many people deal with data and how is the variety of teams structured at Klarna, if you could walk us through that?

Gunnar: I think there are two different stories to tell. One would be that we worked in the data platform in Klarna's domain where I work as a domain architect, but our structure internally is basically that roughly half of our domain of around 60 people is infrastructure, people building the platforms. So, providing tooling for other people to do data processing, in a sense building frameworks and making sure the databases are running and making sure that you have the correct setups of the access to data on block storage, and building the data catalog as well and the other half is doing what we call, core models, basically being end-to-end BI teams. So, they work with more central data that would be not possible to own within a specific domain of the company, but that is the next step. Like a lot of our data teams are either something that is working like in the finance department with data. And, be like a team that is explicitly working on data. And, we have teams doing big data processing in various phases, everything from risk positioning to defraud detection to anomaly detection with problems that can occur with our merchants. So, I see more and more of this, like, "Hey, we need to spin up a data team for it," for something somewhere in the organization, less and less, "Hey, can you please do this centrally at the company because we need this to be done from you" and that is very much in line with how we are trying to build the company, that we want to have. We want to have the domain knowledge close to what is going on and, I think for data that is in particular extremely important because I see a lot of cases where someone is being asked to do something but I do not really know why and they are deemed to be the data people who know exactly how to handle the data. But how to handle data will be very much a product of what the data, like, why is this data generated this way? What does this field mean? All of these things are impossible to know if you are working in some kind of decoupled function, central at the company, you might be extremely good at sparse indexing or sort of really good for performance tuning, but it does not really help if you do not know what you are doing.

Boaz: You guys are looking into sort of a data mesh implementation?

Gunnar: I would say so. We have looked a bit at data mesh but the thing for us with data mesh is actually, it was not something that came from this guy, like, "Hey, let us do data mesh." We looked at what we were doing. And, then we looked at the data mesh and we realized that this is very close to what we are doing and we can learn some things. We can try some things into terminology. But for me, the key takeaway with data mesh is the ownership aspect. I want to have strong ownership.

Boaz: You are saying, it is not about us picking up a data mesh guide and implementing it by the book if the book even exists. As you know, the data mesh in itself at the end of the day talks to something that the industry has always gone back and forth. So centralization versus decentralization in essence, and you are saying you are leaning now towards decentralization and domain ownership much more at this stage where Klarna is, makes more sense for you guys?

Gunnar: Yes, but I think we have always been a bit towards that angle, but I do not think we have had the vocabulary for it. But, what you are saying, there is a data mesh book now and I have not read all of it but I will read it at some point and I think it is good inspiration. But I also think it is kind of unclear exactly how you would implement it. Then I see a lot of flame wars of, is this actual data mesh. And, then I just realized that that is not what I want to discuss. I want to discuss, what is the thing that drives our company forward in a good way. I think ownership and how to set the boundaries, those types of discussions are needed. But, I think the data mesh in itself does not really give the answers. It just gives you a framework of how to talk about it in a sense.

Boaz: Yes, I completely agree. It is a framework, and it looks different in every company because it depends on people as well as practices and the details of the organization. It is like to some extent they do not feel like being agile. How how do you become agile? There was this decade where everybody was talking about becoming agile in software. There is no one way of becoming agile and there are many ways to implement changes from company to company. But that absolutely makes sense. But, what about data engineering though? There is BI, there is data engineering, how are they split? And is not data engineering more centralized compared to the BI teams or is that also spread across the different domains?

Gunnar: I think there are two answers. One thing is the terminology matters. I think we have been discussing if we should release the title, analytics engineer is similar. Because the people we have and who are working with, what a lot of people would call BI would potentially be labeled as data engineers. And that could be the wide-scale, you have everything from someone working with building automation of pipelines and someone implementing the use cases. So, I think there is a space there were titles could make a difference. But, I also think the type of data warehousing work or whatever you would label it, that we are doing might also be different in a sense that you fast fall into the scale of things. So, you do not really have maybe a writing SQL all day, but you are still like, if you do not know how to perform this on SQL then you will have problems like this. I think it is a title question in a sense, but I agree the things we are doing that I would label as data engineering and not analytical engineering are more centralized. If I take the clear example of building and implementing and adopting frameworks for making sure that we can build analytical pipelines, that is something that we considered to be something we should centralize and offer as a standardized component. If you talk about data mesh again, I think this is building the sidecars, but you need to run your analytics products in a company. For us, it makes sense to centralize. It makes sense to standardized because you get so many things out of the box and the way we work is if you want to build your own thing, go ahead; but if you want to integrate with 10 different available tools, if you want to be compliant, then it is probably a whole lot easier for you using the framework that you already have, but it is a give and takes, like if you have a super-specific use case, then you can always build, what you need for that basically?

Boaz: Saying though, everything above data pipelines to some extent you prefer an analytics engineer, maybe somebody that can do full stack data essentially, a mix of data engineering and analytics. And, I think in general, that is a trend we see in the markets. The term analytics engineering is half picked up, but people are starting to like it, but actually, I would like to call it full stack data developer or something like that. That is a kind of combination of meshing BI and data generated together because you know, you cannot do much today if you are not able to go further down the stack, roll up your sleeves and do some kind of data engineering to some extent. And, I think more and more data professionals are finding that out and that mixture of bridging the tools is picking up. Interesting to hear that you are going about that at the Klarna. But do you guys already use the title analytics engineer or not yet?

Gunnar: No, we are discussing it. But I mean to come with them, like the full stack, I think of it very much just like data being a field, it is not unique to data, but I think in general, I think it is more and more important nowadays to be a bit T-shaped than actually having one technology or something that you are good at and then having something else to combine it with, and that might be that you have, maybe a great building analytics pipelines and you are good at domain knowledge of finance or something or it might be that you have like DevOps capabilities to help with other things and I agree that a wider stack is a strength that you need to showcase. I think it is incredibly hard for everyone to have enough to fill up the entire scale. So, from that perspective, I am sometimes thinking, "Hey, it does not matter what type you have, because I will still have to probe and understand what you are doing." I think that is particularly, I realized as being an architect for a while, because when you talk to other architects, it is just such a scale of everything from, I do not know any details to, I know all the details, to hydro houses, like literally of course, but I have that confusion with some neighbors that they thought I was growing abscess, but I just asked, do you have a database? So, then maybe I can help. But, I do think that what are my core things? What would be my two or three things to pick up on, some kind of strengths? That is how I view profiles in general and I think I am more inclined to want to hire someone who is really good at two different things, as opposed to someone who was more like a Jack of all trades type of person. That is where I might post full stack, but I do hear what you are saying.

Boaz: Got it. Walk us through the current data stack. Now everything is in the cloud, how does the data stack at Klarna look like?

Gunnar: We have what we call a data lakehouse and we thought we branded the term and the print hats and things, but then it turned out that someone else used the same term before us, but we were happily unaware and just thought we had invented something new, but in essence, we are running on AWS analytic stack. So, we were quite heavy use of.

Boaz: By the way, you guys should make some noise about it. Go online and shout from the roofs. Hey, we coined the term data lakehouse.

Gunnar: I have told Databricks but then I Googled it and I actually found some reference where I think they were using the term before us, but...

Boaz: Never mind the facts. Let us rewrite history.

Gunnar: But I mean, we did demand, like, in that sense, we did coin the term, but it is not something super complicated in the sense, but if you think of it from more of an API perspective, it is a platform where you can publish data and as a producer, then, you can consume it as a consumer and you do not really have to necessarily worry about the exact location of the data or exactly how things are working within this box. But it is a combination of basically S3 and Redshift and EMR clusters on the spark jump running and we managed that centrally with the configuration possibilities for the users and iterating on it every day, of course.

Boaz: In your transition to the cloud, Redshift was selected as sort of the enterprise data warehouse, right?

Gunnar: Exactly. And that was like if you looked at where we were coming from, we were coming from a fairly interesting scenario where we had a mix of PostgreSQL and Hadoop and the actual data warehouse was implemented in Microsoft SQL. And, when I say PostgreSQL, that was not just one machine somewhere, it was not like 10 machines having the same structure either, it was just a wild mix of a lot of different databases all over the place. So, we went for Redshift as being, like the main data warehouse engine with capabilities of also processing less refined data. So, that is where the Lakehouse term really refers to being able to access through one logical environment where you do not have to go to a different place because you need data from a different domain. You can go to the lakehouse. You have the lake and the house, both order and disorder in the same place.

Boaz: There is also, like, Finna used around it.

Gunnar: We are not using Finna heavily. It is one of the things we are looking at how to leverage more potentially, but it is expanding in usage, I would say. But, we used it extensively, when implementing it; it was like the go-to tool because it is extremely powerful to use the data for interactive results on fairly large data volume and I was so surprised when using the Finna, but I would expect things to be slower than they were being used to, but a lot of the things were quite snappy. But, then when you are running into some limitations, of course, some tools. It is kind of built to have the pet tool and the Hive is very much the opposite where you are focused on resilience and jobs just running until they finish, the Finna is more or just did not work, I am not going to tell you, but you do not get a response. So, it is a different experience, but it worked well, like when we were implementing the phase and needed to quickly look at the data and draw some conclusions. It has just been very helpful. Also very good for doing sanity checks, if you have a data gap for whatever reason, you can refer to it easier to just stop that.

Boaz: What else is in using the stack sort of Redshift?

Gunnar: We use Airflow for orchestration, and we have built our framework surrounding it. We have a team that is focusing on building frameworks. So, instead of exposing Airflow to end-users, we are leveraging our own CI/CD set-up for it. So, you are kind of forced to come into our setup where we guarantee that you have virtual control for transformations and you have some support outside of what you can get from Airflow and we do not get very skilled people, messing up too much because you have to go through our Jenkins to get things running basically. One other thing that I would mention but I did not is, our data catalog as well, which is the thing we built recently, based on a data hub from LinkedIn. So, this is one of the things we realized was a big gap when we rolled out their first citation, but we did not have a good way to just give the user a way to discover all the data in a sensible manner. So, we decided to adopt a solution that would be flexible enough for our needs. So, we rely on being able to ingest metadata from other services by pushing the data to the data hub. So, that was a conscious decision to go for a flexible open-source project that seems to have some traction.

Boaz: Got it. What data volumes are you guys dealing with?

Gunnar: Well! It is petabytes at least. So, the timing varies depending on the domain. Our growth of data is quite big and a lot of the things that have gone live later. When you talk about how much data we have and how much it is growing, we do have hockey stick curves for a lot of it and that is a challenge, of course, but we always have to. My experience has been about when you go live with something new, you tend to be at the state where you are producing a bit more data than you need because you are going for a fairly naive setup. So ironically, even though the volumes are growing, you tend to be able to manage it more over time and it becomes more predictable because you can determine if this is useful or not. But the overall influx, I do not have the number.

Boaz: How much does all the data end up in Redshift? All of it or do you do it year to year?

Gunnar: No. Not all of it. We obviously want to keep it sensible. But this is always a friction point because it is tricky to work with a situation where you would need to send people to different places, depending on what data we need. I would say we have more than we would wish for.

Boaz: What do you guys do for an ETL or ELT batch processing and Spark or other things?

Gunnar: We run Spark and some Hive as well, but we run a combination of Glue and EMR. So, that gives us the main use case for using best Glue is really that we get access to a run time of serverless Spark and it is quite convenient because that means we have an API that is well known. If we would want to run things on a different computer that would be next to impossible. SQL is a big thing in terms of standard languages, but Spark is gaining some traction and it is interesting to see that. I would predict that if Spark is being replaced with something, I would expect them to try to keep the compatibility with the API to be able to help people to migrate if possible because it is becoming standard for running replacement for heavy lifting.

Boaz: Tell us about your day-to-day. So, what are you working on now? What does your job look like?

Gunnar: My main focus is iterating on our architecture. So basically, more technical, reworking the typology of how we have our sizing of different components in the data stack is my main thing. Obviously, I am doing various other things, but that is the big thing.

Boaz: Where do you want to see two years from now? Where do you want to see Klarna in terms of data capabilities?

Gunnar: My dream would be to have CRO concurrency concerns basically. I would want to have a situation where everyone who just wants to do some data processing would be able to do it, and just pay for what they need and not have to worry at all about where the data is. I think that is the main point, even though we have a fairly consistent environment to stabilize the situation where you might be sent to a specific place because of having specific data needs. I would want to break that down entirely and just let people choose the capacity they want and ideally the tools they want. But, I think that is a utopia and I do not think you would not be able to offer everything, but at least like, do you want SQL or Spark, that type of decisions and have frameworks that just support you to the work you need.

Boaz: With the data organization that is so big, how do you guys even go about making these decisions? Making decisions that would affect the long run? Can you tell us a little bit about the culture of decision-making around data at Klarna?

Gunnar: Historically, it has been a bit unclear, but what we are doing now is going full-blown with the RFCs and the ADR processes for everything. So, to make sure that people have an opportunity to raise their voices about the different decisions, but typically these things also take some time, so it is a process of like, "Hey, we are doing this." Maybe I would propose an ADR for it and then I would get some pushback and then we would move forward with it. But, it is becoming more and more formalized and I think some people love that, some people hate that, but it becomes a necessity when you are growing. You just realize that other people have solved this problem of having growth and it is like some kind of administration, but you can no longer tell people like the coffee machine. We are going to make it change and then it is like, "Oh, but I'm in the Toronto office." "I did not hear what you said at the coffee machine."

Boaz: Absolutely, there are the challenges of how you need to adapt the way you work for scale. We feel it also at Firebolt but we have grown tremendously, we spread worldwide and we need to invest more time in writing things properly, sharing them properly, and encouraging people to access them.

Gunnar: Exactly. But, how big are you at the moment?

Boaz: It is like a drop in the sea compared to you guys, 200, doubled within a year. So it feels huge for us.

Gunnar: Yeah, but you need to be prepared for the growth.

Boaz: Looking at that journey now, imagine you would do everything from scratch. If our listeners go through the same journey, modernizing their Hadoop and leftovers from on-prem and now sort of are all in AWS, what would you have done differently now that would have saved you some time or headache, that you know today?

Gunnar: I think it is a boring standard of things, but I think I would not necessarily focus more on testing documentation, but on strictness on the get-go. I think this is something that will always bite you when you end up making decisions. "Hey, we can do this a bit faster if we cut the bit down on the strictness that you want to have them." And I think going forward with more strictness, this is how we expect data to be produced. This is exactly how things should work, as opposed to letting us try to solve this local problem for now and later, we will implement some chemistry. So that is a very tough transition to make. So, I think that is what I would change. So try to be stricter from the start, and maybe the way you would handle the exceptions because you will always have someone breathing down your neck and forcing you to take some shortcuts. I would probably have some kind of exception process and just gather the poor technical decisions that were made with a clear ambition to move faster. Because I think that would be helpful not to solve those cases, but to get an overview of what we need to learn for the good future cases. So, that would be what I would change I think.

Boaz: Awesome, thanks. I am thinking, what else we have not covered? For Klarna, you are dealing a lot with architecture today. When was the point in time where architect for the data became a full-time job, became something people decided to do, we need architects, people to do this as a side job?

Gunnar: That is a good question. I think it probably must have been like four years ago or something. And, at the time, it was a product manager who stepped into the first architecture role. I thought it was a bit weird then because I again like titles and how we think about it. But now, it makes a lot of sense to me, I think. I come from a data development background entirely. For me, I realized the type of things I need to adapt and pick up. It is a lot of actual product management because you are building a product thing and you need to think of it that way and we have been looking at architecture like the way we work. I access the architect of the domain on the actual technological platform. But then looking at how we drive architecture, on data modeling, that is more of like team responsibility in a way as well and I think that is all scenarios where it is a bit tricky to like find the exact sweet spot for help to do that. Like how much you should, again, like centralization versus Federation. Like, how do you get everyone to build the data models that are consistent with each other, without having someone centrally telling them exactly what to do and I think that is one of the things that is a bit of a challenge. I do not think there are some standard solutions for doing it, but this surely I need for like, having syncs on how to do that type of architecture and we have some things that come centrally in terms of how you should name your fields or the different like counter codes, trivial funny example, but like, it is easy to standardize exactly how you do it. But then it is just a never-ending list.

Boaz: Interestingly, the product person moving into the data or like you said more and more thinking about data, as product, which is by the way, natural link back, sort of to the data mesh story, because in there the causes of the data as a product is also heavily included in the story, but it is true. At the end of the day, I think more and more companies are doing that without noticing just as you guys are going to notice we are doing something that feels like it is a data mesh. Many companies that did, have realized that unless we treat data as if it was a product and understand end-users who are, the internal, could be analysts or whatever, it will definitely make our life easier down the road and it just has picked up like crazy, which is true. What are the top data use cases, workloads that run today, that the company is very reliant on and is interesting?

Gunnar: I think that is a hard one to answer, but obviously there are processes that are more important and central and they might have a smaller scale of what they are doing, but be extremely valuable, but then you also have the actual scenario where almost all the different product development teams are doing, like AB testing of their features and when you follow up like train models for how we are doing our decisions and every company is doing bookkeeping and the financial reporting, I think as a company, we are more skewed towards both the product development and underwriting aspects. I think that those are what sets us apart a bit, and it is also partly why we have probably a different load profile from a lot of other companies. I would predict that other banks are not doing the same type of large data processing in the same sense. Because if I look at their web pages, it does not quite look like they are doing AB testing of every feature, it looks more like someone thought something very sensible through and then built it. Then it is Okay. But, for us it is more like to the core; if we release this feature in our app, what will the implication be? And, those types of things. I think that is the type of data use cases where an analyst in a team that is working with a specific module, would want to know, does this work better than if we do this little tweak and this thing, will more people realize that they need to look at this thing now.

Boaz: Klarna in essence is very modern when it comes to everything, the data, maybe not what we are used to seeing in the traditional finance world, but definitely representative of the new age, the modern FinTech companies, where everything has to be data-driven and data is entrenched in across the departments, in their decisions and how they work.

Gunnar: I think, there is never a scenario where we take positions, and it is okay to not have any form of data. We really need to back up the company and a lot of these things to make sure that the people are not having to do cut, paste calls.

Boaz: I wonder for our listeners try to look up Klarna has a very cool commercial that was viral with the fish, sort of moving down, how do you call it, a slide and then smoothly sort of moving across the floor and then the message is just smooth or smooth payments or something like that. So, I wonder how many fish were part of the AB test, types of fish, maybe Tuna, Salmon, and, and there was an AB test and the right fish was selected for the commercial.

Gunnar: Unfortunately, I was not part of that, but, AB testing is such a central thing when the core of what you are doing is just removing friction and the fish had no friction. That is probably part of the message.

Boaz: Very creative piece of commercial. Okay, Gunnar, this has been super, super, super interesting. Thank you so much for sharing those stories with us.

Gunnar: Thank you.

Boaz: I hope you had a good time as well.

Gunnar: I did. Thank you very much!

Boaz: Thank you, folks. See you next time. Bye-bye. Have a great day.

Brought to you with love from Firebolt