The Data Engineering Show | Transcript: How Vimeo Keeps Data Intact with 85B Events Per Month

August 18, 2021 • 40 Minutes

How Vimeo Keeps Data Intact with 85B Events Per Month

How does the Viemo data team deal with 2 PBs of data and 85B events per month? What made them recently build a data ops team? What data tool does the team love? And why (the hell) did they call their legacy platform Fatal Attraction? Guest: Lior Solomon, VP Data Engineering at Vimeo.

Boaz: We recently had the pleasure of speaking with Lior Solomon, VP of Engineering at Vimeo.

Of course, Vimeo doesn't really need any introductions. We all have watched a video at one point or another on the platform. Lior has been with Vimeo for almost three years. He joined as the Head of Data, then moved his way up to VP of Engineering. He is an industry veteran, having been in data leadership roles and engineering leadership roles for quite some time.

Anything I missed?

Lior: The only thing I'll elaborate on is what Vimeo is. It is known brand and video platform for a lot of people. But you know, for the past five or six years we've been focused mostly on helping small to medium businesses scale their business and bring an impact with video.

So, in the context of data, that's where it becomes interesting. If we're trying to provide these businesses insight about potential competitors or other businesses that are driving an impact, the data should be intact before we try and do that. So, I joined Vimeo with that objective.

Boaz: “Data should be intact.” That's a good tagline. We should make t-shirts. That could be a big seller.

Okay. Thanks. So, let's get to it. Let's talk about data at Vimeo.

Let's get started with the sheer numbers. What kind of data volumes are you dealing with?

Lior: Last time I looked at it, we collect about 85 billion events per month. We have a couple of data warehouses, unfortunately. We have about 1.5 Peta of data on Big Query, and about half a Peta on HBase.

Most of the data is streaming in as viewership — you know, the analytics around who's watching what and how frequently, the quality of experience, if it is buffering, which regions people are watching from, and if the videos are distributing properly. That’s the motive of most of the data coming in.

Beyond that, just like any SaaS platform, is just gathering the user experience, analytics, and any events and trying to make sense out of it.

Boaz: Ok we’ll spend more time diving into the stack because it's interesting how you spread all the eggs across more than one bucket with all these technologies, Snowflake and BigQuery.

But, people-wise, I think Vimeo is a little bit over 10,000 people strong, more or less. How many people are in data-related roles? How are those teams structured?

Lior: We're about 35 data engineers. We have a data platform team that focuses on the real-time processing and the availability of the pipelines —mostly Kafka stack. We used to have a managed Kafka clusters, but we moved to a manage one of them in Confluent.

The data platform team is also working continuously on building a framework to make it easier for other engineering groups to consume data from Kafka — just like having their own data syncs dropped or whatever.

There's another team called the video analytics team, which focuses on the consumer-facing video analytics, basically serving the consumer on the website and on the mobile apps. They work on getting the data closer to real-time with all the aggregation and all the challenges around that.

There's another team for enterprise analytics, which focuses mostly on the CDN aggregation. We get a lot of data from vendors on a daily basis that we need to aggregate to understand how much bandwidth is being consumed by each one of the accounts. There, the stack is mostly Dataproc using BigQuery, and some of that data set eventually leads into Snowflake for the BI team. The enterprise analytics team is kind of the gatekeeper of the data warehouse and Snowflake. They're the ones that actually do the ETL integration bags, Airflow and so on.

The data ops team focuses on the time to analysis. They help product teams to onboard their event, payloads, work with them to do the event, and data modeling. They're the team that actually provides a framework internally for the big picture. We like to think of it as a structure stream where any PM or engineering team can go and create their own schema. And once that schema has been created, we provide the SDKs to whatever platform they use. And that framework basically validates the schema and lets those teams find where exactly they want to drop later.

Boaz: Has data ops always been around at Vimeo or is it a newer addition to the team? Tell us about evolution of it.

Lior: It's pretty new. We launched it maybe eight months ago.

Boaz: And what drove you to launch it?

Lior: That'll take us into the legacy and the history behind Vimeo.

Boaz: So, let's roll up the sleeves. Let’s talk about the history. How has the stack evolved?

Lior: So originally, we had another home group kind of pipeline for collecting data from different names, called Fatal Attraction. Don't ask me why. That's the name.

It's basically just like unstructured data. You can push whatever you want, but you have like, basically three different columns where you can push. The original idea was like, send me the component where the event actually happens. Send me the page, the actual payload and unsurprisingly after many years, all the massaging of data happened on the ETLs downstream by the BI team. You see ETLs of thousands of lines of code that basically extract the data. In some cases, it’s JSONs someone decides to send.

Boaz: Were these Spark-based?

Lior: No, it used to come in from basically backend logs. We’d aggregate all the backend logs, push it into Kafka, drop it into Snowflake. And then on top of the raw schema in Snowflake, we’ll build ETLs that make sense out of it.

Eldad: One big variant field with all the adjacent in it. Thousands of ELT processes to parse and structure it.

Lior: Absolutely. Yeah. So, you know, there's a couple of challenges here. First, it is so easy to break this pipeline. You know, an analyst worked for quite a while to find his way to the data, and then add a couple of lines of code to those thousand lines of code. And then exactly a couple of days later, someone changes it upstream. The analyst is not even aware of it and we have a broken pipeline.

That's the world we’re trying to shy away from as much as possible. Today we’re focused on improving the analytics efficiency, letting them be as fast as possible and letting them focus on driving insights. We’re probably not the first company to do that.

But going back to Fatal Attraction and the unstructured data. We got to the point where we said, okay, let's build a different framework. Let's make sure that the there's a contract with the client sending the data and we can validate the data. That way we can find the ownership to who's the one submitting the data and emitting the data. And basically we're using schema registry to validate on top of all those contracts, all those like schemas.

For each event, we have a valid topic and an invalid topic. So, once it goes away in that topic, we know, oh, something's wrong. We can go and ping the team that actually is responsible for that pipeline and tell them, “Hey guys, something's wrong.” They can go fix the issue. And we can rerun the events because we don't lose them. We still have them.

Problem solved, right? It's awesome. Now everything's like intact. If anything breaks, we know immediately.

Eldad: There's always someone to blame.

Lior: We don't do that.

But, you know, we created a new problem that goes back to your question about data ops.

From the engineer's perspective upstream, like the one building the applications, you know, working on the user experience, they're like, “oh, that's awesome. I can go into like the big picture UI. I can set up a new schema. I can focus on the questions that I have because I don't care about the other teams. I have a question I want to answer. And I personally just want the data in amplitude.” Because, you know, that's how my PM is using… That’s how they want their analysis and they don't really think about needs of the analyst or the marketing guy and so on and so forth.

So really quickly you end up with like hundreds, if not thousands of different schemas. The life of an analyst is like: they come in. Someone knocks on the door like, “Hey guys, can you tell people this hypothesis? I want you to analyze this user behavior.” And they spend weeks to understand which events they should cherry pick from the list of events that exist.

Yes, they are more reliable, but still, it's not quite clear what the context is, what you should use, And so on.

So, data ops came to that for two reasons. It's first of all, to stop that behavior and say “guys, before you go and create a schema, let's sit down. What are you trying to do? What are you trying to achieve?”

Say if you're trying to track if someone paused the video, I think that already exists. Don't go and create another one. Or if you're just missing some data, let's think about how we can extend the existing event or coordinate with the other teams and make sure it's event data modeling.

That's one objective. The other objective is to create the automation that we need for that specific framework because downstream for the analyst, it doesn't make sense for them to be so familiar and savvy with each one of the schemas. At the end of the day, they want to have click stream.

There's a set of properties that they’re looking into, they're looking for some sort of standardization. So, the data ops team was working on creating what we call internally “global properties.”

Basically, we have roughly three engineering organizations within Vimeo. It's about 500 engineers. Each one is responsible for a certain application. We signed a contract with all those engineers saying they have to provide us with that information. That's like the basic bread and butter for any analysis. That's how we create that.

So, a lot of the automation is done by the data automation team. And that team is actually a cross-functional team. In that team, you have a front-end developer just for the UI and all the “making fancy,” and a data engineer that builds the ETLs.

Boaz: How big is that team?

Lior: Four right now.

Boaz: But you mentioned that they’re essentially in charge of standardization and consolidation of models.

Eldad: Its basically going full self-service and then turning that into kind of a fully managed self-service. Removing tons of friction, trying to find common data sets, common metadata.

This uncontrolled service, pretty much going back to the Excel spreadsheet nightmare. We all grew on just a better scale with data engineers instead of information consumers.

Boaz: I wonder, putting that team in place and suddenly telling the org “let's stop for a second,” like you said, how do you enforce that? How did you manage the process around that? Because you know, an analyst I might guess that they can just go ahead and do what they did before. Sometimes it would be faster to not even ask if we're around the data ops. So how do you manage?

Lior: Yeah, that's a great question. I won't be able to say, “Hey guys, stop for a month, we're moving.” We actually moved, we worked with both of the pipelines, so we still have the legacy to Fatal Attraction, which is slowly draining because as the new products are going, everyone's speaking the framework. So, we are slowly draining that pipeline. I'm having active discussions with leadership to be more aggressive. But that's it: we need to be completely deprecated, even from the legacy parts of the website, because we're going towards using the new framework for experimentation and AB testing, which we do internally.

One of the ways that I got the engineering teams to actually bear with me and work with me on the new framework was to find the carrot at the end of the stick of and say, “Hey guys, how about you get some self-serve analytics on aptitude?” And they were like, “That's awesome.”

The analytics team is more focused on business. You know, KPIs, correlating, retention with FTS and subscription. More of the advanced analytics data science work.

And when it comes to, “Hey, tell me how many people click the button. Or show me a funnel.” Applicant is just amazing at that. So, I say “go create that event.” It immediately appears on aptitude that they created a self-serve world for a lot of the PMs and engineering.

That also created the problem because now again, going back to the original issue, there's not multiple schemas out of team and really hard to control this. So how would you get from this mess? Because it seems like you created a great framework where anyone can create whatever they want and they're happy about it, but the collective is not. So, in the coming quarter, we're working on an enrichment layer and we're looking into KSP over confluent.

The idea is that we want to dumb down as much as possible for the client sending the data. I don't need you to tell me about what subscription the customer has because I already know it. And I have those dimension tables. I don’t need you to provide you all that information. I don’t need to tell you which team ID they belong to.

So, the reason we can do it so far is because a lot of the obligations. And the business logic is maintained on the data warehouse and Snowflake. And when I'm serving the data from Kafka into amplitude or any of the end points, you know, it could be a Prometheus, whatever you're using, I don't have the access to those dimension tables.

So, using KSP, we're basically planning to have those tables closer to Kafka, KSP runs under the hood with persistent data. And then once that happens, we're going to create basically no more than 20 or 30 generic events, but just asking for the facts and whatever needs to be enriched.

That's going to be done through KSP. It's a framework that you can use in order to drop it to whatever data sync you need. So, you will be able to self-serve. The fact that the event's already designed. The data model is all in place.

We're dumbing down the client. It will speed up the deployment of new products.

And we're keeping the data modeling for data engineering and analysts.

Boaz: Is this already launched in production or is this an active project?

Lior: No, we just started working on it.

Boaz: You mentioned before transitioning from managed Kafka to Kafka confluent. What was the story for that transition?

Lior: I think overall, when you need to measure where you want to put your efforts versus maintaining your infrastructure versus putting it on managed, even though you pay for it, it's not for free. Right now, we are focusing our putting our CPUs into driving data insights and helping BI, helping data scientists, machine learning teams to actually move forward with their initiatives.

So right now, as a rule of thumb, if there's a service, I can put it as a managed service and offload that work from me.

Boaz: You mentioned a huge variety of things at the end of the day that you guys do with data, and it's pretty impressive to see Vimeo being so data driven. Can you share what sort of workloads or use cases you enjoyed seeing that brought a lot of value?

Lior: Yeah. I would say almost any department is actually consuming data from the data warehouse for their own initiatives all the way from marketing to return on targeted ad spend.

A lot of the data science projects are also utilized for Snowflake. I would also say the machine learning team uses some of their payloads. For example, we use Kubeflow together with Airflow to ingest and pull data from Snowflake. An important team is also ‘Search and Recommendation’, which is running Elasticsearch.

They're basically running our personalization and recommendation algorithms. Also they're utilizing data. The data comes from Slope. And the personalization data is actually stored into a big table and Snowflake. As we talked at the beginning, originally the reason we had both was just because we moved to GCP a couple years ago. When I joined, we just moved from Vertica to Snowflake, Snowflake had only the AWS implementation.

So, we kind of run both and I don't see us moving to Snowflake over GCP.

Boaz: But what's your strategy there? Is the intent to stay on both AWS and GCP? I mean, you're running both Snowflake and big query — Do you consider that sort of a legacy tech debt, or is it sort of a strategy of using the right tool for the right purpose?

Lior: The way I'm thinking about it is Snowflake to me is where I want to make sure there are high-quality data that we use by this different business. BigQuery and the stack on GCP are for data stores. There's something more engineering-oriented.

The easiest way to think about is like, I won't put any report which is coming from BigQuery in front of leadership. I'm not owning it. It's not my problem. Anything that goes into Snowflake, that's what we do.

Boaz: So what are the big plans for the next year within your data initiatives?

Lior: We are doubling downloads. For me personally, I'm really, really focused on the data availability and building trust with the overall organization.

We are actually expanding our machine learning teams and really trying to go in that direction. We are trying really hard to advocate for hiring more and to take more risks as a business. We spent a lot of last year on the data availability aspect, but you know, creating SLAs for some of the data stack, making sure that teams have the expectation what they don't like, the business, what is supposed to be done, what's done in response to any data outage. The more we can create that world, you set up expectations with the stakeholders because the easiest statement is saying “the data is wrong.” It is not easy to draw those lines like where does the data engineering start? In a lot of the cases, I'm more facilitating the sessions that are not necessarily related to data engineering, but for example, we're trying to help the team by being their ambassador to find where the data problem starts. There's a transaction scope and deal, or maybe payments not getting in on time. It could be third party like one of the vendors, it's not an engineering issue. We're still trying to find this engineering language to help them get there. It's the world of human engineering required to be successful in data engineering.

Boaz: We don't talk about it often enough, the politics of data.

Lior: Yeah, it doesn't matter how good of a technology we build. It’s a combination of people and processes and technology. We implemented Monte Carlo like September last year, which was super successful. I'm really, really happy with it.

Boaz: Yeah. So, Monte Carlo are trending like crazy. Tell us a bit about the value you got out of it.

Lior: We literally jumped to the future because we were actually trying to build some of those things, and I'll give you an example. So, let's talk about a little bit about data validation. So upstream data comes into the Kafka. Now we have a schema registry and fancy framework that know to tell if the data is right or wrong. Well, what we don't know how to do that yet. To tell tell you if there are any unknowns or anomalies in the data. So, you go downstream a little bit in the ETL and Airflow. We put great expectations there, and maybe the expectations are great, but it takes time to implement for the whole pipeline, all the ETLs. You have to prioritize. And unfortunately, usually you're prioritized by who’s shouting at you more.

So we experimented with it a lot. We found it is good for specific anomaly detection, specific business logic. What Monte Carlo did, which wasn't the main thing in my mind, was that it hooked up to Snowflake and Looker.

And basically without setting up anything, they started listening and building those metrics for all the tables, the data sets we have. And suddenly, I'm starting to be aware of problems I wasn't aware of at all.

My first reaction was that sounds like pretty risky, because now we're going to spend our time jumping on like a thousand alerts. How's that going to go?

So, what we did was to actually start focusing on specific schemas or specific data and understand who's the owner of those nodes. Because you know, sometimes you cannot leave the issue with yourself. You need to go talk to the engineering team, understand what's going on. So, we started building those relationships where if I know what dataset, I know who’s the team actually driving the data set. I can set up a Slack channel where any alert goes there. I’ll always be there as a data engineer. So, I’ll be aware of those alerts, but also I'll make sure that the stakeholders that also on that channel and the publishers are also in that channel so we have a team informed.

Now in reality, we need to build that relationship. Some teams are more data driven and they're excited about solving those problems. That's not the biggest problem. The biggest problem is we need to just start building those reports.

I'm getting a report and it tells me, from all the challenges you have right now, from all the teams you’re working with, that’s the responsiveness SLA. And for me, that's kind of like a get out of jail thing, because whenever someone says, hey, data is bad. I'm like, data is bad where?

Oh, it's bad on the CRM. That's the CRM. Well, you guys are not responsive to any of the alerts for the past month, obviously it's bad, you know? So that starts the conversation in the board where you're setting up the expectations with your stakeholders.

Eldad: That's the amazing thing about how Monte-Carlo takes your existing modern stack, your older mess, all the confusion and kind of backwards, built up draws a picture, shows you where the problems are.

That’s a new way of driving data. A few years ago, we would have to have a dedicated team just for that. It doesn't work, especially when there's so much self-service going on. It's just amazing to see and we're super happy to hear about Monte-Carlo. We hear that a lot. So, we expect great things from this company.

Boaz: You shared a lot of great things that you do at Vimeo but we cannot let you brag forever. Now it's time for you to share an epic failure with us. Doesn't have to be from Vimeo, can be from your data career. Let's hear about some things that didn't work.

Lior: When we built this big picture framework, we were so focused on the self-serve aspect. We thought that would be amazing. They can go and create their own schemas. And it's so great. If it fails, we'll find it. That was the mindset. But then we said, hold on. Who's going to be the one that sets up the mandate in regards of like, why are we even creating that? And at what point do we bring the customers data? The analysts and the data scientists should be part of the discussion about what we need.

Eldad: Will be interesting to talk to you in a year and see how that experiment went. I mean, in many cases, companies just jump between two extremes. So, they either go full self-service and then they kind of get burned out and then they go full managed and four people try to route every request to the company.

I think that the modern data stack is here to help. So, will be amazing to see kind of how that plays out. Always, as you said, manage expectations, manage people. And prioritize.

Boaz: Ok, maybe a weird question, but what would happen to the business if the budget of the whole data engineering and data-related initiatives were cut in half?

Eldad: No ads on videos, for example.

Lior: Great point, because we don’t do ads on videos. Do your research.

Eldad: Ahh, plugged on purpose. Of course I know that.

Lior: So, going back to the question about managed, not managed, I think right now we're focused on insights.

We strongly believe that data is on an exponential mode because we are a company that has existed for 16 years with no space on video. We believe that if we'll be able to land the processes and teams and infrastructure properly, we can start extracting insights about videos and help businesses thrive.

And really we need that. We won't be just the video platform. We'll be providing video insights for customers. So that's like, that's the main focus and how fast can we get there.

We just had the IPO and are in that sweet spot. It's not as money doesn't count, but right now anything that we can just delegate and just not focus on so that we can just focus on building those insights, that's going to promoted right now. But when my CFO looks at the Snowflake budget, he's really worried because we did grow it. I think from the first year, we grew it by 55% and the second year, I think 70%.

Boaz: Just by the number of times you mentioned Snowflake, I'm sure your CFO is super worried looking at that bill.

These are amazing times to be in data because in the past data budget talks were “how can we, with the least amount of spend, get these annoying reports out of our way.” Whereas today, modern companies understand that data is an investment. It's value. It is suddenly an X factor for business models for companies.

And it's okay to invest more than we used to, not less, to try and drive value out of it. So, it is super interesting to see how Vimeo does that.

Okay, good. So, we want to lighten things up a little bit as we near the end. We'll do a quick blitz round of short, quick questions. Don't overthink. Just answer quickly.

Commercial or open source?

Lior: Commercial.

Boaz: Batch or streaming?

Lior: Streaming.

Boaz: Write your own SQL or use a drag and drop visualization tool?

Lior: Write your own.

Boaz: Work from home or from the office?

Lior: For me personally, at home, but I think it's more feasible to work from the office.

Boaz: AWS GCP or Azure?

Lior: Well, right now, absolutely GCP because we're all hands down into it. But it depends.

Boaz: To DBT or not to DBT?

Lior: To DBT. Absolutely, all the way.

Boaz: Delta lake or not Delta lake?

Lior: No.

Boaz: YouTube or Vimeo?

Eldad: YouTube Premium or Vimeo?

Lior: Oh that’s a good question. It depends. If you're trying to monetize for ads, definitely Youtube. If you're a business caring about your brand mission and making sure you have clean video with no ads, then that’s Vimeo!

Boaz: That's the difference when we have executives on the podcast versus the hands-on people. If we interview an engineer at Vimeo, “of course Vimeo!”

Eldad: Just true or false. There's no in between.

Boaz: Okay. Very good. I think we're about to end. Anything else you want to dive into?

Eldad: One quick thing. So, you've mentioned that you want to drive insight from Voice. People communicate and then you get tons of value out of kind of analyzing what multiple voices say on the call. Are you planning to make videos more interactive and embed more? Are you planning on making it less about broadcasting and more about getting some feedback from viewers so that it can actually drive some kind of a video analytics insight in the future? Is that planned? Anything you can share?

Lior: Short answer, yes. But I can’t share too much.

Eldad: Nice! So, there's a big thing planned. Okay.

Boaz: No comment, no comment. Lior, this has been super interesting. Thank you so much for joining our podcast. We that said we need to talk to you in a year. Please do not disappoint us, we want you to have all these milestones achieved with excellence. Thank you so much.

Brought to you with love from Firebolt