Season 2 · Episode 9
Abundance, Metadata, and Automation with Mark Grover
How can we make data 10X more accessible for data-driven people within data-driven companies? Tune in to Mark and Sam discussing probabilistic product management, and the emerging metadata ecosystem.
Founder at Stemma, co-creator of Amundsen
Hi, this is Sam Ramji and you're listening to the open source data podcast. Today. I'm here with Mark Grover. Mark is a co-creator of the Amundsen open source data catalog, and now co-founder and CEO of stemma.ai, which is bringing the power of Amundsen to organizations like Lyft, ING, and more previously he was a product manager at Lyft.
As well as a spark developer at Cloudera. And in his spare time, he likes to dance and be in the great outdoors. Mark, welcome to the show.
Thank you for having me. I have no idea where you got the dancing and the great doubt, our doors thing. I think it's true. I just don't publicize it as much. So whoever did the digging, I think Audra did a phenomenal job there.
So we like to start each conversation, asking our guests what open source data means to them. So what does open source data mean to you?
Yeah, a combination of two things like open source and data, but do dig in a little more deeply on that. I think the data stack for reasons that maybe we can dig into has historically been an open source friendly stack.
So when you think about ways of storing data, Oracle is a good example, Oracle, not open. But then the move to disrupt Oracle through Hadoop, HDFS, Impala, based on dremmel. And then of course like you have Snowflake, which is not open source.
So there seems to be a cycle, but at least for a part of the cycle, there is a movement that you can start with open source software, which allows you to easily integrate, deploy, manage, control your own destiny that has been in the blood of data warriors for a while.
So, to me, it has that legacy in a good way of being able to control the data destiny within your organization, enable more people to do data-driven decisions, build ML models to move the company further by means of using open source data software. That's what it means to me.
It's fascinating also to think that you're one of a set of Lyft alumni that have come out with pretty interesting open source data technologies.
And it's interesting for me over the last decade to see how many really good computer scientists have moved into data, rather than working on things like containers and compute and platform. So data obviously is a super interesting new challenge. There's almost, um, a migration towards it, and these are all playing with each other in this open source means of development, means of integration, right? We're starting to see the rocks all smash against each other and get softer and find their fit, their field of use.
So I'm curious to see what open source data projects really light up for you because obviously Amundsen is an important one, and yet it's in an ecosystem of their other open source data technologies.
So I've worked both in data companies that sell data software. Cloudera was an example of this, my own company today stemma is an example of this. And I worked at companies that end up using data software for their own internal use and decision-making. Lyft for example, where I was right before Stemma is an example of this, and there was a cycle in my career around this.
And when I look back, and I find that having the experience of working at a data company, selling software - or vendor - seeing different fabrics, different kinds of companies that have their problems, various different industries, all the way from to technology companies, of various different sizes. But then coming to internally to an own company and being able to iterate so quickly, I felt that within the company, you can find product market fit so much sooner because your users are right there and you all have the same incentive structure, which is like, make my employer, made my company do better with data, right? So getting a meeting with a user has so little overhead and it's like a 30-second thing. And I think I was really lucky to have both of those experiences ,and specifically like having the exposure to the broader ecosystem and then coming in and being like, I want to better understand the problems of my peers and then use my past exposure to solve those problems. I was super grateful for that option.
And especially being a spark developer at Cloudera, right. Which is the Hadoop company and HTFS and all the technologies you mentioned, and then spark ultimately ended up eclipsing the way that I do process data. And now we see that in, in, in Databricks. So there's an active sort of embrace and evolution process, right?
Under an old era, you might be like, well, you'll, you'll never see competitive technologies under one roof, but it seems like that's.
Yeah, honestly, seeing compact pattern technologies under the same roof is often better than seeing them under the.
That's a really good point. One of the things that you put a ton of focus into is metadata.
So I'd love to get you to take our listeners to the future, like five years from now. What do you think the metadata ecosystem looks like?
I think a great place to start is, what is broken today? What is not working because of which we need to reinvent a better future.
Over the last 10 years or so there's been a lot of innovation in the data space to open source and proprietary needs. So there's new technologies like Snowflake, BigQuery, Hadoop, S3 that let you store massive amounts of data. Then we've done innovation in things like Fivetran, Stitch that let you bring in more data. It's really easy to take one of these hundred connectors that Fivetran has and be like, "Oh yeah, bring them all in." Then we have Airflow, DBT, Prefect, Dagster, the list goes on. Tools that make it really easy for you to process data and derive more information, more insights from them.
And then we've had products like Tableau, Looker, and Mode that give people who weren't previously Data Something people that weren't Data Analysts or Data Scientists. But now product managers can now use this data in order to get information, draw insights, make decisions and so on and so forth. So there's a lot of technology innovation.
Then as an industry, we have said, "Data driven companies are better companies." There's no one disputing that. And so there's a lot of hunger for using data in your storytelling, using data in your decision-making. Right?
So both personnel-wise, and technology-wise these two things coupled together has meant that organizations have a lot of data. There's a lot of people who want to use this data.
And the problem is it's no longer the fact that you don't have this data or you don't have the skills, let's say SQL that you can't pick up to use this data problem if there's so much data that no one has any clue what data exists. What's trustworthy? What are the gotchas that I need to be aware of and how do I use it? Right. And that's the thing that's broken.
Now, mind you, there's been companies that have tried to solve this problem in the past and the historical way of solving this. Has been this concept of data stewardship. And so what you do is you usually have a full-time person, but sometimes a volunteer who's part-time, that's going around and documenting these things about data. So activities they do are, they will say this particular data set has certified. This particular data set or dashboard is the one we report on the street. These are their data quality checks we should be doing on this data set. This data set gets updated daily. This data set has these foreign keys in this other data set. This data set has this ER diagram that gets used, right? The list goes on and on.
But the problem is with the amount of data organizations have. And the amount at which both the people in the organizations, as opposed to the data and the organizations are growing, that you can't hire enough data stewards to do this job.
And that is exactly the problem I landed myself in at Lyft - that the company is growing so fast, it was doubling every year and it's not a 2 to 4% company. Right. It's like a 1,000 to 2,000 to 4,000 every year. And to date there's so much data. You know, you use your Lyft app that sends data. We have all these services, the things that show you like ETA is and prices. And they're like shadow models that run in the past. There's a crazy amount of data that you can't possibly hire a data steward to do that in a modern organization. Lyft had nobody called a data steward and it wasn't anyone's job.
By the way, these terms like “metadata management system”, “data catalog”, “data discovery system” - I use them interchangeably. I think there are some nuances, but for the purposes of this conversation, they're all the same. I'll stick to the term “data catalog” for no particular reason.
But can we build an automated data catalog that can get maybe 80% of the information through automation. That this data set was the last effort by this, there are 15 dashboards built on top of it. And there are these other quarries and most common filter conditions, most common joint conditions, slack conversations, like these are the things that are happening.
So maybe you can't get a 100% guarantee it's the right dataset for you to use, but you can get to a point where maybe out of the hundred columns that are related and you've nailed it down to the five that you should really be poking at. Right? So that was a problem that was at Lyft.
The part that's broken is, I believe you can no longer rely on an army of data stewards to document your data. And that's the thing that needs to change.
So five years from now, we're in a world where truth is Bayesean, and we're taking a predictive model, or a probability of these are probably the data sets that you want. And all that stuff is automated because it has to be given the embarrassing abundance of all of these different data sources and ongoing data streams.
What's that going to enable? What do you think that, uh, that the practitioners look like in that world? Like one thing that we often see is work transforms.
As you pointed out at Lyft, you didn't have any data stewards. So there are entire job families that go away. Once you start to deal with things at a 10 X or a hundred X scale, the required job transformation in the dev ops world, we kind of see a transformation from operators working in a command line to site reliability engineers, right?
Doing higher-level automation. And over time, the operations jobs go away and their main jobs are SRE, which even then might be distributed into developers. Right? All the developers in the microservices team are each responsible for the SLO. Everybody's writing a little bit of automation and telemetry scripts and they all carry pagers. They know what's going on.
What do you see transforming five years from now when this problem is solved, right? When Amundsen is everywhere and everyone's using Stemma like what, what changes about what?
Yeah. So there are multiple levels to this. The first level is that more people can make data-driven decisions.
So today it's not uncommon. I would say in the bay area to see a product manager, use some SQL to make a data driven. But I would say it's uncommon outside of there, right? Would you agree?
Yeah. And when you think about the data rich environments you have in financial services or in telecommunications, it seems like a SQL statement should be part of any business definition, right. Here's what we're going to do. What part of that needs to be? How do you measure it? Right. So it's, it seems like this is a change that is burgeoning certainly ought to be everywhere.
Yeah, totally. So, a few different use cases on one level. And then we'll go to the second level in a moment. So on the very first level you see these non-traditional data X roles, being able to use data to make decisions relatively quickly.
Then you see another use case around ML. I think we talk a lot about it now, but my personal perspective, and perhaps some disagree with me, is that there's a six step process of building an ML model and deploying it. That process is still very, very slow. It takes a very high skill set and that even for that skill set, it takes a long time.
So while this vision of being able to understand and automatically catalog your data won't necessarily help with the skill set. I do think that we need to build better tools that's outside of the purview of Amunden or Stemma to actually lower that skillset. But I do think it would take out this time and make it shorter because step one is ingestion, step two is feature prep. And that that step is a nightmare and that step itself can benefit hugely from an automated area.
Yeah, it's going to be really interesting to see how the ecosystem changes. Like there's early stage startups now, like mine's DB, which will give you some auto ML in response to any query and supporting a wide range of databases.
It's kind of coming up with several different models on the fly and determining which one's the most fit. And so you can almost become a citizen data scientist in the future, right? Some of those jobs can change. Whereas on the other end, right, you've got companies like Observable where they're giving you the ability to do a lot more data visualization. If you can just give them the right source.
So this, this next generation ecosystem that gets enabled by making it really much more obvious, which data sets could be of interest would be a game changer.
Absolutely. And so at one level you make these people in the organization, more skilled, more productive, more effective.
At data at the second level, what happens. I find that organizations that continuously disrupt themselves or their competition are data-driven right. So you look at a company like Lyft or Uber disrupting the taxi industry, right? Could not have done that without being able to control the app experience and using that data to power better prices, better ETA. You look at a company like Airbnb could not have disrupted the hotel industry without having done that.
So what this leads to is a new era of innovation and disruption that previously wasn't possible because incumbents don't use data. They have static institutional knowledge, which was true for decades, but you can power better experiences for your users that you couldn't with these incomes.
Yeah. The design of the business probably changes because you start thinking about people who are customers or partners of yours. You look at them now as telemetry, right? What's the maximum number of data points that I can draw from them in real time and interpret that to have a better understanding of the market, right?
It'll make its way into business design. It was very interesting for me to see Uber, uh, make a major acquisition. I think it was Postmates, right, which is a food distribution startup that had a similar breakthrough to like Uber is a delivery business. Uber was able to really effectively price the value of the acquisition based on the fact that they were in the market, they had a bunch of telemetry, they had good data processing and they knew what these growth rates and trends looked like. So their ability to run their business more efficiently based on that wide-scale telemetry and data platform structure of the company might tell us a little bit about the future.
It's going to be really interesting to see what happens with widespread adoption of metadata catalog, because as you said, there's a crisis of abundance. Which is actually a crisis. Like if you have too much water, too much air, too much oxygen in your air, that's going to be potentially fatal. You could get fascinated with all these different datasets, but not be able to do anything.
So being able to have this, you know, better view of what matters is probably gonna be definitional for the next generation company.
Yeah. If I may add onto some things you just have that struck a chord with. This crisis of abundance has occurred in a few different places. I'm reminded of the web prior to Google, right?
People had lots of websites, people would bookmark these websites and you would have these hundred bookmarks that you have in your browser or AOL or whatever we used at that time to go to these things. Right. And then Google made it really easy for you to search and kind of like made those discoverable and easy for people to use.
And in many ways, what I think of Stemma when I think of Amundsen, I also think about one thing that Google had working for them, they did a fantastic job, but there's a standard for websites and that's HTML right.
In the data space there is no standard. There is your data in Snowflake, your data in BigQuery and those two are separate companies. You have your data in Tableau and that's owned by yet another separate company, right. And that makes the problem even worse because any product or tool that you may have seen that helped solve this problem in one company may not work. And then your new company, because the ecosystem that you integrate with is completely different.
That's an outstanding point. And I think part of it is that, everyone knows that data is the new X whether that's oil or plutonium or gold. So there's a vendor rush to make sure that you can kind of build up your data island.
Right. So you're trying to create your mega silo. So there's a, there's an economic incentive to make data less open, right. To make an open, but only according to your designs, but not let it transit between islands particularly well. But that may be because we don't have a systemic view of the data, right?
As we start to integrate. And practice, metadata and scale using metadata platforms that might give people a top level view that says, we need to standardize a bunch of this stuff to include in the platform.
There may be a point at which we ended up kind of going well, everything's gotta be self-describing right. Your first step is, you have to write a definition or something like that. Do you see anything, anything like that changing in terms of data definition? Solve the problem of openness that would benefit metadata users?
Yeah. I think this is a good question. So let's go to the next level of detail.
Okay, so in five years, the way it's going to be like your data is going to be auto documented. Maybe we add a little more color, like 80% of it will be on undocumented. I do think that there's 20% stuff that some person is doing. That's specific to them in their context. And you have to have that information.
There are two ways for you to get there. And I believe we need to invest in both of those ways. The first way is you integrate, right? So you say, I read your Snowflake metadata, your Snowflake logs. I read your BigQuery logs, I hit the Tableau API, and I built a view of this world, connecting the data sets and assets that are in the organization in order to do that. That's already happening. That's how Stemma works, like that's going to happen. Right.
And the second wave is that you embed in people's workflows. And what I mean by that, Samantha, as a data engineer is just doing the migration from HubSpot, the CRM to Salesforce, the CRM. Everybody who's built the sales analytics dashboards on top of Samantha's tables needs to migrate to the newly created Salesforce tables. And Samantha needs to tell this information to these users, right? That means that there probably needs to be a migration tool that Samantha uses that she is migrating, and saying, "Oh, I'm doing this. And as of now, I'm just about to migrate table three or four." Right?
That means even if it takes Samantha three weeks to do all the migration. All her users know at any given time what the state of the migration is in this automated data catalog. And you can't always get all this information from integration. And what you have to do is you have to integrate both on the production side. Get this metadata from workflows, like embed in Samantha's migration workflow. But also go put this information in the users works. All right. If a dashboard hasn't been migrated, there should be a gutting banner on it, right? Like, Hey Sam, you are using a dashboard that's powered by the HubSpot CRM data and Samantha migrated it three days ago, please do not use this dashboard. Right. And we need to do both of these in the second category.
That's a really, really insightful point because these end up being effectively probabilistic solutions that you have to deliver when you're looking at, you know, a company with 10,000 people.
Human behavior changes really slowly. We hate to change. It's expensive to change. I have to think a new thought. It’s much easier to just do the thing that I did before. So it's nobody's fault. It's just how the human mind works.
So getting people to start to migrate, like you said, maybe I looked at the data and maybe I got the update and even knew it, but boy, I just didn't want to think about it again. So I'm still using the old thing, but you have to have a sweeper almost kind of bringing people along probabilistically to the point where you can then do the turn down on the old system. Cause you're like, yeah, we haven't had anybody look at the old thing for five days. Now we can be relatively certain, right. There could still be a failure, but like, like you said, 80% before, right. When you think about your Bayesian curve, you're like, "Oh, now we're going to call this done." So I think your point is tremendously insightful because we have so many data users who are trying to make decisions out of it.
And how do we change their behaviors at scale? What rewards do we put in? And also what are the sticks? How do we know when it's time to turn that thing off, knowing that it's going to generate some hate mail?
One of my favorite stories of - well it's failure. A favorite in retrospect, I think it was a big failure on my part when I was actually in the middle of it.
The failures are so much more interesting insight, actually.
Yeah. And this is around changing human behavior. So at Lyft and every Stemma customer has this problem too, but Lyft is super recognizable.
There's a channel called analytics. This channel is where a data scientist goes to get ready to quit their job. The reason is because there are so many questions on this channel and the data science team puts an on-call every week to answer questions of this channel, the questions range anywhere from, "What is the source of truth for data?" Revenue, data sales, data, ETA data. To "Is this still the right thing for me to use?"
And there's absolutely no way one on-call person who works in one small little team in the company has any context about this, right? And so you're stretching their knowledge and they are very interruptive the entire week. And the reason I bring this up is I created an Amundsen at Lyft and I was like, we'll put all this information in Amundsen. Problem solved. People will stop asking questions on Slack. Right.
Let me share what happened after Amunden was launched, because I was still at Lyft. Amundsen launches the level of fundamental questions, "What is the source of this? Who uses this? When was this last updated?" Those actually get reduced because that information is there.
We had to build some technology. Do you remember, "Let me Google this for you"? Yeah. So a thing like that, right. And humans can do that, but you can build like bots to do that too. So at Lyft mostly the humans did that.
But what I realized is that the questions that started getting asked in that channel will become more nuanced. There was an incident yesterday because of which drivers in a particular region weren't able to give rides to passengers and it's our responsibility to reimburse these drivers. And I need to figure out where to find, who were all the drivers impacted by this incident? How many rides were requested off of them? How much money they would have made that they didn't make. And then go issue that money. Right. And that is a nebulous question. And like, you can't start that question on our table page. That's gotta be started on slack. Right.
And so what I have learned is that in places where you can't change human behavior, which is, I am learning that it's more and more of the case you need to inform and embed in the new behavior, the new products and tools. Right?
So one thing that we built that Stemma for example, is a slack bot, which we have choice, right? Like I can ask people to have their conversations in Stemma, or I can say, Hey, continue on the Slack path, and we will have a bot that actually links these conversations into the data catalog so they aren't lost and they get embedded.
But when you start this conversation, you don't know which table it even relates to. That comes up 20, 20 replies later. Right. And that feature is like the most loved feature at Stemma. And that's a thing that I learned from a failure at Lyft was to lean on the human behavior and inform it, and embed your product in it instead of changing.
That is awesome. “Lean into the human behavior” I think is a super deep insight for product managers anywhere. And if you remember one thing from this conversation, I hope it's that right? Cause we do have this large-scale behavior to change. That's fascinating.
You also elevated the class of discussion that we're having, which is such a better use of human heartbeats.
So right now, everybody's actually in conversation about the fundamentals of the domain of the business, as opposed to the fundamentals of the domain of the infrastructure, which is really cool.
So it's been a privilege to get so much of your time. I'm going to ask you for just a little bit more. So you've done a lot. You've seen a lot, you've grown a lot. You've seen a bunch of things changing the industry. Recently, our audience is coming along with you on the journey.
What is one resource or one piece of advice that you would give them as they move forward into the wild world of data?
Yeah, there's probably different types of audience in your podcast, and maybe I'll share one piece of advice to these different kinds of people. The first top of mind for me, or perhaps those that are beginning their data journey. And I remember when I was, or beginning my career journey. Right. I remember I would look at people who have been in the industry.
People like you sound like someone I admire and have followed for a long time. Like, oh, they've got it all figured out. Like, I don't know if I should work in the platform team or a product team. That's a common dilemma I hear. Should I be a data scientist or a product manager? Right. Should I continue down the IOM path or an engineer path or become an EM?
A - You got time, it's okay, chill a little bit. And B - The best way to find is to actually do them. So make a change and there's a chance that it's not gonna pan out, but you will have concrete evidence on how you feel.
When you got that change. And so my story about being a product manager as I was an engineer at Cloudera, and I was like, I want to try something different. I don't quite know what it will be. It's either going to be an EM engineering manager or a product manager. And I found that it was very easy with my background for a company to say no to me being an engineering manager there, because I hadn't been an EM in the past. But it was much easier for them to open the door to me, to a product manager responsibility. I'd written a book in the past, and did a lot of conference presentations and be able to show my product thinking in ways that I wasn't intentional on my part.
And that opened the door to me to become a product manager. I learned that I enjoyed it. Right. And so I think that the piece of advice I'd have for those starting their careers is to take a chance.
That is awesome. Mark.
Thank you so much. We're wishing you great success with Stemma and with Amundsen yourself in super hard problems.
Thank you for having me. I've thoroughly enjoyed this conversation.