Season 2 · Episode 8
Metadata, Communities, and Architecture with Shirshanka Das
How can we evolve an expanding ecosystem of data technologies while making sense of the whole? Tune in to LinkedIn DataHub, and Acryl Data founder, Shirshanka Das, as he and Sam have a discussion on metadata at the center and specialization at the edge to sustainably scale data governance.
Founder of LinkedIn DataHub, Apache Gobblin, Acryl Data
Hi, this is Sam Ramji and you're listening to Open||Source||Data. Welcome Shirshanka.
Hi Sam. Really glad to be here. And I've been a fan of the open source data podcast, and I'm super excited to share what I'm building with all the listeners.
Well, thank you so much. You've done so much in open source data, and I'm really looking forward to hearing your answer to this question that we ask all our guests, which is, what does open source data mean to you?
Oh, that's a great question. Yeah. You know, having spent the last decade pretty much in the data ecosystem, I've been amazed by how much the open source ecosystem has affected and influenced the modern data stack.
I remember back in the day when, um, we were just thinking about starting a new database project that LinkedIn, which ended up being Espresso. You know Mongo DB had just come out and Cassandra was just getting started. This was 2010, 2011, that kind of timeframe. And fast forward a decade. Right. And you look around you and it's just all over the place from Spark, Kafka, DBT Pytorch, TensorFlow, Goblin... it's just everywhere. That's not to say that non open source tools like snowflake and Looker, haven't made an impact. They've actually made a huge impact.
But when I look at the ecosystem as a whole, I think open source is important, but really what stands out is a community-based approach to building technology that has really made a difference to all of these individual projects. The projects that have actually succeeded.
And we're seeing this play out with DataHub, a project that is aimed at getting to the heart of data metadata. I've seen, you know, over the last six months that have really dived into the community, how. The community has actually developed a lot of connectors and code contributions that have really made DataHub, not just work excellently at LinkedIn, where it started, but work excellently at every single company that does not deploy at.
And it also gives us this high quality feedback for things that we create. Every single product feature we create, we get instant feedback, and I think I've seen the same happened for all these different projects that have made an impact on open source. So, this really is like this community you based approach to building open source is something that I'm seeing repeatedly as kind of the new way that has emerged in all of these different projects.
We recently crossed like a thousand members in our slack community. And we do monthly town halls that are attended by over 60, 70 people. But honestly, these are just numbers, right? And numbers don't mean anything. What stands out really is the engagement and the quality of discourse that ends up happening in the community.
You're debugging stack traces and troubleshooting ingestion issues and connectivity. And that's one end of the spectrum. And the other end of the spectrum is you're talking about the ideal way to model a data product in a data mesh implementation, and people are asking, what does data observability mean in the context of DataHub? And what does ML Ops mean for the kind of use cases you're trying to enable here.
So. In a nutshell, I think you can build a product that stands the test of time. If you really collaborate with the best practitioners in the field, and open source really gives you the ability to do that at a scale that is not possible when you're building closed source.
We've seen this play out with Cassandra, Kafka. I believe we are seeing this play out with projects like DataHub, and super excited.
And you're one of the leaders in a central part of this sort of community of communities in data. You put together Metadata Day among other things. And as a community of communities, I find it fascinating to see how many people are contributing to multiple open source projects as you have. And the connectedness between them, not just like, oh, I got tired of this. I put it down and moved on to the next thing. It's like, I'm still working on this thing, but this other thing is adjacent - I need to work in it there. There seems to be kind of a warm osmosis between multiple projects in the open source data world right now.
Absolutely agree. In fact, my decade at LinkedIn has actually split up into, I would say three different segments. The first segment I actually worked on StreamData infrastructure. I worked on Data Bus, which was a change data capture system, kind of code developed along the same time as Kafka did. And then, you know, Kafka evolved and actually took on that whole space. It's been doing amazingly well.
And then I moved on to online data infrastructure. So I built a database. You don't build that twice or thrice. I mean, I built it once. I'm probably not going to build another database. But it was fun and he runs more stuff on LinkedIn right now.
And then for the last six years, I spent my time as the tech lead of the big data platform team. And that really exposed me to this breadth where I was responsible for data ingestion, data management, metadata management, and the reporting platform and the ML infrastructure pieces. And it led me to pause and realize that in terms of central problems, that the data community has created for itself to some extent, and I've been part of that, uh, building, building, building. I think the last decade, you know, we've been innovating like crazy and it, it, it's natural. This tension between specialization and generalization is here to stay, it's not going to go away.
But I'm seeing that there's an interesting correlation between innovation and also Venture Capital. And the two kind of combined to create what I would consider hyper specialization in some cases. So you've got incumbents that probably are offering slightly dated solutions to solve X or Y for data. And that X could be maybe data management, or Y could be governance.
And, startups, you know, are racing to find a wedge to establish a foothold in enterprises. So obviously, you know, in six months you're not gonna. Rebuild data management and then say, oh, I have better data management. Well, you end up saying what of data management isn't working. And so what you end up seeing is that these problem spaces are getting split up into multiple parts. And so you see this happening for ingestion, data prep, data compute, reverse ETL, data observability, pretty much any buzzword that you've heard the last few years there is a segment has been created.
And the combination of innovation and venture capital is flowing in to kind of establish that segment. And that's not necessarily bad. Innovation I think can only happen in pieces and can only often happen in using seeds. But the persona that I find that's really hurting at the end of all of this innovation is, the customer, um, who earlier, they would just go and buy an Informatica and they were paired up with their article or teradata, you know, our Neteeza. And now they're scratching their heads and trying to figure out with a bot, a bazillion tools that have been either bought or adopted by my teams. And I still have to create a coherent data strategy out of all of this. Data isn't just falling through the cracks between one tool to the next, right.
And I think that's one of the fundamental problems that I've seen facing our community, as we've built lots of tools, is how do you actually connect them all together and, you know, make them sing? How do you build this integrated experience across all of the tools that are forming the new stack? There isn't even really a new stack. There's like a few different stacks. So how do you do that? Well, it's something that I am, um, I think is the central challenge. The next wave of technology. Was that very emerging?
Yeah. The combination of open source as permissionless innovation, right? The ability to just start something that solves a problem that you understand maybe really well, or maybe just a little bit, times venture capital certainly is creating a perfect storm. And as you said, hyper specialization. And you're speaking, you reminded me of Tim O'Reilly, golden great statement about the internet architecture, which is, you know, many pieces loosely joined.
The challenge of course, is when they're all loosely joined, there's no formalized architecture, right? One of the other things that he always pointed out about good open source projects is that there's an architecture participation, but at this point we need a meta-architecture, right? How do we architect the architecture of all these pieces of data? Because there is so little prior art, right? So much of this used to be just cozy and a monolith, and just forget about the data. And so much of what we've done in pioneering in distributed systems has really been about compute.
So what I'm fascinated by is what does the next decade look like as we start to create this art around data processing and how do we do many pieces loosely joined? Or, is that dead on arrival.
So I think this is a problem that you've been focusing on pretty heavily because the, the sort of the generic term for data about data has metadata. You can define it many different ways, but I would love for you to share with our listeners, what does an ideal metadata platform look like to you? Let's say five years from now.
That's a great question. Um, and I have thought about it a lot, for sure. In fact, as you were talking, it reminded me also about another famous paper at The End to End Argument, that kind of talks about, how do you build things that last for centuries, right? The internet has lasted us for quite a while.
And a lot of it is built on the principle of the pipes have to be dumb, and intelligence has to be at the edges. And you can take it too far sometimes and make the pipe so dumb that you've brought too much intelligence at the edges and everyone is reinventing what is a dataset and reinventing? What does a schema look like?
And so there are other kinds of metadata platforms that basically are nothing but a stream of bytes. That's just flowing from one to the other. And I don't think that quite works. I think there is a certain set of capabilities and concepts that do have to be modeled at the metadata fabric layer, if you will, that connects these different tools together. But it still needs to allow for extensibility, so that DBT can basically explain itself to maybe a Looker without the metadata platform getting in the way.
So that's kind of from a data modeling or a metadata modeling perspective. And from a platform perspective I think there are a few things that I have learned over my years, trying to build a data infra and partially succeeding. Which is that metadata infrastructure actually, no one talks about it. No one actually talks a lot about, what does it take to build an amazing metadata store?
We have the Hive metastore - I mean, Article has a catalog, Better Data has a catalog. And now there are a few other data lake implementations emerging that are trying to solve the lots of partitions and lots of blobs problems.
But I think what has been missing sorely is this acknowledgement that metadata is a big data problem and not just big data in the slow moving big data sense. But in the high velocity, lots of bits and bytes sense. And the fact that metadata is real-time. It's not yesterday's data, it's actually what changed right now. And so I think that the metadata platform of the future, and in fact the metadata platform I'm building, will provide free flowing metadata for all of these students to produce and consume from. So you can then specialize at the edges and build a really compelling whatever you want to build. Whether you want to build a compelling data monitoring tool, or whether you want to build a compelling data governance tool, you can build all of those specialized experiences, but the metadata fabric has to be so fast and so real-time that you don't face any loss of consistency. And looking back at that problem that I talked about about data modeling, I think it has to be strongly typed, but extensible. So we cannot be doing schema on read and saying, "Oh, here's the property bag, let me see what it's got. Oh, it's got a DBT dot metadata dot schema in there. So let me just parse it and hope that DBT continues putting in those bits, in that property bag for the, for the rest of its evolution." Right.
In terms of storage, you're seeing, um, metadata used to be a small data problem. Um, Schemas and how many schemes can you have? But now when you start adding changes to schemas, when you start saying partitions that are part of this, and then profiles of data attached to partitions are part of it. The amount of nodes and edges that you need to store are growing quite a lot. So you end up needing a metadata platform that can actually scale to the same kind of scale that your data platforms can scale to. So having pluggable storage and indexing is important.
And the last thing that I feel a metadata platform needs to do, and that I don't hear people talking about much, is appeal to both humans and systems. What do I mean by that? There are a lot of consumers in the market for a data discovery tool. And you go, okay, what does that look like? Well, it's a UI, it's map, it's got pretty tags, it's got colors, I can talk to people over there, and I can find things. And is that important? Of course. But, it's not just about you as a data scientist, finding a data asset. It's also about the data compliance machinery and the classification machinery, all of those systems. Also finding that same data set.
So your metadata platform needs to be consistent, not just for human consumption. It should also be consistent and delightful for system consumption, which means, you know, you need to have delightful APIs. You need to have all sorts of APIs. As you should be able to produce to this thing from Zoom. From this thing, you need to be able to graph QL your way into the platform and graph, kill your way out of the platform.
At LinkedIn over the last six years, we started with exactly that search and discovery use case, and then ended up building compliance infrastructure sitting on top of this data management infrastructure, sitting on top of this ML ops infrastructure, sitting on top of this, uh, feature registries model registry. All of these use cases sit on top of a single metadata fabric. And the only way you can do that is by not treating it as a, one tool at a time problem, but treating it as a, I need the right metadata fabric, and then I can build tools on top.
So platform. And it's not, of course it's not enough. Like don't get me wrong, a platform by itself and just open APIs and standards, and just saying, "Hey, I've got an API, you know, have fun with it." doesn't work. It's the beginning. We are definitely focused on building like these amazing tight-knit integrations between, you know, the favorite tools in the stack, so that when you deploy one of these systems, like DataHub along with maybe the Acryl data bits, you can get like these magical experiences when working with data. But it first starts with the foundations and then building the experiences on top.
There's so much to explore in that the nature of metadata fabric itself is fascinating. One of the things that I think we often forget as technologists is, um, the fourth dimension Delta T, right. And what is data? What are we trying to do with it? Really? We're trying to do cognition with data, right? So we're trying to create knowledge. Knowledge is contextual, it's highly localized. So there's a need to federate our metadata and to be able to have much like, uh, maybe, maybe an effective functioning human government, some decentralization, right? Some local control, some locality of reference. This thing might mean something a little bit more nuanced among this community of users and something different in another.
And then companies, one thing that LinkedIn clearly didn't really have to worry about was getting split up. But when we start to bring this to the enterprise, we have to think about Delta T being, you know, corporate entities merging and separating. So those things also start to tear at the metadata fabric.
So there's a lot of fascinating work ahead to think about for companies beyond Silicon valley giants. How do these things maintain their practicability over years, right, over decades? Because whatever we're installing right now is tomorrow's legacy.
Absolutely. And I see this, um, kind of challenge and opportunity around federated data governance to be very aligned with what we're seeing in kind of complex organizations,
LinkedIn actually did acquire companies. We had SlideShare and we acquired a bunch of other companies as well, over the years, Bright, and a few others. And when I was tasked with leading GDPR, as in, being the GDPR architect for LinkedIn. I actually found out about a lot of the consequences of those acquisitions only then.
We initially started out by saying, we just have a member ID. What do you mean there? An there are any other ideas that have just the LinkedIn member ID and that's the identifier you get when you sign up for a LinkedIn account. And so our compliance taxonomy had member IDs. And then I found out there was a company called SlideShare that we had acquired and they were like, oh, we have SlideShare IDs also. And so of course the taxonomy had to grow.
And why LinkedIn, in fact to this day continues to be a very homogeneous culture on the same stack - there's not a ton of diversity. I do see that even in modern software practices, right? You will always refactor and you will always do team-based organization. And you want like the microservices architecture, at least in terms of the domain construct is here to stay. Right?
So there are teams, teams on services, they own pages. They also own the exhaust from their services, which is really data. The services are producing events. The services are writing to databases and all of that together. He is really a data product. And I'm, that's why I'm so excited about the whole data mesh movement that we're seeing, uh, because in many ways it is, uh, getting to the heart of the question, which is - application teams have traditionally just said, data is a downstream problem. And we're telling them, "No, you need to be aware, not just about your service interfaces. You know, you're proud of your service interface and you're proud of your API and you're watching them. Why wouldn't you do the same thing with your data?"
And the typical application developer in response to that is always, "Well, but my data has schemas, and so as long as I have schemas, I should be good right?" And the reality is that that's at least a good starting point. In fact, I find frequently in our community, a lot of people are still producing JASON and saying, "Let the downstream handle it", in cases where people are not doing that, they're already one step ahead.
But what we're starting to see is the ability, uh, in tools like DataHub to actually take, not just schemas from the source systems, but also metadata associated with the schemas. What that allows you to do is things like the famous business glossary. I actually did not know what a business glossary was until I left LinkedIn. I just knew about genomes and taxonomies, but I didn't quite know what a business blesser you was, but it's that printing a thing and all the catalogs need to have business glossaries. So I went and read up about business glossaries and realized, okay, it's not terribly complex, it's just another way of structuring data and the understanding of data, and then linking it with data assets and columns and things like that.
And the interesting thing that happens when you take something like a business glossary, but then you can finally talk about high-level concepts in a way that doesn't leak implementation details like that Postgres table over there has this data, right? You can just say the customer account and it has a balance and the balance has certain privileges associated with it, and it has certain data handling rules that I need to ascribe to it. And say, okay, "How should we maintain this business glossary?" And a lot of times in the old school catalogs, generally you go into the UI and you've got like roles and permissions and you're clicking and clicking and, you know, adding a few terms and adding a few nodes and constructing that tree sort of by chance inside the UI.
And we thought, why don't we work the model and say, this is just another schema repository. Why don't we manage it just like code and just like our schema and just check it into source control? And that really liberates the domain teams because the domain teams now are not only getting a schema language in their IDE. They're also getting the ability to link to a business glossary in that same IDE. We don't have to teach them a new tool in which to go and click buttons. They can actually just auto-complete in their IDE and just treat business glossary terms as just auto-complete in the IDE itself. And that's really good because the developers can stay where they are and where they're comfortable. And the governance team gets what they want, which is all schemas they're coming in with compliance terms assigned, with business glossary terms assigned and things like that.
And I think that leads to this ability to have federated data governance, which is a fascinating movement for metadata. We're able to shift metadata quality problems left, um, metadata that can be assigned at source. Of course it will be wrong. Like data owners frequently don't know exactly what they're putting into their data sets, so they will get it wrong, but they will get it less wrong than they would if they had no idea what they were supposed to do or what their responsibilities were.
Yeah, it's all about being able to create traceability so that people can see what they've done, walk back and then iterate. I think that's so much of what we're learning in the world of future automation. Right? We're gonna get it wrong because we're building hard things about problems that we only barely understand.
It's by trying it out that we start to understand the problem, we say, "Oh, next time we're going to get it right." But this idea of interability treating everything as code, right. And shifting left, that whole movement is really around. Being able to turn this all back into a practice of software development.
And it's been fascinating to me to see so many software development practices creep into data. And some of those you focused on both LinkedIn and now are around, data quality and data observability. And when I think about software quality, that's kind of obvious. We have a lot of tests, suites. We have fuzz tests, like there's a whole set of priorities there. I think about software observability, we think about, you know, x-rays or Jaeger or, you know, all, all the number of ways of understanding what's actually happening as built, not as designed.
So how does what you're doing in metadata, metadata catalogs, intersect with these sorts of software appropriations, as we look at data quality and data observability?
Well, it's, it's actually interesting, again, similarly like this data catalog word, it means some things to some people and it kind of brings in certain connotations. So most people think that a data catalog should only have schemas and storage. That's it, that's where it stops.
But what we're seeing is that the community is pulling us to saying no, no, no, that's not where a data catalog stops. I need to get more signal into whether this data set that I'm looking at is even alive. Is it a living, breathing data set or did it die two years ago? And I don't even know about it, right?
So it starts with like, just, even from a discovery perspective, being able to feed in operational signals, being able to feed in Live data health quality scores into that system to be able to say, "Is this data set something I should be interacting with? Can I depend on this data set for real?" And I often think about how Google has evolved over the years and Google started with just being 10 blue links and just being amazing and giving you those 10 blue things.
But if you look now and you search for a restaurant, you're not only finding the restaurant, you're getting a knowledge card that tells you what the restaurant is about, but you're also seeing if it's busy at this time, and whether you should go to that restaurant now or not, and you know what people are saying about it.
So all of that life context that you get is really improving your ability to interact with that restaurant and decide whether it's the right time to visit it or not. And I think we're seeing the same thing happen for data and metadata as well. So that's the data catalogs are moving from being just schemas and tables, to being in some sense, both wider and deeper.
When I talk about wide, it's about having everything about everything, it's not just data sets, it's features, models, dashboards, people, teams and even code, right. I'm getting that whole nexus together and saying, "How can I get the most comprehensive widest of my data graph?"
And then deeper. Because on each one of these it's the Delta. It's not just about what it looked like at some point in time. But it's about how it is changing over time. And when you do that, metadata itself starts looking like Stream and it becomes voluminous in nature. And so the past approaches of treating metadata, small data problems are just not working. So you need the ability to ingest data. Freshness signals, completeness, the ability to deal with version metadata.
In fact at Metadata Day we were having a discussion about it. And Phil Bernstein did it and he's moved on to other things. He was like, "Yeah, you need to do versions, and you don't have a Graph of versions, and then be able to have time dimensions, and time is also a version, and then temporal data points."
And then at the run time, or maybe at the right time, you have to resolve it all. When someone asks a question about a certain high dataset. You have to literally walk that Graph of lots of different partitions and lots of different evolutions of that same Hive dataset over time, across different clusters, and then say, "Okay, here's the result entity. And here's what you need to know about it."
So the ability to deal with time series, data, to detect patterns, anomalies, all the things that we've got, kind of experience in service and data infra. We have to kind of apply to data catalog infra or metadata platforms, if you want to think of it that way.
And so I think the intersection is bound to happen. You will basically want to push down these concerns of storage and indexing and representation of all kinds of metadata into a single system. And then, like I said, build differentiated experiences on top. So you don't have to. back all the features into one tool, you can build multiple specialized interfaces that allow different personas to interact with the metadata Graph in the way that they want to. But the platform has to be all singing, all dancing in some sense.
Yeah. I love the term "metadata Graph", because it gives a sense of what the architecture might be that allows you to make your inferential reasoning about what's there. Because all of this is about meaning creation, right? We're trying to structure an ontology of, what are the thoughts we think about the real world. So that's going to be powered by some level of inferential computation. So where do I go to find that stuff out? And that's all going to be traversing. This powerful metadata Graph.
I didn't use the famous knowledge graph word praise yet. But honestly, it is important, uh, and it shows up in very simple places. "I am Airflow, and I want to publish the lineage between dataset A and data set B."
My understanding of that data set might be different in subtle ways, from the Hive catalogs' understanding of that dataset. Or the Kafka schema registry is understanding of that dataset. Just very simple ways. Maybe the IP address is just a little bit different for the hostname that I'm connecting to. Or maybe I have a connection name. That is registered in Airflow that defines my connectivity to the source and that's different in different systems. And so entity resolution is one of the foundational pieces that needs to be built as part of the metadata Graph.
You can pretend that everyone will get the ideas right. And you know, you can only go so far with it. Uh, people don't get it back. Systems cannot get the IDs, right most of the time, right?
It would be interesting to see how the Kubernetes architecture and sort of APIs and control plane end up influencing our ability to do this kind of architecture at scale because many of these problems really should be left as configuration and they should be resolved by a more competent and unified control plane. So it's neat to see computation crashing into distributed data in these interesting ways, even where we can locate metadata. If you can count on Kubernetes as your, as your common environment, you could say, "Well, you know, you've got etcd, and that's available in every cluster. And you can communicate between clusters around that." So we can probably store copies of parts of the metadata there, right? Much like you might have a CRD, right for a Kube custom resource. How do we understand these things? It's an amazing world, uh, to be stepping into for sure.
And I think that puts a whole different point on the phrase data discovery, because when we talk about service discovery, We don't talk about people going to tools and searching for services. When we say service discovery, you're just basically saying services lead to DNS lookup service ID, and just be able to talk to the service.
And I think we're funnily enough, missing that data today. Like that exact parallel we are actually missing. And we're talking about humans discovering data - this is important - but I think compute discovery of data at a standardized rate is actually an unsolved problem. We have lots of different metastores inside different systems that support it. But I think the industry is still missing the metastore to rule them all in some sense.
Yeah. And that is certainly going to have to be developed in open source.
Shirshanka, it's been an absolute blast speaking with you. And one of the things we like to do is to end our episodes asking for one resource, or one piece of advice that you'd like to leave with our audience.
Well, I'll take your offer and give two things. The first is a resource, which is a slack.datahub.project.io - that's where our community lives. So if you want to learn more about the data hub community and participate and have fun conversations about metadata, whether it should be free flowing or should be locked up in silos, that's the place to be.
And the piece of advice. If you haven't figured it out already, I firmly believe that hyperspecialization is in the end, hurting the customer. And so as everyone makes their choices around tools, it's important to look behind the tool a little bit, and make sure that you are choosing the right platform. Especially when it comes to choosing something like a data catalog. The platform has to be good enough that it can generalize well to multiple use cases. Otherwise it won't really stand the test of time. And one year or two years later, you'll be back in the market looking for another tool.
You don't talk about Kafka as a tool. You don't talk about Cassandra as a tool. You don't talk about Snowflake as a tool. And so similarly, you should choose your data catalog wisely. Features, you know, it can always be built on top of a good architecture. And so fundamental principles I think are much harder to change.
So I'm a technologist first. So when I'm doing due diligence on other technologies, when I was doing that in the past, I would always look beyond, you know, the demo and that, that shiny integration and look at how is this thing built and will it actually scale with us? Will it actually grow with us as we push its limits to do different things?
So I would just say from an advice perspective, look beyond that glass pane and try to understand how the thing you're thinking of adopting is built before making that decision. I think it's really important.
That's outstanding. Shirshanka, thank you so much for your, uh, for your time and sharing your awesome experience with us.
Thank you so much, Sam. It's been great and I'm actually looking forward to figuring out what Kubernetes can do for data. It's a fascinating conversation. But maybe next time.
I can't wait.