Data Meshes: Big Data Architecture Becoming Distributed, Declarative and Domain Oriented
Beyond The Data Lake was Director of Emerging Technologies at ThoughtWorks, Zhamak Dehghani's 2017 paper that was a guiding light for Sam Ramji at another point in his career. Listen to how a Data Mesh allows the composition of multi-model data across an organization and beyond.
Director of Emerging Technologies at ThoughtWorks North America
Zhamak Dehghani: One of the pillars or the four principles I have for data meshing, well the last principle is this, we need a new model for governing data at scale, when it's decentralized.
Sam Ramji: I'm Sam Ramji and this is Open Source Data. I'm here with Zhamak Dehghani, who works with ThoughtWorks as the Director of Emerging Technologies in North America. Now her focus is on distributed systems and data architecture and she has a deep passion for decentralized technology solutions, data mesh, decentralized trust and identity networking. She's a distributed computing nerd so this is going to be a lot of fun.
Sam Ramji: Zhamak introduced the concept of data mesh in 2018 in a paper called, Beyond the Data Lake, and since has been evangelizing the concept with the wider industry. She's a member of the ThoughtWorks technology advisory board and contributes to the creation of ThoughtWorks Technology Radar. Her background is over 20 years as a software engineer and architect and she's contributed to multiple patents in distributed computing communications as well as embedded device technologies. Zhamak, welcome.
Zhamak Dehghani: Hi Sam. Thank you for having me.
Sam Ramji: I'm going to start with sort of a standard question to explore the meaning of our show. What does open source data mean to you?
Zhamak Dehghani: I guess I can kind of decouple the words a little bit and see how I see it. Open source, well, it's openly available or openly sourced and data I suppose here could mean the data itself. An open source data can take us down wonderful conversations that might get us to self sovereign data and data that individuals own and share. And we could also, I guess, look at it as open source data related technologies and tooling and how can we provide kind of openly sourced and shared tooling that would then enable sharing of the data at scale. We can, I guess, look at it from multiple angles. I don't have one smart answer for you.
Sam Ramji: Yeah. And I think the purpose of that question is really to explore the space of meaning. The ontology of these three words, because we're exploring this area and we're looking into the semantics and the semiotics and how do we create language around all of these new practices? Which were really what got me intrigued by what you were doing. I ran across your keynote on data meshes at ThoughtWorks Virtual Conference last month. I loved it. It's completely coherent with what we built at Google and what I got to learn from technical infrastructure team who built the capability for 44,000 Google engineers to constantly build new capabilities. But so few of those kinds of architectures are really carefully described and very few of them are transmitted outside.
Sam Ramji: You came up with it independently and I ran across your first work in the paper, Beyond the Data Lake when I was at Autodesk, which was a job I had between Google and DataStax, a leading cloud platform. I would love to get a sense of your inspiration for our audience on data meshes and the kind of journey that you took to go from distributed computing into talking about larger-scale data architectures in the next generation.
Zhamak Dehghani: Sure. I'm really excited that you came across it and we crossed paths and we're here today talking about it. I try to contain my excitement as we go through this conversation. I think the origin of this thought or hypothesis that ended up being data mesh as an approach was a point in time that I felt sheer frustration and a sense of crisis. And I guess these are the points in time that you start thinking out of the box and question the status quo. And it was a moment when I had the privilege of, I guess, working with multiple clients in the West Coast of the US, particularly in the Bay area, they were all, both large clients as well as very rich clients in terms of domains. They were in retail, they were in tech, they were in healthcare. And they had this challenge and question to us that what is your point of view around data architecture?
Zhamak Dehghani: Because we are struggling to scale. I guess to scale at two different points. Scale, using our data in a variety of use cases, in a diverse set of use cases that are going to fuel our innovation and becoming data-driven and ultimately compete using data. And we are a failure to scale, getting our arms around, I guess, diversity and the different domains from where the data is coming. They were not struggling necessarily around the volume of the data of how many petabytes they had, but more around the diversity of these domains that were generating the data and be able to get value out of that. And at that point, I started working with our kind of the head of data architecture, who's been Ken Collier, a colleague of mine who's been in the industry for decades in that particular big data analytics space and BI and warehousing and started looking under the hood to see what's going on here?
Zhamak Dehghani: And to my surprise, I felt the technology and approach and solutioning and mindset were really probably a decade behind what I had seen in an operational world. With microservices, with Kubernetes, with all of the technology and advancement that came to exist to decentralize and democratize creating capabilities through their services. I had this moment of crisis. And I love, I guess the narrative of Thomas Kuhn, who is a philosopher of science, and his observation that scientific revolution happens when scientists start to see things that don't fit their paradigms anymore. And they start seeing anomalies and going to the mode of crisis, and I thought there has to be a better way. And from that, the thinking around data mesh started and convergence of everything I had learned and we had learned as an industry, I think in the decentralization of operational systems, towards microservices and everything that evolved to support that. And let's bring that to the world of big data and analytics and data warehousing. Let's break that apart and see what else this world could look like.
Sam Ramji: It's a powerful way to come at things, realizing that the current paradigm is not working. And I think that was what attracted us all in the cleverly named Data Architects Working Group or DAWG, at Autodesk's here paper, because we had adopted a lot of Hadoop for analytical purposes. But then of course you have the slow path of data engineering and making sure that you've got effectively your daily builds and a pipeline of data for analytics that decreasingly few people can actually use and contribute to. And so the team had built out this multi-petabyte data environment for application rate data on Cassandra. And that was one of the things that brought Cassandra to mind as well. As we were trying to balance the force and how to answer all the questions that you have about all of this operational data and then all of this sort of reporting time data ... your paper was like a light in the dark.
Sam Ramji: One of the big challenges that you point at both in your response and in your writing is that it's really not just about the technology. It has to do with the organization, who's doing what. You talk about domain-driven design, which is really important. And one of the things that I think many people forget is when Martin Fowler and the ThoughtWorks team established the term microservices, it was an organizational construct first and a technical construct second. A lot of these overlaps that you're bringing to bear, how do we scale? All kind of come down to this idea that it's our people and our communications and our technology together that have to shift to solve new problems. Is that a fair way to understand what you're saying?
Zhamak Dehghani: Absolutely. Absolutely. I think for a really long time, we all have been talking about Conway's law. That the way we structure our architectures and technology is representative of how the organizations are structured and the line of communication established. And I think the more and more I work with organizations at that level, at the level of enterprise architecture, the more I realize architecture and organizational structure and even value systems, the cultures are so interdependent and one influences the other.
Zhamak Dehghani: As you said, I think the very first pillar or principle under data mesh is the decentralization of the data ownership as well as architecture around the boundary of domains and give that responsibility and accountability to the people who are most intimately familiar with that data. This is the domain of, I don't know, auto management or customer management that are day to day dealing with that data or dealing with the orders, they're dealing with the people. Why are we shying away from when it comes to both operational and analytical data ownership, we're shying away from giving that autonomy and responsibility to them and creating now architecture and platform that would enable this decentralization around the boundary of domains.
Sam Ramji: Well, it's a big shift because the way we've thought about our domain is architecture has been more process-oriented and more imperative, whereas what you're bringing with domain-driven design and Kubernetes is really data-oriented and declarative. It's a very different way to look at things. It's a much more powerful way because you can express goals to a system that it can seek. If you're going at this from a declarative perspective. If you're coming at it in an imperative way, then you kind of have to tell it every step in the process and when those steps break, the system doesn't have a chance to recover.
Zhamak Dehghani: Yeah, absolutely. And then there are places, I think within the domain that both models complement each other, but definitely, in the kind of analytical as we become more intelligently augmented and introduce more ML and AI, we would need to have that declarative, the data-centric view and the data-centric access models to the data, rather than kind of, as you mentioned, the imperative capability driven access models. They might still exist, they will exist, but there's more and more emphasis and need for modern data-oriented or data-centric and particularly analytical data-centric as a temporal view of the world that continuously changes and provides access to that data.
Sam Ramji: Yeah, as the arrow of time passes my cursor at any point in time, it starts to look different. It's got a different value to look back a week or a month or a year, depending on the time of year I'm in. One of the challenges that I've run across is microservices 1.0, where we kind of told teams, "Hey, go for broke. The only thing that matters is velocity. You can do anything you want. All the tools you want. Oh and anybody who wants to tell you how to do it, just ignore them. You've got a two-pizza team, you can do anything that you want." But then you kind of advance that a year or two down the road and you've got a team that's got three different databases that they're managing and they've got a couple of APIs and they've got this microservice. And then a couple of people leave the team and it goes into maintenance mode.
Sam Ramji: Nobody knows quite exactly how to operate it. And then the big question comes, hey, we have this data product where we want to launch and it needs access to some of the data that's in the microservice, but we don't want it represented the way that you have in the microservice. And so then you're stuck in kind of the tyranny of microservices 1.0. The way that I've been coming at data meshes is to look at that problem of data product velocity, and to say, "How do you fix it?" Well, how again, Google fixed was to turn that problem on its side and say, "All the data is open access."
Sam Ramji: The microservices team shouldn't expect to manage state. And in fact, if you started writing your own custom state handling at Google as an engineer, you were in very deep trouble because a very small number of engineers were supposed to manage state for the entire corporation. Thinking about data platforms is a very different approach, which allows you to think about this architecture that you're describing as a data mesh. Very curious about how you see the advancement of application design and development in microservices meeting some new analytically oriented control architecture that you have in a data mesh.
Zhamak Dehghani: Yeah, that's actually, I'm super curious to go deeper, perhaps a different conversation to see how that implementation was done at Google. But the way, I guess I'm kind of been establishing the details of implementation with some under the principle of progressive disclosure of complexity and going top down rather than bottom up at it, but the way we are implementing it right now, we're at with the implementation. And it may not be the best implementation because as you can imagine, we are still stuck using the tooling and the technology and the platforms that exist. We're not a product company. We use other people's products to build solutions. It may not be the best way, but the way kind of we've been building it and imagining is that microservices that basically satisfy the needs of the business and the operations that they perform through the APIs.
Zhamak Dehghani: And they will provide the state changes and all of the facts that they would normally probably know because they just keep the latest state of the system. They provide those facts as an input to this adjacent thing that now we need to build, we call it data product, which is unknown on the data mesh. And that thing, that node on the mesh then is responsible for kind of getting that data. From a microservice perspective it has a very small retention policy or short retention policy, but the data product on this side is responsible for long retention, maybe infinite retention of the data, depending on what the definition of infinite within that domain is. And storing that data and providing this multimodal access to that now temporal long term data that is accumulated and received from microservices and providing that multimodal access to a variety of use cases because that analytical or temporal or longstanding data has many diverse use cases.
Zhamak Dehghani: And reporting is one use case. And people who write reports would love to use SQL, it feeds kind of the feature store or the feature design, or future engineering for machine learning folks and they would want to access that very differently based on columnar kind of access and so on and so on. There are different ways of getting to that multimodal. That thing, we call it, I guess the quantum of architecture, the smallest unit of architecture on this data mesh, sits adjacent to a microservice and it's his job is slightly different. It's not necessarily run the business and run the API, it's to provide now this analytical view of the world in a multimodal access. And I think there is a ton of white space in terms of innovating how to build that because now all of those microservices teams or domains that are building the microservices are not only responsible for building their microservices, but they're responsible for building their data products because they know that domain or that data best. And I think that's where innovation can happen.
Sam Ramji: Yeah. There's a really nice language construction that you're performing by mirroring the term service mesh, which kind of sits above microservices and has the responsibility of distributing access and also enforcing aspect-oriented policy across all of them. And with the data mash, you can easily imagine that being asymmetric reflection below the microservice where it has the same job, but the temporal assignment is a little bit different, but it is providing consistent access to the data that is left behind by the microservice while also enforcing aspect-oriented policy across that whole estate. Is that a reasonable reconstruction of what you're?
Zhamak Dehghani: Oh, that was just so wonderfully put. All that rambling I did, you just put it in one beautiful sentence.
Sam Ramji: Well you did the hard work, I just summarized it.
Zhamak Dehghani: I must say I didn't have a name for this for quite a while. It was called Beyond the Lake and then 2017, one of the, I guess the most exciting moments in my life in 2017 was when, I think folks in open source very first revisions of Istio, I fell in love with the concept of service mesh. It's just such a beautiful concept, abstracting away a lot of cross-cutting concerns into an invisible layer in a way to the application developer. I was in that world for quite a while and that inspired the thinking behind the data mesh, but it's just, how can we map that to the world? To the big data world that is slightly different, it has slightly different concerns. And those I support affordances, the interfaces, and capabilities as cross-cutting concerns that now we need to implement for each of these nodes where the issue of these kinds of data products are slightly different.
Zhamak Dehghani: There definitely is overlap between the services, but there are other concerns that are very specific to data, for example, being able to correlate and mesh data across different domains that now are being provided by different groups of people. Be able to explore and understand and discover data across the mesh. There are a set of affordances that I think the data mesh, that mesh layer can provide as a platform capability, invisible to the folks that are really creating those data sets as part of the platform.
Sam Ramji: And that's a great summary of the positive space, kind of creative elements that get solved by the data mesh. There's an I'd argue, equally important sort of governance, sort of negative space that it also takes care of because the policy has got to include economic policy. How much is it worth to hold which data for how long? Retention policy which gets read on by regulatory requirements. Fuzzing policy, security access, who can see all of the data? Who can see part of the data? And how does that carried around through the system? What's the providence, in fact, of any piece of data that you end up bringing into your ML teams for your feature modeling and putting in your feature store?
Sam Ramji: And that is such a vast space of complexity, it can't possibly be done by hand. We still don't have great tooling and great ways to talk about it so the language and the paradigm aren't quite there yet, but over the next decade, it's embarrassingly obvious that ML is going to transform every organization in the world. This is a key piece that we need to do to reduce all of these problems down to a computational policy that we can apply to data. And it seems to me that your data mesh construct is the one that we can use to run it.
Zhamak Dehghani: Yeah. I think governance is one of those scary, somewhat boring words, but it's so essential and it needs refresh, I guess, thinking. One of the pillars or the four principles I have for data meshing, well the last principle is this - We need a new model for governing data at scale when it's decentralized. The past paradigms around governance, being very much centralized that you have a centralized team with a lot of manual effort, they try to enforce and control the policies that you mentioned. But because the solution was fairly centralized, maybe they could do it okay but there were always points of friction and point of frustration for people that actually want to use the data, but they had a really good, important job to do. The security concerns that you mentioned, that levels of encryption that they have to apply. In the new world with decentralization, we have no choice, but automate, automate, automate.
Zhamak Dehghani: A lot of those concerns were just manually being controlled or manually being put in place and pushing that into that kind of the platform layer of the mesh. And then there are certain decisions that need to be made and the policies that you mentioned, some of those decisions can be pushed locally to those domains because they are local concerns. For example, the modeling of the data for a particular domain can be left to that domain to decide. And then there are decisions that we need to all globally agree upon and they're mostly around interoperability. When the data crosses one boundary to another, what are those standards that we need to agree upon so that we can meaningfully use this data that comes from different places?
Zhamak Dehghani: And those globally governed, I suppose, policies can be made in a federated fashion rather than a centralized fashion. I think from the organizational structure to automation of the policy, as you mentioned, and the value system to embrace change as a fundamental piece, rather than try to enforce the static structures are very important pieces. And that makes me excited. I wouldn't normally get excited about the word governance.
Sam Ramji: But it's going to save so much time and so much headache. I remember at Google when we had to go through GDPR compliance prep, it didn't actually take each team that long because Google had a beautiful model for declarative documentation. Effectively you can imagine a YAML file of sorts, but you could declare the information about the data that your application managed and that declaration could then be computationally assessed to determine its risk and remediation requirements for GDPR. That's a really important thing because governments around the world are going to continue to understand that data is the most important asset that any industry, any company, any government, any organization has and therefore it needs to be regulated. And we can assume that the regulations will keep changing over time. We're in decade N of N times N times Q decades worth ahead of managing data.
Sam Ramji: We'll be in this world forever and what we think about it will change over time. We don't want to be in a position where we have to go back and manually relitigate or recategorize anything in the past. That's why I kind of look at governance as being the essential solution that has to give us a computational exit path out of all of the torture that we go through, making sure that we've done all the appropriate things with data, because we never want to do the wrong thing on purpose. And we also never want to do the wrong thing by accident.
Zhamak Dehghani: I agree, 100%. There will be many people that sleep comfortably at night if we did that.
Sam Ramji: Yeah. And for good reason. They could do so with confidence. We're kind of coming up on the end of the time that you have so let me ask just two last questions. One is maybe a big question. What's next for technology tooling and platforms in this area from your point of view?
Zhamak Dehghani: I have many hopes and many dreams. I think there is, as we just discovered, I don't know how long we've been talking. It went so quickly. But just this short conversation was so generative. If folks were listening, there were so many topics that just bubbled up as the potential for innovation. I think at a base level, we need a new language to even describe this architecture. In my mind, the reason we all got together and agreed upon microservices, we had a set of open standards. We first had a language to describe that world and then we had a set of standards that enabled interoperability that we kind of agreed-upon, simple as HTTP rest. That was so key to bringing that idea to life. I think in the space of data mesh, it's the same thing.
Zhamak Dehghani: We need to establish that new language for this new vision of the world. We need to establish a set of open standards that then give, they encourage ecosystem building and tool building. Let's talk as an example, for example, if we think about this node on the mesh related data product is now the quantum of our architecture, it'd be that maybe we want to standardize when we talk about it. There are interfaces or specifications that we can define.
Zhamak Dehghani: For example, that deClarity file that you just described that Google had as a way of describing this node, as a way of describing the data products. And then once you have the clarity of description as a spec, all of the toolmakers, it's a very generative concept because many toolmakers can go and bring that declaration to life with provisioning engines and with operational engines and build all of the bells and whistles that would require to have security built into it. I think the next piece of this evolution would require our agreement in an open fashion with open source data technology around a few key specifications that then we can use to seed implementations behind the scene.
Sam Ramji: That's an awesome answer. It's going to be a really exciting time to work on this. I'm excited to work on it with you. Let me close with a question, what is one resource that you would point the audience to if they are interested in chasing this down, exploring it, and maybe getting on board and joining us in defining the data mesh landscape?
Zhamak Dehghani: It's a resource that's coming. I think the resource right now that exists is really that seminal paper, that's the one that started it all and people point to that or a recent talk, but I'm working on a follow-up paper on that to put a little bit more framework around the architecture. It would be on Martin's site. It will be published. It's a blueprint around data mesh and I will publish it over at Martin's site. And that would point to hopefully a very basic repo of this initial state of these specifications that we can build together. It's a data mesh blueprint article, maybe by the time we publish this, it's already on Martin Fowler's website and I will share the link to have in the show notes.
Sam Ramji: That is awesome. Zhamak, thank you so much for what you're bringing to the industry and thank you so much for spending time with me today.
Zhamak Dehghani: Thank you for the conversation. I really enjoyed it.
Narrator: Thank you so much for tuning in to today's episode of the Open||Source||Data podcast, hosted by DataStax's Chief Strategy Officer, Sam Ramji. We're privileged and excited to feature many more guests who will share their perspectives on the future of software so please stay tuned. If you haven't already done so, subscribe to this series to be notified when a new conversation is released, and feel free to drop us any questions or feedback at firstname.lastname@example.org.