Toggle Menu

Season 1 · Episode 12

Fabrics, Meshes, and Graphs with Deloitte Principal Dave Thomas

Join Dave and Sam as they discuss data sets evolving from finite to infinite, and finding the needle in the haystack with math. Listen to Dave talk about cutting edge data problems and the essential need for curious people.

Published February 25th, 2021  |  27:06 Runtime

Episode Guest

Dave Thomas

Dave Thomas

Principal at Deloitte

Episode Transcript

Sam Ramji:
Hi, this is Sam Ramji, and you're listening to Open Source Data. This week on Open Source Data we have Dave Thomas. He's at the cutting edge of building large-scale distributed systems and data analytics platforms, building fabrics, meshes and graphs. With his focus on Federal Government, he works with clients to improve all aspects of the analytical systems, from data modernization to advanced machine learning algorithms. As a principle in Deloitte's analytics and cognitive practice, he helps bring innovations in the technology sector to his client's tough admissions and decisions. Dave, thank you so much for being with us.

Dave Thomas:
Thanks for having me, Sam.

Sam Ramji:
You and I have had a lot of conversations over the last several months. I've always taken away several things and learn new things every time. We like to start each conversation asking, what does open source data mean to you?

Dave Thomas:
I love that formulation. When I start to think about the technology landscape around data these days, it gets more complicated it seems like every day. And I think the open source nature of many of the new products is critical. So I think of open source as a great way of distributing technical solutions across the broad community, starting to draw some or determine some standardization around what are those APIs that the systems need to connect with. And so, I think some of the toughest challenges we're facing in data now literally couldn't be solved without the open source community coming to some of the standardization questions.

Sam Ramji:
I love the concept of open source driving standardization. It was so often early in the open source era, it was just described as a mirror commoditization. But as you start to inspect it, commodities drive standards, and then you have new technologies. Hadoop was a great example, really, in data of something that you couldn't do in proprietary software. Nobody was doing it. So it was no longer considered just a copycat. It was like, no, this is real innovation. And that's back in 2007, 2008. So open source data as a standards driver is super exciting to hear from you as somebody who's at the cutting edge of practice.

Dave Thomas:
And I love the commoditization and standards. If you went back to railroads, getting the gauge standardized allowed for the transportation revolution we saw. Same sort of thing as we start to standardize, we have more interoperability. It may be a car can go from the East Coast to the West Coast faster, or as we talk a little bit about some of the data challenges, I can have a piece of data move all throughout my enterprise and still have the same protections around it as it does.

Sam Ramji:
Well, that's a really powerful framework for us to talk about data today, because you've got the railroad gages that got standardized, which created interoperability for the trains and transcontinental railroads. And then it got taken up an entirely new level with the invention of containers, which could move from ships, to trains and to trucks. So there's some of that work that you've been doing to take really awe inspiring quantities of data in your projects for the missions that you support, both starting with interoperability, which I think you had looked at as a fabric class of problem, and now starting to look at almost a containerization of data as we start to look at data meshes. And you have created a new frame of powerful algorithms based on graph to solve a bunch of the problems in the domains you care about. I would love to have you talk with us, how did you end up landing on fabrics? And how has that changed your work? And how are you now moving from fabrics to meshes?

Dave Thomas:
I love that question. So I spent a lot of time building knowledge graphs for my clients. And when we think of knowledge graphs, we're pulling together data that exists throughout the entire enterprise, often in different siloed systems. But what we're trying to do is provide context for each piece of information as that data moves throughout the system. And all of it comes into the knowledge graph. Each piece of information gets exponentially more valuable. Everything winds up getting pushed in. And what we found though, is that there's a challenge at one point where, if I have a piece of data that is protected, PII, for example, coming from a transactional system, when it moves into the graph, I need to have the same protections on that piece of data. And so I want to make sure that I can have all those protections travel with the data throughout the system.

Dave Thomas:
In a fabric, the way I think of a fabric is very monolithic. And so it's easy to push all those protections into that fabric layer as it comes up. I say easy, it's still a lot of work, but it's doable. What we're starting to see, it's no longer just a transactional system to a knowledge graph or an analytics system. A graph is just one method of thinking of an analytics system. But now my clients want to take that data that was put into the graph and push it back down into their transactional systems for additional value. And so now the data as it moves needs to continue to have those protections with it, so the metadata needs to travel around with that data, the policies, all the protections. And so it's becoming much more of a distributed problem. So I think that would be the distinction I would make between a fabric and a mesh. It's no longer just pushing everything into a monolithic data fabric you have tightly knit together data system, but how do we allow loosely coupled data systems, exchange of information with all that metadata going with it?

Sam Ramji:
Yeah, that's a huge challenge. A loose coupling is a great way to build resiliency and isolation so that you can have different pieces move at different speeds. But it's tough on standardization of some really key things around governance and policy, which obviously is a huge concern for you. So as you think about that security context moving around with the data, what have you had to learn or what have your teams had to invent in order to make sure that that, not just the provenance, but the actual governance of the data does get carried throughout this distributed data system?

Dave Thomas:
Yeah. I think in some ways we're lucky that Kubernetes and a lot of the work done around distributing software has gone ahead, so we get to look and cheat a little bit about some of the patterns we're seeing there. And security is a great example. As we think of some of the service measures that have been stood up to live on top of the distributed software, we have some of the same types of paradigms we need to build into these data measures. So I need to know identity across all the different data sets. So who is accessing a piece of data? I need to make sure that this piece of data can move from this place to another place. So if I have your GDPR policies that may restrict movement, or if I have data locality challenges, I may want to build a caching system that pushes data out close to an end user.

Dave Thomas:
But if that end user leaves the country, maybe I no longer can leverage that caching system that's closer to them. That consolidation of policy and then distributed enforcement that we see using like a sidecar model in the software realm is very similar to what we're trying to do when we build these data meshes. So we have the policies and the protocols separated from the data itself, and then we push the enforcement down to the individual data elements or the systems that are holding those individual data elements.

Sam Ramji:
That's both extraordinarily important as the scale of data increases, as we get to every company holding exabytes of data and needing to compute in any given project on many petabytes. It's also embarrassingly difficult to think about passing regulations or passing audit if you have to apply lawyers and teams of technologists with log files and doing log analytics. So clearly there's got to be a computational solution to all of this. And that's something that you have been doing a bunch of and you've taught me about around policy engines and policy agents. You introduced me to some concepts that Steve Touw and Immuta have been practicing. I'd love to have you talk a little bit more about, how do you go from the intent to have policy, like you've described for sovereignty or data locality, and actually make it a reality and then have it provably auditable?

Dave Thomas:
It's a great question. And part of my job is spending a lot of time with those lawyers that made the agreements around data sharing. And if you think of the Federal Government, where I spend most of my time, there's all these memorandums of understanding between different agencies and there's all these different title authorities that allow people to use data differently, which have all been negotiated and agreed upon in these MOUs. And we do literally sit down with lawyers and go from the plain text that they've agreed upon and map that to the physical data that exists. So if we say, this piece of information is PII, and PII is protected, we'll actually add metadata to the physical model that we're persisting. And so we can get agreement that, yes, a name is PII, and yes, the lawyer said PII can or cannot be shared, and then we have that traceability all the way from the domain into the physical representation.

Dave Thomas:
And that mapping is something that we ensure that we can, to your point, compute on. If we move that record, that column or that file somewhere else, we move metadata with it as well, so that as we access that piece of information somewhere else, we can take that same policy, look at the policy, the accesses or the privileges of the user accessing it, and then apply those at compute time, do I let this person get access to this data? So if you think it's a distributed attribute based control system that we're building here where the attributes live separately, then the data. And I think one of the challenges we have is there aren't that many standards around how to enforce this interoperability, back to the open source question at the beginning. We're starting to see some of the identity information is transferable, things like co-author, or open ID, or allowing us to have some identity of the end-user move around.

Dave Thomas:
We don't have great systems yet for capturing one, that domain, a system that is transferable. So what is PII mean and all of these different contexts? And then the policy as well, we don't yet have standards. There's certainly companies adopting and innovating in that space, but the interoperability between all the systems is still something that I think over the next five years, the industry is really going to move forwards.

Sam Ramji:
Yeah. And you described the problem very eloquently. One of the great challenges of our time is that lawyers don't have computer science backgrounds, and computer scientists mostly don't have law backgrounds either. I had an opportunity to talk with John Clippinger, who's been at MIT and worked on identity distributed trust for a long time. And he was describing some of the work he's curious about now around computational law. And once you start chasing a certain curiosity, you start to see it everywhere. I noticed that Intuit recently came out with a description of how they had applied graph and ML to understand the PPP and other types of loan programs and grants to large-scale communities of small business owners.

Sam Ramji:
And it's about being able to effectively, just like we used to shred XML into actual binary bits and data, how do you shred law into a computable graph? Do you think about it that way? When you describe your challenge, you're talking to lawyers and deciding this and that, is that more of a design time problem where you're coming up with rendering the policy as you imagine it to be into some programmable normal form? Or is there more of a knowledge graph approach to updating the system's understanding of what PII means or what particular policies imply?

Dave Thomas:
I would say we've actually been doing sort of the shredding and pushing it down into computation for a long time, we just didn't really think about it explicitly. So I've been writing code long enough to think back to when we do the object relational mapping, and that had all the context between what is a physical and how do I map that to my domain? And then we would have a bunch of developers who would say, well, I know this person is accessing this data, and maybe it's a medical system and I can see it's the payer, so I'm going to give them all the details. Or maybe it's a clinician, so they can only see the medical information. Or maybe it's the billing department, so they only see the address. All that mapping was done by a developer somewhere in an object that was removed from sort of the view of the lawyers, which was effective, except for when you had to change those policies.

Dave Thomas:
Great example would be, maybe the doctor shouldn't be able to see an individual's residence, unless maybe at the diagnosis may have something to do with an environmental factor. In that case, the place of residence might be very, very important. So those exceptions to the laws and the exceptions of the policies are great, but you probably don't want every developer across all your different systems making those independent mappings. And so what we're really trying to do is consolidate that mapping from the legal speak into a domain that everyone can understand, and then allow the domain to be mapped explicitly to the underlying physical representation of that data. So it's something we've done a long time, I think we're just putting more discipline around it now and transparency. And I think the standards will allow us to consolidate all those rules in a place that allows us to quickly update them and so that we have universal enforcement as well.

Sam Ramji:
I can see why you focus so much on meshes and grass, because this whole system needs to be so loosely distributed because the field of practice is changing. Law implements human beliefs and principles about how things ought to be. And as we keep changing our practice, those principles shift. So there's a lot more ahead here, as you think about that 10 year future of some of the projects that you're building. One of the things that I've been inspired by is the class of complex problems that you're able to solve in your systems with application of graph in a matter of seconds, where other approaches to computing that information would take weeks or longer, which wouldn't work for a lot of the projects that you're supporting. Can you talk a little about your journey from working on sort of standard query approaches, large-scale data, clearly there's plenty of opportunity for you to throw a lot of iron at analytics, and what made you start working on graphs? And then maybe I'll ask you a bit more about what can you teach us about graphs in practice?

Dave Thomas:
I was actually driven towards graph many years ago when it became clear that the context of a fact was important as the fact itself. So what were the relationships between an individual entity and other entities? And the classic example in graph is fraud. So I may have two different account numbers or credit cards, but they're all the purchases are going to the same address. Well, that's an unusual pattern and it might be indicative that those two credit cards are stolen, particularly if they are coming from the same IP address, or maybe if they started making a lot of unusual purchases at the same time. So while any individual purchase is especially interesting, all the connections around those is something that's very quick, easy to do in graph and hard to do in traditional computational methods. So that sort of drove me towards the graph sense, and then just sort of got fascinated by the different applications.

Dave Thomas:
Actually, Matthias Broecheler from Beta Stocks was one of my early inspirations in the work that he was doing around in Titan, and of course now in data stocks. And some of the at scale problems, this combination of the graph and the distributed nature of compute was allowing me to answer questions that my clients had that literally was computationally impossible until we could represent the problem as a graph. Often it is that connection, finding a community within a graph of highly connected, finding unusual corporate structures, for example, if we're looking at tax evasion. All sorts of physical representations of the data, and then doing the pattern detection in that space that you couldn't do in a more traditional table form.

Sam Ramji:
Yeah. You have a lot of needles in haystacks classes of problems in the work that you're doing, which I think actually are significant of more and more of the things that we are being asked to solve in data projects in big companies and big organizations, because the simple problems we already know how to solve.

Dave Thomas:
Yeah, absolutely.

Sam Ramji:
And they're simple because we know how to solve them. It's these needle in haystack issues that graph is helping you tease out.

Dave Thomas:
The data mesh becomes increasingly important as we start to build these massive graphs. There are types of data that may be not the best fit in a graph database. For example, if I have a long form text document or a lot of images, there are a lot of graph systems where I could put that data in, but I'm probably not going to be getting a whole lot of information out of the data in that structure within the graph. So what I'd rather do most likely is link to document storage somewhere and retrieve that information when I'm presenting it. Or transactional data often, you're very high volume, transactional data, time series data might not make sense to push natively in the graph.

Dave Thomas:
And so that kept driving us towards this mesh concept where I didn't want to have everything stored there that, back to the sort of the fabric construct, it didn't make sense to have it be a fabric where all the data lived there. But I wanted to ensure that as I was navigating the graph, there was one of my end users navigating across that graph that all the enforcement was taking place, and if they were navigating out of the graph to these other subsystems, the enforcement traveled with them in that place. And so data mesh has become increasingly important in this polyglot world.

Sam Ramji:
Yeah. The problems start to look more like orchestration and choreography than they do of the old monolithic control.

Dave Thomas:
Absolutely.

Sam Ramji:
Some people think of graph as simply a graph database. And so you bring up the concept and someone's like, oh yeah, Neo4j, And then they just stop thinking about it. They're like, I kind of understand that. But what you're doing is going from graph database to graph data and applying graph as a concept for how you model, understand, build algorithms against data. And as you distribute your data sources more and have both an input output layer across this distributed data system, the way that you're applying graph is ending up being really powerful, I think. I'd love for you to talk a little bit about the changes that you've had to implement, if any, between graph on a fabric or a more monolithic system versus graph against a mesh and more distributed data architecture.

Dave Thomas:
That's a really interesting challenge. I think a lot of the graph people, if you will, do think of it as, well, they either think of it as a database or as a math problem. And sometimes those overlap, but not always. We certainly have a lot of folks who ask, can you solve this problem with graph? Or is this a graph problem? What we find is that the answer is usually, it depends, a classic consulting answer. But thinking of it as a, do you need additional context? Do you need additional information around the set of facts? Is typically how we engage in those types of questions. And the data mesh versus the data fabric piece, if you're the graph math people over here, if they need the numbers in the graph to do calculations, it probably needs to get sucked into the graph itself.

Dave Thomas:
The graph database people maybe spend a little bit more time in the mesh where not all the data is necessarily located. Or often even on the math side, you'll see, we may not take the underlying data, but we'll represent that in a matrix somehow and push that matrix into the graph, extend the graph in that sense. So they're moving kind of back and forth between the layers is often allows us to attack problems in more abstract ways that we wouldn't be able to without that interconnection between the two.

Sam Ramji:
So it'd become more computable and they don't require you to do as much of the data loading that you would otherwise, because you're kind of loading in, as you said, matrices, effectively mementos that let you understand enough about it to compute, but not with the full fidelity of all the underlying data.

Dave Thomas:
Yep, exactly.

Sam Ramji:
Interesting. Hey, so you are solving, in some cases, safety sort of life critical, at the other end of the spectrum, some of these things around finance, and tax, and fraud and other risks, so you've built some pretty awesome systems. Many of us will probably end up reaching where you're at now over the next few years. From your point of view, what's unsolved? Where are you seeing white space in the industry? What are the things that you wish open source projects, or vendors, or anyone would provide that would make the future show up faster, or just make your next job easier?

Dave Thomas:
It's really all around the standards and interoperability. There's so many different ways that we express, persist query data nowadays. And the standards around, or I should say maybe tangential to the explicit APIs are the challenges that we often face. So how do I pass identity uniformly into a system? How do I extract metadata out of the persistent capabilities so that I can enforce that metadata across all of the different systems? And critically, I make it such that we can adopt a standard, and if I have a search API or a GraphQL API, regardless of how I'm querying the data and how that data is represented, I can make sure that the metadata around that is consistent, and enforced and expressed in a domain level that I can map policy to, and that the identity information can also be passed into that same query, I can trust the optimization engines of the various persistence layers to respond in the fastest way. I think that's the area where I would love to see the industry start to come to some agreement.

Sam Ramji:
Yeah. You make a fascinating point, because there's two different dimensions of which you pointed out. One is a set of functional APIs, like a search API, where you could say, look, I've got a standardized way to talk to this thing. I don't really care how it's implemented. But then you've also got a structural or an infrastructural API, which as we learned from the Kubernetes breakthrough, kind of paved the way by Terraform, going from imperative API to a declarative API, which tells the system, I would like you to assume the following state. I don't care how you do it, but when you figured it out, come back and tell me that you're ready for action. And you've got both of those things, I think I heard in what you are asking for.

Dave Thomas:
I love the way you phrased that. Right now it's very much the imperative approach. We're saying we have to take this type of information. We have to understand how the data is persisted. And so we're taking the metadata information, we're taking the identity information and knowledge of the persistence layer, and we're building the systems now to combine all those and give the answer back. I'd love to not have to do that, allow the system to tell me the best way to take that information and build optimizers so that it's a lot faster than we can build ourselves. And so having to build one of those optimization patterns for every single possible database or form of database our clients have is cost prohibitive and really limits some of the main challenges that we get to tackle because we're spending a lot of time in this infrastructure space.

Sam Ramji:
That's super helpful. And I look forward to working with you on some of these standardization and interrupt challenges. I've got two last questions for you. I know we're short on time. The first is, you lead large teams and complex projects. A lot of the people in the open source data community are in the earlier parts of their journey on distributed systems, distributed data. What do you look for and what advice do you give for people who are not necessarily starting out, but what do you look for when you're hiring into your team? Because one thing that you do many times a year is you build new teams for new projects. So help us see through your lens as an architect and a technology leader and a practitioner.

Dave Thomas:
I think the most important thing for me is curiosity, just an intellectual curiosity that manifests itself in a drive to answer problems. At the end of the day, what we spend all our time doing is listening to our client's hardest challenges and figuring out a new way to solve it. And if there were an easy roadmap, they wouldn't be coming to us for help. So the people who have that sort of inherent drive to solve problems.

Dave Thomas:
I think the other thing, a little bit more tactically what I'll often look at is people who have experience in a number of different languages, or domains, or areas. And what I've found is that, moving from space to space sort of forces the brain to abstract some of the problems, and so you're not just thinking, how do I enter this in JavaScript or Java, but how do I enter the problem, and then how do I map that to the infrastructure that I have? At the end of problem solving there's a search problem, and if you have a bigger search area, you'll come up with better problems. So I think those are the two, sort of the abstract, and then the tactical representation of that.

Sam Ramji:
That's really good advice. We used to talk about this going from being I-shaped to being T-shaped. If you can do more things, then you have a better opportunity to collaborate with others. But you're pointing out that they're also compacting their intelligence and creating more capability to cross domains, which seems like all the interesting problems these days are multi domain.

Dave Thomas:
Absolutely.

Sam Ramji:
My last question is, what is one resource, one link or one concept that you want to leave with our audience? What could people most benefit from if they want to follow your path, connect with fabrics, meshes and graphs?

Dave Thomas:
I think there's actually a really good resource. Well, I should say that a couple of the previous podcasts would be on there as a mix discussion around data meshes, or the Paco's discussion on graph or knowledge graphs. But another very tactical one would be NIST's paper on attribute based access control. They do a very good job of thinking about the problem in the abstract and what are the functional components that need to be implemented in order to enable a back? And as we're thinking about data meshes, to me, one of the hardest challenges is this, I can do based access control. And so thinking about that and starting to put some structure on what are the moving pieces that we need to implement? I think would be a great first step.

Sam Ramji:
That was awesome. Dave, thank you so much for your generosity of time and thought today.

Dave Thomas:
No problem, Sam. Thank you. Appreciate the conversation.

Narrator:
Thank you so much for tuning in to today's episode of the Open||Source||Data podcast, hosted by DataStax's Chief Strategy Officer, Sam Ramji. We're privileged and excited to feature many more guests who will share their perspectives on the future of software, so please stay tuned. If you haven't already done. So subscribe to this series to be notified when a new conversation is released and feel free to drop us any questions or feedback at opensourcedata@datastax.com.