New Survey: Leveraging real-time data delivers higher revenue growth and increased developer productivity. Learn more.
Behind the Innovator: Dave Mattox, Senior Product Architect, Deloitte
Welcome to our Q&A series: Behind the Innovator.
Behind the Innovator takes a peek behind the scenes with learnings and best practices from leading architects, operators, and developers building cloud-native, data-driven applications with Apache Cassandra™ and open-source technologies in unprecedented times.
This week, we spoke with Dave Mattox, a senior product architect at Deloitte, who’s focused on the MissionGraph™ implementation at a state-level, law enforcement fusion center. MissionGraph is a data integration and exploration platform built on DataStax Enterprise (Cassandra, SOLR, Spark, and Graph) and deployed in AWS GovCloud that integrates 12 terabytes of data from 11 different sources.
Here’s what he had to say.
1. Please share a little bit about your background.
Functionally, I'm the lead architect on a Deloitte asset called MissionGraph. My background is in databases and artificial intelligence (AI). Over the course of my career, I've moved between storing data and analyzing data and back again, and now I'm more on the storage side again. So, I've spent a lot of time with databases.
I also spent quite a bit of time in the intelligence community doing both systems architecture and targeting for counterterrorism. So, I've worked both sides of the fence. I actually had to use programs that I wrote, which was very humbling. But it worked. All of the people I worked with had to use what I designed too, so every day I had to listen to everybody complaining about my mistakes! [Laughing]
I've jumped around from data science analytics to more hardcore data engineering to data architectures. Now, I'm back to being a data architect.
I've been with Deloitte now for three years. I started in October in 2017, and overall, I've been in business for about 30 years or so now. I got a PhD from University of Illinois in a combination of data, AI, and databases. So, I've been there and done that in a lot of different places.
2. What are some of your most proud accomplishments and achievements in your experience?
The project that I'm working on now, MissionGraph, is quite an interesting accomplishment.
Before that, in 2003, I was working in the counterterrorism area and we built a massive system to do analytics in support of support national defense. It was a kind of thing that hadn't been built before, when we started. It’s now a lot more common these days. So, we had a fun time building it and using it.
3. What is the mission of your current team? In addition to being used to monitor the spread of opioids across the country, what else does MissionGraph do?
MissionGraph itself is a platform that supports investigative analytics. Building on top of DataStax, what we've done is build a platform that makes it easy to ingest multiple data sets, de-conflict everything, integrate them, do entity resolution, and then store it in tabular, graph, and geospatial formats. MissionGraph makes it easy to present this information to the user to cut down the time it takes to ingest 30 different data sets to gathering insights from the data.
MissionGraph is mainly used to support investigative work.
For a very large law enforcement organization at the state-level, we deployed a version of it to their fusion center that integrated 12 or 13 different datasets. Essentially, it reduced the amount of time they spent looking for data from maybe a day down to under a couple of minutes just because we're able to integrate all of the data inside of DataStax and then the MissionGraph application on top of it.
We're also using DataStax now in another large government agency. One of the things we ended up building that was important for the analytics was this data injection pipeline. It turns out you can't really do a good job of analyzing data if it's all dirty and disconnected. So three-quarters of the effort is ingesting that disparate data and cleaning it up so it is usable by the platform and it’s users, but we're also applying that concept to the idea of data comments and other data attributes. Right now, we’re talking about a data lake, which is where everybody kind of throws all their data into one big pile and then people can use it.
What we've done is: We've adapted the data pipeline to sort of industrialize the injection of data in a way that we could store it in multiple models—whether it's tabular or graph or in file systems, or whatnot—very quickly. And then allow people on the other side to do analytics on it and build data marts and essentially get some more value out of the data much more quickly and be able to catalog it. We do a lot of services, such as finding a provenance of the data that moves through the pipeline. It makes it easy to combine things.
4. What were some of the challenges that you have run across and some of the learnings that you've gained from the MissionGraph project?
The biggest learning was the amount of effort it takes to get the data into the system in the right way. So a couple of things: One is that everybody gets focused on the end state, right? We've got this really pretty graph that we want to use, but the amount of effort to get to the graph is actually fairly large. So my saying has always been, "It's great that everybody wants to connect the dots, but the real hard part is actually building the dots." And that means taking information from all these different data sets and integrating them together.
One of the big things that we do is entity resolution. This means that there might be 10 references to you across five different data sets. The trick is determining which of those records all belong together into a single entity. That's part of the ingestion pipeline that we built to go do that.
Now once you do that, then the analysis is easy. Personally, just in working with DataStax, there's always a little bit of learning around being able to understand how to model and put data into DataStax, then also realizing some of the advantages you get out of that.
5. We've recently launched DataStax Fast 100, which is our new partner enablement program. The whole goal of the program is to help close the skills gap on Cassandra in order to certify and prepare consultants so they can start delivering on those engagements within 30 days. Do you think a program like that would have been beneficial for you coming into the world of Cassandra and DataStax? Or how might it benefit you even now?
This new partner enablement program will be a great addition to both existing projects, such as our Fusion Center Step Up, and new projects. This type of program will enable us to optimize the DataStax implementation.
For our Fusion Center Step Up, we got it running really fast. Right at the beginning, the team crawled a bit but after a lot of time working with some of DataStax’s technical folks, we were able to optimize it. I’m glad for future projects, our teams will be able to take part in the DataStax Fast 100 Trainings to help our people be knowledgeable out of the gate.
The training is a big thing. Even now, in some of the stuff that we're doing, people understand it at more of a superficial level but not at the level of expertise you need to really work with big data. I mean, everything's fine when you're dealing with a million records. But when you start getting into multiple billions, then you really need to understand what you're doing.
6. How has your experience been working with Cassandra and with DataStax?
Once we natively developed the expertise, it works great. DataStax has some advantages over other databases for certain applications. I've been finding that it's very handy to have the flexibility of DataStax when you're bringing in lots of data from lots of different places, messy, and not well-formed.
In the real world, sometimes you're going to get duplicate records and there's a whole lot of stuff that happens that's not ideal. But DataStax is a little bit more flexible and lets us work with that. For example, being able to do a database insertion and not have a key collision when duplicate records come in. It makes it easier to adapt to those sorts of things. DataStax is very fast at the other end to retrieve the data.
7. What advice would you give to any other enterprises or developers that are trying to navigate today's software landscape?
I guess the advice I still always give to people is try not to focus on the bright shiny object first. Everybody wants to do machine learning and build cool models right now. But again, none of that works unless you've actually done a good job of bringing the data in because you can't build good machine learning models unless you have clean data. You can't build a nice graph unless you've done a good job of integrating data from lots of different sources. You don't get a complete graph if you don't do that.
So, in most projects that I've seen, people tend to focus on the cool new technology—but not on paying a lot of attention to the underlying architecture and infrastructure that you need to support that cool new technology.
Everybody wants to build the roof and not lay the foundation.
9. What's your vision for MissionGraph moving forward? There are a lot of really cool use cases currently, but what is the one big thing you would like to achieve with MissionGraph with your team? What is that pie in the sky goal that you have for it?
It actually is integrating all the new machine learning-type and analytics-type models. What I found with customers, both at Deloitte and in the past, is a lot of times we end up focusing on machine learning and all the other cool features but if they focus on properly integrating their data, about 60% of the analytics is generated by the platform. So, once you've got all the data on an investigative topic together in one spot, then a human operator can glean a lot of value out of that. Now that we’ve hit that mark, what we really want to do now is take it to the next stage, right?
Since we've got all this nice clean data, we can actually start building cool, higher-end analytics on it. At Deloitte, we are working on a new umbrella project where we're building out very large-scale AI environments. It's not just that there's these cool techniques to build these cool models. It's really more the management aspect of it underneath. Let's say I have 20 data scientists, all building models. I want to have a scalable environment, which they can run. If they run it on a lot of data, it'll horizontally scale using Kubernetes and elastic-type stuff.
But now that I've got all this data that I'm using to build models, I need to keep track of which data set trained which model, which was then applied to which other data sets to produce these types of results, and all that sort of stuff.
Sometimes data science can be a hero's journey where there is a single practitioner who gathers the data, cleans the data, and then builds the models. But what we're really trying to do is industrialize that. So, we've got a whole environment where you can build models at scale, that is also going to keep track of the training sets and outputs as well as build libraries of models that everyone can find and use.
Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited, a UK private company limited by guarantee (“DTTL”), its network of member firms, and their related entities. DTTL and each of its member firms are legally separate and independent entities. DTTL (also referred to as “Deloitte Global”) does not provide services to clients. In the United States, Deloitte refers to one or more of the US member firms of DTTL, their related entities that operate using the “Deloitte” name in the United States and their respective affiliates. Certain services may not be available to attest clients under the rules and regulations of public accounting. Please see www.deloitte.com/about to learn more about our global network of member firms.