DataOps, MLOps, and Self Service: How Data Teams are Changing with Jesse Anderson
Join Data Institute's Managing Director, Jesse Anderson to learn how data teams are changing in response to the overwhelming demand for data products. Tune in as he and Sam discuss bringing software engineering into the domain of data - and why he wrote Data Teams.
Managing Director at Big Data Institute
Sam Ramji: Welcome to open source data. I'm Sam Ramji and I'm here today with Jesse Anderson, author of the recently published book data teams. He's also the managing director of the Big Data Institute. Jesse has also published on O'Reilly and Pragmatic Programmers and has been covered in everything from the Wall Street Journal to the BBC to NPR.
He's spoken extensively at top big data conferences, like the Strata Data Conference taught and consulted extensively at hundreds of organizations and has had thousands of students in his classes. Now he's mentoring organizations, management teams, as well as their technical teams. Jesse, thank you for having me.
I like to start these conversations by asking what does open source data mean to you?
Jesse Anderson: It means a few things to me. I think it means different things to different people, depending on if they're at a vendor or if they, in my case are consultants or if they work at a company.
So I'll put on my former Cloudera hat and open source data means that you're using open source technologies to deal with your data. But I also put on my consultant hat, where I'm used to dealing with the end customers trying to take these tools and make some value out of them.
And they're thinking how do I take these tools, and make data products? But I would go a step further and say, open source data may mean self-service data. How do we actually get our data into the tools of the people who have a business problem - the analysts, the product owners, all these people, business intelligence, data scientists to actually make a decision with their data?
And we, we need to be getting away from these, these systems where we're saying, no, you can't do that and get them into this open data where we're saying, Oh, go knock yourself out. You want to run that analytic over 10 years? Cool. Let me know how it goes.
Sam Ramji: Yeah. So you would, if the data is hosted in an open source environment, it still might be effectively closed off from you.
If you can't access, it don't know where it is and your tools can't talk to it.
Jesse Anderson: Exactly. And I see that all the time. There's a particular thing that I call the no team. They're there to say, no, you want to access your data? No, sorry. Can't do that. You want to actually come full circle, as data teams and start saying, let's figure out ways to say yes.
Cause we have the tools. We have the technology. We can do this.
Sam Ramji: One of the things that I had fun with last year was we wrote a set of the 10 questions. You should never ask about data in an enterprise. And the first question was how often should the data committee meet? So you must be seeing a lot of challenges like that around doing the data science model properly.
You're teaching a lot of folks. You're seeing these things in real life, data that's accessible and accessible, tons of excitement, maybe over excitement about data science and artificial intelligence. What do you think are the things that people need to know and what are you seeing as the ways to overcome the big hurdles?
Jesse Anderson: The biggest one I've seen managers not understand is, it's not technology it's people. So coming back to that, that experience at a vendor. We were saying, "Oh, technology is the solution. You just put some Hadoop in place. You put some spark in place and that's all you need."
And so that's carried on to the management level where they're saying, Oh, we just need this technology in place. And what I've tried to get people to do is take a step back. It's not the technology. It's actually the people. You put the right people in place, give them the right tools, the right time, sane product schedules, they will choose the right things.
And I think that's really important. Start with the right people on the bus and the right places on the bus. There's no such thing as waving a magic wand and saying "I now dub the data engineer and I dub the data scientist." Sorry, but changing somebody's title is not actually going to change their skills.
Set these people up for success, please.
Sam Ramji: So was that part of the inspiration for writing the book?
Jesse Anderson: It definitely was, we have this disconnect between, oftentimes what the vendors are saying "Hey, our product is so easy. They don't need any new skills."
And that drives me nuts because I'm there in the trenches. So. I give this talk called foundations of data teams. And I bring up this concept called disclosure. when you sell a house, people have to tell you the problems of the house.
Now, metaphorically whose responsibility is it to tell you the problems of your data teams, the problems of that technology adoption. And so we kind of go through this thought experiment of, "okay, who's job. Is it, is it management? Is it a book? Is it a vendor's and individual contributors?" And I ask people, what do they think? And they say, "It's everybody's job." And when something's everybody's job, it means it's nobody's job
Sam Ramji: That's well said.
Jesse Anderson: That's why I wrote the book. I wanted to disclose and tell everybody, Hey, this is the problem.
These are some of the solutions to it, but know that there is this problem.
Sam Ramji: And as you say, the understanding of who to get on the bus, get them in the right seats on the bus. That doesn't seem super well understood. So just being able to have almost a canonical clarity about what are the ideal roles, or perhaps what are the jobs to be done?
What are the competencies and qualifications or experiences that might lead someone to be good at any of those roles would be an amazingly powerful piece of teaching for most people at most companies who are trying to solve problems in data. Can you give us a quick picture of how you define the seats on the bus?
Jesse Anderson: That is a good question. It's something I talk in depth about in the book, and frankly, it wasn't something that was talked about. So in Data Teams, I talk about the three teams, those three teams are data science, data engineering, and operations. So data science is mostly filled with data scientists, somebody who's taken mathematical background, statistical background, learned some programming and applies that to creating models. And that some programming is a key part of that definition. Then we move on to our data engineering team.
They're the creators of data products, where the data scientists are the consumers of data products, and that's an important distinction. So when we have that data engineer that is my definition as a software engineer who has specialized their skills in big data. And it is important that they have a software engineering background, otherwise they will not be as successful with these tools, that the data engineering team mostly filled with data engineers, but there may be some cross titles there, but mostly filled with data engineers of that type.
We have an operations engineer who has taken those operational skills and then specialized in these big data tools. How do they operate the frameworks, but there's also this other key part. And that's your code. They need to understand your company code, to be able to operate how that works with the framework.
So the issue that kind of prompted me to write the book is the message came across that you only need data scientists. Data scientists were this unicorn that could do data engineering and operations. And they just weren't that person. What you're better off doing is getting these teams, working together, getting symbiotic relationships, good, strong connections with each other, and having them work together. And you're so much better off than trying to find that unicorn.
Sam Ramji: That's a really powerful framing. I would love to hear you talk more about the nature of software engineering in the data engineering discipline, and then maybe look at a couple of other constituencies, like application engineering, and how a business or data analysts fit into the picture. Let's take the software engineering tour through data engineering. Can you talk about that a bit?
Jesse Anderson: Sure. So let me start by saying, how is a software engineer different from a data engineer and what I found, that's key is there's this love or interest in data. So some software engineers like to write code, but they don't care about the data.
Their interface for data has been, we throw it in the database and that's it. But then you have software engineers that are more interested in data. So personally I call myself a data engineer. I did the analysis. I was curious about what this data means. And so I would take that data, do cohort analysis and it takes a decent amount of software engineering skills to do cohort analysis. So I was able to take the data, apply software engineering, and get some analysis out of it. Now your data engineers may not be doing tons of analysis, but that's an example of, of what that does.
We also have the nature of our big data tools right now. Not every tool or framework has SQL as an interface. SQL definitely lowers the bar technically, but you have other technical things like Flink, Kafka, Pulsar. You need to be programming in order to do this.
And that's a key thing that I think is missing. Sometimes people will think that their data warehouse team can hang with this. And that's not really the case that the data warehouse team lacks that software engineering ability to do this right. So we really do need that software engineering background. To do this successfully because the complexity increases.
Sam Ramji: Yeah. That makes perfect sense. We see a lot of software engineers using Cassandra and Cassandra is mostly deployed into operational, plane supporting applications. So it's often a big chunk of what supports a particular microservice and out of your eight or 10 folks on the microservices team, there'll be one or two software engineers. And this was a surprise to me. That's the title of the people who are managing Cassandra, which I had always thought of as a database. So your point is super well taken.
Jesse Anderson: I strongly believe that one of the goals of data engineering should be to give self-service abilities to the data analysts, to the BI team so that they are doing the infrastructure, the right way, those data products, the right way.
For example, I wouldn't want an application engineer writing a spark job, but what we would want is to get the data engineering team, to give a good interface, to give a good, easy way for that application team to get a result back.
Sam Ramji: It seems to be that there's a lot of inbound pressure that's only increasing on data teams. On the one hand, you've got your business analysts who are domain experts in the business often relied on to explain to the product manager or to the executives, what they ought to do and help support big decisions in the business. And they are relying pretty heavily on data teams. They want more analysis, more features, better modeling.
On the other hand, you have application teams who are trying to build adaptive real-time applications that get smarter as more people use them, right. That have better recommenders, better auto-complete they also want to be using models, but they want to be using them in real-time. And they want them to update and interact in some meaningful way with operational data. So those seem to be two dimensions that are independently putting a lot of extra demand on data teams above and beyond perhaps data science, providing modeling and decision support for their customers. Does that seem right? And how do you see that pressure changing?
Jesse Anderson: I completely agree with you. Demand is outstripping supply by far. You have high demands on the data teams. Then you have the demand from data science on the data engineering teams. A vast majority of companies' demands is far outstripping supply. At some companies, especially big companies, what will happen is the business units will feel that pressure. And so what will happen is they'll start to hire their own data scientists, their own data engineers with a thought that is just a lack of people, rather than a lack of people, a lack of infrastructure, a lack of organizational design.
You heard me talk about self-service and the reason I talk about self-service so much is it's critical for these teams to be able to get at the data themselves. We don't want a, "Hey data engineering team, can you run this query for them?" No, it's, "What are the link for me to do this myself and I self-serve." And that way we take out some of this low-lying stuff that the people can do themselves. And as we look at that, we can look at maybe some of those queries, let's say we're on Cassandra. Maybe some of those queries, maybe they're too complex for them. But what we could do is we could start to say, here's the baseline of that query, parameterize it. Here, I've written this query for you. Just pop in the dates. There you go. You can do this yourself. Take the data engineers out of this, out of the tasks that they can't do. Take the data scientists out of the tasks that other people can do. You'll find yourself with a more productive team.
Sam Ramji: That's a great insight. Many of the data executives I've gotten to talk with over the last several months have had two major issues. One is, they find that the data engineering team has an average backlog of about six months. Of different data sources that they're supposed to hook up. They're waiting for permission from other teams in order to get a data set that the business has asked for that then they can give to the data analysts or to the data scientists.
So your point of self-service brings us closer to the idea of. Cutting the backlog down by eliminating a bunch of work that maybe they shouldn't be doing through parameterization and another smart software engineering to your prior point. Techniques - you're not as a software engineer trying to solve one thing a million times in a row. You're trying to solve a million things once because we are creatively lazy as, uh, as many people have identified about software engineers.
The other problem that they talk about is the growth of demand for the models to be served in real-time. And the organizations that they've built, their data teams. Are keeping up with the demand right now, but they can see that six months, 12 months from now, they're going to have to distribute the servers that are serving the models. And now that starts to take them beyond their remit, beyond their team competency between their team size as the application path now, wants the machine learning models to run as part of the application serving, what are you seeing there? Is there a way to change and simplify that job or is that just going to be challenging for the next few years to figure out?
Jesse Anderson: ML ops I see is one of the bigger growth parts of data teams. And I would say right now, some companies have done it well and for a lot of companies, it's a second or third-order problem that it's kind of a TBD of "How do we do this?" But it's going to be a pretty key issue, so I think what I will be doing, if I were sitting in a manager's shoes right now I'd be looking at an ML ops automation software, getting the human out of the loop. And I'd be looking at to that point we were just talking about, we automate, we automate whatever as possible. So look at the steps. Is the model training manual steps, or can I push a button and kick-off that model training? That's a key thing. Is a push of a button to deploy a new model?
We need to be looking at this, it kind of circles back to what I was talking about with the data scientists, the data scientists with all respect to them are not usually good enough software engineers to automate these sorts of processes. They're thinking, how can I make this into an RNN rather than how can I make this easier to deliver and put this into production.
So we'll need this team effort of perhaps the data scientists working with the data engineers. But there's this overall question of where ML ops sits. And my hunch is that ML ops should actually sit in operations, but have been automated properly and well by the data science and data engineering team.
And a big difference that is important for the listeners to know about, the difference between software and a model as software, as long as you don't make any changes can sit there forever. Now with models, you have the model drift. So there's this part that the operations person will need to understand what drift is okay? At what point do we do this? So there has to be some level of documentation, some level of help, but I think the operations team, once they have that help, once they have those tools, is the one to do this the right way.
Sam Ramji: So you've pointed to data ops as kind of a better system that has maybe an outcome of self-service data, ML ops, as something that enables you to have better management of ML. It sounds like data engineering is changing as a job. What do you see in terms of the necessary skills that data engineers and data teams need to be focusing on for the next couple of years to accelerate and improve automation.
Jesse Anderson: One of the biggest trends we're seeing is the march toward real-time. So, what we're seeing is technology such as Kafka and Pulsar really pushing that forward. So if you are a data engineer and you're still doing batch, Hey batch is still okay. But what I would be doing is looking around not just as an individual contributor, but as a manager saying, what could we make real-time? Is there a business that needs to be in that real time-space? So definitely looking at that, we also have other interesting projects that are coming out. There are big trade-offs that a data engineer should know about. Let's say these newer generation Druids, for example, they're doing roll-ups well, what's the difference between doing a roll-up versus let's say Cassandra.
You more than likely will need both. Or what are the differences between Pulsar and Kafka, it's an important distinction for you to know. Is there some feature, is there some specific thing that really pushes you over the edge on one of those technologies over the other?
One of the questions I keep on seeing is, how do people keep up? Do you feel like you need to know a bunch of technologies? The answer is yes, this is why you're paid the big bucks quite frankly, this is part of your job. So if you don't want to do this, Hey, maybe there are other things you could do, but data engineers need to maintain at least a cursory understanding of these technologies. Hey, if I'm going to do something in real-time, I'm actually going to look at Kafka and Pulsar. And that there's a reason why I would use one or the other. It's not just, you know, put my finger to the wind and say, okay, Kafka is the most popular. No, there are fundamental differences there.
Sam Ramji: Yeah. And this is something of a challenge for people who are coming from the sort of well structured, singularly designed, business intelligence monoliths that so many companies have run on. We keep seeing that modern infrastructure always has a data stack, right? You see Pulsar or you see Kafka, you see Cassandra, you see Hadoop. You now start to see Druid. All of these things need to work together harmoniously, especially as you see more and more choices for development teams to decide what backends they want to use.
One of the things that you and I talked about, earlier on was about ethics and really bringing a sense of honesty and ethics and transparency to the transformation that we see for the industry.
That there's a set of new skills that are moving and changing and becoming more valuable. And at the same time, old skills are not changing and they're becoming less valuable. So there's a risk of being left behind. I think you had a really thoughtful and sensitive point of view on this. I'd love to hear you talking about that.
Jesse Anderson: Some people won't be able to make this journey. Some titles won't be able to make this journey. And so I think it's incumbent upon companies to realize this. I believe it's incumbent upon individual contributors to realize this as well. So one of the key teams I'm thinking of is data. Warehousing data warehousing is changing dramatically and data warehouse engineers. If you're listening to this, this isn't a blip on the radar. This is a gradual diminishing of the size of that data warehouse team, where I've had people write in to me and say, Hey, I was just downsized. And now I'm looking around and there are not many people hiring data warehouse engineers right now. And it's because that 10 person team that you were abusing that data warehouse, now you have better tools, you have the Cassandra's, you have these other sparks. And now I don't need a 10 person deep team. I can use a five-person deep team. And so what's happening is you have the data engineers taking a piece of what the data warehouse was. And it's not like it's coming back.
This is that fad or something. This is not coming back. So to that end the data warehouse engineers, please know this, that learning to program is going to take longer than you think this is not a, Hey I can learn this on a weekend. Programming is not API memorization. The issue for data warehouse people is not a lack of programming. It's also a lack of complex systems creation. That's a whole problem unto itself that the just globbing onto, let's say a Wiziwig or, you know, a drag and drop system is going to get you by. That's a whole problem. You're going to have to invest a lot of time, a lot of effort, and this is going to be difficult.
It's worth it. I've seen some data warehouses, people are able to make this change. They were able to future-proof their skills.
And now switching over to the company side from an HR perspective, from a management perspective, frankly there will be layoffs. We're already seeing these layoffs. You should know that there will be downsizing of these teams.
So what you'll want to be looking at is how can these people be re-skilled, how can you find them other jobs within the company, perhaps more aligned with their skills, but you should know this, that this is not a technology change without HR changes as well. If you were to use the euphemism right-sizing, there will be some right-sizing there.
So know this. Please do come to this understanding. What I really want to reiterate is that it is unfair to both sides. To not know this. I think it's really open and honest to say, Hey, yes, there's going to be the skill, acquire these skills. Likewise, some people won't be able to acquire those skills.
Sam Ramji: Yeah. So there's a strategic change in how we use data and that's dragging in a technology change so that we can get different velocity in different unit economics. And that brings in a workforce skills change and it would be. Undisciplined and unethical to think it's all going to work would also be undisciplined and unethical to leave anyone behind because the people who've been at your company, who understand the domain, well, who've been modeling. It are probably the ideal people to help it transform as long as you help them transform. So if you're a manager, if you're a leader you should be actively investing in. Re-skilling your folks and understanding that it's not a language, it's a systems orientation.
So yes, you might want to learn Java or GO. And you want to bring people along with a sense of what the distributed system looks like and how did each piece functions?
Jesse Anderson: Exactly.
Sam Ramji: So, Jesse, I really appreciate your time. It's been an awesome conversation. We like to end our podcasts with one link, a resource or a word of advice you'd like to leave with our audience of, uh, data professionals, practitioners, and aspiring data team members.
Jesse Anderson: As self-serving, as it sounds, Hey, read my book. If you're a data warehouse engineer, I actually have a section just for you. I took a lot of my learnings. Avail yourself with that because this isn't just me hypothesizing. This is actually me having spent. Eight plus years in the field, and I try to share as openly and honestly as possible. Please do avail yourself of that.
If you are a software engineer, this sounds interesting to you. Also know, this isn't something that you switch overnight. This isn't a weekend thing. This is something that is a concerted effort. This is probably going to take you six-plus months.
For managers, understanding this brings about organizational problems. The biggest one is sometimes people will say technical issues are the manifestation of an organizational problem. I'm just going to pick on Cassandra, since we're here.
Sometimes people will say that Cassandra didn't work. And I hear that. And I think, well, Cassandra actually worked what you had was an organizational problem that manifested as a technology problem. So as you hear that, as you think about that, think about where it came from. What is the real source rather than here's what I was told.
Sam Ramji: Jesse, thank you for your time. That was really thoughtful wisdom, and I hope everybody listening takes it, applies it and reads your book, Data Teams, and passes it on by one for a friend
Jesse Anderson: Datateams.io. Thank you again, Sam. I really appreciate it.