Season 2 · Episode 6
ModelOps, ML Monitoring, and Busy Humans with Elena Samuylova
It’s 2 AM - do you know what your models are doing? Listen to Elena Samuylova as she talks with Sam about how to bridge the critical gaps between data scientists, engineers, and business managers using tooling and empathy. Learn how to design for diverse expectations, and more!
Co-Founder and CEO at Evidently AI
Hi, this is Sam Ramji and you're listening to open source data. I'm here today with Elena Samuylova who is the CEO, and co-founder at Evidently AI. Elena, it's a pleasure to have you on the show. Welcome.
Thank you for having me.
We'd like to start each episode asking our guests what open source data means to them. So, what does open source data mean to you?
Since I'm a founder of an open source company that has to do with data, it has a very literal meaning to me. That's the space I'm operating in, but to say it a bit more broadly for me personally, it's all about the community and the opportunity to co-create value with other actors in the ecosystem and to solve a lot of problems that actually affect pretty much all of us.
And I think now the most interesting things are related to how you scale using data in a company. How do you actually get value out of it and not just pretend to be data-driven and how you operate all these data products reliably and safely, and maintain them long term. And these have all these pertinent problems that I feel like as a community we're looking to solve. And for me, this is what this idea of open source data kind of like embodies.
It's really neat that you pointed at data products because that's a term that is starting to be used more and more, but it's still kind of novel. And there is something thoughtful and respectful and innovative about stating that the data is not merely data, but that you are going to specify a domain.
And in that domain, you're going to create a data product. I think that's super cool.
Yeah, absolutely. I agree with you. And for me, particularly in machine learning products, is it like a subset of data products, which I think also should be treated as such because it's not just, you're creating a model, you're actually integrating it to solve some particular business problem within a domain. So you should treat it as a product as well. Like a subset of data products.
So when you think about a service product, you think about SLS, right? You think about service level objectives. And one of the things that you use to meet your SLO is that you measure it and you're like, all right, we're going to do some monitoring. Make sure the service is up.
That has a really different meaning in ML than it does in traditional system monitoring. Right? Because maybe uptime is a given, but maybe there are other things that we ought to be monitoring in order to make sure the SLO is working.
How is ML monitoring different from traditional software monitors?
Well, we still have to take care of the service operations. It is still a software service. But there is like this extra layer on top of it, which is related basically to the machine learning system. And I think every time you talk about monitoring, you basically talk about things that can go wrong, right? So what can fail, what can break. And with machine learning systems, there are particular types of failures that might happen beyond just, you know, not giving a response when you would send an API request. You can still return the request, but this can be something that you should not really trust because maybe your input data was wrong.
Or maybe the data was within the expected boundaries. But there's this concept of data drift, which is one particular example. When the distributions change and basically the model starts operating in the domain that is not really familiar for it. So you get the prediction, but you should not trust it, you should act on it. You should be able to detect these situations.
Another type of problem, which is called Concept Drift, which is basically when there are patterns that change and evolve. And we've all observed that recently with the pandemic. For example, you would have people that are shopping with a completely different pattern, right? So your demand forecasting models probably would not pick that up.
So you should be able to detect these things and proactively, ideally be able to resolve them, right. And these are specific aspects of the machine learning system that we should monitor, and we should know how to resolve. And this makes it a particular domain on top of the fact that you still have to monitor the software system.
That's pretty complex, right? Compared to traditional software development, which is a little bit more. Building a recipe and everything kind of flows procedurally from what you've written. And maybe in a distributed system, it becomes more complex to understand how all these procedural calls or event-driven calls end up responding to each other. But that's really not necessarily about conceptual correctness. So when you start pointing at the problems that you can have with drift, it's almost like where philosophy meets mathematics.
How do you think about the underlying problem of Drift? What is Drift? What causes it and, and how much domain knowledge versus data science knowledge do you need to have in order to know what is happening?
Well, it's a very interesting angle to take because some models are maybe more robust to Drift. Uh, if you have, for example, domain expertise in-built in them. But when you talk about machine learning models, very often, they're just deriving statistical relationships. So they don't really know what is the causal relationship behind this problem that they're modeling.
So you have to ship them together with some guardrails to know when they're operating outside of this domain. So there are some other ways how you approach modeling, and that might be a bit more robust. But when we're talking particularly about machine learning, this is a very special thing. So it can be very right, but it operates only within this domain that it knows. Right. So it doesn't know that it's wrong. So that sounds very philosophical, right? It doesn't know that it doesn't know.
It is an important component in AI, right? To have that philosophical substrate. Right. Does the program have introspection ability? Can you tell you why it's doing what it's doing right now? What did it do previously? What's it going to do next? And to be able to self-diagnose errors, right? So this is kind of the emerging space of practice, which when I went to school and got my degree in AI and Neuroscience, it was mostly theoretical because we weren't doing things of huge complexity with enormous amounts of data on an enormous amount of compute, because it was too expensive. Now it's all cheap. But I imagine with that abundance, you have a lot more classes of error and a lot more subtle errors.
Most of them are plainly silent. That's the thing, right? So you don't really know there's something. Because unlike the software that can just glitch and send you an error message, the machine learning model does not send you this error message.
So you have to define by yourself appropriate tests, appropriate behavior, and check both inputs and outputs of the model to make sure that you can rely on what's going on.
And how do you tackle that? Is that a set of tools that you need to make the data science to where it is a set of conversations that the business owner and the data scientists need to be in touch with, take us down there for a minute or so.
There are probably two parts of it. First, when you actually define what you want to build, and this is where you need to have involvement of the main experts, right? Because you need to understand what can go wrong and maybe prepare for some errors in advance.
So, one example I can give you is from the manufacturing domain where I used to work. Uh, you have a lot of sensor data, and this is typically what you use this machine learning model on. For example, your sensors might just break. So you know, this physical sensor, it might start showing, I don't know, the same number for a certain period of time, and it can only be replaced during maintenance.
So your model should be ready to prepare for this particular type of error. You know, it will still show you the number that is within the expectations. You cannot just invalidate it. But if you see that it stays the same for a period of time, probably you should dismiss it. So this sort of thing doesn't come built-in in the model and data scientist doesn't know that either. You have to have this conversation with your domain expert to be ready to prepare for it.
And the second part comes when you deploy the model, right? So you need to be able to alert on the things that might be going wrong. But when you're interpreting what's happening, you again might need to have this conversation, right? So is it an appropriate behavior? I'm expecting this model to be able to handle this or that input.
This is a kind of teamwork, right? So you cannot just designate it and say like, "Hey, it's a data scientist job, you deal with it. You need to make sure that this model delivers”. You need to figure out first what you hope it will deliver.
Yeah, it sounds both conceptually and procedurally complex. Right. And it's 2021. And we all know that in the future pretty much every business on the planet is going to pass through a machine learning gateway company announcements, you know, major CEO shifts of company strategy, fortune 500s, fortune 50s are all talking about the pathway through ML, right to create the next few years of business.
So between the demand and the complexity that we have right now, something has got to give. So five years from now, like 2026, what do you think the state of machine learning, modeling and monitoring is going to look like. How are we going to change this shape of practice so that it's more coherent, more obvious and just kind of better practice?
That's exactly my expectation. We're going to have it shaped. Because right now, when we talk about machine learning operations in general, it's the wild west. So a lot of companies are reinventing the wheel. There are a lot of tools popping up, but there is no real standard practice that you would say, "Hey, this is how I'm going to monitor my models when they're in production."
So if you look at this DevOps domain that we already referenced. There are companies like New Relic and Datadog that helped shape practices of how we look at the model or the software system once it wants to deploy. But there is no such practice for machine learning models. And I hope that five years from now, we'll be able to say, "Hey, now I'm shipping a model into production. That's how I'm going to monitor it. These are the metrics that are used for it. These are the people who are going to be responsible. That's how to resolve and escalate incidents."
And this sort of like readiness. It's not there yet. Most of the companies are still struggling with putting these models into production. And they did not yet have the occurances to learn from. So I hope that in five years we're going to have it a little bit more streamlined.
Most of the industries that start to modernize and standardize, as you pointed out New Relic and Datadog, the tool and the user inform each other. Right. And then the tool and the user kind of informed the team around them, on what is normal, what's the workflow, what's the obvious side of steps that you take, what's that sequence.
When we get to 2026, you have any categories of tools you think are going to be radically standardized. Like everybody just kind of says. :Well, obviously you do X." Right? "Obviously you pick up your logs in a system in context." What do you think that's going to look like in machine learning?
Elena: Now I expect that we're all going to have this visibility into the models and we will take it as something expected. So right now, when you have a website, you would have Google analytics attached to it. When you have a product, you have something like Mixpanel and Amplitude. So you have some product analytics, right.
You have a software system. You look at these dashboards by Datadog and New Relic. And then you have a machine learning system, and you look at nothing.
I don't believe that that's all gonna continue being. So I expect that we will assume that the system must be shipped with some kind of pane that you can look at and understand what's going on.
So this is something that we'll definitely take as expected. But attached to it, there are so many adjacent problems. How, for example, you manage the experiments, how do you log all the results? How do you understand how the model was created, on which data so you're able to reproduce it? So I'm sure there will be a set of components that we believe are parts of a reliable system that we want to have there. Just like we have software. Like 10 years ago, this probably is a good reference if you want to think about what we're machine learning is now, right. It's software 10 years ago.
It sounds like there's going to be a lot of, uh, metadata and metadata analysis involved.
One of the things that you're doing with your time is you're putting your whole effort into creating Evidently AI, and it's currently part of Y Combinator. So congratulations on that. And an outstanding accelerator.
Um, I'd love to hear you talk about Evidently AI, what the system that you're building looks like, and how folks can think about the tool and they get a chance to use it.
Everyone can use it because it's open source, you just go and download it. That's the beauty of it. And this is a tool that helps you visualize your model performance really quickly. So if you have the model running and you have the logs, which are usually stored: you know what was predicted, you know what was the response.
Sometimes you would also have to wait until you know the ground truth, meaning the actual values that you predicted were right or wrong. And then we help you to spin up dashboards that are very visual and that help you calculate the metrics and basically understand what's going on with the model.
For example, did the data change significantly? What are the segments where the model is failing? It's sort of, it gives you the peace of mind. If everything is fine, or it helps you debug or understand where the issues are, if you are trying to solve them.
Yeah, process visualization tools start to create standards of practice. Like once I can look at a software development CI/CD pipeline, now I have a metaphor like, oh, it's a pipeline. And then in that metaphor, you have additional information, which is that you have states, there are good and bad states. And we can all look at the same visual metaphor and agree on this shared imaginary reality. And then we can work in that imaginary reality together. As you lay out that set of visual metaphors and evidently you're going to be shaping how people do the work and think about the work.
What did you learn in the last decade that kind of stands out as the, sort of the core of your visual metaphor? How do you think people should think about what they're doing in ML monitoring?
I think there is a particular aspect of it, which is you have a lot of people who might look at it. So you have different roles and different people who might need to understand what's going on with the model. And even among data scientists or the people who are creating the model, this is not like a uniform bunch of people who will have the same skill set.
No, some of them are coming more from software engineering backgrounds. So they understand all these notions that you were just mentioning. Some of that coming more from business analytical backgrounds, or maybe there are PhDs and, some, domain experts in something, and then they learn a bit of statistics on top of it to be able to create models.
So all these people have very different expectations and understanding of what they want to see and even some sort of different knowledge. Some of them are better with statistics. Some are worse.
So it's very interesting how you can create a tool that would be helpful for all these different audiences. And on the one side, easier to use for those who are maybe a little bit less rigorous with statistics and understanding how it works. And at the same time, give all the flexibility to the experts who might need to tweak everything under the hood. So that is what we are striving to do. So to enable all these people to use the tool to the best of their knowledge.
And at the same time, there's also the business people. The domain experts who might look at it, who are not data scientists at all. So we should want to give the opportunity for them to interpret what's happening.
So it was like a very layered approach, right, in how you communicate what's happening. And that's probably one of the biggest challenges for us to solve. But I find that's very exciting. So like, how do you manage to present all they needed icomplexity? It's to not make it simpler than it is. And help really understand what's going on. So that's a very interesting design problem to solve.
It's a design problem of awesome scope. And it's something that we've encountered in the software industry for decades. I remember using four GLs, right - fourth fourth-generation languages early in the 90s, which were supposed to be, you know, very efficient. But then of course you had this problem. You felt like you were in a sandbox for novices and you couldn't break out of the sandbox and use your expertise.
Later on, right, we started to see that evolve, the visual basic grew a calm layer, right. The component object model. And now you could kind of flip back and forth between, did you want to use a quick builder and a quick visual demonstration of how the code was going to work? And then for your complexity, you could kind of escape and write your own C plus plus.
That said, it's an incredibly challenging problem. At least in coding, we had a lot of prior art about thinking about procedural development IDEs and all that. It seems like the wild west comment that you made about machine learning is evidenced everywhere.
So being able to constrain that and say, okay, here are the obvious views for your three different audiences; your business user, you're a data scientist, and you're perhaps your software engineer.
What's the biggest surprise that you've seen in that so far, as you've developed Evidently?
I was still very pleasantly surprised how it is to create an open source company, because so many people reach out to you and they're very generous with sharing what they want, what they see, how they give feedback.
So this is above expectations, I would say, in how people react when you are sharing openly what you're doing, they are all so kind in giving you feedback. And that's incredibly helpful when you want to prioritize the features to understand what's actually needed, right? Because they come up and literally ask you to do this or suggest that they contribute. So that has been truly amazing.
But another aspect of this is like, there are also so many people who are coming and basically validating this idea that when we started this company saying that their models are not monitored. They like sharing some stories, how there were some failures or model issues went unnoticed.
And it is still surprising because if you come from a software background, It feels like it's a given, right? You have a system, you need to know how it's working. Apparently with machine learning, we are not there yet. There are of course, big tech companies. They probably have that more streamlined. Some more enterprise businesses, more traditional industries, they're still in the earlier stages of adoption. So that's yet to be figured out.
It's that opportunity to practice some really powerful, positive ethics, right? When you create visual metaphors that reduce confusion, then people get less unhappy. Right? Often when we're confused, where we're fearful and angry, and then the team starts to get into a negative mood. And that is probably the biggest damage that you can do to efficiency. When you've got a data science and machine learning team, you really need that team to be incredibly effective. So coming from dev ops, right, there's a saying "no grumpy humans", right. Make sure that we're being kind to each other and that we can agree on the state of the world. And now we're all diagnosing the state of the world together.
So as you create these multi-layered metaphors, where somebody in the code modifies the code. As an expert and then somebody else in their visual interface, realizing that things are fixed and now they can relax, uh, without having to have an email or a slack thread or any of the other things that go around, right.
There are really big breakthroughs that are, I think ahead of you for, for what you're bringing with Evidently AI into a really highly constrained set of users, right? The teams that you're dealing with are under a lot of pressure to deliver. And a lot of things are very, very custom and hard to diagnose.
Where do you get your inspiration for the business model that you're building the company on? My take is you have a very sort of inspiringly pure sense of how an open source business can be built. And I'd love to hear you talk a little bit about your inspiration for that and, uh, how you're practicing it.
You know, there are both rational and emotional arguments to that, I would say. So rationally, it's a very efficient go to market strategy these days. So if you're creating a tool that will be used by technical audiences and it should be easy to adopt for them, right, very fast. So they can actually start using it. And open source enables this even better than maybe just a freemium approach. And it also gives you this feedback loop that any startup needs in the early days. Right? So you want the users to come and ask you what they want and in open source it's kind of in-built. Once people learn and try it, they start sharing this stuff. That just sounds to me as almost a no-brainer. Why build an infrastructure company and do it closed source, right?
And the angle of the community, also contributing. It also helps you. Let's say, uh, keep up with all the changes, all the integrations, all the data formats. There are all these things that you need to maintain in your product, and you want to have it easy for anyone to add and make the tool work for them, right.
And then there is sort of an emotional aspect in a way that I think is just the very right way to build software and to build products for the end-user. So not just, you know, top-down big monolithic things. That you’ll going to have handed to you. And just like, say like, "Hmm, how I'm going to use it?" Right. So you actually want to co-create with your users.
And I think that's also a very important part for me as a founder. Right? So like you choose to market, you choose the users to work with. Right. So you want to know that they're really happy. So that's where I'm doing something that they really want.
Yeah. I love the ethical and the aesthetic elements of what you've said. Super cool.
What do you hope to see from the open source community dealing with both the area that you're focused on specifically with evidently AI and then kind of around that in the space of model operations in a way?
I think we're going to have an interesting sort of fight, right? But like in a positive way. To create this standard stack. Because it does not yet exist in machine learning. So like, for example, when you want to create this model, when you want to process the data, experiment, deploy and monitor it, and then close the loop again. So it is probably going to have like a few companies on each part of the stack that are offering something like ELK stack or something that already exists in many other software domains.
We still don't have that in machine learning and machine learning operations specifically. So probably going to see it shaped with a lot of companies trying to create standards. And I really hope that we will be able to create this as something that the community wants, to respond to the needs, to have the standards not just imposed because someone made the choice in the beginning, but actually co-create. So this is my hope and expectation that that's how it's going to work.
Yeah, we really want users to lead with standards of practice so that we can follow them with standards of software. And if we're doing it in open source and we want to enable the whole industry. And we can do that around certain standards of sharing.
I think that standardization, over time of practice, and software, and sharing are kind of the core of how we'd hope to grow a really healthy industry.
Absolutely. So we want to build this best practices, however corny it sounds, like together with the industry, right?
And bringing in both the diverse set of users and a diverse set of contributors. Those are, I know, really important elements of what has brought you to build the software that you're building and solving the problems that you're dealing with.
Exactly, that's how we hope it to be.
We like to end our episodes with one recommendation that you have like a resource piece of advice or something that inspires you so that the folks who are listening to open-source data can get a bit of your wisdom and kind of get a springboard to get a little bit closer to what you've learned in what you're practicing.
Oh, so if you're into machine learning monitoring, check out our blogs. So we really strive to distill some concepts and we really welcome some contributions and comments on that too.
But I feel like in the broader machine learning field and data field, there is one piece of advice, which I think is very important is I've seen so many machine learning and data projects fail because there were two sides to it which don’t really communicate well. Usually the data team and the domain experts. And that no one feels that it's their job to actually help bridge this communication to translate between these two domains. And I feel we will be able to create so many amazing things if we actually see this as part of our job description on either side of the fence. And I would definitely encourage people to look at it this way, business explain the priorities, data explain what exactly is happening with the data and technology and everything, and try to bridge this gap. So, that's something I really hope to see happening.
Do you see an emerging role, uh, that stands alone and it has a title there? Or do you think it's going to be a bit of everybody's job?
I think it's a bit of everybody's job, but in certain circumstances, we might have seen this analytics translator role, right? So people who actually have to bridge the communication and streamline it between different groups.
That is awesome advice.
Elena, thank you so much for your time. It's been really great to speak with you and I appreciate your generosity of, uh, of a spirit and of, of mind.
Elena: Thank you for having me.