Season 2 · Episode 3
Git-Like Branch and Merge for Data with Einat Orr
What if you could version object storage just like code? Tune in to Einat Orr as she explains how CI/CD and data lineage are being transformed through versioning data, enabling sandboxes, safe rollbacks, and coherent history.
CEO and Co-Founder at Treeverse
Welcome Einat, we're delighted to have you join today.
I'm very happy to be here.
I like to start out each episode, asking our guests, what does open source data mean to you?
A lot of things first, it's the possibility. I think most of the things in data happening over resources over the last 15 years. So as a CTO and as the user of technology, I've always seen the opportunity of adopting open source and the new capabilities that open source had to offer.
And of course, community. The ability to work together and join forces in order to bring the world and better technology. And now as an entrepreneur and first being able to give back to the community that is really based on so much open source and tools that we believe with, uh, really help data scientists, data, engineers, and analysts who are working over data and improve their lives. So it's very pleasant to be contributing ourselves.
And of course also a business opportunity.
Yeah, that's fantastic. I think the gift that we're giving everyone by not hiding value behind things that they can't see, right. That transparency is like, Hey, come take a look. We put our opinions in software.
If you like it, use it. If you want more help with it, then you know where to find us. And that's the birth of many open source businesses.
It simply helps build the right product, right. Because if you've built the product together with you and you. It would be the right product. Well, if you build it behind walls, it's going to take much longer and it might not converge in the right photo.
Yeah. Convergence is a great point. And there's an element of co-creation there as well. You, one of those things about moving faster with your users, I think that's one of the big realizations I had when speaking with you over the last couple of months is that we are getting faster and faster in DevOps iteration cycles between what we're building, what our users want, but data isn't always coming up. And I think that's a problem that you've put a lot of math and science and attention behind. I'd love to hear you talk about that.
Yes. So actually it's very fascinating from a lot of angles. Why data intensive applications did not adopt the org structure, not the processes and not the technologies that are used when you are developing an application. And I'm not sure why that is, but I'm pretty sure that that problem is very wide. It starts with not having cross skills teams that work together to solve the problem end to end and not having someone responsible for providing requirements for the data application. It is this orphan that needs to somehow work and the amount of data growth, and the amount of money the organization puts on getting insights from the data just dramatically grows, but the attention to are we managing this properly? Is not necessarily given.
So this was an example of not adopting processes, but it is important because within those processes of actually working in the common, agile way, there is the need to work fast and work with very short cycles that will allow you quick feedback. How do you work fast in the world of data as we see today? Really impossible, right.
Managing tons of distributed systems that need to be configured somehow and optimized for your problems, and coding, many languages synchronizing all of that effort and putting your data. Large amounts of and constantly growing amounts of data in the right places to be consumed properly. All that doesn't assist in getting short cycles hiked.
So we have very long cycles, not the right processes and, we're struggling.
Yeah. And so it's often the rate of change of data that is holding back the pace of change of the application or of the business intent. So some of the things that I think we can learn from fast moving open source projects, like Linux, right.
Created a new approach to sharing code, right. With Git, and the insight with Git was not that it would be easy to fork. Right. Which was kind of the CVS of this version. But it would be easy to merge. And so from there you have get hub and then you have dev ops and sort of this merge centric architecture, like you can freely fork cause you know, you're going to maybe be able to merge back later. But applying that to data has been kind of a very remote art, right? Those practices that we assume work within the code of an operating system or that deploy to production for something on containers, we think it would just flow. But playing that to data, how do you do a Git-like merge on data, that's super hard, but that's a problem that you have started tackling in lake Fs, right. Which is, uh, the open source project that you and your co-founder have brought to life.
Yes, exactly. So if we are looking at those processes that are not being implemented, a key in really running very quickly with the code is being able to have version control and to manage collaboration, and reduce dramatically the cost of error, right? If you have reduced the cost of error, you can run very quickly because you can always revert and that doesn't exist in the world of data. So yes, mergers have a very interesting technical problem in our system specifically, but before we go there, uh, just having the ability to brunch and get isolation.
In a data environment where the only way to get isolations copy. And when we see a copy of them being copying petabytes of data and no time instead of that, you can replace them with that one simple, very quick atomic action that provides you with isolation. Now you can. Now you don't need samples of data or working very small or trying to somehow overcome this.
The fact that the data is not accessible to you, you just opened a branch. Did they tell you to work in isolation? You can cause as much damage as you want. And at the end of the damage, you just discard, the main was not touched. Other people can consume the data safely while you. And this capability just brings so much relief to people who develop over data and this allows to shorten cycles.
Right? And the other thing is actually committing because if you have the main brands where the data is consumed from, but you can run commits to that branch. And then if there has been an error with the data, you can revert again. Tell me actions - it doesn't take time. Now you can run very quickly and you might make mistakes and you might expose the wrong data. You can always fix it by one revert, right? So if you have put into production, a bug in an application, worst comes to worst. You can work very quickly and take the risk. Same would happen with data. If you use lakeFS, you can run really quickly. Worst comes to worst, you revert. So you can just experiment way more and take more risks because the price of whatever just reduced dramatically. So branching and committing are already very, very invaluable. Before we talk about the merge.
And then merge comes into place because what merchandise you to do is to actually run validations or run continuous integration of data and continuous deployment of data by ingesting your data or running the job that creates new data on a brunch and test the results and use the pre-merger hook to make sure that if the test passes data is automatically merged.
According to a merge that you have implemented or chosen out of the automatic mergers in the system. So the merge is done into Maine and the data is exposed to your consumers. But if the test fails, wrong data would not be exposed, and you would have a snapshot of your data leak at the time of the failure to very, very efficiently the bug, because we all know that one of the problems with the bug in big data environment is that you don't know what was the data at the time of the failure. It's very frustrating.
But if you create this layer, Of logic with the mergers done only if tests and validations have past, then you can actually get a snapshot of the data at the time of the failure. That is pretty awesome.
It's like a time machine for a data lake.
Yes. And then you have the tools to start working in the logic that we're used to working with applications.
To do CI, to do CD, to have a test environment. Also things that are hard to do now, and we're struggling with are becoming very natural.
It's really exciting. Bringing coherence to all the practices that you have in the entire line of bringing applications to bear. You must be seeing tremendous power in it. As those practices start getting borrowed for data. Is there a lot of extra storage required as you take more versions and, and branches as you use this capability to make application delivery so much more effective?
So the metadata itself is very small and it is also kept on the object storage, which is a cheap option.
The data itself really depends on your needs. So no one uses get because it versions right. You use get, because it allows you to work safely and to make sure that you streamline your processes properly and collaborate properly. So you use like a festival, what you need it for. And if you do need it for reproducibility, And you want to be able to go back to a version you had a year ago, then that would require you to save more versions of data, but it's the business need that you have. So you are saving the data.
But if you're only using it because you need to revert to the version before your latest you'll save only that, and then the overhead of saving additional data would not be significant at all. So the amount of versions you would have would simply reflect your business needs.
That makes perfect sense. As you go into the future of this ability to have branched, merged, isolated data that can be continuously delivered. And I think be incredibly powerful for SRE teams, as you think about the ability to, to always safely roll back, have no grumpy humans, and a lot more observability, probably on the root cause of failure, because you may have some sense that it was a data change, and that you can revert the data, change independently from the application change. That's all super powerful. What if we jump forward into like 2026 or 2030, right? When this stuff has become mainstreamed. What do you think the world looks like for data then?
From the lakeFS perspective, you would see us as the cross-organizational tool. So, we would probably be running not only over the object storage, but also over other data sources that you have because your data pipelines would probably be moving around data sources. So lakeFS would be a platform that covers them all and provides you with Git-like operations throughout your pipeline.
I think that would go with quite a few trends that we see right now. One would be the discovery tools are definitely going to become something that every organization has, because I believe the variety of data sources used would still remain high. And you would also use storage and databases and all kinds of databases, but the ability to have a cockpit where you can manage the data of the organization in a reasonable way would be there. And that would include discovery tools, the Git of those operations, and very good orchestration engines, and quality platforms. So those four, I see becoming more and more powerful and playing a very strong role together, of course, with the databases and the object storage. Keep on improving, being more cost-effective, showing higher performance and becoming as serverless as possible because you can't do without that.
Right now, the overhead of managing those systems, even if you buy a managed service, you buy it according to the loads that you are putting in all sorts of things that require too much understanding and too much time. So going serverless solves that problem. So I see that as something that is getting to be very strong and compatible with this metadata layer that I've just described as going over all those data sources. And specifically within the data sources, I see the object storage becoming stronger and stronger, providing much higher performance and maybe going into ability to allow a selection of data and things that it doesn't support today that's the won't be available but it would offer a few more capabilities that would make it dramatically more useful for the people who use data, together with the compute engines, who's done a tremendous three pried from 2006 to today, except for becoming serverless.
Also becoming even better in performance than they are now. So definitely Spark, presto becoming much stronger compute engines and must faster compute engines.
So a big shift of all these things towards closer to real time performance. Right because of the underlying infrastructure improvement.
The other thing that surfaced in your conversation is that just being able to map and visualize all of the, get operations in your organization on the data will be a tremendous piece of telemetry to inform you where the value is in your data. So the opportunity to look at, like how would we value data on a per byte basis might never be there, right? All bytes are just bytes but to be able to measure the value based on the number of interactions is important, right? We can look at that as end-user interactions, but maybe there's a really powerful indicator of the developer interactions. How many times are these things changing? When is a data science team is trying to play with the data as you described, put it in the sandbox. That's probably a pretty important organizational signal.
Yeah, absolutely. And this is why I see this part of the cockpit. Right? So working with discovery tools, providing this part of information together with the lineage that we see, right.
If you work with a confessor, you're actually exposed to it, the lineage between the different data. What creates what, and how, and in what version, and this would become extremely useful in reproducibility of very complex systems.
Yeah, just moving from copying, as you described, to branching is going to be a boom for so many things, because now you can remember lineage, you can also do different things with governance.
You can learn when and why these things changed and were they suppose to change, right? A whole different field of play, which is really important because so many of the problems that we have in changing how we think about privacy and security of data are changing with regulatory requirements. And mostly we're chasing this down with lawyers and engineers doing hand work.
And that just can't work 10 years from now. We have to have a computational basis to which we can view, specify and observe and control.
Yeah, absolutely. Governance is definitely something that would work.
So this is tremendously inspiring stuff. I'm very curious to know what was your inspiration to do this right? Because creating new technology is exciting, but creating a new company is the craziest thing you can possibly do. I've done it a few times and it's one of those things that you have to have a lot of inspiration because it's always easier to join somebody else's thing, than to start your own. But here you are right with lakeFS and with Treeverse - I'd love to step a little bit farther back into history and kind of hear you explore what your motivations were and why you got so inspired.
So, first of all, I definitely blame my co-founder for inspiring me. I'll never forgive him. So.
So, I worked at SimilarWeb as the CTO - it’s a data company. We managed seven petabytes of data over S3, and I thought we were doing a pretty decent job. We were also really, as I talked earlier about open source, we were adopting technologies early on because we really needed them. We used Airflow very early. We Spark very early.
We really needed the stuff that was coming up. But no matter how cutting edge we were and how talented the team was. We were struggling. We were constantly struggling and it wasn't a failure. We were succeeding. We brought the product, relying on the data.
We brought high accuracy. We brought very advanced algorithms. We had a very, uh, good data pipeline managed, as I said, with both cutting edge technologies. But the day-to-day work was frustrating again, because the price of error was so high because the ability to manage yourself probably required so much overhead to the actual work. That it was frustrating and it was simply the nature of the beast. And then you look at this and you say to yourself, "There has to be a better way to do this." And I remember specifically the day that a retention script with a bug ran, and deleted half of my production data. Which took us a very long time.
Yes, yes. Yes. Took us a very long time to recover from, and I was just walking in the corridor saying "Revert!". I mean, that is a thing you can do. Why can't we do it here? So the pain is really big and it's big, even if you're doing everything right. And it's frustrating.
So when my co-founder came and said, you know, I had this idea, why don't we do Gits for object storage? So I said, yeah, but it won't work right. The performance, it just doesn't work. So he said, "No, no, no. We will do our own Git." It was a no brainer. It was obvious that that was the right thing to do. I immediately saw how all those pains that we have, are all sold by the same conceptual idea of just using Git-like operations.
So. That's it. That was game over.
It's so impressive. And fortunately, you have a lot of math in your background. My last job before DataStax I worked at Autodesk and we had acquired a company with some, some super, super sharp folks, Thiago Da Costa and Deuv Aldehyde, and they had built out a Git-based system for real-time CAD programing.
And so Autodesk used them to effectively do this kind of versioned, fine-grained object store for the things that we were coming from, a mechanical engineering design interface, so that you could end up doing a few things first, you know, a complicated thing, like a motorcycle has a lot of different assemblies and each of them actually needs to have its own version.
And then you need to be able to say, okay, the current version of all of these things is a version of the motors. And then how do you enable all these different experts to work together? And how does that get rendered in real time through a web based CAD? So a super interesting problem. Right? And it ends up kind of reducing to, you know, an immutable log and then vector clocks.
To be able to determine, how do you do this dynamic merging? And then the biggest challenge then of course, is that you've created this very custom way of thinking about data. And then you end up in this world of materialized views. And materialized views are always hard because they're always wrong. You're translating from your first language to your second language and the nuance or the detail is always lost. Right? So that's the hard part. But it sounds like with object storage as a basis, you're using the standard interface, standard protocols, standard format, and the innovation is adding the vector clocks, right. And adding this sense of how can you make these things work?
Yeah. And still, of course you have to build a very good piece of machinery to run at the rates that you expect an object storage to be running at. But absolutely the logic is there because the base is in-mutable and really helps.
So Einat, you have another thing that's similar to some of the most interesting people I've met in the field of data in the last year, which is that you have a PhD in Graph theory. And you've moved from doing the hard mathematical work to understand that, to doing something that looks on the surface, quite different from it. And this is consistent with a bunch of the other people who have PhDs and graph that they're doing some other thing. So this must be like an amazing proving ground or, or something else that's happening for people who think about this.
I'd love to hear you talk a bit about your PhD. What got you into it and, and kind of where it led you intellectually.
Yeah. So I always liked problems that actually had implementation. My Masters was actually in applied probability. And then for the PhD I went through optimization problems in graphs that are definitely theoretical. But the inspiration for the theoretical problems that I've provided theoretical algorithms for, were actually from real life problems that I've seen.
So I've worked for a company that's created and developed for testing molecules. I won't go into the details, but that's where the idea for a capacitated Vertex covering problem came from, for example. So the inspiration was always an actual problem in the world, and I really loved being a student, a PhD student. You know, lying and looking up at the ceiling. And claiming that I was working, uh, so, uh, very happy parts of my life.
But then when I started working as an algorithms developer, of course those were applied problems. Everything I've learned in university to do a good job as an algorithms developer. But I realized that when progressing in my career that some of the problems around me included problems with managing people or streamlining inefficiencies in the organization. And I kind of felt that I had something to say about that.
Optimization problems are now implemented to people and software and not just mathematics and the algorithm itself. Stuff wasn't optimal around me and it bothered me. So I went into managing people. And a lot of people commented throughout my career that I manage in, in a way that I look at the problems as optimization problems. Which is interesting that it is so clear to others that this is the way I think. This is how I got to being a manager.
I'm looking at problems that are now way more complex and larger because they don't include all the algorithms. They include algorithms, and software, and people, and data that is always wrong. The data is always wrong. What's up with that. I mean the input data, not what we create, it's all so dirty. So, it's a very complex machine that you need to build in order to make all of that work together.
And now in the last tape, as an entrepreneur, just trying to help that machine that needs optimization to work properly. So I'd say this is my story of how I shifted from mathematics to what I do today.
That's fantastic. There's so much deep, moral and intellectual satisfaction in solving increasingly complex problems. As you bring in all these extra dimensions, I can hear you like adding dimensions to the algorithm and all of a sudden it's like a hundred dimensional problem.
And it turns out like 20 dimensions are emotional, but you're way more complicated. So Einat, we're getting close to the end of our time, and I would love to have you offer some advice to our audience. Right? Many folks are fascinated by the data career. They're interested in a data industry. Of course, we're coming out of a year of COVID and a lot of folks are graduating and trying to figure out what, what tends to make of the future.
So if someone is fascinated by what we've talked about today, or getting. On their road of their data career. What advice might you have for them? Or what resource might you point them out?
Sam, my favorite is Designing Data Intensive Applications. I really think it's a must read, especially the first third. I'm sometimes asked about it because technology is moving so quickly. If it's not up to date enough. And I don't think so at all, because the basics, the way you think about data problems and data applications is there. And this is the tool that you need. Just like in mathematics. If you have the basics you can grow from there. Here, he provides the basics, in just an incredibly professional and deep way. So I think this is a must read for anyone who's working with data.
That is awesome. Einat thank you so much for taking the time with us and your incredibly thoughtful explanation of how the world of data is changing.
All right. Well, we wish you much success with lakeFS and Treeverse, and we're excited to see where it all goes.
Thank you very much.