Season 2 · Episode 2
Data Discoverability, Products, and User Diversity with Shinji Kim
Learn how an accelerating abundance of data can be harnessed through telemetry. Tune-in while Shinji Kim and Sam explore opening data to more users, PageRank for tables, and pragmatic use of data lineage to find value.
Founder and CEO at Select Star
Hi, this is Sam Ramji and you're listening to the open source data. This week we're talking with Shinji Kim. Shinji is the founder and CEO of Select Star, an intelligent data discovery platform that automatically analyzes and documents your data.
Welcome Shinji. We are delighted to have you join us today.
Thanks Sam, excited to be here.
We like to start out each episode asking our guests, what does open source data mean to you?
Usually we call it open source, or open data, or maybe data source. But after I saw the || signs, what it means to me is really sharing, and how sharing makes all of us much better. So it's not closed. You can see what it is or maybe you can use it, or maybe you can also contribute and form it into a different way. And I think that makes all of us better. I created something really cool, put it out there, other people see it, and people use it, they get benefit out of it, they say, thank you, they contribute to it, and then I get a lot of joy out of it. And I think that's one of the cool things about open source that has really led technology to flourish for, you know, the last 10, 20 years. I feel like the internet is all around being open. Most of the time, I think traditionally data has always been more closed because it's private because it's proprietary, it's still considered as an IP of companies. But as a government to start also opening up their data like Census, but also each city starts opening up their data around their traffic, or real estate. I think that is making the research, but also other companies and learnings. I think it's really around sharing. And how sharing makes all of us better.
Such an awesome answer. And it starts so many conversations. One thing I took away from it was open and data brings a sense of abundance to something that was previously scarce, right.
That maybe there was a lot of data locked up, hidden away, but now it's becoming abundant. We know from economics though, every abundance creates a new scarcity and the scarcity now is how I discover all that stuff. Right? So like page rank in the Google search engine helped us start to find the things that we actually cared about all in the open internet.
And that's a problem that you have thrown yourself into solving with Select Star too, right? So as you, as you look at creating something valuable out of all of that, you know, sheer abundance, now, that's a pretty interesting way to structure your thinking and the work that you're doing in the world.
I'd love to hear you talk a little bit about the motivations for like, sort of the technical elements of the problem and how you've thrown yourself out that with Select Star.
And then I love to talk a little bit about your journey that brought you here. Cause there's a lot of courage and curiosity in your life story.
I have been a software engineer, data scientist and product manager in the past working with a lot of data to transform or ETL the data, to analyze the data and try to figure out what the trend of the data looked like. And I started my first company in 2014 called Concord systems. And we decided to start the company because the other company I was working at, which was at Yieldmo, a mobile ad network startup, currently in New York. We were growing super fast and we were processing about 10 billion events a day. And our data pipeline of Kafka storm HDFS was breaking and we wanted to get a better, more stable solution. So that's why we created a Concorde and spun it out as a company. We worked with a lot of financial services customers, and then they sold that to Akamai and out of Akamai it became an IOT data platform called the IOT Edge Connect. So now, deployed the and serviced as a cloud service around the world processing sensor data coming from billions of devices. And that experience got me opened up into how large companies use data and compute data.
And because I've been more or less in the lines, doing quote, unquote, big data processing in real-time or a streaming way, and processing a lot of volumes of data is starting to get solved by a cloud providers that will give you unlimited amounts of compute power, and storage and ways to, run those computes.
Also, there have been a lot of advancements in machine learning and data science where you can drive and figure out very advanced recommendations, anomalies, and get information out of lots of data. But what I've noticed was a lot of workflows in between that are more focused on data analysis. What is our revenue? How do we segment our customers? What does our funnel look like? What's our conversion rate for signups, things like that will always trigger a lot of conversations and questions around. Okay. So what data do we actually have? And where is this data? What is this called? And even if you find that data set, you will ask, "so which field should I use?" "Do we use activation date or last login?" And that really got me thinking more about how there are a lot of issues around data discovery. Just finding data that you have, and also understanding what you have and the problems that have that data discovery is growing in the market. Because one more, I would say diverse data is entering the data warehouse as we are moving towards the ELT phenomenon and more quote unquote modern data stack. You can now directly load the data directly from Salesforce, Google Analytics, Marketo - it doesn't have to hit your Postgres first and with that, there's a lot of sources of data, and you end up with basically hundreds, if not thousands of database tables inside your data warehouse. On top of that, because each data source will bring on a different raw form, you will have to transform those data so that you can join them together, and you will probably aggregate those data in some ways with your own transformation job, whether that's from DVT and so forth, and that creates more tables and it will create more confusion.
So that's another big trend that I've seen, and I felt like contributing to this problem of data discovery, and it's just starting to get actually really painful for a lot of people that are touching the data.
The last part actually is also because as there is more application data entering the data warehouse, traditionally it used to mostly only be software engineers touching the data. And now it is a lot of data analysts that also commit SQL queries and build their transformation logic. And with that, we're also seeing more business users starting to either learn SQL or utilize graphical user-base to run their data quarries, even if they are not writing the SQL themselves. So that creates a more diverse range of data consumers that will all start thinking about, okay, I want to figure out this business question - “Which data should I use?” “Do we actually have this data?” And that's how I arrived at Select Star.
It's so interesting. The diversity of consumers with data is exactly what, what I've been seeing lately, as I've been asking questions and listening, and the opportunity to build a tool, right? Or a tool and platform like you're doing allows you to standardize work, right?
You're taking some opinions you have about how to make this simpler about how to be able to deliver it to these very different sets of users. And you're starting to deal with that new abundance and scarcity cycle.
What's the most surprising thing that you've found about this part of the journey so far, you're reaching some company milestones around customers and interactions and productization.
So without getting into any details about the customers individually, has something showed up as really surprising or delightful in the process?
Yeah, I mean both.
So initially we wanted to build a really good platform where you can easily search and search to find and understand the data that you have.
So what it means to our customers is every time they find the table, they'll be able to see where the data came from, who the top users of this data inside the company are, what dashboards that were created out of this dataset, what are some of those SQL cores that other people have run, or other tables that I could join it with and examples of those queries.
So you can get the context of what the data is, where it came from, if it tells me that, "Oh, this is worthwhile data that I should use. How can I use it?” And that notion we felt like was useful. And as we started seeing how our customers use it, we realized that a lot of this context data that we provided, combined together can really help our customers to organize and govern their data better. So now what we do with our customers, when they first onboard - we walked through their data team, how their data looks in their data warehouse, or their BI tool. What are the most used tables that you should document? What are the tables that nobody has touched in the last 30 days that you should maybe deprecate? And who should be the owners of these data sets should be maintaining the descriptions? And last but not least - how would you categorize this data, so you can add tags and labels whether per your business lines, product lines, and also whether this is gold, silver, or bronze table, whether this is analyst approved, things like that. And once that governance structure and that organization is set up, they can really open it up to the rest of the company so that everyone, whether that they are a PM, or customer support, or a new analyst, they can easily understand how the company data is structured, and where to go to find data sets they've never used before.
For us that was more eye opening and really interesting to see. And we are really excited to help more customers start the data governance journey so they can set up these structures really well, and then have their data team involved in completing any documentation or organization. And then, have their extended teams of other data consumers to also benefit from this type of a single source of truth documentation on data.
That is amazingly cool. So there's a lot of neat elements there like telemetry being able to see what's happening, and then you use the word governance in a really special way, which I think is not just limited to like, what's the policy, what are the restrictions, but actually elevating - "Hey, these things are products. They need owners. They already have customers." And now with your telemetry and your analysis, you're showing people kind of the way through their own data forest. Is super exciting.
Yeah. I think traditionally, a lot of companies in the data governance space have been focused on the security and access part of it. And that is important. But, at the same time, I feel like being able to define, what data do we really have? Hence which data should be available for who? I think that's where the governance should start.
What we are doing with our customers today are allowing them to first define those upfront in Select Star and being able to utilize that and replicate that to their access control model, or their policies, so on and so forth.
Because before you guard the access, you should first say, "is this actually a data that we should guard? How important is this data for us to actually document." And that is really the part where we want to shine the light. So once that's clear, then it's a much easier journey for the security team and the data platform team to have full governance on their data.
Yeah, I've been in conversations that didn't have that light shining and it's amazing how much theory and complexity and politics comes in, because it's all opinion driven. Which is kind of the opposite of how we ought to solve the problem. Right. So you're bringing a tool of science for people to be able to have a more measured conversation, I think.
Yeah. We are trying [laughs].
One of the things that I find fascinating about you and what you're building is also that you represent a new wave of people who are deeply trained in software engineering and have moved their focus from, from apps and compute, and all the classical software stuff to really focusing your attention on data, right. And there seems to be this set of very fascinating data centric startups. There is the emergence of software engineering methodologies into data, like dev ops is kind of influencing data ops. And I think it would be really fascinating to hear a little bit about your journey from Software Engineer early in your career, a little bit of your research, how you ended up as a practitioner, bringing your engineering background to data science.
I always went from one place to another, following what was very interesting to me at that point of time, without having a grand plan of, “Oh, I'm going to start a company eventually.”
Yeah. I guess if I were to go back, I started as initially just a Software Engineer, well, I actually got into computer science because I started building websites when I was in high school. And built some websites for other small businesses that became my own small business. And that's why I went to study computer science or software engineering at University of Waterloo. So one of the things that's great about Waterloo is the co-op program where I get to work with actual companies.
One of my first internships was at Sun Microsystems research lab working as a statistical analyst. Actually, prior to being a Statistical Analyst, I was actually a Software Engineer to build a data visualization model to display the ten-year worth of sales data that the lapse has built. So my manager, who has a phD in computer science and MBA, had his model, where we display the numbers of actual plans in our forecast. Building that archive of how our team has been building that model was my initial project. And my manager asked me if I wanted to come back for my next internship to actually work on the model itself. So that's kind of how I got into data. Since then, everything has somewhat, very much been related to data.
Because after I worked at Barclays they hired me as a contractor to manage four interns because the first project that I built was almost like a dashboard, like on a web application that the IT teams can use to decommission their old databases. Very useful for them so they wanted to carry on. So they got me some headcounts to manage the interns so that they can continue maintaining it. And in the internship following, I was at Facebook working on the Growth team where primarily I was writing a lot of ETL jobs, for putting together our analysis for our SEM campaigns. Mostly for all the event's data coming from Google ads, the ROI of the keywords that we are beating on, things like that,
Which has to be a tremendous stream of data just in terms of raw volume and velocity.
Right, right. And I wonder a lot, actually more than just the processing of the data. And during that time I was also very interested in many other things that were happening at Facebook. And by that time Facebook had 1,000 employees and it was the smallest company that I was working at, compared to all the other internships that I've done.
And it was really fun to be involved in many different parts of the company. Cause I also started doing a bunch of side projects on top of the Growth work that I was doing. But the benefit that I got from working at the Growth team is it really opened my eyes to how the software that we'd been building where I was really focusing on trying to figure out the hows, to looking at the data to make decisions of what, which started getting me a lot of interest into trying to figure out, like, why are we building this software, or this product, or how do we make this decision to invest more into this campaign, versus the conjunction of making decisions driven by data, how to build a business better with data. That actually led me into deciding to go into management consulting to learn more about the business side of how they make the business decisions to build software, and so on and so forth.
And I think all of those experiences bind together now feeds into a Select Star, because I've been in the seat of the data engineer, data scientist, data analysts, and also data consumer.
Part of the work that I did at Yieldmo was running an ad format lab, which would come up with our own ad formats, Ron multivariante or AB testing against different publishers. And then also doing a lot of customer analysis, because it was a marketplace. We're running on ad exchange, looking at both sides of the market of both advertisers as well as publishers.
So just working through a Select Star and also working with our customers of Select Star. There are many different parts of these experiences. That'd be minds me, but whenever I talk to these customers
It's super cool. It's such a conjunction of so many of the different elements of your, of your story and where things are today, which is that everybody who's doing technology needs to be better at business. Everybody who's doing business needs to be better at technology. And this conjunction of your first manager that you mentioned who had a PhD in computer science and MBA. Right. That's almost like the model that things are building around and you've taken a stand to make that easier, right. More people are going to be able to have that kind of crossover knowledge because of the tool that you're developing.
So with that in the past, what do you see happening in the future? In the world of data? Looking out five or 10 years?
I think it's going to be very interesting because more data is going to be embedded into business workflows, just even in our case of Select Star. Once the customers define how they are using the data, we can utilize that so that customers can either propagate the documentation throughout the lineage or the descriptions that they need or notify somebody that has been the user of that data if the data is going to change. That's really just on the side of the data platform. But I think there's also a lot of workflow that can be built on top of it, depending on what's happening underneath the data. And I think that is definitely one of the next things that's coming up in the next five to 10 years in the data industry.
How do you think a novel generation of data discovery could change the landscape?
One thing that was provoked for me when you were speaking earlier is, you know, could there be a page rank for tables?
In a way we have it, and sometimes we really describe our solution as if it's Google for internal data. Mainly because you go to Google to find something and then you go to an external page, a website that you'll find information about.
And that is like how we are seeing our role in the data ecosystem. Today, you start from Select Star and you will find whether that's a table, column, dashboard, a report workbook, so on and so forth or metric definition. And we will direct you back to the BI tool, or the database so on and so forth.
But the way that we also compose this information so that we know where to direct you the best is by looking at our own popularity model, which is driven by the SQL query that have been executed against the data warehouse, whether you're using a Looker mode, Tableau, ad hoc query directly to the database, it will all become a query that gets executed inside the data warehouse.
So we basically collect that data and see which data sets are being used. And accessed by what type of people, and based on that, we will rank the search result so that you can see the most important / popular information on the top, and rank it based on that. So, Anything that you're finding inside Select Star, you will see that most of the top results are trustworthy because a lot of other people have used it and have been either tagged or things like that. So we will give that information to you. And that's why, when you mentioned the PageRank that there's definitely, there's some similarities of how we define the rank in Select Star as well.
And every single page inside Select Star, whether you're looking at a database or table or list of dashboards. We always have this list view, and it's always ordered by popularity. So even if you're looking at just one table, if it has 50 columns, you will see the most used ones on top, to, to the least.
That's really cool. So maybe we can anticipate a future that's a little bit like the evolution of the Google search engine, where they added relevance, and then you started to see a Knowledge Graph. Kind of building some sense of what these things meant and then, you know, auto completing what you ought to be like, "what you meant to type was ... ."
Yeah. Yeah, exactly. I feel like data lineage kind of gives you a lot of information like that. So you can see end to end from the raw table, all the way to the dashboard that is getting affected.
Well, we're getting close to the end of our time. So I want to ask you the question that I think many folks will have on their mind, which is, if I'm really excited about data and the field of data lineage or a governance or discovery - what's a resource or a piece of advice that you'd like to leave the source data audience with.
I would definitely encourage people to check our blog posts on selectstar.com. We talk about the general industry of data catalog and why data discovery is important. Examples of how other companies have been developing and utilizing data discovery in their data team. So I think that's a really good intro blog post.
There are also a lot of links to that blog post that you can go to other places to check out. Obviously, there are a lot of blog posts out there regarding data discovery, but I feel like our blog post is also a good place to start.
And if there's any advice around data it will be, share context. Data context, whether you are technical or on the business side, I think it's very important to share. On the technical side, explain how the data is actually organized, right. Even if you don't have the tools, I think this is something that can be better regarding communication so that the other party can also understand how to utilize that better. Whereas on the business side, we don't necessarily always understand how the data is structured. If you focus on providing more context of like, why you want to do this analysis, why this type of data you think is important - I think that context of like, what is a real problem we are trying to solve? I think that's always helpful.
Because data is just a point in time, it's the context of how you're looking at the data, how you're connecting multiple data together to really drive the insights and better decisions for the business.
That is fantastic advice. Well, Shinji, thank you so much for your time. You've been incredibly generous. I wish you outstanding success and continued curiosity in the Select Star and the work that you're doing.
Thanks so much for having me here.