Get your copy of the O’Reilly Cassandra eBook: The Definitive Guide - Download FREE Today
Data discovery has never been more challenging than it is today. Companies still struggle with organizing the vast amount of data they’re collecting through different systems and how to identify the data that has the most value to their business. That’s why I invited Shinji Kim, founder and CEO of Select Star, to the Open Source Data podcast.
Kim has worked with data in a number of different roles and knows all too well the difficulties associated with analyzing data collected from multiple sources. Eventually, she decided to change that and founded Select Star – a platform designed to make data discovery easy.
Kim felt the amount of time and money spent collecting data from different points was unnecessary and inefficient. Select Star, which became available at the end of 2021, now helps organizations eliminate data discovery issues by providing a centralized query engine for easier management of distributed data. The platform gathers and controls data, putting it all together, so users can access what they need, when they need it.
With the spread and growth of data in all forms, Kim found that companies need a data analysis system that adds value to their processes. A reliable data cataloging system is essential to any data analytics pipeline, from tuning applications (such as batch jobs) to better understanding user behavior and improving quality.
More data, more discovery problems
Before founding Select Star, Kim worked as a software engineer, data scientist, and project manager in several organizations. This is where she saw the need for a better way to find data – and gained the unique perspective she now uses to build Select Star.
Kim said there are a lot of issues around data discovery. Just finding and understanding the data you have is a big enough challenge. But then there’s also a huge diversity in the kinds of data that flows into data warehouses today, especially as more companies move towards more “modern data stacks” that use extract, load, and transform (ELT) processes. With ELT, data is loaded directly into the data warehouse without any transformation, which results in a myriad of raw data formats that must be transformed and aggregated in order to use the data.
With so many potential raw data sources and the additional processing needed to use the data, it’s easy to see how a company can end up with hundreds, if not thousands of database tables inside their data warehouses. The problem here is not so much about the number of tables that end up in the warehouse. Rather it’s the lack of clear provenance and value of the data they contain – a problem Select Star solves this problem by using a PageRank-like approach for tables.
If your data warehouse is a mess today, it only gets worse once you start to analyze the data and add tables that need managing. Select Star automatically finds and documents your data from Oracle, MySQL sources, extract, transform and load (ETL) processes, and even your business intelligence dashboards so you can easily find the source of your data, who uses it, which dashboards are built on top of it, and more.
One of the most amazing things about Select Star’s service is that it provides visibility into how other people have queried the data and can show other tables that can be used to query – essentially equivalent to the information Google’s PageRank algorithm provides. “Every time they find the table, they'll be able to see where the data came from, who the top users inside the company are, what dashboards were created out of this dataset,'' Kim said.
One Platform For Shared Data
Another benefit to Select Star is that it can work with data from all the data sources you use, funneling the data into its warehouse to make it accessible and searchable in one place. So you no longer need to go to Salesforce for one data table, Marketo for another, and Google Analytics for a third or figure out how to put them all together for a single report. Select Star does that for you.
Data has become so important that companies protect and guard it now more than ever. However, Kim wonders if it makes sense to protect every bit of data companies collect. With Select Star, users can define and govern their data from the beginning. They can also measure what data is worth protecting, and what is just taking up space in their data warehouse.
Before you guard the access, you should say, ‘Is this data that we should guard? How important is this data for us to document?’ So once that's clear, then it's a much easier journey for the security team and the data platform team to have full governance on their data. – Shinji Kim
Streamline Data and Workflows
Streamlining the data that enters a data warehouse revolutionizes the workflow for all users involved. They’re all aware of what’s happening to it and who’s accountable for the changes.
“Once customers define how they use data, we can use that so that customers can either propagate the documentation throughout the lineage or the descriptions that they need or notify somebody that has been the user of that data if the data is going to change,” Kim said.
At the end of our conversation, Kim gave this piece of advice: Share the context in your data. Whether you are working as a member of a technical team or on the business side of the company, explaining how the data is actually organized is key to making the most of it.
Learn more about the Select Star platform by going to the company website and checking out their use cases for data discovery or data governance. If you found this post interesting, lend your ears to other episodes of the Open Source Data podcast.
About Shinji Kim
Shinji Kim has a bachelor's degree in software engineering and has worked with data as a software engineer, product manager, data scientist and consultant. And now, she’s well on her way to becoming a serial entrepreneur. She founded, designed and developed the social game ShufflePix, co-founded Concord Systems – which was acquired by Akamai, and went on to found Select Star in March of 2020.