Get your copy of the O’Reilly Cassandra eBook: The Definitive Guide - Download FREE Today
Alex Petrov is an Infrastructure Engineer, Apache Cassandra Committer, working on building data infrastructure and processing pipelines. He’s interested in CS Theory, algorithms, Distributed Systems, understanding how things work and sharing it with others through blog posts, articles and conference talks. He is also the author of Database Internals: A Deep Dive Into How Distributed Data Systems Work.
Database Research in 2019: The Year in Review
In the same way that our database management systems have to keep up with the growing sizes of our data sets, as database enthusiasts, we have to keep up with everything that is going on in this already vast yet ever-growing field. In this post, I’ll try to cover the highlights of the major 2019 conferences and draw your attention to some publications from your favorite research groups. I’ve attempted to coarsely group papers into categories where they best fit, but some papers discuss more than just one concept, so they may deserve mention in multiple sections.
We’ll start with papers that cover what’s going on in database nodes locally: after a request is received from the client, but before it has been processed by the storage engine. In other words, everything that might be related to query processing.
Most of the database management systems today focus on processing OLAP queries using some structured query language (not always SQL!) and giving responses in the form of a table. This means that user inputs have to be interpreted and encoded into the query, and query results have to be presented in a format that best suits the user by the client software. In “A Holistic Approach for Query Evaluation and Result Vocalization in Voice-Based OLAP” [TRUMMER19], authors describe a system specialized in responding to voice-based queries by both providing a concise query language, a query engine built specifically for this purpose, and a result processor that formats the output as concisely and simply as possible, i.e., by summarizing the results. For example, a query that requests financial statements of some institutions can be described by giving some baseline and outputting how other results are clustered relative to this baseline. Such a system can have a wide application for users whose primary interaction mechanism is voice, which is rather popular today in mobile devices and home assistants. While old systems can try to retrofit existing stores for the same purpose, having a database specifically built for this use case may yield much better results – both in terms of processing performance and clarity of outputs.
Heuristics that used to work well might change as hardware and underlying software layers evolve. Just like asymmetric I/O operations on SSDs made the database community reconsider access patterns, the recent trend of cheaper main memory and CPUs with more cores suggests that now might be a good time to revisit some of the rules used for index selection. When performing a query on a column that has a secondary index, database systems often use cardinality estimation to determine the selectivity of the given query and make a decision to either perform an index scan (probing) or a sequential scan on the table. “Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?” [KESTER19] suggests that it is not enough to take into consideration only data layout, selectivity, and hardware characteristics. Since the efficiency of scans on modern architectures has much better performance characteristics, it is time to consider query concurrency as a measure of access pattern selection as well. In other words, we have to use dynamic runtime properties of the system as well as its static properties to achieve the best result.
Another continuing trend in 2019 is database-as-a-service offerings from public cloud vendors and database vendors building out their cloud offerings. Previously, database administrators could know schemas, data distributions, queries, and directly influence access patterns. In a cloud setting, this might be unfeasible due to a sheer number of database instances, schemas, and different workloads; or even impossible due to legal constraints and privacy. As a consequence, people using databases may lack expertise or insight to be able to identify and implement potential improvements. “Automatically Indexing Millions of Databases in Microsoft Azure SQL Database” [DAS19] describes a system that can analyze access patterns and workloads without interfering with production instances to infer, implement, and validate improvements from the possible suggestions (index recommender component of the system), ensuring that an “improvement” doesn’t cause performance degradation elsewhere (validator component).
“Self-driving” database systems remain a promising area of research (see “Self-Driving Database Management Systems” [PAVLO17], and “Query-based Workload Forecasting for Self-Driving Database Management Systems” [MA18]), and the trend is likely to continue since specialized solutions generally outperform one-size-fits-all solutions. Tuning a database for a specific workload is often a good idea that bears fruit rather quickly. One more example of an auto-tuning system is Querc, a database-agnostic workload management system described in “Database-Agnostic Workload Management” [JAIN19]. To allocate resources optimally, improve query routing, and enchance index selection, the auto-tuning component must have a complete picture of what’s going on in the database system. For that, Querc uses an offline ETL to create a workload summary from recent queries and uses this model to identify and recommend best-fitting indexes. The idea is rather interesting, but given that the paper itself is rather short and only gives an overview. If you’re interested in the subject, you may have to dig deep into the references.
If last year you’ve read and liked “The Case for Learned Index Structures” [KRASKA18], a paper that considers indexes as prediction models, you’re likely to also enjoy follow-up paper from the same research group, “SageDB: A Learned Database System” [KRASKA19]. This paper describes SageDB, a system that builds models describing data and workload distributions, and uses them to pick the best data structures and algorithms for each sub-component of the database system. This approach helps to select the best access method, sorting algorithm, query optimization, and workload scheduling strategies. The paper closes with the words “if successful, we believe this approach will result in a new generation of big data processing tools,” which is a high aspiration but is still a rather forward-looking statement. Having that said, dynamic, hybrid, and learned approaches seem to yield great results almost everywhere else, so there’s a good reason to be a believer.
Some other papers on query processing that deserve a honorary mention are:
- Immanuel Trummer, Junxiong Wang, Deepak Maram, Samuel Moseley, Saehan Jo, and Joseph Antonakakis. 2019. SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1153–1170.
- Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2019. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. Proc. VLDB Endow. 12, 7 (March 2019), 822–835.
- Ryan Marcus and Olga Papaemmanouil. Towards a Hands-Free Query Optimizer through Deep Learning.
Indexes and Access Methods
After the query has been interpreted and indexes have been picked, data eventually has to be read from disk. Hence, no matter how well your query is optimized, the final performance often comes down to the access method.
B+-Trees are to disk-based indexes what the S&P 500 is to funds on the stock market: people who come up with new access methods, tend to benchmark them against B+-Trees and attempt to show where their new methods perform better and by how much. In the previous several decades, we’ve seen indexes differing mainly in layout, composition, and maintenance; today, it’s much more challenging to come up with an entirely novel data structure.
One of the approaches commonly taken today is using workload or data properties to compose optimized indexes. For example, FITing-Trees described in “FITing-Tree: A Data-aware Index Structure” [GALAKATOS19] are an approximate tree-based structure that partitions data in variable-size pages and creates an index over the boundaries of these pages. Alongside with the index boundary keys, FITing-Trees store the slope (in other words, rate of change of keys within a segment) of the linear function that can be used to interpolate the approximate position of the searched key to locate the entry quickly. The value associated with a searched key is then guaranteed to be within a user-specified margin of error from its approximated location.
Just as with B-Tree indexes, there are so many flavors of LSM Trees that it’s getting hard to count. In “Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging” [DAYAN18], authors describe Dostoevsky, an LSM-based key-value store with better space/time trade-offs. The paper introduces a fluid LSM-Tree and lazy leveling. A fluid LSM-Tree uses dynamically pluggable compaction strategies and controls the frequency of merge operations separately for all levels. Lazy leveling is a combination of tiering (allowing multiple tables on the same level and merging tables within the level only when their number reaches a threshold) and leveling (merging new runs within the level as they come in). Lazy leveling eliminates merging on all levels but the largest one (in other words, the bottommost), which reduces write amplification while maintaining the same complexity for lookups and space amplification. Depending on the workload, it is possible to adjust and choose the compaction strategy dynamically and switch between leveling, tiering, and lazy leveling by controlling the number of tables on the bottommost level and numbers of tables on the other levels.
[DAYAN18] was published in 2018 and is mentioned here as an intro for “The Log-Structured Merge-Bush & the Wacky Continuum” [DAYAN19], a 2019 follow-up by the same research group. Here, authors introduce LSM-Bush, a compaction strategy for the highest levels, that avoids non-impactful compactions: levels holding the smallest tables collect more tables before merging them. In LSM-Bush, capacity ratios between adjacent levels grow exponentially. To leverage different strategies depending on the workload, authors introduce Wacky continuum (Wacky: Amorphous Calculable Key-Value Store). “Calculable” in the title is just a fancy way of saying that the compaction strategy will be chosen optimally for the given workload. This fits in well with the trend of adaptive and dynamic data structures. Most of the existing LSM solutions allow pluggable compaction strategies, but you rarely see them change at runtime. However, decision rules described in the paper may help to pick a compaction strategy even for a static system, given that the workload is known upfront.
LSM Trees seem to continue growing in popularity. However, it seems that most ideas in the field have already been explored: storing keys separately from values, choosing whether or not to keep the runs sorted, using different structures for primary key lookup in an LSM-Tree. Even though the idea of using immutable B+-Trees is as old as LSM Trees themselves (“The log-structured merge-tree” [ONEIL96] describes single-component LSM-Trees explore this idea), “Jungle: towards dynamically adjustable key-value store by combining LSM-tree and copy-on-write B+-tree” [AHN19] takes it a notch further. Authors introduce an LSM-Tree/Copy-On-Write B+-Tree fusion called Jungle that splits immutable B+-Trees into levels and stores keys separately from the values. Jungle’s design is somewhat similar to tiering mentioned above, but here multiple sorted runs form an immutable B+-Tree. When immutable B+-Trees are merged, a new write batch (usually, the higher level table) is appended towards the end of the target tree (compaction counterpart on the next level), and the newly created tree structure references remaining tuples from the old batch. Major compaction rewrites the immutable B-Tree, discarding outdated tuples.
As we approach a “there’s nothing new under the sun” moment, it is the best time for a survey paper. “LSM-Based Storage Techniques: a Survey” [LUO19] presents an excellent overview of the LSM-Based storage techniques, starting with history, a brief summary, common implementation details, optimizations, and a motivation for immutable storage structures. Authors group LSM-related research projects into several categories, depending on the area the project introduces improvements for, such as:
- write amplification, compaction/merge operations
- new hardware opportunities (SSD/NVM storage, new multi-core architectures, filesystem-less stores)
- special workloads (such as time-series, append-only, or mostly-sorted datasets)
- auto-tuning (as we’ve also seen in the scope of this post)
- maintaining secondary indexes in LSM-based stores
Some other papers that didn’t make it into the post, but deserve a honorary mention are:
- Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, and Demi Guo. 2018. The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 535–550.
- Pennino, Diego, Maurizio Pizzonia, and Alessio Papi. “Overlay Indexes: Efficiently Supporting Aggregate Range Queries and Authenticated Data Structures in Off-the-Shelf Databases.” IEEE Access 7 (2019): 175642–175670. Crossref. Web.
- Joy Arulraj, Ran Xian, Lin Ma, and Andrew Pavlo. “Predictive Indexing”.
- Atul Adya, Robert Grandl, Daniel Myers, and Henry Qin. 2019. Fast key-value stores: An idea whose time has come and gone. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’19). Association for Computing Machinery, New York, NY, USA, 113–119.
- Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, and Zichen Zhu. “Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn”. Biennial Conference on Innovative Data Systems Research (CIDR). 2019.
- Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1223–1240.
- Daniel Kocher and Nikolaus Augsten. 2019. A Scalable Index for Top-k Subtree Similarity Queries. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1624–1641.
- Dong Xie, Badrish Chandramouli, Yinan Li, and Donald Kossmann. 2019. FishStore: Faster Ingestion with Subset Hashing. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1711–1728.
Anyone who’s ever used a general-purpose database knows that it’s hard to be a Jack of all trades and easy to be a master of none. Many advancements in the field are now coming from the desire to improve the performance of some specific workload. This is a worthy endeavor since oftentimes workloads are indeed easy to categorize and optimize for, so there’s no reason not to exploit this path.
An alternative to optimization for a specific workload is to optimize for every conceivable workload by analyzing it and providing a set of tunable parameters or multiple algorithms, each optimized for one particular setting. While specializing on some workload can be implemented in a database system that only has a set of static parameters that require database restart or even rebuild to be applied, optimizing for a workload at hand calls for dynamic reconfigurability.
At the same time, it’s great to see more papers that apply machine learning techniques to improve database behavior and optimize its performance. This is only the beginning, and we’re yet to see more smart, self-driving database management systems that can analyze a workload, come up with more optimal configuration, and apply it while running. While this idea is nothing new in the research literature (for example, even “SEDA: an architecture for well-conditioned, scalable internet services” [WELSH01] paper describes dynamic resource controllers), industry applications of these ideas are still sparse. Let’s see what 2020 brings since we might see some changes in this area, too.
- [TRUMMER19]: Immanuel Trummer, Yicheng Wang, and Saketh Mahankali. 2019. A Holistic Approach for Query Evaluation and Result Vocalization in Voice-Based OLAP.
- [KESTER19]: Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?
- [KRASKA18]: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 489–504.
- [KRASKA19]: Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, Vikram Nathan. 2018. SageDB: A Learned Database System.
- [DAS19]: Sudipto Das, Miroslav Grbic, Igor Ilic, Isidora Jovandic, Andrija Jovanovic, Vivek R. Narasayya, Miodrag Radulovic, Maja Stikic, Gaoxiang Xu, and Surajit Chaudhuri. 2019. Automatically Indexing Millions of Databases in Microsoft Azure SQL Database. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 666–679.
- [PAVLO17]: Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang, "Self-Driving Database Management Systems," in CIDR 2017, Conference on Innovative Data Systems Research, 2017.
- [MA18]: L. Ma, D. Van Aken, A. Hefny, G. Mezerhane, A. Pavlo, and G. J. Gordon, "Query-based Workload Forecasting for Self-Driving Database Management Systems," in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 631-645.
- [JAIN19]: Shrainik Jain and Jiaqi Yan and Thierry Cruane and Bill Howe. Database-Agnostic Workload Management.
- [GALAKATOS19]: Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure.
- [DAYAN19]: Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 505–520.
- [DAYAN19]: Niv Dayan and Stratos Idreos. 2019. The Log-Structured Merge-Bush & the Wacky Continuum. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 449–466.
- [CHEN19]: Luo, Chen, and Michael J. Carey. “LSM-Based Storage Techniques: a Survey.” The VLDB Journal (2019): n. pag. Crossref. Web.
- [ONEIL96] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (June 1996), 351–385.
- [AHN19]: Jung-Sanga Ahn, Mohiuddin Abdul Qader, Woon-Hak Kang, Hieu Nguyen, Guogen Zhang, and Sami Ben-Romdhane. 2019. Jungle: towards dynamically adjustable key-value store by combining LSM-tree and copy-on-write B+-tree. In Proceedings of the 11th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’19). USENIX Association, USA, 9.
- [WELSH01] Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: an architecture for well-conditioned, scalable internet services. SIGOPS Oper. Syst. Rev. 35, 5 (October 2001), 230–243.
Check out Alex's book, Database Internals: A Deep Dive Into How Distributed Data Systems Work, for more on how cutting edge techniques are implemented in real databases today.