Now it’s your turn — become awesome at data modeling for Cassandra.

Note: This blog series coincides with the short course Data Model Meets World. This content originally appeared on Jeff's personal blog and is reproduced here by permission.

In this series of articles, I’ve been responding to a series of questions about data modeling for Apache Cassandra. A reader of of my book Cassandra: The Definitive Guide, 2nd Edition (O’Reilly) asked several excellent questions about the hotel data model presented in the book. Over the course of several articles, we’ve looked at many aspects of data modeling and architecture of cloud based applications.

Now we’ve reached the end of the series. Like a great hike, we’ve made a long ascent and are now at the summit. From this viewpoint, let’s enjoy a quick look back at where we’ve been and see where we want go next.

Questions revisited

Looking back, this series started with a series of questions from a reader of my book who was wrestling with how the sample hotel data model would work a real application:

1) In Chapter 5 a data model is presented for hotels. A hotel_id is part of the key for many tables. From the book, I learned that you need the partition key beforehand to be able to query the tables presented. How would you get the hotel_id, given that there might by hundreds of hotels in a large corporation? A sample key like AZ123 (presented in Chapter 9) would be hard to know ‘by heart’. Would you use a separate Key/value store to manage those ID’s and the hotel names? Would Cassandra fit the bill for managing the keys? If so, how?

2) In the same data model, points-of-interest would be managed by ‘names’. For the book, I gather that poi_names have to be unique. How would uniqueness be guaranteed across the cluster, given that new hotels might be acquired weekly and POIs may change over time on a global scale? From within an application to manage the ‘points-of-interest’, I would be tempted to read all poi-names before trying to add a new one, but wouldn’t that result in a full cluster scan? Or would you hard-code the POI names in the application, which goes against the principle of separating data layer and application layer. Is there any better solution given a choice of Cassandra? How would users get a list of POI names (e.g. within a certain geographical region)? Would you use Cassandra for that?

Answers distilled

This was such a good list of questions that it’s taken me several articles to unpack all of the implications. Let me try to highlight some of the key points that I’ve offered up in response throughout this series:

  • Query-first design is an important data modeling approach. This is especially true when you have a non-trivial number of data types and your application needs to support the ability to navigate relationships between those data types. It’s a good idea to to try to anticipate the access patterns that will be required to support features on your roadmap, so you can make sure to leave room for them in your design.
  • The unique identity of rows is important so that you don’t accidentally overwrite data. While natural keys may be acceptable for small and/or fixed data sets like the points of interest in the example, I recommend using UUIDs in most cases because they facilitate low coupling between data types.
  • The microservice architectural style is a great fit for cloud-scale applications. Identifying bounded contexts and candidate microservices as you are developing your data models helps keep the boundaries between services clean and makes sure your data model and architecture are in agreement.
  • Each microservice is in charge of its own data persistence, which gives you the flexibility to use multi-model approaches, rather than choosing an inappropriate style of database for your problem or approaches such as hard-coding data that limit the scalability and extensibility of your application. You can select from different database models such as tabular, key-value, document, and graph, or even cache data in-memory where needed to meet service level agreements (SLAs).
  • Extending your data model to support additional queries doesn’t have to mean adding new tables. When used carefully, indexes and materialized views can be used to support additional queries such as those mentioned above: text-based POI searching and geospatial hotel searching. Spark (DSE Analytics) can be used for more complex investigations.
  • Overconfidence can trip up even experienced data modelers, leading to designs that don’t perform well under production level loads and costly rework. It’s worth the effort to use estimation and load testing to ensure the scalability of your data models well before they make it into production.

I hope these observations will be useful to you as you create data models for your own applications.

Need more data modeling resources?

If you’d like to continue learning about Cassandra data modeling, there are several great resources available:

  • “Chapter 5: Data Modeling” of my book Cassandra: The Definitive Guide, 2nd Edition is the basis of this series of articles. This chapter is available for free on the O’Reilly Ideas website as the article: Designing Data Models for Cassandra. That digital copy of this content has recently been updated to include updated estimation formulas from Artem Chebotko (@ArtemChebotko) for Cassandra 3.X.
  • DS220: Data Modeling — this free course available here on DataStax Academy takes you all the way through the process of producing conceptual, logical and physical data models, providing lots of practical guidance. It even provides example data models for use cases from several domains. One of the best parts of this course is the interactive exercises that help you master the principles you’re learning.
  • I’ve developed a live training course for O’Reilly called Building Applications with Apache Cassandra. The next opportunity for you to take this will be in September 2017. Data modeling is a key part of this course.
  • KillrVideo — this is a reference application for Cassandra and DataStax Enterprise that is developed and maintained by the evangelist team at DataStax (of which I am a part). The reason I recommend this as a data modeling resource is that you get to observe several of the decisions and tradeoffs that go into designing schema for a real working application, particularly if you check out this related SlideShare by Luke Tillman (@LukeTillman).

Reaching expert level

My main goal in writing this series is to help you become an awesome Cassandra data modeler by looking at real world examples and tradeoffs. However much benefit you may gain from absorbing training material such as this series and the resources I’ve listed above, it’s probably not enough by itself.

Research has shown that to become an expert in something, you need a lot of practice. And not just any practice, it needs to be deliberate practice based on watching and emulating experts. After all, there’s no point in cranking out mediocre data model after mediocre data model. This is a great video from Kathy Sierra at O’Reilly’s 2015 Fluent conference summarizing a lot of that research. She distills all of it down to what it takes to continually be awesome at multiple skills, which is what it takes to succeed as a modern software developer.

My encouragement to you is to keep working at your data modeling skills by continuing to create models and test them out.

Also, don’t be afraid to have others look at your data models. You can see evidence of the value of this in venues such as the Cassandra user mailing list (user@cassandra.apache.org), where developers frequently post portions of their data models looking for feedback. I typically learn a lot from these exchanges, especially when considering data models from domains I haven’t worked in as much, such as IoT applications with time-series data.

The next challenge — KillrVideo

I’ve started on my next challenge - my fellow evangelist David Gilardi (@SonicDMG) and I are in the process of updating the KillrVideo application to take advantage of recent Cassandra features such as user defined types (UDTs), SASI indexes, materialized views, and user defined functions (UDFs). We’ll also be leveraging new features of DSE 5.1 including Search, Analytics and Graph. Hopefully this will be the source of many future articles.

Subscribe to Our Blog Now

Check your Internet Connection!!

Thank You for Signing Up!