TechnologyJune 11, 2019

Why It’s Time to Rethink Data Modeling

Louise Westoby
Louise Westoby
Why It’s Time to Rethink Data Modeling

Data modeling provides a means of planning and blueprinting the complex relationship between an application and its data. It has become increasingly important as the “three Vs” of data—volume, variety, and velocity—continue to explode.

But as the importance of data modeling, in general, has grown, data modeling with a relational database has rapidly become less relevant and effective.

Relational databases simply aren't designed to handle all the different types of data that are integral to modern databases. The mismatch between relational database capabilities and modern needs prompted former federal CIO Vivek Kundra, back in 2009, to declare: “This notion of thinking about data in a structured, relational database is dead.”

Ten years beyond that statement, it has become apparent that Kundra was correct. With every passing day, it seems, more forms of unstructured and semi-structured data must be stored, manipulated, and utilized in making business decisions—and those decisions must often be delivered with lightning-fast speed.

The Complexities of Relational Data Modeling

Data modeling with relational databases can be challenging. It’s a slow and cumbersome process. Making the smallest, simplest changes to a relational database—even changing just one field in one table—can set off a cascading domino effect of additional changes that can be very labor intensive to enact, and quite expensive to complete.

Although relational data modeling has remained a challenge, data modeling itself is very much alive and well. Data modeling is essential because it helps define both the data structure and the business requirements of an application, which brings us to our next topic: Apache Cassandra® data modeling.

Relational Data Modeling vs. Cassandra Data Modeling

Where relational databases base the data models on the data itself, Cassandra bases it on what you want to do with the data (i.e., the application).

This is a key distinction.

Cassandra data modeling is far more applicable and useful to today’s business environment. Data modeling with Cassandra, however, is quite different from relational data modeling and requires a mindset adjustment that leaves behind many of the restrictions of relational data modeling.

When transitioning from relational modeling rules to Cassandra modeling rules, two of the most important relational modeling restrictions that may be discarded are:

  • Minimizing writes
  • Minimizing data duplication

Since writes are expensive in relational databases, relational data modelers typically seek to restrict writes as much as possible.

That restriction is not applicable to Cassandra; writes can be quite inexpensive, and Cassandra is optimized to perform virtually all writes efficiently. And since Cassandra is typically architected around low-cost and abundant data storage, denormalization and duplication of data are common. Read efficiency, in fact, can often be maximized in Cassandra by intentionally using duplicate data.

With Cassandra data modeling, there are two primary goals that are quite different from relational database modeling. These goals are:

  • To spread data evenly around the cluster: Rows are spread around the cluster based on the partition key, which is the first element of the primary key. So, in order to spread data evenly, you need to pick a good primary key.
  • To minimize the number of partitions read: Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.

And unlike with relational databases, any type of data can be stored in a Cassandra database. Cassandra also differs from relational databases in its ability to handle:

  • Massive volume: Multiple petabytes of data? Trillions of data entities? Not a problem in Cassandra.
  • Virtually unlimited velocity: Cassandra can handle millions of transactions per second, including real time and streams.
  • Infinite variety: Cassandra can accommodate all forms of data, including structured, unstructured, and semi-structured.

Why It’s Important to Get the Data Model Right

Done correctly, data modeling can provide benefits throughout the development lifecycle. In addition to improving performance, when done correctly data modeling also accelerates application development. It can ensure a more structured, less haphazard development process that contributes toward maximizing the quality of the end product as well as help lower long-term maintenance costs.

Conversely, a flawed data model, or a data model that simply isn’t a proper fit for the application, can lead to a cascade of problems:

  • Complicating and slowing the development effort
  • Over-complicating data access
  • Indecision and uncertainty about how application data will be stored and accessed
  • Inflexibility in responding to evolving requirements
  • Excessively complicated code that makes ongoing maintenance more time-consuming and expensive

Each of these problems is likely to result in busted budgets and blown deadlines.

Transitioning from Relational Data Modeling to Cassandra Data Modeling

Transitioning from a relational database to Cassandra may seem a daunting challenge. But that’s a common misconception. Companies that manage some of the largest databases on the planet, like Netflix, have made the transition. And there’s plenty of guidance available for making that transition. It, of course, helps that Cassandra uses the SQL-like Cassandra Query Language (CQL).

Similarly, transitioning from relational data modeling to Cassandra data modeling can be easy. So easy, in fact, that you can complete the process in just five simple steps.

5 Steps to an Awesome Apache Cassandra Data Model (free webinar)


Discover more
Data Modeling

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.