Best Practices for Migrating from a Relational Data Platform to Apache Cassandra™
One of the most frequently asked questions we receive is “how do I migrate my application that was designed for a relational database to Cassandra?” This is a great question with all kinds of practical implications, so we’ve gathered some collective wisdom and best practices to share with you.
Knowing when it’s time to migrate
Any migration to a new platform or database is going to involve work. Before starting on a migration project, you’ll want to understand how you’ll benefit and when it’s the right time to move the key workloads that run your enterprise from a relational database to a NoSQL Cassandra database. Here are some of the key signs you should look for to help you know when your current relational database isn’t getting the job done anymore:
- Your database queries are getting slower and more difficult to debug and maintain
- You’re struggling to scale beyond a single database node, or paying high license costs for a fancy multi-node solution
- Hot backups required by your disaster recovery plan waste resources and don’t guarantee high availability
- You want to deploy applications in hybrid or multi-cloud architectures
If you’ve concluded it’s time for a change, you’ll need to identify the use cases that are causing the most performance and scalability challenges, and begin prioritizing those use cases for migration. Where possible, migrating functionality a bit at a time is generally lower risk than a “big bang” or “flip the switch” migration.
Once you’ve identified a use case or two to migrate, the migration process includes the following:
- Adapting your data model
- Adapting your application
- Planning your deployment
- Moving your data
Let’s examine what’s involved in each of these steps.
Adapting your data model
It’s vital to understand that Cassandra data modeling is not the same as relational data modeling. While Cassandra uses familiar concepts like tables, rows, and columns, and the Cassandra Query Language (CQL) is quite similar to SQL, there are some important differences that you need to be aware of.
Relational data modelers are accustomed to creating normalized schema in order to minimize data duplication, and using joins to assemble data from multiple tables. You might be able to speed up some slow queries by adding additional indexes or selectively denormalizing data by duplicating columns in tables to avoid joins.
Cassandra data modeling is different: denormalization is the rule, not the exception. You start with analyzing the workflows of your application to identify the queries you’ll need, and then design tables that contain all the required information in a single query. The DataStax Academy course “DS220 Practical Application Data Modeling with Apache Cassandra” is a great way to develop your expertise in designing Cassandra tables.
Make sure you don’t underestimate the importance of creating good Cassandra data models, as this will be your number one key to a successful migration. It’s probably a good idea to do some load testing on your data models to see how your write and read queries will perform with non-trivial amounts of data, and to get a more concrete idea of the cluster size and configuration that will help achieve your performance goals. You can use DSBench to put a load on a target cluster for performance and scale testing, data model validation and more.
Adapting your application
The next step is updating your application code to write and read from the Cassandra tables you’ve designed. Whether you’re updating an existing monolith or creating entirely new microservices, there are DataStax Drivers available in the most popular languages that you can use to connect to your DataStax Enterprise or Cassandra clusters.
New Cassandra developers need to become accustomed to the idea that Cassandra is a distributed database and that there are tradeoffs to consider when storing multiple copies of data across multiple nodes or even multiple data centers or clouds. You’ll want to learn about Cassandra’s tuneable consistency and the tools Cassandra gives you for managing the tradeoffs between consistency and performance, including consistency levels, lightweight transactions, and batches. The DataStax Academy course “DS201 Foundations of Apache Cassandra” is a great introduction to these concepts.
Planning your deployment
Before you deploy your updated application, it’s important to plan out your Cassandra cluster. You’ll want to consider questions such as:
- What are the performance metrics or service level agreements (SLAs) that will be required for your queries?
- What kind of hardware and network is available within your platform?
- How many data centers will the application be deployed to?
These questions should be considered both in terms of your initial deployment as well as how you plan to expand as your application proves successful.
Moving your data
After doing the hard work to design Cassandra tables and write code to write and read from those tables, the time will come to deploy your updated application to production. In most cases you’ll have data to move from your legacy database. Tools such as the DataStax Bulk Loader are great for performing one-time data migrations, one of several cases described in Brian Hess’s blog series.
If you have requirements for a zero-downtime migration you’ll also want to also investigate using Apache Kafka®, the Kafka Connect framework, and the DataStax Kafka Connector to capture changes from your legacy database and write them into Cassandra tables. To validate the results of your data migration, consider using Apache Spark to compare records from the source system to those in your Cassandra cluster.
Let’s do this!
If you’re ready to take the plunge, we have plenty of help available, including DataStax Docs and our community site where you can get quick responses to your questions from our experts. Also make sure to check out our webinar.