What is data persistence and why does it matter?
For more information on persistence in Cassandra and other NoSQL databases, check out What is NoSQL?
Understanding the meaning of persistence is important for evaluating different data store systems. Given the importance of the data store in most modern applications, making a poorly informed choice could mean substantial downtime or loss of data. In this post, we'll discuss persistence and data store design approaches and provide some background on these in the context of Cassandra.
Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended. In other words, for a data store to be considered persistent, it must write to non-volatile storage.
If you need persistence in your data store, then you need to also understand the four man design approaches that a data store can take and how (or if) these designs provide persistence:
Pure in-memory, no persistence at all, such as memcaches or Scalaris
In-memory with periodic snapshots, such as Oracle Coherence or Redis
Disk-based with update-in-place writes, such as MySQL ISAM or MongoDB
Commitlog-based, such as all traditional OLTP databases (Oracle, SQL Server, etc.)
In-memory approaches can achieve blazing speed, but at the cost of being limited to a relatively small data set. Most workloads have relatively small "hot" (active) subset of their total data; systems that require the whole datset to fit in memory rather than just the active part are fine for caches but a bad fit for most other applications. Because the data is in memory only, it will not survive process termination. Therefor these types of data stores are not considered persistent.
The easiest way to add persistence to an in-memory system is with periodic snapshots to disk at a configurable interval. Thus, you can lose up to that interval's worth of updates.
Update-in-place and commitlog-based systems store to non-volatile memory immediately, but only commitlog-based persistence provides Durability -- the D in ACID -- with every write persisted before success is returned to the client.
Cassandra implements a commit-log based persistence design, but at the same time provides for tunable levels of durability. This allows you do decide what the right trade off is between safety and performance. You can choose, for each write operation, to wait for that update to be buffered to memory, written to disk on a single machine, written to disk on multiple machines, or even written to disk on multiple machines in different data centers. Or, you can choose to accept writes a quickly as possible, acknowledging their receipt immediately before they have even been fully deserialized from the network.
At the end of the day, you're the only one who knows what the right performance/durability trade off is for your data. Making an informed decision on data store technologies is critical to addressing this tradeoff on your terms. Because Cassandra provides such tuneability, it is a logical choice for systems with a need for a durable, performant data store.