Understanding the Guarantees, Limitations, and Tradeoffs of Cassandra and Materialized Views
The new Materialized Views feature in Cassandra 3.0 offers an easy way to accurately denormalize data so it can be efficiently queried. It's meant to be used on high cardinality columns where the use of secondary indexes is not efficient due to fan-out across all nodes. An example would be creating a secondary index on a user_id. As the number of users in the system grows the longer it would take a secondary index to locate the data since secondary indexes store data locally. With a materialized view you can partition the data on user_id so finding a specific user becomes a direct lookup with the added benefit of holding other denormalized data from the base table along with it, similar to a DynamoDB global secondary index.
Materialized views are a very useful feature to have in Cassandra but before you go jumping in head first, it helps to understand how this feature was designed and what the guarantees are.
Primarily, since materialized views live in Cassandra they can offer at most what Cassandra offers, namely a highly available, eventually consistent version of materialized views.
A quick refresher of the Cassandra guarantees and tradeoffs:
- Writes to a single table are guaranteed to be eventually consistent across replicas - meaning divergent versions of a row will be reconciled and reach the same end state.
- Lightweight transactions are guaranteed to be linearizable for table writes within a data center or globally depending on the use of LOCAL_SERIAL vs SERIAL consistency level respectively.
- Batched writes across multiple tables are guaranteed to succeed completely or not at all (by using a durable log).
- Secondary indexes (once built) are guaranteed to be consistent with their local replicas data.
- Cassandra provides read uncommitted isolation by default. (Lightweight transactions provide linearizable isolation)
- Using lower consistency levels yield higher availability and better latency at the price of weaker consistency.
- Using higher consistency levels yield lower availability and higher request latency with the benefit of stronger consistency.
Another tradeoff to consider is how Cassandra deals with data safety in the face of hardware failures. Say your disk dies or your datacenter has a fire and you lose machines; how safe is your data? Well, it depends on a few factors, mainly replication factor and consistency level used for the write. With consistency level QUORUM and RF=3 your data is safe on at least two nodes so if you lose one node you still have a copy. However, if you only have RF=1 and lose a node forever you've lost data forever.
An extreme example of this is if you have RF=3 but write at CL.ONE and the write only succeeds on a single node, followed directly by the death of that node. Unless the coordinator was a different node you probably just lost data.
Given Cassandra's system properties, the implication of maintaining Materialized Views manually in your application is likely to create permanent inconsistencies between views. Since your application will need to read the existing state from Cassandra then modify the views to clean-up any updates existing rows. Besides the added latency, if there are other updates going to the same rows your reads will end up in a race condition and fail to clean up all the state changes. This is the scenario the mvbench tool compares against.
The Materialized Views feature in Cassandra 3.0 was written to address these and other complexities surrounding manual denormalization, but that is not to say it's not without its own set of guarantees and tradeoffs to consider. To understand the internal design of Materialized Views please read the design document. At a high level though we chose correctness over raw performance for writes, but did our best to avoid needless write amplification. A simple way to think about this write amplification problem is: if I have a base table with RF=3 and a view table with RF=3 a naive approach would send a write to each base replica and each base replica would send a view update to each view replica; RF+RF^2 writes per-mutation! C* Materialized Views instead pairs each base replica with a single view replica. This simplifies to be RF+RF writes per mutation while still guaranteeing convergence.
Materialized View Guarantees:
- All changes to the base table will be eventually reflected in the view tables unless there is a total data loss in the base table (as described in the previous section)
Materialized View Limitations:
- All updates to the view happen asynchronously unless corresponding view replica is the same node. We must do this to ensure availability is not compromised. It's easy to imagine a worst case scenario of 10 Materialized Views for which each update to the base table requires writing to 10 separate nodes. Under normal operation views will see the data quickly and there are new metrics to track it (ViewWriteMetricss).
- There is no read repair between the views and the base table. Meaning a read repair on the view will only correct that view's data not the base table's data. If you are reading from the base table though, read repair will send updates to the base and the view.
- Mutations on a base table partition must happen sequentially per replica if the mutation touches a column in a view (this will improve after ticket CASSANDRA-10307)
Materialized View Tradeoffs:
- With materialized views you are trading performance for correctness. It takes more work to ensure the views will see all the state changes to a given row. Local locks and local reads required. If you don't need consistency or never update/delete data you can bypass materialized views and simply write to many tables from your client. There is also a ticket CASSANDRA-9779 that will offer a way to bypass the performance hit in the case of insert only workloads.
- The data loss scenario described in the section above (there exists only a single copy on a single node that dies) has different effects depending on if the base or view was affected. If view data was lost from all replicas you would need to drop and re-create the view. If the base table lost data through, there would be an inconsistency between the base and the view with the view having data the base doesn't. Currently, there is no way to fix the base from the view; ticket CASSANDRA-10346 was added to address this.
One final point on repair. As described in the design document, repairs mean different things depending on if you are repairing the base or the view. If you repair only the view you will see a consistent state across the view replicas (not the base). If you repair the base you will repair both the base and the view. This is accomplished by passing streamed base data through the regular write path, which in turn updates the views. This mode is also how bootstrapping new nodes and SSTable loading works as well to provide consistent materialized views.