CompanyOctober 14, 2020

Get Your Head In the Clouds (Part 1 of 3): Build Cloud-Native Apps with DataStax Astra DBaaS now on AWS, GCP and Azure

Matt Kennedy
Matt KennedyProduct Strategy, DataStax
Get Your Head In the Clouds (Part 1 of 3): Build Cloud-Native Apps with DataStax Astra DBaaS now on AWS, GCP and Azure

New features in Astra include: Storage Attached Indexing, VPC Peering and support for multi-region databases

Today, we announced that DataStax Astra is now available on Amazon Web Services, Google Cloud Platform, and Microsoft Azure. Astra enables users to rapidly build, deploy and easily manage multi-cloud, multi-region applications with a massively scalable Database-as-a-Service (DBaaS). Enterprises and developers now have the freedom to run any Apache Cassandra® workload, anywhere, at global-scale.

We also launched new features in Astra: Storage-Attached Indexing, VPC Peering, and support for multi-region databases. We’ll dive into all the new features in a blog series, but today let’s focus on Storage Attached Indexing.

Storage Attached Indexing (SAI) Revolutionizes Cassandra Data Modeling Challenges

If you're a technologist that works with data at any sort of large volume, there's a good chance that you've evaluated or researched Cassandra at some time in the last decade. There's also a good chance that you learned that Cassandra data modeling can be difficult. If that statement resonates with you, now is the time to take that second hard look at Cassandra. The reason for that is the introduction of Storage-Attached Indexing, or SAI. 

Back in 2010, when Cassandra introduced what would come to be known as C*2i, short for Cassandra Secondary Indexes, I was personally super excited. I'd been a Cassandra user for a few short months at that point, and I saw this huge potential in the database. Nothing else scaled like Cassandra with an elegant peer-to-peer architecture that I could wrap my head around. I appreciated the practicality of the consistency mechanisms and what that meant for geographically distributed data centers (more on that in a later blog post) that could function in an active-active capacity. 

But the data modeling challenges at the time were daunting. This was well before CQL had taken off, and most data modeling was done completely in your application code, as Thrift really had no schema management to speak of. We had wide partitions to group related records together for efficient retrieval, and you could embed JSON objects in cell values to get a little extra structure. Then there were Supercolumns, but those were widely regarded to be broken and an anti-pattern right around the time that I really grasped them, so I was out of luck there. I thought that C*2i would be the magic bullet that really made Cassandra data modeling manageable. Alas, shortly after their introduction, and a lot of painful mailing list messages about what C*2i was good for, and what it wasn't, I decided they weren't something I could rely on.

Fast forward a few years to 2014 when I joined DataStax, and the DataStax Enterprise (DSE) Search functionality based on Solr, combined with the advancements made in CQL had finally made a serious dent in how challenging Cassandra data modeling had to be. But what struck me, and a lot of others as well, was how DSE Search was often being used as a simple indexing mechanism, not a full fledged search engine as the Lucene/Solr technology had been initially built to be. Solr was way over engineered for a use-case of core database indexing. So, inspired by our enterprise users, and innovations from the community like SASI, DataStax set out to solve once and for all the challenge of true secondary indexes for Cassandra. That solution is called SAI, and it's now available in both DSE and in Astra and is in the process of being evaluated by the community for inclusion in open source Cassandra.

In the few brief months that Astra has had SAI, I've seen it completely revolutionize the challenge of Cassandra data modeling. Developers working on new applications can now start with an intuitive table model, and add indexes as new query requirements are discovered without having to make a second denormalized representation of a set of data. Existing applications are far easier to adapt to new requirements the same way. Instead of adding a materialized view or code to support a custom query table, just add an index.

SAI can even help solve other data modeling challenges like excessive tombstones, which can be a problem in Cassandra when a query has to scan over tombstones for deleted data to arrive at the desired live records. Now, a physical data model can be chosen which minimizes tombstone scans and handles query requirements via SAI. It's also a game changer for node density as fewer Materialized Views and custom query tables means less redundancy. SAI takes up a fraction of the storage of either of those mechanisms. 

Now, it may be the case that for your most latency sensitive, or very high throughput queries that you may still need an MV, or custom query table, but you don't have to do that for EVERY query now, and that has the potential to not only be a huge productivity boost for existing Cassandra developers, but more importantly, it means that everyone that has passed over Cassandra due to data modeling complexity owes it to themselves to have a second look. 

Sign up for Astra and give SAI a spin at no-cost.

For a deeper dive, join me and Patrick McFadin for an upcoming webinar: “Everything You Need to Know About Storage Attached Indexing in Apache Cassandra.” Register here


One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.