Jeff Carpenter shares the top ten things you should know if you’re an Apache Cassandra user just starting out with DataStax Enterprise.
I’ve been an avid user of Apache Cassandra for several years, helping to build cloud-based applications with Cassandra as the primary data store. I even wrote a book about Cassandra for O’Reilly (Cassandra: The Definitive Guide, 2nd Edition) before joining DataStax as a technical evangelist. I love Cassandra because of the unique way it scales compared to other database technologies, while at the same time providing continuous availability. I am convinced that Cassandra’s tuneable consistency approach provides the most flexible foundation for managing qualities you need for each database operation.
However, part of my experience with using Cassandra was that massive scalability and flexibility came with some tradeoffs. Some of these were by design, such as CQL query limitations that Cassandra enforces in order to help avoid performance issues. And others had to do with developer experience and ease of operations, areas the community has worked diligently to improve over time. All of these issues proved to be ones that my teams could work around, but at the cost of additional effort that could have been spent on other areas.
For example, I was aware that others had successfully integrated complementary open source technologies like Lucene and Spark to enhance Cassandra for search and analytics applications. I figured it would just be a matter of finding the time and resources to duplicate these integrations for my own projects. But as you may have experienced yourself, it’s often hard to justify spending IT development hours on infrastructure when your business is asking for new application features. Plus, it’s tricky to ensure consistent security approaches across all of your applications and infrastructure.
When I started exploring DataStax Enterprise, I quickly realized that the additional functionality I had been looking for was already present in DSE, along with other features I didn’t even realize I needed yet. At the same time, the majority of the development and operations pain points my teammates and I had experienced were addressed.
You may find yourself in a similar situation if you’re already familiar with Apache Cassandra but are just getting started with DataStax Enterprise (DSE). If your organization has adopted DSE and you’re merely thinking of it as a productized version of Cassandra, there’s a lot that you could be missing out on.
So, from one Apache Cassandra user to another, I’d like to share the top ten things you should know about DataStax Enterprise. Unlike a Letterman Top Ten list, which were funny because they were so random, I hope you’ll find these points to be relevant to whatever application domain you find yourself working in.
#10. Certified Apache Cassandra releases give you bug fixes with confidence
I observed a dev team which identified a Cassandra bug which was impeding their progress. Unfortunately, the infrastructure team in that organization wasn’t ready to move their officially supported version forward to the fixed version. (You can debate the merits of DevOps culture and allowing teams to pick their own infrastructure, but the situation I describe is the reality in many organizations.) The team backported the fix themselves to produce their own patched version of Cassandra on the “officially supported” release. This got the team unstuck, but it also led to them taking ownership of their own cluster and this ironically led to resistance from that team to upgrade when the infrastructure team had caught back up, since they had a level of trust with their own patched version.
You may have heard the DataStax statement that we provide the best distribution of Apache Cassandra. One of the things this means is that we provide production certified releases - we take the latest bug fixes and test them at scale (1,000 nodes) before rolling them out in a DSE release. Being a DataStax customer also means you get expert customer support, including back porting of bug fixes, hot fixes and bug escalation. So if you run into an issue, make sure to ask for help.
#9. Ongoing performance innovation
One of the other things that it means to have the “best distribution of Cassandra” is that DataStax is continually innovating on performance. While we have all developed a healthy skepticism of database benchmarks, it’s fair to say that there is ongoing competition between database vendors on performance, and we’re not standing still. A couple of examples from the most recent DSE release (5.1):
- DSE Core features a 2x in compaction throughput vs. Apache Cassandra. This reduces the amount of background processing that nodes are performing to cleanup / declutter SSTables, which improves performance of your reads and writes.
- DSE Analytics features 3x better performance on operational analytics queries vs. open source Cassandra + Spark.
We continue to innovate on performance with more exciting developments to come in future releases. If you’re like me, when you see a new release of a key piece of your infrastructure, you want to test how it compares to the previous version, especially at higher scale. If you haven’t got the capability to test how your applications perform at scale, it’s usually worth the investment to prove for yourself how DSE performance improvements impact your overall application performance from release to release.
#8 Flexible Deployment options for multi-data center and multi-workload
A key feature of Apache Cassandra that differentiates it from many of the other databases and public cloud managed services is how Cassandra seamlessly synchronizes data across multiple datacenters, whether private, public, or a mix of both. With a hybrid deployment, you can span a single cluster across on-premises and cloud.
For cloud applications, having more flexibility in network and computing choices allows you to manage tradeoffs between cost and performance. DataStax provides multiple features that allow you to adapt to the unique infrastructure requirements of your application. As I’ve learned, this is particularly important when using public cloud vendors and your development organization is held to account for the growth in your monthly bill.
Multi-Workload clusters: Cassandra’s flexible topology of “datacenters” and “racks” allows you to create clusters supporting both operational and analytics workloads. One common deployment pattern is to deploy one datacenter with DSE Core and DSE Search to service operational queries and establishing a separate datacenter with DSE Analytics and DSE Graph enabled for analytic queries. This architecture has the advantage of physically separating these workloads without requiring a separate ETL processing step.
Advanced replication: if your multi-data center deployment has issues with intermittent connectivity or fluctuating bandwidth between locations, have a look at DSE’s advanced replication feature. Instead of a standard multi-datacenter cluster configuration, advanced replication uses multiple edge clusters which are synchronized with a central hub cluster as network connectivity allows.
Multi-instance: DSE’s multi-instance feature allows you to run multiple nodes on a single machine to fully leverage the capabilities of large servers. This is useful for cases where you need to make use of existing hardware in on-premise data centers, or in public cloud data centers when using larger instance sizes is more cost effective.
Tiered storage: for applications where most reads are on recently written data and older data is accessed less frequently, DSE provides a tiered storage feature that maintains recently written data in a faster SSD tier and ages out older data, moving it to a more cost-effective storage option.
#7 Advanced Security features for your unique environment
We’ve all seen the recent press regarding high profile website breaches, including a few involving prominent NoSQL databases. Hopefully you’re sobered as I am by these breaches. If you’re in a regulated industry, you already have security requirements, and if not, you still should be concerned about security because it’s the right thing to do for your customers.
DataStax Enterprise Unified Authentication: Apache Cassandra supports pluggable authentication and authorization, and offers encryption of data as it moves between clients and nodes, with encryption of data at rest in progress. Cassandra’s built-in authentication and authorization implementations allow you to manage users, roles and permissions within the Cassandra environment.
Unified Authentication builds on Cassandra’s pluggable security to allow you to integrate your existing authentication mechanisms such as Kerberos or LDAP in addition to the built-in Cassandra authenticator. Unified Authentication works across all elements of DataStax Enterprise, so you don’t have to worry about separate security integrations for core database, search, analytics and graph.
Leveraging multiple authentication mechanisms could be useful in situations where you are transitioning between providers, or if you have a mixture of identities representing both people and applications. For example, users representing individual microservices registered with the Cassandra built-in provider might have read/write access to their own tables (SELECT / INSERT / UPDATE / DELETE), while specific employees that are part of a database administrator LDAP group might be given schema permissions (CREATE / ALTER / DROP / TRUNCATE).
Row-Level Access Control (RLAC): If your particular application is a SaaS targeting small to medium size customers, you’ll eventually run into resource allocation challenges when you start adding small to medium size clusters, where deploying a dedicated DSE cluster per customer isn’t cost effective.
Having personally done some investigation into what would be required to implement multi-tenancy on top of Cassandra, I’m very excited about the Row-Level Access Control feature added in DSE 5.1, which you can use to restrict access to individual rows by the value of a specified primary key column.
#6 Reduce your operational burden with DataStax OpsCenter
One of the criticisms I hear from time to time about Cassandra is that it can be difficult to operate. There is some merit to the idea that maintaining Cassandra requires teams to take the time to learn how to do it effectively. I have seen teams struggle to come up to speed on the variety of options available in the cassandra.yaml configuration file. Manual administration of nodes can be a chore, but rolling your own automation for operations such as repair, upgrades or adding and removing nodes can be counterproductive without a good understanding of how Cassandra works.
The Cassandra community has been working hard to simplify operations with features in the Apache distribution and utilities such as cassandra-reaper. These are great tools to have, but I recommend DataStax OpsCenter as a comprehensive solution that takes your operational maturity to the next level and effectively counters the criticism I referenced above.
OpsCenter is a browser-based, visual management and monitoring solution for DSE clusters. The visualization changes the game so that you are thinking in terms of managing clusters instead of individual nodes. Common tasks like adding and removing nodes, software updates, backups, and repairs are automated according to best practices. This allows your operations teams to have their focus on overall application availability and performance, instead of devoting disproportionate attention to the data platform.
Well, that's quite a lot for one blog post - you didn't think I was going to reveal the top 5 already, did you? Don't worry, the top things to know are right around the corner. And yes, we'll be talking about things like DSE Graph, DataStax Studio, and and DataStax Managed Cloud.