FAQ

Here are answers to the most commonly asked questions on DataStax Enterprise (DSE).

If you have any questions about DataStax product licensing, or related matters, please refer to our separate Licensing FAQs here.

DataStax Enterprise

DataStax Enterprise Graph

Apache Cassandra™


DataStax Enterprise

What is DataStax Enterprise?

DataStax Enterprise, the provider of database software for cloud applications, accelerates the ability of enterprises, government agencies, and systems integrators to power the exploding number of cloud applications that require data distribution across datacenters and clouds, by using our secure, operationally simple platform built on Apache Cassandra™.

What’s new in DataStax Enterprise 5.0?

For details on release enhancements and features, please reference the press release.

How can I run analytics on Apache Cassandra™ data in DSE?

DSE provides two options for running analytics:

  • Stream and run near real-time analytics using DSE Analytics with Apache Spark.
  • Link together data from an external Hadoop installation (e.g. Cloudera or Hortonworks) with data stored in DSE via Spark

What benefits do I get by running Spark Analytics within DataStax Enterprise?

DataStax Enterprise (DSE) Analytics provide a much easier deployment and integration experience with Spark. Workload isolation capabilities have been extended to Spark analytics, so the real-time, search and analytic workloads/nodes do not compete for data or compute resources. Everything is managed automatically by DSE without user intervention. It also comes with built-in security and high availability for Spark Master.

Another great benefit of DataStax Enterprise is that it eliminates the need for complex extract-transform-load (ETL) operations normally needed to move data from real-time systems to analytic databases or data warehouses. Instead, data is transparently and automatically replicated among real-time and analytic nodes; no developer or administrator work is necessary.

In-memory Spark analytics can be combined with DSE’s in-memory OLTP option, providing a full in-memory solution for OLTP + analytics.

We also test and certify Spark + Apache Cassandra™ versions (DSE only) for up to 1,000 nodes in a cluster.

Lastly, having one integrated database for transactional work, analytics, and enterprise search makes for a much more productive environment for operations personnel and easier development experience for developers.

Which use cases benefit from DSE Analytics?

Time-sensitive applications such as click prediction, spam filters, sensor data processing and fraud detection benefit from DSE Analytics.

Does DataStax have a formal partnership with DataBricks?

Yes. Please refer to this press release for details.

Will I get support for DSE-Spark integration?

Yes, with a DSE Max subscription, you will receive 24x7x365 expert support.

Can we connect DSE Analytics to BI tools like Tableau?

DSE 4.7 and higher ships with ODBC and JDBC drivers for SparkSQL. These drivers can be used to connect to BI tools like Tableau.

What specific version of Spark is supported in DataStax Enterprise Analytics?

Please check the release notes for the DSE version you are deploying.

Is DataStax Enterprise a Hadoop platform?

No, DSE is focused on providing support for online / transactional applications whereas Hadoop targets data warehouse / lake use cases.

Does DataStax have formal partnerships with Cloudera and HortonWorks?

Yes. DSE is formally certified on both vendors’ platforms.

What external Hadoop platforms and versions are supported?

Please check the release notes for the DSE version you are deploying.

How does DSE’s built-in analytics work?

When either creating or modifying an existing cluster, an administrator specifies that nodes in a certain data center in the cluster are devoted to running analytic workloads. When these nodes are started, the Spark management processes are started automatically.  Once these nodes are started, data can be analyzed with Spark – including batch jobs in Scala or Python, streaming jobs, or SQL.  DSE provides fault tolerance for continuous availability of Spark management processes to accept new jobs and manage running jobs.  An administrator can also start additional Spark-based services, such as the Spark SQL Thriftserver for programmatic access from JDBC and ODBC clients and tools.

Does DSE integrate with third party analytics vendors?

Yes, DSE integrates with third party analytics vendors (e.g. Cloudera, Hortonworks). For more information on how the integration works, please refer here.

Does the security feature set of DSE work with external Spark and Hadoop deployments?

Yes, components like Kerberos can be used in DSE to provide security for Spark and Hadoop components.

How does DataStax Enterprise provide support for enterprise search operations?

DataStax Enterprise Search extends Apache Solr, the most popular open source search software, with new, unique features such as Live Indexing to provide more robust, enterprise level search capabilities.

What benefits do I get with DSE Search?

There are five key benefits of DSE Search:

  1. Continuous Availability – No single point of failure with the masterless architecture that DSE provides and automatic and transparent redundancy for all search components and operations
  2. Full Data Durability for Incoming Search Data – Unlike open source Solr, which can lose data if a node goes down before new data is flushed to disk, DataStax Enterprise guarantees that no data is ever lost through the use of Cassandra’s write ahead log.
  3. Scalable Design for Write Operations – DataStax Enterprise allows writing to all search nodes and automatic sharding – even across multiple data centers – and ensures everything stays in sync.
  4. Live Indexing – Incoming data is almost instantly available for search tasks through in-memory structures.
  5. Transparent and Automatic Data Replication – Complex ETL operations  normally required to move data from real-time systems to search databases is eliminated. Instead, data is transparently and automatically replicated among real-time and search nodes, so no developer or administrator work is necessary.

How does DataStax Enterprise handle both OLTP and search data in the same database?

DataStax Enterprise uses Cassandra’s replication to replicate data between nodes designated for real-time data and nodes specified for search operations. Any node may be written to, with changes being propagated across all nodes. All nodes may also be read. Such a configuration eliminates write bottlenecks and read/write hotspots.

Isn’t it a bad idea to have both OLTP and search tasks running in the same database?

Not with DataStax Enterprise. DataStax Enterprise uses smart workload isolation so that OLTP and search nodes do not compete for either the underlying data or compute resources. All search tasks execute on nodes marked out for enterprise search and all OLTP, online operations take place on nodes designated for real-time data tasks.

Can I access data in Solr/search nodes with CQL?

Yes. DataStax Enterprise extends Cassandra’s CQL to include Solr syntax. See the online documentation for more detail about how to construct Solr CQL queries.

Does DataStax Enterprise offer any type of workload management reprovisioning?

Yes. OLTP (Apache Cassandra™), search and analytic nodes can be easily reprovisioned by stopping/starting nodes in a different mode. This allows you to easily adjust the performance and capacity for various workloads.

What type of security does DataStax Enterprise offer?

DataStax Enterprise provides the following built-in security features: (1) internally-managed authentication (login IDs and passwords are managed within Apache Cassandra™); (2) external authentication option that supports Kerberos, LDAP, and Active Directory; (3) internal authorization / object permission management via GRANT/REVOKE; (4) client-to-node and node-to-node encryption via SSL; (5) transparent data encryption at the table / column family level with both on and off-server key management; (6) encryption of external database files like the Apache Cassandra™ commitlog and DSE Search indexes; and (7) data auditing. None of these security features is enabled by default.

What benefits do I get by running in-memory in DataStax Enterprise?

DSE In-Memory allows users to create memory-only database objects with faster performance over objects that are standard disk or SSD-based. Typical scenarios that benefit from DSE’s in-memory option include those with primarily read-only workloads with slow changing data and/or semi-static datasets.

Data protection and durability ensure that no data is lost due to power failure or other problems. From a developer perspective, there is no difference in working with an in-memory database object versus a traditional disk-based table.

Can in-memory and disk tables exist together in DataStax Enterprise?

Yes. With DataStax Enterprise, in-memory and standard disk tables can be created easily to co-exist inside the same keyspace. Any object can be switched from in-memory to disk and vice-versa.

Are there any limitations and recommendations on in-memory tables?

DataStax recommends that in-memory objects consume no more than 60% of free memory on a database server.

How can I move data from an RDBMS to DataStax Enterprise?

The most efficient way to move data from any RDBMS to DSE is to leverage Spark.  Spark uses the JDBC driver  in an RDBMS to create a Dataframe in Spark (see here).  Once the data is in a Spark Dataframe, you can use the Spark Apache Cassandra™ Connector to save the Dataframe to Apache Cassandra™.

DataStax Enterprise also includes Sqoop to move data from any RDBMS with a JDBC driver (e.g., Oracle, MySQL) to the DataStax Enterprise server. One or more tables are simply mapped to new Apache Cassandra™ column families and the Sqoop interface takes care of the rest.

You can also use third-party tools such as Pentaho’s Kettle, which has full ETL capabilities and is free to download and use. Lastly, you can unload RDBMS data into delimited (e.g. comma, tab, etc.) flat files and load them into Apache Cassandra™ with the COPY load utility.

Can I move application log data to DataStax Enterprise?

Yes. Using log4j, application log data can be moved easily into the DataStax Enterprise server and then indexed and searched via the Solr support in the server.

What is DSE Management Services?

DSE Management Services transparently automate many operational tasks for a database cluster. Current services include the Repair Service, Performance Service, Capacity Service, Best Practice Service, Backup Service, and Deploy Service. For details, see the online DSE documentation.    

What is DSE Advanced Replication?

Certain applications, especially those in the retail and energy markets need specialized forms of data distribution that rely on a hub-and-spoke topology (also referred to as a “edge of the internet” model). These systems have to constantly update central data collection sites with information collected and stored at numerous locations around the world. While Apache Cassandra™ sets the standard for modern data replication and distribution, it falls short where easily supporting this type of design is concerned.

DSE Advanced Replication builds on Cassandra’s replication by providing multi-cluster replication from numerous endpoints to a centralized location that is used for data aggregation and analysis, thus supporting the hub-and-spoke data distribution use cases. DSE Advanced Replication also includes the ability to prioritize which data must be sent first, as well as stores data to be forwarded in the event of network disruption from the edge to the hub.

What is DSE Tiered Storage?

DSE Tiered Storage transparently shifts older, infrequently accessed data from high performance SSDs to slower more economically-friendly HDDs based on your criteria, and does so in a performant and efficient manner. DSE Tiered Storage solves the challenge of smartly utilizing the right storage for the right data “temperature”. For data that doesn’t require high-speed access (e.g. data that ages and is no longer ‘hot’), a recurring move of that data to lesser expensive HDDs can help reduce overall hardware spend.

What is DSE Multi-Instance?

DSE Multi-instance lets you run multiple DSE instances (database processes) on individual hosts without the need for a virtualization or container layer. This allows each instance to consume a share of a large host’s physical resources, thereby increasing system utilization and by extension, data center efficiency. Furthermore, DSE Multi-instance support maintains replica placement safety so that a catastrophic failure on one physical host won’t impact more than one replica of any given partition.

What are the benefits of DataStax Enterprise supporting JSON?

JSON/Document model support in DSE improves ease-of-use when working with other technologies that natively support JSON. In particular, developers will have to write a lot less code to translate to and from JSON when interacting with their database. Expect a significant boost to developer productivity as a result.

There are several resources to find additional information on DSE support for JSON, a good starting point is the blog post that initially introduced the feature in Apache Cassandra™ 2.2.

When should I use JSON in DSE?

DSE is often used as the state store for web applications written with Javascript UI frameworks that are designed to exchange JSON between the browser and the server. With DSE support for JSON, the server-side code required to do that is dramatically simplified. Prior to JSON support, the application developer would have to convert CQL ResultSets to JSON before sending query results back to the client. Conversely, incoming data would have to be transformed from JSON into CQL statements. None of that code is required anymore. The CQL statement itself can output JSON and can accept JSON as input.

Consider a web application constructed with a DSE database backend, and an AngularJS/ExpressJS application tier that connects to the database cluster via the DataStax Node.js driver. The entire application tier is optimized to work with JSON as a serialized format. With DSE, the DataStax Node.js driver does not have to format the contents of the result object as JSON in the callback. Instead, the CQL query specifies that the values returned from the query should be formatted in JSON. These can be passed directly to the application tier.

How is JSON support in DSE different from JSON support in other NoSQL databases like MongoDB?

The major difference between DSE and other databases that support JSON (MongoDB in particular) is that DSE requires JSON to adhere to a database schema. In MongoDB, JSON is used for storage of schemaless “document” records. One record might vary significantly from the next and there is no expectation that any record structure is enforced by the database. DSE however, retains the requirement that the CQL table is defined with a schema that enforces the columns and column types in the row. Nothing else architecturally has changed, and so DSE and MongoDB remain very different databases with very different capabilities.

Can I store Document style records in DSE?

Yes. CQL supports collection types (maps, sets & lists) and user-defined types that all map into JSON list and map types. This allows DSE to store records that have similar structural complexity to document-style records without being completely free-form. Sparse storage allocation in the underlying storage engine means that column values can be left null with no penalty, further adding to the flexible structure of the record. It is possible to define columns that might only be used with a handful of rows in the table which matches the ability for a JSON document to leave values out of a record. It is also possible to add columns to a table at any time without a performance penalty. So, if the requirement is to deal with an incoming data feed that might change as time progresses, this can be handled with a no-cost change to the table schema. In this respect, DSE has many of the Document storage capabilities of pure Document databases save for the ability to handle truly schemaless data without prior knowledge of the structure of incoming records.

Does support for JSON mean that DSE is a Document Database?

DSE has the flexibility to store data with complex nested schemas and can easily move data to and from application web tiers, which is a primary use case of Document-oriented databases. The major difference between DSE’s support for the JSON/Document data model and other NoSQL document databases is that DSE requires a formal schema.  

What data models are supported in DataStax Enterprise?

DSE natively supports key-value, tabular, JSON/Document, and graph. SQL/RDBMS data model support is handled via a partner.  

DataStax Enterprise Graph

What is DataStax Enterprise Graph?

DataStax Enterprise Graph is the first graph database fast enough to power customer facing applications, capable of scaling to massive datasets and integrated advanced tools that power deep analytical queries. Because DataStax Enterprise (DSE) is built on the core architecture of Apache Cassandra™, DSE Graph can scale to billions of objects, spanning hundreds of machines across multiple datacenters with no single point of failure. DSE Graph is built on proven open source technologies including Apache TinkerPop™, Apache Cassandra™, Apache Spark™, and others.

What is TinkerPop?

Apache TinkerPop™ is an open source graph computing framework that enables database and data analytic systems to offer graph computing capabilities to their users. The Gremlin graph traversal language is the primary means by which users interface with graph databases that use TinkerPop.

Along with the Gremlin language and virtual machine, TinkerPop provides various supporting tools such as Gremlin Server, data bulk loaders, graph algorithms, visualization tool connectors, and more. Being Apache TinkerPop-enabled, DataStax Enterprise Graph is able to append sophisticated, standardized graph computing features to its core foundation and avoids proprietary vendor lock in.

What is Titan?

DataStax Enterprise Graph is inspired by the open source Titan graph database, which is used by many well-known enterprises such as Amazon, IBM, and others. Titan is a scale-out graph database that has a pluggable storage back end option that allows it to persist data to a variety of databases including Apache Cassandra™, HBase, and others.

While DataStax Enterprise Graph uses Titan as a model, it is a completely different set of software that goes much further than Titan’s basic scale-out capabilities by both deeply integrating with Apache Cassandra™ and including additional commercial software functionality.

Can existing Titan users migrate to DataStax Enterprise Graph?

Yes. Because of its reliance on TinkerPop, DataStax Enterprise Graph is compatible with later versions of Titan, which means that existing Titan users can migrate their application code after migrating their data to DataStax Enterprise Graph.

What business problems does graph solve?

DataStax Enterprise Graph is a graph database built for Cloud Applications that need to manage, analyze, and search highly connected data. A graph database like DataStax Enterprise Graph will almost always be better than a relational database management system (RDBMS) when it comes to identifying commonalities and anomalies in large, complex and highly connected datasets. While DataStax Enterprise Graph can be used for a variety of application use cases, the following are the most common:

Master Data Management/Customer 360

A company must understand the data relationships across its multiple business units to create a holistic view of its customers or products. A graph model is the best way to consolidate the disparate data for use by both business intelligence (BI) tools and other business applications. An example of this application include using graph to understand the various ways a customer interacts with your company, what types of accounts they have, what services they are using and the various identities they use across the separate properties both virtual and physical. Other examples include product catalogs and product lifecycle management (PLM) which often have complex hierarchical structures and are overlaid by taxonomies to capture the composition of relationships. Understanding these relationships, being able to immediately grasp the impact of a change in a supplier relationship, or being able to quantify the impact of a recall are all capabilities that are made easy through the use of graph technology.

Recommendation & Personalization

Almost all enterprises need to understand how they can quickly and most effectively influence customers to buy their products and recommend them to others using components in a cloud application such as recommendation, personalization, and network (people or machines) analysis engines. A graph is well suited to these and similar analytical use cases where recommending products, next actions, or advertising based on a user’s data, past behavior, and interactions are important.

Security & Fraud Detection

In a complex and highly interrelated network of users, entities, transactions, events, and interactions, a graph database can help determine which entity, transaction, or interaction is fraudulent, poses a security risk, or is a compliance concern. In short, a graph database assists in finding the bad needle in a haystack of relationships and events that involve countless financial interactions.

Internet of Things (IoT)

This use case most commonly involves devices or machines that generate time-series information such as event and status data. A graph works well in this case because the streams from individual points create a high degree of complexity when blended together. Further, analytics required in tasks such as root-cause analysis, involve numerous relationships that form among the data elements and tend to be of much greater interest when examined collectively than reviewed in isolation.

How does DataStax Enterprise Graph differ from a relational database?

A relational database management system (RDBMS) and graph database are similar in that they involve data that contains connections or relationships between data elements. From a data model perspective, their components have the following surface level similarities.

Concept RDBMS Graph DB
An identifiable “something” or object to keep track of Entity Vertex
A connection or reference between two objects Relationship Edge
A characteristic of an object Attribute Property

 

A key difference between a graph database and an RDBMS is how relationships between entities/vertexes are prioritized and managed. While an RDBMS uses mechanisms like foreign keys to connect entities in a secondary fashion, edges (the relationships) in a graph database are of first order importance.

In other words, relationships are explicitly embedded in a graph data model. Essentially, a graph-shaped business problem is one in which the concern is with the relationships (edges) among entities (vertexes) than with the entities in isolation.

When should DataStax Enterprise Graph be used instead of a Relational Database?

Review the below table to determine whether to use a RDBMS or a graph database like DataStax Enterprise Graph.

RDBMS DSE Graph
Simple to moderate data complexity Heavy data complexity
Hundreds of potential relationships Hundreds of thousands to millions or billions of potential relationships
Moderate JOIN operations with good performance Heavy to extreme JOIN operations required
Infrequent to no data model changes Constantly changing and evolving data model
Static to semi-static data changes Dynamic and constantly changing data
Primarily structured data Structured and unstructured data
Nested or complex transactions Simple transactions
Always strongly consistent Tunable consistency (eventual to strong)
Moderate incoming data velocity High incoming data velocity (e.g. sensors)
High availability (handled with failover) Continuous availability (no downtime)
Centralized application that is location dependent (e.g. single location), especially for write operations and not just read Distributed application that is location independent  (multiple locations involving multiple data centers and/or clouds) for write and read operations
Scale up for increased performance Scale out for increased performance

 

How does DataStax Enterprise Graph differ from other NoSQL databases?

The primary difference between a graph data model and those used by other NoSQL databases is that a graph model is purpose built to handle high data complexity and connectedness whereas other NoSQL models are designed to manage data that is simpler in its format and relationships:

Difference of DSE Graph from NoSQL databases

When should I use the DataStax Enterprise Graph data model over Cassandra’s tabular model?  

Review the below table to determine when Cassandra’s tabular data model should be used vs. the graph data model found in DataStax Enterprise Graph.

Apache Cassandra™ Tabular Data Model Graph Model in DSE Graph
Little to no value in data object relationships Great value in data object relationships
Manual data denormalization is easy Manual data denormalization too complex
Data rarely joined together. If joins occur (e.g. with Spark, etc.), performance is acceptable. Data constantly connected and used to produce end result in a performant manner
Write/read heavy Read heavy; write moderate

 

How does DataStax Enterprise Graph work with Apache Cassandra™?

DataStax Enterprise Graph utilizes an enterprise-certified version of Apache Cassandra™ for its persistent datastore. Because of its deep integration with Apache Cassandra™, DataStax Enterprise Graph inherits all of Cassandra’s key benefits including continuous availability, geographic distribution, linear scalability and operationally low latency.

To that foundation, DataStax Enterprise Graph adds other performance-enhancing capabilities that include an adaptive query optimizer, locality-driven graph data partitioner, distributed query execution engine, and various graph-specific index structures.

Can DataStax Enterprise Graph run analytic operations on graph data?

Yes. DataStax Enterprise Graph is integrated with DataStax Enterprise Analytics, which allows analytical tasks to run on data stored in a graph. The recommended deployment model for having both online transaction processing (OLTP) and online analytical processing (OLAP) graph in the same DataStax Enterprise (DSE) cluster is the same as Apache Cassandra™ OLTP and OLAP —  separate the workloads in different datacenters (which can either be physical datacenters or nodes in the same location). The data will be automatically replicated between them. Note: The same graph schema is used in both.  

Can DataStax Enterprise Graph run search operations on graph data?

Yes, DataStax Enterprise Graph integrates with DataStax Enterprise Search to handle search tasks on graph data. DataStax Enterprise Graph maintains the Solr core definitions through the graph schema automatically and registers those on the respective Apache Cassandra™ tables. DataStax Enterprise Search then indexes graph data with a special graph adapter directly against the Apache Cassandra™ tables.
If the query optimizer for DataStax Enterprise Graph determines that a particular query can be best answered by an existing search index, it creates a Lucene query corresponding to the Gremlin query fragment and sends it to DataStax Enterprise Search. Graph then extracts the vertex IDs from the result set and does further query processing on them.

If the query optimizer for DataStax Enterprise Graph determines that a particular query can be best answered by an existing search index, it creates a Lucene query corresponding to the Gremlin query fragment and sends it to DataStax Enterprise Search. Graph then extracts the vertex ID’s from the result set and does further query processing on them.

Can data stored in DataStax Enterprise Graph be secured?

Yes. The same advanced security features available for Apache Cassandra™ data in DataStax Enterprise are available to DataStax Enterprise Graph. See the online documentation for more details.

How should a cluster be sized and configured for DataStax Enterprise Graph?

The standard guidelines for configuring a DataStax Enterprise (DSE) cluster applies to DataStax Enterprise Graph. DSE Graph can be run with or without DSE Analytics and DSE Search. Nodes in a cluster using DSE Graph without analytics or search can be deployed with the same set of recommendations as a standard DSE Apache Cassandra™ node, while nodes using DSE Graph and analytics or search should be configured with the parameters specified for DSE Analytics and DSE Search nodes.  

How is DataStax Enterprise Graph sold?

DataStax Enterprise Graph is sold as an option to either DataStax Enterprise Standard or Max subscriptions. DataStax Enterprise Graph is not available for the Basic subscription offering.

Apache Cassandra™

What is Apache Cassandra™?

Apache Cassandra™, an Apache Software Foundation project, is a massively scalable NoSQL database. Apache Cassandra™ is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with extremely high database performance and availability.

What are the benefits of using Apache Cassandra™?

There are many technical benefits that come from using Apache Cassandra™. See our white papers for more detail.

How do I install Apache Cassandra™?

Downloading and installing Apache Cassandra™ is very easy. Downloads of Apache Cassandra™ are available via the DataStax website at: http://www.datastax.com/download.

For installation guidance, please see our online documentation.

How do I start/stop Apache Cassandra™ on a machine?

Starting Apache Cassandra™ involves connecting to the machine where it is installed with the proper security credentials, and invoking the Apache Cassandra™ executable from the installation’s binary directory.

How do I log into Cassandra?

The basic interface for logging in to Apache Cassandra™ is the CQL (Apache Cassandra™ Query Language) utility.

The CQL utility (found in the installation directory’s bin subdirectory) connects by default to a local Apache Cassandra™ instance running on the default port of 9160


  $ ./cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 2.0.0 | Cassandra 1.0.7 | CQL spec 2.0.0 | Thrift protocol 19.20.0] Use HELP for help. cqlsh>

When is Apache Cassandra™ required for an application?

Cassandra is perfect for cloud applications, and can be used in many different data management situations. Some of the most common use cases for Apache Cassandra™ include:

  • Time series data management
  • High-velocity device data ingestion and analysis (e.g. IoT)
  • Telecommunications and messaging
  • Media streaming (e.g. music, movies)
  • Social media input and analysis
  • Online web retail (e.g. shopping carts, user transactions)
  • Web log management / analysis
  • Web click-stream analysis
  • Real-time data analytics
  • Online gaming (e.g. real-time messaging)
  • Write-intensive transaction systems
  • Buyer event analytics
  • Risk analysis and management

When should I not use Apache Cassandra™?

Cassandra is typically not the choice for transactional data that needs per-transaction commit/rollback capabilities. Note that Apache Cassandra™ does have atomic transactional abilities on a per row/insert basis (but with no rollback capabilities).

How does Apache Cassandra™ differ from Hadoop?

The primary difference between Apache Cassandra™ and Hadoop is that Apache Cassandra™ targets real-time/operational data, while Hadoop has been designed for data warehouse and data lake use cases.

There are many different technical differences between Apache Cassandra™ and Hadoop, including Cassandra’s underlying data structure (based on Google’s BigTable); its fault-tolerant, peer-to-peer architecture; multi-data center capabilities; tunable data consistency; and much more – including the fact that all nodes in Apache Cassandra™ are the same (e.g., no concept of a namenode).

How does Apache Cassandra™ differ from HBase?

HBase is an open source, column-oriented datastore modeled after Google BigTable, and is designed to offer BigTable-like capabilities on top of data stored in Hadoop. However, while HBase shares the BigTable design with Apache Cassandra™, its foundational architecture is very different and targets different use cases – Apache Cassandra™ is aimed at operational database management tasks and HBase at data warehouse/lake scenarios.

A Apache Cassandra™ cluster is much easier to setup and configure than a comparable HBase cluster. HBase’s reliance on the Hadoop namenode equates to there being a single point of failure in HBase, whereas with Apache Cassandra™, because all nodes are the same, there is no such issue.

How does Apache Cassandra™ differ from MongoDB?

MongoDB is a document-oriented database that is built upon a master-slave/sharding architecture. MongoDB is designed to store/manage collections of JSON-styled documents.

By contrast, Apache Cassandra™ uses a masterless, write/read-anywhere styled architecture that is based on a combination of Google BigTable and Amazon Dynamo. This allows Cassandra to avoid the various complications and pitfalls of master/slave and sharding architectures. Moreover, Apache Cassandra™ offers linear performance increases as new nodes are added to a cluster, scales to terabyte-petabyte data volumes, and has no single point of failure.

Note: DataStax Enterprise supports the JSON/Document data model and persists all JSON data to Cassandra, thereby supplying a more scalable and fault tolerant data platform than MongoDB.  

What is DataStax Distribution of Apache Cassandra™?

DataStax Distribution of Apache Cassandra™ is a free software bundle from DataStax that combines Apache Cassandra with a number of developer and management tools provided by DataStax, which are designed to get someone up and productive with Cassandra in very little time. DataStax Distribution of Apache Cassandra™ is provided for open source enthusiasts and is not recommended for production use as it is not formally supported by DataStax’ production support staff.

How does Apache Cassandra™ protect me against downtime?

Cassandra has been built from the ground up to be a fault-tolerant, masterless database that offers no single point of failure. Apache Cassandra™ can automatically replicate data between nodes to offer data redundancy. It also offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe) as well as between geographically dispersed data centers, and/or public cloud providers and on-premise machines, which offers the strongest possible uptime and disaster recovery capabilities:

  • Automatically replicates data between nodes to offer data redundancy
  • Offers built-in intelligence to replicate data between different physical server racks (so that if one rack goes down the data on other racks is safe)
  • Easily replicates between geographically dispersed data centers
  • Leverages any combination of cloud and on-premise resources

Does Apache Cassandra™ use a master/slave architecture or something else?

Apache Cassandra™ does not use a master/slave architecture, but instead uses a masterless implementation, which avoids the pitfalls, latency problems, single point of failure issues, and performance headaches associated with master/slave setups.

How do I replicate data across Apache Cassandra™ nodes?

Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. When you create a keyspace in Apache Cassandra™, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch (see FAQ “What is a snitch”) to help determine the physical location of nodes and their proximity to each other.

The total number of replicas across the cluster is often referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row. A replication factor of 2 means two copies of each row. All replicas are equally important; there is no primary or master replica in terms of how read and write requests are handled.

Replication options are defined when you create a keyspace in Apache Cassandra™. The snitch is configured per node.

How is my data partitioned in Apache Cassandra™ across nodes in a cluster?

Apache Cassandra™ provides a number of options to partition your data across nodes in a cluster.

The Murmur3Partitioner (default) uniformly distributes data across the cluster based on MurmurHash hash values.

The RandomPartitioner is the default partitioning strategy for a Apache Cassandra™ cluster. It uses a consistent hashing algorithm to determine which node will store a particular row.

The ByteOrderedPartitioner ensures that row keys are stored in sorted order. It is not recommended for most use cases and can result in uneven distribution of data across a cluster.

What are ‘seed nodes’ in Apache Cassandra™?

A seed node in Apache Cassandra™ is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Apache Cassandra™ uses a protocol called gossip to discover location and state information about the other nodes participating in a Apache Cassandra™ cluster. When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure.

What is a “snitch”?

The snitch is a configurable component of a Apache Cassandra™ cluster used to define how the nodes are grouped together within the overall network topology (such as rack and data center groupings). Apache Cassandra™ uses this information to route inter-node requests as efficiently as possible within the confines of the replica placement strategy. The snitch does not affect requests between the client application and Apache Cassandra™ (it does not control which node a client connects to).

How do I add new nodes to a cluster?

Apache Cassandra™ is capable of offering linear performance benefits when new nodes are added to a cluster.

A new machine can be added to an existing cluster by installing the Apache Cassandra™ software on the server and configuring the new node so that it knows (1) the name of the Apache Cassandra™ cluster it is joining; (2) the seed node(s) it should obtain its data from.

How do I remove nodes from an existing cluster?

Nodes can be removed from a Apache Cassandra™ cluster by using the nodetool utility and issuing a decommission command. This can be done without affecting the overall operations or uptime of the cluster.

What happens when a node fails in Apache Cassandra™?

Apache Cassandra™ uses gossip state information to locally determine if another node in the system is up or down. This failure detection information is used by Apache Cassandra™ to avoid routing client requests to unreachable nodes whenever possible.

The gossip inter-node communication process tracks “heartbeats” from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes heard about secondhand, thirdhand, and so on). Rather than have a fixed threshold for marking nodes without a heartbeat as down, Apache Cassandra™ uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect the perceived heartbeat rate.

Node failures can result from various causes such as hardware failures, network outages, and so on. Node outages are often transient but can last for extended intervals. A node outage rarely signifies a permanent departure from the cluster, and therefore does not automatically result in permanent removal of the failed node from the cluster. Other nodes will still try to periodically initiate gossip contact with failed nodes to see if they are back up.

When a node comes back online after an outage, it may have missed writes for the replica data it maintains. Writes missed due to short, transient outages are saved for a period of time on other replicas and replayed on the failed host once it recovers using Cassandra’s built-in hinted handoff feature. If a node is down for an extended period, an administrator can run the nodetool repair utility after the node is back online to ‘catch it up’ with its corresponding replicas.

To permanently change a node’s membership in a cluster, administrators must explicitly remove a node from a Apache Cassandra™ cluster using the nodetool management utility.

How can I use the same Apache Cassandra™ cluster across multiple datacenters?

Apache Cassandra™ can easily replicate data between different physical datacenters by creating a keyspace that uses the replication strategy currently termed NetworkTopologyStrategy. This strategy allows you to configure Apache Cassandra™ to automatically replicate data to different data centers and even different racks within datacenters to protect against specific rack/physical hardware failures causing a cluster to go down. It can also replicate data between public clouds and on-premise machines.

What configuration files does Apache Cassandra™ use?

The main Apache Cassandra™ configuration file is the cassandra.yaml file, which houses all the main options that control how Apache Cassandra™ operates.

How can I use Apache Cassandra™ in the cloud?

Cassandra’s architecture make it perfect for full cloud deployments as well as hybrid implementations that store some data in the cloud and other data on-premises.

DataStax provides an Amazon AMI that allows you to quickly deploy a Apache Cassandra™ cluster on EC2. See the online documentation for a step-by-step guide to installing a Apache Cassandra™ cluster on Amazon.

Do I need to use a caching layer (like memcached) with Apache Cassandra™?

Apache Cassandra™ negates the need for extra software caching layers like memcached through its distributed architecture, fast write throughput capabilities, and internal memory caching structures.

Why is Apache Cassandra™ so fast for write activity/data loads?

Apache Cassandra™ has been architecture for consuming large amounts of data as fast as possible. To accomplish this, Apache Cassandra™ first writes new data to a commit log to ensure it is safe. After that, the data is then written to an in-memory structure called a memtable. Apache Cassandra™ deems the write successful once it is stored on both the commit log and a memtable, which provides the durability required for mission-critical systems.

Once a memtable‘s memory limit is reached, all writes are then written to disk in the form of an SSTable(sorted strings table). An SSTable is immutable, meaning it is not written to ever again. If the data contained in the SSTable is modified, the data is written to Apache Cassandra™ in an upsert fashion and the previous data automatically removed.

Because SSTables are immutable and only written once the corresponding memtable is full, Apache Cassandra™ avoids random seeks and instead only performs sequential IO in large batches, resulting in high write throughput.

A related factor is that Apache Cassandra™ doesn’t have to do a read as part of a write (i.e. check index to see where current data is). This means that insert performance remains high as data size grows, while with b-tree based engines (e.g. MongoDB) it deteriorates.

How does Apache Cassandra™ communicate across nodes in a cluster?

Apache Cassandra™ is architected in a masterless fashion and uses a protocol called “gossip” to communicate with other nodes in a cluster. The gossip process runs every second to exchange information across the cluster.

Gossip only includes information about the cluster itself (e.g., up/down, joining, leaving, version, schema) and does not manage the data. Data is transferred node-to-node using a message-passing like protocol on a distinct port from what client applications connect to. The Apache Cassandra™ partitioner turns a column family key into a token, the replication strategy picks the set of nodes responsible for that token (using information from the snitch) and Apache Cassandra™ sends messages to those replicas with the request (read or write).

How does Apache Cassandra™ detect that a node is down?

The gossip protocol is used to determine the state of all nodes in a cluster and if a particular node has gone down. The gossip process tracks heartbeats from other nodes and uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate before a node is actually marked as down.

The configuration parameter phi_convict_threshold in the cassandra.yaml file is used to control Cassandra’s sensitivity of node failure detection. The default value is appropriate for most situations. However in cloud environments, such as Amazon EC2, the value should be increased to 12 in order to account for network issues that sometimes occur on such platforms.

Does Apache Cassandra™ compress data on disk?

Yes, data compression is available with Apache Cassandra™ 1.0 and above. Various compression algorithms are used to deliver fairly impressive storage savings – in some cases, compressing raw data up to 80+ percent with no performance penalties for read/write operations. In fact, because of the reduction in physical I/O, compression actually increases performance in some use cases.

How do I backup data in Apache Cassandra™?

Currently, the most common method for backing up data in Apache Cassandra™ is using the snapshot function in the nodetool utility. This is an online operation and does not require any downtime or block any operations on the server.

Snapshots are sent by default to a snapshots directory located in the Apache Cassandra™ data directory (controlled via the data_file_directories in the cassandra.yaml file). Once taken, snapshots can be moved off-site to be protected.

Incremental backups (i.e., data backed up since the last full snapshot) can be performed by setting theincremental_backups parameter in the cassandra.yaml file to “true.” When incremental backup is enabled, Apache Cassandra™ copies every flushed SSTable for each keyspace to a backup directory located under the Apache Cassandra™ data directory. Restoring from an incremental backup involves first restoring from the last full snapshot and then copying each incremental file back into the Apache Cassandra™ data directory.

How do I restore data in Apache Cassandra™?

In general, restoring a Apache Cassandra™ node is done by first following these procedures:

  1. Shut down the node that is to be restored
  2. Clear the commit log by removing all the files in the commit log directory (e.g., rm /var/lib/cassandra/commitlog/*)
  3. Remove the database files for all keyspaces (e.g., rm /var/lib/cassandra/data/keyspace1/*.db). Take care so as not to remove the snapshot directory for the keyspace
  4. Copy the latest snapshot directory contents for each keyspace to the keyspace’s data directory (e.g., cp -p /var/lib/cassandra/data/keyspace1/snapshots/56046198758643-snapshotkeyspace1/* /var/lib/cassandra/data/keyspace1)
  5. Copy any incremental backups taken for each keyspace into the keyspace’s data directory
  6. Repeat steps 3-5 for each keyspace
  7. Restart the node

How do I uninstall Apache Cassandra™?

It depends on the installation method (e.g. Linux yum, tar). Linux packaged installs can be uninstalled using those utilities while tar installs require the manual deletion of the Apache Cassandra™ software, data, and log files.

Is my data safe in Apache Cassandra™?

Yes. First, data durability is fully supported in Apache Cassandra™ so that any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.

Second, Apache Cassandra™ offers tunable data consistency so that a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Apache Cassandra™ provides consistency in the CAP sense in that all readers will see the same values.

Other forms of tunable consistency involve having a quorum of nodes written to or just one node for the loosest form of consistency. Apache Cassandra™ is very flexible and allows data consistency to be chosen on a per operation basis if needed so that very strong consistency can be used when desired, or very loose consistency can be utilized when the use case permits.

What type of security does Apache Cassandra™ offer?

Apache Cassandra™ 1.2 and higher provides the following built-in security features: (1) internally-managed authentication (login IDs and passwords are managed within Apache Cassandra™); (2) internal authorization / object permission management via GRANT/REVOKE; (3) client to node encryption via SSL. None of these security features are enabled by default and must be configured in the cassandra.yaml file

What options do I have to make sure my data is consistent across nodes?

In Apache Cassandra™, consistency refers to how up to date and synchronized a row of data is on all of its replicas. Apache Cassandra™ offers a number of built-in features to ensure data consistency:

  • Hinted Handoff Writes – Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online again. Hinted handoff ensures data consistency due to short, transient node outages.
  • Read Repair – Read operations trigger consistency across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas.
  • Anti-Entropy Node Repair – For data that is not read frequently, or to update data on a node that has been down for an extended period, the node repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair (using the nodetool utility) should be run routinely as part of regular cluster maintenance operations.

What is ‘tunable consistency’ in Apache Cassandra™?

Apache Cassandra™ extends the concept of ‘eventual consistency’ by offering ‘tunable consistency’. For any given read or write operation, the client application decides how consistent the requested data should be.

Consistency levels in Apache Cassandra™ can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Apache Cassandra™ offers a number of consistency levels for both reads and writes.

Choosing a consistency level for reads and writes involves determining your requirements for consistent results (always reading the most recently written data) versus read or write latency (the time it takes for the requested data to be returned or for the write to succeed).

If latency is a top priority, consider a consistency level of ONE (only one replica node must successfully respond to the read or write request). There is a higher probability of stale data being read with this consistency level (as the replicas contacted for reads may not always have the most recent write). For some applications, this may be an acceptable trade-off.

If consistency is top priority, you can ensure that a read will always reflect the most recent write by using the following formula:

(nodes_written + nodes_read) > replication_factor

For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency.

How do I load data into Apache Cassandra™?

With respect to loading external data, Apache Cassandra™ supplies a COPY utility that easily loads external data that exists in delimited format. Apache Cassandra™ also has a load utility called sstableloader, which is able to load flat files into Apache Cassandra™, however the files must first be converted into SSTable format.

How can I move data from another database to Apache Cassandra™?

Most RDBMS’s have an unload utility that allows data to be unloaded to flat files. Once in flat file format, the COPY command can be used to load the data into Apache Cassandra™ column families.

In addition, DataStax has partnered with various data integration vendors to provide a powerful extract-transform-load (ETL) framework that allows easy migration of various source systems (e.g., Oracle, MySQL) into Apache Cassandra™.

What is read repair in Apache Cassandra™?

Read operations trigger consistency checks across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas. Read repair is an optional feature and can be configured per column family.

How can I move data from other databases/sources to Apache Cassandra™?

There are a number of internal utilities and external tools that allow data to easily to moved into/out of Apache Cassandra™. See this blog post that describes the most commonly used methods.

What client libraries/drivers can I use with Apache Cassandra™?

There are a number of CQL (Apache Cassandra™ Query Language) drivers and native client libraries available for most all popular development languages (e.g. Java, Python Ruby) All drivers and client libraries can be downloaded from: http://www.datastax.com/download/clientdrivers.

What type of data model does Apache Cassandra™ use?

The Apache Cassandra™ data model supports both key-value and tabular data models.

Although it is natural to want to compare the Apache Cassandra™ tabular data model to a relational database, they are really quite different. In a relational database, data is stored in tables and the tables comprising an application are typically related to each other. Data is usually normalized to reduce redundant entries, and tables are joined on common keys to satisfy a given query.

In Apache Cassandra™, the keyspace is the container for your application data, similar to a database or schema in a relational database. Inside the keyspace are one or more flexible tables.

Apache Cassandra™ does not enforce relationships between column families the way that relational databases do between tables: there are no formal foreign keys in Apache Cassandra™, and joining tables at query time is not supported (although integration with Spark supports JOINs). Each table has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from your application.

What datatypes does Apache Cassandra™ support?

Apache Cassandra™ supports all modern datatypes to handle structured, semi-structured and unstructured data.

What is a keyspace in Apache Cassandra™?

In Apache Cassandra™, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group tables together. Typically, a cluster has one keyspace per application.

Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.

Does Apache Cassandra™ support transactions?

Apache Cassandra™ transactions are atomic, isolated, and durable, however the consistency is tunable in that a developer can decide how strong or eventual they want the consistency of each transaction to be. Apache Cassandra™ also supports lightweight transactions (see documentation for more information).

What is the CQL language?

The Apache Cassandra™ Query Language (CQL) is the formal API for interacting with Apache Cassandra™, and is based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Apache Cassandra™ data model. There is no support for JOINs, for example.

What is a compaction in Apache Cassandra™?

Apache Cassandra™ is optimized for write throughput. Apache Cassandra™ writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (Sorted String table).

The “Sorted” part means SSTables are sorted by row token (as determined by the partitioner), which is what makes merges for compaction efficient (i.e., don’t have to read entire SSTables into memory). Row contents are also sorted by column comparator, so Apache Cassandra™ can support larger-than-memory rows too. SSTables are immutable (i.e., they are not written to again after they have been flushed). This means that a row is typically stored across multiple SSTable files.

In the background, Apache Cassandra™ periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTable files are sorted by row key, this merge is efficient (no random disk I/O). Once a newly merged SSTable is complete, the smaller input SSTables are marked as obsolete and eventually deleted by the Java Virtual Machine (JVM) garbage collection (GC) process. However, during compaction, there is a temporary spike in disk space usage and disk I/O on the node.

What platforms does Apache Cassandra™ run on?

Apache Cassandra™ is a Java application, meaning that a compiled binary distribution of Apache Cassandra™ can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM).

DataStax makes available packaged releases for Red Hat, CentOS, Debian, and Ubuntu Linux, as well as Microsoft Windows and Mac OSX.


© 2017 DataStax, All rights reserved. Tel. +1 (408) 933-3120 sales@datastax.com Offices

DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.