DataStax News: Astra Streaming now GA with new built-in support for Kafka and RabbitMQ. Read the press release.
In the previous post, I talked briefly about the journey we've taken to find the best way to get SSTable data loaded onto the GPU for data analytics. We looked at the existing methods of pulling data from the driver and manually converting it into a cuDF, then at using the existing Cassandra driver, and finally at working on a custom parser implementation, sstable-to-arrow, which will be the focus of this post. We’ll go over its capabilities, limitations, and how to get started.
sstable-to-arrow is written in C++17. It uses the Kaitai Struct library to declaratively specify the layout of the SSTable files using YAML, a data serialization language. The Kaitai Struct compiler then compiles these Kaitai declarations into C++ classes, which can be included in the source code to actually parse the SSTables into data in memory. Then, it takes the data and converts each column in the table into an Arrow Vector. sstable-to-arrow can then ship the Arrow data to any client, where the data can be turned into a cuDF and be available for GPU analysis.
- sstable-to-arrow can only read one SSTable at a time. To handle multiple SSTables, the user must configure a cuDF for each SSTable and use the GPU to merge them based on last write wins semantics. sstable-to-arrow exposes internal cassandra timestamps and tombstone markers so that merging can be done at the cuDF layer.
- Some data, including the names of the partition key and clustering columns, can't actually be deduced from the SSTable files since they require the schema to be stored in the system tables.
- Cassandra stores data in memtables and commitlogs before flushing to SSTables, so analytics performed via only sstable-to-arrow will potentially be stale / not real-time.
- Currently, the parser only supports files written by Cassandra OSS 3.11.
- The system is set up to scan entire SSTables (not read specific partitions). More work will be needed if we ever do predicate pushdown.
- The following cql types are not supported: counter, frozen, and user-defined types.
- varints can only store up to 8 bytes. Attempting to read a table with larger varints will crash.
- The parser can only read tables with up to 64 columns.
- The parser loads each SSTable into memory, so it is currently unable to handle large SSTables that exceed the machine’s memory capacity.
- decimals are converted into an 8-byte floating point value because neither C++ nor Arrow has native support for arbitrary-precision integers or decimals like the Java BigInteger or BigDecimal classes. This means that operations on decimal columns will use floating point arithmetic, which may be inexact.
- sets are treated as lists because Arrow has no equivalent of a set.
Roadmap and future developments
The ultimate goal for this project is to have some form of a read_sstable function included in the RAPIDS ecosystem, similar to how cudf.DataFrame.from_csv currently works. Performance is also going to be a continuous area of development, and I'm currently looking into ways in which reading the SSTables can be further parallelized to make the most use of the GPU. I’m also working on solving or improving the limitations addressed above, especially broadening support for different CQL types and enabling the program to handle large datasets.
How to run
You can run sstable-to-arrow using Docker.
# -s flag loads and sends a sample Internet Of Things (IOT) database docker run --rm -itp 9143:9143 --name sstable-to-arrow datastaxlabs/sstable-to-arrow -s
This will listen for a connection on port 9143. It expects the client to send a message first, and then it will send data in the following format:
1. The number of Arrow Tables being transferred as an 8-byte big-endian unsigned integer
2. For each table:
1. Its size in bytes as an 8-byte big-endian unsigned integer
2. The contents of the table in Arrow IPC Stream Format
You can then fetch the data from your port using any client. To get started with the sample Python client, follow these steps if your system does not support CUDA:
# 1. Clone the repository: git clone https://github.com/datastax/sstable-to-arrow.git # 2. Navigate to the client directory: cd sstable-to-arrow/client # 3. Create a new Python virtual environment: python3 -m venv ./sstable_to_arrow_venv # 4. Activate the virtual environment: source ./sstable_to_arrow_venv/bin/activate # 5. Make sure you have the latest version of pip: pip3 install --upgrade pip # 6. Install the requirements: pip3 install -r requirements.txt # 7. Start the sstable-to-arrow server in another terminal (see above) # 8. Run the sample client python3 no_cuda.py
If your system does support CUDA, we recommend creating a conda environment with the following commands. You will also need to pass the -x flag when starting the sstable-to-arrow server above to turn all non-cudf-supported types into hex strings.
# 1. Clone the repository: git clone https://github.com/datastax/sstable-to-arrow.git # 2. Navigate to the client directory: cd sstable-to-arrow/client # 3. Create a new conda environment: conda create -y -n sstable-to-arrow # 4. Activate the virtual environment: conda activate sstable-to-arrow # 5. Install the requirements: conda install -y -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql cudf pyarrow # 6. Start the sstable-to-arrow server with the -x flag in another terminal (see above) # 7. Run the sample client python3 with_cuda.py
To experiment with other datasets, you will need raw SSTable files on your machine. You can download sample IOT data at this Google Drive folder. You can also generate IOT data using the generate-data script in the repository, or you can manually create a table using CQL and the Cassandra Docker image (see the Cassandra quickstart for more info). Make sure to use Docker volumes to share the SSTable files with the container:
docker run --rm -itp 9143:9143 -v /absolute/path/to/sstable/directory:/mnt/sstables --name sstable-to-arrow datastaxlabs/sstable-to-arrow /mnt/sstables
You can also pass the -h flag to get information about other options.
If you would like to build the project from source, follow the steps in the GitHub repository.
SSTable to Parquet
sstable-to-arrow is also able to save the SSTable data as a Parquet file, a common format for storing columnar data. Again, it does not yet support deduplication, so it will simply output the sstable and all metadata to the given Parquet file.
You can run this by passing the -p flag followed by the path where you would like to store the Parquet file:
# will save the parquet file into the same directory as the sstable files docker run --rm -it -v /absolute/path/to/sstable/directory:/mnt/sstables --name sstable-to-arrow datastaxlabs/sstable-to-arrow -p /mnt/sstables/data.parquet /mnt/sstables
We will be holding a free online workshop which will go deeper into this project with hands-on examples in mid August! You can sign up here if you’re interested.
If you’re interested in trying out sstable-to-arrow, take a look at the second blog post in this two part series and feel free to reach out to firstname.lastname@example.org with any feedback or questions.