Debugging SSTables in 3.0 with sstabledump

By Andy Tolbert -  March 9, 2016 | 6 Comments

Cassandra 3.0.4 and 3.4 introduces sstabledump, a new utility for exploring SSTables. sstabledump is the spiritual successor to and a replacement for sstable2json. sstable2json was removed from Cassandra in version 3.0, but examining SSTable data is still a useful diagnostic tool. sstabledump can export SSTable content to the human readable JSON format.

How SSTable data is stored on disk has changed in Cassandra 3.0, as previously covered in ‘Putting some structure in the storage engine’. Previously, SSTables were composed of partition keys and their cells; now SSTables are composed of partitions and their rows.

This eliminates quite a bit of overhead present in prior versions of Cassandra. Metadata such as clustering key values, timestamps and TTLs are now defined at the row level, rather than repeated for each individual cell within a row. This new layout now matches how data is represented in CQL, and is more understandable.

A nice enhancement of sstabledump over sstable2json is that the utility can be run in ‘client mode’, so the system data does not have to be read to determine schema. sstabledump can be executed outside of the Cassandra environment, and cassandra.yaml is not required in the classpath for the tool to work.

Note that sstabledump only supports Cassandra 3.X SSTables.

Visualizing the Storage Engine changes in 3.0

To demonstrate sstabledump and the changes in SSTable layout in 3.0, we’ll use sstable2json and sstabledump to contrast the SSTables created by a Cassandra 2.2 node and those created by a Cassandra 3.0 node.

First, let’s generate a small SSTable for a table that represents stock ticker data. This should be done within a cqlsh session on each Cassandra cluster:

-- Create the schema

CREATE KEYSPACE IF NOT EXISTS ticker WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE ticker;

CREATE TABLE IF NOT EXISTS symbol_history (	
  symbol    text,
  year      int,
  month     int,
  day       int,
  volume    bigint,
  close     double,
  open      double,
  low       double,
  high      double,
  idx       text static,
  PRIMARY KEY ((symbol, year), month, day)
) with CLUSTERING ORDER BY (month desc, day desc);

-- Insert some records

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high, idx) 
VALUES ('CORP', 2015, 12, 31, 1054342, 9.33, 9.55, 9.21, 9.57, 'NYSE') USING TTL 604800;

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high, idx) 
VALUES ('CORP', 2016, 1, 1, 1055334, 8.2, 9.33, 8.02, 9.35, 'NASDAQ') USING TTL 604800;

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high) 
VALUES ('CORP', 2016, 1, 4, 1054342, 8.54, 8.2, 8.2, 8.65) USING TTL 604800;

INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high) 
VALUES ('CORP', 2016, 1, 5, 1054772, 8.73, 8.54, 8.44, 8.75) USING TTL 604800;

-- Update a column value

UPDATE symbol_history USING TTL 604800 set close = 8.55 where symbol = 'CORP' and year = 2016 and month = 1 and day = 4;

Next, let’s flush memtables to disk as SSTables using nodetool:

$ bin/nodetool flush

Then in a cqlsh session we will set a column value to null and delete an entire row to generate some tombstones:

-- Set column value to null

USE ticker;
UPDATE symbol_history SET high = null WHERE symbol = 'CORP' and year = 2016 and month = 1 and day = 1;

-- Delete an entire row

DELETE FROM symbol_history WHERE symbol = 'CORP' and year = 2016 and month = 1 and day = 5;

We proceed to flush again to generate a new SSTable, and then perform a major compaction yielding a single SSTable.

$ bin/nodetool flush; bin/nodetool compact ticker

Now that we have a single SSTable representing operations on our CQL table we can use the appropriate tool to examine its contents.

C* 2.2 sstable2json Output

$ tools/bin/sstable2json data/data/ticker/symbol_history-d7197900e5aa11e590210b5b92b49507/la-3-big-Data.db

[
{"key": "CORP:2016",
 "cells": [["::idx","NASDAQ",1457495762169139,"e",604800,1458100562],
           ["1:5:_","1:5:!",1457495781073797,"t",1457495781],
           ["1:4:","",1457495762172733,"e",604800,1458100562],
           ["1:4:close","8.55",1457495767496569,"e",604800,1458100567],
           ["1:4:high","8.65",1457495762172733,"e",604800,1458100562],
           ["1:4:low","8.2",1457495762172733,"e",604800,1458100562],
           ["1:4:open","8.2",1457495762172733,"e",604800,1458100562],
           ["1:4:volume","1054342",1457495762172733,"e",604800,1458100562],
           ["1:1:","",1457495762169139,"e",604800,1458100562],
           ["1:1:close","8.2",1457495762169139,"e",604800,1458100562],
           ["1:1:high",1457495780,1457495780541716,"d"],
           ["1:1:low","8.02",1457495762169139,"e",604800,1458100562],
           ["1:1:open","9.33",1457495762169139,"e",604800,1458100562],
           ["1:1:volume","1055334",1457495762169139,"e",604800,1458100562]]},
{"key": "CORP:2015",
 "cells": [["::idx","NYSE",1457495762164052,"e",604800,1458100562],
           ["12:31:","",1457495762164052,"e",604800,1458100562],
           ["12:31:close","9.33",1457495762164052,"e",604800,1458100562],
           ["12:31:high","9.57",1457495762164052,"e",604800,1458100562],
           ["12:31:low","9.21",1457495762164052,"e",604800,1458100562],
           ["12:31:open","9.55",1457495762164052,"e",604800,1458100562],
           ["12:31:volume","1054342",1457495762164052,"e",604800,1458100562]]}
]

As previously stated, the sstable2json output demonstrates that the storage engine prior to Cassandra 2.2 represents partition keys and their cells.

A large portion of the presented data in cells is redundant. For example, when we executed INSERT queries, each cell representing a column value shares the same timestamp and TTL. Additionally, each cell contains not only the full name of the column, but also the values of the clustering keys that cell belongs to. This overhead contributes a large portion to the size of the SSTable.

C* 3.0 sstabledump output

$ tools/bin/sstabledump data/data/ticker/symbol_history-6d6bfc70e5ab11e5aeae7b4a82a62e48/ma-3-big-Data.db 

[
  {
    "partition" : {
      "key" : [ "CORP", "2016" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "static_block",
        "position" : 48,
        "cells" : [
          { "name" : "idx", "value" : "NASDAQ", "tstamp" : 1457484225583260, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false }
        ]
      },
      {
        "type" : "row",
        "position" : 48,
        "clustering" : [ "1", "5" ],
        "deletion_info" : { "deletion_time" : 1457484273784615, "tstamp" : 1457484273 }
      },
      {
        "type" : "row",
        "position" : 66,
        "clustering" : [ "1", "4" ],
        "liveness_info" : { "tstamp" : 1457484225586933, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },
        "cells" : [
          { "name" : "close", "value" : "8.54" },
          { "name" : "high", "value" : "8.65" },
          { "name" : "low", "value" : "8.2" },
          { "name" : "open", "value" : "8.2" },
          { "name" : "volume", "value" : "1054342" }
        ]
      },
      {
        "type" : "row",
        "position" : 131,
        "clustering" : [ "1", "1" ],
        "liveness_info" : { "tstamp" : 1457484225583260, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },
        "cells" : [
          { "name" : "close", "value" : "8.2" },
          { "name" : "high", "deletion_time" : 1457484267, "tstamp" : 1457484267368678 },
          { "name" : "low", "value" : "8.02" },
          { "name" : "open", "value" : "9.33" },
          { "name" : "volume", "value" : "1055334" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "CORP", "2015" ],
      "position" : 194
    },
    "rows" : [
      {
        "type" : "static_block",
        "position" : 239,
        "cells" : [
          { "name" : "idx", "value" : "NYSE", "tstamp" : 1457484225578370, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false }
        ]
      },
      {
        "type" : "row",
        "position" : 239,
        "clustering" : [ "12", "31" ],
        "liveness_info" : { "tstamp" : 1457484225578370, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },
        "cells" : [
          { "name" : "close", "value" : "9.33" },
          { "name" : "high", "value" : "9.57" },
          { "name" : "low", "value" : "9.21" },
          { "name" : "open", "value" : "9.55" },
          { "name" : "volume", "value" : "1054342" }
        ]
      }
    ]
  }
]

As a consequence of the new tool's verbose output, the output payload is less compact than sstable2json. However, the enriched structure of the 3.0 storage engine, is displayed. What is apparent is that there is less repeated data, which leads to a dramatically reduced SSTable storage footprint.

Looking at the output, note that clustering, timestamp and ttl information are now presented at the row level, instead of repeating in individual cells. This change is a large factor in optimizing disk space. While column names are present in each cell, the full column names are not stored for each cell as previously. You can read more about these optimizations and others in the aforementioned blog post.

Internal Representation Format

As previously mentioned, sstabledump’s JSON representation is more verbose than sstable2json. sstabledump also provides an alternative ‘debug’ output format that is more concise than its json counterpart. While initially difficult to understand, it is a more compact and convenient format for advanced users to grok the contents of an SSTable. To view data in this format, simply pass the -d parameter to sstabledump:

$ tools/bin/sstabledump data/data/ticker/symbol_history-6d6bfc70e5ab11e5aeae7b4a82a62e48/ma-3-big-Data.db -d

[CORP:2016]@0 Row[info=[ts=-9223372036854775808] ]: STATIC | [idx=NASDAQ ts=1457496014384090 ttl=604800 ldt=1458100814]
[CORP:2016]@0 Row[info=[ts=-9223372036854775808] del=deletedAt=1457496035375251, localDeletion=1457496035 ]: 1, 5 | 
[CORP:2016]@66 Row[info=[ts=1457496014387922 ttl=604800, let=1458100814] ]: 1, 4 | [close=8.55 ts=1457496020899876 ttl=604800 ldt=1458100820], [high=8.65 ts=1457496014387922 ttl=604800 ldt=1458100814], [low=8.2 ts=1457496014387922 ttl=604800 ldt=1458100814], [open=8.2 ts=1457496014387922 ttl=604800 ldt=1458100814], [volume=1054342 ts=1457496014387922 ttl=604800 ldt=1458100814]
[CORP:2016]@141 Row[info=[ts=1457496014384090 ttl=604800, let=1458100814] ]: 1, 1 | [close=8.2 ts=1457496014384090 ttl=604800 ldt=1458100814], [high=<tombstone> ts=1457496034857652 ldt=1457496034], [low=8.02 ts=1457496014384090 ttl=604800 ldt=1458100814], [open=9.33 ts=1457496014384090 ttl=604800 ldt=1458100814], [volume=1055334 ts=1457496014384090 ttl=604800 ldt=1458100814]
[CORP:2015]@204 Row[info=[ts=-9223372036854775808] ]: STATIC | [idx=NYSE ts=1457496014379236 ttl=604800 ldt=1458100814]
[CORP:2015]@204 Row[info=[ts=1457496014379236 ttl=604800, let=1458100814] ]: 12, 31 | [close=9.33 ts=1457496014379236 ttl=604800 ldt=1458100814], [high=9.57 ts=1457496014379236 ttl=604800 ldt=1458100814], [low=9.21 ts=1457496014379236 ttl=604800 ldt=1458100814], [open=9.55 ts=1457496014379236 ttl=604800 ldt=1458100814], [volume=1054342 ts=1457496014379236 ttl=604800 ldt=1458100814]

Other than the inclusion of this internal representation format, the usage between sstabledump and sstable2json is exactly the same.

Additional Links









DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.



register for classes

get certified

DBA's Guide to NoSQL







Comments

  1. sood says:

    Thanks for the informative article. Really appreciate the effort.

    I have a question on row

    {
    “type” : “row”,
    “position” : 48,
    “clustering” : [ “1”, “5” ],
    “deletion_info” : { “deletion_time” : 1457484273784615, “tstamp” : 1457484273 }
    },

    The only data that the sstable seems to contain is the deletion marker. I am assuming this is what corresponds to a tombstone. and the reason there is only deletion info and no actual data is because we have carried out compaction.

    Can you explain what would happen to this record when repair will run. Will this row ever get removed from sstable

    Thanks

    1. Andy Tolbert says:

      Hi sood,

      Excellent question! You are correct, this corresponds to a tombstone. In cassandra, tombstones are compacted away after a compaction occurs on an sstable containing said tombstone gc_grace_seconds after that tombstone was created. See: http://cassandra.apache.org/doc/latest/operating/compaction.html?highlight=gc_grace_seconds#the-gc-grace-seconds-parameter-and-tombstone-removal.

      Come to think of it, I wonder if we can improve sstabledump to include when a tombstone is old enough for removal in this output. I’ll pursue this!

      1. Andy Tolbert says:

        Unfortunately it is not currently possible to access the gc_grace_seconds setting without accessing the table schema, which sstabledump does not do by design since it aims to require only the sstable files. https://issues.apache.org/jira/browse/CASSANDRA-9587 will allow us to access it though. Created https://issues.apache.org/jira/browse/CASSANDRA-12434 for this.

  2. Ian says:

    I can’t find sstabledump on my system.

    I installed 3.7 from the DataStax distribution, i.e.

    # cat /etc/yum.repos.d/datastax.repo
    [datastax-ddc]
    name = DataStax Repo for Apache Cassandra
    baseurl = http://rpm.datastax.com/datastax-ddc/3.7
    enabled = 1
    gpgcheck = 0

    but sstabledump is not anywhere on my system. I have other sstable executables in /usr/bin…

    # ll /usr/bin/sstable*
    -rwxr-xr-x. 1 root root 2046 Jun 10 17:16 /usr/bin/sstableexpiredblockers
    -rwxr-xr-x. 1 root root 2044 Jun 10 17:16 /usr/bin/sstablelevelreset
    -rwxr-xr-x. 1 root root 2034 Jun 10 17:16 /usr/bin/sstableloader
    -rwxr-xr-x. 1 root root 2045 Jun 10 17:16 /usr/bin/sstablemetadata
    -rwxr-xr-x. 1 root root 2045 Jun 10 17:16 /usr/bin/sstableofflinerelevel
    -rwxr-xr-x. 1 root root 2047 Jun 10 17:16 /usr/bin/sstablerepairedset
    -rwxr-xr-x. 1 root root 2042 Jun 10 17:16 /usr/bin/sstablescrub
    -rwxr-xr-x. 1 root root 2042 Jun 10 17:16 /usr/bin/sstablesplit
    -rwxr-xr-x. 1 root root 2042 Jun 10 17:16 /usr/bin/sstableupgrade
    -rwxr-xr-x. 1 root root 2045 Jun 10 17:16 /usr/bin/sstableutil
    -rwxr-xr-x. 1 root root 2042 Jun 10 17:16 /usr/bin/sstableverify

    Is there anything else I need to install to get this?

    1. Andy Tolbert says:

      Hi Ian, looks like this may be a bug. I see https://issues.apache.org/jira/browse/CASSANDRA-11642 indicates that sstabledump is not included in the 3.7 debian package, but it was in 3.6. For the yum package I don’t see it in 3.6 or 3.7. I’ll look into this and log a ticket.

      1. Andreas Wederbrand says:

        As a side-note it’s also missing from the docker image for 3.7.

Comments

Your email address will not be published. Required fields are marked *




Subscribe for newsletter:

Tel. +1 (650) 389-6000 Offices France GermanyJapan

DataStax Enterprise is powered by the best distribution of Apache Cassandra™.

© 2018 DataStax, All Rights Reserved. DataStax, Titan, and TitanDB are registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
Apache Cassandra, Apache, Tomcat, Lucene, Solr, Hadoop, Spark, TinkerPop, and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.