TechnologyMarch 9, 2016

Debugging SSTables in 3.0 with sstabledump

Andrew Tolbert
Andrew Tolbert
Debugging SSTables in 3.0 with sstabledump

Cassandra 3.0.4 and 3.4 introduces sstabledump, a new utility for exploring SSTablessstabledump is the spiritual successor to and a replacement for sstable2jsonsstable2json was removed from Cassandra in version 3.0, but examining SSTable data is still a useful diagnostic tool. sstabledump can export SSTable content to the human readable JSON format.

How SSTable data is stored on disk has changed in Cassandra 3.0, as previously covered in ‘Putting some structure in the storage engine’. Previously, SSTables were composed of partition keys and their cells; now SSTables are composed of partitions and their rows.

This eliminates quite a bit of overhead present in prior versions of Cassandra. Metadata such as clustering key values, timestamps and TTLs are now defined at the row level, rather than repeated for each individual cell within a row. This new layout now matches how data is represented in CQL, and is more understandable.

A nice enhancement of sstabledump over sstable2json is that the utility can be run in ‘client mode’, so the system data does not have to be read to determine schema. sstabledump can be executed outside of the Cassandra environment, and cassandra.yaml is not required in the classpath for the tool to work.

Note that sstabledump only supports Cassandra 3.X SSTables.

Visualizing the Storage Engine changes in 3.0

To demonstrate sstabledump and the changes in SSTable layout in 3.0, we’ll use sstable2json and sstabledump to contrast the SSTables created by a Cassandra 2.2 node and those created by a Cassandra 3.0 node.

First, let’s generate a small SSTable for a table that represents stock ticker data. This should be done within a cqlsh session on each Cassandra cluster:


-- Create the schema


CREATE KEYSPACE IF NOT EXISTS ticker WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE ticker;


CREATE TABLE IF NOT EXISTS symbol_history (

  symbol    text,

  year      int,

  month     int,

  day       int,

  volume    bigint,

  close     double,

  open      double,

  low       double,

  high      double,

  idx       text static,

  PRIMARY KEY ((symbol, year), month, day)

) with CLUSTERING ORDER BY (month desc, day desc);


-- Insert some records


INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high, idx)

VALUES ('CORP', 2015, 12, 31, 1054342, 9.33, 9.55, 9.21, 9.57, 'NYSE') USING TTL 604800;


INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high, idx)

VALUES ('CORP', 2016, 1, 1, 1055334, 8.2, 9.33, 8.02, 9.35, 'NASDAQ') USING TTL 604800;


INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high)

VALUES ('CORP', 2016, 1, 4, 1054342, 8.54, 8.2, 8.2, 8.65) USING TTL 604800;


INSERT INTO symbol_history (symbol, year, month, day, volume, close, open, low, high)

VALUES ('CORP', 2016, 1, 5, 1054772, 8.73, 8.54, 8.44, 8.75) USING TTL 604800;


-- Update a column value


UPDATE symbol_history USING TTL 604800 set close = 8.55 where symbol = 'CORP' and year = 2016 and month = 1 and day = 4;

Next, let’s flush memtables to disk as SSTables using nodetool:


$ bin/nodetool flush

Then in a cqlsh session we will set a column value to null and delete an entire row to generate some tombstones:


-- Set column value to null

USE ticker;

UPDATE symbol_history SET high = null WHERE symbol = 'CORP' and year = 2016 and month = 1 and day = 1;


-- Delete an entire row DELETE FROM symbol_history WHERE symbol = 'CORP' and year = 2016 and month = 1 and day = 5;

We proceed to flush again to generate a new SSTable, and then perform a major compaction yielding a single SSTable.


$ bin/nodetool flush; bin/nodetool compact ticker

Now that we have a single SSTable representing operations on our CQL table we can use the appropriate tool to examine its contents.

C* 2.2 sstable2json Output


$ tools/bin/sstable2json data/data/ticker/symbol_history-d7197900e5aa11e590210b5b92b49507/la-3-big-Data.db


[

{"key": "CORP:2016",

 "cells": [["::idx","NASDAQ",1457495762169139,"e",604800,1458100562],

           ["1:5:_","1:5:!",1457495781073797,"t",1457495781],

           ["1:4:","",1457495762172733,"e",604800,1458100562],

           ["1:4:close","8.55",1457495767496569,"e",604800,1458100567],

           ["1:4:high","8.65",1457495762172733,"e",604800,1458100562],

           ["1:4:low","8.2",1457495762172733,"e",604800,1458100562],

           ["1:4:open","8.2",1457495762172733,"e",604800,1458100562],

           ["1:4:volume","1054342",1457495762172733,"e",604800,1458100562],

           ["1:1:","",1457495762169139,"e",604800,1458100562],

           ["1:1:close","8.2",1457495762169139,"e",604800,1458100562],

           ["1:1:high",1457495780,1457495780541716,"d"],

           ["1:1:low","8.02",1457495762169139,"e",604800,1458100562],

           ["1:1:open","9.33",1457495762169139,"e",604800,1458100562],

           ["1:1:volume","1055334",1457495762169139,"e",604800,1458100562]]},

{"key": "CORP:2015",

 "cells": [["::idx","NYSE",1457495762164052,"e",604800,1458100562],

           ["12:31:","",1457495762164052,"e",604800,1458100562],

           ["12:31:close","9.33",1457495762164052,"e",604800,1458100562],

           ["12:31:high","9.57",1457495762164052,"e",604800,1458100562],

           ["12:31:low","9.21",1457495762164052,"e",604800,1458100562],

           ["12:31:open","9.55",1457495762164052,"e",604800,1458100562],

           ["12:31:volume","1054342",1457495762164052,"e",604800,1458100562]]}

]

As previously stated, the sstable2json output demonstrates that the storage engine prior to Cassandra 2.2 represents partition keys and their cells.

A large portion of the presented data in cells is redundant. For example, when we executed INSERT queries, each cell representing a column value shares the same timestamp and TTL. Additionally, each cell contains not only the full name of the column, but also the values of the clustering keys that cell belongs to. This overhead contributes a large portion to the size of the SSTable.

C* 3.0 sstabledump output


$ tools/bin/sstabledump data/data/ticker/symbol_history-6d6bfc70e5ab11e5aeae7b4a82a62e48/ma-3-big-Data.db


[

  {

    "partition" : {

      "key" : [ "CORP", "2016" ],

      "position" : 0

    },

    "rows" : [

      {

        "type" : "static_block",

        "position" : 48,

        "cells" : [

          { "name" : "idx", "value" : "NASDAQ", "tstamp" : 1457484225583260, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false }

        ]

      },

      {

        "type" : "row",

        "position" : 48,

        "clustering" : [ "1", "5" ],

        "deletion_info" : { "deletion_time" : 1457484273784615, "tstamp" : 1457484273 }

      },

      {

        "type" : "row",

        "position" : 66,

        "clustering" : [ "1", "4" ],

        "liveness_info" : { "tstamp" : 1457484225586933, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },

        "cells" : [

          { "name" : "close", "value" : "8.54" },

          { "name" : "high", "value" : "8.65" },

          { "name" : "low", "value" : "8.2" },

          { "name" : "open", "value" : "8.2" },

          { "name" : "volume", "value" : "1054342" }

        ]

      },

      {

        "type" : "row",

        "position" : 131,

        "clustering" : [ "1", "1" ],

        "liveness_info" : { "tstamp" : 1457484225583260, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },

        "cells" : [

          { "name" : "close", "value" : "8.2" },

          { "name" : "high", "deletion_time" : 1457484267, "tstamp" : 1457484267368678 },

          { "name" : "low", "value" : "8.02" },

          { "name" : "open", "value" : "9.33" },

          { "name" : "volume", "value" : "1055334" }

        ]

      }

    ]

  },

  {

    "partition" : {

      "key" : [ "CORP", "2015" ],

      "position" : 194

    },

    "rows" : [

      {

        "type" : "static_block",

        "position" : 239,

        "cells" : [

          { "name" : "idx", "value" : "NYSE", "tstamp" : 1457484225578370, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false }

        ]

      },

      {

        "type" : "row",

        "position" : 239,

        "clustering" : [ "12", "31" ],

        "liveness_info" : { "tstamp" : 1457484225578370, "ttl" : 604800, "expires_at" : 1458089025, "expired" : false },

        "cells" : [

          { "name" : "close", "value" : "9.33" },

          { "name" : "high", "value" : "9.57" },

          { "name" : "low", "value" : "9.21" },

          { "name" : "open", "value" : "9.55" },

          { "name" : "volume", "value" : "1054342" }

        ]

      }

    ]

  }

]

As a consequence of the new tool's verbose output, the output payload is less compact than sstable2json. However, the enriched structure of the 3.0 storage engine, is displayed. What is apparent is that there is less repeated data, which leads to a dramatically reduced SSTable storage footprint.

Looking at the output, note that clustering, timestamp and ttl information are now presented at the row level, instead of repeating in individual cells. This change is a large factor in optimizing disk space. While column names are present in each cell, the full column names are not stored for each cell as previously. You can read more about these optimizations and others in the aforementioned blog post.

Internal Representation Format

As previously mentioned, sstabledump’s JSON representation is more verbose than sstable2jsonsstabledump also provides an alternative ‘debug’ output format that is more concise than its json counterpart. While initially difficult to understand, it is a more compact and convenient format for advanced users to grok the contents of an SSTable. To view data in this format, simply pass the -d parameter to sstabledump:

$ tools/bin/sstabledump data/data/ticker/symbol_history-6d6bfc70e5ab11e5aeae7b4a82a62e48/ma-3-big-Data.db -d

[CORP:2016]@0 Row[info=[ts=-9223372036854775808] ]: STATIC | [idx=NASDAQ ts=1457496014384090 ttl=604800 ldt=1458100814]

[CORP:2016]@0 Row[info=[ts=-9223372036854775808] del=deletedAt=1457496035375251, localDeletion=1457496035 ]: 1, 5 |

[CORP:2016]@66 Row[info=[ts=1457496014387922 ttl=604800, let=1458100814] ]: 1, 4 | [close=8.55 ts=1457496020899876 ttl=604800 ldt=1458100820], [high=8.65 ts=1457496014387922 ttl=604800 ldt=1458100814], [low=8.2 ts=1457496014387922 ttl=604800 ldt=1458100814], [open=8.2 ts=1457496014387922 ttl=604800 ldt=1458100814], [volume=1054342 ts=1457496014387922 ttl=604800 ldt=1458100814]

[CORP:2016]@141 Row[info=[ts=1457496014384090 ttl=604800, let=1458100814] ]: 1, 1 | [close=8.2 ts=1457496014384090 ttl=604800 ldt=1458100814], [high= ts=1457496034857652 ldt=1457496034], [low=8.02 ts=1457496014384090 ttl=604800 ldt=1458100814], [open=9.33 ts=1457496014384090 ttl=604800 ldt=1458100814], [volume=1055334 ts=1457496014384090 ttl=604800 ldt=1458100814]

[CORP:2015]@204 Row[info=[ts=-9223372036854775808] ]: STATIC | [idx=NYSE ts=1457496014379236 ttl=604800 ldt=1458100814]

[CORP:2015]@204 Row[info=[ts=1457496014379236 ttl=604800, let=1458100814] ]: 12, 31 | [close=9.33 ts=1457496014379236 ttl=604800 ldt=1458100814], [high=9.57 ts=1457496014379236 ttl=604800 ldt=1458100814], [low=9.21 ts=1457496014379236 ttl=604800 ldt=1458100814], [open=9.55 ts=1457496014379236 ttl=604800 ldt=1458100814], [volume=1054342 ts=1457496014379236 ttl=604800 ldt=1458100814]

Other than the inclusion of this internal representation format, the usage between sstabledump and sstable2json is exactly the same.

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.