DataStax Developer Blog

Cassandra Column TTL Support in DSE Search

By Alex Liu -  December 19, 2012 | 5 Comments

Cassandra supports column TTL(time to live) since 0.7 version. You can specify a TTL value when you insert a column. After the column expires, you can’t retrieve data for the column any more.

How does Column TTL Works

When a ttl column is inserted, it’s stored with the ttl value. Cassandra returns empty when a request retrieves the expired column. The expiring column with TTL value is stored in Cassandra before the compaction happens. The first time compaction on the expired column transforms the column to a tombstone which frees the disk space of the size of the expired column. It will be finally removed by compaction after GCGraceSeconds.

The Issue of DSE Search with Column TTL

DSE Search integrates Solr with Cassandra by implementing Cassandra secondary index API. Cassandra secondary index triggers re-index only if column data changes. When a column expires, the column data doesn’t change until compaction happens on the column family. So before the compaction and after column expires, DSE Search returns expired columns. Some search application is very sensitive to the search result correction, so we need make sure the search result doesn’t have expired data for those search applications.

How does DSE Search Support Column TTL

Since It’s hard to do the PUSH, for Cassandra secondary index doesn’t support re-indexing when a column expires and We may need to do the batch re-index for solr secondary index is a per row type secondary index which re-index the whole row. I decide to do the Poll from Solr side. The idea is to periodically poll the column family and re-index solr when there is any expired column.

I add an expiring time field to the index document, so we can search on that expiring field to re-index the expired documents. A scheduler per a solr core schedules the task periodically to search the expired documents and re-index them. The following configuration parameters are available.

  • ttl_index_rebuild_fix_rate_period — It’s re-indexing frequency in seconds. You can set it to some number less than the compaction frequency. If you want to disable the re-indexing, you can set it to a real big number.
  • ttl_index_rebuild_initial_delay — It’s the initial waiting time before starting the re-indexing thread in seconds. You can set it number that allow you quick start up the node, so the re-indexing doesn’t slow down the start up time.
  • ttl_index_rebuild_thread_pool_initial_size — It’s the re-indexing thread pool initial size
  • ttl_index_rebuild_thread_pool_max_size — It’s the max re-indexing thread pool size which limits the resources consumed by the thread pool.
  • ttl_index_rebuild_thread_pool_keep_alive_time — It’s alive time to keep the thread in the pool.
  • ttl_index_rebuild_thread_pool_blocking_queue_size — It’s max concurrent running tasks to
  • max_solr_search_result_count_per_page — It’s the number of rows to search for each page, so we can page through all rows. It basically limits the resources consumed during the TTL re-indexing.

Good Practice to Set TTL Index Rebuilding Parameters

TTL index rebuilding does consume resources, e.g. cpu, memory and read Cassandra column families. You can make the re-indexing frequency to a longer time if there’s not much TTL data in Cassandra column families. For common use case, you need check the frequency of your compaction, and assign the re-indexing frequency to short than compaction frequency and match the re-indexing frequency to your business requirement.

Future Work

We may make TTL re-indexing per column family, and expose it to the administration page, so we can tune it for each column family. We may have a per index TTL which can be set through solr restful API. I believe those enhancement will be available in the feature.

We may also make Cassandra to support PUSH, so it can automatically re-index the data when column expired. Another thought is to have a row base TTL.



Comments

  1. Fredrik E says:

    Is this implemented in Datastax? If so, how to activate it?

  2. Alex Liu says:

    It’s implemented in DSE2.1.X, it’s active by default. You can change the settings in dse.yaml

  3. Fredrik E says:

    I found the settings. But no matter what I do, I cannot get rid of the keys in the solr index. Columns (fields) expire and disappear from both cassandra and solr, as do keys from cassandra after flushing, waiting for gc_grace_seconds, and compacting. But the keys in solr are stubborn. What am I missing here?

  4. Alex Liu says:

    It looks like a bug. Solr index should delete the keys if the underneath Cassandra row is deleted.

    We may fix this issue in the next DSE release. Thanks for the finding.

  5. Fredrik E says:

    After a few days servers increase their activity significantly for some reason; perhaps because of ttl expiration. There is nothing evident in the logs. How can I turn on debug logging for the ttl routines?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>