Cassandra Column TTL Support in DSE Search
Cassandra supports column TTL(time to live) since 0.7 version. You can specify a TTL value when you insert a column. After the column expires, you can’t retrieve data for the column any more.
How does Column TTL Works
When a ttl column is inserted, it’s stored with the ttl value. Cassandra returns empty when a request retrieves the expired column. The expiring column with TTL value is stored in Cassandra before the compaction happens. The first time compaction on the expired column transforms the column to a tombstone which frees the disk space of the size of the expired column. It will be finally removed by compaction after GCGraceSeconds.
The Issue of DSE Search with Column TTL
DSE Search integrates Solr with Cassandra by implementing Cassandra secondary index API. Cassandra secondary index triggers re-index only if column data changes. When a column expires, the column data doesn’t change until compaction happens on the column family. So before the compaction and after column expires, DSE Search returns expired columns. Some search application is very sensitive to the search result correction, so we need make sure the search result doesn’t have expired data for those search applications.
How does DSE Search Support Column TTL
Since It’s hard to do the PUSH, for Cassandra secondary index doesn’t support re-indexing when a column expires and We may need to do the batch re-index for solr secondary index is a per row type secondary index which re-index the whole row. I decide to do the Poll from Solr side. The idea is to periodically poll the column family and re-index solr when there is any expired column.
I add an expiring time field to the index document, so we can search on that expiring field to re-index the expired documents. A scheduler per a solr core schedules the task periodically to search the expired documents and re-index them. The following configuration parameters are available.
- ttl_index_rebuild_fix_rate_period — It’s re-indexing frequency in seconds. You can set it to some number less than the compaction frequency. If you want to disable the re-indexing, you can set it to a real big number.
- ttl_index_rebuild_initial_delay — It’s the initial waiting time before starting the re-indexing thread in seconds. You can set it number that allow you quick start up the node, so the re-indexing doesn’t slow down the start up time.
- ttl_index_rebuild_thread_pool_initial_size — It’s the re-indexing thread pool initial size
- ttl_index_rebuild_thread_pool_max_size — It’s the max re-indexing thread pool size which limits the resources consumed by the thread pool.
- ttl_index_rebuild_thread_pool_keep_alive_time — It’s alive time to keep the thread in the pool.
- ttl_index_rebuild_thread_pool_blocking_queue_size — It’s max concurrent running tasks to
- max_solr_search_result_count_per_page — It’s the number of rows to search for each page, so we can page through all rows. It basically limits the resources consumed during the TTL re-indexing.
Good Practice to Set TTL Index Rebuilding Parameters
TTL index rebuilding does consume resources, e.g. cpu, memory and read Cassandra column families. You can make the re-indexing frequency to a longer time if there’s not much TTL data in Cassandra column families. For common use case, you need check the frequency of your compaction, and assign the re-indexing frequency to short than compaction frequency and match the re-indexing frequency to your business requirement.
We may make TTL re-indexing per column family, and expose it to the administration page, so we can tune it for each column family. We may have a per index TTL which can be set through solr restful API. I believe those enhancement will be available in the feature.
We may also make Cassandra to support PUSH, so it can automatically re-index the data when column expired. Another thought is to have a row base TTL.