Apache Cassandra 0.6 Documentation

Example: Lucandra

This document corresponds to an earlier product version. Make sure you are using the version that corresponds to your version.

Latest Cassandra documentation | Earlier Cassandra documentation

Lucandra is a combination of Cassandra and Lucene, a high-performance, full-featured text search engine.

Originally, the Lucandra data model used one super column per search term. However, this scheme ran into problems with the super column limitations. The data model used currently is the following:

TermInfo Super Column Family

row keys: index_name/field/term

Each super column looks like:

docID:
{
    “Frequencies”:
    {
        byte[] of list of numbers
    },

    “Positions”:
    {
        byte[] of list of numbers
    },
    “Offsets”:
    {
        byte[] of list of numbers
    },
    “Norms”:
    {
        byte[] of list of numbers
    }
}

DocInfo Column Family

row keys: index_name/docID

Each row represents one document and looks like:

field1:
{
    binary content of this field
},

field2:
{
    binary content of this field
},

etc ...

The choice of how keys are formed is important for a couple of reasons. First, using a different key for every term and document causes the workload to be spread around the cluster. Second, if an OrderPreservingPartitioner is used, you can efficiently search keys using wildcards, sort keys, and perform range queries. One alternative to including the index name in row keys would be to have separate keyspaces for different indexes (depending on the the total number of indexes and how often they are created). This might make managing the distribution of tokens to deal with hotspots caused by the OrderPreservingPartitioner easier.

The docID used for super column names and row keys is randomly generated. A UUID would work well in this case as well.

As you can see, the data has been organized to minimize the number of queries needed for a search. Only one read is needed for each term in a search query. When the results of the term queries have been processed, only one query is needed to retrieve each matching document (or even just the matching field).