DataStax Developer Blog

Introduction to Composite Columns

By January 18, 2012 | 18 Comments

Data modeling in Apache Cassandra is probably one of the most difficult concepts for new users to grasp – particularly those with a lot of experience in traditional RDBMS systems. The elusive sweet spot of “sorted, wide rows” can be difficult to find with some models, particularly those where the column family currently relies on super columns or is “static” (similar in design to a table in an RDBMS modeling, say, the attributes of a user per row). Composite columns, the subject of this entry, are beneficial to adapting some of these models, as well as providing new indexing functionality to those workloads like time series data already known to perform well.

Sorted, wide rows are useful because they take excellent advantage of comparator ordering to provide efficient access into data by minimizing disk seeks. As data volume increases, they further cut down on the overhead associated with large numbers of skinny rows which make the optimizations like key indexes and bloom filters less effective. Composites can help adapt some models to take full advantage of these efficiencies by facilitating ordering of nested components.

This entry will go through some practical applications of the composite comparator type in an attempt to demystify their usage and present the usefulness of their application to your data model.

At a high level, composite comparators can be thought of simply as a comparator composed of several other types of comparators. Composite comparator provides the following major benefits for data modeling:

  • custom inverted search indexes: when you want more control over the CF layout than a secondary index
  • a replacement for super columns: both and a means to offset some of the worst performance penalties associated with such, as well as extend the model to provide and arbitrary level of nesting
  • grouping otherwise static skinny rows into wider rows for greater efficiency

The current composite comparator implementations come in two forms: CompositeType and DynamicCompositeType. This entry will discuss the former.

If you want to understand some of the history of how comparators came about, you can take a look back to see how and why they were added to Apache Cassandra:

https://issues.apache.org/jira/browse/CASSANDRA-2231

Though long, this issue thread shows some good discussions on why certain choices were made. Worth a read if you ever want to explore composites at a code level. I also recommend the following presentation as a background for indexing techniques in general with Apache Cassandra http://www.slideshare.net/edanuff/indexing-in-cassandra (Note – Ed Anuff was the original contributor of the CompositeType comparator).

To see this functionality in action, we are going to experiment with some publicly accessible timezone data as our test set. In this case, we are storing the timezone for major cities in the United States. The format of this data is pretty simple and in raw form contains the following: two letter country code, two letter state/province code, city name, and timezone.

Where previously, we would have potentially relied on super columns or a static column family to model this data, in our composite-oriented model, we will combine the first three fields for the composite column name and the timezone as the column value. This has the benefit of being able to collapse the data into a single column. This will make a column of data look something like the following:

US:TX:Austin=America/Chicago

Note that for larger data sets, you would want to spread out the columns among rows in order avoid hotspots on any one node.

When we talk about composites, we can refer to the individual members as components. So, in this model, we have three components for the composite comparator name: The two letter country code, two letter state code, and city name. The value for the column is the timezone in which the city is located.

With this particular data model, we can explore some of the features of using composite comparators as an inverted search index to take full advantage of Apache Cassandra’s storage format. We will use the Java client Hector for examples to see how to search broadly within a row, initially returning a few thousand results then increasingly narrow the search criteria to just a few records as we add clauses to the composite column range used in the slice query.

You can download, run and experiment with this code via the following project on github: http://github.com/zznate/cassandra-tutorial

Particularly, we will be looking at CompositeQuery and CompositeDataLoader (though new users to the Hector API or Apache Cassandra in general may find the rest of the project contents helpful as well).

So first, we’ll need to create the keyspace and column family:


create keyspace Tutorial
with strategy_options = {replication_factor:1}
and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
use Tutorial;

create column family CountryStateCity
with comparator = 'CompositeType(UTF8Type,UTF8Type,UTF8Type)'
and key_validation_class = 'UTF8Type'
and default_validation_class = 'UTF8Type';

Note how in the comparator declaration, we combine the types we are going to use according to the position of the relative component. The order is important and must be maintained once declared or operations will fail with InvalidRequestException, much as they would if you used the wrong type on any other non-composite column.

Now, with the column family in-place, we can insert some data using CompositeDataLoader* with the following invocation of maven:


mvn -e exec:java -Dexec.mainClass="com.datastax.tutorial.composite.CompositeDataLoader"

This class reads a CSV file from the data directory and inserts a few thousand columns of data under a single “ALL” key.

Now execute the CompositeQuery** class. The first set of results for which we are looking is all columns which are located in the United States (prefixed with “US”). The following query would be constructed as follows:


mvn -e exec:java -Dexec.mainClass="com.datastax.tutorial.composite.CompositeQuery"

You may have noticed the GREATER_THAN_EQUAL clause on the finish Composite column. If you are wondering why this is not an EQUAL clause, you are not alone. This is a common mistake for most users new to composites. The reason behind this has to do with how each component is encoded.

The encoding of a component is made up of three parts: the length of the value, the value itself, and a “end of component byte.” It is this last part, the e-o-c byte that controls slicing operations. In our case as detailed above, the value is 1. When applied to the finish component for the composite column of the slice operation, it means the “give me all the columns whose first component is ‘US’” when used in conjunction with EQUAL on the start composite. We’ll explore the other cases as we continue through the example.

So, with the current structure of the query, this example is not terribly interesting: it just returns all the columns prefixed with “US” which, in our subset of data, is the whole row.

Let’s narrow the search range down to California (abbreviated as “CA”) in our second component. Like our first example, the start clause contains an EQUAL expression, the finish clause a GREATER_THAN_EQUAL. This give us all the columns for the state of California. Note that we can also change the first clause to EQUAL since we are dealing now with comparing the second component – this needs to be done to set the e-o-c bit back to zero so the composite comparator will move on to examining the next component. Not doing so will result in an InvalidRequestException.


Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);
Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL);
start.addComponent(1,"CA",Composite.ComponentEquality.EQUAL);
end.addComponent(1,"CA",Composite.ComponentEquality.GREATER_THAN_EQUAL);

Running CompositeQuery again will produce a result set limited to California. To further narrow down the search to cities beginning with the prefix “San “, we add the following for the third component:


start.addComponent(2,"San ",Composite.ComponentEquality.EQUAL);
end.addComponent(2, "San " + Character.MAX_VALUE, Composite.ComponentEquality.GREATER_THAN_EQUAL);

This gives us a list of all columns starting with “San “ as the city name. Note the use of appending Character.MAX_VALUE to take advantage for the comparator ordering.

A similar query making use of the equality operations, say to select all the cities for Wyoming and West Virginia (“WY” and “WV” respectively), could be constructed as follows:


start.addComponent(1,"WV",Composite.ComponentEquality.EQUAL);
end.addComponent(1,"WY",Composite.ComponentEquality.GREATER_THAN_EQUAL);

Null values are also allowed on insertion – for example if we wanted a “state level” column which had null for the city name, you can insert with only two components (or one!) of the composite populated. Obviously in doing so, you would want to check for null when retrieving the right-most components of the composite from a slice.

Hopefully that is enough of an overview to give you an idea of how powerful composite comparators for some use cases. The examples above are all MIT licensed, so make whatever use of them you can.

*Though it deals with a trivial amount of data in a simple format, CompositeDataLoader can be used as a model for application-level parallelized bulk loading with the Hector API. Feel free to experiment with this approach for you application bulk loading needs.

** The CompositeQuery class makes use of an auto-paging feature built into Hector via the ColumnSliceIterator class. CompositeQuery uses this class in conjunction with an inner java.lang.Iterable implementation to provide clean iteration semantics back up to the caller. Use this as an example of how to retrieve a moderate to large number of columns from a row.



Comments

  1. Dose CQL support Composite columns?

    Thanks,
    Charlie

  2. Nate McCall says:

    Composite support via CQL wil lbe available in Apache Cassandra 1.1. See https://issues.apache.org/jira/browse/CASSANDRA-2474 for details.

  3. Cool, Thanks. We may use this feature to do composite index later in some new projects.

    I’m an Oracle database developer, it’s not easy for me to understand Hector API.

  4. Edward Capriolo says:

    Hector really does a great job with composite types. Even an old school thrift user like myself can appreciate stuff like: start.addComponent(2,”San”,Composite.ComponentEquality.EQUAL

    This is much more slick then Building together byte arrays and trying to remember that Composite.ComponentEquality.EQUAL is 1,0, or -1 (or whatever it is)

  5. Luke Collins says:

    Hi Nate,

    I have being following the above tutorial. But iam getting this error.
    Printing all columns starting with USException in thread “main” me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:Invalid bytes remaining after an end-of-component at component0)

    when i do the following search for “CA”

    Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);
    Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL);
    start.addComponent(1,”CA”,Composite.ComponentEquality.EQUAL);
    end.addComponent(1,”CA”,Composite.ComponentEquality.GREATER_THAN_EQUAL);

    Not sure what iam doing wrong?

    Thanks Luke

  6. Chad Kienle says:

    Luke,

    I ran into the same error. I was able to get it to work when I changed the following line:

    Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL);

    to:

    Composite end = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);

    I’ve have found that in order to query on the range of a specific component, all the components to the left of (with lower indexes than) it need to be specifically set (with ComponentEquality.EQUAL).

    Not sure if there is a way to query on ranges of more than one component.

    Hope this helps.

    -Chad

  7. Nate McCall says:

    @Luke – the second invocation of “compositeFrom” for the end component in your example needs to have ComponentEquality.EQUAL. Ie.
    Composite end = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);

  8. John Liberty says:

    Can I use this to get all columns where the city name begins with “San “, over a range of states (or all)?
    What setup/configuration for that?

  9. Yanis Biziuk says:

    Hello Nate McCall,

    I just ran CompositeQuery from source (before I ran scripts and CompositeDataLoader)
    So, in method ‘main’ I add two lines
    start.addComponent(1,”CA”,Composite.ComponentEquality.EQUAL);
    end.addComponent(1,”CA”,Composite.ComponentEquality.GREATER_THAN_EQUAL);

    and I got error “Invalid bytes remaining after an end-of-component at component0″

    What’s wrong?

    final code ‘main’

    public static void main(String []args) {
    init();

    CompositeQuery compositeQuery = new CompositeQuery();

    // Note the use of ‘equal’ and ‘greater-than-equal’ for the start and end.
    // this has to be the case when we want all
    Composite start = compositeFrom(startArg, Composite.ComponentEquality.EQUAL);
    Composite end = compositeFrom(startArg, Composite.ComponentEquality.GREATER_THAN_EQUAL);

    start.addComponent(1,”CA”,Composite.ComponentEquality.EQUAL);
    end.addComponent(1,”CA”,Composite.ComponentEquality.GREATER_THAN_EQUAL);

    compositeQuery.printColumnsFor(start,end);

    }

  10. Yim says:

    This is a great intro, Nate, and I am looking forward to Part 2.

    However, I have to admit I got confused trying to understand the reasons behind EQUAL vs GREATER_THAN_EQUAL. It seems all the code snippets in the article uses GREATER_THAN_EQUAL for the “end” Composite; yet quite a few commenters (myself included) got an error and the remedy is to use EQUAL instead.

    This re-enforces exactly what you said in the article that “this is a common mistake for most users new to composites”. I read the paragraph on the e-o-c byte encoding several times, but it didn’t help me much in trying to understand EQUAL vs G_T_E. So I am just wondering if you could elaborate on this a bit more. Thanks!

    – Y.

  11. John Liberty says:

    >>Can I use this to get all columns where the city name begins with “San “, over a range of states (or all)?

    Well, I figured out the answer to my question… Basically, no… Here’s my explanation…
    I was trying to treat the composite searching as if produced nested for loops for each component. When in reality, since the columns are sorted, you are specifying the overall start and end terms for a contiguous block of entries.
    so, if columns look like:
    A:A:A
    A:B:B
    A:B:C
    A:C:B
    B:A:A
    B:B:A
    B:B:B
    C:A:B

    For a search from A:A:A to B:B:B:
    My original thinking was that I would get A:A:A, A:B:B, B:A:A, B:B:A, B:B:B (skipping A:B:C, A:C:B)..
    But that’s not possible, the results would be the contiguous columns from A:A:A to B:B:B

    Now, if I can just figure out when/if I should use LESS_THAN_EQUAL :\

  12. Dominique De Vito says:

    Hi,

    It’s does not work for Composite used as row keys.

    As far as I have tested:
    start.addComponent(2,”San “,Composite.ComponentEquality.EQUAL);
    end.addComponent(2, “San ” + Character.MAX_VALUE, Composite.ComponentEquality.GREATER_THAN_EQUAL);

    doesn’t work for Composite used as row keys, in order to get all rows with 2nd part of the key beginning with “San ” !

    More precisely, it doesn’t work for my tests with such CF definition:

    create column family CF_ROW with
    comparator=UTF8Type and
    default_validation_class=UTF8Type and
    key_validation_class=’CompositeType(UTF8Type, UTF8Type)’;

    and Hector + Cassandra 1.0.7 + ByteOrderedPartitioner

    See my more complete report about such problem: http://mail-archives.apache.org/mod_mbox/cassandra-user/201203.mbox/%3C20082_1332426679_4F6B37B7_20082_4077_1_AEE40020481AB74EAADC798FC2BC7C4A01B0AED8783E%40THSONEA01CMS03P.one.grp%3E

    In order to get all rows with 2nd part of the Composite key beginning with “San “, is there another solution, different than the one for Composite used as column names ?

    Thanks.

    Dominique

    PS: but your solution works, as described/expected in your post/article, for Composite used as column names.

  13. cyril says:

    @John Liberty

    from this sample:
    lat: A lon: A Name: wfow
    lat: A lon: B Name: wofw
    lat: A lon: C Name: wofw
    lat: A lon: F Name: wodw
    lat: B lon: A Name: wofw
    lat: B lon: B Name: wgreatgreatow
    lat: B lon: C Name: grfeat
    lat: C lon: A Name: grebat
    lat: C lon: D Name: great

    Composite start = new Composite();
    start.addComponent(“A”, StringSerializer.get(),”", Composite.ComponentEquality.EQUAL);
    start.addComponent(“A”, StringSerializer.get(),”", Composite.ComponentEquality.EQUAL);

    Composite end = new Composite();
    end.addComponent(“B”, StringSerializer.get(),”", Composite.ComponentEquality.EQUAL);
    end.addComponent(“B”, StringSerializer.get(),”", Composite.ComponentEquality.GREATER_THAN_EQUAL);

    SliceQuery sliceQuery =
    HFactory.createSliceQuery(tutorialKeyspace, StringSerializer.get(), new CompositeSerializer(), StringSerializer.get());

    will return:

    lat: A lon: A Name: wfow
    lat: A lon: B Name: wofw
    lat: A lon: C Name: wofw
    lat: A lon: F Name: wodw
    lat: B lon: A Name: wofw
    lat: B lon: B Name: wgreatgreatow

    so seconds arguments are ignored?
    What methods shoyuld be used then? createIndexedSlicesQuery instead if sliceQuery maybe?

  14. I have a sandbox running on windows 7 and am unable to achieve the right sliceQuery based on the example above. If I do a search on “CA”, then I get all the states names including and after “C” and am making sure to use below:

    start.addComponent(0,”US”,Composite.ComponentEquality.EQUAL);
    start.addComponent(1,”CA”,Composite.ComponentEquality.EQUAL);

    and

    end.addComponent(1,”CA”,Composite.ComponentEquality.EQUAL);
    end.addComponent(1,”CA”,Composite.ComponentEquality.GREATER_THAN_EQUAL);

    However, then i do search on “WV”, it does not return states before “W” which is good.

  15. Type on my above code. It should be:

    start.addComponent(0,”US”,Composite.ComponentEquality.EQUAL);
    start.addComponent(1,”CA”,Composite.ComponentEquality.EQUAL);

    and

    end.addComponent(0,”US”,Composite.ComponentEquality.EQUAL);
    end.addComponent(1,”CA”,Composite.ComponentEquality.GREATER_THAN_EQUAL);

  16. Shouvanik Haldar says:

    When I am running the code snippet start.addComponent(2,”San “,Composite.ComponentEquality.EQUAL);
    end.addComponent(2, “San ” + Character.MAX_VALUE, Composite.ComponentEquality.GREATER_THAN_EQUAL);

    I get a array Index out of bounds exception which says
    Exception in thread “main” java.lang.IndexOutOfBoundsException: Index: 2, Size: 1

    Please help.

    Regards,
    Shouvanik

  17. Madhu says:

    Can we have multiple composite columns in a column family?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>