Hello:
A major aspect of my application is that in addition to tenant provided content, there is a quantity of 'canonical content' that gets updated once a quarter. I am making it available for both search and retrieval. I'm looking for a more production worthy way to accomplish this.
Prior to the release of DSE 2.0, this content existed in both Cassandra (non-search) and Solr. For the former, the content build process generates SSTables using org.apache.cassandra.io.sstable.SSTableSimpleUnsortedWriter, and the latter by generating native Solr XML formatted files. Being pre-production at the time, and only using a single Cassandra node, it was pretty straightforward to shutdown Cassandra, and replace the SSTables with the new ones. Likewise for Solr, I could remove the old index and ingest the new Solr XML files.
Regardless of DSE 2, moving to multiple Cassandra nodes requires more finesse than my procedure above. It seems what I need to do in that case is described here:
http://www.datastax.com/dev/blog/bulk-loading
However, Sylvain's blog post is dated August 1, 2011 and speaks of Cassandra 0.8.1. Is using sstableloader still the way to go in the Cassandra 1.0.x, and soon to be 1.1, world? I'm already generating the SSTables, but they need to be uploaded to the cluster rather than just a single node.
With DSE 2, it sounds like I can continue to index the existing Solr XML files I generate, or more directly upload the new content directly into Cassandra, also using sstableloader. The search related keyspace and column family (Solr core) definition may or may not change as well for a given quarterly update.
With Solr today, I can issue a deletion query and remove all documents, then process a series of files to define the new content, and at the end of that, issue a commit. Once that has finally completed, then new queries will use that data. Until then, current searches will continue to refer to the old content. Is there a way to accomplish something similar going through Cassandra via sstableloader? I'm guessing the content will be updated as it goes, becoming eventually consistent across the cluster.
I suppose another approach would be to create a new version of the core via the HTTP API, then bulk load the content, and then instruct my app to then refer to the new core. The name of the current content core could be set in Cassandra itself, which the app will read and start using. At some point, the old core can be deleted. I kind of like this approach since I'll only have to deal with SSTables and Cassandra, and not Solr XML etc.
Finally, the canonical content is comprised of numerous multi-valued fields. If I pull up such a document field in the CLI, I see:
(column=n_macromolecule_species, value=solrjson:["Human","Mouse","Rat","Cow"], timestamp=1333552307242000)
Is that column value a standard I can make use of to define multi-valued fields, or is it some secret DSE Solr integration thing?
Thanks!
Jeff
