I'm admittedly still just experimenting with DSE 2.0 (with Solr - dse cassandra -s) as a single node, and I'm looking to figure how best to optimize its performance. It seems somewhat slower than our prior implementation which used Solr 3.5.0 running under Tomcat 6 and Cassandra 1.0.6. In our pre-DSE world, Cassandra contained content separate from Solr. Now we're using DSE 2 for both search, and that content we kept in Cassandra. I'm trying to get a handle on this initial disparity before firing up two or more DSE 2.0 nodes.
We are using Amazon EC2 m1.xlarge instances (8 ECUs, 4 Cores, 15GB of RAM). I used the DSE 2.0 tarball to install on OpenSUSE 11.3. This seems to be the OS/version of choice for my client. All I've done so far wrt to the DSE 2.0 documentation, is setup JNA (p 20). This did not help at all, but on the other hand the m1.xlarge instance still has about 8GB free, so it's unlikely the DSE JVM was getting swapped out in the first place. But, that was easy to do, and I want it in place moving forward.
In regards to EC2, p 10 mentions "virtual environment" requirements. The 3rd point recommends for production clusters that that 8GB is a more common configuration. Given an m1.large has 7.5GB of memory, that kind of tips us into the m1.xlarge (15GB) category. Also, having just watched both Solr 3.5.0 and DSE 2.0 via top, the 2x number of CPU cores of the m1.xlarge are quite beneficial as well.
However, the last bullet point under "Memory" says "For Solr and Hadoop nodes, use 32GB or more of total RAM". That would indicate using an m2.xlarge, which has 34GB or RAM. We'd like to avoid that, at least in the early stages. And besides, with 8GB of free memory in the m1.xlarge, it does not yet seem we have a big memory problem yet. There is nothing else running on the instance, but I understand that having a healthy amount of leftover memory that Linux can use for the file system cache is great for read intensive scenarios. While there is a fair about of writing going on, reads are the predominant thing for us, particularly with Solr within DSE 2.
I know I'm kind of throwing generalities at you and not some hard core usage data, cpu utilization graphs etc. I'll get to more of that as we move forward. Also, I'm not testing DSE directly, but rather via a front-end application (Java 6, SolrJ, Hector), that in turn uses SolrJ and Hector to work with DSE 2. That app is running on an m1.large and it's memory and CPU utilization is pretty low, while DSE 2 (or Solr 3.5.0 in the old instance) works over the CPUs of the m1.xlarge pretty good (80-90% with 20 concurrent users).
The project is still relatively early in its release, and I've so far performed a fairly naive integration with DSE 2 from an application standpoint as well. That is, I defined two Solr cores, and indexed the content in my standard ways; Solr compatible XML POSTed directly to Solr for our content, and via our own API for tenant content and on to DSE 2 using SolrJ. That is, I did not first load this data into Cassandra and index it in place. I do want to move to that way of doing things, but for this first cut, just doing what we've been doing was the most direct way to go to kick the tires.
Finally, Solr itself struggles with a particular search scenario. While our main content index has only about 150K documents in it, there are a large number of multi-valued fields which are faceted, and highlighted, boosted etc. There are 54 separate fields configured in the edismax request handler (qf), each with their own boost, being highlighted, and there are also about 12 fields specified in pf, pf2. Only ten fields are returned (fl) though. I need to work with the client on reducing that complexity. The goal is to tweak relevance as well as provide detailed explanations for why a given document matched the way it did, but it's quite a lot of processing. Other searches are much less strenuous. Anyway, I realize this is not a Solr forum per se, but I just wanted to point out that DSE 2 (or Solr) is not being used lightly. I'm sure there are some application-side changes that can improve things.
While it'd be great for you to tell me to set -Drunreallyfast=true and double the performance, what's the most sensible direction to proceed from here? I see the DSE startup scripts decide their own heap size parameters based on the available memory and CPU cores. On the m1.xlarge, this came out to: -Xms3840M -Xmx3840M when I look at the running process. Maybe that's too small on a 15GB machine running nothing else? I'd think setting the heap to 8GB and leaving the rest for the file system cache would be an improvement?
Since Solr is my main workload, are there Cassandra configurations that would favor Solr? That is, is there some relationship to configuring the Solr caches (solrconfig.xml) in coordination with the Cassandra row and rowkey caches etc.? Is there still value in optimizing a Solr index in DSE 2?
I've yet to play with OpsCenter. Perhaps it can shed some light in what's happening with the DSE 2 instance while under load.
Thanks for any advice.