New Cloud Storage offering leverages advantages of direct attached SSD with safety and reliability of persistent storage.
Microsoft’s new Premium Storage offering is a compelling blend of network attached persistent SSD with a locally attached SSD cache that is an interesting storage option for DSE and Cassandra deployments in Azure that need to rely on persistent rather than ephemeral storage. The starter config is:
- Use a P10 disk for the commitlog (and turn off caching on it!)
- Use a P30 disk for the data directory and mount it with barriers off.
- Attach these to a DS4, DS13 or DS14 instance.
Have a look at these links for more info:
If you’re not quite at the point where you need to worry about storage considerations and are just looking to get started with DataStax Enterprise (DSE) on Azure, please refer to my last blog entry on using the DSE Sandbox from the Azure Marketplace.
Storage technology is a constant topic of conversation that we have with our customers, largely because a lot of them cost more than they need to, or they’re the source of crippling performance and availability issues(I’m looking at you SAN). The DataStax dogma on this subject is to always use directly attached SSDs. For a deeper explanation, please refer to this DataStax Academy video.
The local-SSD advice makes perfect sense when deploying in on-premise data centers, but cloud deployments have an extra wrinkle to consider. That is, usually the SSDs that are directly attached to Virtual Machine (VM) instances are ephemeral. That is, if the VM is lost, the data on the SSD is lost. Now, Cassandra has an architecture that does a lot to mitigate the risk of overall data loss in the event that the data on an ephemeral SSD is lost. Clusters will typically be configured to store three copies of everything, and that can be extended to a multi-data center strategy. So a cluster that is deployed with say two data centers on either coast of the U.S. and one in Europe, will typically retain nine copies of every piece of data, and achieves geographic resiliency to boot. But, that comes at the cost of at least 9 nodes, and probably more than that. My point being, that to get to a state where having your data on ephemeral SSDs doesn’t stress you out, can take a fair amount more infrastructure than some projects can justify. Nothing will ever match the performance and availability of locally attached SSDs (provided you have enough of them) but Microsoft has come up with an extremely interesting cloud storage offering called Premium Storage that brings some of the benefits of direct attached SSD to remotely attached, SSD backed storage volumes that are persistent.
Azure Premium Storage (PS)
To save you from having to Google, excuse me, Bing for more data on PS, these articles: [1 & 2] are well worth reading. Before you get too wrapped around the axle on the details though, the rest of this post will help distill the bits that are important to DataStax deployments.
DSE & Azure PS
Let’s take a minute to remind ourselves how DSE does IO:
- The commitlog writes sequentially and by default fsync()s every 10 seconds.
- SSTables are flushed from memory in large sequential writes.
- SSTables are compacted with each other in a background process that involves both reads & writes of relatively large size.
- SSTables are randomly acccessed by things like CQL SELECT statements.
- If you’re using DSE Search, the index files will also be randomly accessed.
OK, stash that in the back of your mind for a minute while we cover some critical details about PS.
PS uses SSD backed volumes, and their performance is related to the size of the volume. Performance is broken up in to two core aspects, IOPS and throughput. The largest disk is called a P30, and it is rated for 5000 IOPS and 200MB/s of throughput. Now at this point, you might be saying 5000 IOPS? That’s a pittance! I’ve heard the guy that wrote this article state publicly that 20,000 IOPS is the minimum to consider for Cassandra. That’s true. I have said that, and I stand by it. So, upon reading the the PS docs, one might come to the conclusion that the answer is to attach 10 p30s to a DS14 to get up to 50,000 IOPS on that instance. But that would result in a lot of wasted capacity and would probably cost almost as much as those 9 or more geo-diverse instances that you were hoping to avoid having to spin up. (Bear in mind that unlike standard storage, PS is not sparse allocated, so you’re charged for the full capacity of the volume, not just the used portion)
So what’s the trick then? Why does the TL;DR section above say to start with a single P30 disk?
The DS-series VMs that we use with PS are very closely related to the D-Series volumes that you might use to deploy DataStax with ephemeral storage. Only part of the ephemeral storage is blocked off to serve as a block read cache in front of the network attached P30 volume. Everytime we read a block from the P30, we copy it to the local SSD. Future reads of that block come from the cache, they don’t have to go over the network. A DS-14 instance has 576GB of read cache allocated to it! The upshot of that is there is the potential to have 100% of the database hot in the cache. If you stretch your imagination a bit, you can see how this kind of storage setup could be incredibly powerful when combined with something like DTCS. (https://www.datastax.com/dev/blog/datetieredcompactionstrategy). The really critical part though, is that the cache is not subject to the performance throttles of the P30 disk. That little detail can be hard to catch as you’re reading up on Premium Storage, but if you have a look at article #1 above, you not that it includes this NOTE:
So, that block cache itself is capable of about 64,000 IOPS on a DS-14. The good news is that most of today’s workloads exhibit cache friendly characteristics. [citation welcomed!] Cassandra and other big data technologies have been taking advantage of this for years by efficiently using unused DRAM in the Linux page cache, which is why everybody that has ever used YCSB now knows what a Zipfian distribution is.
Premium Storage has extended that cache capability with an additional layer backed by SSD rather than DRAM. The more hits that the SSD services, the less stress there will be on the P30 backing store. It’s also likely that your compaction throughput requirements fall within the limits of a single P30, so it makes sense to start with one as your /var/lib/cassandra/data mount point. Similarly, commitlog IO performance and capacity requirements are lower, but persistence is still important, so a P10 is a sensible starting point here. It’s important to make a couple of config changes here to make sure that everything is operating optimally.
- Turn caching off on the commitlog disk: ‘azure vm disk attach -h None <VM-Name> <Disk-Name>'
- Mount the data disk with barriers off. See: http://azure.microsoft.com/en-us/documentation/articles/storage-premium-storage-preview-portal/#using-linux-vms-with-premium-storage
- Consider increasing the chunk_length_kb setting on your Cassandra tables. The default is 64KB of uncompressed data going in to each SSTable chunk. Given that the PS volumes can’t really do IOs smaller than 256KB, you may as well start with a chunk_length_kb setting of 256 or higher if you’re using compression. There may be additional IO tuning parameters that need to be made depending on the workload, but we don’t have concrete results to share yet.
Azure Premium Storage is still pretty new, and there are a lot of knobs to tweak here, we’ll have a detailed usage guide out in the future once we’ve had a chance to really put it through its paces, until then, use this guide is a starting point.