Distributed Data Show Episode 46
Brian Hess joins the show to talk about what’s new with Analytics in DSE 6, especially Always-on SQL and Spark Streaming.
0:15 - Introducing Brian Hess and his sweater vest
0:30 - The top things you should know about Analytics in DSE 6 are: 1) Update to Spark 2.2, 2) enhanced Spark DSE connector, and 3) Always-on SQL
2:07 - DSE Analytics demonstrates enhanced performance due to factors like: 1) leveraging the continuous paging capability (introduced in DSE 5.1) - great for scans, f 3x improvement.
2:58 - 2) A new kind of direct join, used automatically under the hood. or batch and streaming.
4:25 - 3) Leveraging DSE Search indices when available.
6:29 - Analytics will not use search indices for queries that scan more than a few percent of a table
8:30 - Why Spark SQL is cool: 1) familiarity for SQL users
10:53 - 2) usability - Spark SQL Is great for analytic queries like COUNT BY, GROUP BY
12:08 - 3) Spark SQL is helpful for use cases like working around a poor choice of partition key in Cassandra tables
13:48 - The Always on SQL feature is new in DSE 6 - we’re trying to achieve the same level of availability for Spark SQL that Cassandra provides. A few interesting points:
14:53 - 1) Always on SQL is an improvement over the Spark SQL Thrift Server, which has to be restarted manually if it fails and executes all queries as a single user (more on this later)
16:47 - 2) Always on SQL leverages DSEFS under the hood to store snapshots of cached data, which enables quick recovery of the cache if the server is restarted on another node.
19:22 - 3) Always on SQL leverages DSE Analytics workpools to allow dedicated resources to be allocated to the on restart. Running separate workpools as a way of managing resources - Always on SQL workpool.
21:08 - 4) Always on SQL is off by default in DSE 6. To configure it you simply specify the resources that should be allocated.
24:08 - 5) Always on SQL supports “proxy execution” - a security feature to execute queries as the authenticated user
25:59 - Structured streaming was first introduced in Spark 2.0, and is now officially supported in DSE 6. Structured streaming represents the same kind of improvement over DStreams that DataFrames represent vs. RDDs.
27:46 - We added a structured streaming sink for DSE. This enables a usage pattern we see used frequently - connect to one or more incoming streams, transform/join, write to DSE.
29:07 - Wrapping up - DSE 6 Analytics represents major improvements in performance, ease of use, and capability.