Distributed Data Show Episode 84
Patrick and Holden talk about the highlights of Spark 2.4, what's coming in Spark 3, and why code reviewers are vital to open source projects.
0:15 - Welcoming Holden back to the show
0:50 - What's 2.4 is out - highlights include Apache Arrow integration for better integration between JVM, Python and R runtimes.
2:05 - Python is becoming a first class citizen in the Spark world
2:50 - Projects including Arrow and Spark have a real need for code reviewers that know both Python and Java 4:15 - Livestreaming code reviews
5:52 - The types of changes that need review tend to be the gnarly issues, even a first pass, high level review helps.
7:10 - Spark 3 highlights (note it's not backward compatible) - new Spark SQL engine 8:00 - Python 2.7 support will be deprecated in Spark 3
9:03 - Spark MLlib will also be deprecated in favor of SparkML
10:10 - Spark Streaming data source APIs are changing
11:17 - Kubernetes integration is improving, especially scaling down
13:20 - This helps with the #1 cloud concern - cost control
14:50 - Deep learning pipeline support is being added, the approach is pluggable (bring your own DL libraries)
17:05 - Why OSS releases are late - code reviewers, feature creep, agreeing on priorities
18:40 - Wrapping up ABOUT DATASTAX ENTERPRISE 6 DataStax powers the Right-Now Enterprise with the always-on, distributed cloud database built on Apache Cassandra™ and designed for hybrid cloud. DataStax Enterprise 6 (DSE 6) includes industry-leading performance, self-driving operational simplicity, and robust analytics
Developer Relations at DataStax
Developer Advocate at Google