Video

Spark 3 Preview with Holden Karau

Patrick and Holden talk about the highlights of Spark 2.4, what's coming in Spark 3, and why code reviewers are vital to open source projects.

Highlights!

0:15 - Welcoming Holden back to the show

0:50 - What's 2.4 is out - highlights include Apache Arrow integration for better integration between JVM, Python and R runtimes.

2:05 - Python is becoming a first class citizen in the Spark world

2:50 - Projects including Arrow and Spark have a real need for code reviewers that know both Python and Java 4:15 - Livestreaming code reviews

5:52 - The types of changes that need review tend to be the gnarly issues, even a first pass, high level review helps.

7:10 - Spark 3 highlights (note it's not backward compatible) - new Spark SQL engine 8:00 - Python 2.7 support will be deprecated in Spark 3

9:03 - Spark MLlib will also be deprecated in favor of SparkML

10:10 - Spark Streaming data source APIs are changing

11:17 - Kubernetes integration is improving, especially scaling down

13:20 - This helps with the #1 cloud concern - cost control

14:50 - Deep learning pipeline support is being added, the approach is pluggable (bring your own DL libraries)

17:05 - Why OSS releases are late - code reviewers, feature creep, agreeing on priorities

18:40 - Wrapping up ABOUT DATASTAX ENTERPRISE 6 DataStax powers the Right-Now Enterprise with the always-on, distributed cloud database built on Apache Cassandra™ and designed for hybrid cloud. DataStax Enterprise 6 (DSE 6) includes industry-leading performance, self-driving operational simplicity, and robust analytics