Toggle Menu
Back to Resources

Spark 3 Preview with Holden Karau

Distributed Data Show Episode 84

Patrick and Holden talk about the highlights of Spark 2.4, what's coming in Spark 3, and why code reviewers are vital to open source projects.


0:15 - Welcoming Holden back to the show

0:50 - What's 2.4 is out - highlights include Apache Arrow integration for better integration between JVM, Python and R runtimes.

2:05 - Python is becoming a first class citizen in the Spark world

2:50 - Projects including Arrow and Spark have a real need for code reviewers that know both Python and Java 4:15 - Livestreaming code reviews

5:52 - The types of changes that need review tend to be the gnarly issues, even a first pass, high level review helps.

7:10 - Spark 3 highlights (note it's not backward compatible) - new Spark SQL engine 8:00 - Python 2.7 support will be deprecated in Spark 3

9:03 - Spark MLlib will also be deprecated in favor of SparkML

10:10 - Spark Streaming data source APIs are changing

11:17 - Kubernetes integration is improving, especially scaling down

13:20 - This helps with the #1 cloud concern - cost control

14:50 - Deep learning pipeline support is being added, the approach is pluggable (bring your own DL libraries)

17:05 - Why OSS releases are late - code reviewers, feature creep, agreeing on priorities

18:40 - Wrapping up ABOUT DATASTAX ENTERPRISE 6 DataStax powers the Right-Now Enterprise with the always-on, distributed cloud database built on Apache Cassandra™ and designed for hybrid cloud. DataStax Enterprise 6 (DSE 6) includes industry-leading performance, self-driving operational simplicity, and robust analytics



Patrick McFadin

Patrick McFadin

Developer Relations at DataStax

Holden Karau

Developer Advocate at Google