Distributed Data Show Episode 34
Patrick McFadin catches up with Holden Karau of Google to learn about new features of Spark 2.3, including Vectorized UDFs, Microbatch improvements, and Kubernetes support. Along the way, they explore whether API stability is an indicator that it’s time to make a career move.
0:15 - Welcoming Holden
0:35 - Patrick asks Holden why Spark APIs keep changing and whether API stability and boring infrastructure is a good thing
2:43 - Big changes in Spark 2.3 include Vectorized UDFs powered by Apache Arrow, which gives a big performance boost to transferring data between Python and the JVM (for those who aren’t fans of Scala)
6:44 - In the new Spark Microbatch API, sources and sinks are no longer tied to batches. This gives the flexibility to process as quickly as possible when you can tolerate some data loss like some IoT and machine learning use cases
11:38 - Kubernetes support is finally in Spark 2.3 after a few competing approaches were resolved, simplifying deployment of complex Spark apps that leverage non-JVM libraries
14:44 Holden explains why Spark struggles with scaling down and how Kubernetes support may be part of the solution.
18:34 - Patrick and Holden discuss when it will be safe to deploy Spark on K8S in production (hint, it should be before Spark 3)
Developer Relations at DataStax
Developer Advocate at Google