A conference talk transcript by Holden Karau (Spark contributor at Snowflake) covering Apache Spark performance optimization in lakehouse environments. The core problem: adding type information to Spark Datasets causes 10–400% slowdowns because Spark must validate full records instead of reading only needed columns. The solution being explored is transpilation — converting Python ASTs and JVM bytecode into Spark's Catalyst intermediate representation, allowing predicate and column pushdowns to work even with typed UDFs. A proof-of-concept exists with two passing tests, using recursive descent conversion of Python ASTs in Scala. Key challenges include silent correctness bugs (null handling, date/timezone semantics), avoiding broken pipelining when inserting select statements, and licensing issues with useful bytecode analysis libraries. Future work includes handling maps/flatMaps, opt-in relaxed typing, and targeting GPU operations (NVIDIA-style) beyond Catalyst.

46m watch time

Sort: