Courses and certifications Open Source
Apache Spark for Data Engineers - Advanced Optimizations
Price (without VAT)
Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.
This training is oriented on advanced topics of Spark SQL that have impact on the resulting performance of Spark jobs such as optimizing execution plans, elimination of shuffle, choosing optimal distribution of data, reusing efficiently computation and others. The goal of this training is to learn some techniques for achieving maximal performance.
The training is taught using the programing language Python in the Spark local mode (Spark 3.2 with Jupyter in notebook).
Audience
Data Engineers, scientists and other users of Spark that already have some prior experience with the engine and want to get deeper knowledge and learn how to optimize Spark jobs and achieve maximal performance.
Goals
Participants will learn:
- Understand and interpret execution plans in Spark SQL
- Rewrite query in order to achive better performance
- Use efficiently some internal configurations
- Using Spark prepare data for analytical purpose
- Find the bottleneck of a Spark job
Guarantor of the Training
David Vrba Ph.D.
David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.
Outline
Spark SQL internals (Query Execution)
- Logical planning (Catalog, Analyzer, Cache Management, Optimizer)
- Catalyst API
- Extending the optimizer
- Limiting the optimizer
- Physical planning
- Query planner, strategies
- Spark plan
- Executed plan
- Understanding operators in the physical plan
- Cost based optimizer
- How cost-based optimizations work
- Statistics collection
- Statistics usage
Query optimization
- Shuffle elimination
- Bucketing
- Data repartition (when and how)
- Optimizing joins
- Shuffle-free join
- One-side shuffle-free join
- Broadcast join vs sort-merge join
- Data reuse
- Caching
- Checkpointing
- Exchange reuse
Optimization tips
- Choose the appropriate number of shuffle partitions
- Nondeterministic expressions
- Configuration settings
Data layout
- Different file formats
- Parquet vs Json
- Partitioning and bucketing
- How bucketing works
- How to ensure the proper number of files
- Tables management
- Working with the Catalog API
- Delta-io
- Open-source storage layer with ACID transactions
Prerequisites
This is a follow-up training to the course Apache Spark - From Simple Transformations to Highly Efficient Jobs where people get (among other things) solid understanding of DataFrame API and basic knowledge about internal processes in Spark.
To get the most out of this training it is recommended to have some previous experience with Spark (for example on the level of the aforementioned course), know the DataFrame API and understand basic principles of distributed computing.