Courses and certifications Open Source

Big data

Apache Spark - From Simple Transformations to Highly Efficient Jobs

25.900 CZK

Price (without VAT)

Days2
7. 11. 8. 11. 2024
virtual
CZ
5. 12. 6. 12. 2024
virtual
CZ

Apache Spark is a general purpose computing engine that provides unified framework for big data processing, ad hoc analytics, machine learning, graph processing and streaming. For the past few years Spark is becoming standard for handling these workloads not just in the big data ecosystem. It’s popularity is growing also because of the high level DataFrame API that allows to express business logic in very high level and expressive way.

This training covers Spark from three perspectives: The first part is devoted to the programming interface of the DataFrame API which allows you to quickly build Spark applications. In the second part we focus on Spark architecture, explain how things work under the cover in the Spark SQL and execution layer and use that understanding to achieve high performance queries. In the last part we explore machine learning and graph processing techniques that Spark provides for advanced data analysis.


Audience

  • Data Scientists with no or little experience of Apache Spark that want to quickly learn the technology for doing ad hoc analysis or building machine learning applications
  • Data Engineers with some experience of Apache Spark that want to understand better the internal processes of Spark and use that knowledge in order to write high performance queries and ETL jobs

Goals

Participants will learn:

  • Basic concepts of Apache Spark and distributed computing
  • How to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
  • How the DataFrame API works under the hood
  • How the optimization engine works in Spark
  • How is Spark application executed
  • How to understand query plans and use that information to optimize queries
  • Basic concepts of library ML Pipelines for machine learning
  • Basic concepts of library GraphFrames for graph processing
  • How to process data in (nearly) real time in Spark (Structured Streaming)
  • News in Spark 2.3, 2.4, 3.0

Guarantor of the Training

DAVID VRBA

David Vrba Ph.D. works at Socialbakers as a data scientist and Spark consultant. On daily basis he is optimizing ETL pipelines built in Spark and develops jobs that process data on the scale up to tens of TBs. David is also lecturing Spark trainings and workshops and during the last two years he trained in Spark several teams including data engineers, data analysts and researchers. David is also contributing to the Spark source code and is active in the community by giving public talks at conferences and meetups such as Spark + AI Summit, MLPrague or Spark + AI Prague Meetup.

 

Outline

Introduction to Apache Spark

  • High level introduction to Spark
  • Introduction to Spark architecture
  • Spark APIs: high level vs low level vs internal APIs

Structured APIs in Spark

  • Basic concepts of DataFrame API
  • DataFrame, Row, Column
  • Operations in SparkSQL: transformations, actions
  • Working with DataFrame: creating a DataFrame and basic transformations
  • Working with different data types (Integer, String, Date, Timestamp, Boolean)
  • FilteringConditions
  • Dealing with null valuesJoins

Lab I

  • Simple ETL

Advanced transformations with DataFrames

  • Aggregations and Window functions
  • User Defined Functions
  • Higher Order Functions and complex data types (news in Spark 2.4)

Lab II

  • Analyzing data using DataFrame API

Metastore and tables

  • Catalog API
  • Table creation
  • Saving data
  • Caveats to be careful about

Lab III

  • Saving data and working with tables

Internal processes in Spark SQL

  • Catalyst - Optimization engine in Spark
  • Logical Planning
  • Physical Planning

Execution Layer

  • Introduction to low level APIs: RDDs
  • Structure of Spark job (Stages, Tasks, Shuffle)
  • DAG SchedulerLifecycle of Spark application

Lab IV

  • Spark UI

Performance Tuning

  • Data persistence: caching, checkpointing
  • Bucketing & Partitioning
  • Most often bottlenecks in Spark applications
  • Optimization tips

Introduction to advanced analytics in Spark

  • Machine learning: basic concepts of ML Pipelines
  • Graph processing: basic concepts of GraphFrames library

Lab V

  • Machine learning and Graph processing

Structured Streaming

  • Basic concepts of streaming in Spark
  • Stateful vs stateless transformations
  • Event time processing
  • What is watermark and how to use it to close the state
  • Real time vs near real time processing

 

Prerequisites

There is no prior knowledge of Spark required to pass this training. Very basic knowledge of Python and SQL is advantage but it is not a prerequisite. The training is taught in the Jupyter notebook environment using Python programming language.

 



Inquire course

Courses
Submit
* Required field

Reviews

16. 8. 2024
16. 8. 2024
Igor Kováč
5. 12. 2023
HIgh-quality course with a professional lecturer. Igor Kováč, ČSOB
Cookies help us provide our services. By using our services, you agree to their use.
More information