Learn Apache Spark

Build enterprise-grade data pipelines to clean and manipulate data on a massive scale. Use PySpark to clean & transform tabular data and build live data streams that constantly ingest and process information. Take advantage of Spark's horizontally scalable infrastructure to effectively run pipelines across multiple machines simultaneously.

What You'll Learn

Interact with Spark via a convenient notebook interface
Tabular data sanitation & transformations
Complex joins and aggregations to get more from your data
Create a structured stream from a remote data source
Become familiar with Spark's underlying RDD data structure

that

Have a basic understanding of Python
Are current or aspiring Data Engineers
Need better tools to handle large volumes of data

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.

Transforming PySpark DataFrames

Apply transformations to PySpark DataFrames such as creating new columns, filtering rows, or modifying string & number values.

Cleaning PySpark DataFrames

Easy DataFrame cleaning techniques ranging from dropping rows to selecting important data.

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.

Join and Aggregate PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.