Data Engineering

Build pipelines to extract and prepare data on a massive scale. Learn "big data" technologies such as Apache Spark and other horizontally scalable data tools.

Simplify BigQuery ETL jobs using SQLAlchemy

Simplify BigQuery ETL jobs using SQLAlchemy

Extract and move data between BigQuery and relational databases using a plugin for SQLAlchemy.
Using Amazon Redshift as your Data Warehouse

Using Amazon Redshift as your Data Warehouse

Get the most out of Redshift by performance tuning your cluster and learning how to query your data optimally.
Join and Aggregate PySpark DataFrames

Join and Aggregate PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.
Working with PySpark RDDs

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.
Manage Data Pipelines with Apache Airflow

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.
Structured Streaming in PySpark

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.
Becoming Familiar with Apache Kafka and Message Queues

Becoming Familiar with Apache Kafka and Message Queues

An overview of how Kafka works, as well as equivalent message brokers.
Cleaning PySpark DataFrames

Cleaning PySpark DataFrames

Easy DataFrame cleaning techniques ranging from dropping rows to selecting important data.