Collect and transform data on a large scale. Build data pipelines, work with a horizontally scalable architecture, or simply scrape and collect data.
Supercharge your scraper to extract quality page metadata by parsing JSON-LD data via Python's extruct library.
Extract and move data between BigQuery and relational databases using PyBigQuery: a connector for SQLAlchemy.
Get the most out of Redshift by performance tuning your cluster and learning how to query your data optimally.
Perform SQL-like joins and aggregations on your PySpark DataFrames.
Working with Spark's original data structure API: Resilient Distributed Datasets.
Use Apache Airflow to build and monitor better data pipelines.
Become familiar with building a structured stream in PySpark using the Databricks interface.
Getting to know Apache Kafka: a horizontally scalable event streaming platform. Learn what makes Kafka critical to high-volume low-latency data pipelines.