Tutorials for Apache Big Data technologies including Apache Spark, Apache Kafka, Apache Airflow, and more critical tools for data engineers.
Perform SQL-like joins and aggregations on your PySpark DataFrames.
Working with Spark's original data structure API: Resilient Distributed Datasets.
Use Apache Airflow to build and monitor better data pipelines.
Become familiar with building a structured stream in PySpark using the Databricks interface.
Getting to know Apache Kafka: a horizontally scalable event streaming platform. Learn what makes Kafka critical to high-volume low-latency data pipelines.
Easy DataFrame cleaning techniques ranging from dropping rows to selecting important data.
Apply transformations to PySpark DataFrames such as creating new columns, filtering rows, or modifying string & number values.
Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.