Build enterprise-grade data pipelines to clean and manipulate data on a massive scale. Use PySpark to clean & transform tabular data and build live data streams that constantly ingest and process information. Take advantage of Spark's horizontally scalable infrastructure to effectively run pipelines across multiple machines simultaneously.

What You'll Learn

  • Interact with Spark via a convenient notebook interface
  • Tabular data sanitation & transformations
  • Complex joins and aggregations to get more from your data
  • Create a structured stream from a remote data source
  • Become familiar with Spark's underlying RDD data structure

that

  • Have a basic understanding of Python
  • Are current or aspiring Data Engineers
  • Need better tools to handle large volumes of data