Apache

Tutorials for Apache’s suite of big data products. Includes Apache Hadoop, Apache Spark, Apache Kafka, and other critical technologies for any data engineer.
Performing Macro Operations on PySpark DataFrames

Performing Macro Operations on PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.

We've had quite a journey exploring the magical world of PySpark together. After covering DataFrame transformations, structured streams, and RDDs, there are only so many things left to cross off the list before we've gone too deep.

To round things up for this series, we're a to take a look back at some powerful DataFrame operations we missed. In particular we'll be focusing on operations which modify DataFrames as a whole, such as  

Joining DataFrames in PySpark

I'm going to assume you're already familiar with the concept of SQL-like joins. To demonstrate these in PySpark, I'll create two simple DataFrames:

Working with PySpark RDDs

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.

For being the lifeblood of Spark, there’s surprisingly little documentation on how to actually work with them. If I had to guess, most of the world has been too spoiled by DataFrames to be bothered with non-tabular data. Strange world we live in when using the core data API of Spark is considered a “pro move.”

We've already spent an awful lot of time in this series speaking about DataFrames, which are only one of the 3 data structure APIs we can work with in Spark (or one of two data structure APIs in PySpark, if you're keeping score)

Manage Data Pipelines with Apache Airflow

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.

It seems like almost every data-heavy Python shop is using Airflow in some way these days. It shouldn't take much time in Airflow's interface to figure out why: Airflow is the missing piece data engineers need to standardize the creation of ETL pipelines. The best part of Airflow, of course, is that it's one of the rare projects donated to the Apache foundation which is written in Python. Hooray!

If you happen to be a data engineer who isn't using Airflow (or equivalent) yet, you're in for a treat. It won't take much time using Airflow before you wonder how

Structured Streaming in PySpark

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.

`Now that we're comfortable with Spark DataFrames, we're going to implement this newfound knowledge to help us implement a streaming data pipeline in PySpark. As it turns out, real-time data streaming is one of Spark's greatest strengths.

For this go-around, we'll touch on the basics of how to build a structured stream in Spark. Databricks has a few sweet features which help us visualize streaming data: we'll be using these features to validate whether or not our stream worked. If you're looking to hook Spark into a message broker or create a production-ready pipeline, we'll be covering this in a

DataFrame Transformations in PySpark (Continued)

DataFrame Transformations in PySpark (Continued)

Continuing to apply transformations to Spark DataFrames using PySpark.

We've covered a fair amount of ground when it comes to Spark DataFrame transformations in this series. In part 1, we touched on filter(), select(), dropna(), fillna(), and isNull(). Then, we moved on to dropDuplicates and user-defined functions ( udf) in part 2. This time around, we'll be building on these concepts and introduce some new ways to transform data so you can officially be awarded your PySpark Guru Certification, award by us here at Hackers & Slackers.*

*Hackers & Slackers is not an accredited institution and is respected by virtually nobody in general.

Of course, we need to get things started

Becoming Familiar with Apache Kafka and Message Queues

Becoming Familiar with Apache Kafka and Message Queues

An overview of how Kafka works, as well as equivalent message brokers.

Data engineering technology stacks have a relatively sizeable amount of variance across companies. Depending on the skills and languages preferred by a company's developers, a data stack might be anything between a heavily Java shop, or a Python shop relying on PySpark. Despite the lack of a prescribed "industry-standard" stack, it's becoming clear that one thing will likely be shared by all next-generation data organizations with high-throughput: Apache Kafka.

Kafka is the go-to centerpiece for organizations dealing with massive amounts of data in real-time. Kafka is designed to process billions (or even trillions) of data events per day; a feat

Executing Basic DataFrame Transformations in PySpark

Executing Basic DataFrame Transformations in PySpark

Using PySpark to apply transformations to real datasets.

If you joined us last time, you should have some working knowledge of how to get started with PySpark by using a Databricks notebook. Armed with that knowledge, we can now start playing with real data.

For most of the time we spend in PySpark, we'll likely be working with Spark DataFrames: this is our bread and butter for data manipulation in Spark. For this exercise, we'll attempt to execute an elementary string of transformations to get a feel for what the middle portion of an ETL pipeline looks like (also known as the "transform" part 😁).

Loading Up Some Data

Cleaning PySpark DataFrames

Cleaning PySpark DataFrames

Easy DataFrame cleaning techniques, ranging from dropping problematic rows to selecting important columns.

There's something about being a data engineer that makes it impossible to clearly convey thoughts in an articulate manner. It seems inevitable that every well-meaning Spark tutorial is destined to devolve into walls of incomprehensible code with minimal explanation. This is even apparent in StackOverflow, where simple questions are regularly met with absurdly unnecessary solutions (stop making UDFs for everything!) Anyway, what I'm trying to say is it takes a lot of guts to click into these things, and here you are. I appreciate you.

In our last episode, we covered some Spark basics, played with Databricks, and started loading

Learning Apache Spark with PySpark & Databricks

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.

Something we've only begun to touch on so far is the benefit of utilizing Apache Spark is larger-scale data pipelines. Spark is a quintessential part of the Apache data stack: built atop of Hadoop, Spark is intended to handle resource-intensive jobs such as data streaming and graph processing.

Much of Spark's allure comes from the fact that it is written in Scala & Java. Java and its offshoot languages are notorious for running extremely memory-heavy at run time, which can be used to our advantage. Because our jobs become predictably resource-intensive as everything is stored in memory, this allows us

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

The quest to never explicitly set a table schema ever again.

Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.

The task seems somewhat reasonable from the surface.