Spark

Use Apache Spark to build data pipelines
Spark
24 Jun 2019

Performing Macro Operations on PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.
Performing Macro Operations on PySpark DataFrames

We've had quite a journey exploring the magical world of PySpark together. After covering DataFrame transformations, structured streams, and RDDs, there are only so many things left to cross off the list before we've gone too deep.

To round things up for this series, we're a to take a look back at some powerful DataFrame operations we missed. In particular we'll be focusing on operations which modify DataFrames as a whole, such as  

Joining DataFrames in PySpark

I'm going to assume you're already familiar with the concept of SQL-like joins. To demonstrate these in PySpark, I'll create two simple DataFrames:

Continue Reading
Spark
06 Jun 2019

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.
Working with PySpark RDDs

For being the lifeblood of Spark, there’s surprisingly little documentation on how to actually work with them. If I had to guess, most of the world has been too spoiled by DataFrames to be bothered with non-tabular data. Strange world we live in when using the core data API of Spark is considered a “pro move.”

We've already spent an awful lot of time in this series speaking about DataFrames, which are only one of the 3 data structure APIs we can work with in Spark (or one of two data structure APIs in PySpark, if you're keeping score)

Continue Reading
Spark
13 May 2019

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.
Structured Streaming in PySpark

`Now that we're comfortable with Spark DataFrames, we're going to implement this newfound knowledge to help us implement a streaming data pipeline in PySpark. As it turns out, real-time data streaming is one of Spark's greatest strengths.

For this go-around, we'll touch on the basics of how to build a structured stream in Spark. Databricks has a few sweet features which help us visualize streaming data: we'll be using these features to validate whether or not our stream worked. If you're looking to hook Spark into a message broker or create a production-ready pipeline, we'll be covering this in a

Continue Reading
Spark
07 May 2019

DataFrame Transformations in PySpark (Continued)

Continuing to apply transformations to Spark DataFrames using PySpark.
DataFrame Transformations in PySpark (Continued)

We've covered a fair amount of ground when it comes to Spark DataFrame transformations in this series. In part 1, we touched on filter(), select(), dropna(), fillna(), and isNull(). Then, we moved on to dropDuplicates and user-defined functions ( udf) in part 2. This time around, we'll be building on these concepts and introduce some new ways to transform data so you can officially be awarded your PySpark Guru Certification, award by us here at Hackers & Slackers.*

*Hackers & Slackers is not an accredited institution and is respected by virtually nobody in general.

Of course, we need to get things started

Continue Reading
Spark
28 Apr 2019

Executing Basic DataFrame Transformations in PySpark

Using PySpark to apply transformations to real datasets.
Executing Basic DataFrame Transformations in PySpark

If you joined us last time, you should have some working knowledge of how to get started with PySpark by using a Databricks notebook. Armed with that knowledge, we can now start playing with real data.

For most of the time we spend in PySpark, we'll likely be working with Spark DataFrames: this is our bread and butter for data manipulation in Spark. For this exercise, we'll attempt to execute an elementary string of transformations to get a feel for what the middle portion of an ETL pipeline looks like (also known as the "transform" part 😁).

Loading Up Some Data

Continue Reading
Spark
27 Apr 2019

Cleaning PySpark DataFrames

Easy DataFrame cleaning techniques, ranging from dropping problematic rows to selecting important columns.
Cleaning PySpark DataFrames

There's something about being a data engineer that makes it impossible to clearly convey thoughts in an articulate manner. It seems inevitable that every well-meaning Spark tutorial is destined to devolve into walls of incomprehensible code with minimal explanation. This is even apparent in StackOverflow, where simple questions are regularly met with absurdly unnecessary solutions (stop making UDFs for everything!) Anyway, what I'm trying to say is it takes a lot of guts to click into these things, and here you are. I appreciate you.

In our last episode, we covered some Spark basics, played with Databricks, and started loading

Continue Reading
Spark
26 Apr 2019

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.
Learning Apache Spark with PySpark & Databricks

Something we've only begun to touch on so far is the benefit of utilizing Apache Spark is larger-scale data pipelines. Spark is a quintessential part of the Apache data stack: built atop of Hadoop, Spark is intended to handle resource-intensive jobs such as data streaming and graph processing.

Much of Spark's allure comes from the fact that it is written in Scala & Java. Java and its offshoot languages are notorious for running extremely memory-heavy at run time, which can be used to our advantage. Because our jobs become predictably resource-intensive as everything is stored in memory, this allows us

Continue Reading