Data Engineering

The systematic collection and transformation of data via the creation of tools and pipelines.
Using Amazon Redshift as your Data Warehouse

Using Amazon Redshift as your Data Warehouse

Get the most out of Redshift by performance tuning your cluster and learning how to query your data optimally.

Redshift is quickly taking its place as the world's most popular solution for dumping obscene amounts of data into storage. It's nice to see good services flourish while clunky Hadoop-based stacks of yesterdecade suffer a long, painful death. Regardless of whether you're in data science, data engineering, or analysis, it's only a matter of time before all of us work with the world's most popular data warehouse.

While Redshift's rise to power has been deserved, the unanimous popularity of any service can cause problems... namely, the knowledge gaps that come with defaulting to any de facto industry solution. Most of

Performing Macro Operations on PySpark DataFrames

Performing Macro Operations on PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.

We've had quite a journey exploring the magical world of PySpark together. After covering DataFrame transformations, structured streams, and RDDs, there are only so many things left to cross off the list before we've gone too deep.

To round things up for this series, we're a to take a look back at some powerful DataFrame operations we missed. In particular we'll be focusing on operations which modify DataFrames as a whole, such as  

Joining DataFrames in PySpark

I'm going to assume you're already familiar with the concept of SQL-like joins. To demonstrate these in PySpark, I'll create two simple DataFrames:

Working with PySpark RDDs

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.

For being the lifeblood of Spark, there’s surprisingly little documentation on how to actually work with them. If I had to guess, most of the world has been too spoiled by DataFrames to be bothered with non-tabular data. Strange world we live in when using the core data API of Spark is considered a “pro move.”

We've already spent an awful lot of time in this series speaking about DataFrames, which are only one of the 3 data structure APIs we can work with in Spark (or one of two data structure APIs in PySpark, if you're keeping score)

Manage Data Pipelines with Apache Airflow

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.

It seems like almost every data-heavy Python shop is using Airflow in some way these days. It shouldn't take much time in Airflow's interface to figure out why: Airflow is the missing piece data engineers need to standardize the creation of ETL pipelines. The best part of Airflow, of course, is that it's one of the rare projects donated to the Apache foundation which is written in Python. Hooray!

If you happen to be a data engineer who isn't using Airflow (or equivalent) yet, you're in for a treat. It won't take much time using Airflow before you wonder how

Structured Streaming in PySpark

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.

`Now that we're comfortable with Spark DataFrames, we're going to implement this newfound knowledge to help us implement a streaming data pipeline in PySpark. As it turns out, real-time data streaming is one of Spark's greatest strengths.

For this go-around, we'll touch on the basics of how to build a structured stream in Spark. Databricks has a few sweet features which help us visualize streaming data: we'll be using these features to validate whether or not our stream worked. If you're looking to hook Spark into a message broker or create a production-ready pipeline, we'll be covering this in a

DataFrame Transformations in PySpark (Continued)

DataFrame Transformations in PySpark (Continued)

Continuing to apply transformations to Spark DataFrames using PySpark.

We've covered a fair amount of ground when it comes to Spark DataFrame transformations in this series. In part 1, we touched on filter(), select(), dropna(), fillna(), and isNull(). Then, we moved on to dropDuplicates and user-defined functions ( udf) in part 2. This time around, we'll be building on these concepts and introduce some new ways to transform data so you can officially be awarded your PySpark Guru Certification, award by us here at Hackers & Slackers.*

*Hackers & Slackers is not an accredited institution and is respected by virtually nobody in general.

Of course, we need to get things started

Becoming Familiar with Apache Kafka and Message Queues

Becoming Familiar with Apache Kafka and Message Queues

An overview of how Kafka works, as well as equivalent message brokers.

Data engineering technology stacks have a relatively sizeable amount of variance across companies. Depending on the skills and languages preferred by a company's developers, a data stack might be anything between a heavily Java shop, or a Python shop relying on PySpark. Despite the lack of a prescribed "industry-standard" stack, it's becoming clear that one thing will likely be shared by all next-generation data organizations with high-throughput: Apache Kafka.

Kafka is the go-to centerpiece for organizations dealing with massive amounts of data in real-time. Kafka is designed to process billions (or even trillions) of data events per day; a feat

Executing Basic DataFrame Transformations in PySpark

Executing Basic DataFrame Transformations in PySpark

Using PySpark to apply transformations to real datasets.

If you joined us last time, you should have some working knowledge of how to get started with PySpark by using a Databricks notebook. Armed with that knowledge, we can now start playing with real data.

For most of the time we spend in PySpark, we'll likely be working with Spark DataFrames: this is our bread and butter for data manipulation in Spark. For this exercise, we'll attempt to execute an elementary string of transformations to get a feel for what the middle portion of an ETL pipeline looks like (also known as the "transform" part 😁).

Loading Up Some Data

Cleaning PySpark DataFrames

Cleaning PySpark DataFrames

Easy DataFrame cleaning techniques, ranging from dropping problematic rows to selecting important columns.

There's something about being a data engineer that makes it impossible to clearly convey thoughts in an articulate manner. It seems inevitable that every well-meaning Spark tutorial is destined to devolve into walls of incomprehensible code with minimal explanation. This is even apparent in StackOverflow, where simple questions are regularly met with absurdly unnecessary solutions (stop making UDFs for everything!) Anyway, what I'm trying to say is it takes a lot of guts to click into these things, and here you are. I appreciate you.

In our last episode, we covered some Spark basics, played with Databricks, and started loading

Learning Apache Spark with PySpark & Databricks

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.

Something we've only begun to touch on so far is the benefit of utilizing Apache Spark is larger-scale data pipelines. Spark is a quintessential part of the Apache data stack: built atop of Hadoop, Spark is intended to handle resource-intensive jobs such as data streaming and graph processing.

Much of Spark's allure comes from the fact that it is written in Scala & Java. Java and its offshoot languages are notorious for running extremely memory-heavy at run time, which can be used to our advantage. Because our jobs become predictably resource-intensive as everything is stored in memory, this allows us

Building an ETL Pipeline: From JIRA to SQL

Building an ETL Pipeline: From JIRA to SQL

An example data pipeline which extracts data from the JIRA Cloud API and loads it to a SQL database.

Something we haven't done just yet on this site is walking through the humble process of creating data pipelines: the art of taking a bunch of data, changing said data, and putting it somewhere else. It's kind of a weird thing to be into, hence why the MoMA has been rejecting my submissions of Github repositories. Don't worry; I'll keep at it.

Something you don't see every day are people sharing their pipelines, which is understandable. Presumably, the other people who do this kind of stuff do it for work; nobody is happily building stupid pipelines in their free time

Working With GraphQL Fragments and Mutations

Working With GraphQL Fragments and Mutations

Make your GraphQL queries more dynamic with Fragments, plus get started with Mutations.

Last week we encountered a genuine scenario when working with GraphQL clients. When building real applications consuming data via GraphQL, we usually don't know precisely the query we're going to want to run at runtime. Imagine a user cruising through your application, setting preferences, and arriving at core pieces of functionality under a content which is specific only to them. Say we're building a GrubHub knockoff (we hate profits and love entering impenetrable parts of the market, it's not that uncommon really.) At its core, the information we're serving will always be restaurants; we'll always want to return things like

Building a Client For Your GraphQL API

Building a Client For Your GraphQL API

Now that we have an understanding of GraphQL queries and API setup, it's time to get that data.

If you had the pleasure of joining us last time, we had just completed a crash course in structuring GraphQL Queries. As much we all love studying abstract queries within the confines of a playground environment, the only real way to learn anything to overzealously attempt to build something way out of our skill level. Thus, we're going to shift gears and actually make something with all the dry technical knowledge we've accumulated so far. Hooray!

Data Gone Wild: Exposing Your GraphQL Endpoint

If you're following along with Prisma as your GraphQL service, the endpoint for your API defaults to

Writing Your First GraphQL Query

Writing Your First GraphQL Query

Begin to structure complex queries against your GraphQL API.

In our last run-in with GraphQL, we used Prisma to assist in setting up a GraphQL server. This effectively gave us an endpoint to work with for making GraphQL requests against the database we specified when getting started. If you're still in the business of setting up a GraphQL server, there are plenty of alternative services to Prisma you could explore. Apollo is perhaps the most popular. A different approach could be to use GraphCMS: a headless CMS for building GraphQL models with a beautiful interface.

With our first models are created and deployed, we’re now able to explore