ETL

Hone your data engineering skills by creating data pipelines to extract, transform, and load data. Use tools like Apache Spark and Kafkta to handle big data.
Apache
03 Jun 2019

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.
Manage Data Pipelines with Apache Airflow

It seems like almost every data-heavy Python shop is using Airflow in some way these days. It shouldn't take much time in Airflow's interface to figure out why: Airflow is the missing piece data engineers need to standardize the creation of ETL pipelines. The best part of Airflow, of course, is that it's one of the rare projects donated to the Apache foundation which is written in Python. Hooray!

If you happen to be a data engineer who isn't using Airflow (or equivalent) yet, you're in for a treat. It won't take much time using Airflow before you wonder how

Continue Reading
Spark
07 May 2019

DataFrame Transformations in PySpark (Continued)

Continuing to apply transformations to Spark DataFrames using PySpark.
DataFrame Transformations in PySpark (Continued)

We've covered a fair amount of ground when it comes to Spark DataFrame transformations in this series. In part 1, we touched on filter(), select(), dropna(), fillna(), and isNull(). Then, we moved on to dropDuplicates and user-defined functions ( udf) in part 2. This time around, we'll be building on these concepts and introduce some new ways to transform data so you can officially be awarded your PySpark Guru Certification, award by us here at Hackers & Slackers.*

*Hackers & Slackers is not an accredited institution and is respected by virtually nobody in general.

Of course, we need to get things started

Continue Reading
Apache
04 May 2019

Becoming Familiar with Apache Kafka and Message Queues

An overview of how Kafka works, as well as equivalent message brokers.
Becoming Familiar with Apache Kafka and Message Queues

Data engineering technology stacks have a relatively sizeable amount of variance across companies. Depending on the skills and languages preferred by a company's developers, a data stack might be anything between a heavily Java shop, or a Python shop relying on PySpark. Despite the lack of a prescribed "industry-standard" stack, it's becoming clear that one thing will likely be shared by all next-generation data organizations with high-throughput: Apache Kafka.

Kafka is the go-to centerpiece for organizations dealing with massive amounts of data in real-time. Kafka is designed to process billions (or even trillions) of data events per day; a feat

Continue Reading
Spark
28 Apr 2019

Executing Basic DataFrame Transformations in PySpark

Using PySpark to apply transformations to real datasets.
Executing Basic DataFrame Transformations in PySpark

If you joined us last time, you should have some working knowledge of how to get started with PySpark by using a Databricks notebook. Armed with that knowledge, we can now start playing with real data.

For most of the time we spend in PySpark, we'll likely be working with Spark DataFrames: this is our bread and butter for data manipulation in Spark. For this exercise, we'll attempt to execute an elementary string of transformations to get a feel for what the middle portion of an ETL pipeline looks like (also known as the "transform" part 😁).

Loading Up Some Data

Continue Reading
Spark
26 Apr 2019

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.
Learning Apache Spark with PySpark & Databricks

Something we've only begun to touch on so far is the benefit of utilizing Apache Spark is larger-scale data pipelines. Spark is a quintessential part of the Apache data stack: built atop of Hadoop, Spark is intended to handle resource-intensive jobs such as data streaming and graph processing.

Much of Spark's allure comes from the fact that it is written in Scala & Java. Java and its offshoot languages are notorious for running extremely memory-heavy at run time, which can be used to our advantage. Because our jobs become predictably resource-intensive as everything is stored in memory, this allows us

Continue Reading
Data Engineering
28 Mar 2019

Building an ETL Pipeline: From JIRA to SQL

An example data pipeline which extracts data from the JIRA Cloud API and loads it to a SQL database.
Building an ETL Pipeline: From JIRA to SQL

Something we haven't done just yet on this site is walking through the humble process of creating data pipelines: the art of taking a bunch of data, changing said data, and putting it somewhere else. It's kind of a weird thing to be into, hence why the MoMA has been rejecting my submissions of Github repositories. Don't worry; I'll keep at it.

Something you don't see every day are people sharing their pipelines, which is understandable. Presumably, the other people who do this kind of stuff do it for work; nobody is happily building stupid pipelines in their free time

Continue Reading