Big Data

Work with vast unstructured data across file types and schemas. Tools such as data warehouses, Spark, BigQuery, Redshift, Hadoop etc.
DataFrame Transformations in PySpark (Continued)

DataFrame Transformations in PySpark (Continued)

Continuing to apply transformations to Spark DataFrames using PySpark.

We've covered a fair amount of ground when it comes to Spark DataFrame transformations in this series. In part 1, we touched on filter(), select(), dropna(), fillna(), and isNull(). Then, we moved on to dropDuplicates and user-defined functions ( udf) in part 2. This time around, we'll be building on these concepts and introduce some new ways to transform data so you can officially be awarded your PySpark Guru Certification, award by us here at Hackers & Slackers.*

*Hackers & Slackers is not an accredited institution and is respected by virtually nobody in general.

Of course, we need to get things started

Executing Basic DataFrame Transformations in PySpark

Executing Basic DataFrame Transformations in PySpark

Using PySpark to apply transformations to real datasets.

If you joined us last time, you should have some working knowledge of how to get started with PySpark by using a Databricks notebook. Armed with that knowledge, we can now start playing with real data.

For most of the time we spend in PySpark, we'll likely be working with Spark DataFrames: this is our bread and butter for data manipulation in Spark. For this exercise, we'll attempt to execute an elementary string of transformations to get a feel for what the middle portion of an ETL pipeline looks like (also known as the "transform" part 😁).

Loading Up Some Data

Learning Apache Spark with PySpark & Databricks

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.

Something we've only begun to touch on so far is the benefit of utilizing Apache Spark is larger-scale data pipelines. Spark is a quintessential part of the Apache data stack: built atop of Hadoop, Spark is intended to handle resource-intensive jobs such as data streaming and graph processing.

Much of Spark's allure comes from the fact that it is written in Scala & Java. Java and its offshoot languages are notorious for running extremely memory-heavy at run time, which can be used to our advantage. Because our jobs become predictably resource-intensive as everything is stored in memory, this allows us

Google BigQuery's Python SDK: Creating Tables Programmatically

Google BigQuery's Python SDK: Creating Tables Programmatically

Explore the benefits of Google BigQuery and use the Python SDK to programmatically create tables.

GCP is on the rise, and it's getting harder and harder to have conversations around data warehousing without addressing the new 500-pound gorilla on the block: Google BigQuery. By this point, most enterprises have comfortably settled into their choice of "big data" storage, whether that be Amazon Redshift, Hadoop, or what-have-you. BigQuery is quickly disrupting the way we think about big data stacks by redefining how we use and ultimately pay for such services.

The benefits of BigQuery likely aren't enough to force enterprises to throw the baby out with the bathwater. That said, companies building their infrastructure from the

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

The quest to never explicitly set a table schema ever again.

Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.

The task seems somewhat reasonable from the surface.

Data Could Save Humanity if it Weren't for Humanity

Data Could Save Humanity if it Weren't for Humanity

A compelling case for robot overlords.

A decade has passed since I stumbled into technical product development. Looking back, I've spent that time almost exclusively in the niche of data-driven products and engineering. While it seems obvious now, I realized in the 2000s that you could generally create two types of product: you could either build a (likely uninspired) UI for existing data, or you could build products which produced new data or interpreted existing data in a new useful way. Betting on the latter seemed like an obvious choice. The late 2000’s felt like building apps for the sake of apps most of the