Building Java Projects with Gradle

Building Java Projects with Gradle

Automate your Java project's dependency resolution, testing, and more with Gradle.

I've had a few strongly worded opinions about Java as a language in the past. Be that as it may, choosing a programming language is a luxury that many people don’t have; as long as enterprises exist, there will always be a need for Java developers. According to the 2019 StackOverflow developer survey, about 40% of developers are actively using Java in some way.

For those who work mostly in more "modern" programming languages, coming to Java in 2019 has numerous pain points. Installing dependencies by manually downloading and dropping jar files into your Java path is a humbling

Using Amazon Redshift as your Data Warehouse

Using Amazon Redshift as your Data Warehouse

Get the most out of Redshift by performance tuning your cluster and learning how to query your data optimally.

Redshift is quickly taking its place as the world's most popular solution for dumping obscene amounts of data into storage. It's nice to see good services flourish while clunky Hadoop-based stacks of yesterdecade suffer a long, painful death. Regardless of whether you're in data science, data engineering, or analysis, it's only a matter of time before all of us work with the world's most popular data warehouse.

While Redshift's rise to power has been deserved, the unanimous popularity of any service can cause problems... namely, the knowledge gaps that come with defaulting to any de facto industry solution. Most of

Constructing Database Queries with SQLAlchemy

Constructing Database Queries with SQLAlchemy

Query your data models using SQLAlchemy's query API.

So far in our SQLAlchemy journey, we've covered managing database connections and model creation. So... how do we actually extract the data we want from our database?

SQLAlchemy's ORM query API simplifies the way we write database queries. Instead of writing raw SQL queries, we can construct queries on our SQLAlchemy session by chaining together methods to retrieve data. We're going to dive into SQLAlchemy's extensive query API to get an idea of all the ways we can query our data.

This tutorial will assume you know how to create an SQLAlchemy session. We'll also assume you have some data

Managing Relationships in SQLAlchemy Data Models

Managing Relationships in SQLAlchemy Data Models

Using the SQLAlchemy ORM to build data models with meaningful relationships.

There are plenty of good reasons to use SQLAlchemy, from managing database connections to easy integrations with libraries such as Pandas. If you're in the app-building business, I'd be willing to bet that managing your app's data via an ORM is at the top of your list of use cases for SQLAlchemy.

Most software engineers likely find database model management to be easier than SQL queries. For people with heavy data backgrounds (like us), the added abstractions can be a bit off-putting: why do we need foreign keys to execute JOINs between two tables? Why do we need to distinguish

Performing Macro Operations on PySpark DataFrames

Performing Macro Operations on PySpark DataFrames

Perform SQL-like joins and aggregations on your PySpark DataFrames.

We've had quite a journey exploring the magical world of PySpark together. After covering DataFrame transformations, structured streams, and RDDs, there are only so many things left to cross off the list before we've gone too deep.

To round things up for this series, we're a to take a look back at some powerful DataFrame operations we missed. In particular we'll be focusing on operations which modify DataFrames as a whole, such as  

Joining DataFrames in PySpark

I'm going to assume you're already familiar with the concept of SQL-like joins. To demonstrate these in PySpark, I'll create two simple DataFrames:

Side Projects Are A Good Idea

Side Projects Are A Good Idea

Side projects are a good idea but make sure to do the day job.

The Digital Society is changing everything, and apparently new  technologies are going to disrupt our lives. Social media is in a frenzy  over the ‘Internet of things, ‘Cognitive’, ‘Big Data‘, ‘AI’, ‘Blockchain‘,  ‘Cloud computing’ and heaven knows how many other buzz words. So how  does an ‘Ordinary Guy’ handle all of this avoiding that bad case of  ‘F.O.M.O’ or ‘Fear of missing out’.   Well stick your toe in the water  and see how you feel.  Gain some experience and see how you get on.  The  bad case of ‘F.O.M.O’ could become ‘J.O.M.

Manage Files in Google Cloud Storage With Python

Manage Files in Google Cloud Storage With Python

Manage files in your Google Cloud Storage bucket using the google-cloud-storage Python library.

I recently worked on a project which combined two of my life's greatest passions: coding, and memes. The project was, of course, a chatbot: a fun imaginary friend who sits in your chatroom of choice and loyally waits on your beck and call, delivering memes whenever you might request them. In some cases, the bot would scrape the internet for freshly baked memes, but there were also plenty of instances where the desired memes should be more predictable, namely from a predetermined subset of memes hosted on the cloud which could be updated dynamically. This is where Google Cloud Storage

Working with PySpark RDDs

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.

For being the lifeblood of Spark, there’s surprisingly little documentation on how to actually work with them. If I had to guess, most of the world has been too spoiled by DataFrames to be bothered with non-tabular data. Strange world we live in when using the core data API of Spark is considered a “pro move.”

We've already spent an awful lot of time in this series speaking about DataFrames, which are only one of the 3 data structure APIs we can work with in Spark (or one of two data structure APIs in PySpark, if you're keeping score)

PowerPivot 3: Managing the Data Model

PowerPivot 3: Managing the Data Model

Analyzing ginormous files with Microsoft PowerPivot.

How's it going readers? If you've been paying attention and/or have the basic ability to count, you'll notice that this is the third post in a series about using Excel's secret Weapon of Math Destruction: PowerPivot. I highly suggest that if you haven't read the previous two posts, that you go ahead and do that. If you had, you would already:

  1. Have enabled Powerpivot through the COM add-ins.
  2. Know what Powerpivot is.
  3. Understand why powerpivot can make your life as an analyst a great deal easier.
  4. Be ready to power-up some ginormous flat files.

I'm going to assume that

Manage Data Pipelines with Apache Airflow

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.

It seems like almost every data-heavy Python shop is using Airflow in some way these days. It shouldn't take much time in Airflow's interface to figure out why: Airflow is the missing piece data engineers need to standardize the creation of ETL pipelines. The best part of Airflow, of course, is that it's one of the rare projects donated to the Apache foundation which is written in Python. Hooray!

If you happen to be a data engineer who isn't using Airflow (or equivalent) yet, you're in for a treat. It won't take much time using Airflow before you wonder how

Recasting Low-Cardinality Columns as Categoricals

Recasting Low-Cardinality Columns as Categoricals

Downcast strings in Pandas to their proper data-types using HDF5.

The other day, I was grabbing a way-too-big DB query for local exploration.  It was literally over 2GB as a CSV - which is a pain for a number of reasons!  Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

Numerical columns can be taken down to size with the downcasting functions from a previous post.  But what about Object/String columns?  One of the best ways to reduce the size of a file like this is to recast

PowerPivot 2: What's the Deal with Delimiters?

PowerPivot 2: What's the Deal with Delimiters?

Working with large flat files in PowerPivot.

Hey there budding Excel wizards, this post picks up RIGHT where my previous post left off, so if you haven't enabled Powerpivot yet, I highly recommend that you read the previous post and enable the add-in before moving forward. Don't worry, we'll wait...

Good to see you again! If you've gotten this far, you've already won half the battle by enabling Powerpivot. To recap, technically PowerPivot is the desktop version of one of the cornerstones of Microsoft's Business Intelligence cloud platform: Power BI. Functionally however, PowerPivot is the answer to dealing with enormous files in a way that they're still

Removing Duplicate Columns in Pandas

Removing Duplicate Columns in Pandas

Dealing with duplicate column names in your Pandas DataFrame.

Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter.  But, if we wanna do something like load it into a database, that'll be a problem.  It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post).  The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave

Using Hierarchical Indexes With Pandas

Using Hierarchical Indexes With Pandas

Use Panda's Multiindex to make your data work harder for you.

I've been wandering into a lot of awkward conversations lately, most of them being about how I spend my free time. Apparently "rifling through Python library documentation in hopes of finding dope features" isn't considered a relatable hobby by most people. At least you're here, so I must be doing something right occasionally.

Today we'll be venturing off into the world of Pandas indexes. Not just any old indexes... hierarchical indexes. Hierarchical indexes take the idea of having identifiers for rows, and extends this concept by allowing us to set multiple identifiers with a twist: these indexes hold parent/child