Python

Let us feed your endless Python addiction! Regardless of where you stand as a Pythonista, our team of pros are constantly teaching and sharing pythonic gold.
Python
14 Jul 2019

Constructing Database Queries with SQLAlchemy

Query your data models using SQLAlchemy's query API.
Constructing Database Queries with SQLAlchemy

So far in our SQLAlchemy journey, we've covered managing database connections and model creation. So... how do we actually extract the data we want from our database?

SQLAlchemy's ORM query API simplifies the way we write database queries. Instead of writing raw SQL queries, we can construct queries on our SQLAlchemy session by chaining together methods to retrieve data. We're going to dive into SQLAlchemy's extensive query API to get an idea of all the ways we can query our data.

This tutorial will assume you know how to create an SQLAlchemy session. We'll also assume you have some data

Continue Reading
Python
11 Jul 2019

Managing Relationships in SQLAlchemy Data Models

Using the SQLAlchemy ORM to build data models with meaningful relationships.
Managing Relationships in SQLAlchemy Data Models

There are plenty of good reasons to use SQLAlchemy, from managing database connections to easy integrations with libraries such as Pandas. If you're in the app-building business, I'd be willing to bet that managing your app's data via an ORM is at the top of your list of use cases for SQLAlchemy.

Most software engineers likely find database model management to be easier than SQL queries. For people with heavy data backgrounds (like us), the added abstractions can be a bit off-putting: why do we need foreign keys to execute JOINs between two tables? Why do we need to distinguish

Continue Reading
RaspberryPi
19 Jun 2019

Side Projects Are A Good Idea

Side projects are a good idea but make sure to do the day job.
Side Projects Are A Good Idea

The Digital Society is changing everything, and apparently new  technologies are going to disrupt our lives. Social media is in a frenzy  over the ‘Internet of things, ‘Cognitive’, ‘Big Data‘, ‘AI’, ‘Blockchain‘,  ‘Cloud computing’ and heaven knows how many other buzz words. So how  does an ‘Ordinary Guy’ handle all of this avoiding that bad case of  ‘F.O.M.O’ or ‘Fear of missing out’.   Well stick your toe in the water  and see how you feel.  Gain some experience and see how you get on.  The  bad case of ‘F.O.M.O’ could become ‘J.O.M.

Continue Reading
Google Cloud
18 Jun 2019

Manage Files in Google Cloud Storage With Python

Manage files in your Google Cloud Storage bucket using the google-cloud-storage Python library.
Manage Files in Google Cloud Storage With Python

I recently worked on a project which combined two of my life's greatest passions: coding, and memes. The project was, of course, a chatbot: a fun imaginary friend who sits in your chatroom of choice and loyally waits on your beck and call, delivering memes whenever you might request them. In some cases, the bot would scrape the internet for freshly baked memes, but there were also plenty of instances where the desired memes should be more predictable, namely from a predetermined subset of memes hosted on the cloud which could be updated dynamically. This is where Google Cloud Storage

Continue Reading
Spark
06 Jun 2019

Working with PySpark RDDs

Working with Spark's original data structure API: Resilient Distributed Datasets.
Working with PySpark RDDs

For being the lifeblood of Spark, there’s surprisingly little documentation on how to actually work with them. If I had to guess, most of the world has been too spoiled by DataFrames to be bothered with non-tabular data. Strange world we live in when using the core data API of Spark is considered a “pro move.”

We've already spent an awful lot of time in this series speaking about DataFrames, which are only one of the 3 data structure APIs we can work with in Spark (or one of two data structure APIs in PySpark, if you're keeping score)

Continue Reading
Apache
03 Jun 2019

Manage Data Pipelines with Apache Airflow

Use Apache Airflow to build and monitor better data pipelines.
Manage Data Pipelines with Apache Airflow

It seems like almost every data-heavy Python shop is using Airflow in some way these days. It shouldn't take much time in Airflow's interface to figure out why: Airflow is the missing piece data engineers need to standardize the creation of ETL pipelines. The best part of Airflow, of course, is that it's one of the rare projects donated to the Apache foundation which is written in Python. Hooray!

If you happen to be a data engineer who isn't using Airflow (or equivalent) yet, you're in for a treat. It won't take much time using Airflow before you wonder how

Continue Reading
Code Snippet Corner
03 Jun 2019

Recasting Low-Cardinality Columns as Categoricals

Downcast strings in Pandas to their proper data-types using HDF5.
Recasting Low-Cardinality Columns as Categoricals

The other day, I was grabbing a way-too-big DB query for local exploration.  It was literally over 2GB as a CSV - which is a pain for a number of reasons!  Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

Numerical columns can be taken down to size with the downcasting functions from a previous post.  But what about Object/String columns?  One of the best ways to reduce the size of a file like this is to recast

Continue Reading
Code Snippet Corner
28 May 2019

Removing Duplicate Columns in Pandas

Dealing with duplicate column names in your Pandas DataFrame.
Removing Duplicate Columns in Pandas

Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter.  But, if we wanna do something like load it into a database, that'll be a problem.  It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post).  The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave

Continue Reading
Pandas
28 May 2019

Using Hierarchical Indexes With Pandas

Use Panda's Multiindex to make your data work harder for you.
Using Hierarchical Indexes With Pandas

I've been wandering into a lot of awkward conversations lately, most of them being about how I spend my free time. Apparently "rifling through Python library documentation in hopes of finding dope features" isn't considered a relatable hobby by most people. At least you're here, so I must be doing something right occasionally.

Today we'll be venturing off into the world of Pandas indexes. Not just any old indexes... hierarchical indexes. Hierarchical indexes take the idea of having identifiers for rows, and extends this concept by allowing us to set multiple identifiers with a twist: these indexes hold parent/child

Continue Reading
Flask
27 May 2019

Managing Flask Session Variables

Using Flask-Session and Flask-Redis to store user session variables.
Managing Flask Session Variables

When building we build applications that handle users, a lot of functionality depends on storing session variables for users. Consider a typical checkout cart: it's quite often that an abandoned cart on any e-commerce website will retain its contents long after a user abandons. Carts sometimes even have their contents persist across devices! To build such functionality, we cannot rely on Flask's default method of storing session variables, which happens via locally stored browser cookies. Instead, we can use a cloud key/value store such as Redis, and leverage a plugin called Flask-Session.

Flask-Session is a Flask plugin which enables

Continue Reading
Pandas
20 May 2019

Reshaping Pandas DataFrames

A guide to DataFrame manipulation using groupby, melt, pivot tables, pivot, transpose, and stack.
Reshaping Pandas DataFrames

Summer is just around the corner and everybody seems to be asking the same question: “does my data look... out of shape?” Whether you’re a scientist or an engineer, data-image dysmorphia can lead to serious negative thoughts which leave you second-guessing our data.  

Much has already been said about modifying DataFrames on a “micro” level, such as column-wise operations. But what about modifying entire DataFrames at once? When considering Numpy’s role in general mathematics, it should come as no surprise that Pandas DataFrames have a lot of similarities to the matrices we learned high school pre-calc; namely, they

Continue Reading
Spark
13 May 2019

Structured Streaming in PySpark

Become familiar with building a structured stream in PySpark using the Databricks interface.
Structured Streaming in PySpark

`Now that we're comfortable with Spark DataFrames, we're going to implement this newfound knowledge to help us implement a streaming data pipeline in PySpark. As it turns out, real-time data streaming is one of Spark's greatest strengths.

For this go-around, we'll touch on the basics of how to build a structured stream in Spark. Databricks has a few sweet features which help us visualize streaming data: we'll be using these features to validate whether or not our stream worked. If you're looking to hook Spark into a message broker or create a production-ready pipeline, we'll be covering this in a

Continue Reading
Data Science
29 Apr 2019

Plotting Data With Seaborn and Pandas

Create beautiful data visualizations out-of-the-box with Python’s Seaborn.
Plotting Data With Seaborn and Pandas

There are plenty of good libraries for charting data in Python, perhaps too many. Plotly is great, but a limit of 25 free charts is hardly a starting point. Sure, there's Matplotlib, but surely we find something a little less... well, lame. Where are all the simple-yet-powerful chart libraries at?

As you’ve probably guessed, this is where Seaborn comes in. Seaborn isn’t a third-party library, so you can get started without creating user accounts or worrying about API limits, etc. Seaborn is also built on top of Matplotlib, making it the logical next step up for anybody wanting

Continue Reading
Spark
26 Apr 2019

Learning Apache Spark with PySpark & Databricks

Get started with Apache Spark in part 1 of our series, where we leverage Databricks and PySpark.
Learning Apache Spark with PySpark & Databricks

Something we've only begun to touch on so far is the benefit of utilizing Apache Spark is larger-scale data pipelines. Spark is a quintessential part of the Apache data stack: built atop of Hadoop, Spark is intended to handle resource-intensive jobs such as data streaming and graph processing.

Much of Spark's allure comes from the fact that it is written in Scala & Java. Java and its offshoot languages are notorious for running extremely memory-heavy at run time, which can be used to our advantage. Because our jobs become predictably resource-intensive as everything is stored in memory, this allows us

Continue Reading