Data Engineering

The systematic collection and transformation of data via the creation of tools and pipelines.
Welcome to SQL: Modifying Databases and Tables

Welcome to SQL: Modifying Databases and Tables

Brush up on SQL fundamentals such as creating tables, schemas, and views.

SQL: we all pretend to be experts at it, and mostly get away with it thanks to StackOverflow. Paired with our vast experience of learning how to code in the 90s, our field work of PHPMyAdmin and LAMP stacks basically makes us experts. Go ahead and chalk up a win for your resume.

SQL has been around longer than our careers have, so why start a series on it now? Surely there’s sufficient enough documentation that we can Google the specifics whenever the time comes for us to write a query? That, my friends, is precisely the problem. Regardless

Google BigQuery's Python SDK: Creating Tables Programmatically

Google BigQuery's Python SDK: Creating Tables Programmatically

Explore the benefits of Google BigQuery and use the Python SDK to programmatically create tables.

GCP is on the rise, and it's getting harder and harder to have conversations around data warehousing without addressing the new 500-pound gorilla on the block: Google BigQuery. By this point, most enterprises have comfortably settled into their choice of "big data" storage, whether that be Amazon Redshift, Hadoop, or what-have-you. BigQuery is quickly disrupting the way we think about big data stacks by redefining how we use and ultimately pay for such services.

The benefits of BigQuery likely aren't enough to force enterprises to throw the baby out with the bathwater. That said, companies building their infrastructure from the

Downcast Numerical Data Types with Pandas

Downcast Numerical Data Types with Pandas

Using an Example Where We Downcast Numerical Columns.

Recently, I had to find a way to reduce the memory footprint of a Pandas DataFrame in order to actually do operations on it.  Here's a trick that came in handy!

By default, if you read a DataFrame from a file, it'll cast all the numerical columns as the float64 type.  This is in keeping with the philosophy behind Pandas and NumPy - by using strict types (instead of normal Python "duck typing"), you can do things a lot faster.  The float64 is the most flexible numerical type - it can handle fractions, as well as turning missing values into

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

The quest to never explicitly set a table schema ever again.

Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.

The task seems somewhat reasonable from the surface.

Psycopg2: PostgreSQL & Python the Old Fashioned Way

Psycopg2: PostgreSQL & Python the Old Fashioned Way

Manage PostgreSQL database interactions in Python with the Psycopg2 library.

Last time we met, we joyfully shared a little tirade about missing out on functionality provided to us by libraries such as SQLAlchemy, and the advantages of interacting with databases where ORMs are involved. I stand by that sentiment, but I’ll now directly contradict myself by sharing some tips on using vanilla Psycopg2 to interact with databases.

We never know when we’ll be stranded on a desert island without access to SQLAlchemy, but a lonesome Psycopg2 washes up onshore. Either that or perhaps you’re part of a development team stuck in a certain way of doing things

Pythonic Database Management with SQLAlchemy

Pythonic Database Management with SQLAlchemy

The iconic Python library for handling any conceivable database interaction.

Something we've taken for granted thus far on Hackers and Slackers is a library most data professionals have accepted as standard: SQLAlchemy.

In the past, we've covered database connection management and querying using libraries such as PyMySQL and Psycopg2, both of which do an excellent job of interacting with databases just as we'd expect them to. The nature of opening/closing DB connections and working with cursors hasn't changed much in the past few decades. While boilerplate is boring, at least it has remained consistent, one might figure. That may have been the case, but the philosophical boom of MVC

Using Redis to Store Information in Python Applications

Using Redis to Store Information in Python Applications

A temporary data store for everything from session variables to chat queues.

We’re hacking into the new year here at Hackers and Slackers, and in the process, we’ve received plenty of new gifts to play with. Nevermind how Santa manages to fit physically non-existent SaaS products under the Christmas tree. We ask for abstract enterprise software every year, and this time we happened to get a little red box.

If you've never personally used Redis, the name probably sounds familiar as you've been bombarded with obscure technology brand names in places like the Heroku marketplace, or your unacceptably nerdy Twitter account (I assure you, mine is worse). So what is

MongoDB Stitch Serverless Functions

MongoDB Stitch Serverless Functions

A crash course in MongoDB Stitch serverless functions: the bread and butter of MongoDB Cloud.

At times, I've found my opinion of MongoDB Atlas and MongoDB Stitch to waver between two extremes. Sometimes I'm struck by the allure of a cloud which fundamentally disregards schemas (wooo no schema party!). Other times, such as when Mongo decides to upgrade to a new version and you find all your production instances broken, I like the ecosystem a bit less.

My biggest qualm with MongoDB is poor documentation. The "tutorials" and sample code seems hacked-together, unmaintained, and worst of all, inconsistent with itself. Reading through the docs seems to always end up with Mongo forcing Twilio down my

Scraping Data on the Web with BeautifulSoup

Scraping Data on the Web with BeautifulSoup

The honest act of systematically stealing data without permission.

There are plenty of reliable and open sources of data on the web. Datasets are freely released to the public domain by the likes of Kaggle, Google Cloud, and of course local & federal government. Like most things free and open, however, following the rules to obtain public data can be a bit... boring. I'm not suggesting we go and blatantly break some grey-area laws by stealing data, but this blog isn't exactly called People Who Play It Safe And Slackers, either.

My personal Python roots can actually be traced back to an ambitious side-project: to aggregate all new music

Create a REST API Endpoint Using AWS Lambda

Create a REST API Endpoint Using AWS Lambda

Create an AWS Lambda function to pull records from a database.

Now that you know your way around API Gateway, you have the power to create vast collections of endpoints. If only we could get those endpoints to actually receive and return some stuff.

We'll create a GET function which will solve the common task of retrieving data from a database. The sequence will look something like:

  • Connect to the database
  • Execute the relevant SQL query
  • Map values returned by the query to a key/value dictionary
  • Return a response body containing the prepared response

To get started, create a project on your local machine (this is necessary as we'll need

MySQL, Google Cloud, and a REST API that Generates Itself

MySQL, Google Cloud, and a REST API that Generates Itself

Deploy a MySQL database that auto-creates endpoints for itself.

It wasn’t too long ago that I haphazardly forced us down a journey of exploring Google Cloud’s cloud SQL service. The focus of this exploration was Google’s accompanying REST API for all of its cloud SQL instances. That API turned out to be a relatively disappointing administrative API which did little to extend the features you’d expect from the CLI or console.

You see, I’ve had a dream stuck in my head for a while now. Like most of my utopian dreams, this dream is related to data, or more specifically simplifying the manner in

Working With Google Cloud Functions

Working With Google Cloud Functions

GCP scores a victory by trivializing serverless functions.

The more I explore Google Cloud's endless catalog of cloud services, the more I like Google Cloud. This is why before moving forward, I'd like to be transparent that this blog has become little more than thinly veiled Google propaganda, where I will henceforth bombard you with persuasive and subtle messaging to sell your soul to Google. Let's be honest; they've probably simulated it anyway.

It should be safe to assume that you're familiar with AWS Lambda Functions by now, which have served as the backbone of what we refer to as "serverless." These cloud code snippets have restructured entire

Extract Nested Data From Complex JSON

Extract Nested Data From Complex JSON

Never manually walk through complex JSON objects again by using this function.

We're all data people here, so you already know the scenario: it happens perhaps once a day, perhaps 5, or even more. There's an API you're working with, and it's great. It contains all the information you're looking for, but there's just one problem: the complexity of nested JSON objects is endless, and suddenly the job you love needs to be put on hold to painstakingly retrieve the data you actually want, and it's 5 levels deep in a nested JSON hell. Nobody feels like much of a "scientist" or an "engineer" when half their day becomes dealing with key

Reading and Writing to CSVs in Python

Reading and Writing to CSVs in Python

Playing with tabular data the native Python way.

Tables. Cells. Two-dimensional data. We here at Hackers & Slackers know how to talk dirty, but there's one word we'll be missing from our vocabulary today: Pandas.Before the remaining audience closes their browser windows in fury, hear me out. We love Pandas; so much so that tend to recklessly gunsling this 30mb library to perform simple tasks. This isn't always a wise choice. I get it: you're here for data, not software engineering best practices. We all are, but in a landscape where engineers and scientists already produce polarizing code quality, we're all just a single bloated lambda function

PREV