Humanity will forever be in debt to Silicon Valley for its innovative history of recklessly monetizing human life, spewing arrogant claims of bettering the world, and disrupting the oppressive intuition of face-to-face interaction.
Despite our love for these pioneers of change, it's hard not to enjoy the show when the titans of FAANG attempt to destroy each other. 10 years ago, Microsoft nearly decimated Google with a sophisticated secret weapon called Bing: a shitty search engine that the majority of the world would assume was google.com by virtue of opening Internet Explorer. Google struck back hard with Chrome: a product so successful at not-being-Internet-Explorer that it has given Google power to shape the internet itself. Given their history, it's hard to imagine these war machines putting down their guns and settling their differences. Or at least it was until AWS happened.
The term multi-cloud is yet another tech buzzword most of us have grown numb to, as everything on earth slowly becomes engulfed in AI Blockchain VR Cloud As-A-Services. If we pay attention, it's clear "multi-cloud" is a term used exclusively by two corporations of the three aforementioned corporations (read: the only companies with a significant non-AWS cloud offering). It's no coincidence that Amazon banned the usage of the term "multi-cloud" from all internal communications.
"Multi-cloud" implies that there is a benefit to paying bills to not one, but multiple cloud providers; presumably insinuating that there are products on Google Cloud and Azure which are better than AWS' offerings. It's a narrative intended to bore its way into the skulls of technical decision-makers, but occasionally these companies have a point. For Google, one such point is BigQuery: a data warehouse with cost efficiency and usability that surpasses Redshift by a margin so absurdly large that getting gouged by double the number of cloud providers is actually a better deal.
Under The Hood
BigQuery evolved from one of Google's internal tools called Dremel: a powerful tool able to execute queries against data from all of Google's products (YouTube, Gmail, Google docs, and so forth). Dremel/BigQuery's architecture is a network of tens of thousands of Google servers accessible by all GCP users, essentially serving as a single shared service on a massive scale. Instead of worrying about how to scale data warehouse cluster horizontally, BigQuery leverages the Google Cloud network to determine how your data is partitioned across nodes presumably shared by any number of other cloud customers. This provides us with two major benefits:
- Price: BigQuery charges by usage, determined by the size of each query. This makes BigQuery a significantly cheaper data warehouse option for smaller shops which don't utilize their clusters 24/7.
- Time: It's hard to express what a pain in the ass it is to maintain, optimize, and "fine tune" a Redshift instance. We scratch the surface of the details in another post, but I'd recommend avoiding it unless you're a masochist. Not only are Redshift instances more expensive than the hardware they run on, but it's up to you to ensure data is being partitioned correctly, sweating over "distribution styles" in the endless battle for passable performance. As a general rule of thumb, if "fine tuning" is a primary attribute of your job description, your life might suck.
In my life's list of priorities, time and money are near the top of that list.
BigQuery Python SDK
GCP provides powerful developer SDKs for each of its services, and BigQuery is no exception. We'll be using BigQuery's Python SDK in tandem with the Google Cloud Storage Python SDK to import raw CSV into a new BigQuery table. To follow along, it's recommended that you have a basic knowledge of the following:
- Enabling services in Google Cloud (BigQuery here and Cloud Storage here).
- Generating Service Account keys for Google Clouds APIs (reference here)
- The concept of environment variables and how to use them in Python.
If you find yourself getting lost at any point, the source code for this tutorial is on Github here:
Setting Up
As mentioned, we need to generate a service account key in Google Cloud Console with access to BigQuery and Cloud Storage. Download this JSON file and store it in your project's root directory. Don't commit this file to Github or share it with anybody.
We also need some data. I'll be using a dataset of fake employees; part of this exercise is to demonstrate BigQuery's ability to infer datatypes, so I intentionally picked data with a fairly diverse collection of column datatypes:
id | initiated_date | hire_date | first_name | last_name | title | |
---|---|---|---|---|---|---|
100035435 | 2015-12-11T09:16:20.722-08:00 | 3/22/67 | GretchenRMorrow@jourrapide.com | Gretchen | Morrow | Power plant operator |
100056435 | 2015-12-15T10:11:24.604-08:00 | 6/22/99 | ElizabethLSnow@armyspy.com | Elizabeth | Snow | Oxygen therapist |
100037955 | 2015-12-16T14:31:32.765-08:00 | 5/31/74 | AlbertMPeterson@einrot.com | Albert | Peterson | Psychologist |
100035435 | 2016-01-20T11:15:47.249-08:00 | 9/9/69 | JohnMLynch@dayrep.com | John | Lynch | Environmental hydrologist |
100057657 | 2016-01-21T12:45:38.261-08:00 | 4/9/83 | TheresaJCahoon@teleworm.us | Theresa | Cahoon | Personal chef |
100056747 | 2016-02-01T11:25:39.317-08:00 | 6/26/98 | KennethHPayne@dayrep.com | Kenneth | Payne | Central office operator |
100035435 | 2016-02-01T11:28:11.953-08:00 | 4/16/82 | LeifTSpeights@fleckens.hu | Leif | Speights | Staff development director |
100035435 | 2016-02-01T12:21:01.756-08:00 | 8/6/80 | JamesSRobinson@teleworm.us | James | Robinson | Scheduling clerk |
100074688 | 2016-02-01T13:29:19.147-08:00 | 12/14/74 | AnnaDMoberly@jourrapide.com | Anna | Moberly | Playwright |
100665778 | 2016-02-04T14:40:05.223-08:00 | 9/13/66 | MarjorieBCrawford@armyspy.com | Marjorie | Crawford | Court, municipal, and license clerk |
100876876 | 2016-02-24T12:39:25.872-08:00 | 12/19/67 | LyleCHackett@fleckens.hu | Lyle | Hackett | Airframe mechanic |
100658565 | 2016-02-29T15:52:12.933-08:00 | 11/17/83 | MaryJDensmore@jourrapide.com | Mary | Densmore | Employer relations representative |
While there are now a number of ways to get data into a BigQuery table, the preferred method we'll use is importing data through Google Cloud Storage. With all that in mind, spin up a Python environment and install the following packages:
Configuration
With your service key JSON in your project folder, you'll also need to set an environment variable called GOOGLE_APPLICATION_CREDENTIALS
which should equal the file path to your key. If you have trouble getting set up, check out the Python SDK set up instructions.
Our script will need to be configured to work with our Google Cloud account, which we'll do in config.py. We need to configure the two GCP services we're utilizing in our script: Google Cloud Storage and Google BigQuery.
A quick breakdown of what we're working with here:
GCP_PROJECT_ID
: The ID of your Google Cloud project (found basically anywhere, including the URL of your console).GCP_BUCKET_NAME
: The name of a Google Cloud Storage bucket which will serve as the destination for our CSV upload.GCP_BIGQUERY_TABLE_ID
: Desired table name of our Google BigQuery table.GCP_BIGQUERY_DATASET_ID
: BigQuery dataset destination to store our table under.GCP_BIGQUERY_FULL_TABLE_ID
: Similarly to how SQL tables can be referenced across databases by following the dot notation of DATABASE_NAME.TABLE_NAME, BigQuery follows the convention of identifying tables by drilling down from the top (project id) to the bottom (table ID). We'll refer to this as the "full" table ID as it gives us a convenient path to our table, but should not be confused with the actual table ID, which is the simple table name we set earlier.LOCAL_CSV_FILEPATH
: File path to a local CSV containing data we'd like to upload.REMOTE_CSV_DESTINATION
: Remote file path our CSV will be saved to in our Google Cloud Storage bucket.
Upload Data to Google Cloud Storage
The most effortless way to create a new BigQuery table from raw data is via a file hosted on Google Cloud Storage We've covered working in GCS in detail in a past tutorial, so we'll blow through this quick.
Starting a GCS client is easy:
Our first function is going to take a CSV we've stored locally and upload it to a bucket:
Successful execution of the above should result in a local file being uploaded to your specified Google Cloud Storage bucket. If all went well, you should see a similar success message:
Inserting Data from Cloud Storage to BigQuery
With our data uploaded to Google Cloud Storage, we can now import our data into BigQuery. Similar to how we created a "client" for Google Cloud Storage, we're going to do the same for BigQuery:
Create Tables from CSVs
We're going to define a function to generate a BigQuery table by inserting data parsed from the CSV we uploaded to our bucket as a new table. The only parameters we need to pass are GCP_BIGQUERY_FULL_TABLE_ID
(to tell BigQuery where to save this table), and REMOTE_CSV_DESTINATION
(to find the CSV we uploaded in our bucket).
Our function will be called gcs_csv_to_table()
, and accepts the values above as parameters named full_table_id
and remote_csv_path
.
The first line of our function defines a variable called gcs_csv_uri, which is a simple Google-specific syntax for referring to objects in a bucket:
BigQuery likes to handle tasks like big data imports or inserts as "jobs", as it should; these methods designed to potentially manipulate massive amounts of data, which may (or may not, in our case) take a very long time. We start by creating a LoadJobConfig
, which isn't the job itself, but rather the configuration our insert job will be created from. There's a lot we can pass here, but we go with the following:
- autodetect: When set to
True
, the datatype of each column in our CSV will be inferred from its contents. This is generally a very accurate operation, given that you've cleaned your data to make sense. - skip_leading_rows: Skips the first row of our CSV for loading, as this is our row of column headers. BigQuery is able to discern this simply by passing
1
. - source_format: BigQuery can accept multiple formats besides CSV files, such as JSON. We need to pass objects provided by BigQuery to specify filetype, hence our import
from google.cloud.bigquery import SourceFormat
.
With our job configured, we call load_table_from_uri()
from our BigQuery client and save the output to a variable named load_job
. Here we pass our two URIs from earlier (the CSV URI and the BigQuery table id), as well as the job config we just created. Calling this method immediately starts the job which defaults as a synchronous operation, meaning our script will not proceed until the job is done. This is fine for our purposes.
We can see this in action by checking the output of load_job.result()
, which won't fire until our job has a result (as advertised):
Let's see what running that job with our fake data looks like in the BigQuery UI:
Getting The Inferred Table Schema
Our last function ended by returning the result of a function called get_table()
, which we hadn't defined until now:
get_table()
returns a Table object representing the table we just created. Amongst the methods of that class, we can call .schema()
, which gives us precisely what we want: a beautiful representation of a Table schema, generated from raw CSV information, where there previously was none.
Behold the fruits of your labor:
There you have it; a correctly inferred schema, from data which wasn't entirely clean in the first place (our dates are in MM/DD/YY format as opposed to MM/DD/YYYY, but Google still gets it right. How? Because Google).
It Doesn't End Here
Auto-generating schemas via Google BigQuery's API is only a small, obscure use case of what Google BigQuery is intended to do. Now if you'll excuse me, I need to stop this fanboying post before anybody realizes I'll promote their products for free forever (I think I may have passed that point).
In case you're interested, the source code for this script has been uploaded to Github below.