GCP is on the rise, and it's getting harder and harder to have conversations around data warehousing without addressing the new 500-pound gorilla on the block: Google BigQuery. By this point, most enterprises have comfortably settled into their choice of "big data" storage, whether that be Amazon Redshift, Hadoop, or what-have-you. BigQuery is quickly disrupting the way we think about big data stacks by redefining how we use and ultimately pay for such services.

The benefits of BigQuery likely aren't enough to force enterprises to throw the baby out with the bathwater. That said, companies building their infrastructure from the ground-up have an opportunity to ask a Big™ question: "how much easier would life be with BigQuery?"

Big Data? BigQuery.

Cloud-based data warehouses all follow the same blueprint: clusters of machines working together to partition data across nodes. Most of the time, the specifics of this underlying architecture is painfully apparent to whoever is architecting the solution: cloud architects must set the number of nodes in their cluster, the size of each node, and so forth. BigQuery changes this.

BigQuery is a data warehouse which is fully managed by Google Cloud. This means the specifics of how many nodes our data is partitioned across is completely obfuscated to customers! Instead of worrying about how to scale our warehouse horizontally, BigQuery leverages the Google Cloud network to determine how your data is partitioned across nodes presumably shared by any number of other cloud customers. This provides us two major benefits:

  • Price: BigQuery charges by usage, determined by the size of each query. This makes BigQuery a significantly cheaper data warehouse option for smaller shops which don't utilize their clusters 24/7.
  • Time: A non-trival amount of time is saved by those who would otherwise be worrying about how to scale their data warehouse

If I had to list my priorities in life, you best believe time and money are near the top of that list. I'd be willing to bet your company agrees.

Under The Hood

BigQuery understands SQL queries by extending an internal Google querying tool called Dremel. Google has used Dremel to power massive queries across products, including YouTube, Gmail, Google docs, and so forth. BigQuery is essentially a public-facing implementation of Dremel, which we're able to interact with using BigQuery's Web UI.

When we query data from BigQuery, Dremel executes parallel queries across up to tens of thousands of servers in Google's network. Regardless of company size, every customer on BigQuery benefits from economies of scale as node resources are essentially shared by all customers. BigQuery utilized both columnar storage as well as a tree architecture to determine how data should be partitioned and accessed, which results in a blazing-fast product.

Comparisons to Other Solutions

Let's see how BigQuery stacks up against other solutions:

  • Hadoop & MapReduce: BigQuery prioritizes execution time of queries, while MapReduce is focused mostly on time-consuming batch processing. Queries in BigQuery will almost always outperform MapReduce queries, as MapReduce is primarily concerned with aggregating and modifying data after it has been collected. Using MapReduce is very much
  • RedShift: Redshift queries clock in at significantly better performance than BigQuery, with the downside of huge associated costs. Redshift charges by uptime of a cluster, which means you're paying dollars on the hour regardless of what you're using Redshift for (Redshift will almost always cost companies more than BigQuery).

Working With the BigQuery Python SDK

A huge upside of any Google Cloud product comes with GCP's powerful developer SDKs. Today we'll be interacting with BigQuery using the Python SDK. For this tutorial, we're assuming that you have a basic knowledge of Google Cloud, Google Cloud Storage, and how to download a JSON Service Account key to store locally (hint: click the link).

Here's the data I'll be using to demonstrate. The ultimate goal of this demo is to take a CSV of this table data and create a BigQuery table from it:

id initiated hiredate email firstname lastname title
100035435 2015-12-11T09:16:20.722-08:00 3/22/67 GretchenRMorrow@jourrapide.com Gretchen Morrow Power plant operator
100056435 2015-12-15T10:11:24.604-08:00 6/22/99 ElizabethLSnow@armyspy.com Elizabeth Snow Oxygen therapist
100037955 2015-12-16T14:31:32.765-08:00 5/31/74 AlbertMPeterson@einrot.com Albert Peterson Psychologist
100035435 2016-01-20T11:15:47.249-08:00 9/9/69 JohnMLynch@dayrep.com John Lynch Environmental hydrologist
100057657 2016-01-21T12:45:38.261-08:00 4/9/83 TheresaJCahoon@teleworm.us Theresa Cahoon Personal chef
100056747 2016-02-01T11:25:39.317-08:00 6/26/98 KennethHPayne@dayrep.com Kenneth Payne Central office operator
100035435 2016-02-01T11:28:11.953-08:00 4/16/82 LeifTSpeights@fleckens.hu Leif Speights Staff development director
100035435 2016-02-01T12:21:01.756-08:00 8/6/80 JamesSRobinson@teleworm.us James Robinson Scheduling clerk
100074688 2016-02-01T13:29:19.147-08:00 12/14/74 AnnaDMoberly@jourrapide.com Anna Moberly Playwright
100665778 2016-02-04T14:40:05.223-08:00 9/13/66 MarjorieBCrawford@armyspy.com Marjorie Crawford Court, municipal, and license clerk
100876876 2016-02-24T12:39:25.872-08:00 12/19/67 LyleCHackett@fleckens.hu Lyle Hackett Airframe mechanic
100658565 2016-02-29T15:52:12.933-08:00 11/17/83 MaryJDensmore@jourrapide.com Mary Densmore Employer relations representative
100766547 2016-03-01T12:32:53.357-08:00 10/1/87 CindyRDiaz@armyspy.com Cindy Diaz Student affairs administrator
100045677 2016-03-02T12:07:44.264-08:00 8/16/65 AndreaTLigon@einrot.com Andrea Ligon Railroad engineer

Like Amazon Redshift, BigQuery requires us to upload data to a bucket before importing said data. We'll have to go through Google Cloud Storage as a buffer before inputting data into tables... no big deal, we'll write a script! To get started, we need to install a couple of Python packages:

pip3 install google-cloud-storage google-cloud-bigquery

Configuration and Set up

With all GCP scripts, we need to download our account key as a JSON file and store it in the root directory where we'll be running our script. Download that now, and don't expose it to anybody. With your service key JSON in your project folder, you'll also need to set an environment variable called GOOGLE_APPLICATION_CREDENTIALS which should equal the file path to your key. If you have trouble with getting set up, check out the Python SDK set up instructions.

Next, we need to set some simple configuration variables like your bucket name, bucket URI, BigQuery table, etc. I'm storing these variables in a file named config.py:

"""BigQuery Upload Configuration."""
from os import environ


# Google Cloud Storage
bucketURI = environ.get('GCP_BUCKET_URI')
bucketName = environ.get('GCP_BUCKET_NAME')

# Google BigQuery
bigqueryDataset = environ.get('GCP_BIGQUERY_DATASET')
bigqueryTable = environ.get('GCP_BIGQUERY_TABLE')

# Data
localDataFile = environ.get('LOCAL_DATA_TARGET')
destinationBlobName = environ.get('DESTINATION_BLOB_NAME')
  • bucketUri is found by inspecting any bucket's information on Google Cloud.
  • bucketName is our bucket's given name.
  • bucketFolderName is the name of the folder within our bucket to upload our data to.
  • bucketTarget represents the resulting file structure representing the saved CSV when completed.
  • bigqueryDataset is the name of a "dataset" which our new BigQuery Table will be saved under (similar concept to a Postgres schema).
  • bigqueryTable: Name of the table we'll be creating.
  • localDataset is the path to a CSV we've stored locally: we can assume that we've grabbed some data from somewhere, like an API, and tossed into a local file temporarily.
  • destinationBlobName is the resulting location of our upload.

Upload Data To Google Cloud Storage

Our first function is going to take a CSV we've stored locally and upload it to a bucket:

from google.cloud import storage
from config import bucketName, bucketTarget, bigqueryDataset, bigqueryTable, localDataset, destinationBlobName


def storage_upload_blob(bucketName, source_file_name, destinationBlobName):
    """Upload a CSV to Google Cloud Storage."""
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucketName)
    blob = bucket.blob(destinationBlobName)
    blob.upload_from_filename(source_file_name)  # Commence upload
    return 'File {} uploaded to {}.'.format(source_file_name,
                                            destinationBlobName)
        
        
storage_upload_blob(bucketName,
                    localDataset,
                    destinationBlobName)

Successful execution of the above should result in a local file being uploaded to your specified Google Cloud Storage bucket. This success message should be visible:

File data/test.csv uploaded to datasets/data_upload.csv.

Inserting Data from Cloud Storage to BigQuery

With our data uploaded to Google Cloud Storage, we can now import our data into BigQuery. We're going to add a function called bigquery_insert_data(), which accepts a URL target of the data we're inserting, a BigQuery dataset ID, and a BigQuery table ID:

from google.cloud import storage
from google.cloud import bigquery
from config import bucketName, bucketTarget, bigqueryDataset, bigqueryTable, localDataset, destinationBlobName
import pprint


...


def bigquery_insert_data(target_uri, dataset_id, table_id):
    """Insert CSV from Google Storage to BigQuery Table."""
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    job_config = bigquery.LoadJobConfig()
    job_config.autodetect = True
    job_config.skip_leading_rows = 1
    job_config.source_format = bigquery.SourceFormat.CSV
    uri = target_uri
    load_job = bigquery_client.load_table_from_uri(uri,
                                                   dataset_ref.table(table_id),
                                                   job_config=job_config)
    print('Starting job {}'.format(load_job.job_id))
    load_job.result()  # Waits for table load to complete.
    return 'Job finished.'


storage_upload_blob(bucketName,
                    localDataset,
                    destinationBlobName)
bigquery_insert_data(bucketTarget,
                     bigqueryDataset,
                     bigqueryTable)

First we create a 'client' as a means to interact with BigQuery with the line bigquery_client = bigquery.Client().

Next, we create a dataset reference. In BigQuery, tables can belong to a 'dataset,' which is a grouping of tables (compare this concept to MongoDB's collections or PostgreSQL's schemas). This process has made much easier by the fact that we stored our project key locally... otherwise, we'd have to specify which Google Cloud project we're looking for, etc.

With the dataset specified, we begin to build our "job" object with LoadJobConfig. This is like loading a gun before unleashing a shotgun blast into the face of our problems. Alternatively, a more relevant comparison could be with the Python requests library and the act of prepping an API request before execution.

To demonstrate BigQuery's powerful schema auto-detection, I've set job_config.autodetect to be true.

job_config.skip_leading_rows reserves our header row from screwing things up.

load_job puts our request together, and load_job.result() executes said job. The .result() method graciously puts the rest of our script on hold until the specified job is completed. In our case, we want this happen: it simplifies our script so that we don't need to verify this manually before moving on.

Let's see what running that job with our fake data looks like in the BigQuery UI:

All my fake friends are here!

Getting The Inferred Table Schema

The last function we'll add will return the schema of the table we just created:

...

def get_table_schema(dataset_id, table_id):
    """Get BigQuery Table Schema."""
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    bg_tableref = bigquery.table.TableReference(dataset_ref, table_id)
    bg_table = bigquery_client.get_table(bg_tableref)
    # Print Schema to Console
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(bg_table.schema)
    return bg_table.schema


bigqueryTableSchema = get_table_schema(bigqueryDataset,
                                       bigqueryTable)

TableReference() is similar to the dataset reference we went over earlier, only for tables (duh). This allows us to call upon get_table(), which returns a Table class representing the table we just created. Amongst the methods of that class, we can call .schema(), which gives us precisely what we want: a beautiful representation of a Table schema, generated from raw CSV information, where there previously was none.

Behold the fruits of your labor:

[   SchemaField('id', 'INTEGER', 'NULLABLE', None, ()),
    SchemaField('initiated', 'TIMESTAMP', 'NULLABLE', None, ()),
    SchemaField('hiredate', 'DATE', 'NULLABLE', None, ()),
    SchemaField('email', 'STRING', 'NULLABLE', None, ()),
    SchemaField('firstname', 'STRING', 'NULLABLE', None, ()),
    SchemaField('lastname', 'STRING', 'NULLABLE', None, ()),
    SchemaField('title', 'STRING', 'NULLABLE', None, ()),
    SchemaField('department', 'STRING', 'NULLABLE', None, ()),
    SchemaField('location', 'STRING', 'NULLABLE', None, ()),
    SchemaField('country', 'STRING', 'NULLABLE', None, ()),
    SchemaField('type', 'STRING', 'NULLABLE', None, ())]

There you have it; a correctly inferred schema, from data which wasn't entirely clean in the first place (our dates are in MM/DD/YY format as opposed to MM/DD/YYYY, but Google still gets it right. How? Because Google).

It Doesn't End Here

I hope it goes without saying that abusing Google BigQuery's API to generate schemas for you is only a small, obscure use case of what Google BigQuery is intended to do, and what it can do for you. That said, I need to stop this fanboying post before anybody realizes I'll promote their products for free forever (I think I may have passed that point).

In case you're interested, the source code for this script has been uploaded to Github here. Have at it, and remember to think Big™*.

*Not a real trademark, I'm making things up again.