GCP is on the rise, and it's getting harder and harder to have conversations around data warehousing without addressing the new 500-pound gorilla on the block: Google BigQuery. With most enterprises comfortably settled into their Apache-based Big Data stacks, BigQuery rattles the cages of convention for many. Luckily, Hackers And Slackers is no such enterprise. Thus, we aren't afraid to ask the Big question: how much easier would life be with BigQuery?

Big Data, BigQuery

In short, BigQuery trivializes the act of querying against multiple, unpredictable data sources. To better understand when this is useful, it would better serve us to identify the types of questions BigQuery can answer. Such as:

  • What are our users doing across our multiple systems? How do we leverage log files outputted by multiple systems to find out?
  • How can we consolidate information about employee information, payroll, and benefits, when these all live in isolated systems?
  • What the hell am I supposed to do with all these spreadsheets?

Unlike previous solutions, BigQuery solves these problems in a single product and does so with SQL-like query syntax, a web interface, and 7 native Client Libraries. There are plenty of reasons to love BigQuery, but let's start with one we've recently already talked about: the auto-generation of table schemas.

Matt has demonstrated how to approach this problem manually with the help of Pandas. I provided a more gimmicky approach by leveraging the Python table-schema library. With BigQuery, we find yet another alternative which is neither manual or gimmicky: perfect for those who are lazy, rich, and demand perfection (AKA: your clients, probably).

First, we'll need to get our data into BigQuery

Uploading Data into Google Cloud Storage via the Python SDK

BigQuery requires us to go through Google Cloud Storage as a buffer before inputting data into tables. No big deal, we'll write a script!

We're assuming that you have a basic knowledge of Google Cloud, Google Cloud Storage, and how to download a JSON Service Account key to store locally (hint: click the link).

from google.cloud import storage

bucket_uri = 'gs://your-bucket/'
bucket_name = 'your-bucket'
bucket_target = 'datasets/data_upload.csv'
local_dataset = 'data/test.csv'


def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Upload a CSV to Google Cloud Storage.

    1. Retrieve the target bucket.
    2. Set destination of data to be uploaded.
    3. Upload local CSV.
    """
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    # Commence Upload
    blob.upload_from_filename(source_file_name)
    print('File {} uploaded to {}.'.format(
        source_file_name,
        destination_blob_name))
        
        
upload_blob(bucket_name, local_dataset, bucket_target)

The above is nearly a copy + paste of Google Cloud's sample code for the Google Cloud Storage Python SDK:

  • bucket_uri is found by inspecting any bucket's information on Google Cloud.
  • bucket_name is... well, you know.
  • bucket_target represents the resulting file structure representing the saved CSV when completed.
  • local_dataset is the path to a CSV we've stored locally: we can assume that we've grabbed some data from somewhere, like an API, and tossed into a local file temporarily.

Successfully executing the above results in the following message:

File data/test.csv uploaded to datasets/data_upload.csv.

Inserting Data from Cloud Storage to BigQuery

That was the easy part. Let's move on to the good stuff:

from google.cloud import storage
from google.cloud import bigquery

bucket_uri = 'gs://your-bucket/'
bucket_name = 'your-bucket'
bucket_target = 'datasets/data_upload.csv'
local_dataset = 'data/test.csv'
bucket_target_uri = bucket_uri + bucket_target
bigquery_dataset = 'uploadtest'
bigquery_table = 'my_table'


def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Upload a CSV to Google Cloud Storage.

    1. Retrieve the target bucket.
    2. Set destination of data to be uploaded.
    3. Upload local CSV.
    """
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    # Commence Upload
    blob.upload_from_filename(source_file_name)
    print('File {} uploaded to {}.'.format(
        source_file_name,
        destination_blob_name))


def insert_bigquery(target_uri, dataset_id, table_id):
    """Insert CSV from Google Storage to BigQuery Table.

    1. Specify target dataset within BigQuery.
    2. Create a Job configuration.
    3. Specify that we are autodetecting datatypes.
    4. Reserve row #1 for headers.
    5. Specify the source format of the file (defaults to CSV).
    6. Pass the URI of the data storage on Google Cloud Storage from.
    7. Load BigQuery Job.
    8. Execute BigQuery Job.
    """
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    job_config = bigquery.LoadJobConfig()
    job_config.autodetect = True
    job_config.skip_leading_rows = 1
    job_config.source_format = bigquery.SourceFormat.CSV
    uri = target_uri
    load_job = bigquery_client.load_table_from_uri(
        uri,
        dataset_ref.table(table_id),
        job_config=job_config)  # API request
    print('Starting job {}'.format(load_job.job_id))
    # Waits for table load to complete.
    load_job.result()
    print('Job finished.')


upload_blob(bucket_name, local_dataset, bucket_target)
insert_bigquery(bucket_target_uri, bigquery_dataset, bigquery_table)

We've added the function insert_bigquery() to handle creating a BigQuery table out of a CSV.

After we set our client, we create a dataset reference. In BigQuery, tables can belong to a 'dataset,' which is a grouping of tables. Compare this concept to MongoDB's collections, or PostgreSQL's schemas. Note that this process is made much easier by the fact that we stored our project key locally: otherwise, we'd have to specify which Google Cloud project we're looking for, etc.

With the dataset specified, we begin to build our "job" object with LoadJobConfig. This is like loading a gun before unleashing a shotgun blast into the face of our problems. Alternatively, a more relevant comparison could be with the Python requests library and the act of prepping an API request before execution.

We set job_config.autodetect to be True, obviously. job_config.skip_leading_rows reserves our header row from screwing things up.

load_job puts our request together, and load_job.result() executes said job. The .result() method graciously puts the rest of our script on hold until the specified job is completed. In our case, we want this happen: it simplifies our script so that we don't need to verify this manually before moving on.

Let's see what running that job with our fake data looks like in the BigQuery UI:

All my fake friends are here!

Getting Our Flawlessly Inferred Table Schema

BigQuery surely gets table schemas wrong some of the time. That said, I have yet to see it happen. Let's wrap this script up:

from google.cloud import storage
from google.cloud import bigquery
import pprint

bucket_uri = 'gs://your-bucket/'
bucket_name = 'your-bucket'
bucket_target = 'datasets/data_upload.csv'
local_dataset = 'data/test.csv'
bucket_target_uri = bucket_uri + bucket_target
bigquery_dataset = 'uploadtest'
bigquery_table = 'my_table'


def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Upload a CSV to Google Cloud Storage.

    1. Retrieve the target bucket.
    2. Set destination of data to be uploaded.
    3. Upload local CSV.
    """
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    # Commence Upload
    blob.upload_from_filename(source_file_name)
    print('File {} uploaded to {}.'.format(
        source_file_name,
        destination_blob_name))


def insert_bigquery(target_uri, dataset_id, table_id):
    """Insert CSV from Google Storage to BigQuery Table.

    1. Specify target dataset within BigQuery.
    2. Create a Job configuration.
    3. Specify that we are autodetecting datatypes.
    4. Reserve row #1 for headers.
    5. Specify the source format of the file (defaults to CSV).
    6. Pass the URI of the data storage on Google Cloud Storage from.
    7. Load BigQuery Job.
    8. Execute BigQuery Job.
    """
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    job_config = bigquery.LoadJobConfig()
    job_config.autodetect = True
    job_config.skip_leading_rows = 1
    job_config.source_format = bigquery.SourceFormat.CSV
    uri = target_uri
    load_job = bigquery_client.load_table_from_uri(
        uri,
        dataset_ref.table(table_id),
        job_config=job_config)  # API request
    print('Starting job {}'.format(load_job.job_id))
    # Waits for table load to complete.
    load_job.result()
    print('Job finished.')


def get_schema(dataset_id, table_id):
    """Get BigQuery Table Schema.

    1. Specify target dataset within BigQuery.
    2. Specify target table within given dataset.
    3. Create Table class instance from existing BigQuery Table.
    4. Print results to console.
    5. Return the schema dict.
    """
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    bg_tableref = bigquery.table.TableReference(dataset_ref, table_id)
    bg_table = bigquery_client.get_table(bg_tableref)
    # Print Schema to Console
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(bg_table.schema)
    return bg_table.schema


upload_blob(bucket_name, local_dataset, bucket_target)
insert_bigquery(bucket_target_uri, bigquery_dataset, bigquery_table)
bigquery_table_schema = get_schema(bigquery_dataset, bigquery_table)

With the addition of get_bigquery_schema(), our script is complete!

TableReference() is similar to the dataset reference we went over earlier, only for tables (duh). This allows us to call upon get_table(), which returns a Table class representing the table we just created. Amongst the methods of that class, we can call .schema(), which gives us precisely what we want: a beautiful representation of a Table schema, generated from raw CSV information, where there previously was none.

Behold the fruits of your labor:

[   SchemaField('id', 'INTEGER', 'NULLABLE', None, ()),
    SchemaField('initiated', 'TIMESTAMP', 'NULLABLE', None, ()),
    SchemaField('hiredate', 'DATE', 'NULLABLE', None, ()),
    SchemaField('email', 'STRING', 'NULLABLE', None, ()),
    SchemaField('firstname', 'STRING', 'NULLABLE', None, ()),
    SchemaField('lastname', 'STRING', 'NULLABLE', None, ()),
    SchemaField('title', 'STRING', 'NULLABLE', None, ()),
    SchemaField('department', 'STRING', 'NULLABLE', None, ()),
    SchemaField('location', 'STRING', 'NULLABLE', None, ()),
    SchemaField('country', 'STRING', 'NULLABLE', None, ()),
    SchemaField('type', 'STRING', 'NULLABLE', None, ())]

There you have it; a correctly inferred schema, from data which wasn't entirely clean in the first place (our dates are in MM/DD/YY format as opposed to MM/DD/YYYY, but Google still gets it right. How? Because Google).

It Doesn't End Here

I hope it goes without saying that abusing Google BigQuery's API to generate schemas for you is only a small, obscure use case of what Google BigQuery is intended to do, and what it can do for you. That said, I need to stop this fanboying post before anybody realizes I'll promote their products for free forever (I think I may have passed that point).

In case you're interested, the source code for this script has been uploaded to Github here. Have at it, and remember to think Big™*.

*Not a real trademark, I'm making things up again.