Google BigQuery's Python SDK: Creating Tables Programmatically

Explore the benefits of Google BigQuery and use the Python SDK to programmatically create tables.

Google BigQuery's Python SDK: Creating Tables Programmatically

    GCP is on the rise, and it's getting harder and harder to have conversations around data warehousing without addressing the new 500-pound gorilla on the block: Google BigQuery. By this point, most enterprises have comfortably settled into their choice of "big data" storage, whether that be Amazon Redshift, Hadoop, or what-have-you. BigQuery is quickly disrupting the way we think about big data stacks by redefining how we use and ultimately pay for such services.

    The benefits of BigQuery likely aren't enough to force enterprises to throw the baby out with the bathwater. That said, companies building their infrastructure from the ground-up have an opportunity to ask a Big™ question: "how much easier would life be with BigQuery?"

    Big Data? BigQuery.

    Cloud-based data warehouses all follow the same blueprint: clusters of machines working together to partition data across nodes. Most of the time, the specifics of this underlying architecture is painfully apparent to whoever is architecting the solution: cloud architects must set the number of nodes in their cluster, the size of each node, and so forth. BigQuery changes this.

    BigQuery is a data warehouse which is fully managed by Google Cloud. This means the specifics of how many nodes our data is partitioned across is completely obfuscated to customers! Instead of worrying about how to scale our warehouse horizontally, BigQuery leverages the Google Cloud network to determine how your data is partitioned across nodes presumably shared by any number of other cloud customers. This provides us two major benefits:

    • Price: BigQuery charges by usage, determined by the size of each query. This makes BigQuery a significantly cheaper data warehouse option for smaller shops which don't utilize their clusters 24/7.
    • Time: A non-trival amount of time is saved by those who would otherwise be worrying about how to scale their data warehouse

    If I had to list my priorities in life, you best believe time and money are near the top of that list. I'd be willing to bet your company agrees.

    Under The Hood

    BigQuery understands SQL queries by extending an internal Google querying tool called Dremel. Google has used Dremel to power massive queries across products, including YouTube, Gmail, Google docs, and so forth. BigQuery is essentially a public-facing implementation of Dremel, which we're able to interact with using BigQuery's Web UI.

    When we query data from BigQuery, Dremel executes parallel queries across up to tens of thousands of servers in Google's network. Regardless of company size, every customer on BigQuery benefits from economies of scale as node resources are essentially shared by all customers. BigQuery utilized both columnar storage as well as a tree architecture to determine how data should be partitioned and accessed, which results in a blazing-fast product.

    Comparisons to Other Solutions

    Let's see how BigQuery stacks up against other solutions:

    • Hadoop & MapReduce: BigQuery prioritizes execution time of queries, while MapReduce is focused mostly on time-consuming batch processing. Queries in BigQuery will almost always outperform MapReduce queries, as MapReduce is primarily concerned with aggregating and modifying data after it has been collected. Using MapReduce is very much
    • RedShift: Redshift queries clock in at significantly better performance than BigQuery, with the downside of huge associated costs. Redshift charges by uptime of a cluster, which means you're paying dollars on the hour regardless of what you're using Redshift for (Redshift will almost always cost companies more than BigQuery).

    Working With the BigQuery Python SDK

    A huge upside of any Google Cloud product comes with GCP's powerful developer SDKs. Today we'll be interacting with BigQuery using the Python SDK. For this tutorial, we're assuming that you have a basic knowledge of Google Cloud, Google Cloud Storage, and how to download a JSON Service Account key to store locally (hint: click the link).

    Here's the data I'll be using to demonstrate. The ultimate goal of this demo is to take a CSV of this table data and create a BigQuery table from it:

    id initiated hiredate email firstname lastname title department location country type
    100035435 2015-12-11T09:16:20.722-08:00 3/22/67 [email protected] Gretchen Morrow Power plant operator Physical Product Britling Cafeterias United Kingdom Employee
    100056435 2015-12-15T10:11:24.604-08:00 6/22/99 [email protected] Elizabeth Snow Oxygen therapist Physical Product Grade A Investment United States of America Employee
    100037955 2015-12-16T14:31:32.765-08:00 5/31/74 [email protected] Albert Peterson Psychologist Physical Product Grass Roots Yard Services United States of America Employee
    100035435 2016-01-20T11:15:47.249-08:00 9/9/69 [email protected] John Lynch Environmental hydrologist Physical Product Waccamaw's Homeplace United States of America Employee
    100057657 2016-01-21T12:45:38.261-08:00 4/9/83 [email protected] Theresa Cahoon Personal chef Physical Product Cala Foods United States of America Employee
    100056747 2016-02-01T11:25:39.317-08:00 6/26/98 [email protected] Kenneth Payne Central office operator Frontline Magna Consulting United States of America Employee
    100035435 2016-02-01T11:28:11.953-08:00 4/16/82 [email protected] Leif Speights Staff development director Frontline Rivera Property Maintenance United States of America Employee
    100035435 2016-02-01T12:21:01.756-08:00 8/6/80 [email protected] James Robinson Scheduling clerk Frontline Discount Furniture Showcase United States of America Employee
    100074688 2016-02-01T13:29:19.147-08:00 12/14/74 [email protected] Anna Moberly Playwright Physical Product The Wiz United States of America Employee
    100665778 2016-02-04T14:40:05.223-08:00 9/13/66 [email protected] Marjorie Crawford Court, municipal, and license clerk Physical Product The Serendipity Dip United Kingdom Employee
    100876876 2016-02-24T12:39:25.872-08:00 12/19/67 [email protected] Lyle Hackett Airframe mechanic Physical Product Infinity Investment Plan United States of America Employee
    100658565 2016-02-29T15:52:12.933-08:00 11/17/83 [email protected] Mary Densmore Employer relations representative Frontline One-Up Realtors United States of America Employee
    100766547 2016-03-01T12:32:53.357-08:00 10/1/87 [email protected] Cindy Diaz Student affairs administrator Physical Product Mr. AG's United States of America Employee
    100045677 2016-03-02T12:07:44.264-08:00 8/16/65 [email protected] Andrea Ligon Railroad engineer Central Growth Robinson Furniture United States of America Employee

    Like Amazon Redshift, BigQuery requires us to upload data to a bucket before importing said data. We'll have to go through Google Cloud Storage as a buffer before inputting data into tables... no big deal, we'll write a script! To get started, we need to install a couple of Python packages:

    pip3 install google-cloud-storage google-cloud-bigquery

    Configuration and Set up

    With all GCP scripts, we need to download our account key as a JSON file and store it in the root directory where we'll be running our script. Download that now, and don't expose it to anybody. With your service key JSON in your project folder, you'll also need to set an environment variable called GOOGLE_APPLICATION_CREDENTIALS which should equal the file path to your key. If you have trouble with getting set up, check out the Python SDK set up instructions.

    Next, we need to set some simple configuration variables like your bucket name, bucket URI, BigQuery table, etc. I'm storing these variables in a file named config.py:

    """BigQuery Upload Configuration."""
    from os import environ
    
    
    # Google Cloud Storage
    bucketURI = environ.get('GCP_BUCKET_URI')
    bucketName = environ.get('GCP_BUCKET_NAME')
    
    # Google BigQuery
    bigqueryDataset = environ.get('GCP_BIGQUERY_DATASET')
    bigqueryTable = environ.get('GCP_BIGQUERY_TABLE')
    
    # Data
    localDataFile = environ.get('LOCAL_DATA_TARGET')
    destinationBlobName = environ.get('DESTINATION_BLOB_NAME')
    
    • bucketUri is found by inspecting any bucket's information on Google Cloud.
    • bucketName is our bucket's given name.
    • bucketFolderName is the name of the folder within our bucket to upload our data to.
    • bucketTarget represents the resulting file structure representing the saved CSV when completed.
    • bigqueryDataset is the name of a "dataset" which our new BigQuery Table will be saved under (similar concept to a Postgres schema).
    • bigqueryTable: Name of the table we'll be creating.
    • localDataset is the path to a CSV we've stored locally: we can assume that we've grabbed some data from somewhere, like an API, and tossed into a local file temporarily.
    • destinationBlobName is the resulting location of our upload.

    Upload Data To Google Cloud Storage

    Our first function is going to take a CSV we've stored locally and upload it to a bucket:

    from google.cloud import storage
    from config import bucketName, bucketTarget, bigqueryDataset, bigqueryTable, localDataset, destinationBlobName
    
    
    def storage_upload_blob(bucketName, source_file_name, destinationBlobName):
        """Upload a CSV to Google Cloud Storage."""
        storage_client = storage.Client()
        bucket = storage_client.get_bucket(bucketName)
        blob = bucket.blob(destinationBlobName)
        blob.upload_from_filename(source_file_name)  # Commence upload
        return 'File {} uploaded to {}.'.format(source_file_name,
                                                destinationBlobName)
            
            
    storage_upload_blob(bucketName,
                        localDataset,
                        destinationBlobName)
    

    Successful execution of the above should result in a local file being uploaded to your specified Google Cloud Storage bucket. This success message should be visible:

    File data/test.csv uploaded to datasets/data_upload.csv.
    

    Inserting Data from Cloud Storage to BigQuery

    With our data uploaded to Google Cloud Storage, we can now import our data into BigQuery. We're going to add a function called bigquery_insert_data(), which accepts a URL target of the data we're inserting, a BigQuery dataset ID, and a BigQuery table ID:

    from google.cloud import storage
    from google.cloud import bigquery
    from config import bucketName, bucketTarget, bigqueryDataset, bigqueryTable, localDataset, destinationBlobName
    import pprint
    
    
    ...
    
    
    def bigquery_insert_data(target_uri, dataset_id, table_id):
        """Insert CSV from Google Storage to BigQuery Table."""
        bigquery_client = bigquery.Client()
        dataset_ref = bigquery_client.dataset(dataset_id)
        job_config = bigquery.LoadJobConfig()
        job_config.autodetect = True
        job_config.skip_leading_rows = 1
        job_config.source_format = bigquery.SourceFormat.CSV
        uri = target_uri
        load_job = bigquery_client.load_table_from_uri(uri,
                                                       dataset_ref.table(table_id),
                                                       job_config=job_config)
        print('Starting job {}'.format(load_job.job_id))
        load_job.result()  # Waits for table load to complete.
        return 'Job finished.'
    
    
    storage_upload_blob(bucketName,
                        localDataset,
                        destinationBlobName)
    bigquery_insert_data(bucketTarget,
                         bigqueryDataset,
                         bigqueryTable)
    

    First we create a 'client' as a means to interact with BigQuery with the line bigquery_client = bigquery.Client().

    Next, we create a dataset reference. In BigQuery, tables can belong to a 'dataset,' which is a grouping of tables (compare this concept to MongoDB's collections or PostgreSQL's schemas). This process has made much easier by the fact that we stored our project key locally... otherwise, we'd have to specify which Google Cloud project we're looking for, etc.

    With the dataset specified, we begin to build our "job" object with LoadJobConfig. This is like loading a gun before unleashing a shotgun blast into the face of our problems. Alternatively, a more relevant comparison could be with the Python requests library and the act of prepping an API request before execution.

    To demonstrate BigQuery's powerful schema auto-detection, I've set job_config.autodetect to be true.

    job_config.skip_leading_rows reserves our header row from screwing things up.

    load_job puts our request together, and load_job.result() executes said job. The .result() method graciously puts the rest of our script on hold until the specified job is completed. In our case, we want this happen: it simplifies our script so that we don't need to verify this manually before moving on.

    Let's see what running that job with our fake data looks like in the BigQuery UI:

    All my fake friends are here!

    Getting The Inferred Table Schema

    The last function we'll add will return the schema of the table we just created:

    ...
    
    def get_table_schema(dataset_id, table_id):
        """Get BigQuery Table Schema."""
        bigquery_client = bigquery.Client()
        dataset_ref = bigquery_client.dataset(dataset_id)
        bg_tableref = bigquery.table.TableReference(dataset_ref, table_id)
        bg_table = bigquery_client.get_table(bg_tableref)
        # Print Schema to Console
        pp = pprint.PrettyPrinter(indent=4)
        pp.pprint(bg_table.schema)
        return bg_table.schema
    
    
    bigqueryTableSchema = get_table_schema(bigqueryDataset,
                                           bigqueryTable)
    

    TableReference() is similar to the dataset reference we went over earlier, only for tables (duh). This allows us to call upon get_table(), which returns a Table class representing the table we just created. Amongst the methods of that class, we can call .schema(), which gives us precisely what we want: a beautiful representation of a Table schema, generated from raw CSV information, where there previously was none.

    Behold the fruits of your labor:

    [   SchemaField('id', 'INTEGER', 'NULLABLE', None, ()),
        SchemaField('initiated', 'TIMESTAMP', 'NULLABLE', None, ()),
        SchemaField('hiredate', 'DATE', 'NULLABLE', None, ()),
        SchemaField('email', 'STRING', 'NULLABLE', None, ()),
        SchemaField('firstname', 'STRING', 'NULLABLE', None, ()),
        SchemaField('lastname', 'STRING', 'NULLABLE', None, ()),
        SchemaField('title', 'STRING', 'NULLABLE', None, ()),
        SchemaField('department', 'STRING', 'NULLABLE', None, ()),
        SchemaField('location', 'STRING', 'NULLABLE', None, ()),
        SchemaField('country', 'STRING', 'NULLABLE', None, ()),
        SchemaField('type', 'STRING', 'NULLABLE', None, ())]
    

    There you have it; a correctly inferred schema, from data which wasn't entirely clean in the first place (our dates are in MM/DD/YY format as opposed to MM/DD/YYYY, but Google still gets it right. How? Because Google).

    It Doesn't End Here

    I hope it goes without saying that abusing Google BigQuery's API to generate schemas for you is only a small, obscure use case of what Google BigQuery is intended to do, and what it can do for you. That said, I need to stop this fanboying post before anybody realizes I'll promote their products for free forever (I think I may have passed that point).

    In case you're interested, the source code for this script has been uploaded to Github here. Have at it, and remember to think Big™*.

    *Not a real trademark, I'm making things up again.
    Todd Birchard's' avatar
    New York City Website
    Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

    Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.