S3 File Management With The Boto3 Python SDK

Modify and manipulate thousands of files in your S3 (or DigitalOcean) Bucket.

S3 File Management With The Boto3 Python SDK

    It's incredible the things human beings can adapt to in life-or-death circumstances, isn't it? In this particular case it wasn't my personal life in danger, but rather the life of this very blog. I will allow for a brief pause while the audience shares gasps of disbelief. We must stay strong and collect ourselves from such distress.

    Like most things I despise, the source of this unnecessary headache was a SaaS product. I won't name any names here, but it was Cloudinary. Yep, totally them. We'd been using their (supposedly) free service for hosting our blog's images for about a month now. This may be a lazy solution to a true CDN, sure, but there's only so much we can do when well over half of Ghost's 'officially recommended' storage adapters are depreciated or broken. That's a whole other thing.

    I'll spare the details, but at some point we reached one of the 5 or 6 rate limits on our account which had conveniently gone unmentioned (official violations include storage, bandwidth, lack of galactic credits, and a refusal to give up Park Place from the previously famous McDonalds Monopoly game- seriously though, why not ask for Broadway)? The terms were simple: pay 100 dollars of protection money to the sharks a matter of days. Or, ya know, don't.

    Weapons Of Mass Content Delivery

    Hostage situations aside, the challenge was on: how could move thousands of images to a new CDN within hours of losing all of our data, or without experiencing significant downtime? Some further complications:

    • There’s no real “export” button on Cloudinary. Yes, I know, they’ve just recently released some rest API that may or may not generate a zip file of a percentage of your files at a time. Great.
    • We’re left with 4-5 duplicates of every image. Every time a transform is applied to an image, it leaves behind unused duplicates.
    • We need to revert to the traditional YYYY/MM folder structure, which was destroyed.

    This is gonna be good. You'd be surprised what can be Macgyvered out of a single Python Library and a few SQL queries. Let's focus on Boto3 for now.

    Boto3: It's Not Just for AWS Anymore

    DigitalOcean offers a dead-simple CDN service which just so happens to be fully compatible with Boto3. Let's not linger on that fact too long before we consider the possibility that DO is just another AWS reseller. Moving on.

    Initial Configuration

    Setting up Boto3 is simple just as long as you can manage to find your API key and secret:

    import json
    import boto3
    from botocore.client import Config
    
    # Initialize a session using DigitalOcean Spaces.
    session = boto3.session.Session()
    client = session.client('s3',
                            region_name='nyc3',
                            endpoint_url='https://nyc3.digitaloceanspaces.com',
                            aws_access_key_id=os.environ.get('KEY'),
                            aws_secret_access_key=os.environ.get('SECRET'))
    

    From here forward, whenever we need to reference our 'bucket', we do so via client.

    Fast Cut Back To Our Dramatic Storyline

    In our little scenario, I took a first stab at populating our bucket as a rough pass. I created our desired folder structure and tossed everything we owned hastily into said folders, mostly by rough guesses and by gauging the publish date of posts. So we've got our desired folder structure, but the content is a mess.

    CDN
    ├── Posts
    │   ├── /2017
    │   │   ├── 11
    │   ├── /2018
    │   │   ├── 03
    │   │   ├── 04
    │   │   ├── 05
    │   │   ├── 06
    │   │   ├── 07
    │   │   ├── 08
    │   │   ├── 09
    │   │   ├── 10
    │   │   ├── 11
    │   │   └── 12
    │   ├── /2019
    │   │   ├── 01
    │   │   └── 02
    │   └── /lynx
    ├── /bunch
    ├── /of
    ├── /other
    └── /shit
    

    So we're dealing with a three-tiered folder hierarchy here. You're probably thinking "oh great, this is where we recap some basics about recursion for the 1ooth time..." but you're wrong! Boto3 deals with the pains of recursion for us if we so please. If we were to run client.list_objects_v2() on the root of our bucket, Boto3 would return the file path of every single file in that bucket regardless of where it lives.

    Letting an untested script run wild and make transformations to your production data sounds like fun and games, but I'm not willing to risk losing the hundreds of god damned Lynx pictures I draw every night for a mild sense of amusement. Instead, we're going to have Boto3 loop through each folder one at a time so when our script does break, it'll happen in a predictable way that we can just pick back up. I guess that means.... we're pretty much opting into recursion. Fine, you were right.

    The Art of Retrieving Objects

    Running client.list_objects_v2() sure sounded straightforward when I omitted all the details, but this method can achieve some quite powerful things for its size. list_objects_v2 is essentially our bread and butter behind this script. "But why list_objects_v2 instead of list_objects," you may ask? I don't know, because AWS is a bloated shit show? Does Amazon even know? Why don't we ask their documentation?

    Well that explains... Nothing.

    Well, I'm sure list_objects had a vulnerability or something. Surely it's been sunsetted by now. Anything else just wouldn't make any sense.

    ...Oh. It's right there. Next to version 2.

    That's the last time I'll mention that AWS sucks in this post... I promise.

    Getting All Folders in a Subdirectory

    To humor you, let's see what getting all objects in a bucket would look like:

    def get_everything_ever():
        """Retrieve all folders underneath the specified directory."""
        get_folder_objects = client.list_objects_v2(
            Bucket='hackers',
            Delimiter='',
            EncodingType='url',
            MaxKeys=1000,
            Prefix='/',
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
          )
    

    We've passed pretty much nothing meaningful to list_objects_v2(), so it will come back to us with every file, folder, woman and child it can find in your poor bucket with great vengeance and furious anger:

    oh god oh god oh god

    Here, I'll even be fair and only return the file names/paths instead of each object:

    Ah yes, totally reasonable for thousands of files.

    Instead, we'll solve this like Gentlemen. Oh, but first, let's clean those god-awful strings being returned as keys. That simply won't do, so build yourself a function. We'll need it.

    from urllib.parse import unquote # omg new import
    
    def sanitize_object_key(obj):
        """Replace character encodings with actual characters."""
        new_key = unquote(unquote(obj))
        return new_key
    

    That's better.

    import os
    import json
    import boto3
    from botocore.client import Config
    from botocore
    from urllib.parse import unquote
    
    # Initialize a session using DigitalOcean Spaces.
    session = boto3.session.Session()
    client = session.client('s3',
                            region_name='nyc3',
                            endpoint_url='https://nyc3.digitaloceanspaces.com',
                            aws_access_key_id=os.environ.get('KEY'),
                            aws_secret_access_key=os.environ.get('SECRET'))
    
    
    def get_folders():
        """Retrieve all folders within a specified directory.
    
        1. Set bucket name.
        2. Set delimiter (a character that our target files have in common).
        3. Set folder path to objects using "Prefix" attribute.
        4. Create list of all recursively discovered folder names.
        5. Return list of folders.
        """
        get_folder_objects = client.list_objects_v2(
            Bucket='hackers',
            Delimiter='',
            EncodingType='url',
            MaxKeys=1000,
            Prefix='posts/',
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        folders = [item['Key'] for item in get_folder_objects['Contents']]
        return folders
     '''

    Check out list_objects_v2() this time. We restrict listing objects to the directory we want: posts/. By further specifying Delimiter='/', we're asking for folders to be returned only. This gives us a nice list of folders to walk through, one by one.

    Shit's About to go Down

    We're about to get complex here and we haven't even created an entry point yet. Here's the deal below:

    • get_folders() gets us all folders within the base directory we're interested in.
    • For each folder, we loop through the contents of each folder via the get_objects_in_folder() function.
    • Because Boto3 can be janky, we need to format the string coming back to us as "keys", also know as the "absolute paths to each object". We use the unquote feature in sanitize_object_key() quite often to fix this and return workable file paths.
    import os
    import json
    import boto3
    from botocore.client import Config
    from urllib.parse import unquote
    
    # Initialize a session using DigitalOcean Spaces.
    session = boto3.session.Session()
    client = session.client('s3',
                            region_name='nyc3',
                            endpoint_url='https://nyc3.digitaloceanspaces.com',
                            aws_access_key_id=os.environ.get('KEY'),
                            aws_secret_access_key=os.environ.get('SECRET'))
    
    
    def sanitize_object_key(obj):
        """Replace character encodings with actual characters."""
        new_key = unquote(unquote(obj))
        return new_key
    
    
    def get_folders():
        """Retrieve all folders within a specified directory.
    
        1. Set bucket name.
        2. Set delimiter (a character that our target files have in common).
        3. Set folder path to objects using "Prefix" attribute.
        4. Create list of all recursively discovered folder names.
        5. Return list of folders.
        """
        get_folder_objects = client.list_objects_v2(
            Bucket='hackers',
            Delimiter='',
            EncodingType='url',
            MaxKeys=1000,
            Prefix='posts/',
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        folders = [item['Key'] for item in get_folder_objects['Contents']]
        return folders
    
    
    def get_objects_in_folder(folderpath):
        """List all objects in the provided directory.
    
        1. Set bucket name.
        2. Leave delimiter blank to fetch all files.
        3. Set folder path to "folderpath" parameter.
        4. Return list of objects in folder.
        """
        objects = client.list_objects_v2(
            Bucket='hackers',
            EncodingType='url',
            MaxKeys=1000,
            Prefix=folderpath,
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        return objects
     '''

    RECAP

    All of this until now has been neatly assembled groundwork. Now that we have the power to quickly and predictably loop through every file we want, we can finally start to fuck some shit up.

    Our Script's Core Logic

    Not every transformation I chose to apply to my images will be relevant to everybody; instead, let's take a look at our completed script, and I'll let you decide which snippets you'd like to drop in for yourself!

    Here's our core script that successfully touches every desired object in our bucket, without applying any logic just yet:

    import os
    import json
    import boto3
    from botocore.client import Config
    from botocore
    from urllib.parse import unquote
    
    # Initialize a session using DigitalOcean Spaces.
    session = boto3.session.Session()
    client = session.client('s3',
                            region_name='nyc3',
                            endpoint_url='https://nyc3.digitaloceanspaces.com',
                            aws_access_key_id=os.environ.get('KEY'),
                            aws_secret_access_key=os.environ.get('SECRET'))
    
    
    def get_folders():
        """Retrieve all folders within a specified directory.
    
        1. Set bucket name.
        2. Set delimiter (a character that our target files have in common).
        3. Set folder path to objects using "Prefix" attribute.
        4. Create list of all recursively discovered folder names.
        5. Return list of folders.
        """
        get_folder_objects = client.list_objects_v2(
            Bucket='hackers',
            Delimiter='',
            EncodingType='url',
            MaxKeys=1000,
            Prefix='posts/',
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        folders = [item['Key'] for item in get_folder_objects['Contents']]
        return folders
    
    
    def get_objects_in_folder(folderpath):
        """List all objects in the provided directory.
    
        1. Set bucket name.
        2. Leave delimiter blank to fetch all files.
        3. Set folder path to "folderpath" parameter.
        4. Return list of objects in folder.
        """
        objects = client.list_objects_v2(
            Bucket='hackers',
            EncodingType='url',
            MaxKeys=1000,
            Prefix=folderpath,
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        return objects
    
    
    def sanitize_object_key(obj):
        """Replace character encodings with actual characters."""
        new_key = unquote(unquote(obj))
        return new_key
    
    
    def optimize_cdn_objects():
        """Perform tasks on objects in our CDN.
    
        1. Loop through folders in subdirectory.
        2. In each folder, loop through all objects.
        3. Sanitize object key name.
        3. Remove 'garbage' files by recognizing what substrings they have.
        4. If file not deleted, check to see if file is an image (search for '.')
        5. Logic TBD.
        """
        for folder in get_folders():
            folderpath = sanitize_object_key(folder)
            objects = get_objects_in_folder(folderpath)
            for obj in objects['Contents']:
                item = sanitize_object_key(obj['Key'])
                purged = purge_unwanted_objects(item)
                if not purged:
                    if '.' in item:
                        # OUR APP LOGIC WILL GO HERE
    
    
    optimize_cdn_objects()
    

    There we have it: the heart of our script. Now let's look at a brief catalog of what we could potentially do here.

    Choose Your Own Adventure

    Purge Files We Know Are Trash

    This is an easy one. Surely your buckets get bloated with unused garbage over time... in my example, I somehow managed to upload a bunch of duplicate images from my Dropbox, all with the suffix (Todds-MacBook-Pro.local's conflicted copy YYYY-MM-DD). Things like that can be purged easily:

    def purge_unwanted_objects(item):
        """Delete item from bucket if name meets criteria."""
        banned = ['Todds-iMac', 'conflicted', 'Lynx']
        if any(x in item for x in banned):
            client.delete_object(Bucket="hackers", Key=item)
            return True
        return False
    

    Download CDN Locally

    If we want to apply certain image transformations, it could be a good idea to back up everything in our CDN locally. This will save all objects in our CDN to a relative path which matches the folder hierarchy of our CDN; the only catch is we need to make sure those folders exist prior to running the script:

    ...
    import botocore
    
    def save_images_locally(obj):
        """Download target object.
    
        1. Try downloading the target object.
        2. If image doesn't exist, throw error.
        """
        try:
            client.download_file(Key=obj, Filename=obj, Bucket='hackers')
        except botocore.exceptions.ClientError as e:
            if e.response['Error']['Code'] == "404":
                print("The object does not exist.")
            else:
                raise
    

    Create Retina Images

    With the Retina.js plugin, serving any image of filename x.jpg will also look for a corresponding file name [email protected] to serve on Retina devices. Because our images are exported as high-res, all we need to do is write a function to copy each image and modify the file name:

    def create_retina_image(item):
        """Rename our file to specify that it is a Retina image.
    
        1. Insert "@2x" at end of filename.
        2. Copy original image with new filename.
        3. Keep both files as per retina.js.
        """
        indx = item.index('.')
        newname = item[:indx] + '@2x' + item[indx:]
        newname = sanitize_object_key(newname)
        client.copy_object(Bucket='hackers',
                           CopySource='hackers/' + item,
                           Key=newname)
        print("created: ", newname)
    

    Create Standard Resolution Images

    Because we started with high-res images and copied them, we can now scale down our original images to be normal size. resize_width() is a method of the resizeimage library which scales the width of an image while keeping the height-to-width aspect ratio in-tact. There's a lot happening below, such as using io to 'open' our file without actually downloading it, etc:

    ...
    import PIL
    
    def create_standard_res_image(obj):
        """Resizes large images to an appropriate size.
    
        1. Set maximum bounds for standard-def image.
        2. Get the image file type.
        3. Open the local image.
        4. Resize image.
        5. Save the file locally.
        """
        size = 780, 2000
        indx = obj.index('/')
        filename = obj[indx:]
        filetype = filename.split('.')[-1].upper()
        filetype = filetype.replace('JPG', 'JPEG')
        outfile = obj.replace('@2x', '')
        print('created ', outfile, ' locally with filetype ', filetype)
        # Use PIL to resize image
        img = PIL.Image.open(obj)
        img.thumbnail(size, PIL.Image.ANTIALIAS)
        img.save(outfile, filetype, optimize=True, quality=100)
    

    Upload Local Images

    After modifying our images locally, we'll need to upload the new images to our CDN:

    def upload_local_images(obj):
        """Upload standard def images created locally."""
        if '@2x' in obj:
            outfile = obj.replace('@2x', '')
            client.upload_file(Filename=outfile,
                               Bucket='hackers',
                               Key=outfile)
            print('uploaded: ', outfile)
    

    Put It All Together

    That should be enough to get your imagination running wild. What does all of this look like together?:

    import os
    import json
    import boto3
    from botocore.client import Config
    import botocore
    from urllib.parse import unquote
    import PIL
    
    
    # Initialize a session using DigitalOcean Spaces.
    session = boto3.session.Session()
    client = session.client('s3',
                            region_name='nyc3',
                            endpoint_url='https://nyc3.digitaloceanspaces.com',
                            aws_access_key_id=os.environ.get('KEY'),
                            aws_secret_access_key=os.environ.get('SECRET'))
    
    
    def get_folders():
        """Retrieve all folders within a specified directory.
    
        1. Set bucket name.
        2. Set delimiter (a character that our target files have in common).
        3. Set folder path to objects using "Prefix" attribute.
        4. Create list of all recursively discovered folder names.
        5. Return list of folders.
        """
        get_folder_objects = client.list_objects_v2(
            Bucket='hackers',
            Delimiter='',
            EncodingType='url',
            MaxKeys=1000,
            Prefix='posts/',
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        folders = [item['Key'] for item in get_folder_objects['Contents']]
        return folders
    
    
    def get_objects_in_folder(folderpath):
        """List all objects in the provided directory.
    
        1. Set bucket name.
        2. Leave delimiter blank to fetch all files.
        3. Set folder path to "folderpath" parameter.
        4. Return list of objects in folder.
        """
        objects = client.list_objects_v2(
            Bucket='hackers',
            EncodingType='url',
            MaxKeys=1000,
            Prefix=folderpath,
            ContinuationToken='',
            FetchOwner=False,
            StartAfter=''
            )
        return objects
    
    
    def sanitize_object_key(obj):
        """Replace character encodings with actual characters."""
        new_key = unquote(unquote(obj))
        return new_key
    
    
    def save_images_locally(obj):
        """Download target object.
    
        1. Try downloading the target object.
        2. If image doesn't exist, throw error.
        """
        try:
            client.download_file(Key=obj, Filename=obj, Bucket='hackers')
        except botocore.exceptions.ClientError as e:
            if e.response['Error']['Code'] == "404":
                print("The object does not exist.")
            else:
                raise
    
    
    def create_retina_image(item):
        """Rename our file to specify that it is a Retina image.
    
        1. Insert "@2x" at end of filename.
        2. Copy original image with new filename.
        3. Keep both files as per retina.js.
        """
        indx = item.index('.')
        newname = item[:indx] + '@2x' + item[indx:]
        newname = sanitize_object_key(newname)
        client.copy_object(Bucket='hackers',
                           CopySource='hackers/' + item,
                           Key=newname)
        print("created: ", newname)
    
    
    def create_standard_res_image(obj):
        """Resizes large images to an appropriate size.
    
        1. Set maximum bounds for standard-def image.
        2. Get the image file type.
        3. Open the local image.
        4. Resize image.
        5. Save the file locally.
        """
        size = 780, 2000
        indx = obj.index('/')
        filename = obj[indx:]
        filetype = filename.split('.')[-1].upper()
        filetype = filetype.replace('JPG', 'JPEG')
        outfile = obj.replace('@2x', '')
        print('created ', outfile, ' locally with filetype ', filetype)
        # Use PIL to resize image
        img = PIL.Image.open(obj)
        img.thumbnail(size, PIL.Image.ANTIALIAS)
        img.save(outfile, filetype, optimize=True, quality=100)
    
    
    def upload_local_images(obj):
        """Upload standard def images created locally."""
        if '@2x' in obj:
            outfile = obj.replace('@2x', '')
            client.upload_file(Filename=outfile,
                               Bucket='hackers',
                               Key=outfile)
            print('uploaded: ', outfile)
    
    
    def purge_unwanted_objects(obj):
        """Delete item from bucket if name meets criteria."""
        banned = ['Todds-iMac', 'conflicted', 'Lynx', 'psd', 'lynx']
        if any(x in obj for x in banned):
            client.delete_object(Bucket="hackers", Key=obj)
            return True
        # Determine if image is Lynx post
        filename = obj.split('/')[-1]
        if len(filename) < 7:
            sample = filename[:2]
            if int(sample):
                print(int(sample))
        return False
    
    
    def optimize_cdn_objects():
        """Perform tasks on objects in our CDN.
    
        1. Loop through folders in subdirectory.
        2. In each folder, loop through all objects.
        3. Sanitize object key name.
        3. Remove 'garbage' files by recognizing what substrings they have.
        4. If file not deleted, check to see if file is an image (search for '.')
        5. Rename image to be retina compatible.
        6. Save image locally.
        7. Create standard resolution version of image locally.
        8. Upload standard resolution images to CDN.
        """
        for folder in get_folders():
            folderpath = sanitize_object_key(folder)
            objects = get_objects_in_folder(folderpath)
            for obj in objects['Contents']:
                item = sanitize_object_key(obj['Key'])
                purged = purge_unwanted_objects(item)
                if not purged:
                    if '.' in item:
                        create_standard_res_image(item)
                        save_images_locally(item)
                        create_retina_image(item)
                        upload_local_images(item)
    
    
    optimize_cdn_objects()
    
    

    Well that's a doozy.

    If you feel like getting creative, there's even more you can do to optimize the assets in your bucket or CDN. For example: grabbing each image and rewriting the file in WebP format. I'll let you figure that one out on your own.

    As always, the source for this can be found on Github.

    Todd Birchard's' avatar
    New York City Website
    Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

    Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.