The Hostile Extraction of Tableau Server Data

Say no to proprietary software constraints. Say no to vendor lock. Say yes to freedom.

I try my best not to hate on Tableau. It was the software’s combination of power and ease-of-use that drove me to purchase a license in the first place. Ever since then, I’m finding new and exciting ways Tableau intentionally locks users out of their data.

I gave the Tableau Server Client Python library a spin recently in hopes of finding something useful. I decided to (sigh, once more) allow Tableau the benefit of the doubt: after pushing four updates in a single month, maybe things had changed. On the contrary, the Tableau business strategy stands strong: to be a raging, flaming turd pile. A perfect example of this is the View object Tableau allows you to interact with on your server. Those familiar know that views are slang for sheets of workbooks stored on Tableau server.

Connecting to your Tableau instance via Python to retrieve your view objects is a piece of cake:

import tableauserverclient as TSC
tableau_auth = TSC.TableauAuth('username', 'password')
server = TSC.Server('http://servername')

with server.auth.sign_in(tableau_auth):
  all_views, pagination_item = server.views.get()
  print([view.name for view in all_views])

This simple snippet lists every view object on your server. Wow! Think of what we can do with all that tabular data we worked so hard to transform, rig- WRONG. Look at what Tableau's Python 'View Object' actually contains:

  • id The identifier of the view item.
  • name The name of the view.
  • owner_id The id for the owner of the view.
  • preview_image The thumbnail image for the view.
  • total_views The usage statistics for the view. Indicates the total number of times the view has been accessed.
  • workbook_id The id of the workbook associated with the view.

HOLY MOSES STOP THE PRESSES, we can get a thumbnail image of our data?! THANK YOU GENEROUS TABLEAU OVERLORDS!

Notice how there's no mention of, you know, the actual data.

We're going to play a game. In the wake of my time has been wasted, I feel that warm tickling feeling which seems to say "Viciously dismantle the ambitions of an establishment!" May I remind you, we're talking about the kind of establishment that bills customer licenses based on the number of CPUs being utilized by their server infrastructure. This is effectively recognizing the horrifying and inefficient codebase behind Tableau server, and leveraging this flaw for monetization. Yes, you're paying more money to incentivize worst practices.

Let's Make a Flask App. An Angry One.

In our last post I shared a little script to help you get started stealing data off your own Tableau Server. That doesn't quite scratch my itch anymore. I'm going to build an interface. I want to make it easy as possible for anybody to systemically rob Tableau Server of every penny its got. That's a lot of pennies when we consider the equation: data = oil + new.

Before I bore you, here's a quick demo of the MVP we're building:

Each table is a view being pulled from Tableau Server.

This POC demonstrates that it is very possible to automate the extraction of Tableau views from Tableau Server. The success message is signaling that we've successfully taken a Tableau view and created a corresponding table in an external database. Any data we manipulate in Tableau is now truly ours: we can now leverage the transforms we've applied in workbooks, use this data in other applications, and utilize an extract scheduler to keep the data coming. We've turned a BI tool into an ETL tool. In other words, you can kindly take those thumbnail previews and shove it.

I'll be open sourcing all of this, as is my civic duty. Let us be clear to enterprises: withholding freedom to one's own data is an act of war. Pricing models which reward poor craftsmanship are an insult to our intellect. For every arrogant atrocity committed against consumers, the war will wage twice as hard. I should probably mention these opinions are my own.

The Proletariat Strikes Back

Get a feel for where we're heading with the obligatory project-file-structure tree:

tableau-exporter
├── application
│   ├── __init__.py
│   ├── database.py
│   ├── tableau.py
│   ├── routes.py
│   ├── static
│   │   ├── data
│   │   │   └── view.csv
│   │   ├── dist
│   │   │   ├── all.css
│   │   │   ├── packed.js
│   │   ├── img
│   │   │   └── tableaugithub.jpg
│   │   ├── js
│   │   │   └── main.js
│   │   └── scss
│   │       └── main.scss
│   └── templates
│       ├── export.html
│       ├── index.html
│       ├── layout.html
│       └── view.html
├── config.ini
├── config.py
├── app.yaml
├── start.sh
├── wsgi.py
├── Pipfile
├── README.md
└── requirements.txt

As usual, we're using a classic Flask application factory set up here.

Weapons Of Choice

Let's have a look at our core arsenal:

  • requests: We're achieving our goal by exploiting some loopholes exposed in the Tableau REST API.
  • pandas: Will handle everything from extracting comma-separated data into a CSV, render HTML tables, and output SQL.
  • flask_sqlalchemy: Used in tandem with pandas to handle shipping our data off elsewhere.
  • flask_redis: To handle session variables.

Initiating our Application

Here's how we construct our app:

from flask import Flask
from flask_sqlalchemy import SQLAlchemy
from flask_redis import FlaskRedis

# Set global entities
db = SQLAlchemy()
r = FlaskRedis()


def create_app():
    """Construct the core application."""
    app = Flask(__name__, instance_relative_config=False)
    app.config.from_object('config.Config')

    with app.app_context():
        # Initiate globals
        db.init_app(app)
        r.init_app(app, charset="utf-8", decode_responses=True)

        # Set global contexts
        r.set('uri', app.config['SQLALCHEMY_DATABASE_URI'])
        r.set('baseurl',  app.config['BASE_URL'])
        r.set('username',  app.config['USERNAME'])
        r.set('password', app.config['PASSWORD'])

        # Import our modules
        from . import routes
        from . import tableau
        app.register_blueprint(routes.home_blueprint)

        return app

This should all feel like business-as-usual. The core of our application is split between routes.py, which handles views, and tableau.py, which handles the anti-establishment logic. Let's begin with the latter.

Life, Liberty, and The Pursuit of Sick Data Pipelines

Our good friend tableau.py might look familiar to those who joined us last time. tableau.py has been busy hitting the gym since then and is looking sharp for primetime:

import requests
import xml.etree.ElementTree as ET
from . import r
import pandas as pd
import io


class ExtractTableauView:
    """Class for working in a Tableau instance."""

    __baseurl = r.get('baseurl')
    __username = r.get('username')
    __password = r.get('password')
    __database = r.get('uri')
    __contenturl = r.get('contenturl')

    @classmethod
    def get_view(cls, site, xml, view, token):
        """Extract contents of a single view."""
        headers = {'X-Tableau-Auth': token,
                   'Content-Type': 'text/csv'
                   }
        req = requests.get(cls.__baseurl + '/api/3.2/sites/' + str(site) +'/views/' + str(view) + '/data', headers=headers, stream=True)
        csv_text = req.text
        view_df = pd.read_csv(io.StringIO(csv_text), header=0)
        return view_df

    @classmethod
    def list_views(cls, site, xml, token):
        """List all views belonging to a Tableau Site."""
        headers = {'X-Tableau-Auth': token}
        req = requests.get(cls.__baseurl + '/api/3.2/sites/' + site + '/views', auth=(cls.__username, cls.__password), headers=headers)
        root = ET.fromstring(req.content)
        views_arr = []
        for child in root.iter('*'):
            if child.tag == '{http://tableau.com/api}views':
                for view in child:
                    view_dict = {
                        'name': view.attrib.get('name'),
                        'id': view.attrib.get('id'),
                        'url': cls.__baseurl + '/' + view.attrib.get('contentUrl'),
                        'created': view.attrib.get('createdAt'),
                        'updated': view.attrib.get('updatedAt')
                    }
                    views_arr.append(view_dict)
        return views_arr

    @classmethod
    def get_token(cls, xml):
        """Receive Auth token to perform API requests."""
        for child in xml.iter('*'):
            if child.tag == '{http://tableau.com/api}credentials':
                token = child.attrib.get('token')
                return token

    @classmethod
    def get_site(cls, xml):
        """Retrieve ID of Tableau 'site' instance."""
        root = xml
        for child in root.iter('*'):
            if child.tag == '{http://tableau.com/api}site':
                site = child.attrib.get('id')
                return site

    @classmethod
    def initialize_tableau_request(cls):
        """Retrieve core XML for interacting with Tableau."""
        headers = {'Content-Type': 'application/xml'}
        body = '<tsRequest><credentials name="' + cls.__username + '" password="' + cls.__password + '" ><site contentUrl="' + cls.__contenturl + '" /></credentials></tsRequest>'
        req = requests.post(cls.__baseurl + '/api/3.2/auth/signin', auth=(cls.__username, cls.__password), headers=headers, data=body)
        root = ET.fromstring(req.content)
        return root

I wish I could take full credit for what a shit show this class appears to be at first glance, but I assure you we've been left with no choice. For example: have I mentioned that Tableau's REST API returns XML so malformed that it breaks XML parsers? I can't tell incompetence from malicious intent at this point.

Here's a method breakdown of our class:

  • initialize_tableau_request(): Handles initial auth and returns valuable information such as site ID and API Token to be used thereafter.
  • get_site(): Extracts the site ID from XML returned by the above.
  • get_token(): Similarly extracts our token.
  • list_views(): Compiles a list of all views within a Tableau site, giving us a chance to select ones for extraction.
  • get_view(): Takes a view of our choice and creates a DataFrame, which is to be shipped off to a foreign database.

Our Routing Logic

Moving on we have routes.py building the views and associated logic for our app:

from flask import current_app as app
from flask import render_template, Blueprint, request, Markup
from flask_assets import Bundle, Environment
from . import tableau
from . import database
import pandas as pd

home_blueprint = Blueprint('home', __name__, template_folder='templates', static_folder='static')

assets = Environment(app)
js = Bundle('js/*.js', filters='jsmin', output='dist/packed.js')
scss = Bundle('scss/*.scss', filters='libsass', output='dist/all.css')
assets.register('scss_all', scss)
assets.register('js_all', js)
scss.build()
js.build()


@home_blueprint.route('/', methods=['GET', 'POST'])
def entry():
    """Homepage which lists all available views."""
    tableau_view_extractor = tableau.ExtractTableauView()
    xml = tableau_view_extractor.initialize_tableau_request()
    token = tableau_view_extractor.get_token(xml)
    site = tableau_view_extractor.get_site(xml)
    views = tableau_view_extractor.list_views(site, xml, token)
    return render_template(
        'index.html',
        title="Here are your views.",
        template="home-template",
        views=views,
        token=token,
        xml=xml,
        site=site
    )


@home_blueprint.route('/view', methods=['GET', 'POST'])
def view():
    """Displays a preview of a selected view."""
    site = request.args.get('site')
    xml = request.args.get('xml')
    view = request.args.get('view')
    token = request.args.get('token')
    tableau_view_extractor = tableau.ExtractTableauView()
    view_df = tableau_view_extractor.get_view(site, xml, view, token)
    view_df.to_csv('application/static/data/view.csv')
    return render_template(
        'view.html',
        title='Your View',
        template="home-template",
        view=view,
        token=token,
        xml=xml,
        site=site,
        view_df=Markup(view_df.to_html(index=False))
    )


@home_blueprint.route('/export', methods=['GET', 'POST'])
def export():
    """Exports view to external database."""
    view_df = pd.read_csv('application/static/data/view.csv')
    view_df.to_sql(name='temp', con=database.engine, if_exists='replace', chunksize=50, index=True)
    return render_template(
        'export.html',
        title='Success!',
        template="success-template",
    )

We only have 3 pages to our application. They include our list of views, a preview of a single view, and a success page for when said view is exported. This is all core Flask logic.

Putting it On Display

We build our pages dynamically based on the values we pass our Jinja templates. The homepage utilizes some nested loops to list the views we returned from tableau.py, and also makes use of query strings to pass values on to other templates.

{% extends "layout.html" %}

{% block content %}
<div class="extended-container {{template}}">
  <div class="container">
    <div class="row">
      <div class="col s12">
        <h1>{{title}}</h1>
      </div>
      <div class="col s12 flex-container">
        {% for view in views %}
        <div class="download">
          <a href="{{ url_for('home.view') }}?token={{token}}&site={{site}}&view={{view.id}}&xml={{xml}}">
            <ul>
              {% for key, value in view.items() %}
              <li><span class="key {{key}}">{{key}}</span> {{ value }}</li>
              {% endfor %}
            </ul>
          </a>
        </div>
        {% endfor %}
      </div>
    </div>
  </div>
  {% endblock %}

Moving on: our humble view.html page has two purposes: display the selected view, and export it in the name of justice.

{% extends "layout.html" %}

{% block content %}
<div class="extended-container {{template}}">
  <div class="container">
    <div class="row">
      <div class="col s12">
        <h1>{{title}}</h1>
        <a href="{{ url_for('home.export') }}" class="export"><i class="far fa-file-export"></i></a>
        {{view_df}}
      </div>
    </div>
  </div>
  {% endblock %}

The War is Not Over

This repository is open to the public and can be found here. There are still crusades left ahead of us: for instance, building out this interface to accept credentials via login as opposed to a config file, and the scheduling of view exports, as opposed to on-demand.

Where we go from here depends on what we the people decide. For all I know, I could be shouting to an empty room here (I'm almost positive anybody who pays for enterprise software prefers the blind eye of denial). If the opposite holds true, I dare say the revolution is only getting started.

Todd Birchard's' avatar
New York City Website
Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.