Dealing with Dirty Data

Adventures in Excel

In my last post, we discussed what separates a true analyst (read: technical) from a project manager wearing the mask of an analyst like some Scott Snyder era Joker (I figure that there's a solid overlap between fans of comic books and fans of the real world application of data. Note that this is a study with an N = 1 so it bares no statistical significance, but I have a funny feeling...call it spidey sense).

Full disclosure, this post comes mostly out of my inability to sleep in my hotel room in Chicago following a grueling day of doing the very things I discuss in this blog, and preceding a day where I'll have to literally explain my last post to the suits, but perhaps this is the best mindset to begin discussing the myriad ways in which you may encounter dirty data in the wild, and how a savvy analyst may pivot and match their way around it. However, if my prose isn't as on point as you have grown accustomed...blame it on the 4AM haze.

Alas, let's begin by discussing the organizational structure of the majority of corporate entities that leverage data to some degree (note, this isn't all corporations...and what does that say about the state of business?) and how, at each step of abstraction in this process that you are from the data, the data gets dirtier and dirtier.

Essentially, there's always going to be a group of about 5-10 fewer-than-necessary legitimately skilled data scientists and/or computer programmers/DBAs who are really solid at building and maintaining a database as well as coding in some sort of compiling language (nowadays, that's probably python, but not exclusively, nor does this matter). However, depending on your industry (unless of course your industry IS data), it's nearly impossible to recruit people who have these skills to the level necessary AND have some familiarity with why this data is needed, and/or the ability to explain how the internal products that they build can be used by an end user. As such, this team has their own project manager(s) who's only job is to keep these guys from developing a sentient AI that's sole goal is the annihilation of unfolded laundry...when your industry is healthcare. This team should also have at least one analyst who will take the raw code base and do the first step of translation to a more user friendly form. This generally takes shape as either dashboards in a system like Tableau, or if your company has a group of particularly strong data/business analysts (or particularly weak programmers) an interface written in plain(enough) English on a Business Intelligence platform such as Microsoft BI/SAP Business Objects or whatever other system your company utilizes. As a fun little note, this team ALMOST ALWAYS is referred to by some sort of acronym such as QDAR! (Quality data and reporting!) or KMnR! (Knowledge management and reporting!) or Those Fucking Guys (who have something to do with reporting) (TFG(whstdwr)). On a less fun little note...neither you, nor seemingly ANYONE ELSE will have contact with this team. In light of this information, how do the reports that they build get chosen and who decides how these databases are built? The world may never know.

So let's assume the first type of reporting: the Dataratti  (which is how I will refer to the acronym defined team described above moving forward) produces dashboards utilizing a tool such as Tableau or Crystal Reports. You may be thinking to yourself: "hey, isn't my job taking the data and putting it in a form where people who are scared by more than two nested groups of parenthesis, and thus this renders my job unnecessary?" The answer to the question is twofold: Yes, and of course not! As mentioned previously, the decision to create these dashboards, the data contained therein, and how you want them to look is decided upon by a mythical creature who has full access and understanding of the data warehouse, AND has full access to and understanding of the stakeholders (AKA, Those Who Sit Above in Shadow; that's a reference from a famous run of Thor comics that refers to to a mysterious cabal of gods who perpetuate the cycle of Ragnarok in order to subsist upon the energies created by this strife...which as I write this, is an almost disgustingly on-the-nose metaphor for upper management). Now, if you believe that you may be this mythical creature (as I do), I DARE you to apply for a job with this job description, and once you clinch it with the advice from this blog,  rapidly realize that your job will involve either one of these job duties or the other.

With that digression, even if somehow a useful dashboard for YOU is created, the limitations inherent in these dashboarding tools make one CRUCIAL issue omnipresent: one can only effectively illustrate up to 16 different variables at a time before the system breaks down (for example, Tableau's documentation specifically warns against this). So even if you have the nicest, most illustrative dashboards on the planet from the Dataratti, there is a nearly 100% chance that the information that you actually need will be scattered across 2-3 different dashboards...rendering the nice looking dashboards essentially useless for your purposes, and as previously stated, you have no contact with the Dataratti, nor do you have access to the underlying data from which these dashboards are created. So pop quiz hot shot, what DO you do?

Well, mercifully, all of these dashboard tools allow an end user to download a "data dump" (our parlance for "a buncha numbers with headings"). Using Tableau as an example, one can download either a "crosstab" or a text file of the data represented by the dashboard (in both "summary" and "full data" format). Now, just to get the truly gifted in Tableau off my back, yes, the functionality does exist to build in the ability to download the data in the exact format necessary for your needs through a specific combination of custom web server views and Javascript, but...

  1. If the users of the dash are exclusively using this function, why do the dashboard at all? And...
  2. This forces the developers in the Dataratti to have decent web design skills on top of really high level Tableau skills, and it requires someone to anticipate exactly how the data will be used by the end user by the Dataratti (which is incredibly hard as it's impossible to speak to this department directly, and as previously stated, the lack of this knowledge on their end is the entire reason why my department exists).

A few things to note before downloading data from Tableau:

  • You must highlight at least one element of the dashboard before downloading a crosstab.
  • Depending on what kind of dashboard you're working with, you may need to highlight the entirety of one column in order to capture the entirety of your data (click the first element in any column and then scroll down to the bottom of the report...which may be enormously long, hit shift and click the last element in the report) before downloading either the data or the crosstab.
  • If you are downloading a crosstab, be wary, Tableau web server caps how many rows you can download in this method at a time, this can be avoided by downloading the text version of the data (by clicking data as opposed to crosstab). HOWEVER...
  • If you are going the data route, it defaults to summary view. Look over all the headings, and ensure that this covers everything you need, otherwise click "full data". Interestingly, this still isn't actually the entirety of your data, and continue to check to make sure all of your headings are covered, otherwise, click the display all columns box, and then download all the rows as a text file.

Now, repeat these steps until all of the data that you need in your report is contained across these text files (.csv, AKA the Comma Separated Value file type). With all that lunacy completed, you now have several sheets with some common columns, but all with different information; only some of which you need, so what do you think you do?

Simple, you use the tools given to you in the previous posts: you lookup on the common factors across the sheets and return the data that you want until you have all the data you need, in the correct order, on one sheet, and then depending on the ask, you may want to pivot that data out in order to summarize the whole mess of data. THIS IS YOUR FINAL PRODUCT well done. Another protip: if you want to reposition data that you've obtained via a lookup, highlight the whole column, hit control+C to copy the data and then hit control+V pause a second (press NOTHING else) and then press control FOLLOWED by V. This takes the values generated by a formula and replaces them with the values obtained. Functionally, this looks exactly the same, but now you can move the data around without affecting or being affected by other data.

As explaining only one possible dirty data scenario took over 1500 words, next time, we'll discuss the other most common form of taking the dirty data from the Dataratti and making it useful to you: using business intelligence portals as opposed to dashboards in order to grab the data that you need. Also, if I don't get roasted on a spit for being half asleep for tomorrow's (today's?) meeting, I'll try and write up a companion post with an example of how this works out in practice.

In summary, in this post we've learned:

  1. How data is generally siloed and sequestered within the corporate environment, leading to a bevy of unnecessary steps on behalf of the analyst in order to distill a functional report for the powers-that-be
  2. Two major methods in which data comes from the data team (henceforth known as the Dataratti) to your team: Dashboards and Business Intelligence interfaces, and...
  3. Assuming you get data in the form of dashboards, how to take these dashboards, download the underlying data, recombine and manipulate the data, and package it in a way acceptable for your needs.

Congrats, you've just learned the crucial skill of the Slice n' Dice!

Quite sleepily,

-Snacks

Author image
Down the block from a literal battleship Website
One of three therapists in the known universe who knows how to use a computer. Finds meaning in highly protected data, in a cave, utilizing nothing but a box of scraps.
Author image
Down the block from a literal battleship

One of three therapists in the known universe who knows how to use a computer. Finds meaning in highly protected data, in a cave, utilizing nothing but a box of scraps.