Taking Out the Trash: Dirty Data in Excel

In my last post, we explored the organizational structure of many large companies and how this pertains to one's duties as a fledgling data analyst. I highly recommend you go back and read the first post on "dirty" data, but just in case you're one of those rebels who thinks that they're too cool to read part 1, here's a quick refresher to put you back in the analytical mindset (which is the perfect combination tactical laziness, ADHD, and a complete disregard for downside risk).

Essentially, the reason why most data analysts exist within a company (remember, NOT a tech company, hence the usage of Excel) is because for whatever reason, most companies do not give analysts access to the data on demand, and instead it is tightly controlled by a mysterious cabal of DBAs or developers which I have christened the dataratti.

While this setup is workable in theory, the dataratti often do not have a strong understanding of how the data is used in practice, and there are generally not open lines of communication between the dataratti and the analysts. Thus it falls upon the analyst to do three crucial jobs:

Obtain raw data from the dataratti (remember, raw means that the data is just a bunch of numbers, words, and headings...meaningless on their own).
manipulate the data so that it's in a readable format and does not have unnecessary data.
Present this data in narrative form to the business people.

Now, you may be asking why they can't just hire someone who knows both sides of this equation, the answer is sadly simple: manipulating data in a tool like Excel is antithetical to the general idea of how databases are created (we'll talk about that some more next time) and thus someone who has both skill-sets, as well as the ability to present to stakeholders (remember, Those Who Sit Above in Shadow?) is extremely rare, and trying to become that person may not be the best idea because it may actually be cheaper to hire two employees than hire someone with chops in both skills (and you can't just know how to manipulate/present and code, you have to know both very well).

With all that being said, last time we discussed how one might grab data from the dataratti from a tool like Tableau (which is essentially a data visualization toolkit that's really good at creating dashboards, but can only handle limited variables...a.k.a. columns, and exports the actual data in a particularly messy way). Now we'll talk about one of the other possibilities for the enterprising analyst to grab raw data: Business Intelligence (BI) tools.

Despite how cool it sounds and despite it immediately conjuring thoughts of engaging in corporate espionage as part of the Business Intelligence Agency, BI is at it's core really just an interface where someone who didn't design a database can still query (that literally means "ask") the database for the data that they want.

The way this works is pretty straightforward: there will be an interface with a series of folders (just like you'd see on a desktop) and within each, there will be a list of variables that you can drag-and-drop into your query. After putting in all of the variables that you want in your report, you just hit "go" and it'll show you all of the data associated with the variables...essentially letting you grab data from the database without touching it, and without knowing SQL. To sweeten the already sweet deal, you can even generally manipulate the data after the query is done but before downloading (using a very similar format to Excel, although, why fix what ain't broken?), and you can set it up to re-run whatever query you decided upon on a regular schedule.

Well then, doesn't that just sound ideal, why didn't I lead with this clearly superior strategy? Obviously, there's a catch: not only are BI toolkits VERY expensive, the implementation of a BI tool requires an entire OTHER level of employee between the coders and the analysts. Remember way back when, in a previous post, I stated that databases only build on top of themselves and grow increasingly unwieldy the longer your company is in existence? Also, remember when I stated that the developers rarely know why their users need the data within the database? Well, that leads to the developers coding their databases in a way that only they truly understand (if we were talking about anything other than a database, I'd even say that their code is...inelegant). Now imagine having having to take this mess of code on top of code that's starting to look like Tetsuo at the end of Akira, and translating all of it to plain English that someone with no knowledge of database structures would be able to understand, and you're starting to get an idea of the struggle that a BI developer has ahead of them. To make matters worse, each BI tool requires this to be done in a specific way, so the skills are only moderately transferable. Of note, in a company that relies on their data quite a bit, and when their database is considerably large, this quickly becomes a full-time job, if not the job of a small team.

Now, despite the considerable front-end investment in a good BI team, the flexibility of being able to grab data on-demand ends up being significantly better in the long-term, especially when an analyst knows how to use tableau themselves, and can make super slick graphic representations of the data...and everyone knows that the way to a business person's heart is with sick graphics.

In summary, today we explored the following:

A refresher of the counter-intuitive model of data siloing within companies.
An explanation of the role of an analyst within this company hierarchy.
How BI tools work and what some of the benefits and challenges of implementation may be.

Now that you know why the data you get comes out dirty, next time, we'll explore some examples of how your data may arrive, and how to sculpt your data so that you can make it work for you.

Paz,

-Snacks

Related Posts