If you've been following along with the posts in this series, and you've been putting some of the scenarios into practice, you've now crossed a crucial juncture in your never-ending quest to master the bloated beast that is Excel. To put things in perspective, if you've conquered all the quests that I've set you upon, you should now be able to:
- Open up and add macro code into an Excel Worksheet
- Utilize lookup functions (particularly Index(Match)) to compare and validate data sets
- Build a complex formula
- Utilize IF/AND/and OR statements to manipulate data and finally...
- Understand your role in the data underground
If there was any running theme among the posts in the past, it's that if you're in the position in which you're forced to utilize Excel to analyze data, you're probably grabbing a dirty version of the data-set from an intermediary platform (such as Tableau or Business Objects) built by a developer who isn't informed as to how the end user is going to use the data.
Your mission, whether you accept it or not is essential to take data from people who know data and don't know the business, and turn it into data for people who know the business and don't understand numbers.
If you're wondering if perhaps you're in this role and don't know it yet, there's a simple two-question litmus test
=IF(AND(COLORCODING,OR(SLICE,DICE)),"Analyst","Something Else"
For those of you who are just joining the party, that essentially says that if:
- If a superior of yours has ever asked you to "slice" and or "dice" the data
- If you're asked to color code things arbitrarily. Then you might be an analyst.
Now, while sitting at your desk (in what's most likely an Aeron chair) and reading the above qualifiers, the first thought that pops into your mind should be: "WTF does slice and dice mean when it comes to data analysis?" The answer may surprise you in that it's not just meaningless business lingo, it's a pretty good explanation of what you're going to be doing A LOT of (but fear not, in all likelihood, your boss didn't mean it like that due to what I'd like to call the leveraging of the synergy paradigm).
Essentially, when one attempts to slap-chop a "data dump" (in the computer patois, that's how you say data that's pulled straight from the source), it often involves as the first step "normalizing" the data so that the format is logical, and is in a position in which you can start employing some of the wizardries that we've been speaking about.
The normalization process often involves only a few steps, all of which are quite easy, but EXTREMELY un-intuitive. If you follow these steps, you'll either be done with the normalization process, or you'll be able to figure out how (all data dumps are dirty, but not all shit is created equal).
- Ensure that your data has headings!
a. Headings are essential not just because they're a logical way to organize your variables, but they are also REQUIRED to build a pivot table down the road, if even one of your columns doesn't have a heading, Excel won't play nice! - Ensure that your headings don't repeat themselves
a. While this may seem obvious, if a heading repeats itself, Excel won't pivot properly, won't graph correctly, AND won't upload properly in external tools (such as Microsoft Access, which we'll explore in a later post!) - Ensure that multiple sources are all located in the same Excel workbook (basically, if you're pulling data from multiple places, you paste them all into the same Excel file).
a. Note that since Excel 2007, this step isn't 100% necessary, but it makes things much faster in both ease of use, and the computational sense. - And finally, ensure that your variables are in the correct format...which is what we'll focus on for the rest of this post.
Often, data dumps combine multiple variables into one "string" (a set of characters with no other format) most often last names, first names, and ID numbers which is fine when you're trying to figure out how manySmith,Bob,8675309
s exist in your data...so useless on its own. This brings us to the title of this post: how we're going to slice up (or "Parse") this string so that it's useful to us.
There are a few ways of going about this, but in my experience, the most useful is some combination of the following formulae:
=LEFT
: returns a specified number of characters from the start of a string.=RIGHT
: returns a specified number of characters from the end of the string.=MID
: returns a specified number of characters from a defined point in the string.=LEN
: returns the number of characters in the string, and...=FIND
: returns the location of a specified character or string within a string. Just a word to the wise, starting from the beginning is always 0, not 1 in any of these examples.
Assuming you've got some android in you, you notice that =left
and =right
are useless if you need to break up a string (but can be useful if you wind up with a data dump that gives you a preset amount of garbage to throw away each time, or as a second step in this process). Instead, we're going to leverage the fact that Data Dumps are generally exported in a Comma Separated Format (CSV) which means that there will be a lot of commas for you to set as "flags" where you want to break things up. Take the example I gave you before: Smith,Bob,8675309
s.
This is a particularly mean example because you'll need to break it into three steps for which there are multiple solutions, simply because there's more than one comma to "put your flag down". One way you can go about it is as such:
First use the "mid" function alongside the "find" command to separate the last name from the first name and ID, like so: =MID(A1,FIND(",",A1)+1,LEN(A1))
The result of this will be Bob,8675309, and the reason why the +1 is included is so that we don't include the comma (which would start this whole circus again)
You can then repeat the steps on the newly created Bob,8675309 to obtain the ID.
Once you have the ID number, if you return to the original string and perform =left(A1(find(",",A1)-1)
(the -1 removes the first comma as opposed to the second) you'll get "Smith"
Finally, you can go to the Bob,8675309 string (which we'll say is located in B1) and perform the same steps to get "Bob" all by itself.
With all that being said until we get a bit more advanced (yes, you can do this in fewer steps), assume you need one formula per comma to "parse" each section, and you'll find your way. The benefit of breaking it down this way is that you can drag the formula down and it'll perform the same steps regardless of how long the initial string is, as long as it has 3 commas.
So, what have we learned?
- We've learned to slice up data dumps
- We've learned to dice up the resultant strings and finally
- We've learned how to take a small bit of useless garbage and turn it into variables we can use in further analysis via slicing and dicing.
So you see? Perhaps your boss knew something after all when they asked you if you knew how to slice and dice data... juuuuuust kidding.
Now that you have a basic fund of knowledge (and if you don't I implore you to read the rest of the Adventures in Excel series), the next few lessons are going to go into some of the crazier stuff that I've personally been asked to do (which my other friends on this blog could do in 5 minutes using something other than Excel). You'll learn to start a fire with some flint and a bunch of sticks.
-Snacks out.