Hey there budding Excel wizards, this post picks up RIGHT where my previous post left off, so if you haven't enabled Powerpivot yet, I highly recommend that you read the previous post and enable the add-in before moving forward. Don't worry, we'll wait...
Good to see you again! If you've gotten this far, you've already won half the battle by enabling Powerpivot. To recap, technically PowerPivot is the desktop version of one of the cornerstones of Microsoft's Business Intelligence cloud platform: Power BI. Functionally however, PowerPivot is the answer to dealing with enormous files in a way that they're still useful, and can be used to grab data from a variety of sources, and summarize them under one roof.
Now, Powerpivot is extremely flexible and can grab data from a variety of sources, including but not limited to Databases, Data warehouses, Cloud services (like Tableau), Excel files, and Text Files (A.K.A. Flat files). As my posts tend to focus on those of you who rely almost solely on Excel for your analysis needs, we're going to focus mainly on the last two file types, and specifically the final type: the flat file.
A "Flat File" also known colloquially as a "Data Dump" is, quite literally, the output of a particular table (or query) from a database or database-like system. These files tend to be in .txt format, and if you try to open them as you would a normal file, it'll look like a bunch of gibberish in notepad with a LOT of some sort of punctuation.
You might also notice that these files, being essentially a chunk of a big database, can range from "uncomfortably large" to "you literally can't use it unless you know what you're doing." The reason for this is fairly complex, but a good rule of thumb is that anything larger than 1,048,576 rows needs either a specialized program (if you're interested, I recommend EM editor at www.emeditor.com) or a little ingenuity in order to extract meaning. Also, just from anecdotal evidence, even if the file loads up (and this is dependent somewhat on the power of your own machine) the chance of the file being unworkable increases rapidly when the file exceeds a gigabyte and a half in size.
In order to understand why these files are so big, we need to take a detour into understanding a bit about how DBAs interact with databases. For one thing, if you're working directly on a database (a real database...not something you threw together in Access), your computer isn't actually doing very much interacting with the database at all, and that's good because the processing power it takes to manage data sources of that size is simply out of the range of even the most custom of high-end professional PCs. Instead, these databases rely on the processing power of MANY computers working in tandem, and even then, a DBA is only ever going to look at a snapshot of the database (via a query).
That's why you may have encountered significant resistance from the IT department to release these flat files...what you're asking them to do is essentially take this product that's meant to be consumed by a monster network, and spit it out so that it can be manipulated by one analyst and their mid-range pro-sumer laptop (and real-talk, if you're sitting at your desk at work, next to the rest of the people on your team, look to your left, now to your right...can you honestly say those two people won't submit an IT ticket that says they can't open the big file...it's usually not worth the headache on IT's end). This also explains why the files are so large, usually, most database management software will simply spit out a file with the results of a query, it doesn't actually have to process that query, that's how you get files that exist but are too big to actually open.
Now that you understand a little more about these flat files, let's first figure out how to open them...first in "regular" Excel, and then using PowerPivot.
- Unlike opening a regular file, you need to have an instance of Excel opened up as opposed to opening the text file directly.
- Once Excel is opened, click File-->Open-->Browse...now things get a little weird.
- In the browse window, click the dropdown box above "open" and select "all files"
4. Select the text file that you wish to open.
5. At this point, if the file is straight up too big to play around with, you'll get an error message... and if you do, you'll have to wait till the next post to figure out what to do. If not, this will bring up the screen seen below. You'll want to generally check the "my data has headers" box, and click the "Delimited" radio button:
6. Remember that time I said there was a bunch of punctuation in the text file along with some gibberish? The actual punctuation mark that is used is important here, so either check yourself by opening the text file, or ask your IT department "what's the delimiter on the flat file?" The delimiter is that mark within the text file that stipulates the values that will be contained within a cell in the file that you're ultimately going to be playing with.
7. Hopefully the IT department has responded (or you've figured it out yourself), so on the next screen, click the checkbox next to the delimiter that they've told you to choose (*note* if they say the file is "Bar Delimited" check off "other" and write "|" in the box..."|" is what you get with Ctrl+\). In the example here, a comma separates the values, so the file is "Comma-Delimited".
8. This is an optional step, but the "treat consecutive delimiters as one" checkbox concerns whether your data is supposed to have blank cells or not. Checking the box gets rid of blanks, leaving it blank...leaves in the blanks. In my experience, that box is never checked. Either way, click finish.
9. Now, depending on the size of the flat file you're trying to open and the power of your computer, this will take anywhere from one second to 15 minutes (hopefully not...that's a good sign that you have too many rows). After the wait though, you'll be greeted with an Excel worksheet just like any other. I recommend IMMEDIATELY saving as the file in .XLSX format (there's nothing worse than borking a flat file). Alternatively, you'll still get a worksheet, but you'll get an error stating that some information couldn't be recovered, and a message asking if you want to try and recover the data...it doesn't really matter what you say here, the problem is you hit the row limit in Excel, and it's not going to be able to recover the data. Now, if you said yes, and waited for that whole thing to finish, you'll still get the first million-or-so rows, but there's no telling what was left off...and incomplete data is often worse than no data at all!
Now that you understand flat files and where they come from, I hope you have a better understanding of the struggles that the data team goes through to bring you files that you can manipulate, and some knowledge regarding what "vanilla" Excel can...and can't process.
In the next post, we're going to go beyond the impossible and finally get into using PowerPivot to manage files that are too big for Excel to handle, and maybe, if you're looking to get really nuts, MULTIPLE files.
Yours is the drill that will pierce the heavens,
- Snacks