Pandas 1.0 came out recently. To celebrate, here's a little tour of what brought us here.
There are a lot of places where this story could start, but let's start in 1954. 1954 had two events that are pretty important to the birth of Pandas.
On March 1st, 1954, the US tested a high-yield thermonuclear bomb on Bikini Atoll. Lithium deuteride (LiD) was the fuel. Natural lithium includes two different isotopes - Lithium 6 and Lithium 7. LiD 6 was understood to be Good H-Bomb Fuel, and LiD 7 was used as a moderating substance in the reaction, on the assumption that it was totally inert. Turns out that this assumption was extremely wrong - the substance that was supposed to moderate the reaction wound up turning into fuel. What was supposed to be a 6 megaton blast turned out to be a 15 megaton blast.
The consequences of this test are far-reaching, both in time and space. The crew of a Japanese fishing boat suffered acute radiation poisoning, and many people in the Marshall Islands suffered long term harm from fallout. It lead to an understanding of what fallout actually was, and that you weren't safe from a nuclear blast just from being outside of it - it affected people hundreds of miles away. Castle Bravo also echoes across culture - it helped inspire Godzilla, and it's why Spongebob & friends live in "Bikini Bottom" (a reference to Bikini Atoll). Perhaps most saliently, and most relevant to our story, it was a large motivator behind the 1963 Limited Test Ban Treaty, which ended nuclear testing in the atmosphere, underwater, and in space (though there could still be underground tests). Castle Bravo made it clear just how dangerous these experiments could be, how widespread the damage could be, and that it was basically impossible to guarantee a lack of major mistakes.
Naturally, this meant a major investment in computer technology to simulate nuclear blasts instead of relying so much on tests. This technology was already in its infancy, but the mandate for simulating nukes would drive scientific computing until the present day. This drive for computers and software is ultimately where Silicon Valley comes from. The Department of Defense and Department of Energy's hunger for computers, no matter the cost, lead them to buy up every batch of semiconductors until they were cheap enough to be cost-effective for a more general market.
“Much of my work has come from being lazy. I didn't like writing programs, and so, when I was working on the IBM 701, writing programs for computing missile trajectories, I started work on a programming system to make it easier to write programs.”
- John W. Backus
That thought process should seem pretty familiar to any Pandas user - it is essentially what's motivated Pandas, R, SQL, and basically any other framework that tries to let you focus on declarative programming for math & data manipulation. Fortran (a portmanteau on "Formula Translation") is a tool for writing scientific programs - it may look verbose compared to equivalent Python code, but it's certainly a lot more expressive than Assembly. You can probably follow the code here, which wouldn't necessarily be the case with a bunch of Op Codes. It also has the distinction of being the oldest programming language still in use. It was developed in 1950, but its first program was run in 1954.
Fortran is more than just a spiritual antecedent to scientific computing. Fortran packages for doing matrix operations, such as BLAS and LAPACK, are "under the hood" of Pandas. The one on your computer isn't necessarily written in Fortran - the default is a C translation. But it's an option, where the C one came from, and generally what you want if you really need performance.
Note that BLAS, LAPACK, and much of the rest of the scientific Fortran ecosystem remain products of the military-industrial complex. Or, at least, the ones available open-source. Organizations like The Department of Energy, DARPA, and the NSF (which has a defense mandate) provide the funding and work hours to keep these packages with fresh updates all the way until today. And this is part of why it's still used - it's hard to beat those decades of optimizations.
Fortran is only half of our story, however. Pandas isn't just about the fast operations - it's also about the syntax.
Okay, Fortran is a step up from assembly, but it's not really what NumPy or Pandas code looks like. For that, we need array-based languages with nice vectorized syntax. There were a few of these kicking around in the 60s and 70s as an example of Convergent Evolution (including S, the predecessor to R) - but I'm going to talk about APL because it's the weirdest, most theoretically-grounded, and I'm pretty sure Wes McKinney has gone on-record saying it was an inspiration for Pandas.
There's a saying that there are two types of programming languages - those that start "from the computer upwards", and those that start "from mathematics downward". Fortran, for all its relative user-friendliness, still makes you think about things like pre-allocating memory. APL came from a mathematician named Kenneth E. Iverson, who came up with a notation for manipulating arrays. Eventually they wrote up an implementation in Fortran, and it became a for-real programming language.
APL was terse, expressive, and made matrix operations a first-class citizen. Sure, you had to learn a bunch of weird symbols, will need a custom keyboard, and have to internalize a syntax that includes a concept of "adverbs". But, if all you were doing was manipulating data, then that wasn't so bad - it certainly matched the thought process more cleanly than repeated assignment statements or explicitly writing loops. If you like the Tidyverse, method-chaining in Pandas, or even UNIX-style piping, you like this programming paradigm. It's also closer to mathematical notation, which is nice if you're wired a certain way and/or have that background.
APL was huge, particularly in finance. Wrapping your mind around imperative code is weird if your training was in mathematical modeling - APL let bankers focus on their models. And this is still the case - today APL itself is a novelty for the most part, but it survives in the form of J, Q, and most importantly kdb, mostly used in finance. And of course, Pandas itself was developed while McKinney was working at the hedge fund 2Sigma.
Man, the world's weird, right?