Liberating Data from PDFs with Tabula and Pandas

Making 'open' data more open

Check out the accompanying GitHub repo for this article here.

Technically, the School District of Philadelphia's budget data for the 2019 fiscal year is "open". It is, after all, made available through the district's Open Data portal and is freely available to download.

But just because data is freely available, doesn't mean it's easy to work with. That's what found out when I downloaded the zipped folder, opened it up, and found a heap of PDFs. Joy.

As a member of Code for Philly, I thought of my compatriots who might want to use school district data in their projects. I knew with a bit of data munging, I could provide a data set that would be more easily usable.

Data Liberation

The first hurdle was to find a way to get the data from the PDFs. After a bit Googling, I came across tabula-py, a Python wrapper for Tabula.

Each budget is composed of 5 tables:

  • General information about the school
  • Enrollment information
  • Operating Funded budget allotments
  • Grant Funded budget allotments
  • A summary table of allotment totals

Extracting these tables from a budget with Tabula was as simple as:

tabula.read_pdf(path_to_budget, multiple_tables=True)

Which returned a list of DataFrames, one for each table mentioned above. Perfect!
So, I iterated over all of the files in folder and appended them to a list:

import os
import pandas as pd
import tabula

def read_budgets(directory):
    budgets = []
    for filename in os.listdir(directory):
        budget_tables = tabula.read_pdf(
            f"{directory}/{filename}", 
            multiple_tables=True
        )
        budgets.append(budget_tables)

    return budgets


# this takes a while
budgets = read_budgets("SY1819_School_Budgets")

Initial Cleaning

While this gave me a good start, I knew it wouldn't be that easy to liberate the data from the PDFs. I took a look at each of the DataFrames to see what I'd be working with.

# an example list of budgets
sample_budget = budgets[0]
sample_budget

[    0                  1
     0    Basic Information                NaN
     1     Council District                2nd
     2    Organization Code               1380
     3         School Level  Elementary School
     4         Economically                NaN
     5  Disadvantaged Rate*                NaN
     6                  NaN             83.44%,
                   0     1     2               3
     0           NaN  FY14  FY18  FY19 Projected
     1  Enrollment**   842   640             602,
                                                       0            1            2  \
     0                              Position/Expenditure  FY14 Budget  FY18 Budget   
     1                   Principals/Assistant Principals          2.0          1.0   
     2                      Teachers ‐ Regular Education         30.2         25.0   
     3                      Teachers ‐ Special Education          6.0          2.8   
     4      Counselors/Student Adv./ Soc. Serv. Liaisons          1.2          0.8   
     5                            Nurses/Health Services          0.6          1.0   
     6           Classroom Assistants/Teacher Assistants         11.0          8.0   
     7                                       Secretaries          1.0          1.0   
     8                       Support Services Assistants          0.0          2.0   
     9                             Student Climate Staff          8.0          1.0   
     10                                            Other          0.0          1.2   
     11                                  Total Positions         60.0         43.8   
     12  Supplies/Equipment/Non Full‐Time Salaries/Other      $32,272     $100,159   
     
                   3  
     0   FY19 Budget  
     1           1.0  
     2          24.0  
     3           5.0  
     4           0.1  
     5           1.0  
     6           9.0  
     7           1.0  
     8           5.0  
     9           3.0  
     10          1.0  
     11         50.1  
     12      $97,553  ,
                                                       0            1            2  \
     0                              Position/Expenditure  FY14 Budget  FY18 Budget   
     1                   Principals/Assistant Principals          0.0          0.0   
     2                      Teachers ‐ Regular Education          8.1          8.6   
     3                      Teachers ‐ Special Education          0.0          0.2   
     4      Counselors/Student Adv./ Soc. Serv. Liaisons          0.0          0.2   
     5                            Nurses/Health Services          0.0          0.0   
     6           Classroom Assistants/Teacher Assistants          0.0          0.0   
     7                                       Secretaries          0.0          0.0   
     8                       Support Services Assistants          7.0          5.0   
     9                             Student Climate Staff          0.0          7.0   
     10                                            Other          1.0          0.0   
     11                                  Total Positions         16.1         21.0   
     12  Supplies/Equipment/Non Full‐Time Salaries/Other     $198,454      $19,977   
     
                   3  
     0   FY19 Budget  
     1           0.0  
     2           9.6  
     3           0.0  
     4           1.1  
     5           0.0  
     6           0.0  
     7           0.0  
     8           3.0  
     9           4.0  
     10          0.0  
     11         17.7  
     12      $15,166  ,
                                                        0                     1  \
     0                                                NaN  Position/Expenditure   
     1                                    Total Positions                   NaN   
     2  Total Supplies/Equipment/Non Full‐Time Salarie...                   NaN   
     
                  2            3            4  
     0  FY14 Budget  FY18 Budget  FY19 Budget  
     1         76.1         64.8         67.8  
     2     $230,726     $120,136     $112,719  ]     

After I saw the output, I wrote a function to perform the same cleaning operation for each table in each budget.

For each table below, first I'll introduce the "raw" output that Tabula returned, then I'll show the function that I wrote to fix that output.

Basic Information

Raw Output

basic_information = sample_budget[0] #basic information
basic_information
0 1
0 Basic Information NaN
1 Council District 2nd
2 Organization Code 1380
3 School Level Elementary School
4 Economically NaN
5 Disadvantaged Rate* NaN
6 NaN 83.44%

Cleanup Function

def generate_basic_information_table(df):
    '''Series representing the "basic information" table.'''

    # budgets with a comment near the basic information table, e.g. 2050
    if df.shape[1] == 3:
        df = df.iloc[1:, 1:]
        df = df.reset_index(drop=True)
        df = df.T.reset_index(drop=True).T

    # After that, Tabula did pretty well for this table, but didn't get the
    # Economically Disadvanted Rate quite right.

    df.loc[4] = ["Economically Disadvantaged Rate", df.loc[6, 1]]
    df = df.loc[1:4, :]
    return pd.Series(list(df[1]), index=list(df[0]), name='basic_information')

Cleaned

basic_information = generate_basic_information_table(basic_information)
basic_information
# Basic information output
Council District                                 2nd
Organization Code                               1380
School Level                       Elementary School
Economically Disadvantaged Rate               83.44%
Name: basic_information, dtype: object

Enrollment

Raw Output

# Getting the enrollment output
enrollment = sample_budget[1]
enrollment
0 1 2 3
0 NaN FY14 FY18 FY19 Projected
1 Enrollment** 842 640 602

Cleanup Function

def generate_enrollment_table(df):
    '''returns a series representing the "enrollment" table'''
    # nothing too crazy here
    df = df.T.loc[1:, :]
    df_to_series = pd.Series(list(df[1]), index=list(df[0]), name="enrollment")
    return df_to_series.str.replace(',', '').astype(float)

generate_enrollment_table(enrollment)

Cleaned

# Enrollment table
FY14              842.0
FY18              640.0
FY19 Projected    602.0
Name: enrollment, dtype: float64

Allotments

Luckily, both allotment tables were identical, so I could apply to the same cleanup steps to both.

Raw Output

operating_funded_allotments = sample_budget[2]
operating_funded_allotments
0 1 2 3
0 Position/Expenditure FY14 Budget FY18 Budget FY19 Budget
1 Principals/Assistant Principals 2.0 1.0 1.0
2 Teachers ‐ Regular Education 30.2 25.0 24.0
3 Teachers ‐ Special Education 6.0 2.8 5.0
4 Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
5 Nurses/Health Services 0.6 1.0 1.0
6 Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
7 Secretaries 1.0 1.0 1.0
8 Support Services Assistants 0.0 2.0 5.0
9 Student Climate Staff 8.0 1.0 3.0
10 Other 0.0 1.2 1.0
11 Total Positions 60.0 43.8 50.1
12 Supplies/Equipment/Non Full‐Time Salaries/Other $32,272 $100,159 $97,553
grant_funded_allotments = sample_budget[3]
grant_funded_allotments

Cleanup Function

I decided to merge the two allotment tables into one DataFrame while building a MultiIndex to keep things in order. This would allow me to ask some more interesting questions further on down the road.

def generate_allotments_table(df, code, fund):
    '''Multiindex DF of org code, fund, and budget category by budget year'''
    df.columns = df.iloc[0]
    df = df.drop(0)
    df = df.set_index(['Position/Expenditure'])
    df = (df.apply(lambda x: x.str.replace('$', '').str.replace(',', ''))
            .astype(float)
          )
    df.name = fund + "ed_allotments"

    df_index_arrays = [
        [code] * len(df),
        [fund] * len(df),
        list(df.index),
    ]

    df.index = pd.MultiIndex.from_arrays(
        df_index_arrays,
        names=("org_code", "fund", "allotment")
    )
    df.columns = [column[:4] for column in df.columns]

    return df

Cleaned

pd.concat([
    generate_allotments_table(
        operating_funded_allotments, "1410", "operating_fund"
    ),
    generate_allotments_table(
        grant_funded_allotments, "1410", "grant_fund"
    )
])
FY14 FY18 FY19 org_code fund allotment
1410 operating_fund Principals/Assistant Principals 2.0 1.0 1.0
Teachers ‐ Regular Education 30.2 25.0 24.0
Teachers ‐ Special Education 6.0 2.8 5.0
Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
Nurses/Health Services 0.6 1.0 1.0
Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
Secretaries 1.0 1.0 1.0
Support Services Assistants 0.0 2.0 5.0
Student Climate Staff 8.0 1.0 3.0
Other 0.0 1.2 1.0
Total Positions 60.0 43.8 50.1
Supplies/Equipment/Non Full‐Time Salaries/Other 32272.0 100159.0 97553.0
grant_fund Principals/Assistant Principals 0.0 0.0 0.0
Teachers ‐ Regular Education 8.1 8.6 9.6
Teachers ‐ Special Education 0.0 0.2 0.0
Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2 1.1
Nurses/Health Services 0.0 0.0 0.0
Classroom Assistants/Teacher Assistants 0.0 0.0 0.0
Secretaries 0.0 0.0 0.0
Support Services Assistants 7.0 5.0 3.0
Student Climate Staff 0.0 7.0 4.0
Other 1.0 0.0 0.0
Total Positions 16.1 21.0 17.7
Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0

Totals

Since the final "totals" table could be calculated from the data already in the new allotment table, I didn't bother transforming it in any way.

# same data can be derived from the allotments table directly
sample_budget[4]
0 1 2 3 4
0 NaN Position/Expenditure FY14 Budget FY18 Budget FY19 Budget
1 Total Positions NaN 76.1 64.8 67.8
2 Total Supplies/Equipment/Non Full‐Time Salarie... NaN $230,726 $120,136 $112,719

Once I figured out what transformations I needed for each table, I combined them into a function so that, given a list of DataFames from Tabula, I'd get those same tables back neatly formatted.

def generate_all_tables(list_of_df):
    basic_information = generate_basic_information_table(list_of_df[0])
    enrollment = generate_enrollment_table(list_of_df[1])

    operating_funded_allotments = generate_allotments_table(
        list_of_df[2],
        basic_information['Organization Code'],
        'operating_fund'
    )
    grant_funded_allotments = generate_allotments_table(
        list_of_df[3],
        basic_information['Organization Code'],
        'grant_fund'
    )
    operating_and_grant_funded_allotments = pd.concat(
        [operating_funded_allotments, grant_funded_allotments]
    )

    return basic_information, enrollment, operating_and_grant_funded_allotments

basic_information, enrollment, operating_and_grant_funded_allotments = 
generate_all_tables(sample_budget)

Aggregation Time

Now that I had cleaned the tables that Tabula produced, it was time to combine them into some aggregated tables.

First I wrote a function that would output a Series (representing one row) of information from all tables for a given school in a given fiscal year.

def generate_row(budget_year, basic_information, allotments, enrollment):
    '''School budget series for fiscal year.'''
 	# budget_year should be FY14, FY18, or FY19
    
    flattened_allotments = pd.DataFrame(allotments.to_records())
    flattened_allotments.index = flattened_allotments['fund'] +": " + flattened_allotments['allotment']
    flattened_allotments = flattened_allotments.drop(
        ['fund','allotment'], axis=1
    )
    budget_allotments = flattened_allotments[budget_year]
    
    enrollment_label = budget_year + ' Projected' if budget_year == "FY19" else budget_year
    enrollment_index = 'projected_enrollment' if budget_year == "FY19" else 'enrollment'
    enrollment_row = pd.Series(
        enrollment[enrollment_label], index=[enrollment_index]
    )
    
    return pd.concat(
            [basic_information,budget_allotments,enrollment_row],
            axis=0
           )

generate_row("FY18", basic_information,
             operating_and_grant_funded_allotments, enrollment)
# Output
Council District 2 nd
Organization Code 1380
School Level Elementary School
Economically Disadvantaged Rate 83.44 %
operating_fund: Principals / Assistant Principal.1
operating_fund: Teachers‐ Regular Education 25
operating_fund: Teachers‐ Special Education 2.8
operating_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.8
operating_fund: Nurses / Health Services 1
operating_fund: Classroom Assistants / Teacher Assistants 8
operating_fund: Secretaries 1
operating_fund: Support Services Assistants 2
operating_fund: Student Climate Staff 1
operating_fund: Other 1.2
operating_fund: Total Positions 43.8
operating_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 100159
grant_fund: Principals / Assistant Principals 0
grant_fund: Teachers‐ Regular Education 8.6
grant_fund: Teachers‐ Special Education 0.2
grant_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.2
grant_fund: Nurses / Health Services 0
grant_fund: Classroom Assistants / Teacher Assistants 0
grant_fund: Secretaries 0
grant_fund: Support Services Assistants 5
grant_fund: Student Climate Staff 7
grant_fund: Other 0
grant_fund: Total Positions 21
grant_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 19977
enrollment 640
dtype: object

Then, I applied this function to each list of budgets in the collection and compiled them into a DataFrame.

def generate_tabular_budget(budget_year, budgets):
    '''generate a tabular budget summary for a budget year. Budget year must be FY14,
    FY18, or FY19. Enrollemnt values for budget year 2019 are projected.'''
    school_budget_series = []
    for budget_tables in budgets:
        basic_information, enrollment, operating_and_grant_funded_allotments = generate_all_tables(
            budget_tables
        )
        budget_row = generate_row(
            budget_year, basic_information, operating_and_grant_funded_allotments, enrollment
        )
        budget_row = budget_row
        school_budget_series.append(budget_row)

    return pd.DataFrame(school_budget_series)


fy14 = generate_tabular_budget('FY14', budgets)
fy14['budget_year'] = "FY14"
fy14.to_csv("output/combined_fy14.csv")

fy18 = generate_tabular_budget('FY18', budgets)
fy18['budget_year'] = "FY18"
fy18.to_csv("output/combined_fy18.csv")

fy19 = generate_tabular_budget('FY19', budgets)
fy19['budget_year'] = "FY19"
fy19.to_csv("output/combined_fy19.csv")


combined_tabular_budgets = pd.concat([fy14, fy18, fy19])
combined_tabular_budgets.to_csv("output/all_budgets_tabular.csv")

Finally, I wanted to output a CSV that would preserve some of the multi-indexed nature of the allotment tables. Here's what I wrote for that.

def generate_hierarchical_budget(budgets):
    school_budgets_dfs = []
    for budget_tables in budgets:
        school_budgets_dfs.append(operating_and_grant_funded_allotments)
    return pd.concat(school_budgets_dfs)

hierarchical_budget = generate_hierarchical_budget(budgets)
hierarchical_budget.to_csv("output/all_budgets_hierarchical.csv")

hierarchical_budget
FY14 FY18 FY19 org_code fund allotment
1380 operating_fund Principals/Assistant Principals 2.0 1.0 1.0
Teachers ‐ Regular Education 30.2 25.0 24.0
Teachers ‐ Special Education 6.0 2.8 5.0
Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
Nurses/Health Services 0.6 1.0 1.0
Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
Secretaries 1.0 1.0 1.0
Support Services Assistants 0.0 2.0 5.0
Student Climate Staff 8.0 1.0 3.0
Other 0.0 1.2 1.0
Total Positions 60.0 43.8 50.1
Supplies/Equipment/Non Full‐Time Salaries/Other 32272.0 100159.0 97553.0
grant_fund Principals/Assistant Principals 0.0 0.0 0.0
Teachers ‐ Regular Education 8.1 8.6 9.6
Teachers ‐ Special Education 0.0 0.2 0.0
Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2 1.1
Nurses/Health Services 0.0 0.0 0.0
Classroom Assistants/Teacher Assistants 0.0 0.0 0.0
Secretaries 0.0 0.0 0.0
Support Services Assistants 7.0 5.0 3.0
Student Climate Staff 0.0 7.0 4.0
Other 1.0 0.0 0.0
Total Positions 16.1 21.0 17.7
Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0
operating_fund Principals/Assistant Principals 2.0 1.0 1.0
Teachers ‐ Regular Education 30.2 25.0 24.0
Teachers ‐ Special Education 6.0 2.8 5.0
Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
Nurses/Health Services 0.6 1.0 1.0
Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
grant_fund Secretaries 0.0 0.0 0.0
Support Services Assistants 7.0 5.0 3.0
Student Climate Staff 0.0 7.0 4.0
Other 1.0 0.0 0.0
Total Positions 16.1 21.0 17.7
Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0
operating_fund Principals/Assistant Principals 2.0 1.0 1.0
Teachers ‐ Regular Education 30.2 25.0 24.0
Teachers ‐ Special Education 6.0 2.8 5.0
Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
Nurses/Health Services 0.6 1.0 1.0
Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
Secretaries 1.0 1.0 1.0
Support Services Assistants 0.0 2.0 5.0
Student Climate Staff 8.0 1.0 3.0
Other 0.0 1.2 1.0
Total Positions 60.0 43.8 50.1
Supplies/Equipment/Non Full‐Time Salaries/Other 32272.0 100159.0 97553.0
grant_fund Principals/Assistant Principals 0.0 0.0 0.0
Teachers ‐ Regular Education 8.1 8.6 9.6
Teachers ‐ Special Education 0.0 0.2 0.0
Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2 1.1
Nurses/Health Services 0.0 0.0 0.0
Classroom Assistants/Teacher Assistants 0.0 0.0 0.0
Secretaries 0.0 0.0 0.0
Support Services Assistants 7.0 5.0 3.0
Student Climate Staff 0.0 7.0 4.0
Other 1.0 0.0 0.0
Total Positions 16.1 21.0 17.7
Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0
5160 rows × 3 columns

This makes it easier to aggregate in interesting ways

hierarchical_budget.groupby('allotment').sum()
FY14 FY18 FY19 allotment
Classroom Assistants/Teacher Assistants 2365.0 1720.0 1935.0
Counselors/Student Adv./ Soc. Serv. Liaisons 258.0 215.0 258.0
Nurses/Health Services 129.0 215.0 215.0
Other 215.0 258.0 215.0
Principals/Assistant Principals 430.0 215.0 215.0
Secretaries 215.0 215.0 215.0
Student Climate Staff 1720.0 1720.0 1505.0
Supplies/Equipment/Non Full‐Time Salaries/Other 49606090.0 25829240.0 24234585.0
Support Services Assistants 1505.0 1505.0 1720.0
Teachers ‐ Regular Education 8234.5 7224.0 7224.0
Teachers ‐ Special Education 1290.0 645.0 1075.0
Total Positions 16361.5 13932.0 14577.0

More Cleaning to be Done

My work here is done. I saved the data from their not-so-accessible PDF prisons. But now it's time for someone with some domain-specific knowledge to make it actionable.

The biggest weakness with the data in its current form is that there is some amount of ambiguity as to what the different allotments numbers represent in real-dollar amounts. Only the Supplies/Equipment/Non Full‐Time Salaries/Other allotment category came in currency notation – the rest of the allotments were represented as simple decimal amounts with no context to help interpret what they mean. Do they represent FTE? Dollar amounts in scientific notation? I'm not sure, but I hope by handing this work off to the right people, these questions and more can be answered more easily thanks to a cleaner, more accessible data set.

Author image
Philly Website
Loves Python; loves pandas; leaves every project more Pythonic than he found it. Shares exceptionally attractive D3 charts for #WCW.

Loves Python; loves pandas; leaves every project more Pythonic than he found it. Shares exceptionally attractive D3 charts for #WCW.