Parse Data from PDFs with Tabula and Pandas

Check out the accompanying GitHub repo for this article here.

Technically, the School District of Philadelphia's budget data for the 2019 fiscal year is "open". It is, after all, made available through the district's Open Data portal and is freely available to download.

But just because data is freely available, doesn't mean it's easy to work with. That's what found out when I downloaded the zipped folder, opened it up, and found a heap of PDFs. Joy.

As a member of Code for Philly, I thought of my compatriots who might want to use school district data in their projects. I knew with a bit of data munging, I could provide a data set that would be more easily usable.

Data Liberation

The first hurdle was to find a way to get the data from the PDFs. After a bit Googling, I came across tabula-py, a Python wrapper for Tabula.

Each budget is composed of 5 tables:

General information about the school
Enrollment information
Operating Funded budget allotments
Grant Funded budget allotments
A summary table of allotment totals

Extracting these tables from a budget with Tabula was as simple as:

import tabula


tabula.read_pdf(
    path/to/budget.pdf,
    multiple_tables=True
)

Parse PDF data with Tabula

Which returned a list of DataFrames, one for each table mentioned above. Perfect!
So, I iterated over all of the files in folder and appended them to a list:

import os
import pandas as pd
import tabula


def read_budgets(directory):
    budgets = []
    for filename in os.listdir(directory):
        budget_tables = tabula.read_pdf(
            f"{directory}/{filename}", 
            multiple_tables=True
        )
        budgets.append(budget_tables)
    return budgets


# this takes a while
budgets = read_budgets("SY1819_School_Budgets")

Parse all PDFs in a given directory

Initial Cleaning

While this gave me a good start, I knew it wouldn't be that easy to liberate the data from the PDFs. I took a look at each of the DataFrames to see what I'd be working with.

# an example list of budgets
sample_budget = budgets[0]
sample_budget

Output:

[    0                  1
     0    Basic Information                NaN
     1     Council District                2nd
     2    Organization Code               1380
     3         School Level  Elementary School
     4         Economically                NaN
     5  Disadvantaged Rate*                NaN
     6                  NaN             83.44%,
                   0     1     2               3
     0           NaN  FY14  FY18  FY19 Projected
     1  Enrollment**   842   640             602,
                                                       0            1            2  \
     0                              Position/Expenditure  FY14 Budget  FY18 Budget   
     1                   Principals/Assistant Principals          2.0          1.0   
     2                      Teachers ‐ Regular Education         30.2         25.0   
     3                      Teachers ‐ Special Education          6.0          2.8   
     4      Counselors/Student Adv./ Soc. Serv. Liaisons          1.2          0.8   
     5                            Nurses/Health Services          0.6          1.0   
     6           Classroom Assistants/Teacher Assistants         11.0          8.0   
     7                                       Secretaries          1.0          1.0   
     8                       Support Services Assistants          0.0          2.0   
     9                             Student Climate Staff          8.0          1.0   
     10                                            Other          0.0          1.2   
     11                                  Total Positions         60.0         43.8   
     12  Supplies/Equipment/Non Full‐Time Salaries/Other      $32,272     $100,159   
     
                   3  
     0   FY19 Budget  
     1           1.0  
     2          24.0  
     3           5.0  
     4           0.1  
     5           1.0  
     6           9.0  
     7           1.0  
     8           5.0  
     9           3.0  
     10          1.0  
     11         50.1  
     12      $97,553  ,
                                                       0            1            2  \
     0                              Position/Expenditure  FY14 Budget  FY18 Budget   
     1                   Principals/Assistant Principals          0.0          0.0   
     2                      Teachers ‐ Regular Education          8.1          8.6   
     3                      Teachers ‐ Special Education          0.0          0.2   
     4      Counselors/Student Adv./ Soc. Serv. Liaisons          0.0          0.2   
     5                            Nurses/Health Services          0.0          0.0   
     6           Classroom Assistants/Teacher Assistants          0.0          0.0   
     7                                       Secretaries          0.0          0.0   
     8                       Support Services Assistants          7.0          5.0   
     9                             Student Climate Staff          0.0          7.0   
     10                                            Other          1.0          0.0   
     11                                  Total Positions         16.1         21.0   
     12  Supplies/Equipment/Non Full‐Time Salaries/Other     $198,454      $19,977   
     
                   3  
     0   FY19 Budget  
     1           0.0  
     2           9.6  
     3           0.0  
     4           1.1  
     5           0.0  
     6           0.0  
     7           0.0  
     8           3.0  
     9           4.0  
     10          0.0  
     11         17.7  
     12      $15,166  ,
                                                        0                     1  \
     0                                                NaN  Position/Expenditure   
     1                                    Total Positions                   NaN   
     2  Total Supplies/Equipment/Non Full‐Time Salarie...                   NaN   
     
                  2            3            4  
     0  FY14 Budget  FY18 Budget  FY19 Budget  
     1         76.1         64.8         67.8  
     2     $230,726     $120,136     $112,719  ]

After I saw the output, I wrote a function to perform the same cleaning operation for each table in each budget.

For each table below, first I'll introduce the "raw" output that Tabula returned, then I'll show the function that I wrote to fix that output.

Basic Information

basic_information = sample_budget[0] #basic information
basic_information

Output:

	0	1
0	Basic Information	NaN
1	Council District	2nd
2	Organization Code	1380
3	School Level	Elementary School
4	Economically	NaN
5	Disadvantaged Rate*	NaN
6	NaN	83.44%

Cleanup Function

...

def generate_basic_information_table(df):
    '''Series representing the "basic information" table.'''

    # budgets with a comment near the basic information table, e.g. 2050
    if df.shape[1] == 3:
        df = df.iloc[1:, 1:]
        df = df.reset_index(drop=True)
        df = df.T.reset_index(drop=True).T

    # Tabula did pretty well for this table, but didn't get the
    # Economically Disadvanted Rate quite right.

    df.loc[4] = ["Economically Disadvantaged Rate", df.loc[6, 1]]
    df = df.loc[1:4, :]
    return pd.Series(
        list(df[1]),
        index=list(df[0]),
        name='basic_information'
    )

Cleaned:

basic_information = generate_basic_information_table(basic_information)
basic_information

Output:

# Basic information output
Council District                                 2nd
Organization Code                               1380
School Level                       Elementary School
Economically Disadvantaged Rate               83.44%
Name: basic_information, dtype: object

Enrollment

# Getting the enrollment output
enrollment = sample_budget[1]
enrollment

Output:

	0	1	2	3
0	NaN	FY14	FY18	FY19 Projected
1	Enrollment**	842	640	602

Cleanup Function

...


def generate_enrollment_table(df):
    '''returns a series representing the "enrollment" table'''
    df = df.T.loc[1:, :]
    df_to_series = pd.Series(
        list(df[1]),
        index=list(df[0]),
        name="enrollment"
    )
    return df_to_series.str.replace(',', '').astype(float)

generate_enrollment_table(enrollment)

Cleaned:

# Enrollment table
FY14              842.0
FY18              640.0
FY19 Projected    602.0
Name: enrollment, dtype: float64

Allotments

Luckily, both allotment tables were identical, so I could apply to the same cleanup steps to both.

operating_funded_allotments = sample_budget[2]
operating_funded_allotments

Output:

	0	1	2	3
0	Position/Expenditure	FY14 Budget	FY18 Budget	FY19 Budget
1	Principals/Assistant Principals	2.0	1.0	1.0
2	Teachers ‐ Regular Education	30.2	25.0	24.0
3	Teachers ‐ Special Education	6.0	2.8	5.0
4	Counselors/Student Adv./ Soc. Serv. Liaisons	1.2	0.8	0.1
5	Nurses/Health Services	0.6	1.0	1.0
6	Classroom Assistants/Teacher Assistants	11.0	8.0	9.0
7	Secretaries	1.0	1.0	1.0
8	Support Services Assistants	0.0	2.0	5.0
9	Student Climate Staff	8.0	1.0	3.0
10	Other	0.0	1.2	1.0
11	Total Positions	60.0	43.8	50.1
12	Supplies/Equipment/Non Full‐Time Salaries/Other	$32,272	$100,159	$97,553

grant_funded_allotments = sample_budget[3]
grant_funded_allotments

Cleanup Function

I decided to merge the two allotment tables into one DataFrame while building a MultiIndex to keep things in order. This would allow me to ask some more interesting questions further on down the road.

...


def generate_allotments_table(df, code, fund):
    """Multiindex of org, fund, and budget category by year."""
    df.columns = df.iloc[0]
    df = df.drop(0)
    df = df.set_index(['Position/Expenditure'])
    df = (df.apply(lambda x: x.str.replace('$', '').str.replace(',', ''))
            .astype(float)
          )
    df.name = fund + "ed_allotments"

    df_index_arrays = [
        [code] * len(df),
        [fund] * len(df),
        list(df.index),
    ]

    df.index = pd.MultiIndex.from_arrays(
        df_index_arrays,
        names=("org_code", "fund", "allotment")
    )
    df.columns = [column[:4] for column in df.columns]

    return df

Cleaned:

pd.concat([
    generate_allotments_table(
        operating_funded_allotments, "1410", "operating_fund"
    ),
    generate_allotments_table(
        grant_funded_allotments, "1410", "grant_fund"
    )
])

FY14	FY18	FY19	org_code	fund	allotment
1410	operating_fund	Principals/Assistant Principals	2.0	1.0	1.0
		Teachers ‐ Regular Education	30.2	25.0	24.0
		Teachers ‐ Special Education	6.0	2.8	5.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	1.2	0.8	0.1
		Nurses/Health Services	0.6	1.0	1.0
		Classroom Assistants/Teacher Assistants	11.0	8.0	9.0
		Secretaries	1.0	1.0	1.0
		Support Services Assistants	0.0	2.0	5.0
		Student Climate Staff	8.0	1.0	3.0
		Other	0.0	1.2	1.0
		Total Positions	60.0	43.8	50.1
		Supplies/Equipment/Non Full‐Time Salaries/Other	32272.0	100159.0	97553.0
	grant_fund	Principals/Assistant Principals	0.0	0.0	0.0
		Teachers ‐ Regular Education	8.1	8.6	9.6
		Teachers ‐ Special Education	0.0	0.2	0.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	0.0	0.2	1.1
		Nurses/Health Services	0.0	0.0	0.0
		Classroom Assistants/Teacher Assistants	0.0	0.0	0.0
		Secretaries	0.0	0.0	0.0
		Support Services Assistants	7.0	5.0	3.0
		Student Climate Staff	0.0	7.0	4.0
		Other	1.0	0.0	0.0
		Total Positions	16.1	21.0	17.7
		Supplies/Equipment/Non Full‐Time Salaries/Other	198454.0	19977.0	15166.0

Totals

Since the final "totals" table could be calculated from the data already in the new allotment table, I didn't bother transforming it in any way.

# same data can be derived from the allotments table directly
sample_budget[4]

	0	1	2	3	4
0	NaN	Position/Expenditure	FY14 Budget	FY18 Budget	FY19 Budget
1	Total Positions	NaN	76.1	64.8	67.8
2	Total Supplies/Equipment/Non Full‐Time Salarie...	NaN	$230,726	$120,136	$112,719

Once I figured out what transformations I needed for each table, I combined them into a function so that, given a list of DataFames from Tabula, I'd get those same tables back neatly formatted.

...

def generate_all_tables(list_of_df):
    basic_information = generate_basic_information_table(list_of_df[0])
    enrollment = generate_enrollment_table(list_of_df[1])

    operating_funded_allotments = generate_allotments_table(
        list_of_df[2],
        basic_information['Organization Code'],
        'operating_fund'
    )
    grant_funded_allotments = generate_allotments_table(
        list_of_df[3],
        basic_information['Organization Code'],
        'grant_fund'
    )
    allotments = pd.concat(
        [operating_funded_allotments, grant_funded_allotments]
    )

    return basic_information, enrollment, operating_and_grant_funded_allotments

basic_information, enrollment, allotments = generate_all_tables(sample_budget)

Aggregation Time

Now that I had cleaned the tables that Tabula produced, it was time to combine them into some aggregated tables.

First I wrote a function that would output a Series (representing one row) of information from all tables for a given school in a given fiscal year.

...


def generate_row(budget_year, basic_information, allotments, enrollment):
    """
    School budget series for fiscal year.
    
    budget_year should be FY14, FY18, or FY19
    """    
    flattened_allotments = pd.DataFrame(allotments.to_records())
    flattened_allotments.index = flattened_allotments['fund'] +": " + flattened_allotments['allotment']
    flattened_allotments = flattened_allotments.drop(
        ['fund','allotment'], axis=1
    )
    budget_allotments = flattened_allotments[budget_year]
    
    enrollment_label = budget_year + ' Projected' if budget_year == "FY19" else budget_year
    enrollment_index = 'projected_enrollment' if budget_year == "FY19" else 'enrollment'
    enrollment_row = pd.Series(
        enrollment[enrollment_label], index=[enrollment_index]
    )
    
    return pd.concat(
        [basic_information,budget_allotments,enrollment_row],
        axis=0
    )

generate_row(
    "FY18", 
    basic_information,
    operating_and_grant_funded_allotments,
    enrollment
)

Output:

Council District 2 nd
Organization Code 1380
School Level Elementary School
Economically Disadvantaged Rate 83.44 %
operating_fund: Principals / Assistant Principal.1
operating_fund: Teachers‐ Regular Education 25
operating_fund: Teachers‐ Special Education 2.8
operating_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.8
operating_fund: Nurses / Health Services 1
operating_fund: Classroom Assistants / Teacher Assistants 8
operating_fund: Secretaries 1
operating_fund: Support Services Assistants 2
operating_fund: Student Climate Staff 1
operating_fund: Other 1.2
operating_fund: Total Positions 43.8
operating_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 100159
grant_fund: Principals / Assistant Principals 0
grant_fund: Teachers‐ Regular Education 8.6
grant_fund: Teachers‐ Special Education 0.2
grant_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.2
grant_fund: Nurses / Health Services 0
grant_fund: Classroom Assistants / Teacher Assistants 0
grant_fund: Secretaries 0
grant_fund: Support Services Assistants 5
grant_fund: Student Climate Staff 7
grant_fund: Other 0
grant_fund: Total Positions 21
grant_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 19977
enrollment 640
dtype: object

Then, I applied this function to each list of budgets in the collection and compiled them into a DataFrame.

...

def generate_tabular_budget(budget_year, budgets):
    """
    Generate a tabular budget summary for a budget year.
    
    Budget year must be FY14, FY18, or FY19.
    Enrollemnt values for budget year 2019 are projected.
    """
    school_budget_series = []
    for budget_tables in budgets:
        basic_information, enrollment, operating_and_grant_funded_allotments = generate_all_tables(
            budget_tables
        )
        budget_row = generate_row(
            budget_year, basic_information, operating_and_grant_funded_allotments, enrollment
        )
        budget_row = budget_row
        school_budget_series.append(budget_row)

    return pd.DataFrame(school_budget_series)


fy14 = generate_tabular_budget('FY14', budgets)
fy14['budget_year'] = "FY14"
fy14.to_csv("output/combined_fy14.csv")

fy18 = generate_tabular_budget('FY18', budgets)
fy18['budget_year'] = "FY18"
fy18.to_csv("output/combined_fy18.csv")

fy19 = generate_tabular_budget('FY19', budgets)
fy19['budget_year'] = "FY19"
fy19.to_csv("output/combined_fy19.csv")


combined_tabular_budgets = pd.concat([fy14, fy18, fy19])
combined_tabular_budgets.to_csv("output/all_budgets_tabular.csv")

Finally, I wanted to output a CSV that would preserve some of the multi-indexed nature of the allotment tables. Here's what I wrote for that.

...

def generate_hierarchical_budget(budgets):
    school_budgets_dfs = []
    for budget_tables in budgets:
        school_budgets_dfs.append(operating_and_grant_funded_allotments)
    return pd.concat(school_budgets_dfs)

hierarchical_budget = generate_hierarchical_budget(budgets)
hierarchical_budget.to_csv("output/all_budgets_hierarchical.csv")

hierarchical_budget

Output:

FY14	FY18	FY19	org_code	fund	allotment
1380	operating_fund	Principals/Assistant Principals	2.0	1.0	1.0
		Teachers ‐ Regular Education	30.2	25.0	24.0
		Teachers ‐ Special Education	6.0	2.8	5.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	1.2	0.8	0.1
		Nurses/Health Services	0.6	1.0	1.0
		Classroom Assistants/Teacher Assistants	11.0	8.0	9.0
		Secretaries	1.0	1.0	1.0
		Support Services Assistants	0.0	2.0	5.0
		Student Climate Staff	8.0	1.0	3.0
		Other	0.0	1.2	1.0
		Total Positions	60.0	43.8	50.1
		Supplies/Equipment/Non Full‐Time Salaries/Other	32272.0	100159.0	97553.0
	grant_fund	Principals/Assistant Principals	0.0	0.0	0.0
		Teachers ‐ Regular Education	8.1	8.6	9.6
		Teachers ‐ Special Education	0.0	0.2	0.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	0.0	0.2	1.1
		Nurses/Health Services	0.0	0.0	0.0
		Classroom Assistants/Teacher Assistants	0.0	0.0	0.0
		Secretaries	0.0	0.0	0.0
		Support Services Assistants	7.0	5.0	3.0
		Student Climate Staff	0.0	7.0	4.0
		Other	1.0	0.0	0.0
		Total Positions	16.1	21.0	17.7
		Supplies/Equipment/Non Full‐Time Salaries/Other	198454.0	19977.0	15166.0
	operating_fund	Principals/Assistant Principals	2.0	1.0	1.0
		Teachers ‐ Regular Education	30.2	25.0	24.0
		Teachers ‐ Special Education	6.0	2.8	5.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	1.2	0.8	0.1
		Nurses/Health Services	0.6	1.0	1.0
		Classroom Assistants/Teacher Assistants	11.0	8.0	9.0
	grant_fund	Secretaries	0.0	0.0	0.0
		Support Services Assistants	7.0	5.0	3.0
		Student Climate Staff	0.0	7.0	4.0
		Other	1.0	0.0	0.0
		Total Positions	16.1	21.0	17.7
		Supplies/Equipment/Non Full‐Time Salaries/Other	198454.0	19977.0	15166.0
	operating_fund	Principals/Assistant Principals	2.0	1.0	1.0
		Teachers ‐ Regular Education	30.2	25.0	24.0
		Teachers ‐ Special Education	6.0	2.8	5.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	1.2	0.8	0.1
		Nurses/Health Services	0.6	1.0	1.0
		Classroom Assistants/Teacher Assistants	11.0	8.0	9.0
		Secretaries	1.0	1.0	1.0
		Support Services Assistants	0.0	2.0	5.0
		Student Climate Staff	8.0	1.0	3.0
		Other	0.0	1.2	1.0
		Total Positions	60.0	43.8	50.1
		Supplies/Equipment/Non Full‐Time Salaries/Other	32272.0	100159.0	97553.0
	grant_fund	Principals/Assistant Principals	0.0	0.0	0.0
		Teachers ‐ Regular Education	8.1	8.6	9.6
		Teachers ‐ Special Education	0.0	0.2	0.0
		Counselors/Student Adv./ Soc. Serv. Liaisons	0.0	0.2	1.1
		Nurses/Health Services	0.0	0.0	0.0
		Classroom Assistants/Teacher Assistants	0.0	0.0	0.0
		Secretaries	0.0	0.0	0.0
		Support Services Assistants	7.0	5.0	3.0
		Student Climate Staff	0.0	7.0	4.0
		Other	1.0	0.0	0.0
		Total Positions	16.1	21.0	17.7
		Supplies/Equipment/Non Full‐Time Salaries/Other	198454.0	19977.0	15166.0

5160 rows × 3 columns

This makes it easier to aggregate in interesting ways:

hierarchical_budget.groupby('allotment').sum()

Output:

FY14	FY18	FY19	allotment
Classroom Assistants/Teacher Assistants	2365.0	1720.0	1935.0
Counselors/Student Adv./ Soc. Serv. Liaisons	258.0	215.0	258.0
Nurses/Health Services	129.0	215.0	215.0
Other	215.0	258.0	215.0
Principals/Assistant Principals	430.0	215.0	215.0
Secretaries	215.0	215.0	215.0
Student Climate Staff	1720.0	1720.0	1505.0
Supplies/Equipment/Non Full‐Time Salaries/Other	49606090.0	25829240.0	24234585.0
Support Services Assistants	1505.0	1505.0	1720.0
Teachers ‐ Regular Education	8234.5	7224.0	7224.0
Teachers ‐ Special Education	1290.0	645.0	1075.0
Total Positions	16361.5	13932.0	14577.0

More Cleaning to be Done

My work here is done. I saved the data from their not-so-accessible PDF prisons. But now it's time for someone with some domain-specific knowledge to make it actionable.

The biggest weakness with the data in its current form is that there is some amount of ambiguity as to what the different allotments numbers represent in real-dollar amounts. Only the Supplies/Equipment/Non Full‐Time Salaries/Other allotment category came in currency notation – the rest of the allotments were represented as simple decimal amounts with no context to help interpret what they mean. Do they represent FTE? Dollar amounts in scientific notation? I'm not sure, but I hope by handing this work off to the right people, these questions and more can be answered more easily thanks to a cleaner, more accessible data set.

Data Liberation

Initial Cleaning

Basic Information

Cleanup Function

Enrollment

Cleanup Function

Allotments

Cleanup Function

Totals

Aggregation Time

More Cleaning to be Done

Related Posts