Liberating Data from PDFs with Tabula and Pandas

Making 'open' data more open.

Liberating Data from PDFs with Tabula and Pandas

    Check out the accompanying GitHub repo for this article here.

    Technically, the School District of Philadelphia's budget data for the 2019 fiscal year is "open". It is, after all, made available through the district's Open Data portal and is freely available to download.

    But just because data is freely available, doesn't mean it's easy to work with. That's what found out when I downloaded the zipped folder, opened it up, and found a heap of PDFs. Joy.

    As a member of Code for Philly, I thought of my compatriots who might want to use school district data in their projects. I knew with a bit of data munging, I could provide a data set that would be more easily usable.

    Data Liberation

    The first hurdle was to find a way to get the data from the PDFs. After a bit Googling, I came across tabula-py, a Python wrapper for Tabula.

    Each budget is composed of 5 tables:

    • General information about the school
    • Enrollment information
    • Operating Funded budget allotments
    • Grant Funded budget allotments
    • A summary table of allotment totals

    Extracting these tables from a budget with Tabula was as simple as:

    tabula.read_pdf(path_to_budget, multiple_tables=True)
    

    Which returned a list of DataFrames, one for each table mentioned above. Perfect!
    So, I iterated over all of the files in folder and appended them to a list:

    import os
    import pandas as pd
    import tabula
    
    def read_budgets(directory):
        budgets = []
        for filename in os.listdir(directory):
            budget_tables = tabula.read_pdf(
                f"{directory}/{filename}", 
                multiple_tables=True
            )
            budgets.append(budget_tables)
    
        return budgets
    
    
    # this takes a while
    budgets = read_budgets("SY1819_School_Budgets")
    

    Initial Cleaning

    While this gave me a good start, I knew it wouldn't be that easy to liberate the data from the PDFs. I took a look at each of the DataFrames to see what I'd be working with.

    # an example list of budgets
    sample_budget = budgets[0]
    sample_budget  
    
    [    0                  1
         0    Basic Information                NaN
         1     Council District                2nd
         2    Organization Code               1380
         3         School Level  Elementary School
         4         Economically                NaN
         5  Disadvantaged Rate*                NaN
         6                  NaN             83.44%,
                       0     1     2               3
         0           NaN  FY14  FY18  FY19 Projected
         1  Enrollment**   842   640             602,
                                                           0            1            2  \
         0                              Position/Expenditure  FY14 Budget  FY18 Budget   
         1                   Principals/Assistant Principals          2.0          1.0   
         2                      Teachers ‐ Regular Education         30.2         25.0   
         3                      Teachers ‐ Special Education          6.0          2.8   
         4      Counselors/Student Adv./ Soc. Serv. Liaisons          1.2          0.8   
         5                            Nurses/Health Services          0.6          1.0   
         6           Classroom Assistants/Teacher Assistants         11.0          8.0   
         7                                       Secretaries          1.0          1.0   
         8                       Support Services Assistants          0.0          2.0   
         9                             Student Climate Staff          8.0          1.0   
         10                                            Other          0.0          1.2   
         11                                  Total Positions         60.0         43.8   
         12  Supplies/Equipment/Non Full‐Time Salaries/Other      $32,272     $100,159   
         
                       3  
         0   FY19 Budget  
         1           1.0  
         2          24.0  
         3           5.0  
         4           0.1  
         5           1.0  
         6           9.0  
         7           1.0  
         8           5.0  
         9           3.0  
         10          1.0  
         11         50.1  
         12      $97,553  ,
                                                           0            1            2  \
         0                              Position/Expenditure  FY14 Budget  FY18 Budget   
         1                   Principals/Assistant Principals          0.0          0.0   
         2                      Teachers ‐ Regular Education          8.1          8.6   
         3                      Teachers ‐ Special Education          0.0          0.2   
         4      Counselors/Student Adv./ Soc. Serv. Liaisons          0.0          0.2   
         5                            Nurses/Health Services          0.0          0.0   
         6           Classroom Assistants/Teacher Assistants          0.0          0.0   
         7                                       Secretaries          0.0          0.0   
         8                       Support Services Assistants          7.0          5.0   
         9                             Student Climate Staff          0.0          7.0   
         10                                            Other          1.0          0.0   
         11                                  Total Positions         16.1         21.0   
         12  Supplies/Equipment/Non Full‐Time Salaries/Other     $198,454      $19,977   
         
                       3  
         0   FY19 Budget  
         1           0.0  
         2           9.6  
         3           0.0  
         4           1.1  
         5           0.0  
         6           0.0  
         7           0.0  
         8           3.0  
         9           4.0  
         10          0.0  
         11         17.7  
         12      $15,166  ,
                                                            0                     1  \
         0                                                NaN  Position/Expenditure   
         1                                    Total Positions                   NaN   
         2  Total Supplies/Equipment/Non Full‐Time Salarie...                   NaN   
         
                      2            3            4  
         0  FY14 Budget  FY18 Budget  FY19 Budget  
         1         76.1         64.8         67.8  
         2     $230,726     $120,136     $112,719  ]   
    

    After I saw the output, I wrote a function to perform the same cleaning operation for each table in each budget.

    For each table below, first I'll introduce the "raw" output that Tabula returned, then I'll show the function that I wrote to fix that output.

    Basic Information

    Raw Output:

    basic_information = sample_budget[0] #basic information
    basic_information
    
    0 1
    0 Basic Information NaN
    1 Council District 2nd
    2 Organization Code 1380
    3 School Level Elementary School
    4 Economically NaN
    5 Disadvantaged Rate* NaN
    6 NaN 83.44%

    Cleanup Function:

    def generate_basic_information_table(df):
        '''Series representing the "basic information" table.'''
    
        # budgets with a comment near the basic information table, e.g. 2050
        if df.shape[1] == 3:
            df = df.iloc[1:, 1:]
            df = df.reset_index(drop=True)
            df = df.T.reset_index(drop=True).T
    
        # After that, Tabula did pretty well for this table, but didn't get the
        # Economically Disadvanted Rate quite right.
    
        df.loc[4] = ["Economically Disadvantaged Rate", df.loc[6, 1]]
        df = df.loc[1:4, :]
        return pd.Series(list(df[1]), index=list(df[0]), name='basic_information')
    

    Cleaned:

    basic_information = generate_basic_information_table(basic_information)
    basic_information
    
    # Basic information output
    Council District                                 2nd
    Organization Code                               1380
    School Level                       Elementary School
    Economically Disadvantaged Rate               83.44%
    Name: basic_information, dtype: object
    

    Enrollment

    Raw Output:

    # Getting the enrollment output
    enrollment = sample_budget[1]
    enrollment
    
    0 1 2 3
    0 NaN FY14 FY18 FY19 Projected
    1 Enrollment** 842 640 602

    Cleanup Function:

    def generate_enrollment_table(df):
        '''returns a series representing the "enrollment" table'''
        # nothing too crazy here
        df = df.T.loc[1:, :]
        df_to_series = pd.Series(list(df[1]), index=list(df[0]), name="enrollment")
        return df_to_series.str.replace(',', '').astype(float)
    
    generate_enrollment_table(enrollment)
    

    Cleaned:

    # Enrollment table
    FY14              842.0
    FY18              640.0
    FY19 Projected    602.0
    Name: enrollment, dtype: float64
    

    Allotments

    Luckily, both allotment tables were identical, so I could apply to the same cleanup steps to both.

    Raw Output:

    operating_funded_allotments = sample_budget[2]
    operating_funded_allotments
    
    0 1 2 3
    0 Position/Expenditure FY14 Budget FY18 Budget FY19 Budget
    1 Principals/Assistant Principals 2.0 1.0 1.0
    2 Teachers ‐ Regular Education 30.2 25.0 24.0
    3 Teachers ‐ Special Education 6.0 2.8 5.0
    4 Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
    5 Nurses/Health Services 0.6 1.0 1.0
    6 Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
    7 Secretaries 1.0 1.0 1.0
    8 Support Services Assistants 0.0 2.0 5.0
    9 Student Climate Staff 8.0 1.0 3.0
    10 Other 0.0 1.2 1.0
    11 Total Positions 60.0 43.8 50.1
    12 Supplies/Equipment/Non Full‐Time Salaries/Other $32,272 $100,159 $97,553
    grant_funded_allotments = sample_budget[3]
    grant_funded_allotments
    

    Cleanup Function

    I decided to merge the two allotment tables into one DataFrame while building a MultiIndex to keep things in order. This would allow me to ask some more interesting questions further on down the road.

    def generate_allotments_table(df, code, fund):
        '''Multiindex DF of org code, fund, and budget category by budget year'''
        df.columns = df.iloc[0]
        df = df.drop(0)
        df = df.set_index(['Position/Expenditure'])
        df = (df.apply(lambda x: x.str.replace('$', '').str.replace(',', ''))
                .astype(float)
              )
        df.name = fund + "ed_allotments"
    
        df_index_arrays = [
            [code] * len(df),
            [fund] * len(df),
            list(df.index),
        ]
    
        df.index = pd.MultiIndex.from_arrays(
            df_index_arrays,
            names=("org_code", "fund", "allotment")
        )
        df.columns = [column[:4] for column in df.columns]
    
        return df
    

    Cleaned:

    pd.concat([
        generate_allotments_table(
            operating_funded_allotments, "1410", "operating_fund"
        ),
        generate_allotments_table(
            grant_funded_allotments, "1410", "grant_fund"
        )
    ])
    
    FY14 FY18 FY19 org_code fund allotment
    1410 operating_fund Principals/Assistant Principals 2.0 1.0 1.0
    Teachers ‐ Regular Education 30.2 25.0 24.0
    Teachers ‐ Special Education 6.0 2.8 5.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
    Nurses/Health Services 0.6 1.0 1.0
    Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
    Secretaries 1.0 1.0 1.0
    Support Services Assistants 0.0 2.0 5.0
    Student Climate Staff 8.0 1.0 3.0
    Other 0.0 1.2 1.0
    Total Positions 60.0 43.8 50.1
    Supplies/Equipment/Non Full‐Time Salaries/Other 32272.0 100159.0 97553.0
    grant_fund Principals/Assistant Principals 0.0 0.0 0.0
    Teachers ‐ Regular Education 8.1 8.6 9.6
    Teachers ‐ Special Education 0.0 0.2 0.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2 1.1
    Nurses/Health Services 0.0 0.0 0.0
    Classroom Assistants/Teacher Assistants 0.0 0.0 0.0
    Secretaries 0.0 0.0 0.0
    Support Services Assistants 7.0 5.0 3.0
    Student Climate Staff 0.0 7.0 4.0
    Other 1.0 0.0 0.0
    Total Positions 16.1 21.0 17.7
    Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0

    Totals

    Since the final "totals" table could be calculated from the data already in the new allotment table, I didn't bother transforming it in any way.

    # same data can be derived from the allotments table directly
    sample_budget[4]
    
    0 1 2 3 4
    0 NaN Position/Expenditure FY14 Budget FY18 Budget FY19 Budget
    1 Total Positions NaN 76.1 64.8 67.8
    2 Total Supplies/Equipment/Non Full‐Time Salarie... NaN $230,726 $120,136 $112,719

    Once I figured out what transformations I needed for each table, I combined them into a function so that, given a list of DataFames from Tabula, I'd get those same tables back neatly formatted.

    def generate_all_tables(list_of_df):
        basic_information = generate_basic_information_table(list_of_df[0])
        enrollment = generate_enrollment_table(list_of_df[1])
    
        operating_funded_allotments = generate_allotments_table(
            list_of_df[2],
            basic_information['Organization Code'],
            'operating_fund'
        )
        grant_funded_allotments = generate_allotments_table(
            list_of_df[3],
            basic_information['Organization Code'],
            'grant_fund'
        )
        operating_and_grant_funded_allotments = pd.concat(
            [operating_funded_allotments, grant_funded_allotments]
        )
    
        return basic_information, enrollment, operating_and_grant_funded_allotments
    
    basic_information, enrollment, operating_and_grant_funded_allotments = 
    generate_all_tables(sample_budget)
    

    Aggregation Time

    Now that I had cleaned the tables that Tabula produced, it was time to combine them into some aggregated tables.

    First I wrote a function that would output a Series (representing one row) of information from all tables for a given school in a given fiscal year.

    def generate_row(budget_year, basic_information, allotments, enrollment):
        '''School budget series for fiscal year.'''
     	# budget_year should be FY14, FY18, or FY19
        
        flattened_allotments = pd.DataFrame(allotments.to_records())
        flattened_allotments.index = flattened_allotments['fund'] +": " + flattened_allotments['allotment']
        flattened_allotments = flattened_allotments.drop(
            ['fund','allotment'], axis=1
        )
        budget_allotments = flattened_allotments[budget_year]
        
        enrollment_label = budget_year + ' Projected' if budget_year == "FY19" else budget_year
        enrollment_index = 'projected_enrollment' if budget_year == "FY19" else 'enrollment'
        enrollment_row = pd.Series(
            enrollment[enrollment_label], index=[enrollment_index]
        )
        
        return pd.concat(
                [basic_information,budget_allotments,enrollment_row],
                axis=0
               )
    
    generate_row("FY18", basic_information,
                 operating_and_grant_funded_allotments, enrollment)
    
    # Output
    Council District 2 nd
    Organization Code 1380
    School Level Elementary School
    Economically Disadvantaged Rate 83.44 %
    operating_fund: Principals / Assistant Principal.1
    operating_fund: Teachers‐ Regular Education 25
    operating_fund: Teachers‐ Special Education 2.8
    operating_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.8
    operating_fund: Nurses / Health Services 1
    operating_fund: Classroom Assistants / Teacher Assistants 8
    operating_fund: Secretaries 1
    operating_fund: Support Services Assistants 2
    operating_fund: Student Climate Staff 1
    operating_fund: Other 1.2
    operating_fund: Total Positions 43.8
    operating_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 100159
    grant_fund: Principals / Assistant Principals 0
    grant_fund: Teachers‐ Regular Education 8.6
    grant_fund: Teachers‐ Special Education 0.2
    grant_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.2
    grant_fund: Nurses / Health Services 0
    grant_fund: Classroom Assistants / Teacher Assistants 0
    grant_fund: Secretaries 0
    grant_fund: Support Services Assistants 5
    grant_fund: Student Climate Staff 7
    grant_fund: Other 0
    grant_fund: Total Positions 21
    grant_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 19977
    enrollment 640
    dtype: object
    

    Then, I applied this function to each list of budgets in the collection and compiled them into a DataFrame.

    def generate_tabular_budget(budget_year, budgets):
        '''generate a tabular budget summary for a budget year. Budget year must be FY14,
        FY18, or FY19. Enrollemnt values for budget year 2019 are projected.'''
        school_budget_series = []
        for budget_tables in budgets:
            basic_information, enrollment, operating_and_grant_funded_allotments = generate_all_tables(
                budget_tables
            )
            budget_row = generate_row(
                budget_year, basic_information, operating_and_grant_funded_allotments, enrollment
            )
            budget_row = budget_row
            school_budget_series.append(budget_row)
    
        return pd.DataFrame(school_budget_series)
    
    
    fy14 = generate_tabular_budget('FY14', budgets)
    fy14['budget_year'] = "FY14"
    fy14.to_csv("output/combined_fy14.csv")
    
    fy18 = generate_tabular_budget('FY18', budgets)
    fy18['budget_year'] = "FY18"
    fy18.to_csv("output/combined_fy18.csv")
    
    fy19 = generate_tabular_budget('FY19', budgets)
    fy19['budget_year'] = "FY19"
    fy19.to_csv("output/combined_fy19.csv")
    
    
    combined_tabular_budgets = pd.concat([fy14, fy18, fy19])
    combined_tabular_budgets.to_csv("output/all_budgets_tabular.csv")
    

    Finally, I wanted to output a CSV that would preserve some of the multi-indexed nature of the allotment tables. Here's what I wrote for that.

    def generate_hierarchical_budget(budgets):
        school_budgets_dfs = []
        for budget_tables in budgets:
            school_budgets_dfs.append(operating_and_grant_funded_allotments)
        return pd.concat(school_budgets_dfs)
    
    hierarchical_budget = generate_hierarchical_budget(budgets)
    hierarchical_budget.to_csv("output/all_budgets_hierarchical.csv")
    
    hierarchical_budget
    
    FY14 FY18 FY19 org_code fund allotment
    1380 operating_fund Principals/Assistant Principals 2.0 1.0 1.0
    Teachers ‐ Regular Education 30.2 25.0 24.0
    Teachers ‐ Special Education 6.0 2.8 5.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
    Nurses/Health Services 0.6 1.0 1.0
    Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
    Secretaries 1.0 1.0 1.0
    Support Services Assistants 0.0 2.0 5.0
    Student Climate Staff 8.0 1.0 3.0
    Other 0.0 1.2 1.0
    Total Positions 60.0 43.8 50.1
    Supplies/Equipment/Non Full‐Time Salaries/Other 32272.0 100159.0 97553.0
    grant_fund Principals/Assistant Principals 0.0 0.0 0.0
    Teachers ‐ Regular Education 8.1 8.6 9.6
    Teachers ‐ Special Education 0.0 0.2 0.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2 1.1
    Nurses/Health Services 0.0 0.0 0.0
    Classroom Assistants/Teacher Assistants 0.0 0.0 0.0
    Secretaries 0.0 0.0 0.0
    Support Services Assistants 7.0 5.0 3.0
    Student Climate Staff 0.0 7.0 4.0
    Other 1.0 0.0 0.0
    Total Positions 16.1 21.0 17.7
    Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0
    operating_fund Principals/Assistant Principals 2.0 1.0 1.0
    Teachers ‐ Regular Education 30.2 25.0 24.0
    Teachers ‐ Special Education 6.0 2.8 5.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
    Nurses/Health Services 0.6 1.0 1.0
    Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
    grant_fund Secretaries 0.0 0.0 0.0
    Support Services Assistants 7.0 5.0 3.0
    Student Climate Staff 0.0 7.0 4.0
    Other 1.0 0.0 0.0
    Total Positions 16.1 21.0 17.7
    Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0
    operating_fund Principals/Assistant Principals 2.0 1.0 1.0
    Teachers ‐ Regular Education 30.2 25.0 24.0
    Teachers ‐ Special Education 6.0 2.8 5.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8 0.1
    Nurses/Health Services 0.6 1.0 1.0
    Classroom Assistants/Teacher Assistants 11.0 8.0 9.0
    Secretaries 1.0 1.0 1.0
    Support Services Assistants 0.0 2.0 5.0
    Student Climate Staff 8.0 1.0 3.0
    Other 0.0 1.2 1.0
    Total Positions 60.0 43.8 50.1
    Supplies/Equipment/Non Full‐Time Salaries/Other 32272.0 100159.0 97553.0
    grant_fund Principals/Assistant Principals 0.0 0.0 0.0
    Teachers ‐ Regular Education 8.1 8.6 9.6
    Teachers ‐ Special Education 0.0 0.2 0.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2 1.1
    Nurses/Health Services 0.0 0.0 0.0
    Classroom Assistants/Teacher Assistants 0.0 0.0 0.0
    Secretaries 0.0 0.0 0.0
    Support Services Assistants 7.0 5.0 3.0
    Student Climate Staff 0.0 7.0 4.0
    Other 1.0 0.0 0.0
    Total Positions 16.1 21.0 17.7
    Supplies/Equipment/Non Full‐Time Salaries/Other 198454.0 19977.0 15166.0
    5160 rows × 3 columns

    This makes it easier to aggregate in interesting ways:

    hierarchical_budget.groupby('allotment').sum()
    
    FY14 FY18 FY19 allotment
    Classroom Assistants/Teacher Assistants 2365.0 1720.0 1935.0
    Counselors/Student Adv./ Soc. Serv. Liaisons 258.0 215.0 258.0
    Nurses/Health Services 129.0 215.0 215.0
    Other 215.0 258.0 215.0
    Principals/Assistant Principals 430.0 215.0 215.0
    Secretaries 215.0 215.0 215.0
    Student Climate Staff 1720.0 1720.0 1505.0
    Supplies/Equipment/Non Full‐Time Salaries/Other 49606090.0 25829240.0 24234585.0
    Support Services Assistants 1505.0 1505.0 1720.0
    Teachers ‐ Regular Education 8234.5 7224.0 7224.0
    Teachers ‐ Special Education 1290.0 645.0 1075.0
    Total Positions 16361.5 13932.0 14577.0

    More Cleaning to be Done

    My work here is done. I saved the data from their not-so-accessible PDF prisons. But now it's time for someone with some domain-specific knowledge to make it actionable.

    The biggest weakness with the data in its current form is that there is some amount of ambiguity as to what the different allotments numbers represent in real-dollar amounts. Only the Supplies/Equipment/Non Full‐Time Salaries/Other allotment category came in currency notation – the rest of the allotments were represented as simple decimal amounts with no context to help interpret what they mean. Do they represent FTE? Dollar amounts in scientific notation? I'm not sure, but I hope by handing this work off to the right people, these questions and more can be answered more easily thanks to a cleaner, more accessible data set.

    Graham Beckley's' avatar
    Philly
    Loves Python; loves pandas; leaves every project more Pythonic than he found it.

    Loves Python; loves pandas; leaves every project more Pythonic than he found it.