Check out the accompanying GitHub repo for this article here.
Technically, the School District of Philadelphia's budget data for the 2019 fiscal year is "open". It is, after all, made available through the district's Open Data portal and is freely available to download.
But just because data is freely available, doesn't mean it's easy to work with. That's what found out when I downloaded the zipped folder, opened it up, and found a heap of PDFs. Joy.
As a member of Code for Philly, I thought of my compatriots who might want to use school district data in their projects. I knew with a bit of data munging, I could provide a data set that would be more easily usable.
Data Liberation
The first hurdle was to find a way to get the data from the PDFs. After a bit Googling, I came across tabula-py, a Python wrapper for Tabula.
Each budget is composed of 5 tables:
- General information about the school
- Enrollment information
- Operating Funded budget allotments
- Grant Funded budget allotments
- A summary table of allotment totals
Extracting these tables from a budget with Tabula was as simple as:
Which returned a list of DataFrames, one for each table mentioned above. Perfect!
So, I iterated over all of the files in folder and appended them to a list:
Initial Cleaning
While this gave me a good start, I knew it wouldn't be that easy to liberate the data from the PDFs. I took a look at each of the DataFrames to see what I'd be working with.
# an example list of budgets
sample_budget = budgets[0]
sample_budget
Output:
[ 0 1
0 Basic Information NaN
1 Council District 2nd
2 Organization Code 1380
3 School Level Elementary School
4 Economically NaN
5 Disadvantaged Rate* NaN
6 NaN 83.44%,
0 1 2 3
0 NaN FY14 FY18 FY19 Projected
1 Enrollment** 842 640 602,
0 1 2 \
0 Position/Expenditure FY14 Budget FY18 Budget
1 Principals/Assistant Principals 2.0 1.0
2 Teachers ‐ Regular Education 30.2 25.0
3 Teachers ‐ Special Education 6.0 2.8
4 Counselors/Student Adv./ Soc. Serv. Liaisons 1.2 0.8
5 Nurses/Health Services 0.6 1.0
6 Classroom Assistants/Teacher Assistants 11.0 8.0
7 Secretaries 1.0 1.0
8 Support Services Assistants 0.0 2.0
9 Student Climate Staff 8.0 1.0
10 Other 0.0 1.2
11 Total Positions 60.0 43.8
12 Supplies/Equipment/Non Full‐Time Salaries/Other $32,272 $100,159
3
0 FY19 Budget
1 1.0
2 24.0
3 5.0
4 0.1
5 1.0
6 9.0
7 1.0
8 5.0
9 3.0
10 1.0
11 50.1
12 $97,553 ,
0 1 2 \
0 Position/Expenditure FY14 Budget FY18 Budget
1 Principals/Assistant Principals 0.0 0.0
2 Teachers ‐ Regular Education 8.1 8.6
3 Teachers ‐ Special Education 0.0 0.2
4 Counselors/Student Adv./ Soc. Serv. Liaisons 0.0 0.2
5 Nurses/Health Services 0.0 0.0
6 Classroom Assistants/Teacher Assistants 0.0 0.0
7 Secretaries 0.0 0.0
8 Support Services Assistants 7.0 5.0
9 Student Climate Staff 0.0 7.0
10 Other 1.0 0.0
11 Total Positions 16.1 21.0
12 Supplies/Equipment/Non Full‐Time Salaries/Other $198,454 $19,977
3
0 FY19 Budget
1 0.0
2 9.6
3 0.0
4 1.1
5 0.0
6 0.0
7 0.0
8 3.0
9 4.0
10 0.0
11 17.7
12 $15,166 ,
0 1 \
0 NaN Position/Expenditure
1 Total Positions NaN
2 Total Supplies/Equipment/Non Full‐Time Salarie... NaN
2 3 4
0 FY14 Budget FY18 Budget FY19 Budget
1 76.1 64.8 67.8
2 $230,726 $120,136 $112,719 ]
After I saw the output, I wrote a function to perform the same cleaning operation for each table in each budget.
For each table below, first I'll introduce the "raw" output that Tabula returned, then I'll show the function that I wrote to fix that output.
Basic Information
basic_information = sample_budget[0] #basic information
basic_information
Output:
0 | 1 | |
---|---|---|
0 | Basic Information | NaN |
1 | Council District | 2nd |
2 | Organization Code | 1380 |
3 | School Level | Elementary School |
4 | Economically | NaN |
5 | Disadvantaged Rate* | NaN |
6 | NaN | 83.44% |
Cleanup Function
...
def generate_basic_information_table(df):
'''Series representing the "basic information" table.'''
# budgets with a comment near the basic information table, e.g. 2050
if df.shape[1] == 3:
df = df.iloc[1:, 1:]
df = df.reset_index(drop=True)
df = df.T.reset_index(drop=True).T
# Tabula did pretty well for this table, but didn't get the
# Economically Disadvanted Rate quite right.
df.loc[4] = ["Economically Disadvantaged Rate", df.loc[6, 1]]
df = df.loc[1:4, :]
return pd.Series(
list(df[1]),
index=list(df[0]),
name='basic_information'
)
Cleaned:
basic_information = generate_basic_information_table(basic_information)
basic_information
Output:
# Basic information output
Council District 2nd
Organization Code 1380
School Level Elementary School
Economically Disadvantaged Rate 83.44%
Name: basic_information, dtype: object
Enrollment
# Getting the enrollment output
enrollment = sample_budget[1]
enrollment
Output:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | NaN | FY14 | FY18 | FY19 Projected |
1 | Enrollment** | 842 | 640 | 602 |
Cleanup Function
...
def generate_enrollment_table(df):
'''returns a series representing the "enrollment" table'''
df = df.T.loc[1:, :]
df_to_series = pd.Series(
list(df[1]),
index=list(df[0]),
name="enrollment"
)
return df_to_series.str.replace(',', '').astype(float)
generate_enrollment_table(enrollment)
Cleaned:
# Enrollment table
FY14 842.0
FY18 640.0
FY19 Projected 602.0
Name: enrollment, dtype: float64
Allotments
Luckily, both allotment tables were identical, so I could apply to the same cleanup steps to both.
operating_funded_allotments = sample_budget[2]
operating_funded_allotments
Output:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | Position/Expenditure | FY14 Budget | FY18 Budget | FY19 Budget |
1 | Principals/Assistant Principals | 2.0 | 1.0 | 1.0 |
2 | Teachers ‐ Regular Education | 30.2 | 25.0 | 24.0 |
3 | Teachers ‐ Special Education | 6.0 | 2.8 | 5.0 |
4 | Counselors/Student Adv./ Soc. Serv. Liaisons | 1.2 | 0.8 | 0.1 |
5 | Nurses/Health Services | 0.6 | 1.0 | 1.0 |
6 | Classroom Assistants/Teacher Assistants | 11.0 | 8.0 | 9.0 |
7 | Secretaries | 1.0 | 1.0 | 1.0 |
8 | Support Services Assistants | 0.0 | 2.0 | 5.0 |
9 | Student Climate Staff | 8.0 | 1.0 | 3.0 |
10 | Other | 0.0 | 1.2 | 1.0 |
11 | Total Positions | 60.0 | 43.8 | 50.1 |
12 | Supplies/Equipment/Non Full‐Time Salaries/Other | $32,272 | $100,159 | $97,553 |
grant_funded_allotments = sample_budget[3]
grant_funded_allotments
Cleanup Function
I decided to merge the two allotment tables into one DataFrame while building a MultiIndex to keep things in order. This would allow me to ask some more interesting questions further on down the road.
...
def generate_allotments_table(df, code, fund):
"""Multiindex of org, fund, and budget category by year."""
df.columns = df.iloc[0]
df = df.drop(0)
df = df.set_index(['Position/Expenditure'])
df = (df.apply(lambda x: x.str.replace('$', '').str.replace(',', ''))
.astype(float)
)
df.name = fund + "ed_allotments"
df_index_arrays = [
[code] * len(df),
[fund] * len(df),
list(df.index),
]
df.index = pd.MultiIndex.from_arrays(
df_index_arrays,
names=("org_code", "fund", "allotment")
)
df.columns = [column[:4] for column in df.columns]
return df
Cleaned:
pd.concat([
generate_allotments_table(
operating_funded_allotments, "1410", "operating_fund"
),
generate_allotments_table(
grant_funded_allotments, "1410", "grant_fund"
)
])
FY14 | FY18 | FY19 | org_code | fund | allotment |
---|---|---|---|---|---|
1410 | operating_fund | Principals/Assistant Principals | 2.0 | 1.0 | 1.0 |
Teachers ‐ Regular Education | 30.2 | 25.0 | 24.0 | ||
Teachers ‐ Special Education | 6.0 | 2.8 | 5.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 1.2 | 0.8 | 0.1 | ||
Nurses/Health Services | 0.6 | 1.0 | 1.0 | ||
Classroom Assistants/Teacher Assistants | 11.0 | 8.0 | 9.0 | ||
Secretaries | 1.0 | 1.0 | 1.0 | ||
Support Services Assistants | 0.0 | 2.0 | 5.0 | ||
Student Climate Staff | 8.0 | 1.0 | 3.0 | ||
Other | 0.0 | 1.2 | 1.0 | ||
Total Positions | 60.0 | 43.8 | 50.1 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 32272.0 | 100159.0 | 97553.0 | ||
grant_fund | Principals/Assistant Principals | 0.0 | 0.0 | 0.0 | |
Teachers ‐ Regular Education | 8.1 | 8.6 | 9.6 | ||
Teachers ‐ Special Education | 0.0 | 0.2 | 0.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 0.0 | 0.2 | 1.1 | ||
Nurses/Health Services | 0.0 | 0.0 | 0.0 | ||
Classroom Assistants/Teacher Assistants | 0.0 | 0.0 | 0.0 | ||
Secretaries | 0.0 | 0.0 | 0.0 | ||
Support Services Assistants | 7.0 | 5.0 | 3.0 | ||
Student Climate Staff | 0.0 | 7.0 | 4.0 | ||
Other | 1.0 | 0.0 | 0.0 | ||
Total Positions | 16.1 | 21.0 | 17.7 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 198454.0 | 19977.0 | 15166.0 |
Totals
Since the final "totals" table could be calculated from the data already in the new allotment table, I didn't bother transforming it in any way.
# same data can be derived from the allotments table directly
sample_budget[4]
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | NaN | Position/Expenditure | FY14 Budget | FY18 Budget | FY19 Budget |
1 | Total Positions | NaN | 76.1 | 64.8 | 67.8 |
2 | Total Supplies/Equipment/Non Full‐Time Salarie... | NaN | $230,726 | $120,136 | $112,719 |
Once I figured out what transformations I needed for each table, I combined them into a function so that, given a list of DataFames from Tabula, I'd get those same tables back neatly formatted.
...
def generate_all_tables(list_of_df):
basic_information = generate_basic_information_table(list_of_df[0])
enrollment = generate_enrollment_table(list_of_df[1])
operating_funded_allotments = generate_allotments_table(
list_of_df[2],
basic_information['Organization Code'],
'operating_fund'
)
grant_funded_allotments = generate_allotments_table(
list_of_df[3],
basic_information['Organization Code'],
'grant_fund'
)
allotments = pd.concat(
[operating_funded_allotments, grant_funded_allotments]
)
return basic_information, enrollment, operating_and_grant_funded_allotments
basic_information, enrollment, allotments = generate_all_tables(sample_budget)
Aggregation Time
Now that I had cleaned the tables that Tabula produced, it was time to combine them into some aggregated tables.
First I wrote a function that would output a Series (representing one row) of information from all tables for a given school in a given fiscal year.
...
def generate_row(budget_year, basic_information, allotments, enrollment):
"""
School budget series for fiscal year.
budget_year should be FY14, FY18, or FY19
"""
flattened_allotments = pd.DataFrame(allotments.to_records())
flattened_allotments.index = flattened_allotments['fund'] +": " + flattened_allotments['allotment']
flattened_allotments = flattened_allotments.drop(
['fund','allotment'], axis=1
)
budget_allotments = flattened_allotments[budget_year]
enrollment_label = budget_year + ' Projected' if budget_year == "FY19" else budget_year
enrollment_index = 'projected_enrollment' if budget_year == "FY19" else 'enrollment'
enrollment_row = pd.Series(
enrollment[enrollment_label], index=[enrollment_index]
)
return pd.concat(
[basic_information,budget_allotments,enrollment_row],
axis=0
)
generate_row(
"FY18",
basic_information,
operating_and_grant_funded_allotments,
enrollment
)
Output:
Council District 2 nd
Organization Code 1380
School Level Elementary School
Economically Disadvantaged Rate 83.44 %
operating_fund: Principals / Assistant Principal.1
operating_fund: Teachers‐ Regular Education 25
operating_fund: Teachers‐ Special Education 2.8
operating_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.8
operating_fund: Nurses / Health Services 1
operating_fund: Classroom Assistants / Teacher Assistants 8
operating_fund: Secretaries 1
operating_fund: Support Services Assistants 2
operating_fund: Student Climate Staff 1
operating_fund: Other 1.2
operating_fund: Total Positions 43.8
operating_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 100159
grant_fund: Principals / Assistant Principals 0
grant_fund: Teachers‐ Regular Education 8.6
grant_fund: Teachers‐ Special Education 0.2
grant_fund: Counselors / Student Adv. / Soc.Serv.Liaisons 0.2
grant_fund: Nurses / Health Services 0
grant_fund: Classroom Assistants / Teacher Assistants 0
grant_fund: Secretaries 0
grant_fund: Support Services Assistants 5
grant_fund: Student Climate Staff 7
grant_fund: Other 0
grant_fund: Total Positions 21
grant_fund: Supplies / Equipment / Non Full‐ Time Salaries / Other 19977
enrollment 640
dtype: object
Then, I applied this function to each list of budgets in the collection and compiled them into a DataFrame.
...
def generate_tabular_budget(budget_year, budgets):
"""
Generate a tabular budget summary for a budget year.
Budget year must be FY14, FY18, or FY19.
Enrollemnt values for budget year 2019 are projected.
"""
school_budget_series = []
for budget_tables in budgets:
basic_information, enrollment, operating_and_grant_funded_allotments = generate_all_tables(
budget_tables
)
budget_row = generate_row(
budget_year, basic_information, operating_and_grant_funded_allotments, enrollment
)
budget_row = budget_row
school_budget_series.append(budget_row)
return pd.DataFrame(school_budget_series)
fy14 = generate_tabular_budget('FY14', budgets)
fy14['budget_year'] = "FY14"
fy14.to_csv("output/combined_fy14.csv")
fy18 = generate_tabular_budget('FY18', budgets)
fy18['budget_year'] = "FY18"
fy18.to_csv("output/combined_fy18.csv")
fy19 = generate_tabular_budget('FY19', budgets)
fy19['budget_year'] = "FY19"
fy19.to_csv("output/combined_fy19.csv")
combined_tabular_budgets = pd.concat([fy14, fy18, fy19])
combined_tabular_budgets.to_csv("output/all_budgets_tabular.csv")
Finally, I wanted to output a CSV that would preserve some of the multi-indexed nature of the allotment tables. Here's what I wrote for that.
...
def generate_hierarchical_budget(budgets):
school_budgets_dfs = []
for budget_tables in budgets:
school_budgets_dfs.append(operating_and_grant_funded_allotments)
return pd.concat(school_budgets_dfs)
hierarchical_budget = generate_hierarchical_budget(budgets)
hierarchical_budget.to_csv("output/all_budgets_hierarchical.csv")
hierarchical_budget
Output:
FY14 | FY18 | FY19 | org_code | fund | allotment |
---|---|---|---|---|---|
1380 | operating_fund | Principals/Assistant Principals | 2.0 | 1.0 | 1.0 |
Teachers ‐ Regular Education | 30.2 | 25.0 | 24.0 | ||
Teachers ‐ Special Education | 6.0 | 2.8 | 5.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 1.2 | 0.8 | 0.1 | ||
Nurses/Health Services | 0.6 | 1.0 | 1.0 | ||
Classroom Assistants/Teacher Assistants | 11.0 | 8.0 | 9.0 | ||
Secretaries | 1.0 | 1.0 | 1.0 | ||
Support Services Assistants | 0.0 | 2.0 | 5.0 | ||
Student Climate Staff | 8.0 | 1.0 | 3.0 | ||
Other | 0.0 | 1.2 | 1.0 | ||
Total Positions | 60.0 | 43.8 | 50.1 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 32272.0 | 100159.0 | 97553.0 | ||
grant_fund | Principals/Assistant Principals | 0.0 | 0.0 | 0.0 | |
Teachers ‐ Regular Education | 8.1 | 8.6 | 9.6 | ||
Teachers ‐ Special Education | 0.0 | 0.2 | 0.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 0.0 | 0.2 | 1.1 | ||
Nurses/Health Services | 0.0 | 0.0 | 0.0 | ||
Classroom Assistants/Teacher Assistants | 0.0 | 0.0 | 0.0 | ||
Secretaries | 0.0 | 0.0 | 0.0 | ||
Support Services Assistants | 7.0 | 5.0 | 3.0 | ||
Student Climate Staff | 0.0 | 7.0 | 4.0 | ||
Other | 1.0 | 0.0 | 0.0 | ||
Total Positions | 16.1 | 21.0 | 17.7 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 198454.0 | 19977.0 | 15166.0 | ||
operating_fund | Principals/Assistant Principals | 2.0 | 1.0 | 1.0 | |
Teachers ‐ Regular Education | 30.2 | 25.0 | 24.0 | ||
Teachers ‐ Special Education | 6.0 | 2.8 | 5.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 1.2 | 0.8 | 0.1 | ||
Nurses/Health Services | 0.6 | 1.0 | 1.0 | ||
Classroom Assistants/Teacher Assistants | 11.0 | 8.0 | 9.0 | ||
grant_fund | Secretaries | 0.0 | 0.0 | 0.0 | |
Support Services Assistants | 7.0 | 5.0 | 3.0 | ||
Student Climate Staff | 0.0 | 7.0 | 4.0 | ||
Other | 1.0 | 0.0 | 0.0 | ||
Total Positions | 16.1 | 21.0 | 17.7 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 198454.0 | 19977.0 | 15166.0 | ||
operating_fund | Principals/Assistant Principals | 2.0 | 1.0 | 1.0 | |
Teachers ‐ Regular Education | 30.2 | 25.0 | 24.0 | ||
Teachers ‐ Special Education | 6.0 | 2.8 | 5.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 1.2 | 0.8 | 0.1 | ||
Nurses/Health Services | 0.6 | 1.0 | 1.0 | ||
Classroom Assistants/Teacher Assistants | 11.0 | 8.0 | 9.0 | ||
Secretaries | 1.0 | 1.0 | 1.0 | ||
Support Services Assistants | 0.0 | 2.0 | 5.0 | ||
Student Climate Staff | 8.0 | 1.0 | 3.0 | ||
Other | 0.0 | 1.2 | 1.0 | ||
Total Positions | 60.0 | 43.8 | 50.1 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 32272.0 | 100159.0 | 97553.0 | ||
grant_fund | Principals/Assistant Principals | 0.0 | 0.0 | 0.0 | |
Teachers ‐ Regular Education | 8.1 | 8.6 | 9.6 | ||
Teachers ‐ Special Education | 0.0 | 0.2 | 0.0 | ||
Counselors/Student Adv./ Soc. Serv. Liaisons | 0.0 | 0.2 | 1.1 | ||
Nurses/Health Services | 0.0 | 0.0 | 0.0 | ||
Classroom Assistants/Teacher Assistants | 0.0 | 0.0 | 0.0 | ||
Secretaries | 0.0 | 0.0 | 0.0 | ||
Support Services Assistants | 7.0 | 5.0 | 3.0 | ||
Student Climate Staff | 0.0 | 7.0 | 4.0 | ||
Other | 1.0 | 0.0 | 0.0 | ||
Total Positions | 16.1 | 21.0 | 17.7 | ||
Supplies/Equipment/Non Full‐Time Salaries/Other | 198454.0 | 19977.0 | 15166.0 |
This makes it easier to aggregate in interesting ways:
hierarchical_budget.groupby('allotment').sum()
Output:
FY14 | FY18 | FY19 | allotment |
---|---|---|---|
Classroom Assistants/Teacher Assistants | 2365.0 | 1720.0 | 1935.0 |
Counselors/Student Adv./ Soc. Serv. Liaisons | 258.0 | 215.0 | 258.0 |
Nurses/Health Services | 129.0 | 215.0 | 215.0 |
Other | 215.0 | 258.0 | 215.0 |
Principals/Assistant Principals | 430.0 | 215.0 | 215.0 |
Secretaries | 215.0 | 215.0 | 215.0 |
Student Climate Staff | 1720.0 | 1720.0 | 1505.0 |
Supplies/Equipment/Non Full‐Time Salaries/Other | 49606090.0 | 25829240.0 | 24234585.0 |
Support Services Assistants | 1505.0 | 1505.0 | 1720.0 |
Teachers ‐ Regular Education | 8234.5 | 7224.0 | 7224.0 |
Teachers ‐ Special Education | 1290.0 | 645.0 | 1075.0 |
Total Positions | 16361.5 | 13932.0 | 14577.0 |
More Cleaning to be Done
My work here is done. I saved the data from their not-so-accessible PDF prisons. But now it's time for someone with some domain-specific knowledge to make it actionable.
The biggest weakness with the data in its current form is that there is some amount of ambiguity as to what the different allotments numbers represent in real-dollar amounts. Only the Supplies/Equipment/Non Full‐Time Salaries/Other allotment category came in currency notation – the rest of the allotments were represented as simple decimal amounts with no context to help interpret what they mean. Do they represent FTE? Dollar amounts in scientific notation? I'm not sure, but I hope by handing this work off to the right people, these questions and more can be answered more easily thanks to a cleaner, more accessible data set.