Comparing Rows Between Two Pandas DataFrames

It's a tale as old as time: your company has multiple instances of the same dataset which should match, but for whatever reason, don't. An obvious example that comes to mind might be discrepancies between environments: your staging environment probably falls out of sync with production often. This is an easy problem to solve because it's clear why these types of systems get out of sync. Solving the problem is simple as syncing one system to the other, usually where one system serves as the "source of truth." But what about discrepancies between systems without a source of truth? What do we do when parallel systems should have comparable data, but don't? Lastly, how can we figure out what went wrong?

Companies are prone to tons of scenarios for this exact problem, and they're usually the kind of situations that are deadly when unresolved. For example, let's say your company tracks new hires by adding them to an HR system (such as Bamboo or Workday). These new hires need access to internal company systems (like Confluence, or anything behind your company's VPN). Your company prides itself on being "scrappy," thus depends on an IT department to provision these users across systems manually. Assuming nobody has made a user error, these systems will have matching user information... until those hires are terminated.

No matter the situation, you're nearly guaranteed to run into this issue at some point. Luckily for us, the solution isn't too complicated.

Setting the Stage

The assumption here is that we’re comparing the rows in our data. We need two datasets which have matching columns, but different entries. Sticking to our employee example, I'm going to use two fake datasets containing employee information as such:

id	first_name	last_name	email	gender	department
424	Regine	Aberkirder	raberkirderbr@shop-pro.jp	Female	Marketing
989	Jessa	Acome	jacomerg@ustream.tv	Female	Business Development
318	Marci	Acres	macres8t@paypal.com	Female	Marketing
996	Olympe	Acutt	oacuttrn@flickr.com	Female	Support
516	Baxy	Adamides	badamideseb@google.pl	Male	Research and Development
743	Melba	Addey	maddeykm@vimeo.com	Female	Accounting
207	Willie	Addionisio	waddionisio5q@hhs.gov	Male	Human Resources
764	Noby	Agutter	nagutterl7@ow.ly	Male	Accounting
950	Avery	Ahlin	aahlinqd@ihg.com	Male	Research and Development
434	Richart	Akid	rakidc1@ow.ly	Male	Engineering
863	Kalindi	Alebrooke	kalebrookeny@biglobe.ne.jp	Female	Product Management
250	Marley	Alen	malen6x@fema.gov	Female	Legal
481	Wilfrid	Algie	walgiedc@paypal.com	Male	Product Management

We'll make two Pandas DataFrames from these similar data sets:

df1 = pd.read_csv('data/employees1.csv')
df2 = pd.read_csv('data/employees2.csv')

Create two DataFrames

Now let's get to work.

Pandas Merge With Indicators

The first piece of magic is as simple as adding a keyword argument to a Pandas "merge." When merging two DataFrames in Pandas, setting indicator=True adds a column to the merged DataFame where the value of each row can be one of three possible values: left_only, right_only, or both:

id	first_name	last_name	email	gender	department	_merge
990	Yanaton	Tumioto	ytumiotol@ehow.com	Male	Training	right_only
962	Lara	Minichillo	lminichillom@photobucket.com	Female	Accounting	right_only
976	Reine	Haldon	rhaldonn@bbc.co.uk	Female	Services	right_only
961	Lyman	Marcq	lmarcqp@merriam-webster.com	Male	Engineering	right_only
221	Findlay	Danett	fdanettr@ibm.com	Male	Human Resources	both
675	Muhammad	Rimbault	mrimbaults@feedburner.com	Male	Research and Development	left_only
692	Mindy	Ruggen	mruggent@unc.edu	Female	Marketing	left_only
669	Isidoro	Reedie	ireedieu@so-net.ne.jp	Male	Business Development	left_only
783	Niels	Strase	nstrasev@utexas.edu	Male	Legal	both
120	Jock	Bucktharp	jbucktharpw@stanford.edu	Male	Research and Development	both
734	Leanora	Sigert	lsigertx@so-net.ne.jp	Female	Sales	both
52	Bea	Behneke	bbehnekey@163.com	Female	Support	both
563	Cindee	Moorcroft	cmoorcroftz@hao123.com	Female	Support	both
172	Kelvin	Clubb	kclubb10@gmpg.org	Male	Product Management	both
362	Stella	Hirtz	shirtz11@hibu.com	Female	Legal	both
273	Archibold	Eager	aeager12@macromedia.com	Male	Services	both

As you might imagine, rows marked with a value of "both" in the merge column denotes rows which are common to both DataFrames. left_only and right_only mark rows which were present in either the left or right DataFrame, respectively.

We already have a big piece of the puzzle solved. Let's build off of this to create a reusable function which returns exactly what we're looking for each time.

Creating a Code Snippet

Let's build a function called dataframe_difference() which answers any of 4 questions:

Which rows were only present in the first DataFrame?
Which rows were only present in the second DataFrame?
Which rows were present in both DataFrames?
Which rows were not present in both DataFrames, but present in one of them?

For starters, our function dataframe_difference() will need to be passed two DataFrames to compare. To make things interesting, let's add an optional keyword argument which allows us to return rows for each of the four scenarios above:

def dataframe_difference(df1: DataFrame, df2: DataFrame, which=None):
    """Find rows which are different between two DataFrames."""
    comparison_df = df1.merge(
        df2,
        indicator=True,
        how='outer'
    )
    if which is None:
        diff_df = comparison_df[comparison_df['_merge'] != 'both']
    else:
        diff_df = comparison_df[comparison_df['_merge'] == which]
    diff_df.to_csv('data/diff.csv')
    return diff_df

Compare rows between two DataFrames & output the difference

If we call our function without a keyword argument (ie: dataframe_difference(df1, df2) ), our function will answer question #4:

id	first_name	last_name	email	gender	department	_merge
990	Yanaton	Tumioto	ytumiotol@ehow.com	Male	Training	right_only
962	Lara	Minichillo	lminichillom@photobucket.com	Female	Accounting	right_only
976	Reine	Haldon	rhaldonn@bbc.co.uk	Female	Services	right_only
961	Lyman	Marcq	lmarcqp@merriam-webster.com	Male	Engineering	right_only
675	Muhammad	Rimbault	mrimbaults@feedburner.com	Male	Research and Development	left_only
692	Mindy	Ruggen	mruggent@unc.edu	Female	Marketing	left_only
669	Isidoro	Reedie	ireedieu@so-net.ne.jp	Male	Business Development	left_only
746	Rhianon	Sloan	rsloan16@ovh.net	Female	Sales	left_only
952	Jennilee	Palle	jpalle19@state.gov	Female	Marketing	right_only
970	Nedda	Grieger	ngrieger1c@quantcast.com	Female	Marketing	right_only

As expected, values in the merge column contain left_only and right_only, but not both. The opposite of this would be answering question #3: which rows were present in both DataFrames? To accomplish this, we can pass which='both' to our function. dataframe_difference(df1, df2, which='both') would look something like this:

id	first_name	last_name	email	gender	department	_merge
863	Marcia	Whalebelly	mwhalebelly1f@jugem.jp	Female	Accounting	both
470	Averell	Letteresse	aletteresse1g@slashdot.org	Male	Accounting	both
125	Berri	Bullough	bbullough1h@comsenz.com	Female	Product Management	both
258	Vassili	Drinkwater	vdrinkwater1i@posterous.com	Male	Legal	both
701	Laetitia	Sarah	lsarah1j@nationalgeographic.com	Female	Human Resources	both
507	Hendrika	Malec	hmalec1k@ifeng.com	Female	Support	both
163	Haskell	Chiplen	hchiplen1l@house.gov	Male	Engineering	both
90	Gardiner	Bonhome	gbonhome1m@163.com	Male	Human Resources	both
705	Arley	Scapens	ascapens1n@sitemeter.com	Male	Product Management	both
127	Dacey	Burrett	dburrett1o@nydailynews.com	Female	Training	both

Passing which='left_only' and which='right_only' do exactly what you'd expect.

Feel free to take this snippet and use it as your own. I've added it to Github as a Gist here.

BONUS ROUND: Uploading Missing Rows to SQL Table

We can easily apply the function we just created to help us sync rows between two database tables. If we want to make sure rows between two SQL tables match, we can do something like this:

# Create DataFrames from two SQL tables
df1 = pd.read_sql_table(
    'table1',
    con=engine,
    index_col='id'
)
df2 = pd.read_sql_table(
    'table2',
    con=engine,
    index_col='id'
)

# Upload missing rows
diff_df = dataframe_difference(df1, df2)
diff_df.to_sql(
    'table2',
    con=self.engine,
    index=True,
    index_label='id',
    if_exists='append'
)

Insert missing rows into SQL table

Pandas' to_sql() method has a nifty keyword argument called if_exists. If we set if_exists='append' when using to_sql(), Pandas will append rows from a DataFrame to an existing SQL table. Since we used our dataframe_difference() function to find which rows were different, we were able to ensure that we only uploaded rows that were different.

Setting the Stage

Pandas Merge With Indicators

Creating a Code Snippet

BONUS ROUND: Uploading Missing Rows to SQL Table

Related Posts