Let's say you have two obscenely large sets of data. These sets of data might report on a similar topic, such as the personal details of employees, but for some reason set 1 contains email address where set 2 does not, and set 2 contains addresses which are similarly missing in set 1.
Alternatively, these sets of data may contain the same data points, but employee entries are missing from one or the other Dataframe.
This tutorial covers the case of handling both of these use cases simultaneously.
This crash course in data science was inspired by a 1am Google hangout with Snkia.
We'll be assuming you have your data exported as a CSV for the purpose of this exercise. Before going in to detail, know some of the terminology associated with merging Dataframes.
Sets of data can be merged in a number of ways. Merges can either be used to find similarities in two Dataframes and merge associated information, or may be entirely non-destructive in the way that two sets of data are merged.
In many cases (such as the one in this tutorial) you'd likely want to merge two Dataframes based on the value of a key. A key is the authoritative column by which the Dataframes will be merged. When merging Dataframes in this way, keys will stay in tact as an identifier while the values of columns in the same row associated to that key.
This type of merge can be used when two Dataframes hold differing fields for similar rows. If Dataframe 1 contains the phone numbers of people by name, and Dataframe 2 contains emails of a similar grouping of people, these two may be merged to create a single collection of data with all of this information.
A parameter of pandas functions which determines whether the function should be run against a Dataframe's columns or rows. An axis of 0 determines that the action will be taken on a per-row basis, where an axis of 1 denotes column.
For example, performing a drop with axis 0 on key X will drop the row where value of a cell is equal to X.
An example of a left/right merge can be seen below:
The two data frames above hold similar keys with different associated information per axis, thus the result is a combination of these two Dataframes where the keys remain intact.
"Left" or "right" refer to the left or right Dataframes above. If the keys from both Dataframes do not match 1-to-1, specifying a left/right merge determines which Dataframe's keys will be considered the authority to be preserved in the merge.
An inner merge will merge two Dataframes based on overlap of values between keys in both Dataframes:
An outer merge will preserve the most data by not dropping keys which are uncommon to both Dataframes. Keys which exist in a single Dataframe will be added to the resulting Dataframe, with empty values populated for any columns brought in by the other Dataframe:
Merging Two Dataframes via Left Merge
Let's get it going. Enter the iPython shell.
Import Pandas and read both of your CSV files.
import pandas as pd df = pd.read_csv("csv1.csv") df2 = pd.read_csv("csv2.csv")
The above opens the CSVs as Dataframes recognizable by pandas.
Next, we'll merge the two CSV files.
How specifies the type of merge, and on specifies the column to merge by (key). The key must be present in both Dataframes.
For the purpose of this exercise we'll be merging left, as that is the CSV which contains the keys we'd like to maintain.
mergedDF = df2.merge(df, how=“left”, on="email") print(mergedDF)
This should return a dataset of all common rows, with columns from both CSVs included in the merge.
For more on merging, check out the official Pandas documentation here.