Dropping Rows of Data Using Pandas

Square one of cleaning your Pandas Dataframes: dropping empty or problematic data.

One of the many things you'll want to do when working with large datasets is rid yourself of filthy, filthy data. This can include empty, poorly formatted, or simply irrelevant data entries. While 'bad' data can occasionally be fixed or salvaged via transforms, in many cases it's best to do away with rows entirely to ensure that only the fittest survive.

Drop Empty Rows or Columns

If you're looking to drop rows (or columns) containing empty data, you're in luck: Pandas' dropna() method is specifically for this.

Using dropna() is a simple one-liner which accepts a number of useful arguments:

import pandas as pd

# Create a Dataframe from CSV
my_dataframe = pd.read_csv('example.csv')

# Drop rows with any empty cells
my_dataframe.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Technically you could run MyDataFrame.dropna() without any parameters, and this  would default to dropping all rows where are completely empty. If thats all you needed, well, I guess you're done already. Otherwise, here are the parameters you can include:

  • Axis: Specifies to drop by row or column. 0 means row, 1 means column.
  • How: Accepts one of two possible values: any or all. This will either drop an axis which is completely empty (all), or an axis with even just a single empty cell (any).
  • Thresh: Here's an interesting one: thresh accepts an integer, and will drop an axis only if that number threshold of empty cells is breached.
  • Subset: Accepts an array of which axis' to consider, as opposed to considering all by default.
  • Inplace: If you haven't come across inplace yet, learn this now: changes will NOT be made to the Dataframe you're touching unless this is set to True. It's False by default.

Drop by Label

The pandas .drop() method is used to remove entire rows or columns based on their name. If we can see that our Dataframe contains extraneous information (perhaps for example, the HR team is storing a preferred_icecream_flavor in their master records), we can destroy the column (or row) outright.

Using drop() looks something like this:

import pandas as pd

# Create a Dataframe from CSV
my_dataframe = pd.read_csv('example.csv')

my_dataframe.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

We'll attempt to cover the usage of these parameters in plain English before inevitably falling into useless lingo which you have not yet learned.

  • Axis: Similar to the above, setting the axis specifies if you're trying to drop rows or columns.
  • Labels: May refer to either the name (string) of the target axis, or its index (int). Of course, whether this is referring to columns or rows in the Dataframe is dependent on the value of the axis parameter. Labels are always defined in the 0th axis of the target Dataframe, and may accept multiple values in the form of an array when dropping multiple rows/columns at once.

Drop by index:

import pandas as pd

# Create a Dataframe from CSV
my_dataframe = pd.read_csv('example.csv')

# Drop by row or column index
my_dataframe.drop([0, 1])

Drop by label:

import pandas as pd

# Create a Dataframe from CSV
my_dataframe = pd.read_csv('example.csv')

# Drop by column name
my_dataframe.drop(['B', 'C'])
  • Index, Columns: An alternative method for specifying the same as the above. Accepts single or multiple values. Setting columns=labels is equivalent to labels, axis=1. index=0* is equivalent to *labels=0.
  • Levels: Used in sets of data which contain multiple hierarchical levels, similar to that of nested arrays. A high-level few of Hierarchical indexing can be found here.
  • Inplace: Again, drop methods are not carried out on the target Dataframe unless explicitly stated. The purpose of this is to presumably preserve the original set of data during ad hoc manipulation.This adheres to the Python style-guide which states that actions should not be performed on live sets of data unless explicitly stated. Here is a video of some guy describing this for some reason.
  • Errors: Accepts either ignore or raise, with 'raise' set as default. When errors='ignore' is set, no errors will be thrown and existing labels are dropped.

Drop by Criteria

We can also remove rows or columns based on whichever criteria your little heart desires. For example, if you really hate people named Chad, you can drop all rows in your Customer database who have the name Chad. Screw Chad.

Unlike previous methods, the popular way of handling this is simply by saving your Dataframe over itself give a passed value. Here's how we'd get rid of Chad:

import pandas as pd

# Create a Dataframe from CSV
my_dataframe = pd.read_csv('example.csv')

# Drop via logic: similar to SQL 'WHERE' clause
my_dataframe = my_dataframe[my_dataframe.employee_name != 'chad')]

The syntax may seem a bit off-putting to newcomers (note the repetition of my_dataframe 3 times). The format of my_dataframe[CONDITION] simply returns a modified version of my_dataframe, where only the data matching the given condition is affected.

Since we're purging this data altogether, stating  my_dataframe = my_dataframe[CONDITION] is an easy (albeit destructive) method for shedding data and moving on with our lives.

Author image
New York City Website
Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.