There's something about being a data engineer that makes it impossible to clearly convey thoughts in an articulate manner. It seems inevitable that every well-meaning Spark tutorial is destined to devolve into walls of incomprehensible code with minimal explanation. This is even apparent in StackOverflow, where simple questions are regularly met with absurdly unnecessary solutions (stop making UDFs for everything!) Anyway, what I'm trying to say is it takes a lot of guts to click into these things, and here you are. I appreciate you.

In our last episode, we covered some Spark basics, played with Databricks, and started loading data into DataFrames. Now we're diving deeper into DataFrames by touching on every data enthusiast's favorite topic: cleaning data. To keep it interesting, I picked the filthiest data set I could find: FDA drug enforcement!

Dropping Rows

We don't have to look at this data for long to start noticing some holes:

classification center_classification_date report_date postal_code termination_date recall_initiation_date recall_number city more_code_info event_id distribution_pattern openfda_application_number openfda_brand_name openfda_dosage_form openfda_generic_name openfda_manufacturer_name openfda_product_ndc openfda_product_type openfda_route openfda_substance_name openfda_spl_id openfda_spl_set_id openfda_pharm_class_moa openfda_pharm_class_cs openfda_pharm_class_pe openfda_pharm_class_epc openfda_upc openfda_unii openfda_rxcui recalling_firm voluntary_mandated state reason_for_recall initial_firm_notification status product_type country product_description code_info address_1 address_2 product_quantity
Class II 2014-04-22 2014-04-30 2017-04-24 2014-04-08 D-1253-2014 67870 US Nationwide including Puerto Rico and US Virgin Islands. 65 federal government accounts received the affected product directly from Pfizer Inc Pfizer Inc. Voluntary: Firm Initiated Presence of Foriegn Tablets/Capsules: A single Tikosyn¿ (dofetilide) 250 mcg capsule was discovered in one bottle of Tikosyn¿ (dofetilide) 500 mcg (0.5 mg). Letter Terminated Drugs Tikosyn¿ (dofetilide) 500 mcg (0.5mg) capsules, 60-count bottle, Rx only Pfizer Labs, Division of Pfizer Inc. NY, NY 10017, NDC 00069-5820-60 H48496, H48497; Exp 05/15 23,064 bottles
Class I 2013-01-22 2013-01-30 94080-4990 2013-03-25 2012-07-30 D-135-2013 South San Francisco 63258 Nationwide Genentech Inc Voluntary: Firm Initiated CA Presence of Particulate Matter: One lot of Bacteriostatic Water for Injection, USP diluent vials that were packed with Trastuzumab Kits for investigational use has the potential to contain glass particulates. E-Mail Terminated Drugs United States Trastuzumab Kit containing 1 vial Trastuzumab (nominal content: 440 mg); 1 20 mL multiple dose vial Bacteriostatic Water for Injection, USP (1.1% benzyl alcohol), Manufactured by Hospira, Inc., for Genentech, Inc., South San Francisco, CA 94080-4990, NDC 50242-903-69; and 1 Reconstitution Instructions per kit; Investigation Use only, Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080-4990. Lot #: 454138, Exp 07/14 (containing Trastuzumab Lot #: 886618, Exp 12/14; and Bacteriostatic Water for Injection, USP Lot #: 08-368-DK-01, Exp 07/14) 1 DNA Way 2,140 kits
Class II 2012-12-04 2012-12-12 94080-4918 2013-12-24 2012-09-20 D-088-2013 South San Francisco 63243 Nationwide Genentech Inc Voluntary: Firm Initiated CA Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). Letter Terminated Drugs United States Nutropin AQ NuSpin 10 (somatropin (rDNA origin) injection} 10 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0074-01 Lot # 936670 Exp. 09/30/13 1 DNA Way 12,791 syringes
Class II 2012-12-04 2012-12-12 94080-4918 2013-12-24 2012-09-20 D-087-2013 South San Francisco 63243 Nationwide Genentech Inc Voluntary: Firm Initiated CA Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). Letter Terminated Drugs United States Nutropin AQ NuSpin 20 (somatropin (rDNA origin) injection} 20 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0076-01 Lot# 936674 Exp. 09/30/13 1 DNA Way 10,306 syringes
Class I 2017-12-01 2017-09-27 94080-4990 2019-01-11 2017-09-05 D-1175-2017 South San Francisco 78088 Nationwide in the USA Genentech, Inc. Voluntary: Firm Initiated CA Non-Sterility:presence of cracked or chipped glass at the neck of Sterile Water for Injection vials. Letter Terminated Drugs United States Activase (alteplase), 100 mg (58 million IU), Lyophilized vial, packaged with 1 single-dose 100ml vial of sterile water for Injection USP and one transfer device. Manufactured by: Genentech, Inc. 1 DNA Way, South San Francisco, CA 84080-4990. NDC: 50242-0085-27 Lot# 3128243, 3141239, EXP. 9/30/2018; 3166728, EXP. 2/28/2019. 1 Dna Way 16,882
Class II 2017-06-21 2017-06-28 37167-2099 2017-06-12 D-0922-2017 Smyrna 77517 Nationwide U.S.A. Vi-Jon, Inc. Voluntary: Firm Initiated TN Presence of foreign substance: glass particle Letter Completed Drugs United States Magnesium Citrate Saline Laxative Cherry Flavored, packaged in a) 10 fl. oz. (296 mL) bottle UPC: 072785104556 (Rexall's label), b) 10 fl. oz. (296 mL) bottle UPC: 041163254138 (Equaline's label), OTC, Distributed by a) Dolgencorp, LL, Goodlettsville, TN 37072 b)Supervalu, Inc. Eden Prairie, MN 55344 Lot: 0313246 1 Swan Dr 52,212 bottles
Class III 2017-10-05 2017-10-11 37167-2099 2017-09-13 D-0006-2018 Smyrna 78099 United States and Canada Vi-Jon, Inc. Voluntary: Firm Initiated TN Failed Impurities/Degradation Specifications. Letter Completed Drugs United States Magnesium Citrate Oral Solution Cherry Flavor, 1.745 g per fl oz, packaged in 10 fl. oz ( 296 mL) bottles, Labeled as a) CVS Health, Distributed by: CVS Pharmacy, Inc. One CVS Drive, Woonsocket, RI 02895; NDC 6984238038, UPC 050428418321, b) Meijer, Dist. By Meijer Distribution, Inc. Grand Rapids, MI 49544, NDC 4125038038, UPC 708820824294, c) Life Brand, Manufactured for: Shoppers Drug Mark/Pharmaprix Toronto, UPC 057800856405 Lot #: 0323439 Exp. 07/2018 1 Swan Dr 87,972 bottles
Class III 2017-11-10 2017-11-01 37167-2099 2017-09-13 D-0073-2018 Smyrna 78101 Nationwide. Vi-Jon, Inc. Voluntary: Firm Initiated TN Failed Stability Specifications Letter Completed Drugs United States Equate Beauty Clarifying Face Daily Moisturizer 4 fl. oz. (118 mL) Distributed by: Wal-Mart Stores, Inc. Bentonville, AR 72716, NDC # 4903594126 UPC # 681131150774. Studio 35 Beauty Daily Clarifying Face Daily Moisturizer 4 fl. oz. (118 mL) Distributed By: Walgreen Co. 200 Wilmot Rd., Deerfield, IL 60015, NDC # 0363001726, UPC # 049022889125 Lot #: 0342373, 0342656; Exp. 12/18 Lot #: 0307223 Exp. 12/17 Lot #: 0345361 Exp. 02/19 1 Swan Dr 67,532 oval plastic bottles
Class II 2017-11-19 2017-11-29 11232-1012 2017-08-22 D-0084-2018 Brooklyn 78257 NY Shata Trading, Inc. Voluntary: Firm Initiated NY Marketed Without an Approved NDA/ANDA Letter Ongoing Drugs United States No. 1 Faiza Beauty Cream Manufactured by: Poonia Brothers (Pak), Gujranwala, Distributed by Shata Traders Inc., 4, 19th Street, Brooklyn, NY 11232. UPC 5842109854239 Batch # 16L63 Lot # 223190 4 19th St 96 foil packages

Look at all those empty cells. Shame. Let's deal with these trouble makers.

Dropping Rows With Empty Values

If you're a Pandas fan, you're probably thinking "this is a job for .dropna()!" As it turns out, you may be more spot-on than you think - PySpark DataFrames also have a method for dropping N/A values, and it happens to be called .dropna()!

df = df.dropna()
display(df)

The keyword arguments will make you feel right at home:

  • how: accepts ‘any’ or ‘all’. If ‘any’, rows containing any null values will be dropped entirely (default). If ‘all’, only rows which are entirely empty will be dropped.
  • thresh: accepts an integer representing the "threshold" for how many empty cells a row must have before being dropped. Tresh is a middle ground between how=any and how=all. As a result, the presence of thresh will override how.
  • subset: accepts a list of column names. When a subset is present, N/A values will only be checked against the columns whose names are provided.

PySpark has no concept of inplace, so any methods we run against our DataFrames will only be applied if we set a DataFrame equal to the value of the affected DataFrame ( df = df.dropna() ).

My dataset is so dirty that running dropna() actually dropped all 500 rows! Yes, there is an empty cell in literally every row. Here's where we benefit from passing column names to subset:

df = df.dropna(subset=['postal_code', 'city', 'country', 'address_1'])
display(df)

Things are looking cleaner already:

classification center_classification_date report_date postal_code termination_date recall_initiation_date recall_number city more_code_info event_id distribution_pattern openfda_application_number openfda_brand_name openfda_dosage_form openfda_generic_name openfda_manufacturer_name openfda_product_ndc openfda_product_type openfda_route openfda_substance_name openfda_spl_id openfda_spl_set_id openfda_pharm_class_moa openfda_pharm_class_cs openfda_pharm_class_pe openfda_pharm_class_epc openfda_upc openfda_unii openfda_rxcui recalling_firm voluntary_mandated state reason_for_recall initial_firm_notification status product_type country product_description code_info address_1 address_2 product_quantity
Class I 2013-01-22T00:00:00.000+0000 2013-01-30T00:00:00.000+0000 94080-4990 2013-03-25T00:00:00.000+0000 2012-07-30T00:00:00.000+0000 D-135-2013 South San Francisco null 63258 Nationwide null null null null null null null null null null null null null null null null null null Genentech Inc Voluntary: Firm Initiated CA Presence of Particulate Matter: One lot of Bacteriostatic Water for Injection, USP diluent vials that were packed with Trastuzumab Kits for investigational use has the potential to contain glass particulates. E-Mail Terminated Drugs United States Trastuzumab Kit containing 1 vial Trastuzumab (nominal content: 440 mg); 1 20 mL multiple dose vial Bacteriostatic Water for Injection, USP (1.1% benzyl alcohol), Manufactured by Hospira, Inc., for Genentech, Inc., South San Francisco, CA 94080-4990, NDC 50242-903-69; and 1 Reconstitution Instructions per kit; Investigation Use only, Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080-4990. Lot #: 454138, Exp 07/14 (containing Trastuzumab Lot #: 886618, Exp 12/14; and Bacteriostatic Water for Injection, USP Lot #: 08-368-DK-01, Exp 07/14) 1 DNA Way null 2,140 kits
Class II 2012-12-04T00:00:00.000+0000 2012-12-12T00:00:00.000+0000 94080-4918 2013-12-24T00:00:00.000+0000 2012-09-20T00:00:00.000+0000 D-088-2013 South San Francisco null 63243 Nationwide null null null null null null null null null null null null null null null null null null Genentech Inc Voluntary: Firm Initiated CA Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). Letter Terminated Drugs United States Nutropin AQ NuSpin 10 (somatropin (rDNA origin) injection} 10 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0074-01 Lot # 936670 Exp. 09/30/13 1 DNA Way null 12,791 syringes
Class II 2012-12-04T00:00:00.000+0000 2012-12-12T00:00:00.000+0000 94080-4918 2013-12-24T00:00:00.000+0000 2012-09-20T00:00:00.000+0000 D-087-2013 South San Francisco null 63243 Nationwide null null null null null null null null null null null null null null null null null null Genentech Inc Voluntary: Firm Initiated CA Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). Letter Terminated Drugs United States Nutropin AQ NuSpin 20 (somatropin (rDNA origin) injection} 20 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0076-01 Lot# 936674 Exp. 09/30/13 1 DNA Way null 10,306 syringes
Class I 2017-12-01T00:00:00.000+0000 2017-09-27T00:00:00.000+0000 94080-4990 2019-01-11T00:00:00.000+0000 2017-09-05T00:00:00.000+0000 D-1175-2017 South San Francisco null 78088 Nationwide in the USA null null null null null null null null null null null null null null null null null null Genentech, Inc. Voluntary: Firm Initiated CA Non-Sterility:presence of cracked or chipped glass at the neck of Sterile Water for Injection vials. Letter Terminated Drugs United States Activase (alteplase), 100 mg (58 million IU), Lyophilized vial, packaged with 1 single-dose 100ml vial of sterile water for Injection USP and one transfer device. Manufactured by: Genentech, Inc. 1 DNA Way, South San Francisco, CA 84080-4990. NDC: 50242-0085-27 Lot# 3128243, 3141239, EXP. 9/30/2018; 3166728, EXP. 2/28/2019. 1 Dna Way null 16,882
Class III 2014-05-06T00:00:00.000+0000 2014-05-14T00:00:00.000+0000 08889-3400 2015-09-02T00:00:00.000+0000 2014-01-13T00:00:00.000+0000 D-1311-2014 Whitehouse Station null 67344 Nationwide and Puerto Rico null null null null null null null null null null null null null null null null null null Merck & Co Inc Voluntary: Firm Initiated NJ This recall of LIPTRUZET is being initiated due to packaging defects. Some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/40 mg, a) 30 count blister (NDC 66582-322-30), b) 90 count blister (NDC 66582-322-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 a) Liptruzet 10/40mg 30 count, 346304, 346342, 350256, 353107, 353108, 353109, Exp. 06/14; 353110, Exp.12/14 b) Liptruzet 10/40mg 90 count 346382, Exp. 06/14; 353186, Exp.12/14. 1 Merck Dr P.O. Box 100 36,493 cartons
Class III 2014-05-06T00:00:00.000+0000 2014-05-14T00:00:00.000+0000 08889-3400 2015-09-02T00:00:00.000+0000 2014-01-13T00:00:00.000+0000 D-1310-2014 Whitehouse Station null 67344 Nationwide and Puerto Rico null null null null null null null null null null null null null null null null null null Merck & Co Inc Voluntary: Firm Initiated NJ This recall of LIPTRUZET is being initiated due to packaging defects. Some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/20 mg, a) 30 count blister (NDC 66582-321-30), b) 90 count blister (NDC 66582-321-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 a) Liptruzet 10/20mg 30 count, 346353, 346354, Exp. 06/14; 351809, Exp. 07/14, 360530, Exp. 12/14. b) Liptruzet 10/20mg, 90 count, 346387, Exp. 06/14; 353185, Exp.12/14. 1 Merck Dr P.O. Box 100 20,259 cartons
Class III 2014-05-06T00:00:00.000+0000 2014-05-14T00:00:00.000+0000 08889-3400 2015-09-02T00:00:00.000+0000 2014-01-13T00:00:00.000+0000 D-1312-2014 Whitehouse Station null 67344 Nationwide and Puerto Rico null null null null null null null null null null null null null null null null null null Merck & Co Inc Voluntary: Firm Initiated NJ Defective Container; some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/80 mg, a) 30 count blister (NDC 66582-323-30), b) 90 count blister (NDC 66582-323-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 a) Liptruzet 10/80mg 30 count, 346379, 350257, Exp.07/14; 353591, Exp. 06/15. b) Liptruzet 10/80mg 90 count, 346383, 371072, Exp. 06/15. 1 Merck Dr P.O. Box 100 11,462 cartons
Class III 2014-05-06T00:00:00.000+0000 2014-05-14T00:00:00.000+0000 08889-3400 2015-09-02T00:00:00.000+0000 2014-01-13T00:00:00.000+0000 D-1309-2014 Whitehouse Station null 67344 Nationwide and Puerto Rico null null null null null null null null null null null null null null null null null null Merck & Co Inc Voluntary: Firm Initiated NJ This recall of LIPTRUZET is being initiated due to packaging defects. Some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/10 mg, a) 30 count blister (NDC 66582-320-30), b) 90 count blister (NDC 66582-320-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 a) Liptruzet 10/10mg, 30 count, 346380,353183, Exp. 07/14 b) Liptruzet 10/10mg, 90 count 346381, 350264, Exp 07/14; 357235, Exp. 12/14. 1 Merck Dr P.O. Box 100 11,175 cartons
Class III 2013-07-22T00:00:00.000+0000 2013-07-31T00:00:00.000+0000 37167-2099 2014-06-17T00:00:00.000+0000 2013-07-18T00:00:00.000+0000 D-812-2013 Smyrna null 65761 Nationwide null null null null null null null null null null null null null null null null null null Vi-Jon, Inc. Voluntary: Firm Initiated TN Microbial Contamination of Non-Sterile Products: This product is being recalled because a stability sample was found to be contaminated with Burkholderia contaminans. Letter Terminated Drugs United States ALCOHOL FREE ANTISEPTIC (cetylpyridinium chloride) Mouth Rinse, 0.07%, mint, 33.8 FL OZ (1 L) bottle, Distributed By The Kroger Co., Cincinnati, OH 45202, UPC 0 41260 33606 5. Lot #: 0185828, Exp 12/14 1 Swan Dr. null 18,408 bottles

Replacing N/A Values

While N/A values can hurt our analysis, sometimes dropping these rows altogether is even more problematic. Consider the case where we want to gain insights to aggregated data: dropping entire rows will easily skew aggregate stats by removing records from the total pool and removing records which should have been counted. In these cases, fillna() is here to help.

fillna() accepts a value, and will replace any empty cells it finds with that value instead of dropping rows:

df = df.fillna(0)
display(df)

fillna() also accepts an optional subset argument, much like dropna().

Dropping Duplicate Rows

Another top-10 method for cleaning data is the dropduplicates() method. By itself, calling dropduplicates() on a DataFrame drops rows where all values in a row are duplicated by another row. Like the other two methods we've covered so far, dropduplicates() also accepts the subset argument:

df = df.dropduplicates(subset="recall_number")
display(df)

Selecting Data From a DataFrame

There's another problem: it has a lot of useless columns. No, seriously, check out what happens when I run df.printSchema():

root
 |-- classification: string (nullable = true)
 |-- center_classification_date: timestamp (nullable = true)
 |-- report_date: timestamp (nullable = true)
 |-- postal_code: string (nullable = true)
 |-- termination_date: timestamp (nullable = true)
 |-- recall_initiation_date: timestamp (nullable = true)
 |-- recall_number: string (nullable = true)
 |-- city: string (nullable = true)
 |-- more_code_info: string (nullable = true)
 |-- event_id: integer (nullable = true)
 |-- distribution_pattern: string (nullable = true)
 |-- openfda_application_number: string (nullable = true)
 |-- openfda_brand_name: string (nullable = true)
 |-- openfda_dosage_form: string (nullable = true)
 |-- openfda_generic_name: string (nullable = true)
 |-- openfda_manufacturer_name: string (nullable = true)
 |-- openfda_product_ndc: string (nullable = true)
 |-- openfda_product_type: string (nullable = true)
 |-- openfda_route: string (nullable = true)
 |-- openfda_substance_name: string (nullable = true)
 |-- openfda_spl_id: string (nullable = true)
 |-- openfda_spl_set_id: string (nullable = true)
 |-- openfda_pharm_class_moa: string (nullable = true)
 |-- openfda_pharm_class_cs: string (nullable = true)
 |-- openfda_pharm_class_pe: string (nullable = true)
 |-- openfda_pharm_class_epc: string (nullable = true)
 |-- openfda_upc: string (nullable = true)
 |-- openfda_unii: string (nullable = true)
 |-- openfda_rxcui: string (nullable = true)
 |-- recalling_firm: string (nullable = true)
 |-- voluntary_mandated: string (nullable = true)
 |-- state: string (nullable = true)
 |-- reason_for_recall: string (nullable = true)
 |-- initial_firm_notification: string (nullable = true)
 |-- status: string (nullable = true)
 |-- product_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- product_description: string (nullable = true)
 |-- code_info: string (nullable = true)
 |-- address_1: string (nullable = true)
 |-- address_2: string (nullable = true)
 |-- product_quantity: string (nullable = true)

I don't even know what some of these columns are. Even if I were to drop every column I didn't recognize, I'd still be wasting a ton of time going down this list. Luckily, we have select(). When we call select() on a DataFrame, we can explicitly call out which columns to keep:

df = df.select('classification', 'report_date', 'termination_date', 'recall_number', 'city', 'distribution_pattern', 'status', 'product_type', 'country', 'product_description', 'address_1', 'address_2', 'product_quantity')
display(df)

Our data is starting to look digestible by humans (unlike the presumably horrible drugs listed in our table):

classification report_date termination_date city distribution_pattern status product_type country product_description address_1 address_2 product_quantity
Class I 2013-01-30T00:00:00.000+0000 2013-03-25T00:00:00.000+0000 South San Francisco Nationwide Terminated Drugs United States Trastuzumab Kit containing 1 vial Trastuzumab (nominal content: 440 mg); 1 20 mL multiple dose vial Bacteriostatic Water for Injection, USP (1.1% benzyl alcohol), Manufactured by Hospira, Inc., for Genentech, Inc., South San Francisco, CA 94080-4990, NDC 50242-903-69; and 1 Reconstitution Instructions per kit; Investigation Use only, Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080-4990. 1 DNA Way null 2,140 kits
Class II 2012-12-12T00:00:00.000+0000 2013-12-24T00:00:00.000+0000 South San Francisco Nationwide Terminated Drugs United States Nutropin AQ NuSpin 10 (somatropin (rDNA origin) injection} 10 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0074-01 1 DNA Way null 12,791 syringes
Class II 2012-12-12T00:00:00.000+0000 2013-12-24T00:00:00.000+0000 South San Francisco Nationwide Terminated Drugs United States Nutropin AQ NuSpin 20 (somatropin (rDNA origin) injection} 20 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0076-01 1 DNA Way null 10,306 syringes
Class I 2017-09-27T00:00:00.000+0000 2019-01-11T00:00:00.000+0000 South San Francisco Nationwide in the USA Terminated Drugs United States Activase (alteplase), 100 mg (58 million IU), Lyophilized vial, packaged with 1 single-dose 100ml vial of sterile water for Injection USP and one transfer device. Manufactured by: Genentech, Inc. 1 DNA Way, South San Francisco, CA 84080-4990. NDC: 50242-0085-27 1 Dna Way null 16,882
Class III 2014-05-14T00:00:00.000+0000 2015-09-02T00:00:00.000+0000 Whitehouse Station Nationwide and Puerto Rico Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/40 mg, a) 30 count blister (NDC 66582-322-30), b) 90 count blister (NDC 66582-322-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 1 Merck Dr P.O. Box 100 36,493 cartons
Class III 2014-05-14T00:00:00.000+0000 2015-09-02T00:00:00.000+0000 Whitehouse Station Nationwide and Puerto Rico Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/20 mg, a) 30 count blister (NDC 66582-321-30), b) 90 count blister (NDC 66582-321-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 1 Merck Dr P.O. Box 100 20,259 cartons
Class III 2014-05-14T00:00:00.000+0000 2015-09-02T00:00:00.000+0000 Whitehouse Station Nationwide and Puerto Rico Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/80 mg, a) 30 count blister (NDC 66582-323-30), b) 90 count blister (NDC 66582-323-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 1 Merck Dr P.O. Box 100 11,462 cartons
Class III 2014-05-14T00:00:00.000+0000 2015-09-02T00:00:00.000+0000 Whitehouse Station Nationwide and Puerto Rico Terminated Drugs United States Liptruzet (ezetimibe and atorvastin) tablets,10 mg/10 mg, a) 30 count blister (NDC 66582-320-30), b) 90 count blister (NDC 66582-320-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 1 Merck Dr P.O. Box 100 11,175 cartons
Class III 2013-07-31T00:00:00.000+0000 2014-06-17T00:00:00.000+0000 Smyrna Nationwide Terminated Drugs United States ALCOHOL FREE ANTISEPTIC (cetylpyridinium chloride) Mouth Rinse, 0.07%, mint, 33.8 FL OZ (1 L) bottle, Distributed By The Kroger Co., Cincinnati, OH 45202, UPC 0 41260 33606 5. 1 Swan Dr. null 18,408 bottles

Filtering Data

Sometimes we'll work with data which has a number of outliers which skew our results. Other times, we'll only want to work with rows which belong to a particular subset of information. Here's how we'd clean our data to only contain cases which occurred in South San Francisco:

df = df.filter(df.city == "South San Francisco")

The contents of filter() will always be a conditional where we compare the values in a certain column against an intended value. The easiest way to access a DataFrame's column is by using the df.column_name syntax.

In our case, we're comparing a column holding strings against a provided string, South San Francisco (for numerical values, we could use the greater-than and less-than operators as well).

Filtering by String Values

Aside from filtering by a perfect match, there are plenty of other powerful ways to filter by strings in PySpark. Take a look:

  • df.filter(df.city.contains('San Francisco'): Returns rows where strings of a column contain a provided substring. In our example, filtering by rows which contain the substring "San Francisco" would be a good way to get all rows in San Francisco, instead of just "South San Francisco".
  • df.filter(df.city.startswith('San')): Returns rows where a string starts with a provided substring.
  • df.filter(df.city.endswith('ice')): Returns rows where a string starts with a provided substring.
  • df.filter(df.city.isNull()): Returns rows where values in a provided column are null.
  • df.filter(df.city.isNotNull()): Opposite of the above.
  • df.filter(df.city.like('San%')): Performs a SQL-like query containing the LIKE clause.
  • df.filter(df.city.rlike('[A-Z]*ice$')): Performs a regexp filter.
  • df.filter(df.city.isin('San Francisco', 'Los Angeles')): Looks for rows where the string value of a column matches any of the provided strings exactly.

Filtering by Date Values

In addition to filtering by strings, we can also filter by columns where the values are stored as dates or datetimes. Perhaps the most useful way to filter dates is by using the between() method, which allows us to find results within a certain date range. Here we find all the results which were reported in the years 2013 and 2014:

df = df.filter(df.report_date.between('2013-01-01 00:00:00','2015-01-11 00:00:00'))

Filtering via where()

.where() is another way of achieving the same effect that we accomplished with filter():

df = df.where((df.city == "South San Francisco"))

Sorting Our DataFrame

Finally, there are a few ways we can sort the data in our DataFrame to our liking. My preferred method is by using orderBy():

df = df.orderBy('report_date', ascending=False)

orderBy() sorts results in the way we'd expect: string columns are sorted alphabetically, numerical columns are sorted chronologically, etc. The ascending keyword parameter allows us to display these results descending order when ascending is equal to False.

Notice how we passed 'report_date' as a string, as opposed to df.report_date? PySpark allows us to do this for some reason. This won't work when filtering, however, because df = df.filter("city" == "South San Francisco") looks like we're trying to evaluate a string against a string.

A shittier way of achieving the same effect is by using the sort() method. sort() is shittier than orderBy() because the syntax is uglier, and because it requires us to import something just to list our results in descending order:

from pyspark.sql.functions import desc

df = df.sort(desc("published_at"))

Renaming Columns

We have just one more item on our list of spring cleaning items: naming columns! An easy way to rename one column at a time is with the withColumnRenamed() method:

df = df.withColumnRenamed('recall_number', 'id')

We can also change multiple columns at once:

df = df.selectExpr("product_type as type", "product_description as product_description")

That's all the time we have for today folks. Join us next time when we explore the magical world of transforming DataFrames in PySpark.