There's something about being a data engineer that makes it impossible to clearly convey thoughts in an articulate manner. It seems inevitable that every well-meaning Spark tutorial is destined to devolve into walls of incomprehensible code with minimal explanation. This is even apparent in StackOverflow, where simple questions are regularly met with absurdly unnecessary solutions (stop making UDFs for everything!) Anyway, what I'm trying to say is it takes a lot of guts to click into these things, and here you are. I appreciate you.
In our last episode, we covered some Spark basics, played with Databricks, and started loading data into DataFrames. Now we're diving deeper into DataFrames by touching on every data enthusiast's favorite topic: cleaning data. To keep it interesting, I picked the filthiest data set I could find: FDA drug enforcement!
Go ahead, download the dataset(JSON) and try loading it onto Databricks!
Dropping Rows
We don't have to look at this data for long to start noticing some holes:
classification | center_classification_date | report_date | postal_code | termination_date | recall_initiation_date | recall_number | city | more_code_info | event_id | distribution_pattern | openfda_application_number | openfda_brand_name | openfda_dosage_form | openfda_generic_name | openfda_manufacturer_name | openfda_product_ndc | openfda_product_type | openfda_route | openfda_substance_name | openfda_spl_id | openfda_spl_set_id | openfda_pharm_class_moa | openfda_pharm_class_cs | openfda_pharm_class_pe | openfda_pharm_class_epc | openfda_upc | openfda_unii | openfda_rxcui | recalling_firm | voluntary_mandated | state | reason_for_recall | initial_firm_notification | status | product_type | country | product_description | code_info | address_1 | address_2 | product_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Class II | 2014-04-22 | 2014-04-30 | 2017-04-24 | 2014-04-08 | D-1253-2014 | 67870 | US Nationwide including Puerto Rico and US Virgin Islands. 65 federal government accounts received the affected product directly from Pfizer Inc | Pfizer Inc. | Voluntary: Firm Initiated | Presence of Foriegn Tablets/Capsules: A single Tikosyn¿ (dofetilide) 250 mcg capsule was discovered in one bottle of Tikosyn¿ (dofetilide) 500 mcg (0.5 mg). | Letter | Terminated | Drugs | Tikosyn¿ (dofetilide) 500 mcg (0.5mg) capsules, 60-count bottle, Rx only Pfizer Labs, Division of Pfizer Inc. NY, NY 10017, NDC 00069-5820-60 | H48496, H48497; Exp 05/15 | 23,064 bottles | |||||||||||||||||||||||||
Class I | 2013-01-22 | 2013-01-30 | 94080-4990 | 2013-03-25 | 2012-07-30 | D-135-2013 | South San Francisco | 63258 | Nationwide | Genentech Inc | Voluntary: Firm Initiated | CA | Presence of Particulate Matter: One lot of Bacteriostatic Water for Injection, USP diluent vials that were packed with Trastuzumab Kits for investigational use has the potential to contain glass particulates. | Terminated | Drugs | United States | Trastuzumab Kit containing 1 vial Trastuzumab (nominal content: 440 mg); 1 20 mL multiple dose vial Bacteriostatic Water for Injection, USP (1.1% benzyl alcohol), Manufactured by Hospira, Inc., for Genentech, Inc., South San Francisco, CA 94080-4990, NDC 50242-903-69; and 1 Reconstitution Instructions per kit; Investigation Use only, Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080-4990. | Lot #: 454138, Exp 07/14 (containing Trastuzumab Lot #: 886618, Exp 12/14; and Bacteriostatic Water for Injection, USP Lot #: 08-368-DK-01, Exp 07/14) | 1 DNA Way | 2,140 kits | |||||||||||||||||||||
Class II | 2012-12-04 | 2012-12-12 | 94080-4918 | 2013-12-24 | 2012-09-20 | D-088-2013 | South San Francisco | 63243 | Nationwide | Genentech Inc | Voluntary: Firm Initiated | CA | Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). | Letter | Terminated | Drugs | United States | Nutropin AQ NuSpin 10 (somatropin (rDNA origin) injection} 10 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0074-01 | Lot # 936670 Exp. 09/30/13 | 1 DNA Way | 12,791 syringes | ||||||||||||||||||||
Class II | 2012-12-04 | 2012-12-12 | 94080-4918 | 2013-12-24 | 2012-09-20 | D-087-2013 | South San Francisco | 63243 | Nationwide | Genentech Inc | Voluntary: Firm Initiated | CA | Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). | Letter | Terminated | Drugs | United States | Nutropin AQ NuSpin 20 (somatropin (rDNA origin) injection} 20 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0076-01 | Lot# 936674 Exp. 09/30/13 | 1 DNA Way | 10,306 syringes | ||||||||||||||||||||
Class I | 2017-12-01 | 2017-09-27 | 94080-4990 | 2019-01-11 | 2017-09-05 | D-1175-2017 | South San Francisco | 78088 | Nationwide in the USA | Genentech, Inc. | Voluntary: Firm Initiated | CA | Non-Sterility:presence of cracked or chipped glass at the neck of Sterile Water for Injection vials. | Letter | Terminated | Drugs | United States | Activase (alteplase), 100 mg (58 million IU), Lyophilized vial, packaged with 1 single-dose 100ml vial of sterile water for Injection USP and one transfer device. Manufactured by: Genentech, Inc. 1 DNA Way, South San Francisco, CA 84080-4990. NDC: 50242-0085-27 | Lot# 3128243, 3141239, EXP. 9/30/2018; 3166728, EXP. 2/28/2019. | 1 Dna Way | 16,882 | ||||||||||||||||||||
Class II | 2017-06-21 | 2017-06-28 | 37167-2099 | 2017-06-12 | D-0922-2017 | Smyrna | 77517 | Nationwide U.S.A. | Vi-Jon, Inc. | Voluntary: Firm Initiated | TN | Presence of foreign substance: glass particle | Letter | Completed | Drugs | United States | Magnesium Citrate Saline Laxative Cherry Flavored, packaged in a) 10 fl. oz. (296 mL) bottle UPC: 072785104556 (Rexall's label), b) 10 fl. oz. (296 mL) bottle UPC: 041163254138 (Equaline's label), OTC, Distributed by a) Dolgencorp, LL, Goodlettsville, TN 37072 b)Supervalu, Inc. Eden Prairie, MN 55344 | Lot: 0313246 | 1 Swan Dr | 52,212 bottles | |||||||||||||||||||||
Class III | 2017-10-05 | 2017-10-11 | 37167-2099 | 2017-09-13 | D-0006-2018 | Smyrna | 78099 | United States and Canada | Vi-Jon, Inc. | Voluntary: Firm Initiated | TN | Failed Impurities/Degradation Specifications. | Letter | Completed | Drugs | United States | Magnesium Citrate Oral Solution Cherry Flavor, 1.745 g per fl oz, packaged in 10 fl. oz ( 296 mL) bottles, Labeled as a) CVS Health, Distributed by: CVS Pharmacy, Inc. One CVS Drive, Woonsocket, RI 02895; NDC 6984238038, UPC 050428418321, b) Meijer, Dist. By Meijer Distribution, Inc. Grand Rapids, MI 49544, NDC 4125038038, UPC 708820824294, c) Life Brand, Manufactured for: Shoppers Drug Mark/Pharmaprix Toronto, UPC 057800856405 | Lot #: 0323439 Exp. 07/2018 | 1 Swan Dr | 87,972 bottles | |||||||||||||||||||||
Class III | 2017-11-10 | 2017-11-01 | 37167-2099 | 2017-09-13 | D-0073-2018 | Smyrna | 78101 | Nationwide. | Vi-Jon, Inc. | Voluntary: Firm Initiated | TN | Failed Stability Specifications | Letter | Completed | Drugs | United States | Equate Beauty Clarifying Face Daily Moisturizer 4 fl. oz. (118 mL) Distributed by: Wal-Mart Stores, Inc. Bentonville, AR 72716, NDC # 4903594126 UPC # 681131150774. Studio 35 Beauty Daily Clarifying Face Daily Moisturizer 4 fl. oz. (118 mL) Distributed By: Walgreen Co. 200 Wilmot Rd., Deerfield, IL 60015, NDC # 0363001726, UPC # 049022889125 | Lot #: 0342373, 0342656; Exp. 12/18 Lot #: 0307223 Exp. 12/17 Lot #: 0345361 Exp. 02/19 | 1 Swan Dr | 67,532 oval plastic bottles | |||||||||||||||||||||
Class II | 2017-11-19 | 2017-11-29 | 11232-1012 | 2017-08-22 | D-0084-2018 | Brooklyn | 78257 | NY | Shata Trading, Inc. | Voluntary: Firm Initiated | NY | Marketed Without an Approved NDA/ANDA | Letter | Ongoing | Drugs | United States | No. 1 Faiza Beauty Cream Manufactured by: Poonia Brothers (Pak), Gujranwala, Distributed by Shata Traders Inc., 4, 19th Street, Brooklyn, NY 11232. UPC 5842109854239 | Batch # 16L63 Lot # 223190 | 4 19th St | 96 foil packages |
Look at all those empty cells. Shame. Let's deal with these troublemakers.
Dropping Rows With Empty Values
If you're a Pandas fan, you're probably thinking "this is a job for .dropna()
!" As it turns out, you may be more spot-on than you think - PySpark DataFrames also have a method for dropping N/A values, and it happens to be called .dropna()
!
df = df.dropna()
display(df)
The keyword arguments will make you feel right at home:
- how: accepts ‘any’ or ‘all’. If ‘any’, rows containing any null values will be dropped entirely (default). If ‘all’, only rows that are entirely empty will be dropped.
- thresh: accepts an integer representing the "threshold" for how many empty cells a row must have before being dropped. Tresh is a middle ground between
how=any
andhow=all
. As a result, the presence of thresh will override how. - subset: accepts a list of column names. When a subset is present, N/A values will only be checked against the columns whose names are provided.
PySpark has no concept of inplace, so any methods we run against our DataFrames will only be applied if we set a DataFrame equal to the value of the affected DataFrame ( df = df.dropna()
).
My dataset is so dirty that running dropna()
actually dropped all 500 rows! Yes, there is an empty cell in literally every row. Here's where we benefit from passing column names to subset:
Things are looking cleaner already:
classification | center_classification_date | report_date | postal_code | termination_date | recall_initiation_date | recall_number | city | more_code_info | event_id | distribution_pattern | openfda_application_number | openfda_brand_name | openfda_dosage_form | openfda_generic_name | openfda_manufacturer_name | openfda_product_ndc | openfda_product_type | openfda_route | openfda_substance_name | openfda_spl_id | openfda_spl_set_id | openfda_pharm_class_moa | openfda_pharm_class_cs | openfda_pharm_class_pe | openfda_pharm_class_epc | openfda_upc | openfda_unii | openfda_rxcui | recalling_firm | voluntary_mandated | state | reason_for_recall | initial_firm_notification | status | product_type | country | product_description | code_info | address_1 | address_2 | product_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Class I | 2013-01-22T00:00:00.000+0000 | 2013-01-30T00:00:00.000+0000 | 94080-4990 | 2013-03-25T00:00:00.000+0000 | 2012-07-30T00:00:00.000+0000 | D-135-2013 | South San Francisco | null | 63258 | Nationwide | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Genentech Inc | Voluntary: Firm Initiated | CA | Presence of Particulate Matter: One lot of Bacteriostatic Water for Injection, USP diluent vials that were packed with Trastuzumab Kits for investigational use has the potential to contain glass particulates. | Terminated | Drugs | United States | Trastuzumab Kit containing 1 vial Trastuzumab (nominal content: 440 mg); 1 20 mL multiple dose vial Bacteriostatic Water for Injection, USP (1.1% benzyl alcohol), Manufactured by Hospira, Inc., for Genentech, Inc., South San Francisco, CA 94080-4990, NDC 50242-903-69; and 1 Reconstitution Instructions per kit; Investigation Use only, Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080-4990. | Lot #: 454138, Exp 07/14 (containing Trastuzumab Lot #: 886618, Exp 12/14; and Bacteriostatic Water for Injection, USP Lot #: 08-368-DK-01, Exp 07/14) | 1 DNA Way | null | 2,140 kits | |
Class II | 2012-12-04T00:00:00.000+0000 | 2012-12-12T00:00:00.000+0000 | 94080-4918 | 2013-12-24T00:00:00.000+0000 | 2012-09-20T00:00:00.000+0000 | D-088-2013 | South San Francisco | null | 63243 | Nationwide | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Genentech Inc | Voluntary: Firm Initiated | CA | Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). | Letter | Terminated | Drugs | United States | Nutropin AQ NuSpin 10 (somatropin (rDNA origin) injection} 10 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0074-01 | Lot # 936670 Exp. 09/30/13 | 1 DNA Way | null | 12,791 syringes |
Class II | 2012-12-04T00:00:00.000+0000 | 2012-12-12T00:00:00.000+0000 | 94080-4918 | 2013-12-24T00:00:00.000+0000 | 2012-09-20T00:00:00.000+0000 | D-087-2013 | South San Francisco | null | 63243 | Nationwide | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Genentech Inc | Voluntary: Firm Initiated | CA | Miscalibrated and/or Defective Delivery System: Genentech has received complaints for Nutropin AQ NuSpin 10 & 20 reporting that the dose knob spun slowly and the injection took longer than usual (slow dose delivery). | Letter | Terminated | Drugs | United States | Nutropin AQ NuSpin 20 (somatropin (rDNA origin) injection} 20 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0076-01 | Lot# 936674 Exp. 09/30/13 | 1 DNA Way | null | 10,306 syringes |
Class I | 2017-12-01T00:00:00.000+0000 | 2017-09-27T00:00:00.000+0000 | 94080-4990 | 2019-01-11T00:00:00.000+0000 | 2017-09-05T00:00:00.000+0000 | D-1175-2017 | South San Francisco | null | 78088 | Nationwide in the USA | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Genentech, Inc. | Voluntary: Firm Initiated | CA | Non-Sterility:presence of cracked or chipped glass at the neck of Sterile Water for Injection vials. | Letter | Terminated | Drugs | United States | Activase (alteplase), 100 mg (58 million IU), Lyophilized vial, packaged with 1 single-dose 100ml vial of sterile water for Injection USP and one transfer device. Manufactured by: Genentech, Inc. 1 DNA Way, South San Francisco, CA 84080-4990. NDC: 50242-0085-27 | Lot# 3128243, 3141239, EXP. 9/30/2018; 3166728, EXP. 2/28/2019. | 1 Dna Way | null | 16,882 |
Class III | 2014-05-06T00:00:00.000+0000 | 2014-05-14T00:00:00.000+0000 | 08889-3400 | 2015-09-02T00:00:00.000+0000 | 2014-01-13T00:00:00.000+0000 | D-1311-2014 | Whitehouse Station | null | 67344 | Nationwide and Puerto Rico | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Merck & Co Inc | Voluntary: Firm Initiated | NJ | This recall of LIPTRUZET is being initiated due to packaging defects. Some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. | Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/40 mg, a) 30 count blister (NDC 66582-322-30), b) 90 count blister (NDC 66582-322-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | a) Liptruzet 10/40mg 30 count, 346304, 346342, 350256, 353107, 353108, 353109, Exp. 06/14; 353110, Exp.12/14 b) Liptruzet 10/40mg 90 count 346382, Exp. 06/14; 353186, Exp.12/14. | 1 Merck Dr | P.O. Box 100 | 36,493 cartons |
Class III | 2014-05-06T00:00:00.000+0000 | 2014-05-14T00:00:00.000+0000 | 08889-3400 | 2015-09-02T00:00:00.000+0000 | 2014-01-13T00:00:00.000+0000 | D-1310-2014 | Whitehouse Station | null | 67344 | Nationwide and Puerto Rico | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Merck & Co Inc | Voluntary: Firm Initiated | NJ | This recall of LIPTRUZET is being initiated due to packaging defects. Some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. | Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/20 mg, a) 30 count blister (NDC 66582-321-30), b) 90 count blister (NDC 66582-321-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | a) Liptruzet 10/20mg 30 count, 346353, 346354, Exp. 06/14; 351809, Exp. 07/14, 360530, Exp. 12/14. b) Liptruzet 10/20mg, 90 count, 346387, Exp. 06/14; 353185, Exp.12/14. | 1 Merck Dr | P.O. Box 100 | 20,259 cartons |
Class III | 2014-05-06T00:00:00.000+0000 | 2014-05-14T00:00:00.000+0000 | 08889-3400 | 2015-09-02T00:00:00.000+0000 | 2014-01-13T00:00:00.000+0000 | D-1312-2014 | Whitehouse Station | null | 67344 | Nationwide and Puerto Rico | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Merck & Co Inc | Voluntary: Firm Initiated | NJ | Defective Container; some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. | Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/80 mg, a) 30 count blister (NDC 66582-323-30), b) 90 count blister (NDC 66582-323-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | a) Liptruzet 10/80mg 30 count, 346379, 350257, Exp.07/14; 353591, Exp. 06/15. b) Liptruzet 10/80mg 90 count, 346383, 371072, Exp. 06/15. | 1 Merck Dr | P.O. Box 100 | 11,462 cartons |
Class III | 2014-05-06T00:00:00.000+0000 | 2014-05-14T00:00:00.000+0000 | 08889-3400 | 2015-09-02T00:00:00.000+0000 | 2014-01-13T00:00:00.000+0000 | D-1309-2014 | Whitehouse Station | null | 67344 | Nationwide and Puerto Rico | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Merck & Co Inc | Voluntary: Firm Initiated | NJ | This recall of LIPTRUZET is being initiated due to packaging defects. Some of the outer laminate foil pouches allowed in air and moisture, which could potentially decrease the effectiveness or change the characteristics of the product. | Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/10 mg, a) 30 count blister (NDC 66582-320-30), b) 90 count blister (NDC 66582-320-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | a) Liptruzet 10/10mg, 30 count, 346380,353183, Exp. 07/14 b) Liptruzet 10/10mg, 90 count 346381, 350264, Exp 07/14; 357235, Exp. 12/14. | 1 Merck Dr | P.O. Box 100 | 11,175 cartons |
Class III | 2013-07-22T00:00:00.000+0000 | 2013-07-31T00:00:00.000+0000 | 37167-2099 | 2014-06-17T00:00:00.000+0000 | 2013-07-18T00:00:00.000+0000 | D-812-2013 | Smyrna | null | 65761 | Nationwide | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | Vi-Jon, Inc. | Voluntary: Firm Initiated | TN | Microbial Contamination of Non-Sterile Products: This product is being recalled because a stability sample was found to be contaminated with Burkholderia contaminans. | Letter | Terminated | Drugs | United States | ALCOHOL FREE ANTISEPTIC (cetylpyridinium chloride) Mouth Rinse, 0.07%, mint, 33.8 FL OZ (1 L) bottle, Distributed By The Kroger Co., Cincinnati, OH 45202, UPC 0 41260 33606 5. | Lot #: 0185828, Exp 12/14 | 1 Swan Dr. | null | 18,408 bottles |
Replacing N/A Values
While N/A values can hurt our analysis, sometimes dropping these rows altogether is even more problematic. Consider the case where we want to gain insights into aggregated data: dropping entire rows will easily skew aggregate stats by removing records from the total pool and removing records that should have been counted. In these cases, fillna()
is here to help.
fillna()
accepts a value, and will replace any empty cells it finds with that value instead of dropping rows:
fillna()
also accepts an optional subset argument, much like dropna()
.
Dropping Duplicate Rows
Another top-10 method for cleaning data is the dropduplicates()
method. By itself, calling dropduplicates()
on a DataFrame drops rows where all values in a row are duplicated by another row. Like the other two methods we've covered so far, dropduplicates()
also accepts the subset argument:
Selecting Data From a DataFrame
There's another problem: it has a lot of useless columns. No, seriously, check out what happens when I run df.printSchema()
:
root
|-- classification: string (nullable = true)
|-- center_classification_date: timestamp (nullable = true)
|-- report_date: timestamp (nullable = true)
|-- postal_code: string (nullable = true)
|-- termination_date: timestamp (nullable = true)
|-- recall_initiation_date: timestamp (nullable = true)
|-- recall_number: string (nullable = true)
|-- city: string (nullable = true)
|-- more_code_info: string (nullable = true)
|-- event_id: integer (nullable = true)
|-- distribution_pattern: string (nullable = true)
|-- openfda_application_number: string (nullable = true)
|-- openfda_brand_name: string (nullable = true)
|-- openfda_dosage_form: string (nullable = true)
|-- openfda_generic_name: string (nullable = true)
|-- openfda_manufacturer_name: string (nullable = true)
|-- openfda_product_ndc: string (nullable = true)
|-- openfda_product_type: string (nullable = true)
|-- openfda_route: string (nullable = true)
|-- openfda_substance_name: string (nullable = true)
|-- openfda_spl_id: string (nullable = true)
|-- openfda_spl_set_id: string (nullable = true)
|-- openfda_pharm_class_moa: string (nullable = true)
|-- openfda_pharm_class_cs: string (nullable = true)
|-- openfda_pharm_class_pe: string (nullable = true)
|-- openfda_pharm_class_epc: string (nullable = true)
|-- openfda_upc: string (nullable = true)
|-- openfda_unii: string (nullable = true)
|-- openfda_rxcui: string (nullable = true)
|-- recalling_firm: string (nullable = true)
|-- voluntary_mandated: string (nullable = true)
|-- state: string (nullable = true)
|-- reason_for_recall: string (nullable = true)
|-- initial_firm_notification: string (nullable = true)
|-- status: string (nullable = true)
|-- product_type: string (nullable = true)
|-- country: string (nullable = true)
|-- product_description: string (nullable = true)
|-- code_info: string (nullable = true)
|-- address_1: string (nullable = true)
|-- address_2: string (nullable = true)
|-- product_quantity: string (nullable = true)
I don't even know what some of these columns are. Even if I were to drop every column I didn't recognize, I'd still be wasting a ton of time going down this list. Luckily, we have select()
. When we call select()
on a DataFrame, we can explicitly call out which columns to keep:
Our data is starting to look digestible by humans (unlike the presumably horrible drugs listed in our table):
classification | report_date | termination_date | city | distribution_pattern | status | product_type | country | product_description | address_1 | address_2 | product_quantity |
---|---|---|---|---|---|---|---|---|---|---|---|
Class I | 2013-01-30T00:00:00.000+0000 | 2013-03-25T00:00:00.000+0000 | South San Francisco | Nationwide | Terminated | Drugs | United States | Trastuzumab Kit containing 1 vial Trastuzumab (nominal content: 440 mg); 1 20 mL multiple dose vial Bacteriostatic Water for Injection, USP (1.1% benzyl alcohol), Manufactured by Hospira, Inc., for Genentech, Inc., South San Francisco, CA 94080-4990, NDC 50242-903-69; and 1 Reconstitution Instructions per kit; Investigation Use only, Genentech, Inc., 1 DNA Way, South San Francisco, CA 94080-4990. | 1 DNA Way | null | 2,140 kits |
Class II | 2012-12-12T00:00:00.000+0000 | 2013-12-24T00:00:00.000+0000 | South San Francisco | Nationwide | Terminated | Drugs | United States | Nutropin AQ NuSpin 10 (somatropin (rDNA origin) injection} 10 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0074-01 | 1 DNA Way | null | 12,791 syringes |
Class II | 2012-12-12T00:00:00.000+0000 | 2013-12-24T00:00:00.000+0000 | South San Francisco | Nationwide | Terminated | Drugs | United States | Nutropin AQ NuSpin 20 (somatropin (rDNA origin) injection} 20 mg/ 2mL Genentech, South San Francisco, CA 94080, NDC#50242-0076-01 | 1 DNA Way | null | 10,306 syringes |
Class I | 2017-09-27T00:00:00.000+0000 | 2019-01-11T00:00:00.000+0000 | South San Francisco | Nationwide in the USA | Terminated | Drugs | United States | Activase (alteplase), 100 mg (58 million IU), Lyophilized vial, packaged with 1 single-dose 100ml vial of sterile water for Injection USP and one transfer device. Manufactured by: Genentech, Inc. 1 DNA Way, South San Francisco, CA 84080-4990. NDC: 50242-0085-27 | 1 Dna Way | null | 16,882 |
Class III | 2014-05-14T00:00:00.000+0000 | 2015-09-02T00:00:00.000+0000 | Whitehouse Station | Nationwide and Puerto Rico | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/40 mg, a) 30 count blister (NDC 66582-322-30), b) 90 count blister (NDC 66582-322-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | 1 Merck Dr | P.O. Box 100 | 36,493 cartons |
Class III | 2014-05-14T00:00:00.000+0000 | 2015-09-02T00:00:00.000+0000 | Whitehouse Station | Nationwide and Puerto Rico | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/20 mg, a) 30 count blister (NDC 66582-321-30), b) 90 count blister (NDC 66582-321-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | 1 Merck Dr | P.O. Box 100 | 20,259 cartons |
Class III | 2014-05-14T00:00:00.000+0000 | 2015-09-02T00:00:00.000+0000 | Whitehouse Station | Nationwide and Puerto Rico | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/80 mg, a) 30 count blister (NDC 66582-323-30), b) 90 count blister (NDC 66582-323-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | 1 Merck Dr | P.O. Box 100 | 11,462 cartons |
Class III | 2014-05-14T00:00:00.000+0000 | 2015-09-02T00:00:00.000+0000 | Whitehouse Station | Nationwide and Puerto Rico | Terminated | Drugs | United States | Liptruzet (ezetimibe and atorvastin) tablets,10 mg/10 mg, a) 30 count blister (NDC 66582-320-30), b) 90 count blister (NDC 66582-320-54), Rx only, Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc, Whitehouse Station, NJ 08889 | 1 Merck Dr | P.O. Box 100 | 11,175 cartons |
Class III | 2013-07-31T00:00:00.000+0000 | 2014-06-17T00:00:00.000+0000 | Smyrna | Nationwide | Terminated | Drugs | United States | ALCOHOL FREE ANTISEPTIC (cetylpyridinium chloride) Mouth Rinse, 0.07%, mint, 33.8 FL OZ (1 L) bottle, Distributed By The Kroger Co., Cincinnati, OH 45202, UPC 0 41260 33606 5. | 1 Swan Dr. | null | 18,408 bottles |
Filtering Data
Sometimes we'll work with data that has a number of outliers which skew our results. Other times, we'll only want to work with rows that belong to a particular subset of information. Here's how we'd clean our data to only contain cases that occurred in South San Francisco:
The contents of filter()
will always be a conditional where we compare the values in a certain column against an intended value. The easiest way to access a DataFrame's column is by using the df.column_name
syntax.
In our case, we're comparing a column holding strings against a provided string, South San Francisco (for numerical values, we could use the greater-than and less-than operators as well).
Filtering by String Values
Aside from filtering by a perfect match, there are plenty of other powerful ways to filter by strings in PySpark. Take a look:
df.filter(df.city.contains('San Francisco')
: Returns rows where strings of a column contain a provided substring. In our example, filtering by rows that contain the substring "San Francisco" would be a good way to get all rows in San Francisco, instead of just "South San Francisco".df.filter(df.city.startswith('San'))
: Returns rows where a string starts with a provided substring.df.filter(df.city.endswith('ice'))
: Returns rows where a string starts with a provided substring.df.filter(df.city.isNull())
: Returns rows where values in a provided column are null.df.filter(df.city.isNotNull())
: Opposite of the above.df.filter(df.city.like('San%'))
: Performs a SQL-like query containing the LIKE clause.df.filter(df.city.rlike('[A-Z]*ice$'))
: Performs a regexp filter.df.filter(df.city.isin('San Francisco', 'Los Angeles'))
: Looks for rows where the string value of a column matches any of the provided strings exactly.
Filtering by Date Values
In addition to filtering by strings, we can also filter by columns where the values are stored as dates or datetimes. Perhaps the most useful way to filter dates is by using the between()
method, which allows us to find results within a certain date range. Here we find all the results which were reported in the years 2013 and 2014:
Filtering via where()
.where()
is another way of achieving the same effect that we accomplished with filter()
:
Sorting Our DataFrame
Finally, there are a few ways we can sort the data in our DataFrame to our liking. My preferred method is by using orderBy()
:
orderBy()
sorts results in the way we'd expect: string columns are sorted alphabetically, numerical columns are sorted chronologically, etc. The ascending keyword parameter allows us to display these results in descending order when ascending is equal to False.
df = df.filter("city" == "South San Francisco")
looks like we're trying to evaluate a string against a string.A shittier way of achieving the same effect is by using the sort()
method. sort()
is shittier than orderBy()
because the syntax is uglier, and because it requires us to import something just to list our results in descending order:
Renaming Columns
We have just one more item on our list of spring cleaning items: naming columns! An easy way to rename one column at a time is with the withColumnRenamed()
method:
We can also change multiple columns at once:
That's all the time we have for today folks. Join us next time when we explore the magical world of transforming DataFrames in PySpark.