Assignment 2: Part 2. Data Cleaning
The purpose of this assignment is to clean a data set so that it can be analyzed. The dataset is a modified version of the Craft Beer Bar Sales Kaggle dataset. The original dataset is available here: https://www.kaggle.com/datasets/podsyp/sales-in-craft-beer-bar.
Directions
- rename this file as A2_W23_DataCleaning_CUID.ipynb where CUID Is your Carleton University Identification
- pay attention to where your file is saved as you will need to upload it to Brightspace
- follow the instructions below
- when completed, upload your .ipynb file to Brightspace.
The data set should contain the following attributes
- Product_code: unique product identifier
- Vendor_code: Manufacturer's name
- Name: SKU
- Retail_price: Catalog price
- Country_of_Origin: Manufacturer country
- Size: sale item size
- Vendor: Manufacturer's name
- ABV: alcohol by volume
- Product_type: the type of product
Please type your name here:
Please type your student number here:
import pandas as pd import numpy as np
Step 1a. (1 pt) Read the CraftBeerV1.csv file into a dataframe named df and list its columns. This can be done in the same cell.
Step 1b. (2 pts) Remove column(s) from your dataframe that do not exist in the data dictionary above. Use as many cells as you need and ensure you save the results back to your dataframe df. Then, display the data frame. This can be done in the same cell.
Step 1c. (2 pts) Determine the number of unique values for each column in the dataframe. (1 pt) If an attribute has only a single unique value, remove it from the dataframe (1 pt). Show all steps. Use as many cells as you need and ensure you save the results back to your dataframe df.
Step 1d. (2 pts) Compare the Vendor_code column with the Vendor column (1 pt). If they contain the same data, delete one of the columns (1 pt). Show all steps (ie. prove that the two columns do/do not contain the same data). Use as many cells as you need and ensure you save the results back to your dataframe df.
Step 1.e. (3 pts) Determine whether there are any duplicate rows (1 pt). If there are duplicates, display them to the screen (1 pt). If duplicate rows exist, remove them from the dataset and reset the index of the dataframe (1 pt). Show all steps. Use as many cells as you need and ensure you save the results (if there any) back to your dataframe df.
Step 2a. (2 pts) Display basic statistical information for the dataframe's numeric attributes. Ensure that the numeric attributes are shown vertically on the left and the various statistical attributes are columns.
Step 2b. (4 pts) Look through the min column to determine if there are any 0s. If there are, think about whether a 0 is a valid value for the attribute in question. If you think it's an error, first determine how many occurrences of 0 exist for the attribute in the dataframe (1 pt). Next display only the rows that have a 0 for the attribute in question (1 pt). Assign Nan using the numpy library to all occurrences of 0 for the attribute under question (1 pt). Finally, display value counts for every unique value of the attribute in question ensuring that NaN gets counted (1 pt).
Step 2c. (3 pts) Create box plots for Size, and ABV.
Step 2d. (14 pts) Visually inspect each boxplot and determine if there are any statistical outliers for each of the two attributes. You can assume a statistical outlier is any data point that exceeds the upper or lower whisker. For each attribute that contains outlier(s):
- calculate and display the IQR
- calculate and display the necessary min and/or max values that are the thresholds for determining outliers (use the 75th + 1.5 IQR and 25th - 1.5 IQR formula discussed in class)
- use the above information to display the row(s) that contain the outlier
Step 2e. (6 pts). For each of the attribute(s) that had outliers, would it make sense to impute values to replace the outliers? Why or why not? Answer in the cell below. Be sure to identify which attribute you are referring to in answering the questions.
Answer 2e here:
Step 3a. (3 pts) Determine how many missing values there are in each column and how many rows have missing values. Show all your steps.
Step 3b. (2 pts) Remove rows from the dataframe using the .dropNA() method that have more than 3 missing values. Use as many cells as you need and ensure you save the results back to your dataframe df.