SS2864 Assignment 2
===============================================================
Note 1: All assignments are to be completed using R markdown and submit online. Please submit both the rmd file and the output. Make sure that your R markdown file can be compiled, otherwise 50\% of the marks are deducted.
Note 2: Practicing is the best way to learn. So please complete the assignment questions independently.
Note 3: Please submit before the deadline. Late submission will not be accepted. We do not mean to be strict; the fact is that we do not have enough manpower to dealing with issues related to late submissions.
Note 4: The assignment is due on Wednesday, Feb 15, 2023. Please submit your .rmd and output (html or pdf or word) to OWL.
===============================================================
Question 1. - This question deals with cleaning up datasets. Load the dataset file "auto-mpg-messy.csv" into R. Answer the following questions. (20 point)
-
Have a first view of the dataset using functions str, summary, head, tail, View.
-
Change the variables that should be numeric (but showed up as character) to numeric.
-
There will be NAs in the dataframe you obtained in step 2. Print out all observations with NA and then delete them from the dataset.
-
Some variables have outliers (due to typo or coding conventions). Identify all observations with outliers and then remove them.
Question 2 Now you should have a clean dataset to work with from Q1. Do the following (25 point)
-
Apply the summary function to the clean data you obtaind for Question 1 and state the difference between this and the summary result in question 1.
-
Extract the observations for 'honda' and "ford" cars are in the dataset. Count how many of the two types of cars are there in the dataset.
-
Find the observations with top 5 (highest) and bottom 5 (lowest) mpg.
-
Find the mean mpg of all autos and calculate the proportion of observations that have mpg higher than the mean.
-
change the variable cylinder to factor. Then find the mean mpg of autos with 4, 6 and 8 cylinders.
...
Question 3 Repeat Question 2 using functions in Tidyverse functions. (25 point)
Question 4 Applied the functions filter, arrange, mutate, select, group_by summarize and on the dataset you used in Q6 of assignment 1 and observe the results. Note: if your dataset have less than 5 variables, please select another one with more variables for this question. (20 points)
Question 5. SS2864 has classes on MWF. Suppose that the term starts from Jan 9,2023 and ends at Apr 10, 2023, inclusive. Using date/time related function in R to calculate how many classes are in the term. (10 points)