Homepage
Programming
CS 422 - Data Mining Homework 1: Imputation and PCA

CS 422 - Data Mining Homework 1: Imputation and PCA

Engage in a Conversation

Assigned: February 19, 2023 CourseNana.COM

Homework 1

Due: March 05, 2023 CourseNana.COM

Please complete the assigned problems to the best of your abilities. Ensure that the work you do is entirely your own, external resources are only used as permitted by the instructor, and all allowed sources are given proper credit for non-original content. CourseNana.COM

1 Recitation Exercises CourseNana.COM

These excercises are to be found in: Introduction to Data Mining, 2nd Edition by Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar. CourseNana.COM

1.1 Chapter 1 CourseNana.COM

Exercises: 1 CourseNana.COM

1.2 Chapter 2 CourseNana.COM

Exercises: 2,7,15,16,17,18,19 CourseNana.COM

2 Practicum Problems CourseNana.COM

These problems will primarily reference the lecture materials and the examples given in class using Python. It is suggested that a Jupyter/IPython notebook be used for the programmatic components. CourseNana.COM

2.1 Problem 1 CourseNana.COM

Load the titanic sample dataset from the Seaborn library into Python using a Pandas dataframe, and visualize the dataset. Create a distribution plot (histogram) of survival conditional on age and gender - what is the basic relationship between these variables using just visual inspection? Do the results make sense? Why? CourseNana.COM

2.2 Problem 2 CourseNana.COM

Load the auto-mpg sample dataset from the UCI Machine Learning Repository (auto-mpg.data) into Python using a Pandas dataframe. The horsepower feature has a few missing values with a ? - replace these with a NaN from NumPy, and calculate summary statistics for each numerical column (Hint: Use an Imputer from Scikit). Replace the missing values with the overall mean, median, and mode (Hint: Pandas makes this easy) - and calculate the variance of the feature. What imputation results in the lowest variance? Why? Is there a different method of imputing values that would match the distribution more accurately? Describe your method. CourseNana.COM

2.3 Problem 3

Load the iris sample dataset into Python using a Pandas dataframe. Perform a PCA using the Scikit Decomposition component, and provide the percentage of variance explained by each of the Principal Components. Compare this to the percentage of variance explained by each of the original features. What do you observe? CourseNana.COM

2.4 Problem 4

Use Matplotlib to plot a projection of each feature onto the 1st Principal Component from the above problem against vs. the original feature itself. Which pair of features show a closer relationship to PC1 vs. the others? Why? (Hint: Think in terms of cosine distance/the angle θ). Calculate the correlation coefficient between the pair of features you have selected and their projections onto PC1 - do the result agree with the visual inspection? CourseNana.COM

2.5 Problem 5

Calculate the total variance of the original features and the total variance of the four eigenvectors from the above problem. What can you say about the corresponding values? If we wished to capture > 95% of the variance of the original data, how many principal components would we be selecting? How does this number correspond to the number of dimensions we are reducing our features to? CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: STUDENT RECORD-KEEPING SYSTEM DATABASE PROJECT

Next: Homework3: Develop a simple pet store application

US代写,IIT代写,Illinois Institute of Technology代写,CS 422代写,Data Mining代写,CS422代写,Python代写,PCA代写,Data Analysis代写,US代编,IIT代编,Illinois Institute of Technology代编,CS 422代编,Data Mining代编,CS422代编,Python代编,PCA代编,Data Analysis代编,US代考,IIT代考,Illinois Institute of Technology代考,CS 422代考,Data Mining代考,CS422代考,Python代考,PCA代考,Data Analysis代考,UShelp,IIThelp,Illinois Institute of Technologyhelp,CS 422help,Data Mininghelp,CS422help,Pythonhelp,PCAhelp,Data Analysishelp,US作业代写,IIT作业代写,Illinois Institute of Technology作业代写,CS 422作业代写,Data Mining作业代写,CS422作业代写,Python作业代写,PCA作业代写,Data Analysis作业代写,US编程代写,IIT编程代写,Illinois Institute of Technology编程代写,CS 422编程代写,Data Mining编程代写,CS422编程代写,Python编程代写,PCA编程代写,Data Analysis编程代写,USprogramming help,IITprogramming help,Illinois Institute of Technologyprogramming help,CS 422programming help,Data Miningprogramming help,CS422programming help,Pythonprogramming help,PCAprogramming help,Data Analysisprogramming help,USassignment help,IITassignment help,Illinois Institute of Technologyassignment help,CS 422assignment help,Data Miningassignment help,CS422assignment help,Pythonassignment help,PCAassignment help,Data Analysisassignment help,USsolution,IITsolution,Illinois Institute of Technologysolution,CS 422solution,Data Miningsolution,CS422solution,Pythonsolution,PCAsolution,Data Analysissolution,