Homepage
Programming
CS109A Introduction to Data Science - Homework 2: kNN and Linear Regression

CS109A Introduction to Data Science - Homework 2: kNN and Linear Regression

Engage in a Conversation

CS109A Introduction to Data Science

Homework 2: kNN and Linear Regression

Instructions

To submit your assignment follow the instructions given in Canvas.
Plots should be legible and interpretable without having to refer to the code that generated them, including labels for the $x$ - and $y$ -axes as well as a descriptive title and/or legend when appropriate.
When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot means.
The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
The use of extremely inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
We have tried to include all the libraries you may need to do the assignment in the imports cell provided below. Please get course staff approval before importing any additional 3rd party libraries.
Enable scrolling output on cells with very long output.
Feel free to add additional code or markdown cells as needed.
Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (note that this can take a few minutes).

In [ ]:

# RUN THIS CELL
import os
import pathlib
working_dir = pathlib.Path().absolute()
os.chdir(working_dir)

# Import libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import operator
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

# pandas tricks for better display
pd.options.display.max_columns = 50  
pd.options.display.max_rows = 500     
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3

CourseNana.COM

Notebook Contents

PART 1 [60 pts]: Predicting the selling price of cars on CarDekho.com
Part 1 Overview
Question 1: Exploratory data analysis (EDA) [10 pts]
Question 2: k-Nearest Neighbors [25 pts]
Question 3: Simple linear regression [25 pts]
PART 2 [40 pts]: Analysis of Simulated ASEC Data
Part 2 Overview
Question 4: Investigating trends [25 pts]
Question 5: Calculate the Gini coefficient [10 pts]
Question 6: Critiquing the simulated data [5 pts]

About this homework

This assignment is the first in which we'll go through the process of loading a dataset, splitting it into train and test sets, performing some preprocessing, and finally fitting some models and evaluating our results. CourseNana.COM

We have two different datasets: CourseNana.COM

PART 1 car data from cardekho.com
PART 2 simulated income data created from the Annual Social and Economic (ASEC) Supplement
Part 1 explores two simple methods for prediction, k-nearest neighbors regression (kNN), a non-parametric method, and linear regression, a parametricmethod. CourseNana.COM

Part 2 is focused EDA and visualization. CourseNana.COM

CourseNana.COM

PART 1 [60 pts]: Predicting the selling price of cars on CarDekho.com

CourseNana.COM

Overview

According to its website, CarDekho.com is India's leading car search venture. Its website and app carry rich automotive content such as expert reviews, detailed specs and prices, comparisons, as well as videos and pictures of all car brands and models available in India. Each car has a current selling price, which is the price for buying a used car on this site, and an MRP, which is the retail price of the car. These two prices differ depending on factors such as brand, make year, mileage, condition, etc. CourseNana.COM

Dataset

The dataset contains 601 used cars and is available as `data/car_dekho_full.csv`. It contains the following columns: CourseNana.COM

Year - make year (year the car was made),
Current_Selling_Price - current price of a used car on CarDekho.com (in lakhs),
MRP - maximum retail price of the car when it was new (in lakhs).
Kms_Driven - number of kilometers
NOTE: 1 lakh is 100,000 Rupees in the Indian numbering system. Also, kilometers are used as a measure of distance instead of miles. CourseNana.COM

Objective

Using kNN and linear regression we will predict the `Current_Selling_Price` from the other features available in this dataset. CourseNana.COM

CourseNana.COM

Question 1: Exploratory data analysis (EDA) [10 pts]

Return to contents CourseNana.COM

To reach the goal of predicting the Current_Selling_Price, start by inspecting the dataset using Exploratory Data Analysis (EDA). CourseNana.COM

Load the dataset, inspect it, and answer the following questions: CourseNana.COM

Q1.1

Identify all variables in the dataset. Which ones are quantitative, and which ones are categorical? If you think any variables are categorical, briefly explain why. CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

In [ ]:

# your code here

Q1.2

What are the means and standard deviations for Current_Selling_Price and MRP? CourseNana.COM

Store your results in mean_csp, mean_mrp, std_csp, and std_mrp to match the variable names used in the provided print function. CourseNana.COM

Points: 2 CourseNana.COM

In [ ]:

mean_csp = ...
mean_mrp = ...
std_csp = ...
std_mrp = ...

In [ ]:

# Be certain to name your variables mean_csp, mean_mrp, std_csp, std_mrp
# to match the variable names used in the provided print function
print(
    "\n"
    f"The mean Current Selling Price is {mean_csp:.4f} lakhs\n"
    f"The mean MRP is {mean_mrp:.4f} lakhs\n"
    f"The Standard Deviation of Current Selling Price is {std_csp:.4f}\n"
    f"The Standard Deviation of MRP is {std_mrp:.4f}"
)

In [ ]:

grader.check("q1.2")

Q1.3

What is the range of kilometers that the cars have been driven? Store your answer in the variable km_range. CourseNana.COM

Hint: 'range' here refers the difference between the highest and lowest recorded kilometers driven. CourseNana.COM

Points: 2 CourseNana.COM

In [ ]:

# your code here
km_range = ...

In [ ]:

# check your result
print(f"the range of kilometers is {km_range:,.2f}")

In [ ]:

grader.check("q1.3")

Q1.4

The goal in this section is to identify the best feature to use to predict our response, Current_Selling_Price. CourseNana.COM

Plot a scatter plot of each feature and our reponse and examine any relationships.
Which is the predictor that seems to best predict Current_Selling_Price? Provide an interpretation of the plots that justifies your choice.

Points: 4 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

In [ ]:

# your code here

CourseNana.COM

Question 2: k-Nearest Neighbors [25 pts]

Return to contents CourseNana.COM

We will begin our modeling with k-Nearest Neighbors (kNN) regression, using sklearn for both preprocessing and model fitting. CourseNana.COM

Q2.1

Split the dataset into a train and test set with 75% training data and 25% testing data, using argument random_state = 109. The resulting splits should be stored in the variables X_train, X_test, y_train, y_test. CourseNana.COM

Points: 2 CourseNana.COM

In [ ]:

# your code here

In [ ]:

grader.check("q2.1")

Q2.2

Now, we will fit several kNN regression models for various values of $k$ to identify the best parameterization. CourseNana.COM

Q2.2.1

For each $k$ in $k \in [1, 2, 3, 5, 7, 10, 50, 100]$ , fit a k-NN regression model to the training data with response Current_Selling_Price and the predictor MRP. CourseNana.COM

For each $k$ , make a plot of reponse vs. predictor (8 plots in total, arranged in a 4×2 grid).
Each of your 8 plots should clearly show (a) the training data and the testing data in different colors, (b) the model prediction, and (c) title, legend, and axis labels.
NOTE: Feel free to use the plt.subplots() code we provide to specify your 4x2 grid, unless you first try that and decide that you have a clearer, cleaner way of accomplishing this task.

Points: 7 CourseNana.COM

In [ ]:

# fig, axs = plt.subplots(4,2, figsize=(12, 14))
# fig.subplots_adjust(hspace = .5, wspace=.3)
# your code here

Q2.2.2

Plot the training and test $M S E$ values as a function of $k$ (1 plot in total). CourseNana.COM

Points: 4 CourseNana.COM

In [ ]:

# your code here

Q2.2.3

Find the best model based on the test $M S E$ values. Store the best $k$ -value in best_k and the best test $M S E$ in best_mse. CourseNana.COM

Points: 2 CourseNana.COM

In [ ]:

# your code here
best_k = ...
best_mse = ...

In [ ]:

print(
    f"The best k value is {best_k}. This corresponds to the "
    f"lowest test MSE of {best_mse:.3f}."
)

In [ ]:

grader.check("q2.2.3")

Q2.2.4

Evaluate and report the $R^{2}$ of the best model. Save the $R^{2}$ of the best model in best_r2. CourseNana.COM

Points: 2 CourseNana.COM

In [ ]:

# your code here
best_r2 = ...

In [ ]:

print(
    f"The R-squared score evaluated on the test set for the best model "
    f"with k={best_k} is {best_r2:.4f}."
)

In [ ]:

grader.check("q2.2.4")

Q2.3

In this section you will discuss your results by answering the following questions. You should answer the questions directly in the provided markdown cells of your notebook. CourseNana.COM

Q2.3.1

How does the value of $k$ affect the fitted model? CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q2.3.2

If $n$ is the number of observations in the training set, what can you say about a kNN regression model that uses $k = n$ ? CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q2.3.3

Do the training and test $M S E$ plots exhibit different trends? Explain how the value of $k$ influences the training and test $M S E$ values. CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q2.3.4

If you were to change the random_state argument to train_test_split above and re-run the code, do you think would select the same model? If not, why? CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

CourseNana.COM

Question 3: Simple linear regression [25 pts]

Q3.1

We will now fit our data using a linear regression model. Choose the same predictor and response variables you used in the kNN model. You will also use the same 75% training and 25% testing split of the data, which was created using random_state = 109. CourseNana.COM

Q3.1.1

Fit a linear regression model. Name your model linreg. CourseNana.COM

Points: 6 CourseNana.COM

In [ ]:

# your code here
# Instantiate a LinearRegression class object and fit with train data
linreg = ...

In [ ]:

grader.check("q3.1.1")

Q3.1.2

Report the slope and intercept values for the fitted linear model. Name your variables slope and intercept. CourseNana.COM

Points: 4 CourseNana.COM

In [ ]:

# your code here
slope = ...
intercept = ...

In [ ]:

print(
    f"Intercept of the fitted linear model\t\t{slope:.4f}\n"
    f"Slope of the fitted linear model\t{intercept:.4f}"
)

In [ ]:

grader.check("q3.1.2")

Q3.1.3

Report the $M S E$ for the training and test sets and the $R^{2}$ for the test set. Name your variables lin_train_mse, lin_test_mse, and lin_test_r2. CourseNana.COM

Points: 4 CourseNana.COM

In [ ]:

# your code here 
#Compute the MSE of the model
lin_train_mse = ...
lin_test_mse = ...
#Compute the R-squared of the model
lin_test_r2 = ...

In [ ]:

print("Linear regression model results:\n")
print(
    "\tTrain MSE\t{:.4f}\n"
    "\tTest MSE\t{:.4f}\n".format(
        lin_train_mse,
        lin_test_mse,
    )
)

print(f"\tTest R-squared\t{lin_test_r2:.4f}")

In [ ]:

grader.check("q3.1.3")

Q3.1.4

Plot the residuals, $e = y - \hat{y}$ , of the model on the training set as a function of the response variable. Draw a horizontal line denoting the zero residual value on the $y$ -axis. CourseNana.COM

Points: 5 CourseNana.COM

In [ ]:

# your code here

CourseNana.COM

Q3.2

Answer the following questions about your results:

CourseNana.COM

Q3.2.1

How does the test $M S E$ score compare with the best test $M S E$ value obtained with kNN regression? CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q3.2.2

What does the sign of the slope of the fitted linear model convey about the relationship between the predictor and the response? CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q3.2.3

Discuss the shape of the residual plot and what it shows for the quality of the model. Be sure to discuss whether or not the assumption of linearity is valid for this data. CourseNana.COM

Points: 2 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

CourseNana.COM

In this part we analyze simulated income data from the publically available 2021 US Annual Social and Economic (ASEC) Supplement (https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2021.html), provided by the US Census Bureau. The Current Population Survey (CPS) has been conducted monthly for over 50 years. Currently, about 54,000 households are interviewed monthly, scientifically selected on the basis of area of residence to represent the nation as a whole, individual states, and other specified areas.
CourseNana.COM

Dataset

The simulated data can be found in `data/census_data.csv`. CourseNana.COM

The number of features have been greatly reduced from the original dataset. You can see the description of the original dataset in the ASEC data dictionary. CourseNana.COM

In addition to subsetting the features, other preprocessing steps have been taken. For example, some categorical variables have had their number of unique values reduced. CourseNana.COM

We refer to the data as simulated because sampling was used to turn what was originally an ordinal response variable (42 income brackets) into something more continous. CourseNana.COM

Considering this, the results of your analysis should be viewed skeptically. You can view the the preprocessing steps taken to create the simplified, simulated data in `data/preprocessing.ipynb`. CourseNana.COM

NOTE: Variables have been renamed for easier interpretation. You can find the original variable names in the preprocessing notebook. It will be these original variable names that appear in the data dictionary linked above. CourseNana.COM

Features

age - Age of person
hourly_pay - Hourly salary of person (-1 if person is not payed by the hour)
hours_per_week - Number of hours usually worked per week
weeks_worked - Number of weeks worked per year CourseNana.COM

sex - {'Female': 0,'Male': 1} CourseNana.COM

marital_status - {'married':0,'widowed':1,'Divorced':2, 'Separated':3,'Never married':4}
military_service - {'has not served in the US armed forces':0,'has served in the US armed forces':1}
student_status - {'Not currently studying':0,'Enrolled full-time':1, 'Enrolled part-time':1}
education - {'Not finished high school': 0, 'High school': 1, 'Associate degree': 2, 'Bachelor\'s': 3, ,'Master\'s': 4,'Professional school degree': 5', Doctorate': 6} CourseNana.COM

race - {'White': 0, 'Black': 1, 'American Indian, Alaskan Native only (AI)': 2, 'Asian': 3, ,'Hawaiian, Pacific Islander (HP)': 4,'White-Black': 5', 'White-AI': 6, 'White-Asian': 7, 'White-HP': 8, 'Black-AI': 9, 'Black-Asian': 10, 'Black-HP': 11, 'AI-Asian': 12, 'AI-HP': 13, 'Asian-HP': 14, 'other race combinations': 15} CourseNana.COM

industry - Industry that the person is working in {'Other': 0, 'Agriculture, forestry, fishing, hunting': 1, 'Mining': 2, 'Construction': 3, ,'Manufacturing': 4,'Wholesale and retail trade': 5','Transportation and utilities': 6, 'Information': 7, 'Financial activities': 8, 'Professional and business services': 9, 'Education and health services': 10, 'Leisure and hospitality': 11, 'Other services': 12, 'Public administration': 13, 'Armed Forces': 14} CourseNana.COM

occupation - Occupation of person {'Other': 0, 'Management, business, and financial occ.': 1, 'Professional and related occ.': 2, 'Service occ.': 3, ,'Sales and related occ.': 4,'Office and administrative support occ.': 5','Farming, fishing and forestry': 6, 'Construction and extraction occ.': 7, 'Installation, maintenance and repair occ.': 8, 'Production occ.': 9, 'Transportation and material moving occ.': 10, 'Armed Forces': 11} CourseNana.COM

income - Annual income in dollars
CourseNana.COM

Question 4: Investigating trends [25 pts]

Below we'll answer questions about potential trends in the data with the help of plots and simple statistics. CourseNana.COM

Q4.1
Is there a disparity in income of participants by gender? Consider using a log scale or another technique when plotting to communicate findings more clearly. CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.2
Is there a relationship between income and the "occupation" variable? CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.3
Let's investigate a few questions about education and income: CourseNana.COM

Is there a relationship between income and education level? CourseNana.COM

Is this trend similar across both genders in the dataset? CourseNana.COM

Is it possible to consider education level as an ordinal variable? For instance, consider whether retaining this ordering as in the dataset might be preferable to education level complexion as a categorical variable lacking order. CourseNana.COM

Points: 4 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.4
Is there a discernable trend in the incomes of participants from different industries? CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.5
Is there a clear trend between age and income? CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.6
Do any of the quantitative attributes show a clear relationship with income? If so, are these relationships linear or non-linear? CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.7
What is the relationship between income and the different values for `marital_status` in the dataset? CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

Q4.8
What is the average effect of the `military_service` variable on income? CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text CourseNana.COM

CourseNana.COM

Question 5: Calculate the Gini coefficient [10 pts] CourseNana.COM

CourseNana.COM

Gini coefficients are often used to quantify income inequality. For an introductory overview of the Gini coefficient, its derivation, and its uses, you can read more about it here. That article also provides a useful graphical representation of the Gini coefficient to better understand how it measures inequality. CourseNana.COM

The Gini coefficient is defined by the formula: CourseNana.COM

G = \frac{\sum_{i = 1}^{n} (2 i - n - 1) x_{i}}{n \sum_{i = 1}^{n} x_{i}}

where $x$ is an observed value, $n$ is the number of values observed and $i$ is the rank of values in ascending order. CourseNana.COM

A Gini coefficient of $G = 0$ implies perfect income equality, whereas a Gini coefficient close to $G = 1$ implies a concentration of wealth among the richest few. CourseNana.COM

Q5.1

Based on the above formula, calculate and report the Gini coefficient for the income of those people in the the provided ASEC dataset. Store the result in gini_coef. CourseNana.COM

NOTE: For algorithmic simplicity and consistency, you can rank income values for all observations, keeping duplicate values in your sorted array. Therefore, you will likely have sorted income values $x_{i}$ similar to [417, 417, 417, ..., 250000, 250000, 250000] with corresponding rank indices $i$ similar to [1, 2, 3, ..., 12353, 12354, 12355]. Nothing more sophisticated than that is required for dealing with ties (i.e. duplicates) in your sorted income values for Question 5.1.

Points: 7 CourseNana.COM

# your code here
gini_coef = ...

# Print resulting Gini coefficient
print(
    f"The Gini Index for this dataset is {gini_coef:.3f}")

grader.check("q5.1")

Q5.2

According to the World Bank estimate the country with the largest Gini coefficient is South Africa, ranked 1st at $0.63$ , while the lowest is the Slovak Republic, ranked 162nd at $0.232$ . The United States is ranked 46th on the list and has a Gini index of $0.415$ . CourseNana.COM

How well does your calculated Gini coefficient for this simulated dataset match the world bank estimate? CourseNana.COM
Might the self-report nature of the data, preprocessing steps, or simulation (i.e., sampling) procedure have affected your results? If so, how? CourseNana.COM

Note: The World Bank estimate website uses a [0,100] range for the Gini Index. Above we have converted this to a [0,1] range. CourseNana.COM

Points: 3 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

CourseNana.COM

Question 6: Critiquing the simulated data [5 pts]

Take a look at both the data dictionary for the original dataset and the notebook used to create the simplified simulation, `data/preprocessing.ipynb`. CourseNana.COM

What might you have done differently were you to write your own preprocessing code? A nonexhaustive list of a few things to consider would be: CourseNana.COM

Are there important features you think should have been included that were not?
Do you agree with the methods used to reduce the number of unique categorical values?
Might there be a better way to simulate a continuous response from the discrete income brackets in the original data?
Note: We used the record type 'person' data from the ASEC rather than 'household' or 'family.' All three record types are represented in the data dictionary. CourseNana.COM

Points: 5 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

This concludes HW2. Thank you! CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: CS109A Introduction to Data Science - Homework 0 - Part 2: Python Poetry Challenge

Next: CS109A Introduction to Data Science - Homework 1: Web Scraping, Data Parsing, and EDA

CS109A代写,Python代写,US代写,Harvard University代写,COMPSCI 109A代写,Introduction to Data Science代写,kNN代写,Linear Regression代写,CS109A代编,Python代编,US代编,Harvard University代编,COMPSCI 109A代编,Introduction to Data Science代编,kNN代编,Linear Regression代编,CS109A代考,Python代考,US代考,Harvard University代考,COMPSCI 109A代考,Introduction to Data Science代考,kNN代考,Linear Regression代考,CS109Ahelp,Pythonhelp,UShelp,Harvard Universityhelp,COMPSCI 109Ahelp,Introduction to Data Sciencehelp,kNNhelp,Linear Regressionhelp,CS109A作业代写,Python作业代写,US作业代写,Harvard University作业代写,COMPSCI 109A作业代写,Introduction to Data Science作业代写,kNN作业代写,Linear Regression作业代写,CS109A编程代写,Python编程代写,US编程代写,Harvard University编程代写,COMPSCI 109A编程代写,Introduction to Data Science编程代写,kNN编程代写,Linear Regression编程代写,CS109Aprogramming help,Pythonprogramming help,USprogramming help,Harvard Universityprogramming help,COMPSCI 109Aprogramming help,Introduction to Data Scienceprogramming help,kNNprogramming help,Linear Regressionprogramming help,CS109Aassignment help,Pythonassignment help,USassignment help,Harvard Universityassignment help,COMPSCI 109Aassignment help,Introduction to Data Scienceassignment help,kNNassignment help,Linear Regressionassignment help,CS109Asolution,Pythonsolution,USsolution,Harvard Universitysolution,COMPSCI 109Asolution,Introduction to Data Sciencesolution,kNNsolution,Linear Regressionsolution,

CS109A Introduction to Data Science - Homework 2: kNN and Linear Regression

CS109A Introduction to Data Science

Homework 2: kNN and Linear Regression

Instructions

Notebook Contents

About this homework

PART 1 [60 pts]: Predicting the selling price of cars on CarDekho.com

CourseNana.COM

Overview

Dataset

Objective

Using kNN and linear regression we will predict the Current_Selling_Price from the other features available in this dataset. CourseNana.COM CourseNana.COM

Question 1: Exploratory data analysis (EDA) [10 pts]

Question 2: k-Nearest Neighbors [25 pts]

Question 3: Simple linear regression [25 pts]

PART 2 [40 pts]: Analysis of 2021 US Annual Social and Economic (ASEC) Supplement

Overview

Dataset

Features

Question 4: Investigating trends [25 pts]

Question 6: Critiquing the simulated data [5 pts]

Get in Touch with Our Experts

Using kNN and linear regression we will predict the `Current_Selling_Price` from the other features available in this dataset. CourseNana.COM

CourseNana.COM