CS109A Introduction to Data Science
Homework 2: kNN and Linear Regression
Instructions
- To submit your assignment follow the instructions given in Canvas.
- Plots should be legible and interpretable without having to refer to the code that generated them, including labels for the
? - and ? -axes as well as a descriptive title and/or legend when appropriate. - When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot means.
- The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
- The use of extremely inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
- We have tried to include all the libraries you may need to do the assignment in the imports cell provided below. Please get course staff approval before importing any additional 3rd party libraries.
- Enable scrolling output on cells with very long output.
- Feel free to add additional code or markdown cells as needed.
- Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (note that this can take a few minutes).
In [ ]:# RUN THIS CELL
import os
import pathlib
working_dir = pathlib.Path().absolute()
os.chdir(working_dir)
# Import libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import operator
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
# pandas tricks for better display
pd.options.display.max_columns = 50
pd.options.display.max_rows = 500
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3
- To submit your assignment follow the instructions given in Canvas.
- Plots should be legible and interpretable without having to refer to the code that generated them, including labels for the
? - and? -axes as well as a descriptive title and/or legend when appropriate. - When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot means.
- The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
- The use of extremely inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
- We have tried to include all the libraries you may need to do the assignment in the imports cell provided below. Please get course staff approval before importing any additional 3rd party libraries.
- Enable scrolling output on cells with very long output.
- Feel free to add additional code or markdown cells as needed.
- Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (note that this can take a few minutes).
# RUN THIS CELL
import os
import pathlib
working_dir = pathlib.Path().absolute()
os.chdir(working_dir)
# Import libraries
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import operator
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
# pandas tricks for better display
pd.options.display.max_columns = 50
pd.options.display.max_rows = 500
pd.options.display.max_colwidth = 100
pd.options.display.precision = 3
Notebook Contents
About this homework
This assignment is the first in which we'll go through the process of loading a dataset, splitting it into train and test sets, performing some preprocessing, and finally fitting some models and evaluating our results.
CourseNana.COM
We have two different datasets:
CourseNana.COM
- PART 1 car data from cardekho.com
- PART 2 simulated income data created from the Annual Social and Economic (ASEC) Supplement
Part 1 explores two simple methods for prediction, k-nearest neighbors regression (kNN), a non-parametric method, and linear regression, a parametricmethod.
CourseNana.COM
Part 2 is focused EDA and visualization.
CourseNana.COM
This assignment is the first in which we'll go through the process of loading a dataset, splitting it into train and test sets, performing some preprocessing, and finally fitting some models and evaluating our results. CourseNana.COM
We have two different datasets: CourseNana.COM
- PART 1 car data from cardekho.com
- PART 2 simulated income data created from the Annual Social and Economic (ASEC) Supplement
Part 1 explores two simple methods for prediction, k-nearest neighbors regression (kNN), a non-parametric method, and linear regression, a parametricmethod. CourseNana.COM
Part 2 is focused EDA and visualization. CourseNana.COM
PART 1 [60 pts]: Predicting the selling price of cars on CarDekho.com
Overview
According to its website, CarDekho.com is India's leading car search venture. Its website and app carry rich automotive content such as expert reviews, detailed specs and prices, comparisons, as well as videos and pictures of all car brands and models available in India. Each car has a current selling price, which is the price for buying a used car on this site, and an MRP, which is the retail price of the car. These two prices differ depending on factors such as brand, make year, mileage, condition, etc.
CourseNana.COM
According to its website, CarDekho.com is India's leading car search venture. Its website and app carry rich automotive content such as expert reviews, detailed specs and prices, comparisons, as well as videos and pictures of all car brands and models available in India. Each car has a current selling price, which is the price for buying a used car on this site, and an MRP, which is the retail price of the car. These two prices differ depending on factors such as brand, make year, mileage, condition, etc. CourseNana.COM
Dataset
The dataset contains 601 used cars and is available as data/car_dekho_full.csv
. It contains the following columns:
CourseNana.COM
- Year - make year (year the car was made),
- Current_Selling_Price - current price of a used car on CarDekho.com (in lakhs),
- MRP - maximum retail price of the car when it was new (in lakhs).
- Kms_Driven - number of kilometers
NOTE: 1 lakh is 100,000 Rupees in the Indian numbering system. Also, kilometers are used as a measure of distance instead of miles.
CourseNana.COM
The dataset contains 601 used cars and is available as data/car_dekho_full.csv
. It contains the following columns:
CourseNana.COM
- Year - make year (year the car was made),
- Current_Selling_Price - current price of a used car on CarDekho.com (in lakhs),
- MRP - maximum retail price of the car when it was new (in lakhs).
- Kms_Driven - number of kilometers
NOTE: 1 lakh is 100,000 Rupees in the Indian numbering system. Also, kilometers are used as a measure of distance instead of miles. CourseNana.COM
Objective
Using kNN and linear regression we will predict the Current_Selling_Price
from the other features available in this dataset.
CourseNana.COM
Using kNN and linear regression we will predict the Current_Selling_Price
from the other features available in this dataset.
CourseNana.COM
Question 1: Exploratory data analysis (EDA) [10 pts]
Return to contents
CourseNana.COM
To reach the goal of predicting the Current_Selling_Price
, start by inspecting the dataset using Exploratory Data Analysis (EDA).
CourseNana.COM
Load the dataset, inspect it, and answer the following questions:
CourseNana.COM
Q1.1Identify all variables in the dataset. Which ones are quantitative, and which ones are categorical? If you think any variables are categorical, briefly explain why.
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
In [ ]:# your code here
Q1.2What are the means and standard deviations for Current_Selling_Price
and MRP
?
CourseNana.COM
Store your results in mean_csp
, mean_mrp
, std_csp
, and std_mrp
to match the variable names used in the provided print function.
CourseNana.COM
Points: 2
CourseNana.COM
In [ ]:mean_csp = ...
mean_mrp = ...
std_csp = ...
std_mrp = ...
In [ ]:# Be certain to name your variables mean_csp, mean_mrp, std_csp, std_mrp
# to match the variable names used in the provided print function
print(
"\n"
f"The mean Current Selling Price is {mean_csp:.4f} lakhs\n"
f"The mean MRP is {mean_mrp:.4f} lakhs\n"
f"The Standard Deviation of Current Selling Price is {std_csp:.4f}\n"
f"The Standard Deviation of MRP is {std_mrp:.4f}"
)
In [ ]:grader.check("q1.2")
Q1.3What is the range of kilometers that the cars have been driven? Store your answer in the variable km_range
.
CourseNana.COM
Hint: 'range' here refers the difference between the highest and lowest recorded kilometers driven.
CourseNana.COM
Points: 2
CourseNana.COM
In [ ]:# your code here
km_range = ...
In [ ]:# check your result
print(f"the range of kilometers is {km_range:,.2f}")
In [ ]:grader.check("q1.3")
Q1.4The goal in this section is to identify the best feature to use to predict our response, Current_Selling_Price
.
CourseNana.COM
- Plot a scatter plot of each feature and our reponse and examine any relationships.
- Which is the predictor that seems to best predict
Current_Selling_Price
? Provide an interpretation of the plots that justifies your choice.
Points: 4
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
In [ ]:# your code here
Return to contents CourseNana.COM
To reach the goal of predicting the Current_Selling_Price
, start by inspecting the dataset using Exploratory Data Analysis (EDA).
CourseNana.COM
Load the dataset, inspect it, and answer the following questions: CourseNana.COM
Identify all variables in the dataset. Which ones are quantitative, and which ones are categorical? If you think any variables are categorical, briefly explain why. CourseNana.COM
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
# your code here
What are the means and standard deviations for Current_Selling_Price
and MRP
?
CourseNana.COM
Store your results in mean_csp
, mean_mrp
, std_csp
, and std_mrp
to match the variable names used in the provided print function.
CourseNana.COM
Points: 2 CourseNana.COM
mean_csp = ...
mean_mrp = ...
std_csp = ...
std_mrp = ...
# Be certain to name your variables mean_csp, mean_mrp, std_csp, std_mrp
# to match the variable names used in the provided print function
print(
"\n"
f"The mean Current Selling Price is {mean_csp:.4f} lakhs\n"
f"The mean MRP is {mean_mrp:.4f} lakhs\n"
f"The Standard Deviation of Current Selling Price is {std_csp:.4f}\n"
f"The Standard Deviation of MRP is {std_mrp:.4f}"
)
grader.check("q1.2")
What is the range of kilometers that the cars have been driven? Store your answer in the variable km_range
.
CourseNana.COM
Hint: 'range' here refers the difference between the highest and lowest recorded kilometers driven. CourseNana.COM
Points: 2 CourseNana.COM
# your code here
km_range = ...
# check your result
print(f"the range of kilometers is {km_range:,.2f}")
grader.check("q1.3")
The goal in this section is to identify the best feature to use to predict our response, Current_Selling_Price
.
CourseNana.COM
- Plot a scatter plot of each feature and our reponse and examine any relationships.
- Which is the predictor that seems to best predict
Current_Selling_Price
? Provide an interpretation of the plots that justifies your choice.
Points: 4 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
# your code here
Question 2: k-Nearest Neighbors [25 pts]
Return to contents
CourseNana.COM
We will begin our modeling with k-Nearest Neighbors (kNN) regression, using sklearn
for both preprocessing and model fitting.
CourseNana.COM
Q2.1Split the dataset into a train and test set with 75% training data and 25% testing data, using argument random_state = 109
. The resulting splits should be stored in the variables X_train
, X_test
, y_train
, y_test
.
CourseNana.COM
Points: 2
CourseNana.COM
In [ ]:# your code here
In [ ]:grader.check("q2.1")
Q2.2Now, we will fit several kNN regression models for various values of ? to identify the best parameterization.
CourseNana.COM
Q2.2.1For each ? in ?∈[1,2,3,5,7,10,50,100] , fit a k-NN regression model to the training data with response Current_Selling_Price
and the predictor MRP
.
CourseNana.COM
- For each
? , make a plot of reponse vs. predictor (8 plots in total, arranged in a 4×2 grid). - Each of your 8 plots should clearly show (a) the training data and the testing data in different colors, (b) the model prediction, and (c) title, legend, and axis labels.
- NOTE: Feel free to use the
plt.subplots()
code we provide to specify your 4x2 grid, unless you first try that and decide that you have a clearer, cleaner way of accomplishing this task.
Points: 7
CourseNana.COM
In [ ]:# fig, axs = plt.subplots(4,2, figsize=(12, 14))
# fig.subplots_adjust(hspace = .5, wspace=.3)
# your code here
Q2.2.2Plot the training and test ??? values as a function of ? (1 plot in total).
CourseNana.COM
Points: 4
CourseNana.COM
In [ ]:# your code here
Q2.2.3Find the best model based on the test ??? values. Store the best ? -value in best_k
and the best test ??? in best_mse
.
CourseNana.COM
Points: 2
CourseNana.COM
In [ ]:# your code here
best_k = ...
best_mse = ...
In [ ]:print(
f"The best k value is {best_k}. This corresponds to the "
f"lowest test MSE of {best_mse:.3f}."
)
In [ ]:grader.check("q2.2.3")
Q2.2.4Evaluate and report the ?2 of the best model. Save the ?2 of the best model in best_r2
.
CourseNana.COM
Points: 2
CourseNana.COM
In [ ]:# your code here
best_r2 = ...
In [ ]:print(
f"The R-squared score evaluated on the test set for the best model "
f"with k={best_k} is {best_r2:.4f}."
)
In [ ]:grader.check("q2.2.4")
Q2.3In this section you will discuss your results by answering the following questions. You should answer the questions directly in the provided markdown cells of your notebook.
CourseNana.COM
Q2.3.1How does the value of ? affect the fitted model?
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q2.3.2If ? is the number of observations in the training set, what can you say about a kNN regression model that uses ?=? ?
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q2.3.3Do the training and test ??? plots exhibit different trends? Explain how the value of ? influences the training and test ??? values.
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q2.3.4If you were to change the random_state
argument to train_test_split
above and re-run the code, do you think would select the same model? If not, why?
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Return to contents CourseNana.COM
We will begin our modeling with k-Nearest Neighbors (kNN) regression, using sklearn
for both preprocessing and model fitting.
CourseNana.COM
Split the dataset into a train and test set with 75% training data and 25% testing data, using argument random_state = 109
. The resulting splits should be stored in the variables X_train
, X_test
, y_train
, y_test
.
CourseNana.COM
Points: 2 CourseNana.COM
# your code here
grader.check("q2.1")
Now, we will fit several kNN regression models for various values of
For each Current_Selling_Price
and the predictor MRP
.
CourseNana.COM
- For each
? , make a plot of reponse vs. predictor (8 plots in total, arranged in a 4×2 grid). - Each of your 8 plots should clearly show (a) the training data and the testing data in different colors, (b) the model prediction, and (c) title, legend, and axis labels.
- NOTE: Feel free to use the
plt.subplots()
code we provide to specify your 4x2 grid, unless you first try that and decide that you have a clearer, cleaner way of accomplishing this task.
Points: 7 CourseNana.COM
# fig, axs = plt.subplots(4,2, figsize=(12, 14))
# fig.subplots_adjust(hspace = .5, wspace=.3)
# your code here
Plot the training and test
Points: 4 CourseNana.COM
# your code here
Find the best model based on the test best_k
and the best test best_mse
.
CourseNana.COM
Points: 2 CourseNana.COM
# your code here
best_k = ...
best_mse = ...
print(
f"The best k value is {best_k}. This corresponds to the "
f"lowest test MSE of {best_mse:.3f}."
)
grader.check("q2.2.3")
Evaluate and report the best_r2
.
CourseNana.COM
Points: 2 CourseNana.COM
# your code here
best_r2 = ...
print(
f"The R-squared score evaluated on the test set for the best model "
f"with k={best_k} is {best_r2:.4f}."
)
grader.check("q2.2.4")
In this section you will discuss your results by answering the following questions. You should answer the questions directly in the provided markdown cells of your notebook. CourseNana.COM
How does the value of
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
If
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Do the training and test
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
If you were to change the random_state
argument to train_test_split
above and re-run the code, do you think would select the same model? If not, why?
CourseNana.COM
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Question 3: Simple linear regression [25 pts]
Q3.1We will now fit our data using a linear regression model. Choose the same predictor and response variables you used in the kNN model. You will also use the same 75% training and 25% testing split of the data, which was created using random_state = 109
.
CourseNana.COM
Q3.1.1Fit a linear regression model. Name your model linreg
.
CourseNana.COM
Points: 6
CourseNana.COM
In [ ]:# your code here
# Instantiate a LinearRegression class object and fit with train data
linreg = ...
In [ ]:grader.check("q3.1.1")
Q3.1.2Report the slope and intercept values for the fitted linear model. Name your variables slope
and intercept
.
CourseNana.COM
Points: 4
CourseNana.COM
In [ ]:# your code here
slope = ...
intercept = ...
In [ ]:print(
f"Intercept of the fitted linear model\t\t{slope:.4f}\n"
f"Slope of the fitted linear model\t{intercept:.4f}"
)
In [ ]:grader.check("q3.1.2")
Q3.1.3Report the ??? for the training and test sets and the ?2 for the test set. Name your variables lin_train_mse
, lin_test_mse
, and lin_test_r2
.
CourseNana.COM
Points: 4
CourseNana.COM
In [ ]:# your code here
#Compute the MSE of the model
lin_train_mse = ...
lin_test_mse = ...
#Compute the R-squared of the model
lin_test_r2 = ...
In [ ]:print("Linear regression model results:\n")
print(
"\tTrain MSE\t{:.4f}\n"
"\tTest MSE\t{:.4f}\n".format(
lin_train_mse,
lin_test_mse,
)
)
print(f"\tTest R-squared\t{lin_test_r2:.4f}")
In [ ]:grader.check("q3.1.3")
Q3.1.4Plot the residuals, ?=?−?̂ , of the model on the training set as a function of the response variable. Draw a horizontal line denoting the zero residual value on the ? -axis.
CourseNana.COM
Points: 5
CourseNana.COM
In [ ]:# your code here
Q3.2.1How does the test ??? score compare with the best test ??? value obtained with kNN regression?
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q3.2.2What does the sign of the slope of the fitted linear model convey about the relationship between the predictor and the response?
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q3.2.3Discuss the shape of the residual plot and what it shows for the quality of the model. Be sure to discuss whether or not the assumption of linearity is valid for this data.
CourseNana.COM
Points: 2
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
We will now fit our data using a linear regression model. Choose the same predictor and response variables you used in the kNN model. You will also use the same 75% training and 25% testing split of the data, which was created using random_state = 109
.
CourseNana.COM
Fit a linear regression model. Name your model linreg
.
CourseNana.COM
Points: 6 CourseNana.COM
# your code here
# Instantiate a LinearRegression class object and fit with train data
linreg = ...
grader.check("q3.1.1")
Report the slope and intercept values for the fitted linear model. Name your variables slope
and intercept
.
CourseNana.COM
Points: 4 CourseNana.COM
# your code here
slope = ...
intercept = ...
print(
f"Intercept of the fitted linear model\t\t{slope:.4f}\n"
f"Slope of the fitted linear model\t{intercept:.4f}"
)
grader.check("q3.1.2")
Report the lin_train_mse
, lin_test_mse
, and lin_test_r2
.
CourseNana.COM
Points: 4 CourseNana.COM
# your code here
#Compute the MSE of the model
lin_train_mse = ...
lin_test_mse = ...
#Compute the R-squared of the model
lin_test_r2 = ...
print("Linear regression model results:\n")
print(
"\tTrain MSE\t{:.4f}\n"
"\tTest MSE\t{:.4f}\n".format(
lin_train_mse,
lin_test_mse,
)
)
print(f"\tTest R-squared\t{lin_test_r2:.4f}")
grader.check("q3.1.3")
Plot the residuals,
Points: 5 CourseNana.COM
# your code here
How does the test
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
What does the sign of the slope of the fitted linear model convey about the relationship between the predictor and the response? CourseNana.COM
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Discuss the shape of the residual plot and what it shows for the quality of the model. Be sure to discuss whether or not the assumption of linearity is valid for this data. CourseNana.COM
Points: 2 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
PART 2 [40 pts]: Analysis of 2021 US Annual Social and Economic (ASEC) Supplement
Overview
In this part we analyze simulated income data from the publically available 2021 US Annual Social and Economic (ASEC) Supplement (https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2021.html), provided by the US Census Bureau. The Current Population Survey (CPS) has been conducted monthly for over 50 years. Currently, about 54,000 households are interviewed monthly, scientifically selected on the basis of area of residence to represent the nation as a whole, individual states, and other specified areas.
CourseNana.COM
In this part we analyze simulated income data from the publically available 2021 US Annual Social and Economic (ASEC) Supplement (https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2021.html), provided by the US Census Bureau. The Current Population Survey (CPS) has been conducted monthly for over 50 years. Currently, about 54,000 households are interviewed monthly, scientifically selected on the basis of area of residence to represent the nation as a whole, individual states, and other specified areas.
CourseNana.COM
Dataset
The simulated data can be found in data/census_data.csv
.
CourseNana.COM
The number of features have been greatly reduced from the original dataset. You can see the description of the original dataset in the ASEC data dictionary.
CourseNana.COM
In addition to subsetting the features, other preprocessing steps have been taken. For example, some categorical variables have had their number of unique values reduced.
CourseNana.COM
We refer to the data as simulated because sampling was used to turn what was originally an ordinal response variable (42 income brackets) into something more continous.
CourseNana.COM
Considering this, the results of your analysis should be viewed skeptically. You can view the the preprocessing steps taken to create the simplified, simulated data in data/preprocessing.ipynb
.
CourseNana.COM
NOTE: Variables have been renamed for easier interpretation. You can find the original variable names in the preprocessing notebook. It will be these original variable names that appear in the data dictionary linked above.
CourseNana.COM
The simulated data can be found in data/census_data.csv
.
CourseNana.COM
The number of features have been greatly reduced from the original dataset. You can see the description of the original dataset in the ASEC data dictionary. CourseNana.COM
In addition to subsetting the features, other preprocessing steps have been taken. For example, some categorical variables have had their number of unique values reduced. CourseNana.COM
We refer to the data as simulated because sampling was used to turn what was originally an ordinal response variable (42 income brackets) into something more continous. CourseNana.COM
Considering this, the results of your analysis should be viewed skeptically. You can view the the preprocessing steps taken to create the simplified, simulated data in data/preprocessing.ipynb
.
CourseNana.COM
NOTE: Variables have been renamed for easier interpretation. You can find the original variable names in the preprocessing notebook. It will be these original variable names that appear in the data dictionary linked above. CourseNana.COM
Features
- age - Age of person
- hourly_pay - Hourly salary of person (-1 if person is not payed by the hour)
- hours_per_week - Number of hours usually worked per week
weeks_worked - Number of weeks worked per year
CourseNana.COM
sex - {'Female': 0,'Male': 1}
CourseNana.COM
- marital_status - {'married':0,'widowed':1,'Divorced':2, 'Separated':3,'Never married':4}
- military_service - {'has not served in the US armed forces':0,'has served in the US armed forces':1}
- student_status - {'Not currently studying':0,'Enrolled full-time':1, 'Enrolled part-time':1}
education - {'Not finished high school': 0, 'High school': 1, 'Associate degree': 2, 'Bachelor\'s': 3, ,'Master\'s': 4,'Professional school degree': 5', Doctorate': 6}
CourseNana.COM
race - {'White': 0, 'Black': 1, 'American Indian, Alaskan Native only (AI)': 2, 'Asian': 3, ,'Hawaiian, Pacific Islander (HP)': 4,'White-Black': 5', 'White-AI': 6, 'White-Asian': 7, 'White-HP': 8, 'Black-AI': 9, 'Black-Asian': 10, 'Black-HP': 11, 'AI-Asian': 12, 'AI-HP': 13, 'Asian-HP': 14, 'other race combinations': 15}
CourseNana.COM
industry - Industry that the person is working in {'Other': 0, 'Agriculture, forestry, fishing, hunting': 1, 'Mining': 2, 'Construction': 3, ,'Manufacturing': 4,'Wholesale and retail trade': 5','Transportation and utilities': 6, 'Information': 7, 'Financial activities': 8, 'Professional and business services': 9, 'Education and health services': 10, 'Leisure and hospitality': 11, 'Other services': 12, 'Public administration': 13, 'Armed Forces': 14}
CourseNana.COM
occupation - Occupation of person {'Other': 0, 'Management, business, and financial occ.': 1, 'Professional and related occ.': 2, 'Service occ.': 3, ,'Sales and related occ.': 4,'Office and administrative support occ.': 5','Farming, fishing and forestry': 6, 'Construction and extraction occ.': 7, 'Installation, maintenance and repair occ.': 8, 'Production occ.': 9, 'Transportation and material moving occ.': 10, 'Armed Forces': 11}
CourseNana.COM
- income - Annual income in dollars
- age - Age of person
- hourly_pay - Hourly salary of person (-1 if person is not payed by the hour)
- hours_per_week - Number of hours usually worked per week
weeks_worked - Number of weeks worked per year CourseNana.COM
sex - {'Female': 0,'Male': 1} CourseNana.COM
- marital_status - {'married':0,'widowed':1,'Divorced':2, 'Separated':3,'Never married':4}
- military_service - {'has not served in the US armed forces':0,'has served in the US armed forces':1}
- student_status - {'Not currently studying':0,'Enrolled full-time':1, 'Enrolled part-time':1}
education - {'Not finished high school': 0, 'High school': 1, 'Associate degree': 2, 'Bachelor\'s': 3, ,'Master\'s': 4,'Professional school degree': 5', Doctorate': 6} CourseNana.COM
race - {'White': 0, 'Black': 1, 'American Indian, Alaskan Native only (AI)': 2, 'Asian': 3, ,'Hawaiian, Pacific Islander (HP)': 4,'White-Black': 5', 'White-AI': 6, 'White-Asian': 7, 'White-HP': 8, 'Black-AI': 9, 'Black-Asian': 10, 'Black-HP': 11, 'AI-Asian': 12, 'AI-HP': 13, 'Asian-HP': 14, 'other race combinations': 15} CourseNana.COM
industry - Industry that the person is working in {'Other': 0, 'Agriculture, forestry, fishing, hunting': 1, 'Mining': 2, 'Construction': 3, ,'Manufacturing': 4,'Wholesale and retail trade': 5','Transportation and utilities': 6, 'Information': 7, 'Financial activities': 8, 'Professional and business services': 9, 'Education and health services': 10, 'Leisure and hospitality': 11, 'Other services': 12, 'Public administration': 13, 'Armed Forces': 14} CourseNana.COM
occupation - Occupation of person {'Other': 0, 'Management, business, and financial occ.': 1, 'Professional and related occ.': 2, 'Service occ.': 3, ,'Sales and related occ.': 4,'Office and administrative support occ.': 5','Farming, fishing and forestry': 6, 'Construction and extraction occ.': 7, 'Installation, maintenance and repair occ.': 8, 'Production occ.': 9, 'Transportation and material moving occ.': 10, 'Armed Forces': 11} CourseNana.COM
- income - Annual income in dollars
Question 4: Investigating trends [25 pts]
Below we'll answer questions about potential trends in the data with the help of plots and simple statistics.
CourseNana.COM
Q4.1Is there a disparity in income of participants by gender? Consider using a log scale or another technique when plotting to communicate findings more clearly.
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.2Is there a relationship between income and the "occupation" variable?
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.3Let's investigate a few questions about education and income:
CourseNana.COM
Is there a relationship between income and education level?
CourseNana.COM
Is this trend similar across both genders in the dataset?
CourseNana.COM
Is it possible to consider education level as an ordinal variable? For instance, consider whether retaining this ordering as in the dataset might be preferable to education level complexion as a categorical variable lacking order.
CourseNana.COM
Points: 4
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.4Is there a discernable trend in the incomes of participants from different industries?
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.5Is there a clear trend between age and income?
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.6Do any of the quantitative attributes show a clear relationship with income? If so, are these relationships linear or non-linear?
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.7What is the relationship between income and the different values for marital_status
in the dataset?
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
Q4.8What is the average effect of the military_service
variable on income?
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text
CourseNana.COM
Question 5: Calculate the Gini coefficient [10 pts]
CourseNana.COM
Gini coefficients are often used to quantify income inequality. For an introductory overview of the Gini coefficient, its derivation, and its uses, you can read more about it here. That article also provides a useful graphical representation of the Gini coefficient to better understand how it measures inequality.
CourseNana.COM
Below we'll answer questions about potential trends in the data with the help of plots and simple statistics. CourseNana.COM
Is there a disparity in income of participants by gender? Consider using a log scale or another technique when plotting to communicate findings more clearly. CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Is there a relationship between income and the "occupation" variable? CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Let's investigate a few questions about education and income: CourseNana.COM
Is there a relationship between income and education level? CourseNana.COM
Is this trend similar across both genders in the dataset? CourseNana.COM
Is it possible to consider education level as an ordinal variable? For instance, consider whether retaining this ordering as in the dataset might be preferable to education level complexion as a categorical variable lacking order. CourseNana.COM
Points: 4 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Is there a discernable trend in the incomes of participants from different industries? CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Is there a clear trend between age and income? CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Do any of the quantitative attributes show a clear relationship with income? If so, are these relationships linear or non-linear? CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
What is the relationship between income and the different values for marital_status
in the dataset?
CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
What is the average effect of the military_service
variable on income?
CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text CourseNana.COM
Question 5: Calculate the Gini coefficient [10 pts] CourseNana.COM
Gini coefficients are often used to quantify income inequality. For an introductory overview of the Gini coefficient, its derivation, and its uses, you can read more about it here. That article also provides a useful graphical representation of the Gini coefficient to better understand how it measures inequality. CourseNana.COM
The Gini coefficient is defined by the formula:
CourseNana.COM
?=∑??=1(2?−?−1)???∑??=1?? where ? is an observed value, ? is the number of values observed and ? is the rank of values in ascending order.
CourseNana.COM
A Gini coefficient of ?=0 implies perfect income equality, whereas a Gini coefficient close to ?=1 implies a concentration of wealth among the richest few.
CourseNana.COM
Q5.1Based on the above formula, calculate and report the Gini coefficient for the income of those people in the the provided ASEC dataset. Store the result in gini_coef
.
CourseNana.COM
- NOTE: For algorithmic simplicity and consistency, you can rank income values for all observations, keeping duplicate values in your sorted array. Therefore, you will likely have sorted income values
?? similar to [417, 417, 417, ..., 250000, 250000, 250000]
with corresponding rank indices ? similar to [1, 2, 3, ..., 12353, 12354, 12355]
. Nothing more sophisticated than that is required for dealing with ties (i.e. duplicates) in your sorted income values for Question 5.1.
Points: 7
CourseNana.COM
# your code here
gini_coef = ...
# Print resulting Gini coefficient
print(
f"The Gini Index for this dataset is {gini_coef:.3f}")
grader.check("q5.1")
Q5.2According to the World Bank estimate the country with the largest Gini coefficient is South Africa, ranked 1st at 0.63 , while the lowest is the Slovak Republic, ranked 162nd at 0.232 . The United States is ranked 46th on the list and has a Gini index of 0.415 .
CourseNana.COM
How well does your calculated Gini coefficient for this simulated dataset match the world bank estimate?
CourseNana.COM
Might the self-report nature of the data, preprocessing steps, or simulation (i.e., sampling) procedure have affected your results? If so, how?
CourseNana.COM
Note: The World Bank estimate website uses a [0,100] range for the Gini Index. Above we have converted this to a [0,1] range.
CourseNana.COM
Points: 3
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
The Gini coefficient is defined by the formula: CourseNana.COM
where
A Gini coefficient of
Based on the above formula, calculate and report the Gini coefficient for the income of those people in the the provided ASEC dataset. Store the result in gini_coef
.
CourseNana.COM
- NOTE: For algorithmic simplicity and consistency, you can rank income values for all observations, keeping duplicate values in your sorted array. Therefore, you will likely have sorted income values
?? similar to[417, 417, 417, ..., 250000, 250000, 250000]
with corresponding rank indices? similar to[1, 2, 3, ..., 12353, 12354, 12355]
. Nothing more sophisticated than that is required for dealing with ties (i.e. duplicates) in your sorted income values for Question 5.1.
Points: 7 CourseNana.COM
# your code here
gini_coef = ...
# Print resulting Gini coefficient
print(
f"The Gini Index for this dataset is {gini_coef:.3f}")
grader.check("q5.1")
According to the World Bank estimate the country with the largest Gini coefficient is South Africa, ranked 1st at
How well does your calculated Gini coefficient for this simulated dataset match the world bank estimate? CourseNana.COM
Might the self-report nature of the data, preprocessing steps, or simulation (i.e., sampling) procedure have affected your results? If so, how? CourseNana.COM
Note: The World Bank estimate website uses a [0,100] range for the Gini Index. Above we have converted this to a [0,1] range. CourseNana.COM
Points: 3 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
Question 6: Critiquing the simulated data [5 pts]
Take a look at both the data dictionary for the original dataset and the notebook used to create the simplified simulation, data/preprocessing.ipynb
.
CourseNana.COM
What might you have done differently were you to write your own preprocessing code? A nonexhaustive list of a few things to consider would be:
CourseNana.COM
- Are there important features you think should have been included that were not?
- Do you agree with the methods used to reduce the number of unique categorical values?
- Might there be a better way to simulate a continuous response from the discrete income brackets in the original data?
Note: We used the record type 'person' data from the ASEC rather than 'household' or 'family.' All three record types are represented in the data dictionary.
CourseNana.COM
Points: 5
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
This concludes HW2. Thank you!
CourseNana.COM
Take a look at both the data dictionary for the original dataset and the notebook used to create the simplified simulation, data/preprocessing.ipynb
.
CourseNana.COM
What might you have done differently were you to write your own preprocessing code? A nonexhaustive list of a few things to consider would be: CourseNana.COM
- Are there important features you think should have been included that were not?
- Do you agree with the methods used to reduce the number of unique categorical values?
- Might there be a better way to simulate a continuous response from the discrete income brackets in the original data?
Note: We used the record type 'person' data from the ASEC rather than 'household' or 'family.' All three record types are represented in the data dictionary. CourseNana.COM
Points: 5 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
This concludes HW2. Thank you! CourseNana.COM