Assignment 2: Logistic Regression and Support Vector Classifier
Assignment 2
Question 1. [30 points] Suppose we collect data for a group of students in a statistics class with variables ?! = h???? ??????? , ?" = ?????????? ??? , and ? = 1 if the student receives an A and ? = 0 otherwise. We fit a logistic regression and produce estimated coefficient, ?# = −6, ?! = 0.05, ?" = 1.
- (a) Estimate the probability that a student who studies for 40 hours and has a cumulative GPA of 3.5 gets an A in the class.
- (b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?
- (c) What is the odds ratio and log-odds for the student in (a)?
- (d) Write down the function of linear hyperplane in a figure where ?! is on the ?-axis and ?" on the ?-axis. Also indicate the region for A grade (positive, ? = 1) and region for non-A grade (negative) in the figure.
Question 2. [20 points]
- (a) Suppose that an individual has a 18% chance of defaulting on her credit card (positive, ? = 1) payment. What is the odds-ratio? (Round the result to 2 decimal places)
- (b) Suppose the odds-ratio of defaulting on credit card payment for a man is 0.4, what is the probability this person will default on his credit card payment? (Save the result as percentage and round it to 2 decimal places)
Question 3. [20 points] A support vector classifier was fit to a small data set with 12 instances. The colors indicate their classes (Blue represents positive and red represents negative). The hyperplane (solid line) and the two margins (dashed lines) are plotted in the following figure.
(a) List the number of instances that are support vectors.
- (b) Suppose instance 4 (the red dot in the figure) moves closer to the margin. Will it affect the hyperplane?
- (c) List the number of instances that move across its margin while not the hyperplane.
- (d) List the number of instances that move across the hyperplane.
Question 4. Logistic Regression: Programming [30 points]
Please use the dataset Smoking.csv and write python codes to answer questions below
step by step. Please report both codes and outputs.
- (a) How many observations are there? How many smokers are there (?????? = 1)?
- (b) Split the data into training (70%) and test set (30%). Set random_state = 0. Scale the features/predictors using the MinMaxScaler.
- (c) Train a logistic regression model on training data, with ?????? as target variable and smoke ban (??????) and age (???) as features. Display the intercepts and coefficients. How can you interpret the coefficient for age (???)?
- (d) Check model accuracy on test data. What is the model accuracy if we take 0.5 (default) as the cutting point in predicting class labels? (Hint: you may need to apply
- the scaler on test data before making any prediction)