STA303 Lab1
This lab is meant to:
· Introduce you to latex
· Review logistic regression concepts
· Get you working with one of the “state-of-the-art” packages for regression
You may use any piece of code I have provided. While in class, you may work together, but you must write up the solutions yourself. I do not allow collaboration outside of class. You may not share code with one another at any point. Clearly label each question.
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. pima dataset is available in the “faraway” package. You will only use a sample of these data. The first 4 lines of code in your .rmd file should be:
library(rms)
data(pima, package="faraway")
set.seed(STUDENTNUMBER)
pima = pima[sample(1:nrow(pima), size = 120, replace = FALSE),]
1) [1 marks] Load the data and focus your attention on the outcome variable (test=1 means they had diabetes) and age, bmi, and pregnancy as predictors. Perform some basic exploratory data analysis and data cleaning as needed. Document any changes you made to the dataset. This question is worth 1 mark, one or two sentences/plots is fine.
2) [3 marks] Fit a logistic regression model with no interactions.
a. Interpret the coefficient for bmi as an odds ratio.
b. Interpret the coefficient as a change in probability near the average bmi (use the exact method). Show your work using latex. Hint: For the other x-values, plug in their means.
3) [3 marks] Fit a new model, adding age/pregnancy and age/bmi interactions to the model in question 2.
a. Perform a likelihood ratio test to determine if the interactions provide an improvement in the fit. Report the test statistic and p-value.
b. Perform a Wald tests on each interaction coefficient. (Null hypothesis is coefficient = 0, significance level = 0.05)
4) [5 marks] Use bootstrap validation on both models with B=500
a. Which model shows better predictive/discrimination ability in the original dataset?
b. Which model shows better predictive/discrimination ability on new datasets?
c. Comment on the degree of overfitting in both models.
d. For the no-interaction model, compute the re-calibrated regression coefficient for bmi. Show your work using latex.