STAT 361 (Fall 2023) Assignment 3
The assignment is due on Nov. 04 (Saturday) at 23:00 (time of Kingston Ontario). Please submit to Crowd Mark.
You can still submit your assignment after the scheduled submission deadline; the penalty for a late assignment is 1% per hour. Watch out for a crowdmark techinical feature: If you have clicked submission for a question before the dead- line, then you CANNOT resubmit for that question once the deadline has passed.
Please read the course outline posted in Week 1, OnQ, if you need special accommodation
for your assignment.
Requests for extending the submission deadline by < 24 hours (say 1 hour late) will not be
considered.
Guidelines for Preparing Solutions
For questions that needs R coding, please only include the important R output and the
necessary results in the main text of your solutions. Present them in a clear and concise
fashion (for example, tabulate models and output).
If there are other long code and output that are related to your work and exploration, please
put them in an Appendix at the end of EACH problem.
These Appendix sections will NOT be marked, but you could submit them as evidence of
your independent work.
If you will not submit Appendix sections, make sure your assignment solutions are presented
clearly, and show your independent work.
Do not expect TAs to search everywhere for your answers from lengthy code and output. Identical solutions between students or copying from other sources will be investigated for academic integrity violations.
1. How is R2 related to the sample correlation coefficient? Recall the correlation coefficient
E{[X − E(X)][Y − E(Y )]} = q .
V ar(X)V ar(Y )
forrandomvariablesXandY,definedasρ= q
The sample correlation coefficient for the observed data x and y is
P[(xi − x)(yi − y)] ρˆ= qP(xi −x)2 P(yi −y)2.
Cov(X, Y )
V ar(X)V ar(Y )
Show that the R2 of the simple linear regression, model (1) of Chapter 2, is the square of the sample correlation coefficient between x and y,
22
R = ρˆ .
2. Consider the multiple regression model Y = Xβ + ε, where ε ∼ MVNn(0, σ2I). See descriptions of model forms (1) and (2) in Chapter 4.
(a) Show that the residual vector r = (I − P)Y, where P = X(XT X)−1XT , and show that 1
I − P is also a projection matrix.
(b) Let U = (βˆ , r)T . Find the joint distribution of the random vector U. It may be helpful
(XT X)−1XT ! to notice that U = (I − P)
Y. (c) Show that βˆ and r are independent.
Hint: For (b) and (c), properties of multivariate normal distribution may be useful.
3. Consider the “Savings.txt” data posted. It is an economic dataset collected in 48 different countries. The variable “sr” is ratio of savings (aggregate personal saving divided by dis- posable income). The variables “pop15” and “pop75” are percentages of population under 15 and over 75 respectively. The variable “dpi” is disposable income (per-capita, in dollars) while the variable “ddpi” is the rate (percent) of change in disposable income (per capita). (a) Draw scatter plot matrix for all the variables involved. Comment on the possible rela- tionships between variables, focus on those appear interesting to you.
(b) Fit a simple linear regression model with disposable income (“dpi”) as response and percentage of population under 15 as the only covariate. Describe the model clearly in mathematical form. Report and interpret the fitted model: is there a significant association between the variables, is this what you expect?
(c) Find the sample correlation coefficient between the two variables you studied in (b). How
is it related to R2 of the model you fitted in (b)?
(d) Fit a regression model with ratio of savings (Y , “sr”) as the response, and all other
variables as the covariates. Describe the model clearly in mathematical form, report and
discuss the fit of the model. Interpret the estimated coefficient for the rate of change in
disposable income.
(e) Present the analysis of variance table for the model in (c), i.e, the ANOVA table in the form of Table 1 of Section 4.4. The model you specified in (d) assumes that the error terms are i.i.d. normal with mean 0 and variance σ2. An estimate of σ, denoted by σˆ, can be extracted from your fitted model (supposed it’s named “fitd” in your code), by the R code “sigma(fitd)”. How is σˆ related to SS(Res), the residual sum of squares?