Q. 1 (30 pts)
In order to investigate the feasibility of starting a Sunday edition for a large metropolitan newspapers, information was obtained from a sample of 34 newspapers concerning their daily and Sunday circulations (in thousands) (Source:Gale Directory of Publications, 1994). The data file, “Newspapers.txt” is posted at Quercus, Assignment 2. Fit the regression line predicting Sunday circulation from Daily circulation.
(a) (4 pts) Provide an interval estimate based on 95% level for the true aver- age Sunday circulation of newspapers with Daily circulation of 500,000. Compute it by hands (use R) and compare the result with the built-in R function.
(b) (4 pts) The particular newspaper that is considering a Sunday edition has a Daily circulation of 500,000. Provide an interval estimate based on 95% level for the predicted Sunday circulation of this paper. Compute it by hands (use R) and compare the result with the built-in R function. How does this interval differ from that given in (a)?
(c) (4 pts) Construct the Analysis of Variance table, and interpret the R2.
(d) (4 pts) Plot the residuals versus fitted values. Comment on residual plot.
(e) (4 pts) Obtain a normal probability plot of residuals and test the hy- pothesis that the errors are normally distributed with the Shapiro-Wilk test.
(f) (10 pts) We would like to conduct the Brown-Forsythe test to determine whether or not the error variance varies with the level of X. Divide the data into the two groups based on the median of X. Use α = 0.05. Do not use the built-in R function. Write your own R function to implement this test. What is your test result?
Q. 2 (23 pts) This question is for practicing R for simulations. When you generate a random number, use R code, set.seed(your student number) before the R codes of generating a random number, so that we can replicate the result.
We start by assuming true regression parameters in the model. Thus, we assume that Yi = 13.8+1.3Xi+εi, with εi ∼ N(0,1092). We use the predictors X (Daily circulation ) that we already have from ”Newspapers.txt”.
(a) (5 pt) Based on the above information, calculate P (|βˆ1 −β1| < 0.1) where βˆ1 is the least squares estimator of β1.
These are the simulation steps for following questions (b)-(d).
-
Step 1: Simulation of the fake data
Simulate a vector Y of fake data and put this in a data frame with the same X (Daily circulation). -
Step 2: Fitting the model and keeping the estimated regression co- efficients.
-
Step 3: Repeating Step 1 and Step 2, 10,000 times.
(b) (4 pts) Do Step 1 and Step 2. Obtain the least square estimates of β0
and β1 with the fake data.
Also, compute estimated E (Y |X0 = 450) and obtain 95% confidence interval for E(Y |X0 = 450) by hands and compare it by R built-in func- tion.
(c) (10 pts) Do Step 3. Make a histogram of 10,000 βˆ0 and 10,000 βˆ1. Su- perimpose (overlay) its theoretical distribution on each histogram. Cal- culate the mean and standard deviation of 10,000 estimates each. Are the results consistent with theoretical values?
(d) (4 pts) Do Step 3. Generate 10,000, 95% confidence interval for E(Y |X0 = 450). What proportion of the 10,000 confidence intervals for E(Y |X0 = 450) includes E(Y |X0 = 450)? Is this result consistent with theoretical expressions?
Q. 3 (12 pts) This question is to practice R to build a R function.
The number of injury incidents (y) and the proportion of total flights from New York (x) for nine (n=9) major United States, airlines for a single year is given below.
x <- c(0.095, 0.1920, 0.075, 0.2078, 0.1382, 0.054, 0.1292, 0.0503, 0.0629)
y <- c(11, 7, 7, 19, 9, 4, 3, 1, 3)
(a) (10 pts)Build a box cox transformation function in R (follow the steps
described in the lecture note) and compare the result with the built-in
R function.
(b) (2 pts) Why the transformation is needed for this data? Explain.
Q. 4 (20 pts) (4 pts each) A simple linear regression was fit, relating the number of injury incidents (Y) to the proportion of total flights from New York (X) from the above question (Q. 3).
Use the simple linear regression in matrix form.
-
(a) Obtain the design matrix X and Y .
-
(b) Obtain the vector of estimated regression coefficients, β, and the vector
of fitted value, Yˆ , and the residual vector, e.
(c) Compute the estimated variance-covariance matrix of β, V ar(β).
e eeˆˆ
(d) Find the hat matrix H. What does i=1 hii equal? Here, hij is the element in H in the ith row and jth column.
d Pn ee
(e) Find the estimated variance-covariance matrix of the residual vector, Vb ar(e).
e
Q. 5 (15 pts) (5 pts each) An engineer is interested in the relationship between steel thickness (X) and its breaking strength (Y). She obtains the following matrices from a matrix computer package:
′ 1260′ 120′
XX= 60 360 XY= 800 YY=2470
eee
(a) Construct the ANOVA table based on this information. (b) Provide 95% confidence interval for β1.
(c) TestH0 :β1 =0vsβ1 ̸=0withα=0.05.