Assignment 2
For Stat 231 - Summer 2022 - U of Waterloo
Due on Thursday June 9th at 11:59pm EDT.
You will submit answers/results from R in the Crowdmark pdf submission. The final results should be typed within the written responses and images inserted. Your assignment submission must be typed and submitted as a pdf. There are no exceptions. Any submitted answer which is not typed will not be marked and given a mark of zero. Further, all written answers must be in full sentences.
Written answers which are not in full sentences will receive a deduction of 50% of the marks in that question part. Additionally, all plots should have titles and axes labelled appropriately to receive full marks. The feedback you receive will be focused on your R output and interpretations, not the detailed code itself. Thus, no R code should be included in your pdf solution file.
Q1 Maximum likelihood estimation
(14 marks total)
This question pertains to the mobile games data set variable rare_gems.
For parts a, b, and c, specify the estimator as a function of the observations y and parameter θ, and give the estimated values for θ1 , θ2 , θ3 from the data calculated from R.
a) (2 marks) Apply a Poisson(θ1) distribution to rare_gems. Find the MLE θ1.
b) (2 marks) Apply a Binomial(n=12, θ2) distribution to rare_gems. Find the MLE θ2.
c) (2 marks) Apply a Binomial(n=24, θ3) distribution to rare_gems. Find the MLE θ3.
For parts d, e, and f, show your work instead of just giving the R output. That is, write out the log-likelihood function in terms of θ and observations y. Make sure to include the constants because we are comparing between different distributions.
d) (2 marks) Find the log-likelihood of the rare_gems data given the Poisson(θ1) distribution.
e) (2 marks) Find the log-likelihood of the rare_gems data given the Binomial(n = 12, θ2) distribution.
f) (2 marks) Find the log-likelihood of the rare_gems data given the Binomial(n = 24, θ3) distribution.
g) (2 marks) Which of these three distribution families best fits the data? How do you know?
Q2 Goodness of Fit
(11 marks total) This question pertains to the mobile games data set variable pregame_skill. For the MLEs, you can use the estimators for the Normal/Gaussian distribution that you already know.
a) (2 marks) Make a Q-Q plot of the variate pregame_skill against the normal distribution. From this, does it seem reasonable to assume that pregame_skill is approximately normally distributed?
b) (1 mark) Assuming pregame_skill has a Gaussian distribution, find μˆ the MLE of the parameter μ.
c) (1 mark) Assuming pregame_skill has a Gaussian distribution, find σˆ the MLE of the parameter σ.
d) (2 marks) Use your answer from part c, and the var function to calculate the bias in the MLE of σ . (Hint: The bias is calculated as E[S2] − σ2)
e) (3 marks - 1 for observed, 2 for expected) The variate skill_grade is an ordinal version of the variate pregame_skill with the following cutoffs. Use the table function, the qnorm function, and the invariance property to fill in the observed and expected counts.
pregame_skill skill_grade observed counts expected counts
(-inf, 70) F
[70, 85) E
[85, 100) D
[100, 115) C
[115, 130) B
[130, +inf) A
f) (2 marks) Informally, do these observed and expected counts suggest that the normal distribution is a reasonable model for pregame skill? Why or why not?
Q3 PPDAC Case Study
(12 marks total, 1 mark per part)
Goobookazon wants to see which of two website setups results in a higher proportion of users clicking on ads. Whenever a user visits their homepage, they are randomly given one of two different setups. (This sort of experimentation happens all the time in a method called A/B testing.) After 30 days, each setup has had exactly 1 million views, and the number of ad clicks from each setup has been counted.
Apply the PPDAC framework as in recorded Lecture 04-3 to fill in the following definitions for this specific study.
PROBLEM:
a) Objectives:
b) Units:
c) Target Population/Process:
d) Response variate:
e) Explanatory variates (List at least 2):
f) Attribute:
PLAN:
g) Study population/process: h) Study error:
i) Sampling protocol:
j) Sampling error: k) Sampling size:
l) Measurement error: