CourseNana | MATH6166 Statistical Computing for Data Scientists - Coursework: Exploratory Analysis of Tooth Growth Data Set

MATH6166 Statistical Computing for Data Scientists/MATH6173 Statistical Computing — Coursework CourseNana.COM

Instructions for Coursework CourseNana.COM

This coursework assignment is worth 100% of the overall mark for the module MATH6166, and 50% of the overall mark for the module MATH6173. CourseNana.COM
The deadline is 15:00 on Friday 17th November 2023, and your completed work must be submitted electronically via Blackboard. CourseNana.COM
All of your answers to the coursework are to be written in the R Markdown file “Coursework- template.Rmd”, which can be found under the Coursework heading in the Assessment panel on Black- board. Download this file and write all of your answers to the coursework questions in this file only. CourseNana.COM
The file “Coursework-template.Rmd” is a template answer sheet that is used to write your answers to all of the coursework questions. This is the only file that you need to use to write down your answers. You must not delete any of the content from this file; for example, do not change the names or options of code chunks, or change the layout of the file. The layout of the file has been designed specifically so that the correctness of the R functions that you write can be systematically assessed. CourseNana.COM
You must write all of your coding answers to each question in the corresponding empty R Markdown code chunk within the file. Do not create any new code chunks. Each code chunk has a name that matches the corresponding coursework question, and there are comments that signpost where to provide an answer for a specific question. For example, for Question 1 (a), you will see the comment Write your code for Q1 (a) in the code chunk below: and below this comment will be a code chunk with the label Q1a. Any changes you make outside of the designated code chunks will likely result in your assessment not being able to be marked. All the coding answers must be carried out through R commands (do not use comments to give your answers). CourseNana.COM
Where relevant, report your answers in the designated section below the R code chunks for each question in the R Markdown file. For example, for Question 1 (a), you will see the comment Report your answer for Q1 (a) below:, below which you can write your answer in standard R Markdown formatting. CourseNana.COM
When writing an R function, the CourseNana.COM

• function name,
• argument names and data type, and • output names and data type CourseNana.COM

must be exactly the same as those specified in the questions; otherwise a heavy penalty will be applied. Please note that uppercase and lowercase of the same alphabet are not the same character, so if the question asks to write a function/argument called, apple, but a function/argument called applE is presented in your answer, then that would be wrong. CourseNana.COM
Your submission must consist of exactly two files; the R Markdown file and its associated html file that is created by knitting/rendering the R Markdown file. The R Markdown file is the “Coursework- template.Rmd” file with your answers added in the appropriate places. Using this file, you must generate a html output file using the knit functionality in R Markdown, and submit this along with the R Markdown file. CourseNana.COM
Your R Markdown file must be renamed, and have the file name <your student ID>.Rmd. For example, if your student ID is 12345678, then your script must have the file name 12345678.Rmd. CourseNana.COM
Similarly, the associated html file should be named <your student ID>.html. For example, if your student ID is 12345678, then your html file must have the file name 12345678.html. (This should be CourseNana.COM

1 CourseNana.COM

correctly named by default provided that you have named the R Markdown file correctly.) CourseNana.COM

Missing R codes will be penalised. Please ensure the submitted files have the correct file format; CourseNana.COM

otherwise it cannot be marked. Note an R Markdown file is a different format to an R script file. CourseNana.COM
Your entire R Markdown file must be reproducible and run without any errors. CourseNana.COM
You must informatively comment your R code where appropriate. CourseNana.COM
Unless specifically stated otherwise, you do not have to include error handling in your code. CourseNana.COM
Ensure that all required plots are informative, including the use of appropriate plotting areas, CourseNana.COM

points/lines sizes and colours, as well as including meaningful labels, titles and legends. CourseNana.COM
You must not load any add-on packages, i.e., using functions library or require etc. CourseNana.COM
All data files required to complete the Coursework can be downloaded from Blackboard, under the CourseNana.COM

Coursework panel. CourseNana.COM
To help you with the completion of the Coursework, there is an example Coursework question and CourseNana.COM

related files on Blackboard for reference. You can find the following files under the Coursework Example Question panel: CourseNana.COM
- An example Coursework question, in the file Coursework-example.pdf, CourseNana.COM
- A template R Markdown file Coursework-example-template.Rmd in which to write your answers CourseNana.COM
  
  to the example Coursework question, CourseNana.COM
- A model solution R Markdown file Coursework-example-solutions.Rmd, which contains solu- CourseNana.COM
  
  tions to the Coursework example question and shows you the way you are expected to fill in the CourseNana.COM
  
  Coursework-example-template.Rmd file with your answers. CourseNana.COM
- The model solution output html file Coursework-example-solutions.html, which is simply the CourseNana.COM
  
  rendered output of the Coursework-example-solutions.Rmd file. CourseNana.COM
- all-ad-sales.txt and youtube-ad-sales.txt: the data files needed for the example question. CourseNana.COM
When asked to import a data file into your R workspace, you should first set the working directory to the location of the file on your local computer using the command setwd, and then read the file in using the appropriate R command (e.g. read.csv or read.table). You should set the working directory every time you have to import a data file. For an example of how to do this, see Question 1 (a) in the example Coursework solutions file. CourseNana.COM
The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work which is more than one week late. CourseNana.COM
All coursework must be carried out independently. You are reminded of the University’s Academic Integrity Policy, see https://www.southampton.ac.uk/quality/assessment/academic_integrity.page. CourseNana.COM
In the interests of fairness, queries which relate directly to the coursework must be made via the Coursework Discussion Forum on Blackboard. This ensures that everybody has access to the same information. CourseNana.COM

Total marks available: 100 CourseNana.COM

2 CourseNana.COM

Question 1: Exploratory Analysis of Tooth Growth Data Set CourseNana.COM

[22 marks] CourseNana.COM

The data set in the file tooth-growth.txt contains data on an experiment from researchers at McGill University investigating the effect of vitamin C intake on tooth growth in guinea pigs. For each of the 60 guinea pigs in the experiment, there are recordings of three variables: CourseNana.COM

len: a numeric variable, giving the tooth length of the guinea pig. CourseNana.COM
method: a character variable, giving the delivery method used to give the guinea pig the vitamin C, CourseNana.COM

which is either "AA" (ascorbic acid), or "OJ" (orange juice). CourseNana.COM
dose: a numeric variable, specifying the dose (in milligrams per day) of vitamin C that the guinea pig CourseNana.COM

received. CourseNana.COM

(a) [3 marks] Import the data set from the tooth-growth.txt file into the R workspace in a variable named tooth_data. Change the column name len to be tooth_len (this variable will henceforth be referred to as tooth_len). Change the data type of the method column from character to factor. How many of the observations have delivery method by orange juice? Store this in a variable called num_OJ. CourseNana.COM
(b) [2 marks] What is the variance of the tooth_len column for values where the delivery method is orange juice? Store this in a variable called var_OJ. What is the mean of the tooth_len column, where the delivery method is made by ascorbic acid and the dose value is 2? Store this in a variable called mean_AA_2. CourseNana.COM
(c) [3 marks] The variable dose takes one of three possible values: 0.5, 1, or 2. Create a side-by-side box plot of tooth_len for the three values of dose. CourseNana.COM
(d) [3 marks] Cross-tabulate the frequencies of the method and dose variables, creating a 2 × 3 table containing the frequencies of each of the 6 combinations of method and dose in tooth_data, and store this in a variable called dose_method_counts. Create a barplot showing the frequencies of the observations across the method and dose variables. There should be three bars, corresponding to each dose value, and each bar should be divided into two based on the method variable. CourseNana.COM
(e) [3 marks] Write a function, called subset_summary, that calculates the median and length of a given subset of a numeric vector x. The median and length should be calculated only for values of x whose cor- responding factor variable, x_factor, is a user-specified value chosen_factor. The subset_summary function should have the following features: CourseNana.COM

Arguments: CourseNana.COM

The subset_summary function should compute the median and length of the subset of x, where the values of x considered have corresponding x_factor value equal to chosen_factor. CourseNana.COM

Return:
subset_summary: a list object with two elements: CourseNana.COM

1. subset_median: the median of the subset of x. 2. subset_length: the length of the subset of x. CourseNana.COM
(f) [2 marks] Using the subset_summary function you wrote in part (e), calculate the median and number of observations for the tooth_len variable where the delivery method was ascorbic acid. CourseNana.COM

3 CourseNana.COM

(g) [6 marks] In this question, you will write a function, called subset_summary2, that improves upon the function subset_summary that you wrote in part (e), by including checks to ensure that the arguments passed to the function satisfy certain requirements. First, copy and paste the code you wrote for part (e) into your answer for part (g), and rename the function name to subset_summary2 in part (g). CourseNana.COM

The function subset_summary2 should check that the arguments satisfy the following five requirements: CourseNana.COM

1. x should contain numeric values,
2. x_factor should be a factor object,
3. x_factor should be the same length as x,
4. chosen_factor should be a character variable, and
5. chosen_factor should be one of the possible values (levels) in the x_factors variable. CourseNana.COM

If invalid inputs are passed to this function, it should stop prematurely by using the appropriate R function and print out the associated relevant error message exactly as written below: CourseNana.COM

For example, if requirement 2 is not satisifed, then the function should stop and print the error message 2. CourseNana.COM

4 CourseNana.COM

Question 2: The Multivariate t-Distribution [16 marks] CourseNana.COM

If X is a p-dimensional random vector distributed according to the multivariate t-distribution, then the probability density function (pdf) of X evaluated at x, denoted fX(x), is given by CourseNana.COM

Γ((ν + p)/2) 1 T −1 −(ν+p)/2 fX(x)=Γ(ν/2)νp/2πp/2det(Σ)1/2 1+ν(x−μ) Σ (x−μ) CourseNana.COM

, (1) where Γ(·) is the gamma function, det is the matrix determinant, Σ is a p × p scaling matrix, μ is a CourseNana.COM

p-dimensional vector, and ν > 0 is a positive scalar known as the degrees of freedom. CourseNana.COM

(a) [7 marks] Write a function, called calc_mvt_pdf, that calculates the pdf, or natural logarithm of the pdf, at a single value x, for user-specified values of Σ, μ, and ν. The function calc_mvt_pdf should have the following features: CourseNana.COM

Arguments: CourseNana.COM

x: a numeric vector of length p, representing the p-dimensional x. CourseNana.COM
Sigma: a numeric p × p matrix, representing Σ in Equation (1). The default value should be the CourseNana.COM

p × p identity matrix Ip. CourseNana.COM
mu: a numeric vector of length p, representing μ in Equation (1). The default value should be the CourseNana.COM

p-dimensional zero vector 0 = [0, . . . , 0]T . CourseNana.COM
df: a positive numeric scalar, representing the degrees of freedom ν given in Equation (1). The CourseNana.COM

default value should be 1. CourseNana.COM
calc_log: a logical variable. If set to TRUE, the natural logarithm of fX (x), given by log(fX (x)), CourseNana.COM

is calculated. If set to FALSE, the pdf fX(x) is calculated. The default value should be TRUE. CourseNana.COM

Computation: CourseNana.COM

If calc_log = FALSE, the calc_mvt_pdf function should compute the pdf fX(x) given in Equation (1), and if calc_log = TRUE, the natural logarithm of Equation (1) is computed. You must not use any existing functions in R that already calculate the multivariate t-distribution density. CourseNana.COM

When computing the pdf, numerical errors can occur due to the calculation of very small or very large numbers, which is why we might calculate the log pdf instead. If the log pdf is to be computed, you should not just apply the log function to the calculation you use to compute the pdf. Instead, you should take advantage of properties of the log function to rewrite the logarithm of Equation (1) to produce a more numerically stable result when computing the log pdf. CourseNana.COM

Return:
1. x_pdf: if calc_log = FALSE, the pdf fX(x) is returned. If calc_log = TRUE, the log pdf, CourseNana.COM

log(fX(x)), is returned. Hints: CourseNana.COM

You may use the R functions gamma and det. CourseNana.COM
To check the numerical accuracy of your calculations, note that the expected output of the cal- CourseNana.COM

culation calc_mvt_pdf(rep(3,200)) should be (approximately) the value −506.967. CourseNana.COM

Suppose that x1, . . . , xn are a set of n observed values that are all generated independently from the multivariate t-distribution with parameters Σ, μ and ν. Then, their joint probability density is given by Qni=1 fX(xi), with fX(x) as in Equation (1), whilst the logarithm of the joint probability density is given by Pni=1 log(fX(xi)). CourseNana.COM

5 CourseNana.COM

(b) [4 marks] Using a loop structure and the function calc_mvt_pdf that you wrote in part (a), write a function, calc_joint_mvt_pdf, that calculates the joint probability density for observed values x1, . . . , xn that are all generated independently from the same multivariate t-distribution. The function should have the following features: CourseNana.COM

Arguments: CourseNana.COM
1. x_matrix: a numeric n × p matrix of length p, with the i-th row representing an observed value xi . CourseNana.COM
2. Sigma: a numeric p×p matrix, representing Σ in Equation (1). The default value should the p×p identity matrix Ip.If calc_log = FALSE, the calc_joint_mvt_pdf function should compute the joint pdf Qni=1 fX(xi), and if calc_log = TRUE, the function should compute the natural logarithm of the joint pdf, Pni=1 log(fX(xi)). This value should be computed using a loop structure, by looping over the n values x1,...,xn. CourseNana.COM
Return:
1. joint_pdf: if calc_log = FALSE, the joint pdf Qni=1 fX(xi) is returned. If calc_log = TRUE, CourseNana.COM

the logarithm of the joint pdf, Pni=1 log(fX(xi)) is returned. CourseNana.COM
(c) [3 marks] Using the function joint_pdf you wrote in part (b), compute both the joint pdf and log joint pdf for the observed values x1 = [1.2,3]T, x2 = [1.1,3.1]T, x3 = [3.4,2.6]T, coming from the multivariate t-distribution with CourseNana.COM

T 1 0.3
μ=[1.2,3] , Σ= 0.3 1 , ν=2. CourseNana.COM

Store the results in variables joint_mvt_pdf and log_joint_mvt_pdf respectively. CourseNana.COM
(d) [2 marks] Suppose that X has a bivariate t-distribution with parameters μ = [0, 0]T, Σ is the 2 × 2 identity matrix I2, and ν = 1. Suppose that Y has a bivariate standard normal distribution (i.e. the bivariate normal distribution with mean vector and covariance matrix the same as X above). Cal- culate the absolute value of the difference between these two densities, evaluated at the value [0,1]T, i.e. calculate CourseNana.COM

fX([0,1]T)−fY ([0,1]T) , CourseNana.COM

where fX(·) and fY (·) are the pdfs of X and Y respectively. Store the result in a variable named density_diff. CourseNana.COM

6 CourseNana.COM

Question 3: Nonparametric Kernel Density Estimation CourseNana.COM

[32 marks] CourseNana.COM

Density estimation is the problem of estimating the probability density function (pdf) from a set of given data points. Concretely, suppose we observe n (1-dimensional) data points X1, . . . , Xn, and we wish to estimate the underlying unknown pdf, denoted f(x), of the data. CourseNana.COM

The Kernel Density Estimator (KDE) is one of the most well-known methods for solving this problem. The KDE at a given value x, denoted fˆ (x), is given by the formula CourseNana.COM

width h, the KDE fˆ n,h CourseNana.COM

(x) is the estimator of the true pdf f(x), K(·) is called the kernel function, and h > 0 is called the bandwidth. The kernel function is generally a smooth, symmetric function, and the bandwidth controls the amount of smoothing in the estimation. CourseNana.COM

(a) [3 marks] The Gaussian kernel function
K(y)=√ exp − CourseNana.COM

is one of the most CourseNana.COM

gauss_kernel_calc, CourseNana.COM

common choices for the kernel function
that computes, for n data points X1, . . . , Xn, CourseNana.COM

Write
given value x, and CourseNana.COM

n,h CourseNana.COM

Return:
• kde_est: a numeric object containing the KDE estimator fˆ CourseNana.COM

(b) [2 marks] Another possible kernel is the triangular kernel: CourseNana.COM

(x). CourseNana.COM

(c) [2 marks] The bandwidth parameter h can be set using the “rule-of-thumb” choice: −1/5 IQR CourseNana.COM

1 y2 2π 2 CourseNana.COM

(3) CourseNana.COM

(3). The function should have the following features: Arguments: CourseNana.COM

data: a numeric vector of length n containing the data set X1 , . . . , Xn that we want to estimate the pdf of, CourseNana.COM
x: a numeric object, the value x for which we want to estimate the pdf f(x), CourseNana.COM
n: a positive integer object specifying the number of data points n, CourseNana.COM
h: a numeric object specifying the bandwidth h of the kernel function. CourseNana.COM

Computation:
Calculate fˆ (x) for a given (single) value of x using the Gaussian kernel. CourseNana.COM

K .
(x) given in Equation (2) with K given by the Gaussian kernel in Equation CourseNana.COM

(1−|y| if|y|≤1, 0 otherwise. CourseNana.COM

K(y) =
where | · | is the absolute value. Write a function, called triangular_kernel_calc, that computes, CourseNana.COM

for n data points X , . . . , X , given value x, and bandwidth h, the KDE fˆ (x) given in Equation (2) 1n n,h CourseNana.COM

with K given by the triangular kernel in Equation (4). The function should have exactly the same features as your answer to part (a), except that the function uses the triangular kernel. CourseNana.COM

where σˆ is the estimated standard deviation σˆ = q(n − 1)−1 Pni=1(Xi − X)2 with X = n−1 Pni=1 Xi, IQR is the inter-quartile range, and n is the number of data points observed. Write a function, rule_of_thumb_calc, that computes hROT for a data set X1 , . . . , Xn . The function should have the following features: CourseNana.COM

Arguments:
1. data: a numeric vector of length n containing the data set X1 , . . . , Xn , CourseNana.COM

2. n: a positive integer object specifying the number of data points n. CourseNana.COM

data: a numeric vector of length n containing the data set X1 , . . . , Xn that we want to estimate the pdf of, CourseNana.COM
x_vals: a numeric vector of length m representing values x1 , . . . , xm at which we want to estimate the pdf f(x), CourseNana.COM
kernel: a character object specifying the kernel function we want to use. The kernel function can be set to be either "gaussian" or "triangular". The default argument of kernel is "gaussian". CourseNana.COM
h: a numeric object specifying the bandwidth of the kernel function. The default value of h should CourseNana.COM

be NULL. CourseNana.COM

Computation: CourseNana.COM

If no bandwidth parameter h is specified by the user, then the bandwidth h is calculated using the rule-of-thumb choice in Equation (5). Then, the function calculates fˆ (x) for all x = x , . . . , x , CourseNana.COM

with respect to the chosen bandwidth and kernel function. Return: CourseNana.COM

• kde_est: a numeric vector of length m containing the KDE esimtator fˆ (x) evaluated at the n,h CourseNana.COM

points x = x ,...,x , i.e. a vector containing the values fˆ (x ),...,fˆ (x
1 m n,h 1 n,h m CourseNana.COM

). CourseNana.COM

(e) [6 marks] Generate 100 random variables with distribution N (3, 2), i.e. normally distributed with mean equal to 3 and variance equal to 2, and store them in a variable named norm_data. Calculate the KDE with respect to norm_data, evaluating the KDE at 200 points x1, . . . , x200 on an equi-spaced grid, beginning at -3, and ending at 9; CourseNana.COM

1. using the Gaussian kernel, using the rule-of-thumb choice to set the bandwidth h; 2. using the triangular kernel, using bandwidth h = 0.7. CourseNana.COM

Store the results in variables named gaussian_est and triangular_est respectively. On a single figure, plot the values x ,...,x against the KDE fˆ (x ),...,fˆ (x ) for both gaussian_est CourseNana.COM

1 200 n,h 1 n,h 200
and triangular_est using a line plot. Add to this plot the true values of the pdf, f(x), evaluated at x1, . . . , x200. Make sure that each line is a different type and colour, and add a legend to distinguish the three lines. CourseNana.COM

8 CourseNana.COM

n,h 1m CourseNana.COM

In practice, we should carefully select the bandwidth h to minimise the mean squared error (MSE) of density CourseNana.COM

estimation: CourseNana.COM

ˆ Z∞ˆ 2 CourseNana.COM

MSE fn,h = E
CourseNana.COM

f−i(Xi) = (n − 1)h
i.e. fˆ (X ) is the KDE in Equation (2) evaluated at X , computed using all data except X . CourseNana.COM

fn,h(x) − f(x) dx . (6) CourseNana.COM

K h CourseNana.COM

, (8) CourseNana.COM

fˆ (X), (7) CourseNana.COM

j=1 j ̸=i CourseNana.COM

(f) [7 marks] Write a function J_calc, that computes, for a given bandwidth h, the value of J(h) given in Equation (7). The function J_calc should call the function kde_estimate written in part (d) when necessary, and have the following features: CourseNana.COM

Arguments: CourseNana.COM
1. h: a numeric object specifying the bandwidth of the kernel function, CourseNana.COM
2. data: a numeric vector of length n containing the data set X1 , . . . , Xn , CourseNana.COM
3. kernel: a character object specifying the kernel function we want to use. The default argument CourseNana.COM
  
  of kernel is "gaussian". CourseNana.COM
Computation: CourseNana.COM

The function calculates J(h) given in Equation (7). To compute the first term in Equation (7), you should use the integrate function. Refer to the help file of the integrate function for details of its usage. CourseNana.COM

Return:
• J: a numeric value containing J(h). CourseNana.COM
(g) [3 marks] Write a function find_opt_h, that computes, for a given set of possible bandwidth values H, the optimal choice of h by finding the bandwidth hopt ∈ H that minimises Equation (8), i.e. computing CourseNana.COM

hopt = argminh∈H{J(h)}. (9) The function should call the function J_calc that you wrote in part (f), and have the following features: CourseNana.COM

Arguments: CourseNana.COM
1. H_set: a numeric vector specifying the set of bandwidths H. CourseNana.COM
2. data: a numeric vector of length n containing the data set X1 , . . . , Xn , CourseNana.COM
3. kernel: a character object specifying the kernel function we want to use. The default argument CourseNana.COM
  
  of kernel is "gaussian". CourseNana.COM
Computation:
The function calculates hopt given in Equation (9). Return: CourseNana.COM

• h_opt: a numeric value containing hopt.
Hint: you may find the function which.min useful. CourseNana.COM

−ii i i CourseNana.COM

9 CourseNana.COM

(h) [4 marks] For the norm_data data set from part (e), compute the optimal bandwidth hopt for the Gaussian kernel on the set H = {0.1,0.11,0.12,...,1.19,1.2}. Plot the values of J(h) for all values in the set H in a line plot. Add a vertical line indicating hopt, and a vertical line indicating the rule-of-thumb choice for h, making sure the two lines can be distinguished. CourseNana.COM

10 CourseNana.COM

Question 4: Classification Using Linear Discriminant Analysis CourseNana.COM

[30 marks] CourseNana.COM

A researcher is interested in classifying two different species of iris flower based on measurements of the flower’s petals. Each flower comes from one of two populations, A or B. The researcher’s aim is to decide, given a flower’s measurements, if the flower should be classified as coming from population A or B. CourseNana.COM

Mathematically, suppose we observe p-dimensional data that comes from one of two populations, A or B, where an observation x has prior probability πA of belonging to population A, and prior probability πB of belonging to population B. CourseNana.COM

When each population is modeled as a multivariate Normal distribution with common p × p covariance matrix Σ, and population specific p-dimensional mean vectors μA and μB, the linear discriminant analysis (LDA) decision rule for classifying an observation x as belonging to population A takes the form: CourseNana.COM

T−1 1 T−1 πA AssignxtoPopulationAif(μA−μB) Σ x>2(μA−μB) Σ (μA+μB)−log π , (10) CourseNana.COM

B CourseNana.COM

otherwise assign x to Population B.
In practice, the quantities μA, μB, πA, πB, and Σ must be estimated. This can be achieved using a training CourseNana.COM

data set of n data points where membership of each observation is known. Suppose the training data (A) (A) (B) (B) CourseNana.COM

set contains nA data points x1 , . . . , xnA belonging to Population A, and nB data points x1 , . . . , xnB belonging to Population B (with nA + nB = n). Then, πA and πB can be estimated using the sample proportions CourseNana.COM

πˆA = nA/n, and πˆB = nB/n,
and μA and μB can be estimated using the p-dimensional sample mean vectors CourseNana.COM

1 nA 1 nB CourseNana.COM

μˆA = x ̄A = Xx(A), μˆB = x ̄B = Xx(B), nAi nBi CourseNana.COM

i=1 i=1 The p × p covariance matrix Σ can be estimated using the pooled estimator CourseNana.COM

(11) CourseNana.COM

(12) CourseNana.COM

(13) CourseNana.COM

where
ΣbA = nA −1 CourseNana.COM

1 CourseNana.COM

(A) (A) −xi xi CourseNana.COM

1 ΣbB = nB −1 CourseNana.COM

nB
X (B) CourseNana.COM

(B)
−x ̄B xi −x ̄B CourseNana.COM

T
, (14) CourseNana.COM

bA
Σ= 1 (n −1)Σ +(n −1)Σ , CourseNana.COM

b n−2 A CourseNana.COM

T CourseNana.COM

B CourseNana.COM

bB CourseNana.COM

nA
X (A) CourseNana.COM

To this end, the researcher has access to a correctly labelled training data set iris-training-data.txt, which contains 4 measurements from n observations of flowers: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. Alongside this, each observation has the variable population with the known population (A or B) that the flower belongs to. This data set will be used to calculate the estimated quantities given in Equations (11), (12), and (13). CourseNana.COM

(a) [4 marks] Load the data set iris-training-data.txt into your R workspace. Produce two plots, side-by-side in the same figure, where the left plot shows Petal.Width plotted against Petal.Length, whilst the right-hand plot shows Sepal.Width plotted against Sepal.Length. The observations should be coloured based on whether they came from population A or B, and you should include a legend to distinguish between the two populations. CourseNana.COM

−x ̄A
with the p-dimensional vectors x(A), x ̄A, x(B), and x ̄B in Equation (14) treated as p × 1 column vectors, so CourseNana.COM

xi CourseNana.COM

, CourseNana.COM

xi CourseNana.COM

i=1 CourseNana.COM

ii that ΣbA and ΣbB are p × p matrices. CourseNana.COM

11 CourseNana.COM

(b) [7 marks] Write a function called est_training that uses a training data set to calculate and re- turn the quantities required to classify an observation, given in Equations (11), (12), and (13). The est_training function should have the following features: CourseNana.COM

Argument : CourseNana.COM

1. training_data: a data frame with n rows and p + 1 columns, consisting of training data. Each row corresponds to an observation. The last column, with column name population, contains character variable data given by either "A" or "B", that depicts which of the two populations the observation belongs to. The first p columns contain the p variables from the n observations. CourseNana.COM

Computation:
The training_est function should compute five quantities: CourseNana.COM

the p-dimensional mean vector μˆA of all training data observations in population A (Equation (12) left), CourseNana.COM
the p-dimensional mean vector μˆB of all training data observations in population B (Equation (12) right), CourseNana.COM
the proportion of training data observations in population A, nA/n (Equation (11) left), CourseNana.COM
the proportion of training data observations in population B, nB/n (Equation (11) right), and CourseNana.COM
the pooled covariance estimate of all training data observations Σb (Equation (13)). CourseNana.COM

Return:
training_ests: a list object with five elements: CourseNana.COM

mean_A: a numeric vector of length p containing the mean vector of all training data observations belonging to population A. CourseNana.COM
mean_B: a numeric vector of length p containing the mean vector of all training data observations belonging to population B. CourseNana.COM
pi_A: a numeric variable between 0 and 1, containing the proportion of training data observations belonging to population A. CourseNana.COM
pi_B: a numeric variable between 0 and 1, containing the proportion of training data observations belonging to population B. CourseNana.COM
cov_est: a numeric p × p matrix containing the pooled covariance estimate. CourseNana.COM

Hint: you may use the function cov in order to calculate Σb. CourseNana.COM

Using the estimated quantities, the LDA decision rule classifies a new observation x as belonging to popu- lation A if CourseNana.COM

T 1T πˆA −1
bl x>2bl (μˆA+μˆB)−log πˆ , otherwiseassignxtoB, wherebl=Σb (μˆA−μˆB). (15) CourseNana.COM

B CourseNana.COM

(c) [2 marks] Write a function calc_l_hat, that takes as input Σb, μˆA, and μˆB, and computes the vector bl given in Equation (15). The calc_l_hat function should have the following features: CourseNana.COM

Arguments: CourseNana.COM

1. cov_est: the p × p pooled covriance estimator Σb,
2. mean_A: the p-dimensional mean vector μˆA for training data observations in Population A, 3. mean_B: the p-dimensional mean vector μˆB for training data observations in Population B. CourseNana.COM

Computation:
The function should compute the vector bl, given by bl = Σb−1(μˆA − μˆB). Return:
l_hat: a numeric vector of length p, containing the value of bl. CourseNana.COM

12 CourseNana.COM

(d) [2 marks] The right-hand-side of the inequality in Equation (15), 12blT(μˆA+μˆB)−log(πˆA/πˆB), is known CourseNana.COM

as the decision point. Write a function calc_decision_point, that takes as input bl, the mean vectors μˆA and μˆB, and prior probability estimates πˆA and πˆB, and computes the decision point given by the right-hand-side of the inequality in Equation (15). The calc_decision_point function should have the following features: CourseNana.COM

Arguments: CourseNana.COM

1. l_hat: the p-dimensional vector bl.
2. mean_A: the p-dimensional mean vector μˆA for training data observations in Population A.
3. mean_B: the p-dimensional mean vector μˆB for training data observations in Population B.
4. pi_A: a numeric variable between 0 and 1, containing the proportion of training data observations CourseNana.COM

belonging to population A, πˆA.
5. pi_B: a numeric variable between 0 and 1, containing the proportion of training data observations CourseNana.COM

belonging to population B, πˆB. CourseNana.COM

Computation:
The function should compute the decision point 12blT(μˆA +μˆB)−log(πˆA/πˆB), given in Equation (15). Return:
decision_point: a numeric value containing the value of the decision point. CourseNana.COM
(e) [2 marks] Write a function classify_obs, that classifies a new observation into either population A or B based on the decision rule (15). The function classify_obs should have the following features: CourseNana.COM

Arguments: CourseNana.COM

1. obs: a p-dimensional new observation x that is to be classified. 2. l_hat: the p-dimensional vector containing bl.
3. decision_point: the value of the decision point. CourseNana.COM

Computation:
The function should classify a new observation x as belonging to the population A or B, based on the CourseNana.COM

decision rule (15).
Return:
population_est: a character value, with value "A" or "B", containing the assigned population for x. CourseNana.COM
(f) [5 marks] Using the functions you wrote in parts (b), (c), (d), and (e), write a function classify_data, that takes as input (labelled) training data and (unlabeled) test data, estimates the necessary values needed for classification using the training data set, and classifies each observation in the test data set into population A or B. Your classify_data function should call the functions you wrote in parts (b) – (e) at appropriate places, and should have the following features: CourseNana.COM

Arguments: CourseNana.COM

1. training_data: a data frame with n rows and p + 1 columns, consisting of training data. This object has the same properties as described in part (b). CourseNana.COM

2. test_data: a data frame with m rows and p columns, consisting of test data. Each row corre- sponds to an observation. CourseNana.COM

Computation: CourseNana.COM

The function should classify each of the m observations in test_data as belonging to the population A or B, based on the decision rule (15). The quantities required to evaluate the decision rule should be computed using training_data. CourseNana.COM

Return:
population_ests: a character vector of length m, with values "A" or "B", containing the assigned CourseNana.COM

populations for the m observations in test_data. 13 CourseNana.COM

(g) [2 marks] Load the data set iris-test-data.txt into your R workspace in a variable named test_data. Using the classify_data function you wrote in part (f), perform linear discriminant analysis on test_data, using the training data set training_data. Store the result in a variable named iris_results, and add this information to the test_data data frame, by creating a new column with name population_est that contains the estimated populations. CourseNana.COM

Further analysis finds that the first 10 observations (i.e. those in rows 1–10) in the iris-test-data data set belong to Population A, whilst the last 15 observations (those in rows 11–25) belong to Population B. CourseNana.COM

(h) [2 marks] Add this information into the test_data variable, by creating a new column with name population containing the true populations of each observation. What proportion of the observations in test_data were correctly classified? Store this in a variable named prop_correct. CourseNana.COM

(i) [4 marks] Plot the Petal.Width variable against Petal.Length variable for the observations in test_data, with the two (true) populations distinguished from one another by colour, and with incorrectly classified points given by cross (x) symbols. CourseNana.COM

14 CourseNana.COM

MATH6166 Statistical Computing for Data Scientists - Coursework: Exploratory Analysis of Tooth Growth Data Set

Get in Touch with Our Experts