MATH6166 Statistical Computing for Data Scientists/MATH6173 Statistical Computing — Coursework
Instructions for Coursework
-
This coursework assignment is worth 100% of the overall mark for the module MATH6166, and 50% of the overall mark for the module MATH6173.
-
The deadline is 15:00 on Friday 17th November 2023, and your completed work must be submitted electronically via Blackboard.
-
All of your answers to the coursework are to be written in the R Markdown file “Coursework- template.Rmd”, which can be found under the Coursework heading in the Assessment panel on Black- board. Download this file and write all of your answers to the coursework questions in this file only.
-
The file “Coursework-template.Rmd” is a template answer sheet that is used to write your answers to all of the coursework questions. This is the only file that you need to use to write down your answers. You must not delete any of the content from this file; for example, do not change the names or options of code chunks, or change the layout of the file. The layout of the file has been designed specifically so that the correctness of the R functions that you write can be systematically assessed.
-
You must write all of your coding answers to each question in the corresponding empty R Markdown code chunk within the file. Do not create any new code chunks. Each code chunk has a name that matches the corresponding coursework question, and there are comments that signpost where to provide an answer for a specific question. For example, for Question 1 (a), you will see the comment Write your code for Q1 (a) in the code chunk below: and below this comment will be a code chunk with the label Q1a. Any changes you make outside of the designated code chunks will likely result in your assessment not being able to be marked. All the coding answers must be carried out through R commands (do not use comments to give your answers).
-
Where relevant, report your answers in the designated section below the R code chunks for each question in the R Markdown file. For example, for Question 1 (a), you will see the comment Report your answer for Q1 (a) below:, below which you can write your answer in standard R Markdown formatting.
-
When writing an R function, the
• function name,
• argument names and data type, and • output names and data typemust be exactly the same as those specified in the questions; otherwise a heavy penalty will be applied. Please note that uppercase and lowercase of the same alphabet are not the same character, so if the question asks to write a function/argument called, apple, but a function/argument called applE is presented in your answer, then that would be wrong.
-
Your submission must consist of exactly two files; the R Markdown file and its associated html file that is created by knitting/rendering the R Markdown file. The R Markdown file is the “Coursework- template.Rmd” file with your answers added in the appropriate places. Using this file, you must generate a html output file using the knit functionality in R Markdown, and submit this along with the R Markdown file.
-
Your R Markdown file must be renamed, and have the file name <your student ID>.Rmd. For example, if your student ID is 12345678, then your script must have the file name 12345678.Rmd.
-
Similarly, the associated html file should be named <your student ID>.html. For example, if your student ID is 12345678, then your html file must have the file name 12345678.html. (This should be
1
correctly named by default provided that you have named the R Markdown file correctly.)
-
Missing R codes will be penalised. Please ensure the submitted files have the correct file format;
otherwise it cannot be marked. Note an R Markdown file is a different format to an R script file.
-
Your entire R Markdown file must be reproducible and run without any errors.
-
You must informatively comment your R code where appropriate.
-
Unless specifically stated otherwise, you do not have to include error handling in your code.
-
Ensure that all required plots are informative, including the use of appropriate plotting areas,
points/lines sizes and colours, as well as including meaningful labels, titles and legends.
-
You must not load any add-on packages, i.e., using functions library or require etc.
-
All data files required to complete the Coursework can be downloaded from Blackboard, under the
Coursework panel.
-
To help you with the completion of the Coursework, there is an example Coursework question and
related files on Blackboard for reference. You can find the following files under the Coursework Example Question panel:
-
An example Coursework question, in the file Coursework-example.pdf,
-
A template R Markdown file Coursework-example-template.Rmd in which to write your answers
to the example Coursework question,
-
A model solution R Markdown file Coursework-example-solutions.Rmd, which contains solu-
tions to the Coursework example question and shows you the way you are expected to fill in the
Coursework-example-template.Rmd file with your answers.
-
The model solution output html file Coursework-example-solutions.html, which is simply the
rendered output of the Coursework-example-solutions.Rmd file.
-
all-ad-sales.txt and youtube-ad-sales.txt: the data files needed for the example question.
-
-
When asked to import a data file into your R workspace, you should first set the working directory to the location of the file on your local computer using the command setwd, and then read the file in using the appropriate R command (e.g. read.csv or read.table). You should set the working directory every time you have to import a data file. For an example of how to do this, see Question 1 (a) in the example Coursework solutions file.
-
The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work which is more than one week late.
-
All coursework must be carried out independently. You are reminded of the University’s Academic Integrity Policy, see https://www.southampton.ac.uk/quality/assessment/academic_integrity.page.
-
In the interests of fairness, queries which relate directly to the coursework must be made via the Coursework Discussion Forum on Blackboard. This ensures that everybody has access to the same information.
Total marks available: 100
2
Question 1: Exploratory Analysis of Tooth Growth Data Set
[22 marks]
The data set in the file tooth-growth.txt contains data on an experiment from researchers at McGill University investigating the effect of vitamin C intake on tooth growth in guinea pigs. For each of the 60 guinea pigs in the experiment, there are recordings of three variables:
-
len: a numeric variable, giving the tooth length of the guinea pig.
-
method: a character variable, giving the delivery method used to give the guinea pig the vitamin C,
which is either "AA" (ascorbic acid), or "OJ" (orange juice).
-
dose: a numeric variable, specifying the dose (in milligrams per day) of vitamin C that the guinea pig
received.
-
(a) [3 marks] Import the data set from the tooth-growth.txt file into the R workspace in a variable named tooth_data. Change the column name len to be tooth_len (this variable will henceforth be referred to as tooth_len). Change the data type of the method column from character to factor. How many of the observations have delivery method by orange juice? Store this in a variable called num_OJ.
-
(b) [2 marks] What is the variance of the tooth_len column for values where the delivery method is orange juice? Store this in a variable called var_OJ. What is the mean of the tooth_len column, where the delivery method is made by ascorbic acid and the dose value is 2? Store this in a variable called mean_AA_2.
-
(c) [3 marks] The variable dose takes one of three possible values: 0.5, 1, or 2. Create a side-by-side box plot of tooth_len for the three values of dose.
-
(d) [3 marks] Cross-tabulate the frequencies of the method and dose variables, creating a 2 × 3 table containing the frequencies of each of the 6 combinations of method and dose in tooth_data, and store this in a variable called dose_method_counts. Create a barplot showing the frequencies of the observations across the method and dose variables. There should be three bars, corresponding to each dose value, and each bar should be divided into two based on the method variable.
-
(e) [3 marks] Write a function, called subset_summary, that calculates the median and length of a given subset of a numeric vector x. The median and length should be calculated only for values of x whose cor- responding factor variable, x_factor, is a user-specified value chosen_factor. The subset_summary function should have the following features:
Arguments:
The subset_summary function should compute the median and length of the subset of x, where the values of x considered have corresponding x_factor value equal to chosen_factor.
Return:
subset_summary: a list object with two elements:1. subset_median: the median of the subset of x. 2. subset_length: the length of the subset of x.
-
(f) [2 marks] Using the subset_summary function you wrote in part (e), calculate the median and number of observations for the tooth_len variable where the delivery method was ascorbic acid.
3
(g) [6 marks] In this question, you will write a function, called subset_summary2, that improves upon the function subset_summary that you wrote in part (e), by including checks to ensure that the arguments passed to the function satisfy certain requirements. First, copy and paste the code you wrote for part (e) into your answer for part (g), and rename the function name to subset_summary2 in part (g).
The function subset_summary2 should check that the arguments satisfy the following five requirements:
1. x should contain numeric values,
2. x_factor should be a factor object,
3. x_factor should be the same length as x,
4. chosen_factor should be a character variable, and
5. chosen_factor should be one of the possible values (levels) in the x_factors variable.
If invalid inputs are passed to this function, it should stop prematurely by using the appropriate R function and print out the associated relevant error message exactly as written below:
For example, if requirement 2 is not satisifed, then the function should stop and print the error message 2.
4
Question 2: The Multivariate t-Distribution [16 marks]
If X is a p-dimensional random vector distributed according to the multivariate t-distribution, then the probability density function (pdf) of X evaluated at x, denoted fX(x), is given by
Γ((ν + p)/2) 1 T −1 −(ν+p)/2 fX(x)=Γ(ν/2)νp/2πp/2det(Σ)1/2 1+ν(x−μ) Σ (x−μ)
, (1) where Γ(·) is the gamma function, det is the matrix determinant, Σ is a p × p scaling matrix, μ is a
p-dimensional vector, and ν > 0 is a positive scalar known as the degrees of freedom.
(a) [7 marks] Write a function, called calc_mvt_pdf, that calculates the pdf, or natural logarithm of the pdf, at a single value x, for user-specified values of Σ, μ, and ν. The function calc_mvt_pdf should have the following features:
Arguments:
-
x: a numeric vector of length p, representing the p-dimensional x.
-
Sigma: a numeric p × p matrix, representing Σ in Equation (1). The default value should be the
p × p identity matrix Ip.
-
mu: a numeric vector of length p, representing μ in Equation (1). The default value should be the
p-dimensional zero vector 0 = [0, . . . , 0]T .
-
df: a positive numeric scalar, representing the degrees of freedom ν given in Equation (1). The
default value should be 1.
-
calc_log: a logical variable. If set to TRUE, the natural logarithm of fX (x), given by log(fX (x)),
is calculated. If set to FALSE, the pdf fX(x) is calculated. The default value should be TRUE.
Computation:
If calc_log = FALSE, the calc_mvt_pdf function should compute the pdf fX(x) given in Equation (1), and if calc_log = TRUE, the natural logarithm of Equation (1) is computed. You must not use any existing functions in R that already calculate the multivariate t-distribution density.
When computing the pdf, numerical errors can occur due to the calculation of very small or very large numbers, which is why we might calculate the log pdf instead. If the log pdf is to be computed, you should not just apply the log function to the calculation you use to compute the pdf. Instead, you should take advantage of properties of the log function to rewrite the logarithm of Equation (1) to produce a more numerically stable result when computing the log pdf.
Return:
1. x_pdf: if calc_log = FALSE, the pdf fX(x) is returned. If calc_log = TRUE, the log pdf,
log(fX(x)), is returned. Hints:
-
You may use the R functions gamma and det.
-
To check the numerical accuracy of your calculations, note that the expected output of the cal-
culation calc_mvt_pdf(rep(3,200)) should be (approximately) the value −506.967.
Suppose that x1, . . . , xn are a set of n observed values that are all generated independently from the multivariate t-distribution with parameters Σ, μ and ν. Then, their joint probability density is given by Qni=1 fX(xi), with fX(x) as in Equation (1), whilst the logarithm of the joint probability density is given by Pni=1 log(fX(xi)).
5
-
(b) [4 marks] Using a loop structure and the function calc_mvt_pdf that you wrote in part (a), write a function, calc_joint_mvt_pdf, that calculates the joint probability density for observed values x1, . . . , xn that are all generated independently from the same multivariate t-distribution. The function should have the following features:
Arguments:
-
x_matrix: a numeric n × p matrix of length p, with the i-th row representing an observed value xi .
-
Sigma: a numeric p×p matrix, representing Σ in Equation (1). The default value should the p×p identity matrix Ip.If calc_log = FALSE, the calc_joint_mvt_pdf function should compute the joint pdf Qni=1 fX(xi), and if calc_log = TRUE, the function should compute the natural logarithm of the joint pdf, Pni=1 log(fX(xi)). This value should be computed using a loop structure, by looping over the n values x1,...,xn.
Return:
1. joint_pdf: if calc_log = FALSE, the joint pdf Qni=1 fX(xi) is returned. If calc_log = TRUE,the logarithm of the joint pdf, Pni=1 log(fX(xi)) is returned.
-
-
(c) [3 marks] Using the function joint_pdf you wrote in part (b), compute both the joint pdf and log joint pdf for the observed values x1 = [1.2,3]T, x2 = [1.1,3.1]T, x3 = [3.4,2.6]T, coming from the multivariate t-distribution with
T 1 0.3
μ=[1.2,3] , Σ= 0.3 1 , ν=2.Store the results in variables joint_mvt_pdf and log_joint_mvt_pdf respectively.
-
(d) [2 marks] Suppose that X has a bivariate t-distribution with parameters μ = [0, 0]T, Σ is the 2 × 2 identity matrix I2, and ν = 1. Suppose that Y has a bivariate standard normal distribution (i.e. the bivariate normal distribution with mean vector and covariance matrix the same as X above). Cal- culate the absolute value of the difference between these two densities, evaluated at the value [0,1]T, i.e. calculate
fX([0,1]T)−fY ([0,1]T) ,
where fX(·) and fY (·) are the pdfs of X and Y respectively. Store the result in a variable named density_diff.
6
Question 3: Nonparametric Kernel Density Estimation
[32 marks]
Density estimation is the problem of estimating the probability density function (pdf) from a set of given data points. Concretely, suppose we observe n (1-dimensional) data points X1, . . . , Xn, and we wish to estimate the underlying unknown pdf, denoted f(x), of the data.
The Kernel Density Estimator (KDE) is one of the most well-known methods for solving this problem. The KDE at a given value x, denoted fˆ (x), is given by the formula
width h, the KDE fˆ n,h
(x) is the estimator of the true pdf f(x), K(·) is called the kernel function, and h > 0 is called the bandwidth. The kernel function is generally a smooth, symmetric function, and the bandwidth controls the amount of smoothing in the estimation.
(a) [3 marks] The Gaussian kernel function
K(y)=√ exp −
is one of the most
gauss_kernel_calc,
common choices for the kernel function
that computes, for n data points X1, . . . , Xn,
Write
given value x, and
n,h
Return:
• kde_est: a numeric object containing the KDE estimator fˆ
(b) [2 marks] Another possible kernel is the triangular kernel:
(x).
(c) [2 marks] The bandwidth parameter h can be set using the “rule-of-thumb” choice: −1/5 IQR
1 y2 2π 2
(3)
(3). The function should have the following features: Arguments:
-
data: a numeric vector of length n containing the data set X1 , . . . , Xn that we want to estimate the pdf of,
-
x: a numeric object, the value x for which we want to estimate the pdf f(x),
-
n: a positive integer object specifying the number of data points n,
-
h: a numeric object specifying the bandwidth h of the kernel function.
Computation:
Calculate fˆ (x) for a given (single) value of x using the Gaussian kernel.
K .
(x) given in Equation (2) with K given by the Gaussian kernel in Equation
(1−|y| if|y|≤1, 0 otherwise.
K(y) =
where | · | is the absolute value. Write a function, called triangular_kernel_calc, that computes,
for n data points X , . . . , X , given value x, and bandwidth h, the KDE fˆ (x) given in Equation (2) 1n n,h
with K given by the triangular kernel in Equation (4). The function should have exactly the same features as your answer to part (a), except that the function uses the triangular kernel.
where σˆ is the estimated standard deviation σˆ = q(n − 1)−1 Pni=1(Xi − X)2 with X = n−1 Pni=1 Xi, IQR is the inter-quartile range, and n is the number of data points observed. Write a function, rule_of_thumb_calc, that computes hROT for a data set X1 , . . . , Xn . The function should have the following features:
Arguments:
1. data: a numeric vector of length n containing the data set X1 , . . . , Xn ,
2. n: a positive integer object specifying the number of data points n.
-
data: a numeric vector of length n containing the data set X1 , . . . , Xn that we want to estimate the pdf of,
-
x_vals: a numeric vector of length m representing values x1 , . . . , xm at which we want to estimate the pdf f(x),
-
kernel: a character object specifying the kernel function we want to use. The kernel function can be set to be either "gaussian" or "triangular". The default argument of kernel is "gaussian".
-
h: a numeric object specifying the bandwidth of the kernel function. The default value of h should
be NULL.
Computation:
If no bandwidth parameter h is specified by the user, then the bandwidth h is calculated using the rule-of-thumb choice in Equation (5). Then, the function calculates fˆ (x) for all x = x , . . . , x ,
with respect to the chosen bandwidth and kernel function. Return:
• kde_est: a numeric vector of length m containing the KDE esimtator fˆ (x) evaluated at the n,h
points x = x ,...,x , i.e. a vector containing the values fˆ (x ),...,fˆ (x
1 m n,h 1 n,h m
).
(e) [6 marks] Generate 100 random variables with distribution N (3, 2), i.e. normally distributed with mean equal to 3 and variance equal to 2, and store them in a variable named norm_data. Calculate the KDE with respect to norm_data, evaluating the KDE at 200 points x1, . . . , x200 on an equi-spaced grid, beginning at -3, and ending at 9;
1. using the Gaussian kernel, using the rule-of-thumb choice to set the bandwidth h; 2. using the triangular kernel, using bandwidth h = 0.7.
Store the results in variables named gaussian_est and triangular_est respectively. On a single figure, plot the values x ,...,x against the KDE fˆ (x ),...,fˆ (x ) for both gaussian_est
1 200 n,h 1 n,h 200
and triangular_est using a line plot. Add to this plot the true values of the pdf, f(x), evaluated at
x1, . . . , x200. Make sure that each line is a different type and colour, and add a legend to distinguish
the three lines.
8
n,h 1m
In practice, we should carefully select the bandwidth h to minimise the mean squared error (MSE) of density
estimation:
ˆ Z∞ˆ 2
MSE fn,h = E
f−i(Xi) = (n − 1)h
i.e. fˆ (X ) is the KDE in Equation (2) evaluated at X , computed using all data except X .
fn,h(x) − f(x) dx . (6)
K h
, (8)
fˆ (X), (7)
j=1 j ̸=i
-
(f) [7 marks] Write a function J_calc, that computes, for a given bandwidth h, the value of J(h) given in Equation (7). The function J_calc should call the function kde_estimate written in part (d) when necessary, and have the following features:
Arguments:
-
h: a numeric object specifying the bandwidth of the kernel function,
-
data: a numeric vector of length n containing the data set X1 , . . . , Xn ,
-
kernel: a character object specifying the kernel function we want to use. The default argument
of kernel is "gaussian".
Computation:
The function calculates J(h) given in Equation (7). To compute the first term in Equation (7), you should use the integrate function. Refer to the help file of the integrate function for details of its usage.
Return:
• J: a numeric value containing J(h). -
-
(g) [3 marks] Write a function find_opt_h, that computes, for a given set of possible bandwidth values H, the optimal choice of h by finding the bandwidth hopt ∈ H that minimises Equation (8), i.e. computing
hopt = argminh∈H{J(h)}. (9) The function should call the function J_calc that you wrote in part (f), and have the following features:
Arguments:
-
H_set: a numeric vector specifying the set of bandwidths H.
-
data: a numeric vector of length n containing the data set X1 , . . . , Xn ,
-
kernel: a character object specifying the kernel function we want to use. The default argument
of kernel is "gaussian".
Computation:
The function calculates hopt given in Equation (9). Return:• h_opt: a numeric value containing hopt.
Hint: you may find the function which.min useful. -
−ii i i
9
(h) [4 marks] For the norm_data data set from part (e), compute the optimal bandwidth hopt for the Gaussian kernel on the set H = {0.1,0.11,0.12,...,1.19,1.2}. Plot the values of J(h) for all values in the set H in a line plot. Add a vertical line indicating hopt, and a vertical line indicating the rule-of-thumb choice for h, making sure the two lines can be distinguished.
10
Question 4: Classification Using Linear Discriminant Analysis
[30 marks]
A researcher is interested in classifying two different species of iris flower based on measurements of the flower’s petals. Each flower comes from one of two populations, A or B. The researcher’s aim is to decide, given a flower’s measurements, if the flower should be classified as coming from population A or B.
Mathematically, suppose we observe p-dimensional data that comes from one of two populations, A or B, where an observation x has prior probability πA of belonging to population A, and prior probability πB of belonging to population B.
When each population is modeled as a multivariate Normal distribution with common p × p covariance matrix Σ, and population specific p-dimensional mean vectors μA and μB, the linear discriminant analysis (LDA) decision rule for classifying an observation x as belonging to population A takes the form:
T−1 1 T−1 πA AssignxtoPopulationAif(μA−μB) Σ x>2(μA−μB) Σ (μA+μB)−log π , (10)
B
otherwise assign x to Population B.
In practice, the quantities μA, μB, πA, πB, and Σ must be estimated. This can be achieved using a training
data set of n data points where membership of each observation is known. Suppose the training data (A) (A) (B) (B)
set contains nA data points x1 , . . . , xnA belonging to Population A, and nB data points x1 , . . . , xnB belonging to Population B (with nA + nB = n). Then, πA and πB can be estimated using the sample proportions
πˆA = nA/n, and πˆB = nB/n,
and μA and μB can be estimated using the p-dimensional sample mean vectors
1 nA 1 nB
μˆA = x ̄A = Xx(A), μˆB = x ̄B = Xx(B), nAi nBi
i=1 i=1 The p × p covariance matrix Σ can be estimated using the pooled estimator
(11)
(12)
(13)
where
ΣbA = nA −1
1
(A) (A) −xi xi
1 ΣbB = nB −1
nB
X (B)
(B)
−x ̄B xi −x ̄B
T
, (14)
bA
Σ= 1 (n −1)Σ +(n −1)Σ ,
b n−2 A
T
B
bB
nA
X (A)
To this end, the researcher has access to a correctly labelled training data set iris-training-data.txt, which contains 4 measurements from n observations of flowers: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. Alongside this, each observation has the variable population with the known population (A or B) that the flower belongs to. This data set will be used to calculate the estimated quantities given in Equations (11), (12), and (13).
(a) [4 marks] Load the data set iris-training-data.txt into your R workspace. Produce two plots, side-by-side in the same figure, where the left plot shows Petal.Width plotted against Petal.Length, whilst the right-hand plot shows Sepal.Width plotted against Sepal.Length. The observations should be coloured based on whether they came from population A or B, and you should include a legend to distinguish between the two populations.
−x ̄A
with the p-dimensional vectors x(A), x ̄A, x(B), and x ̄B in Equation (14) treated as p × 1 column vectors, so
xi
,
xi
i=1
i=1
ii that ΣbA and ΣbB are p × p matrices.
11
(b) [7 marks] Write a function called est_training that uses a training data set to calculate and re- turn the quantities required to classify an observation, given in Equations (11), (12), and (13). The est_training function should have the following features:
Argument :
1. training_data: a data frame with n rows and p + 1 columns, consisting of training data. Each row corresponds to an observation. The last column, with column name population, contains character variable data given by either "A" or "B", that depicts which of the two populations the observation belongs to. The first p columns contain the p variables from the n observations.
Computation:
The training_est function should compute five quantities:
-
the p-dimensional mean vector μˆA of all training data observations in population A (Equation (12) left),
-
the p-dimensional mean vector μˆB of all training data observations in population B (Equation (12) right),
-
the proportion of training data observations in population A, nA/n (Equation (11) left),
-
the proportion of training data observations in population B, nB/n (Equation (11) right), and
-
the pooled covariance estimate of all training data observations Σb (Equation (13)).
Return:
training_ests: a list object with five elements:
-
mean_A: a numeric vector of length p containing the mean vector of all training data observations belonging to population A.
-
mean_B: a numeric vector of length p containing the mean vector of all training data observations belonging to population B.
-
pi_A: a numeric variable between 0 and 1, containing the proportion of training data observations belonging to population A.
-
pi_B: a numeric variable between 0 and 1, containing the proportion of training data observations belonging to population B.
-
cov_est: a numeric p × p matrix containing the pooled covariance estimate.
Hint: you may use the function cov in order to calculate Σb.
Using the estimated quantities, the LDA decision rule classifies a new observation x as belonging to popu- lation A if
T 1T πˆA −1
bl x>2bl (μˆA+μˆB)−log πˆ , otherwiseassignxtoB, wherebl=Σb (μˆA−μˆB). (15)
B
(c) [2 marks] Write a function calc_l_hat, that takes as input Σb, μˆA, and μˆB, and computes the vector bl given in Equation (15). The calc_l_hat function should have the following features:
Arguments:
1. cov_est: the p × p pooled covriance estimator Σb,
2. mean_A: the p-dimensional mean vector μˆA for training data observations in Population A,
3. mean_B: the p-dimensional mean vector μˆB for training data observations in Population B.
Computation:
The function should compute the vector bl, given by bl = Σb−1(μˆA − μˆB).
Return:
l_hat: a numeric vector of length p, containing the value of bl.
12
-
(d) [2 marks] The right-hand-side of the inequality in Equation (15), 12blT(μˆA+μˆB)−log(πˆA/πˆB), is known
as the decision point. Write a function calc_decision_point, that takes as input bl, the mean vectors μˆA and μˆB, and prior probability estimates πˆA and πˆB, and computes the decision point given by the right-hand-side of the inequality in Equation (15). The calc_decision_point function should have the following features:
Arguments:
1. l_hat: the p-dimensional vector bl.
2. mean_A: the p-dimensional mean vector μˆA for training data observations in Population A.
3. mean_B: the p-dimensional mean vector μˆB for training data observations in Population B.
4. pi_A: a numeric variable between 0 and 1, containing the proportion of training data observationsbelonging to population A, πˆA.
5. pi_B: a numeric variable between 0 and 1, containing the proportion of training data observationsbelonging to population B, πˆB.
Computation:
The function should compute the decision point 12blT(μˆA +μˆB)−log(πˆA/πˆB), given in Equation (15). Return:
decision_point: a numeric value containing the value of the decision point. -
(e) [2 marks] Write a function classify_obs, that classifies a new observation into either population A or B based on the decision rule (15). The function classify_obs should have the following features:
Arguments:
1. obs: a p-dimensional new observation x that is to be classified. 2. l_hat: the p-dimensional vector containing bl.
3. decision_point: the value of the decision point.Computation:
The function should classify a new observation x as belonging to the population A or B, based on thedecision rule (15).
Return:
population_est: a character value, with value "A" or "B", containing the assigned population for x. -
(f) [5 marks] Using the functions you wrote in parts (b), (c), (d), and (e), write a function classify_data, that takes as input (labelled) training data and (unlabeled) test data, estimates the necessary values needed for classification using the training data set, and classifies each observation in the test data set into population A or B. Your classify_data function should call the functions you wrote in parts (b) – (e) at appropriate places, and should have the following features:
Arguments:
1. training_data: a data frame with n rows and p + 1 columns, consisting of training data. This object has the same properties as described in part (b).
2. test_data: a data frame with m rows and p columns, consisting of test data. Each row corre- sponds to an observation.
Computation:
The function should classify each of the m observations in test_data as belonging to the population A or B, based on the decision rule (15). The quantities required to evaluate the decision rule should be computed using training_data.
Return:
population_ests: a character vector of length m, with values "A" or "B", containing the assignedpopulations for the m observations in test_data. 13
(g) [2 marks] Load the data set iris-test-data.txt into your R workspace in a variable named test_data. Using the classify_data function you wrote in part (f), perform linear discriminant analysis on test_data, using the training data set training_data. Store the result in a variable named iris_results, and add this information to the test_data data frame, by creating a new column with name population_est that contains the estimated populations.
Further analysis finds that the first 10 observations (i.e. those in rows 1–10) in the iris-test-data data set belong to Population A, whilst the last 15 observations (those in rows 11–25) belong to Population B.
(h) [2 marks] Add this information into the test_data variable, by creating a new column with name population containing the true populations of each observation. What proportion of the observations in test_data were correctly classified? Store this in a variable named prop_correct.
(i) [4 marks] Plot the Petal.Width variable against Petal.Length variable for the observations in test_data, with the two (true) populations distinguished from one another by colour, and with incorrectly classified points given by cross (x) symbols.
14