Instructions
BUSS6002 Assignment
October 10, 2023
-
Due: at 23:59 on Friday, October 27, 2023 (end of week 12).
-
You must submit a written report (in PDF) with the following filename format, replacing
STUDENTID with your own student ID: BUSS6002 STUDENTID.pdf.
-
You must also submit a Jupyter Notebook (.ipynb) file with the following filename format,
replacing STUDENTID with your own student ID: BUSS6002 STUDENTID.ipynb.
-
There is a limit of 6 A4-pages for your report (including equations, tables, and captions).
-
All plots, computational tasks, and results must be completed using Python.
-
Each section of your report must be clearly labelled with a heading.
-
Do not include any Python code as part of your report.
-
All figures must be appropriately sized and have readable axis labels and legends (where applicable).
-
The submitted .ipynb file must contain all the code used in the development of your report.
-
The submitted .ipynb file must be free of any errors, and the results must be reproducible.
-
You may submit multiple times but only your last submission will be marked.
-
A late penalty applies if you submit your assignment late without a successful special con- sideration. See the Unit Outline for more details.
-
Generative AI tools (such as ChatGPT) may be used for this assignment but you must add a statement at the end of your report specifying how generative AI was used. E.g., Generative AI was used only used for editing the final report text.
-
Hint! It is highly recommended that you finish the week 10 tutorial before starting this assignment.
1
Description
In this assignment, you are conducting a study that compares the empirical performance between two families of basis functions for linear basis function (LBF) models: polynomial basis functions and radial basis functions. The aim is to investigate which family of basis functions is better suited for approximating highly nonlinear relationships between two scalar-valued variables.
More specifically, you are given four benchmark datasets: A, B, C, and D. Each dataset con- tains 5,000 observations of the the response and predictor variables, which are named y and x, respectively. A scatter plot of each dataset is shown in Figure 1. Your task is to compare the per- formance between polynomial and radial basis function regression models on each of the datasets.
Figure 1: Benchmark Datasets
The LBF model being considered in your study is given by y = φ(x)⊤β + ε,
where φ(x) := [1,φ1(x),...,φp(x)]⊤, β := [β0,β1,...,βp]⊤, and ε is a random noise. For the set of basis functions {φi}pi=1, two choices are being investigated: the first choice is the family of polynomial basis functions,
φi(x) := xi,
and the second choice is the family of radial basis functions,
( (x− i )2) φi(x):=exp − p+1
.
2s2 2
Before comparing the two basis function families, you must set the value of p for the polynomial re- gression model, as well as the values of p and s for the radial basis function regression model. These hyperparameter values should be selected for each dataset, using a validation set, by minimising the validation mean squared error (MSE).
In your study, the optimal value of p (for each basis function family) should be selected by exhaustively searching through an equally-spaced grid from 1 to 10, with a spacing of 1:
P := {1,2,3,...,10}.
For the radial basis functions, in addition to selecting p, you should also select the optimal value of s by exhaustively searching through another equally-spaced grid from 0.1 to 1, with a spacing of 0.1:
S := {0.1,0.2,0.3,...,1}.
That is, for each dataset, the optimal values must be determined for three hyperparameters in total: ppol ∈ P, prad ∈ P, and s ∈ S, where ppol denotes the number of polynomial basis functions (i.e., the degree of the polynomial) and prad denotes the numbers of radial basis functions.
Once the optimal values of the hyperparameters are chosen for both basis function families, you will be able to compare the performance between the two using a test set (i.e., by comparing the test MSE between the two optimally selected models).
The files containing the datasets are listed in Table 1, which can be downloaded from the unit’s Canvas site. In each file, the dataset is organised as comma separated values, with each row being an observation and each column being a variable. The response values are on the first column and the corresponding predictor values on are the second column.
File
dataset-a.csv
dataset-b.csv
dataset-c.csv
dataset-d.csv
Description
Benchmark dataset A Benchmark dataset B Benchmark dataset C Benchmark dataset D
Table 1: Files Provided
3
Report Structure
Your report must contain the following four sections: 1 Introduction (0.5 pages)
-
– Provide a brief project background so that the reader of your report can understand the general problem that you are solving.
-
– Motivate your research question.
-
– State the aim of your project.
-
– Provide a short summary of each of the rest of the sections in your report (e.g., “The report proceeds as follows: Section 2 presents . . . ”).
2 Methodology (2 pages)
-
– Define and describe the LBF model.
-
– Define and describe the two choices of basis function families being investigated.
-
– Describe how the parameter vector β is estimated given the hyperparameter value(s). Mention any potential numerical issues associated with the estimation procedure.
-
– Describe how the hyperparameter value(s) can be determined automatically from data (as opposed to manually setting the hyperparameters to arbitrary values).
-
– Describe how the performance of the two families of basis functions is compared given the optimal hyperparameter value(s).
3 Empirical Study (2.5 pages)
-
– Describe the benchmark datasets used in your study.
-
– Describe in detail the procedure that you followed to obtain the empirical results, in- cluding any computational challenges that you may have encountered. You may refer to details in Section 2 to avoid repetition in your writing.
-
– Present (in a table) the optimal hyperparameter values selected for each dataset and for each basis function family.
-
– Discuss the table of selected hyperparameters.
-
– Visually present (using plots) the predicted response values under each basis function
family for each dataset.
-
– Discuss the plots of predicted values.
-
– Present (in a table) the test MSE under each basis function family for each dataset.
-
– Discuss the table of test MSE values.
4 Conclusion (0.5 pages)
– Discuss your overall findings / insights.
– Discuss any limitations of your study.
– Suggest potential directions of extending your study.
-
-