Machine Learning Fundamentals Group Assessment
2022-2023 Semester I Instructions:
• Students are required to form a group of 3 or 4 to complete the assignment. Please write your name, student ID and class in the space provided. This assignment is worth 40% of the total course mark. • Format specification: Write no more than 12 pages. Use 12-point font, 1.5 spacing and A4 paper margins. Figures and tables should have captions and preferably notes, and they should also be independently readable. • Well-structured assignment is strongly encouraged. The answer should be arranged in the order of introduction, main approach used, main findings, and conclusion. R code used for generating results should be sent as a separate file.
Section 1: Short-answer Questions
[15 marks] Students are required to complete all the short-answer questions. Each question receives 5 marks.
- Why do we need to consider doing feature engineering? What is the logical order for the implementation of feature engineering? Using the Boston Housing dataset as an example, try to fit a KNN model with and without feature engineering and compare the differences. Hint: use recipes::recipe() to create a blueprint and use caret::train() to apply KNN with and without feature engineering processes. To access the Hoston Housing dataset, you need to install and library mlbench package. Then, the Boston Housing data can be accessed via data(BostonHousing2)
- Describe the nature of hyperparameter tuning. What method can we use for hyperparameter tuning? Using the Boston Housing dataset as an example, try to fit a linear model with and without K-fold CV. Compare the differences. (Hint: use lm() model for fitting a linear model without K-fold CV and caret::train() for fitting a linear model with K-fold CV.
- In a situation where we have the number of observations greater than the number of variables, is fitting a linear regression still appropriate? If not, what types of regressions should we consider? In R, use the Boston Housing dataset as an example, demonstrating
the difference between the in-sample RMSE of linear regression and regularised regressions.
Section 2: Report
[25 marks] Background Information Kevin is a professional real-estate manager. In the past, he relied on using a few important features for home valuation. His boss recently asked him to take the initiative to learn to use big data and machine learning algorithms to value home prices in order to better communicate with customers. He took an online data science course and learned different predictive modelling techniques that can be very useful in predicting house prices. In order to fully utilise the benefits of machine learning algorithms, his boss gives him access to the house price database with more than 80 features and thousands of houses. He considers running a model comparison to find out which model performs the best, evaluated both on the training dataset and testing dataset. The candidate models under consideration are linear regression, regularised regression, k-nearest neighbour, decision tree for regression, bagging, and random forests. Requirements:
- use ames dataset from Ameshousing package to perform model comparison. Use 75% and 25% split between training and testing dataset, and you should consider tratified sampling method. The evaluated metrics should be based on root mean squared error (RMSE). Model performance should be evaluated both using in-sample (training dataset) and out-of-sample (testing dataset)
- The first model comparison should be based on using direct engine (Hint: lm() for linear model, glmnet for regularised regression, knnreg() for KNN regression, rpart() for decision tree, bagging() for bagged tree and ranger() for random forest. Compare the model performance and analyse the results
- The second model comparison should be based on combining feature engineering and K-fold CV methods. Create a blueprint using recipe() function and then use train() function from caret package to train each ML model. Compare the model performance and analyse the results.
- Compare and contrast the results for model with and without feature engineering and K-fold CV methods. Outline which models benefit from feature engineering and K-fold CV and which models do not. Explain Why
- Suggest a final recommendation and explain why.