DTS205TC High Performance Computing
Group Project Assignment 1
Overview
Random forest is an ensemble learning method for classification that operates by constructing a multitude of decision trees at training time.1 It often have very good predictive accuracy, and have been widely used in many applications. In this task, you will be asked to manually implement a random forest algorithm and parallelize it.
Team policy
You are free to team up from minimum of two to maximum of three team members, and one of the team member must fill a online document contains all team members’ information before 6th March, 23:59. Students who fail to do so will be randomly assigned. Changes will not be allowed once settled.
Avoid Plagiarism ⚫ Do not submit work from other teams. ⚫ Do not share code/work to students other than your own team members. ⚫ Do not read code/work from other teams, discussions between teams should remain high level. ⚫ Do not use open-source code on the Web, or code from textbooks.
1. Group Tasks (60 marks)
Dataset
We will provide a dataset ‘data.csv’, which can be used to test your program. It is a 2-category, 10feature dataset with 5×105samples.
Tasks
In order to implement a parallel random forest, the following tasks should be accomplished: ✓ Decision tree (without stop-split-early condition) It includes the following components:
1) Calculate information gain; (10 marks) 2) split the data via finding the best feature based on a); (10 marks) 3) create branches recursively based on b) until every leaf only contains a single category; (10 marks) ✓ Random Forest 4) Bagging. Perform 100 Bootstrapping on the data, generate different decision trees based on task 3), and perform majority-voting on their prediction results. (10 marks) 5) When each node of the decision tree is split, 3 features are randomly selected in the way of non-replacement sampling, and task 2) is performed accordingly. (10 marks) ✓ Report 6) Based on task 2) single-layer decision tree, 3) decision tree, 4) bagging tree, 5) random forest, perform 5-fold CV, and compare their average prediction Accuracy2 on the validation set. (10 marks) Models Accuracy Single-layer Decision Tree Decision Tree Bagging Tree Random Forest
Individual Challenge Tasks (40 marks)
Tasks ✓ Parallelization
7) For the above tasks 1)-4) (no task 5), if they need to be parallelized, what parallel method do you think should be used respectively? Please explain the reasons for your choice. (4*5 marks) 8) Choose one (not all!) of tasks 1)-4), and implement the parallelization of random forest based on the solution in task 7). (10 marks) NOTE: Marks are given based on the correctness and clarity of your code. Which approach you perfer will not affect your score. 9) Let the number of decision trees in the Random Forest be fixed at 100, change the number of processors, and measure the running time of the program respectively. Record the results in the table below
Please estimate the speedup of your program, does it achieve a linear speedup? If not, why? (10 marks) NOTE: There are no restrictions on the programming language, nor the parallelization library. You can use python, C, or any other language; and you can use MPI, OpenMP, multiprocessing, coroutine or various other parallelization libraries. To reiterate, no matter what language and library you use, you cannot directly call the off-the-shelf machine learning library, otherwise you will not get a score.
3. Submission
Group Submission One of the group members must submit the following files:
1) Cover letter with the student IDs and names of all group members (template can be found on LMO). 2) All runnable source code organised by folders. 3) A report (pdf) file contains all your answers, source code and charts. 4) Explain what part of the work each person did. Once you have all the files, please put them in a single directory (named groupid-assign1) and compress it to .zip file. Individual Challenge Submission 1) Cover letter with the student ID and Name (template can be found on LMO). 2) All runnable source code organised by folders. 3) A report (pdf) file contains all your answers, source code and charts. Once you have all the files, please put them in a single directory (named studentID-challenge) and compress it to .zip file.