Assignment 1: Decision Trees
Task 1 (No coding- show your general working):
A. You are building a classifier to determine which walking trail is best suited for a weekend outing with your friends. You scouted around and gathered data about eleven different walking trails and about the difficulty level (easy, some difficulty or advance), the distance from Auckland (within, short distance or far), their direction (North, South or West), whether they can comply with restrictions (none, wheelchair access or flat terrain) and whether you enjoyed them or not. Using this data build a decision tree to decide whether you would enjoy a particular trail or not, showing at each level how you decided which attribute to expand next.
Use the following data:
|T5||Some Difficulty||short distance||South||wheelchair access||1|
|T6||Advanced||short distance||South||flat terrain||1|
|T8||Some Difficulty||short distance||West||flat terrain||0|
B. What is the training set accuracy of your decision tree? C. Given a new data set with several other trails, which ones would you choose?
|T13||Some Difficulty||within||North||flat terrain|
|T15||Some Difficulty||within||West||flat terrain|
|T16||Some Difficulty||short distance||South||none|
C. To verify your decision tree accuracy, you decide to try them all. The results are:
D. What is the test set error? Is this result ideal? Explain your answer.
Task 2 (coding):
In this task, we will implement a full ML classifier based on decision trees (in python using Jupyter notebook). We will use the Mushroom Data Set (https://archive.ics.uci.edu/ml/datasets/Mushroom) to train and evaluate your classifier. This dataset comes from the UCI ML repository (https://archive.ics.uci.edu/ml/index.php) . (Hint: There are missing values in this dataset.
At this particular time, you may ignore instances that have missing values and just remove them, or replace missing values with the mean value of the column. Please note that there are other ways of preprocessing data which we have not seen yet.)
You can use libraries e.g., Pandas, NumPy but you may NOT use any prebuilt decision tree packages.
A. Implement the basic decision tree procedure as discussed in the lectures. Implement DecisionTree algorithm with a train procedure. Implement the information gain criterion as described in our lectures. In your report use one or two sentences to discuss the output (the output of the training procedure is the trained decision tree which is a representation of the if-then-else rules). You may print out your decision tree (you don't have to, however it might help you discuss the trained trees) (This may be large - consider the best way to print it). B. Implement tree depth control as a means of controlling the model complexity. In the procedure train implement a parameter stopping_depth.
Use the stopping_depth parameter to stop further splits of the tree. In your report use one or two sentences to discuss the output at stopping level 2, 3, 4. You can print out your decision tree. C. Implement a test procedure for your DecisionTree algorithm (a procedure that takes new data and the trained model and returns a prediction). Describe your test evaluation. D. Propose and implement an evaluation method for your DecisionTree algorithm (this evaluates the whole procedure, given a data set, trains the tree, applies the tree to data, calculates a performance measure) . Please explain your steps and results.
Task 3 (Reflection - not coding):
A. Discuss what will happen if you decide to change the splitting criterion. Explain the new splitting criterion and how it might change your decision tree. B. Explain whether your evaluation method can indicate whether your tree is over- or underfitting.
What to submit? You need to submit:
The raw jupyter notebook .ipynb AND
An HTML generated from the notebook (including the outputs from the execution of the code).
The notebook needs to be clearly structured according to the assignment tasks listed above. Each part should contain a header pointing out which task it contains, your code with results, and one paragraph containing your answers to the questions. You won't get any marks for your code and results alone, they need to be explained; the discussion is the most important part!
The assignment must be submitted to Canvas. It will be run through Turnitin, so make sure that everything you submit has been done by you.
Note that we will deduct marks if the solution is not submitted in the correct format. You can only submit html and ipynb files.
Make sure everything is reproducible, all the parameters are defined, random seeds are set, and results repeatable.