COMP7103 Assignment 1
Question 1 Data Preprocessing [15%]
Consider a numerical attribute with the following values. It is known that the range of values for the attribute is [0, 20] and it is always an integer.
0 3 5 6 8 12 15 15 15 16 18 19 19 19 20
a) Convert the attribute into an ordinal attribute by splitting the data into 3 bins of equal interval width. State the ranges of the 3 bins.
b) Convert the attribute into an ordinal attribute by splitting the data into 3 bins of equal frequency. State the ranges of the 3 bins.
c) Apply min-max normalization to this attribute.
Question 2 Metric Axioms [20%]
Consider a set of document data that records the number of occurrences of ? different words in each of the documents.
a) Suppose a distance function ? is defined for the document data as ?(?, ?) = arccos(????(?, ?)), where ????(?, ?) is the cosine similarity of ? and ?. Validate the distance measure with each of the criteria in the metric axioms (Lecture Notes Chapter 2 p.64).
You may use the following inequalities in your answer. Proof is required for any other inequalities used.
⚫ ∀?, ? ∈ R, |?| + |?| ≥ |? + ?| (triangle inequality)
⚫ ∀?, ? ∈ R , ‖?‖2‖?‖2 ≥ (? ⋅ ?) (Cauchy-Schwarz inequality)
⚫ Denote ∠?? as angle between vectors ? and ?, then |∠?? − ∠??| ≤ ∠?? ≤ ∠?? + ∠?? (triangle inequality for angles)
b) Suggest one transformation to the document data so that the distance function in part a) satisfies the metric axioms. Explain your answer.
Question 3 Classification [30%]
Consider the training and testing samples shown in Table 1 for a binary classification problem.
a) What is the entropy of this collection of training examples with respect to the class attribute?
b) Build a decision tree using entropy as the impurity measure and the pre-pruning criteria is the information gain < 0.1. Show your steps.
c) Using the testing data shown in Table 1 as the test set; show the confusion matrix of your classifier.
d) With respect to the ‘+‘class, what are the precision and recall of your classifier?
Training Set
Record A B C Class
1 H X T +
2 H X T +
3 H X T +
4 H X F −
5 H X F −
6 H X F +
7 H Y T −
8 H Y T +
9 L X T +
10 L X T +
11 L X T +
12 L X T +
13 L Y T −
14 L Y T −
15 L Y F −
16 L Y F +
Testing Set
Record A B C Class
17 H X T −
18 H Y F +
19 L X T +
20 L X F +
21 L Y T +
Table 1 dataset for a binary classification problem
Question 4 Splitting [20%]
Consider a dataset consists of 150 instances of data, with a single attribute ? and a class attribute ?. There are three possible values for ? (A, B or C). Table 2 shows a summary of the dataset showing the number of instances of data per class label for every value of ? appearing in the dataset.
Suppose we want to predict the class attribute ? using the numerical attribute ?, compare all the possible splitting using the GINI as the impurity measures, and derive the best binary split point. Show clearly all split points considered and the corresponding GINI.
? ?=? ?=? ?=?
1 30 0 0
5 20 7 1
9 0 19 4
11 0 22 11
13 0 2 34
Table 2 summary of 150 instances of data
Question 5 Weka [15%]
The Abalone Data Set consists of some measurement of abalones and their age (judged by counting the number of rings under microscope).
Download the dataset: https://archive.ics.uci.edu/ml/datasets/Abalone and read the description. A copy of the dataset and an extract of the description are also available on Moodle.
Construct a new dataset with 5 attributes and a class label as described in Table 3, containing all data from the given dataset. Make sure the data could be imported into Weka so that Weka can identify the data type of the attributes correctly.
Attribute
Sex Length Diameter Height Weight AgeGroup
Data Type
nominal numerical numerical numerical nominal nominal
(class attribute)
Note
Same as “Sex”
Same as “Length”
Same as “Diameter” Same as “Height”
Same as “Whole weight” “A” if Rings ≤ 5;
“B”if5 <Rings≤10; “C” if 10 < Rings ≤ 15; “D” if Rings > 15.
Table 3 Attributes of new dataset
a) Construct the header of an ARFF file for the preprocessed dataset. Show all parts before “@DATA”.
b) Give a screenshot of the histogram of all attributes with respect to the class label AgeGroup in Weka.
c) Use CVParameterSelection in Weka with the J48 algorithm, picking the value of C among 5 values from 0.1 to 0.5. Choose 10-fold cross-validation in both the test options and the options in CVParameterSelection. Give all classifier output before “J48 pruned tree”.
d) Using the “J48 pruned tree” from the classifier output in part c), classifies the instance of data shown in Table 4. Clearly show the section the tree involved in your answer.
Sex Length Diameter Height Weight
I 0.28 0.19 0.06 0.13
Table 4 Instance of test dataset
e) Using Weka, build a classifier for the class attribute using the J48 algorithm with the following setup, give the accuracy of each of them.
1. confidenceFactor (C) = 0.1, test option = 10-fold cross-validation.
2. confidenceFactor (C) = 0.1, test option = 5-fold cross-validation.
3. confidenceFactor (C) = 0.3, test option = 10-fold cross-validation.
4. confidenceFactor (C) = 0.3, test option = 5-fold cross-validation.
5. confidenceFactor (C) = 0.5, test option = 10-fold cross-validation.
6. confidenceFactor (C) = 0.5, test option = 5-fold cross-validation.
f) Based on the results in part e), how will you build your final model? Does the test-option affect your final model? Explain your answer.