1. Homepage
  2. Homework
  3. COMP7103 Data Mining - Assignment 1: Data Processing, Metric Axioms, Classification and Weka
This question has been solved

COMP7103 Data Mining - Assignment 1: Data Processing, Metric Axioms, Classification and Weka

Engage in a Conversation
Hong KongUniversity of Hong KongCOMP7103Data MiningData ProcessingMetric AxiomsDecision Tree

COMP7103 Assignment 1 CourseNana.COM

Question 1 Data Preprocessing [15%] CourseNana.COM

Consider a numerical attribute with the following values. It is known that the range of values for the attribute is [0, 20] and it is always an integer. CourseNana.COM

0 3 5 6 8 12 15 15 15 16 18 19 19 19 20 CourseNana.COM

a)  Convert the attribute into an ordinal attribute by splitting the data into 3 bins of equal interval width. State the ranges of the 3 bins. CourseNana.COM

b)  Convert the attribute into an ordinal attribute by splitting the data into 3 bins of equal frequency. State the ranges of the 3 bins. CourseNana.COM

c)  Apply min-max normalization to this attribute. CourseNana.COM

Question 2 Metric Axioms [20%] CourseNana.COM

Consider a set of document data that records the number of occurrences of ? different words in each of the documents. CourseNana.COM

a) Suppose a distance function ? is defined for the document data as ?(?, ?) = arccos(????(?, ?)), where ????(?, ?) is the cosine similarity of ? and ?. Validate the distance measure with each of the criteria in the metric axioms (Lecture Notes Chapter 2 p.64). CourseNana.COM

You may use the following inequalities in your answer. Proof is required for any other inequalities used. CourseNana.COM

?, ? ∈ R, |?| + |?| ≥ |? + ?| (triangle inequality) CourseNana.COM

?, ? ∈ R , ‖?2?2 ≥ (??) (Cauchy-Schwarz inequality)
Denote ?? as angle between vectors ? and ?, then |∠?? − ∠??| ≤ ∠?? ≤ ∠?? + ∠?? (triangle inequality for angles) CourseNana.COM

b) Suggest one transformation to the document data so that the distance function in part a) satisfies the metric axioms. Explain your answer. CourseNana.COM

Question 3 Classification [30%] CourseNana.COM

Consider the training and testing samples shown in Table 1 for a binary classification problem. CourseNana.COM

a)  What is the entropy of this collection of training examples with respect to the class attribute? CourseNana.COM

b)  Build a decision tree using entropy as the impurity measure and the pre-pruning criteria is the information gain < 0.1. Show your steps. CourseNana.COM

c)  Using the testing data shown in Table 1 as the test set; show the confusion matrix of your classifier. CourseNana.COM

d)  With respect to the ‘+‘class, what are the precision and recall of your classifier? CourseNana.COM

Training Set CourseNana.COM

Record A          B          C         Class CourseNana.COM

1          H          X          T          + CourseNana.COM

2          H          X          T          + CourseNana.COM

3          H          X          T          + CourseNana.COM

4          H          X          F          CourseNana.COM

5          H          X          F          CourseNana.COM

6          H          X          F          + CourseNana.COM

7          H          Y          T          CourseNana.COM

8          H          Y          T          + CourseNana.COM

9          L          X          T          + CourseNana.COM

10        L          X          T          + CourseNana.COM

11        L           X          T          + CourseNana.COM

12        L           X          T          + CourseNana.COM

13        L           Y          T          CourseNana.COM

14        L           Y          T          CourseNana.COM

15        L           Y          F          CourseNana.COM

16        L           Y          F          + CourseNana.COM

Testing Set CourseNana.COM

Record A          B          C          Class CourseNana.COM

17        H          X          T          CourseNana.COM

18        H          Y          F          + CourseNana.COM

19        L          X          T          + CourseNana.COM

20        L          X          F          + CourseNana.COM

21        L          Y          T          + CourseNana.COM

Table 1 dataset for a binary classification problem CourseNana.COM

Question 4 Splitting [20%] CourseNana.COM

Consider a dataset consists of 150 instances of data, with a single attribute ? and a class attribute ?. There are three possible values for ? (A, B or C). Table 2 shows a summary of the dataset showing the number of instances of data per class label for every value of ? appearing in the dataset. CourseNana.COM

Suppose we want to predict the class attribute ? using the numerical attribute ?, compare all the possible splitting using the GINI as the impurity measures, and derive the best binary split point. Show clearly all split points considered and the corresponding GINI. CourseNana.COM

?               ?=?        ?=?        ?=? CourseNana.COM

1               30        0          0
5               20        7          1
9               0          19        4 CourseNana.COM

11            0          22        11 CourseNana.COM

13            0          2          34 CourseNana.COM

Table 2 summary of 150 instances of data CourseNana.COM

Question 5 Weka [15%] CourseNana.COM

The Abalone Data Set consists of some measurement of abalones and their age (judged by counting the number of rings under microscope). CourseNana.COM

Download the dataset: https://archive.ics.uci.edu/ml/datasets/Abalone and read the description. A copy of the dataset and an extract of the description are also available on Moodle. CourseNana.COM

Construct a new dataset with 5 attributes and a class label as described in Table 3, containing all data from the given dataset. Make sure the data could be imported into Weka so that Weka can identify the data type of the attributes correctly. CourseNana.COM

  CourseNana.COM

Attribute CourseNana.COM

Sex Length Diameter Height Weight AgeGroup CourseNana.COM

Data Type CourseNana.COM

nominal numerical numerical numerical nominal nominal
(class attribute)
CourseNana.COM

Note CourseNana.COM

Same as “Sex”
Same as “Length”
Same as “Diameter” Same as “Height”
Same as “Whole weight” “A” if Rings
≤ 5;
“B”if
5 <Rings≤10; “C” if 10 < Rings ≤ 15; “D” if Rings > 15. CourseNana.COM

Table 3 Attributes of new dataset CourseNana.COM

a)  Construct the header of an ARFF file for the preprocessed dataset. Show all parts before “@DATA”. CourseNana.COM

b)  Give a screenshot of the histogram of all attributes with respect to the class label AgeGroup in Weka. CourseNana.COM

c)  Use CVParameterSelection in Weka with the J48 algorithm, picking the value of C among 5 values from 0.1 to 0.5. Choose 10-fold cross-validation in both the test options and the options in CVParameterSelection. Give all classifier output before “J48 pruned tree. CourseNana.COM

d)  Using the “J48 pruned tree” from the classifier output in part c), classifies the instance of data shown in Table 4. Clearly show the section the tree involved in your answer. CourseNana.COM

Sex       Length Diameter          Height Weight CourseNana.COM

I           0.28     0.19     0.06     0.13 CourseNana.COM

Table 4 Instance of test dataset CourseNana.COM

e)  Using Weka, build a classifier for the class attribute using the J48 algorithm with the following setup, give the accuracy of each of them. CourseNana.COM

1.     confidenceFactor (C) = 0.1, test option = 10-fold cross-validation. CourseNana.COM

2.     confidenceFactor (C) = 0.1, test option = 5-fold cross-validation. CourseNana.COM

3.     confidenceFactor (C) = 0.3, test option = 10-fold cross-validation. CourseNana.COM

4.     confidenceFactor (C) = 0.3, test option = 5-fold cross-validation. CourseNana.COM

5.     confidenceFactor (C) = 0.5, test option = 10-fold cross-validation. CourseNana.COM

6.     confidenceFactor (C) = 0.5, test option = 5-fold cross-validation. CourseNana.COM

f)  Based on the results in part e), how will you build your final model? Does the test-option affect your final model? Explain your answer. CourseNana.COM

  CourseNana.COM

  CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
Hong Kong代写,University of Hong Kong代写,COMP7103代写,Data Mining代写,Data Processing代写,Metric Axioms代写,Decision Tree代写,Hong Kong代编,University of Hong Kong代编,COMP7103代编,Data Mining代编,Data Processing代编,Metric Axioms代编,Decision Tree代编,Hong Kong代考,University of Hong Kong代考,COMP7103代考,Data Mining代考,Data Processing代考,Metric Axioms代考,Decision Tree代考,Hong Konghelp,University of Hong Konghelp,COMP7103help,Data Mininghelp,Data Processinghelp,Metric Axiomshelp,Decision Treehelp,Hong Kong作业代写,University of Hong Kong作业代写,COMP7103作业代写,Data Mining作业代写,Data Processing作业代写,Metric Axioms作业代写,Decision Tree作业代写,Hong Kong编程代写,University of Hong Kong编程代写,COMP7103编程代写,Data Mining编程代写,Data Processing编程代写,Metric Axioms编程代写,Decision Tree编程代写,Hong Kongprogramming help,University of Hong Kongprogramming help,COMP7103programming help,Data Miningprogramming help,Data Processingprogramming help,Metric Axiomsprogramming help,Decision Treeprogramming help,Hong Kongassignment help,University of Hong Kongassignment help,COMP7103assignment help,Data Miningassignment help,Data Processingassignment help,Metric Axiomsassignment help,Decision Treeassignment help,Hong Kongsolution,University of Hong Kongsolution,COMP7103solution,Data Miningsolution,Data Processingsolution,Metric Axiomssolution,Decision Treesolution,