Homepage
Programming
EECS-4412M : Data Mining - ASSIGNMENT #1: Association Rule Mining

EECS-4412M : Data Mining - ASSIGNMENT #1: Association Rule Mining

Engage in a Conversation

ASIGNMENT #1: Association Rule Mining

Q1: Apriori Algorithm [7pt]

For this, you will implement part of the apriori algorithm for association rule mining, and test a potential improvement. CourseNana.COM

A. Improving Apriori [3pt]

The Apriori algorithm for discovering frequent itemsets can be expensive. So people are always looking at different ways to improve further the approach's efficiency. CourseNana.COM

One expensive step, for example, is validating candidate itemsets by counting their supports to find the ones meeting or exceeding the support threshold. A common implementation is to use a hash-tree. If this step is done naïvely, the implementation will be too slow to be used. CourseNana.COM

Let us consider an Apriori program that takes as input the frequent itemsets of length k (with respect to some support-count threshold), the transaction itemsets, and a support-count threshold, and produces as output the frequent itemsets of length k + 1 . (We assume the support-count threshold for the input frequent itemsets provided will be the same as, or lower than, the requested threshold for the output.) CourseNana.COM

Let us name such a program levelUp. How much time levelUp will take for a given task depends on how many pre-candidates are produced (by the join step). A pre-candidate advances to being a candidate itemset if it passes the apriori test. At a given support-count threshold, the same number of candidate itemsets will result, regardless. But perhaps we can reduce the number of pre-candidates produced. CourseNana.COM

Consider that we order the items in each itemset from least to most frequent, rather than just ordering them in some “random” way, say lexicographically (as by the standard algorithm). That is, how frequently each item appears in the transaction database; so the frequencies of those 1-itemsets. Why could this help? The prefixes of the frequent itemsets of length k will be less common, meaning fewer pre-candidates should result from the join step. If this is significant in practice, this could make the algorithm perform better. CourseNana.COM

Write a program in Python (3) or in Java called “levelUp.py” or “LevelUp.java”, respectively, to test this. Your algorithm should take three arguments, and a fourth optional argument: CourseNana.COM

a file with the frequent itemsets of length k at the given support threshold (but not with the support counts reported),
a file with the transaction itemsets, and
the support threshold count.

E.g., CourseNana.COM

% python levelUp.py mushroom-lev4-sup500.dat mushroom-trans.dat 500

or CourseNana.COM

% java LevelUp mushroom-lev4-sup500.dat mushroom-trans.dat 500

The frequent itemsets are to be read in from a file; e.g., mushroom-lev4-sup500.dat. CourseNana.COM

Each frequent itemset is on a separate line and is space separated.
Each item is represented by an integer value.
For each itemset, the items are ordered in the same way.

E.g., CourseNana.COM

1 5 7 11 13 3 7 11 23 29 CourseNana.COM

This should run the usual Apriori algorithm to write to standard output the frequent itemsets of length k+1, as described in A. above, with the support counts. CourseNana.COM

If called with the optional argument, e.g., CourseNana.COM

% python levelUp.py mushroom-lev5-sup500.dat mushroom-trans.dat 500 mushroom-lev1-sup500-wCount.dat

it should do the same, but applying the pre-candidate optimization, of ordering the items per itemset by frequency. The last argument, e.g., mushroom-lev1-sup500-wCount.dat, is a file of the frequent 1-itemsets with the support counts. The “wCount” variant for a frequent-itemset file is the same as above, except CourseNana.COM

The first integer per line is the support count. E.g., CourseNana.COM

381 1 5 7 11 13 237 3 7 11 23 29 CourseNana.COM

Instrument your program to track running time, the number of pre-candidates found, and the number of candidates found. CourseNana.COM

Test at least on an input of level 5 and a support-count threshold of 500 from the mushroom transaction database. Report your findings of whether this optimization is effective, and, if so, how effective, in your assignment report (pdf). CourseNana.COM

B. Extracting the Association Rules [4pt]

Write a program in Python (3) or in Java that extracts the association rules from the frequent itemsets of length k for which the left-hand side of the rule is of length j (for some 0<j<k), and has confidence equal to, or exceeding, the provided confidence threshold p (0≤p≤1). Call your program “arules.py” or “Arules.java”, respectively. CourseNana.COM

Your program should take as input CourseNana.COM

the frequent itemsets of length j with their support counts,
the frequent itemsets of length k with their support counts, and
a confidence threshold p.

It should output the discovered association rules. CourseNana.COM

The frequent itemsets are to be read in from files (the “wCount” variants), as described in Q1A above. CourseNana.COM

The program should write the association rules to standard output, one per line. Use the format as in CourseNana.COM

0.334 1 5 7 => 11 13 CourseNana.COM

where the first number is the rule's confidence (printed with a precision of 3), and the following is the rule with “=>” as the separator between the rule's left- and right-hand sides. CourseNana.COM

One should be able to run it from the command line as, e.g., CourseNana.COM

% python arules.py mushroom-lev4-sup500-wCount.dat mushroom-lev6-sup500-wCount.dat 0.75

or CourseNana.COM

% java Arules mushroom-lev4-sup500-wCount.dat mushroom-lev6-sup500-wCount.dat 0.75

Here, j=4 and k=6, and the confidence threshold is 0.75. Since the frequent itemset files provided as input are from a support-count threshold of 500, the association rules found will be with respect to a support-count threshold of 500. CourseNana.COM

Do test your algorithm at least on the mushroom transaction dataset, discussed below, for j=4 and k=6, and the confidence threshold is 0.75 and a support-count threshold of 500. Your program should be reasonably efficient on this, running within, at least, a couple of minutes. CourseNana.COM

Q2: FP-Growth [3pt]

Consider the following transaction itemsets. CourseNana.COM

T1: M O N K E Y T2: D O N K E Y T3: M A K E T4: M U C K Y T5: C O K I E Consider a support threshold of 60% and a confidence threshold of 80%. CourseNana.COM

A. FP-Tree [2pt]

Show the transformed transaction itemsets with the items in frequency order as per FP-Growth and with non-frequent items eliminated. CourseNana.COM

Sketch the FP-Tree that results. CourseNana.COM

B. Extracting the Association Rules [1pt]

Find the resulting association rules with respect to a support threshold of 60% and a confidence threshold of 80%. CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: ECE 316 Signals & Systems Homework #1: Chirp Signal, Nyquist rate

Next: Homework 1: String Class in C++

Canada代写,York University代写,EECS-4412M代写,EECS 4412M代写,Data Mining代写,Association Rule Mining代写,Canada代编,York University代编,EECS-4412M代编,EECS 4412M代编,Data Mining代编,Association Rule Mining代编,Canada代考,York University代考,EECS-4412M代考,EECS 4412M代考,Data Mining代考,Association Rule Mining代考,Canadahelp,York Universityhelp,EECS-4412Mhelp,EECS 4412Mhelp,Data Mininghelp,Association Rule Mininghelp,Canada作业代写,York University作业代写,EECS-4412M作业代写,EECS 4412M作业代写,Data Mining作业代写,Association Rule Mining作业代写,Canada编程代写,York University编程代写,EECS-4412M编程代写,EECS 4412M编程代写,Data Mining编程代写,Association Rule Mining编程代写,Canadaprogramming help,York Universityprogramming help,EECS-4412Mprogramming help,EECS 4412Mprogramming help,Data Miningprogramming help,Association Rule Miningprogramming help,Canadaassignment help,York Universityassignment help,EECS-4412Massignment help,EECS 4412Massignment help,Data Miningassignment help,Association Rule Miningassignment help,Canadasolution,York Universitysolution,EECS-4412Msolution,EECS 4412Msolution,Data Miningsolution,Association Rule Miningsolution,