INFS 630 – Data Mining, Winter 2023
Assignment 1: Association Rule Mining with RapidMiner Studio
This is an individual assignment, not a group assignment. In this assignment, you will learn frequent itemset mining and association rule mining in RapidMiner Studio with a real-world transaction dataset that contains grocery purchase records. The data can be downloaded from MyCourses. After completing the following instructions for assignment 1, you will learn how to:
- Install RapidMiner Studio and getting familiar with different UI components.
- Use RapidMiner to read transaction data.
- Transform transaction data to binominal data.
- Discover frequent item sets.
- Create association rules.
Then answer the questions, and put your answers in a Word document. Submit your Word (or PDF) file via MyCourses before 2:30pm on February 22 (Tue). The instruction may seem to be a bit lengthy, but the steps are not difficult. I just want to provide sufficient details so that you will not miss any steps.
1. Installation.
- Download RapidMiner Studio v9.10.011 (or later version) from
https://rapidminer.com/platform/educational/.
- Click on Download Studio. You will be asked to create an account.
Select Educational purposes. You have to verify the email address before logging in. You may want to use your McGill email address. Follow the instruction to get an educational license.
- Install RapidMiner according to the installation wizard.
- Open RapidMiner. Install the Text Processing package in the
RapidMiner market place. Marketplace can be accessed in RapidMiner Studio through the main menu: Extensions/ Marketplace. Search for Text Processing in the search box.
2. Introducing UI components in RapidMiner Studio:
a. The upper-left Repository panel. This is a local storage repository of
your computer where you can access you saved scripts and data.
Page 1 of 6
- The central Process panel. This is the main canvas that you set up flows of operators to complete a data mining task. An operator executes a specific action. It is of shape rectangle. It has input connectors on its left and output connectors on its right.
- The lower-left Operators panel. Here, you can search for a specific operator and include it in your process by dragging it to your Process panel.
- The upper-right Parameters panel. By clicking a specific operator in the Process panel, you can configure settings for this operator. For example, you can specify the file to be read for an Open File operator.
- The lower-right Help panel. By clicking a specific operator in the Process panel, you can find its information regarding the specific actions, required settings, and the types of input/output data.
3. Data preparation:
- Use Notepad, Excel, or any text editor to open the transaction file
Assignment1 - Data - groceries.csv and take a look what it looks like.
- Open RapidMiner Studio.
- Select menu item File → New Process and create a Blank Process.
- In the Operators panel, search for the Open File operator. Drag the
Open File operator to the Process panel.
- Click the Open file operator in the Process panel. In the Parameters
panel, select your input transaction file for the filename option using
the button.
- In the Operators panel, search for the Read CSV operator. Drag the
Read CSV operator to the Process panel.
- In the Process panel, connect the fil output of the Open file operator
to the fil input of Read CSV operator by dragging a line between
them.
- Select the Read CSV operator in the Process panel. In Parameters
panel, click the Import Configuration Wizard button.
- In the first step of the wizard, select your input transaction file again.
Then move to the second step by clicking Next.
- In the second step of the wizard, uncheck Header Row, select
Semicolon as the Column Separator. Then move to the third step by
clicking Next.
- In the third step, click finish to complete the wizard.
l. In the Parameters panel, scroll down and find data set meta data information and click Edit List (1).... (If you do not see it, click Show advanced parameters.)
- Change the type of att1 from polynomial to text. Then click Apply.
- In the Process panel, you can connect the out output of the Read CSV
operator to the res result connector on the right edge of Process
panel. By clicking the Run button above the Process panel, you can run your process and see the result of the Read CSV operator. At this stage, the result should be a table of two columns. The first column is Row No. which indicates the row identification number. The second one is att1, which indicates the items included in a transaction. Items are separated by comma. Switch back to your process by selecting the Design view above the Process panel.
- Next, in the Operators panel, search for the Process Documents from Data operator and drag it to your process. Connect the out output of the Read CSV operator to the input exa of the Process Documents from Data operator. Click the Process Documents from Data operator, and then in the Parameters panel, set vector creation option to Term Occurrences.
- By double clicking the Process Documents from Data operator, you go into the inside flow of this operator. Here, we need to specify how we want to create a document from a transaction. A document is defined as a list of tokens. Search for the Tokenize operator and drag it to the flow.
- Connect the doc connector on the left edge of the Process panel to the input of the Tokenize operator. Connect the output of the Tokenize operator to the doc connector on the right edge of the Process panel. Click on the Tokenize operator and set the option mode in the Parameters panel to specify characters. Set the option characters to comma by typing , in the input box (put a comma in the box). Go back to your main process by clicking the process link
above the Process panel.
- Connect the exa output of your Process Documents from Data
operator to the res connector on the right edge of the Process panel. Hit the Run button to see the result. You should have a table that consists of multiple numeric attributes. Each row represents a transaction and each column represent a grocery item. If a transaction consists of an item, the attribute corresponding to that
item is 1, otherwise 0. Switch back to your process by selecting the
Design view on to top of the Process panel.
- Next, we transform the numeric table to a binominal data. Search for
the Numerical to Binominal operator and drag it to your process. Connect the exa output of your Process Documents from Data to the input of the Numerical to Binominal operator.
- Click the Numerical to Binominal operator and set min option to -0.5 and max option to 0.5 in the parameter panel. A numeric value falls within this range will be replaced by a binominal value false. If not, it will be replaced by true.
- Inspect your result by connecting the exa output of the Numeric to Binominal operator to the res connector on the right edge of the Process panel. After clicking the Run button, you should have a table that consists of multiple binominal attributes. Each row represents a transaction and each column represent a grocery item. If a transaction consists of an item, the attribute corresponding to that item is true, otherwise false. Switch back to your process by selecting the Design view above the Process panel.
- If you see the table with binominal data, your data is ready for frequent itemset mining and association rule mining. Otherwise, please go back to previous steps and check your process.
4. Discover frequent item sets.
1. Search for the FP-Growth operator in the Operators panel and drag it to
your process.
2. Connect the exa output of the Numerical to Binominal operator to the
exa input of the FP-Growth operator. Click the FP-Growth operator and uncheck the find min number of itemset option. Start by setting the min support option to 0.01, since we have a large dataset.
3. Connect the fre output of the FP-Growth operator to the res connector on the right edge of the Process panel. Click the Run button to see the result. You can see a list of frequent itemset. If not, go back to previous steps and check your process and configurations.
4. Switch back to your process by selecting the Design view above the Process panel.
5. Create association rules.
a. Search for the Create Association Rules operator in the Operators panel
and drag it to your process.
- Connect the fre output of the FP-Growth operator to the ite input of the Create Association Rules operator. Click the Create Association Rules operator and set the minimum confidence to 0.5 in the Parameters panel.
- Connect the rul output of the Create Association Rules operator to the res connector on the right edge of the Process panel. Click the Run button to see the result. You can see a list of association rules. If not, go back to previous steps and check your process and configurations.
6. You can manipulate the min support option and the minimum confidence option to see different results.
Questions
- Briefly describe the format of the input data. How is the data arranged? What does each row represent? What is the expected input format of the FP-Growth operator in RapidMiner?
- Get familiar with the rich tools provided in RapidMiner for data transformation and data cleaning. Convert the data into a table that meets the expected input format of the frequent itemset mining operator. Follow the above instructions to set up the processes for frequent itemset mining and association rule mining on top of your data transformation process. Capture your “Process” and paste it on this assignment. (Note: you can capture the process by right-clicking on the white space of the “Process” pane, and select “Print/Export image”. You may also capture the screen by pressing Alt-Print Screen or using the Snipping Tool in Windows.)
- Briefly describe each operator in your process in one or two sentences. List three association rules that satisfy a support value of 0.02 and a minimum confidence of 0.4. Set the Min. Criterion on the lower left corner to minimal by sliding the knob to left. Under this setting, how many association rules contain the item ‘whole milk’?
- Experiment with different minimum support and minimum confidence values. Describe your observation and comment on the difference of the results with different settings. What happens when you increase/decrease the minimum support value? What happens when you increase/decrease the minimum confidence value? What happens to the popular items, such as 'whole milk', when you have a low minimum support value and a high minimum confidence value? What happens to the popular items when you have a high minimum support value and a low minimum confidence value?