COMP9727 Recommender Systems Assignment 1 - 22T2
Due: 1st July, 17:00 AEST
Total Mark: 50
Introduction In this assignment, you will be required to manually implement a few recommendation algorithms in Python as well as answer some corresponding questions individually. Other than this spec, all the required files can be found in ‘a1.zip’, which you can download directly from the WebCMS3 page. The following are some general requirements for the assignment:
· File link (login required): https://cs9727.web.cse.unsw.edu.au/22T2/assn1/a1.zip. If you can not download the file, try to open it in incognito window.
· There are 5 questions that count for 25% of your final mark. They are not equally dis- tributed.
· Each question consists of a conceptual questions part and an implementation part.
· For conceptual questions part, you must submit your answers electronically. You should report your answers to all the questions in a single file ass1 YourZId.pdf. You are strongly encouraged to use LATEX to produce your answers (Word is fine if you prefer). Please note that we do not accept scanned copies.
· For implementation part, please note that all the algorithms are required to be imple- mented in Python3.
· As far as third-party packages are concerned, only the following are permitted: NumPy, Pandas and Matplotlib. For versioning, please check it in the submission section.
· For ‘a1.zip’, it contains the following datasets: q1.txt, q2.txt, q3.csv, shows.txt, user-shows.txt, q5.csv. It also contains some starter codes: q3.py and q5.py, please follow the instructions inside to finish the tasks. For q1.py, q2.py and q4.py, you need to implement them from scratch, including creating the python files yourself.
· Please ensure your code works well on CSE machines. If your code was unable to run on CSE machines, you will receive 0 for the corresponding question.
· Late Penalty: 5% reduction per day of the assessment mark. Late submissions that are more than 5 days late will not be accepted.
· We do not set up standard outputs for the questions. As such, you will not lose marks due to the output format, as long as you include all the required content and the outputs are well-structured and meaningful.
Question 1 - Associate Rules and its Application in Recommendation (10 Marks)
To measure significance and interest in rules for recommendations, there are three commonly used metrics:
· Confidence (denoted as conf(X → Y )): Confidence is defined as the probability of occur- rence of Y in the transaction if the transaction already contains X:
conf(X → Y ) = Pr(Y |X), where Pr(Y |X) is the conditional probability.
· Lift (denoted as lift(X → Y )): Lift is the ratio of the confidence of the rule and the expected confidence of the rule if X and Y are statistically independent:
lift(X →Y)= conf(X →Y)/support(Y )
· Conviction(denoted as conv(X → Y )): Conviction can be explained as the expected frequency that X occurs without Y (i.e., the frequency that the rule makes an incorrect prediction):
conv(X → Y) = 1−support(Y) /1−conf(X →Y)
(a) (2 Marks)
There is a problem by using confidence only to evaluate the rules: confidence ignores the Pr(Y ). Explain why it matters. Moreover, explain why lift and conviction are not affected by that.
(b) (2 Marks)
Consider the following definition of symmetrical:
Definition 1 (Symmetrical) A measure is symmetrical if measure(X → Y ) = measure(Y → X).
For those mentioned measures: Confidence, Lift and Conviction, prove or disprove that they are symmetrical.
(c) (2 Marks)
A rule is said to be a “perfect” rule if it has a conditional probability of 1 (i.e., P r(Y |X) = 1 for X → Y ). A measure is considered optimal if it reaches its maximum achievable value for a “perfect” rule. Which of these measures is optimal? You should prove or disprove the statement. You can ignore 0/0 but not other cases.
Now let’s move to the implementation part. You are required to create a program named q1.py. Use your program to answer the following questions: (d) and (e).
We will use the provided dataset q1.txt to apply the associate rule mining technique to make recommendations. In this dataset, each line represents a session of a user while the user id is not
known. Within each session, there are several item IDs (strings of length 8) that the user viewed during this session, separated by space.
In your q1.py, use the Apriori algorithm to find the items which are frequently viewed together. Assume the support is 100 and fixed. Find the itemsets of sizes 2 and 3.
(d) (2 Marks)
Given a pair of items (X,Y), assume that the support count of {X,Y} is at least 100. Compute the confidence scores of the following two association rules for all the pairs meeting that condition:
X→Y, Y→X
Sort the rules in decreasing order based on confidence scores. Additionally, if more than one rule has the same confidence score, sort them by lexicographically increasing order on the left hand side of the rule. Please report top 10 rules in your report. Note that, your program should output top 20 rules for sanity check.
(e) (2 Marks)
Now consider a tuple of items (X, Y, Z), follow the same instructions in part (d) and report your results. The association rules you need to consider are:
(X,Y)→Z, (X,Z)→Y, and (Y,Z)→X
Question 2 - Latent Factor Method (5 Marks)
The goal of this question is to implement an algorithm - Stochastic Gradient Descent (SGD) to build a recommender system. Given a matrix R of ratings which contains m items and n users (i.e., size of R is m×n). Riu represents the rating given by user u to item i. Where the dimension of Q is m×k and the dimension of P is n×k. The goal of this problem is to find user matrix P and item matrix Q. We define the error as the following:
where means that we sum only on the pairs (user,item) for which the user has rated the item, i.e. the (i, u) entry of the matrix R is known. qi denotes i−th row of the matrix Q, andpu istheu−throwofthematrixP. qi andpu arebothrowvectorsofsizek,λisisthe regularization parameter. ∥ · ∥2 is the L2 norm.
In the lecture, we’ve discussed how SGD is used for updating the gradient. Now, implement the algorithm in a file named q2.py by using the provided dataset q2.txt with the following requirements. Note q2.txt comes from the same dataset as what we used in Tutorial 2. You may consider to use the tutorial 2 Q3’s code as a start point.
• You are not allowed to store the matrix R in memory. You have to read each Riu once a time from disk and apply your update rules in each iteration.
• Your program should read the whole file within each iteration.
• There is no code template provided in this question, you need to implement it from scratch.
Now, let’s fix k = 20, λ = 0.1 and the number of iterations = 40. You are tasked with finding an optimal value for the learning rate η. With an ideal η, you should observe the following: E < 65, 000 after 40 iterations with both qi and pu stop changing.
Find the value of η that you believe is ideal. Based on the η you find, please plot the cor- responding value of the objective function E (defined in Eq. 1) changing over the number of iterations (i.e., value of E as y-axis and iteration as x-axis).