Assignment 3
Cutoff date: 1 Mar 2023 Instructions: ● Please be reminded that you must turn in all assignments on time. Assignment extensions will be given only to unforeseen circumstances such as illness when supporting documents are provided. However, late assignments will be marked for your benefit but the scores will NOT be recorded.
● There are 4 assignments in the course. Only the top 3 assignments are counted towards the final result. The average passing score is 40 for the assignments.
● Please prepare your answers in an answer file, in .docx or .doc format, and name the file like s12345678a1.docx where ‘12345678’ is your 8-digit student number and ‘1’ is the assignment number. For the programming questions, please also include the source code and execution outputs in one Jupyter Notebook file, and name the file like s12345678a1.ipynb. Do NOT submit multiple Jupyter Notebook files.
● Please remember to write your name and student number in each of your answer file and Jupyter Notebook file (and related files, if applicable). Put all the files in a zip file, name it like s12345678a1.zip, and submit it (i.e. only that zip file) to OLE.
● This assignment contains 4 questions. All questions are compulsory to be answered.
Question 1 – Unsupervised learning concepts [25marks]
a. Two main types of unsupervised learning are association analysis and cluster analysis. For each of the two types, give an example application in the education domain; describe the application and how unsupervised learning techniques are employed in the application.
b. Principal component analysis (PCA) is applied to a dataset of 4 features to produce 4 principal components called PC0, PC1, PC2, and PC3.
i. What is the total amount of explained variance ratios of all the principal components?
ii. Compare the amounts of explained variance ratios of the 4 principal components.
iii. A student calculates some cumulative explained variance ratios by hand, and gets the following results. Comment on the correctness of the calculated results with justification.
Principal component(s) | Calculated cumulative explained variance ratio |
---|---|
PC0 | 0.40 |
PC0 + PC1 | 0.60 |
PC0 + PC1 + PC2 | 0.90 |
c. The following dataset contains 7 transactions of 6 items: apple, bread, carrot, donut, egg, and fish. Calculate (i) support({apple}), (ii) support({donut}), (iii) support({apple} => {donut}), and (iv) confidence({apple} => {donut}). (Note that this dataset is also used in some later questions.) [5] ● T0: apple, bread, egg, fish
● T1: apple, bread, carrot, donut, egg
● T2: bread, egg
● T3: bread, donut, egg
● T4: apple, bread, egg
● T5: apple, bread, egg, fish
● T6: apple, carrot, donut, egg
d. Recommender systems use different types of techniques, which employ different data.
i. Which type of technique mainly uses product description data?
ii. Which type of technique mainly uses recent product sales data?
iii. Which type of technique mainly uses historical user purchase data?
iv. Which type of technique has the most privacy concerns? Briefly justify your answer.
e. Four vectors are given below.
ID | Vector |
---|---|
#0 | (1, 2, 3) |
#1 | (2, 2, 2) |
#2 | (2, 2, 3) |
#3 | (2, 2, 4) |
i. Compute the cosine similarity among the four vectors, and show the resulting cosine similarity values in 4 decimal places and in a 4x4 table. You may do it either by hand or by code, and need not show the intermediate work.
ii. From the table, determine the vector that is most similar to vector #1.
iii. From the table, determine the vector that is most similar to vector #2.
Question 2 – Unsupervised learning algorithms [25marks]
a. Apply Apriori to the dataset in Q1-c to find frequent itemsets for the minimum support value of 0.4. Show your work of identifying the candidate and frequent 1-itemsets, 2-itemsets, and so on.
b. Apply k-means to cluster 6 data points (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), and (4, 3) to 2 clusters. Use the initial centroids (1, 0) and (2, 2). Show your work of computing the distances and centroid updates in each iteration of k-means, and present the values in up to 2 decimal places.
Question 3 – Programming PCA and Apriori [25
marks] a. Given a dataset created from the code fragment below, write code to apply PCA and linear regression on the dataset for various numbers of principal components, and plot the test scores and total variance explained ratios versus the number of principal components used. Do not use cross-validation.
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=500, n_features=30, n_informative=20,
effective_rank=10, noise=4, tail_strength=0.1,
random_state=42)
b. Write code to visualise the dataset in part (a) in a scatter plot using the first two principal components. Show the first principal component in the x-axis, the second principal component in the y-axis, and the target values in colours with the “bwr” Matplotlib colormap. [5] c. Write code to apply Apriori to the dataset in Q1-c for the minimum support value of 0.4. Display the support values and the contents of the frequent itemsets. [8]
Question 4 – Programming and evaluation of clustering [25 marks]
a. Write code to apply k-means to cluster 6 data points (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), and (4, 3) to 2 clusters. Use the initial centroids (1, 0) and (2, 2). Display the resulting cluster labels of the data points, cluster centroids, number of iterations performed, and inertia. [10] b. Repeat part (a) by varying the initial centroids to obtain a different clustering scheme. That is, you need to find a set of initial centroids to produce cluster labels and cluster centroids that are different from the scheme of part (a). Show the resulting cluster labels and cluster centroids. [3] c. Write code to apply the silhouette method to determine the optimal number of clusters for the dataset in part (a). Use agglomerative hierarchical clustering, and plot the resulting silhouette coefficients for 2 to 5 clusters. [10] d. Using the resulting graph of part (c), determine the optimal number of clusters for the dataset. [2]
End of Assignment