COMP24112 Machine Learning Lab 2: News Article Classification by k-NN

Engage in a Conversation

COMP24112 Lab 2: News Article Classification by k-NN

1. Task description

You will work on a news article classification task. The provided dataset includes a total of 800 articles taken from Reuters newswire. They belong to 4 classes: "earn" (0), "crude" (1), "trade" (2) and "interest" (3). There are 200 articles per class. Each article is characterised by word occurrences. The list of used words is called a vocabulary. In our dataset, the vocabulary includes a total of 6428 words. CourseNana.COM

2. Preparation

First we need to import the data. Run the below cell to load the data using NumPy. CourseNana.COM

In [3]:

import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse

data, labels, class_names, vocabulary = np.load("ReutersNews_4Classes_sparse.npy", allow_pickle=True)

A Note on Sparsity

Most documents only contain a small subset of the vocabulary, resulting in a very sparse data matrix. To handle the sparsity, in this exercise data is represented as a scipy.sparse.csr_matrix, which can store sparse matrices efficiently while still allowing efficient row-based indexing. You can learn more about csr_matrix and other ways of dealing with sparse matrices at https://docs.scipy.org/doc/scipy/reference/sparse.html. CourseNana.COM

Note, however, that data is not a normal NumPy array. While most operations will be the same as with a normal dense array, you cannot use a sparse matrix to index another matrix. If you need to do this, either first convert the matrix to a NumPy array with the toarray() method, or use methods specifically designed to work with sparse matrices. CourseNana.COM

In [2]:

print(data[41,:]) # A sparse row vector; the output will be the non-zero indices and their values.
print(data[41,:].toarray()) # Convert back to a NumPy array. Note that the result is a (1, 6428) matrix, not a vector.
# print(vocabulary[data[41,:] > 0]) # Can't index vocabulary with a sparse matrix.
rows, columns, values = scipy.sparse.find(data[41,:]) # Find the non-zero entries in the 42nd document.
print(vocabulary[columns]) # Prints the words present in the 42nd document.

  (0, 2)	1
  (0, 3)	3
  (0, 5)	1
  (0, 8)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	1
  (0, 21)	2
  (0, 24)	1
  (0, 105)	1
  (0, 127)	1
  (0, 227)	1
  (0, 275)	1
  (0, 334)	2
  (0, 341)	1
  (0, 348)	1
  (0, 359)	1
  (0, 411)	1
  (0, 426)	1
  (0, 1428)	1
  (0, 2058)	1
  (0, 5555)	1
[[0 0 1 ... 0 0 0]]
['share' 'split' 'say' 'two-for-one' 'shareholder' 'annual' 'meeting'
 'reuter' 'ct' 'note' 'company' 'pay' 'subject' 'increase' 'stock'
 'dividend' 'april' 'northern' 'declare' 'approval' 'telecom' 'post-split'
 'nt']

To see the full vocabulary, you can run CourseNana.COM

In [3]:

print(", ".join(vocabulary))

You can see how many times article $i$ contains word $j$ using CourseNana.COM

In [4]:

i, j = 40, 2
print(data[i,j])

You can see which class the $i$ th article belongs to using CourseNana.COM

In [5]:

print(labels[i])

For instance, by running CourseNana.COM

In [6]:

print("Occurrences:", data[0,10])
print("Class:", class_names[labels[0]])
print("Word:", vocabulary[10])

Occurrences: 2
Class: earn
Word: shareholder

you can see that the 11th word appears twice in the first document, the first document belongs to the class "earn", and the 11th word is "shareholder". CourseNana.COM

The following function randomly selects a subset of the data. CourseNana.COM

In [7]:

def sample_indices(labels, *num_per_class):
    """
    Returns randomly selected indices. It will return the specified number of indices for each class.
    """
    indices = []
    for cls, num in enumerate(num_per_class):
        cls_indices = np.where(labels == cls)[0]
        indices.extend(np.random.choice(cls_indices, size=num, replace=False))
    return np.array(indices)

For instance, to get one sample from the first class, two from the second, three from the third, and four from the fourth, you can run: CourseNana.COM

In [8]:

indices = sample_indices(labels, 1, 2, 3, 4)
print("Returned indices:", indices)
print("Samples:", data[indices])
print("Corresponding classes:", labels[indices])

Returned indices: [ 28 290 345 527 401 431 731 735 605 703]
Samples:   (0, 13)	1
  (0, 14)	1
  (0, 15)	3
  (0, 16)	3
  (0, 17)	4
  (0, 19)	7
  (0, 20)	4
  (0, 21)	3
  (0, 22)	2
  (0, 23)	2
  (0, 99)	1
  (0, 117)	1
  (0, 121)	1
  (0, 122)	1
  (0, 681)	1
  (0, 4504)	1
  (0, 4505)	1
  (1, 5)	5
  (1, 13)	1
  (1, 23)	9
  (1, 42)	2
  (1, 91)	1
  (1, 104)	1
  (1, 106)	1
  (1, 115)	1
  :	:
  (8, 640)	1
  (8, 772)	1
  (8, 821)	1
  (8, 836)	1
  (8, 837)	1
  (8, 842)	1
  (8, 959)	1
  (8, 984)	1
  (8, 1019)	1
  (8, 1475)	1
  (8, 1600)	1
  (8, 1951)	2
  (8, 1952)	1
  (8, 1953)	2
  (8, 2073)	1
  (8, 2521)	1
  (8, 3091)	1
  (9, 5)	1
  (9, 98)	1
  (9, 332)	1
  (9, 473)	1
  (9, 622)	1
  (9, 822)	1
  (9, 1229)	1
  (9, 1267)	1
Corresponding classes: [0 1 1 2 2 2 3 3 3 3]

3. k-NN Implementation (4 Marks, Normal)

Now, you will need to implement a k-NN classifier by filling the code below. This function should support two types of distance measures: Euclidean distance and cosine distance (defined as 1 - cosine similarity). It should take a set of training samples, a user-specified neighbour number, a distance option, and features of a set of testing samples as the input. It should return the predicted classes for the input set of testing samples. CourseNana.COM

In order to get 4 marks, you are asked to implement the k-NN classifier from scrach without relying on any machine learning library, particularly the distance calculation. But you are allowed to research NumPy functions relating to sorting. If you decide to use existing distance implementation from libraries, e.g., sklearn.metrics.pairwise_distances imported as cdist, you can get at most 3 marks. CourseNana.COM

Your implementation must NOT make use of Python loops over individual samples or features. You should use functions that operate on whole matrices, as this will be much faster than looping in Python. Each experiment below is expected to take no more than 2 minutes to run. CourseNana.COM

In [9]:

import scipy.stats

def knn_classify(test_samples, training_data, training_labels, metric="euclidean", k=1):
    """
    Performs k-nearest neighbour classification on the provided samples,
    given training data and the corresponding labels.
    
    test_samples: An m x d matrix of m samples to classify, each with d features.
    training_data: An n x d matrix consisting of n training samples, each with d features.
    training_labels: A vector of size n, where training_labels[i] is the label of training_data[i].
    metric: The metric to use for calculating distances between samples.
    k: The number of nearest neighbours to use for classification.
    
    Returns: A vector of size m, where out[i] is the predicted class of test_samples[i].
    """
    # Calculate an m x n distance matrix.
    pairwise_distance = ...
    
    # Find the k nearest neighbours of each samples as an m x k matrix of indices.
    nearest_neighbours = ...
    
    # Look up the classes corresponding to each index.
    nearest_labels = ...
    
    # Return the most frequent class on each row.
    # Note: Ensure that the returned vector does not contain any empty dimensions.
    # You may find the squeeze method useful here.
    return ...

4. Experiments (13 Marks in Total)

Use your k-NN function to perform the following experiments. CourseNana.COM

Experiment 1 (3 Marks, Easy)

Randomly select 80 articles per class for training, and use the remaining articles for testing. Fix a neighbour number setting as you see fit. Perform k-NN classification using the Euclidean distance and test it. CourseNana.COM

Repeat this process 20 times (trials). Calculate the mean and standard deviation of the testing accuracies. Print out the mean and standard deviation. CourseNana.COM

In [11]:

# Your code goes here

Use the same neighbour number, but use the cosine distance instead of the Euclidean distance. Repeat the same experiment. CourseNana.COM

Print out the mean and standard deviation. CourseNana.COM

In [14]:

# Your code goes here

Explain in your report which distance measure gives better performance and analyse the reason. CourseNana.COM

Experiment 2 (5 Marks, Easy)

Using the distance measure that you found performs better in Experiment 1. CourseNana.COM

Randomly select 80 articles per class for training, and use the remaining articles for testing. Perform k-NN classification with the neighbour number $k$ varying from 1 to 50. CourseNana.COM

For each values of $k$ , repeat the training process by 20 trials and record the average training error rates and standard deviation. CourseNana.COM

Do the same for testing errors. CourseNana.COM

In [16]:

# Your code goes here

Produce an error bar plot showing the training error rate for each $k$ here: CourseNana.COM

In [18]:

# Your code goes here

Produce your testing error bar plot here: CourseNana.COM

In [ ]:

# Your code goes here

Remember that all graphs should have axis labels and a title. CourseNana.COM

Discuss in your report the difference between the training and testing accuracies, and why this is the case. CourseNana.COM

Analyse in your report the effect of $k$ based on this experiment. What do you think is a reasonable value for $k$ ? Comment specifically on the bias and variance of your model at small and large values of $k$ . CourseNana.COM

Experiment 3 (5 Marks, Hard)

In this experiment we will create confusion matrices for a more detailed view on our model's performance. Then, we will observe the behaviour of our knn classifier on novel classes. CourseNana.COM

First, randomly select 100 articles per class for training, and use the remaining articles for testing. Set the neighbour number to $k = 3$ . Perform 3-NN classification using the Cosine distance, as in previous experiments. CourseNana.COM

Confusion Matrix Implementation

Implement a multi-class confusion matrix yourself, from scratch. Let the row index correspond to the known label, and column index to predicted label. If you decide to use existing confusion matrix implementation from libraries, e.g., sklearn.metrics.confusion_matrix, you can get at most 4 marks. (However, you may use an existing implementation to check the output of your own function.) CourseNana.COM

Print out the confusion matrix and overall accuracy of your classifier for the testing data. CourseNana.COM

In [21]:

# Your code goes here

On Novel Classes

5 new articles have been provided in string format below. The code to create a sparse representation of these articles has also been provided. Take a moment to skim through the articles. CourseNana.COM

Run the code below, saving the sparse matrix representation of these 5 articles into new_data. CourseNana.COM

In [5]:

# Make sure you have scikit-learn installed. 
from sklearn.feature_extraction.text import CountVectorizer

articles = []
for f in [sp0, sp1, sp2, sp3, sp4]:
    text = f.replace('\n', ' ')
    articles.append(text)
vrizer = CountVectorizer(vocabulary=vocabulary)
new_data = vrizer.fit_transform(articles)

(1) Run the classifier from step (1) to predict the classes of the articles in new_data. Print out the class predictions. CourseNana.COM

What classes to you think these 5 articles should belong to, based on your own judgement of their content? Can your classifer make an appropriate class prediction for these 5 articles? Analyse the reason for your answers in your report. CourseNana.COM

In [ ]:

# Your code goes here

(2) Introduce a new class, sport, to your dataset. The class should contain the 5 articles as above. Add this to your data using the command below. Your new data contains 805 articles, 800 from the original dataset and 5 from the new_data, belonging to 5 classes: 200 articles from each of the first 4 classes and 5 articles from the 5th class. CourseNana.COM

Randomly split the new data into a training set containing 100 articles each from 'earn', 'crude', 'trade', and 'interest', and then only 3 articles from 'sport' (you should be able to use the sample_indices function given at the start). Reserve the remaining articles for testing. Test the performance of the new 3-NN classifier. CourseNana.COM

Print the confusion matrix and classification accuracy for the testing data. CourseNana.COM

In [29]:

data_augmented = scipy.sparse.vstack((data, new_data))
labels_augmented = ... # your code goes here

(3) Repeat the above process 6 times, repeating the random train-test split. For each of the 5 classes, print out its averaged testing accuracy. Comment on your classifier's performance in your report. What are the consequences of having no training data and limited training data for the 'sports' class? CourseNana.COM

(4) Self-learn the concepts of zero-shot learning and few-shot learning. In your report, link these concepts to the experiments you've just performed. Is your model performing zero- or few-shot learning? Explain your reasoning. CourseNana.COM

5. Result Analysis (4 Marks in Total)

Analysis 1 (2 Marks, Normal)

Choose a training-testing trial in Experiment 2 for $k = 1$ . Observe the testing error of this 1-NN, and estimate the interval where its true error lies with 90% probability. Explain in your report how you compute it. CourseNana.COM

In [8]:

# You may write your calculations in LateX or in code here

Analysis 2 (2 Marks, Normal)

The following function Get_p_value() is provided to obtain $p$ according to $z_{p}$ . Use this function to perform Analysis 2. CourseNana.COM

In [34]:

# run this cell first

def Get_p_value(zp):
    return round(1 - scipy.stats.norm.sf(abs(zp))*2,2)

In [35]:

# Use this cell to compare the output value of function Get_p_value with 
# the table provided in your lecture notes (e.g., Slide 12, Chapter3C.pdf)

print('zp = 0.67, p = ', Get_p_value(0.67))
print('zp = 1, p = ', Get_p_value(1))
print('zp = 1.64, p = ', Get_p_value(1.64))
print('zp = 2.58, p = ', Get_p_value(2.58))
print()

# you can alert the input zp value and re-run this cell to help you to calculate the corresponding p.
print('p = ', Get_p_value(0.43))  


# you can change 0.43 to any zp value you obtained.

zp = 0.67, p =  0.5
zp = 1, p =  0.68
zp = 1.64, p =  0.9
zp = 2.58, p =  0.99

p =  0.33

Choose a training-testing trial in Experiment 2 for k=45. Observe the testing error of this 45-NN. Compare it with the 1-NN in Analysis 1. Which one has higher testing sample error? Estimate the probability that it also has higher true error. Explain your answer and how you compute it in the report. CourseNana.COM

6. Hyperparameter Selection (4 Marks, Normal)

Use your k-NN function with cosine distance. Design an appropriate and complete machine learning experiment, which should include the training, hyper-parameter selection and evaluation stages. In this case, your hyperparameter will be $k$ . You can choose from the random subsampling, k-fold CV and LOO approaches for hyperparameter selection. In order to get 4 marks, you should implement this from scrach without using readily implemented data-split functions provided in existing libraries. If you decide to use existing implementation on data splitting, model selection and/or evaluation, you can get at most 2 marks. CourseNana.COM

Explain in the report your strategy for splitting the data, and the design of your chosen hyperparameter selection method. Present your results and chosen value of $k$ . Why is it important to split the data into train, test, and validation sets in machine learning experiments? CourseNana.COM