1. Homepage
2. Programming
3. COMP24112 Machine Learning Lab 2: News Article Classification by k-NN

# COMP24112 Machine Learning Lab 2: News Article Classification by k-NN

UKUniversity Of ManchesterCOMP24112Machine LearningNews Article Classification by k-NN Python

# COMP24112 Lab 2: News Article Classification by k-NN

You will work on a news article classification task. The provided dataset includes a total of 800 articles taken from Reuters newswire. They belong to 4 classes: "earn" (0), "crude" (1), "trade" (2) and "interest" (3). There are 200 articles per class. Each article is characterised by word occurrences. The list of used words is called a vocabulary. In our dataset, the vocabulary includes a total of 6428 words.

## 2. Preparation

First we need to import the data. Run the below cell to load the data using NumPy.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse

data, labels, class_names, vocabulary = np.load("ReutersNews_4Classes_sparse.npy", allow_pickle=True)


### A Note on Sparsity

Most documents only contain a small subset of the vocabulary, resulting in a very sparse data matrix. To handle the sparsity, in this exercise data is represented as a scipy.sparse.csr_matrix, which can store sparse matrices efficiently while still allowing efficient row-based indexing. You can learn more about csr_matrix and other ways of dealing with sparse matrices at https://docs.scipy.org/doc/scipy/reference/sparse.html.

Note, however, that data is not a normal NumPy array. While most operations will be the same as with a normal dense array, you cannot use a sparse matrix to index another matrix. If you need to do this, either first convert the matrix to a NumPy array with the toarray() method, or use methods specifically designed to work with sparse matrices.

In [2]:
print(data[41,:]) # A sparse row vector; the output will be the non-zero indices and their values.
print(data[41,:].toarray()) # Convert back to a NumPy array. Note that the result is a (1, 6428) matrix, not a vector.
# print(vocabulary[data[41,:] > 0]) # Can't index vocabulary with a sparse matrix.
rows, columns, values = scipy.sparse.find(data[41,:]) # Find the non-zero entries in the 42nd document.
print(vocabulary[columns]) # Prints the words present in the 42nd document.

  (0, 2)	1
(0, 3)	3
(0, 5)	1
(0, 8)	1
(0, 10)	1
(0, 11)	1
(0, 12)	1
(0, 13)	1
(0, 21)	2
(0, 24)	1
(0, 105)	1
(0, 127)	1
(0, 227)	1
(0, 275)	1
(0, 334)	2
(0, 341)	1
(0, 348)	1
(0, 359)	1
(0, 411)	1
(0, 426)	1
(0, 1428)	1
(0, 2058)	1
(0, 5555)	1
[[0 0 1 ... 0 0 0]]
['share' 'split' 'say' 'two-for-one' 'shareholder' 'annual' 'meeting'
'reuter' 'ct' 'note' 'company' 'pay' 'subject' 'increase' 'stock'
'dividend' 'april' 'northern' 'declare' 'approval' 'telecom' 'post-split'
'nt']


To see the full vocabulary, you can run

In [3]:
print(", ".join(vocabulary))

You can see how many times article $i$ contains word $j$ using

In [4]:
i, j = 40, 2
print(data[i,j])

4


You can see which class the $i$th article belongs to using

In [5]:
print(labels[i])

0


For instance, by running

In [6]:
print("Occurrences:", data[0,10])
print("Class:", class_names[labels[0]])
print("Word:", vocabulary[10])

Occurrences: 2
Class: earn
Word: shareholder


you can see that the 11th word appears twice in the first document, the first document belongs to the class "earn", and the 11th word is "shareholder".

The following function randomly selects a subset of the data.

In [7]:
def sample_indices(labels, *num_per_class):
"""
Returns randomly selected indices. It will return the specified number of indices for each class.
"""
indices = []
for cls, num in enumerate(num_per_class):
cls_indices = np.where(labels == cls)[0]
indices.extend(np.random.choice(cls_indices, size=num, replace=False))
return np.array(indices)


For instance, to get one sample from the first class, two from the second, three from the third, and four from the fourth, you can run:

In [8]:
indices = sample_indices(labels, 1, 2, 3, 4)
print("Returned indices:", indices)
print("Samples:", data[indices])
print("Corresponding classes:", labels[indices])

Returned indices: [ 28 290 345 527 401 431 731 735 605 703]
Samples:   (0, 13)	1
(0, 14)	1
(0, 15)	3
(0, 16)	3
(0, 17)	4
(0, 19)	7
(0, 20)	4
(0, 21)	3
(0, 22)	2
(0, 23)	2
(0, 99)	1
(0, 117)	1
(0, 121)	1
(0, 122)	1
(0, 681)	1
(0, 4504)	1
(0, 4505)	1
(1, 5)	5
(1, 13)	1
(1, 23)	9
(1, 42)	2
(1, 91)	1
(1, 104)	1
(1, 106)	1
(1, 115)	1
:	:
(8, 640)	1
(8, 772)	1
(8, 821)	1
(8, 836)	1
(8, 837)	1
(8, 842)	1
(8, 959)	1
(8, 984)	1
(8, 1019)	1
(8, 1475)	1
(8, 1600)	1
(8, 1951)	2
(8, 1952)	1
(8, 1953)	2
(8, 2073)	1
(8, 2521)	1
(8, 3091)	1
(9, 5)	1
(9, 98)	1
(9, 332)	1
(9, 473)	1
(9, 622)	1
(9, 822)	1
(9, 1229)	1
(9, 1267)	1
Corresponding classes: [0 1 1 2 2 2 3 3 3 3]


## 3. k-NN Implementation (4 Marks, Normal)

Now, you will need to implement a k-NN classifier by filling the code below. This function should support two types of distance measures: Euclidean distance and cosine distance (defined as 1 - cosine similarity). It should take a set of training samples, a user-specified neighbour number, a distance option, and features of a set of testing samples as the input. It should return the predicted classes for the input set of testing samples.

In order to get 4 marks, you are asked to implement the k-NN classifier from scrach without relying on any machine learning library, particularly the distance calculation. But you are allowed to research NumPy functions relating to sorting. If you decide to use existing distance implementation from libraries, e.g., sklearn.metrics.pairwise_distances imported as cdist, you can get at most 3 marks.

Your implementation must NOT make use of Python loops over individual samples or features. You should use functions that operate on whole matrices, as this will be much faster than looping in Python. Each experiment below is expected to take no more than 2 minutes to run.

In [9]:
import scipy.stats

def knn_classify(test_samples, training_data, training_labels, metric="euclidean", k=1):
"""
Performs k-nearest neighbour classification on the provided samples,
given training data and the corresponding labels.

test_samples: An m x d matrix of m samples to classify, each with d features.
training_data: An n x d matrix consisting of n training samples, each with d features.
training_labels: A vector of size n, where training_labels[i] is the label of training_data[i].
metric: The metric to use for calculating distances between samples.
k: The number of nearest neighbours to use for classification.

Returns: A vector of size m, where out[i] is the predicted class of test_samples[i].
"""
# Calculate an m x n distance matrix.
pairwise_distance = ...

# Find the k nearest neighbours of each samples as an m x k matrix of indices.
nearest_neighbours = ...

# Look up the classes corresponding to each index.
nearest_labels = ...

# Return the most frequent class on each row.
# Note: Ensure that the returned vector does not contain any empty dimensions.
# You may find the squeeze method useful here.
return ...


## 4. Experiments (13 Marks in Total)

Use your k-NN function to perform the following experiments.

### Experiment 1 (3 Marks, Easy)

Randomly select 80 articles per class for training, and use the remaining articles for testing. Fix a neighbour number setting as you see fit. Perform k-NN classification using the Euclidean distance and test it.

Repeat this process 20 times (trials). Calculate the mean and standard deviation of the testing accuracies. Print out the mean and standard deviation.

In [11]:
# Your code goes here


Use the same neighbour number, but use the cosine distance instead of the Euclidean distance. Repeat the same experiment.

Print out the mean and standard deviation.

In [14]:
# Your code goes here


Explain in your report which distance measure gives better performance and analyse the reason.

### Experiment 2 (5 Marks, Easy)

Using the distance measure that you found performs better in Experiment 1.

Randomly select 80 articles per class for training, and use the remaining articles for testing. Perform k-NN classification with the neighbour number $k$ varying from 1 to 50.

For each values of $k$, repeat the training process by 20 trials and record the average training error rates and standard deviation.

Do the same for testing errors.

In [16]:
# Your code goes here


Produce an error bar plot showing the training error rate for each $k$ here:

In [18]:
# Your code goes here


Produce your testing error bar plot here:

In [ ]:
# Your code goes here


Remember that all graphs should have axis labels and a title.

Discuss in your report the difference between the training and testing accuracies, and why this is the case.

Analyse in your report the effect of $k$ based on this experiment. What do you think is a reasonable value for $k$? Comment specifically on the bias and variance of your model at small and large values of $k$.

### Experiment 3 (5 Marks, Hard)

In this experiment we will create confusion matrices for a more detailed view on our model's performance. Then, we will observe the behaviour of our knn classifier on novel classes.

First, randomly select 100 articles per class for training, and use the remaining articles for testing. Set the neighbour number to $k=3$. Perform 3-NN classification using the Cosine distance, as in previous experiments.

#### Confusion Matrix Implementation

Implement a multi-class confusion matrix yourself, from scratch. Let the row index correspond to the known label, and column index to predicted label. If you decide to use existing confusion matrix implementation from libraries, e.g., sklearn.metrics.confusion_matrix, you can get at most 4 marks. (However, you may use an existing implementation to check the output of your own function.)

Print out the confusion matrix and overall accuracy of your classifier for the testing data.

In [21]:
# Your code goes here


#### On Novel Classes

5 new articles have been provided in string format below. The code to create a sparse representation of these articles has also been provided. Take a moment to skim through the articles.

Run the code below, saving the sparse matrix representation of these 5 articles into new_data.

In [5]:
# Make sure you have scikit-learn installed.
from sklearn.feature_extraction.text import CountVectorizer

articles = []
for f in [sp0, sp1, sp2, sp3, sp4]:
text = f.replace('\n', ' ')
articles.append(text)
vrizer = CountVectorizer(vocabulary=vocabulary)
new_data = vrizer.fit_transform(articles)


(1) Run the classifier from step (1) to predict the classes of the articles in new_data. Print out the class predictions.

What classes to you think these 5 articles should belong to, based on your own judgement of their content? Can your classifer make an appropriate class prediction for these 5 articles? Analyse the reason for your answers in your report.

In [ ]:
# Your code goes here


(2) Introduce a new class, sport, to your dataset. The class should contain the 5 articles as above. Add this to your data using the command below. Your new data contains 805 articles, 800 from the original dataset and 5 from the new_data, belonging to 5 classes: 200 articles from each of the first 4 classes and 5 articles from the 5th class.

Randomly split the new data into a training set containing 100 articles each from 'earn', 'crude', 'trade', and 'interest', and then only 3 articles from 'sport' (you should be able to use the sample_indices function given at the start). Reserve the remaining articles for testing. Test the performance of the new 3-NN classifier.

Print the confusion matrix and classification accuracy for the testing data.

In [29]:
data_augmented = scipy.sparse.vstack((data, new_data))
labels_augmented = ... # your code goes here


(3) Repeat the above process 6 times, repeating the random train-test split. For each of the 5 classes, print out its averaged testing accuracy. Comment on your classifier's performance in your report. What are the consequences of having no training data and limited training data for the 'sports' class?

(4) Self-learn the concepts of zero-shot learning and few-shot learning. In your report, link these concepts to the experiments you've just performed. Is your model performing zero- or few-shot learning? Explain your reasoning.

## 5. Result Analysis (4 Marks in Total)

### Analysis 1 (2 Marks, Normal)

Choose a training-testing trial in Experiment 2 for $k=1$. Observe the testing error of this 1-NN, and estimate the interval where its true error lies with 90% probability. Explain in your report how you compute it.

In [8]:
# You may write your calculations in LateX or in code here


### Analysis 2 (2 Marks, Normal)

The following function Get_p_value() is provided to obtain $p$ according to ${z}_{p}$. Use this function to perform Analysis 2.

In [34]:
# run this cell first

def Get_p_value(zp):
return round(1 - scipy.stats.norm.sf(abs(zp))*2,2)

In [35]:
# Use this cell to compare the output value of function Get_p_value with
# the table provided in your lecture notes (e.g., Slide 12, Chapter3C.pdf)

print('zp = 0.67, p = ', Get_p_value(0.67))
print('zp = 1, p = ', Get_p_value(1))
print('zp = 1.64, p = ', Get_p_value(1.64))
print('zp = 2.58, p = ', Get_p_value(2.58))
print()

# you can alert the input zp value and re-run this cell to help you to calculate the corresponding p.
print('p = ', Get_p_value(0.43))

# you can change 0.43 to any zp value you obtained.

zp = 0.67, p =  0.5
zp = 1, p =  0.68
zp = 1.64, p =  0.9
zp = 2.58, p =  0.99

p =  0.33


Choose a training-testing trial in Experiment 2 for k=45. Observe the testing error of this 45-NN. Compare it with the 1-NN in Analysis 1. Which one has higher testing sample error? Estimate the probability that it also has higher true error. Explain your answer and how you compute it in the report.

## 6. Hyperparameter Selection (4 Marks, Normal)

Use your k-NN function with cosine distance. Design an appropriate and complete machine learning experiment, which should include the training, hyper-parameter selection and evaluation stages. In this case, your hyperparameter will be $k$. You can choose from the random subsampling, k-fold CV and LOO approaches for hyperparameter selection. In order to get 4 marks, you should implement this from scrach without using readily implemented data-split functions provided in existing libraries. If you decide to use existing implementation on data splitting, model selection and/or evaluation, you can get at most 2 marks.

Explain in the report your strategy for splitting the data, and the design of your chosen hyperparameter selection method. Present your results and chosen value of $k$. Why is it important to split the data into train, test, and validation sets in machine learning experiments?

## Get Expert Help On This Assignment

#### Scan above qrcode with Wechat

UK代写,University Of Manchester代写,COMP24112代写,Machine Learning代写,News Article Classification by k-NN代写, Python代写,UK代编,University Of Manchester代编,COMP24112代编,Machine Learning代编,News Article Classification by k-NN代编, Python代编,UK代考,University Of Manchester代考,COMP24112代考,Machine Learning代考,News Article Classification by k-NN代考, Python代考,UKhelp,University Of Manchesterhelp,COMP24112help,Machine Learninghelp,News Article Classification by k-NNhelp, Pythonhelp,UK作业代写,University Of Manchester作业代写,COMP24112作业代写,Machine Learning作业代写,News Article Classification by k-NN作业代写, Python作业代写,UK编程代写,University Of Manchester编程代写,COMP24112编程代写,Machine Learning编程代写,News Article Classification by k-NN编程代写, Python编程代写,UKprogramming help,University Of Manchesterprogramming help,COMP24112programming help,Machine Learningprogramming help,News Article Classification by k-NNprogramming help, Pythonprogramming help,UKassignment help,University Of Manchesterassignment help,COMP24112assignment help,Machine Learningassignment help,News Article Classification by k-NNassignment help, Pythonassignment help,UKsolution,University Of Manchestersolution,COMP24112solution,Machine Learningsolution,News Article Classification by k-NNsolution, Pythonsolution,