INT401: Fundamentals of Machine Learning Fall Semester
Assignment 1: Feature Generation
Lecturer: Xiaobo Jin Unit: INT Dept. of XJTLU
1.1 Ob jectives
• Learn how to generate the features from the text collections
1.2 Introduction
Text categorization is the task on classifying a set of documents into categories from a set of predefined labels. Texts cannot be directly handled by our model. The indexing procedure is the first step that maps a text dj into a numeric representation during the training and validation. The standard tfidf function is used to represent the text. The unique words from English vocabulary are represented as a dimension of the dataset.
1.3 TFIDF Representation 1.3.1 Preprocessing
-
(20 marks) Read the text files from 5 subdirectories in dataset and split the document text into words (splitting separator is non-alphabet letters).
Hints: use os.listdir() function to index all files; use str.split() to split the text into words;
f = codecs.open(fname,’r’,encoding=’Latin1’)
Notice: The files from 5 subdirectories consist of ONE text dataset. The names of the subdirectories are the class labels of the text files.
-
(20 marks) Remove the stopwords from the text collections, which are frequent words that carry no information. Stopwords list are given in the file stopwords.txt . Convert all words into their lower case form. Delete all non-alphabet characters from the text.
Hint: use set collection to store all stopwords;
-
(20 marks) Perform word stemming to remove the word suffix.
from nltk.stem.porter import *
stemmer = PorterStemmer()
plurals = [’caresses’, ’flies’, ’dies’, ’mules’, ’denied’,
Lab 1: Feature Generation
’died’, ’agreed’, ’owned’, ’humbled’, ’sized’, ’meeting’, ’stating’, ’siezing’, ’itemization’, ’sensational’, ’traditional’, ’reference’, ’colonizer’, ’plotted’]
singles = [stemmer.stem(plural) for plural in plurals]
Hints: use the following code to remove non-alphabet letter from the lower-case w to obtain wd
wd = re.sub(r’[^a-z]’, ’’, w.lower()).strip()
1.3.2 (40 marks) TFIDF Representation
The documents are represented as the vector space model. In the vector space model, each document is represented as a vector of words. A collection of documents are represented by a document-by-word matrix A
A = (aik) (1.1)
where aik is the weight of word k in document i.
TFIDF representation assigns the weight to word i in document k in proportion to the number of occurrences of the word in the document, and inverse proportion to the number of documents in the collection for which the word occurs at least once.
aik = fik ∗ log(N/nk) (1.2)
-
(10 marks) fik: the frequency of word k in document i
-
N: the number of documents in the dataset
-
(10 marks) nk: the total number of documents that word k occurs in the dataset called the document frequency.
Notice that the entry aik is 0 if the word k is not included in the document i.
(20 marks) Taking into account the length of different documents, we normalize the representation of the
The data set can be represent as a matrix AN×D, where D is the number of the unique words in the document
collection.
Finally, the dataset is save into .npz file, where A is a matrix represented with the numpy array.
np.savez(’train-20ng.npz’,X=A)