1. Homepage
  2. Programming
  3. INT401: Fundamentals of Machine Learning Fall Semester Assignment 1: Feature Generation

INT401: Fundamentals of Machine Learning Fall Semester Assignment 1: Feature Generation

Engage in a Conversation
XJTLUINT401Fundamentals of Machine LearningFeature GenerationPython

INT401: Fundamentals of Machine Learning Fall Semester CourseNana.COM

Assignment 1: Feature Generation CourseNana.COM

Lecturer: Xiaobo Jin Unit: INT Dept. of XJTLU CourseNana.COM

1.1 Ob jectives CourseNana.COM

Learn how to generate the features from the text collections CourseNana.COM

1.2 Introduction CourseNana.COM

Text categorization is the task on classifying a set of documents into categories from a set of predefined labels. Texts cannot be directly handled by our model. The indexing procedure is the first step that maps a text dj into a numeric representation during the training and validation. The standard tfidf function is used to represent the text. The unique words from English vocabulary are represented as a dimension of the dataset. CourseNana.COM

1.3 TFIDF Representation 1.3.1 Preprocessing CourseNana.COM

  • (20 marks) Read the text files from 5 subdirectories in dataset and split the document text into words (splitting separator is non-alphabet letters). CourseNana.COM

    Hints: use os.listdir() function to index all files; use str.split() to split the text into words; CourseNana.COM

    f = codecs.open(fname,’r’,encoding=’Latin1’) CourseNana.COM

    Notice: The files from 5 subdirectories consist of ONE text dataset. The names of the subdirectories are the class labels of the text files. CourseNana.COM

  • (20 marks) Remove the stopwords from the text collections, which are frequent words that carry no information. Stopwords list are given in the file stopwords.txt . Convert all words into their lower case form. Delete all non-alphabet characters from the text. CourseNana.COM

    Hint: use set collection to store all stopwords; CourseNana.COM

  • (20 marks) Perform word stemming to remove the word suffix. CourseNana.COM

    from nltk.stem.porter import *
    stemmer = PorterStemmer()
    plurals = [
    ’caresses’, ’flies’, ’dies’, ’mules’, ’denied’, CourseNana.COM

Lab 1: Feature Generation CourseNana.COM

’died’, ’agreed’, ’owned’, ’humbled’, ’sized’, ’meeting’, ’stating’, ’siezing’, ’itemization’, ’sensational’, ’traditional’, ’reference’, ’colonizer’, ’plotted’] CourseNana.COM

singles = [stemmer.stem(plural) for plural in plurals]
Hints: use the following code to remove non-alphabet letter from the lower-case w to obtain wd wd = re.sub(r’[^a-z]’, ’’, w.lower()).strip() CourseNana.COM

1.3.2 (40 marks) TFIDF Representation CourseNana.COM

The documents are represented as the vector space model. In the vector space model, each document is represented as a vector of words. A collection of documents are represented by a document-by-word matrix A CourseNana.COM

A = (aik) (1.1) CourseNana.COM

where aik is the weight of word k in document i. CourseNana.COM

TFIDF representation assigns the weight to word i in document k in proportion to the number of occurrences of the word in the document, and inverse proportion to the number of documents in the collection for which the word occurs at least once. CourseNana.COM

aik = fik log(N/nk) (1.2) CourseNana.COM

  • (10 marks) fik: the frequency of word k in document i CourseNana.COM

  • N: the number of documents in the dataset CourseNana.COM

  • (10 marks) nk: the total number of documents that word k occurs in the dataset called the document frequency. CourseNana.COM

    Notice that the entry aik is 0 if the word k is not included in the document i.
    (20 marks) Taking into account the length of different documents, we normalize the representation of the CourseNana.COM

The data set can be represent as a matrix AN×D, where D is the number of the unique words in the document CourseNana.COM

collection.
Finally, the dataset is save into .npz file, where
A is a matrix represented with the numpy array. CourseNana.COM

np.savez(’train-20ng.npz’,X=A)


CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
XJTLU代写,INT401代写,Fundamentals of Machine Learning代写,Feature Generation代写,Python代写,XJTLU代编,INT401代编,Fundamentals of Machine Learning代编,Feature Generation代编,Python代编,XJTLU代考,INT401代考,Fundamentals of Machine Learning代考,Feature Generation代考,Python代考,XJTLUhelp,INT401help,Fundamentals of Machine Learninghelp,Feature Generationhelp,Pythonhelp,XJTLU作业代写,INT401作业代写,Fundamentals of Machine Learning作业代写,Feature Generation作业代写,Python作业代写,XJTLU编程代写,INT401编程代写,Fundamentals of Machine Learning编程代写,Feature Generation编程代写,Python编程代写,XJTLUprogramming help,INT401programming help,Fundamentals of Machine Learningprogramming help,Feature Generationprogramming help,Pythonprogramming help,XJTLUassignment help,INT401assignment help,Fundamentals of Machine Learningassignment help,Feature Generationassignment help,Pythonassignment help,XJTLUsolution,INT401solution,Fundamentals of Machine Learningsolution,Feature Generationsolution,Pythonsolution,