Homepage
Programming
COMP SCI 7417 Applied Natural Language Processing - Assignment - Classifier and Distributional Semantics

COMP SCI 7417 Applied Natural Language Processing - Assignment - Classifier and Distributional Semantics

Engage in a Conversation

ANLP Assignment (Autumn 2020)

For assessment, you are expected to complete and submit this notebook file. When answers require code, you may import and use library functions (unless explicitly told otherwise). All of your own code should be included in the notebook rather than imported from elsewhere. Written answers should also be included in the notebook. You should insert as many extra cells as you want and change the type between code and markdown as appropriate. CourseNana.COM

In order to avoid misconduct, you should not talk about the assignment questions with your peers. If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants. CourseNana.COM

Marking guidelines are provided as a separate document. CourseNana.COM

The first few cells contain code to set-up the assignment and bring in some data. In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell. Otherwise do not change the code in these cells. CourseNana.COM

In [ ]:

candidateno=11111119 #this MUST be updated to your candidate number so that you get a unique data sample

In [ ]:

#set up drives for resources.  Change the path as necessary

from google.colab import drive
#mount google drive
drive.mount('/content/drive/')
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')

In [ ]:

#do not change the code in this cell
#preliminary imports

import re
import random
import math
import pandas as pd
import matplotlib.pyplot as plt
from itertools import zip_longest

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet_ic')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
brown_ic = wn_ic.ic("ic-brown.dat")

from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from sussex_nltk.corpus_readers import ReutersCorpusReader

Question 1: Books vs DVDs

In this question, you will be investigating NLP methods for distinguishing reviews written about books from reviews written about DVDs. CourseNana.COM

In [ ]:

#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data)  
    n = len(data)  
    train_indices = random.sample(range(n), int(n * ratio))          
    test_indices = list(set(range(n)) - set(train_indices))    
    train = [data[i] for i in train_indices]           
    test = [data[i] for i in test_indices]             
    return (train, test)                       
 

def feature_extract(review):
    """
    Generate a feature representation for a review
    :param review: AmazonReview object
    :return: dictionary of Boolean features
    """
    return {word:True for word in review.words()}

def get_training_test_data(categories=('book','dvd'),ratio=0.7,seed=candidateno):
    """
    Get training and test data for a given pair of categories and ratio, pre-formatted for use with NB classifier
    :param category: pair of categories of review corpus, two from ["kitchen, "dvd, "book", "electronics"]
    :param ratio: proportion of data to use as training data
    :return: pair of lists 
    """
    random.seed(candidateno)

    train_data=[]
    test_data=[]
    for category in categories:
      reader=AmazonReviewCorpusReader().category(category)    
      train, test = split_data(reader.documents(),ratio=ratio)
   
      train_data+=[(feature_extract(review),category)for review in train]
      test_data+=[(feature_extract(review),category)for review in test]
    random.shuffle(train_data)
    random.shuffle(test_data)

    return train_data,test_data

When you have run the cell below, your unique training and testing samples will be stored in training_data and testing_data CourseNana.COM

In [ ]:

#do not change the code in this cell
training_data,testing_data=get_training_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

a) Use your training data to find i) the top 20 words which occur more frequently in book reviews than in dvd reviews ii) the top 20 words which occur more frequently in dvd reviews than book reviews Discuss what pre-processing techniques you have applied (or not applied) in answering this question, and why. [10 marks] CourseNana.COM

b) Design, build and test a word list classifier to classify reviews as being from the book domain or from the dvd domain. Make sure you discuss i) how you decide the lengths and contents of the word lists and ii) accuracy, precision and recall of your final classifier.[15 marks CourseNana.COM

c) Compare the performance of your word list classifier with a Naive Bayes classifier (e.g., from NLTK). Make sure you discuss the results. [10 marks] CourseNana.COM

d) Design and carry out an experiment into the impact of the amount of training data on each of these classifiers. Make sure you describe design decisions in your experiment, include a graph of your results and discuss your conclusions. [15 marks] CourseNana.COM

Question 2: Distributional Semantics

In this question, you will be investigating the distributional hypothesis: words which appear in similar contexts tend to have similar meanings. We are going to be using the Reuters corpus of financial documents for this part of the assignment. When you run the following cell you should see that it contains 1,113,359 sentences. CourseNana.COM

In [ ]:

#do not change the code in this cell
rcr = ReutersCorpusReader().finance()
rcr.enumerate_sents()

The following cell will take 2-5 minutes to run. It will generate a unique-to-you sample of 200,000 sentences. These sentences are tokenised and normalised for case and number for you. CourseNana.COM

In [ ]:

#do not change the code in this cell
def normalise(tokenlist):
    tokenlist=[token.lower() for token in tokenlist]
    tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
    tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
    tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
    return tokenlist

random.seed(candidateno)  
samplesize=2000
iterations =100
sentences=[]
for i in range(0,iterations):
    sentences+=[normalise(sent) for sent in rcr.sample_sents(samplesize=samplesize)]
    print("Completed {}%".format(i))
print("Completed 100%")

generate_features() will used and explored below. CourseNana.COM

In [ ]:

# do not change the code in this cell
def generate_features(sentences,window=1):
    mydict={}
    for sentence in sentences:
        for i,token in enumerate(sentence):
            current=mydict.get(token,{})
            features=sentence[max(0,i-window):i]+sentence[i+1:i+window+1]
            for feature in features:
                current[feature]=current.get(feature,0)+1
            mydict[token]=current
    return mydict

a) Run generate_features(sentences[:5]). With reference to the code and the specific examples, explain how the output was generated [5 marks] CourseNana.COM

In [ ]:

generate_features(sentences[:5])

b) Write code and find the 1000 most frequently occurring words that CourseNana.COM

are in your sample; AND
have at least one noun sense according to WordNet [5 marks]

In [ ]:

# do not change the code in this cell.  It relates to part c)
wordpair=("house","garden")
concept_1=wn.synsets(wordpair[0])[0]
concept_2=wn.synsets(wordpair[1])[0]
print("Path similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.path_similarity(concept_1,concept_2)))
print("Resnik similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.res_similarity(concept_1,concept_2, brown_ic)))
print("Lin similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.lin_similarity(concept_1,concept_2, brown_ic)))

c) i) The code above outputs the path similarity score, the Resnik similarity score and the Lin similarity score for a pair of concepts in WordNet. Explain what each of these numbers means. CourseNana.COM

ii) For every possible pair of words identified in Q2, determine the semantic similarity of the pair according to WordNet. Make sure you justify your choice of semantic similarity measure and explain and justify the strategy used for words with multiple senses. CourseNana.COM

iii) Identify the 10 most similar words (according to WordNet) to the most frequent word in the corpus [15 marks] CourseNana.COM

d) i) Write code to construct distributional vector representations of words in the corpus with a parameter to specify context size. Explain how you calculate the value of association between each word and each context feature. CourseNana.COM

ii) Use your code to construct representations of the 1000 words identified in Q2 with a window size of 1 and thus determine the 10 words which are distributionally most similar to the most frequent word in the corpus. [10 marks] CourseNana.COM

e) Plan and carry out an investigation into the correlation between semantic similarity according to WordNet and distributional similarity with different context window sizes. You should make sure that you include a graph of how correlation varies with context window size and that you discuss your results. [15 marks] CourseNana.COM

In [12]:

##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 388

import io
from nbformat import current

filepath="/content/drive/My Drive/NLE Notebooks/assessment/ANLPassignment.ipynb"
question_count=754

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

Submission length is 0

Get in Touch with Our Experts

WeChat (微信)

Last: COMP 2012 - Data Structures Essentials - Assignment2 - GameStonk Share Trading

Next: [2022] USC - CS455: Introduction to Programming - HW2 C++ Pointers, Class Inheritance and Image Synthesis

COMP SCI 7417代写,Applied Natural Language Processing代写,Classifier代写,Distributional Semantics代写,The University of Adelaide代写,COMP SCI 7417代编,Applied Natural Language Processing代编,Classifier代编,Distributional Semantics代编,The University of Adelaide代编,COMP SCI 7417代考,Applied Natural Language Processing代考,Classifier代考,Distributional Semantics代考,The University of Adelaide代考,COMP SCI 7417help,Applied Natural Language Processinghelp,Classifierhelp,Distributional Semanticshelp,The University of Adelaidehelp,COMP SCI 7417作业代写,Applied Natural Language Processing作业代写,Classifier作业代写,Distributional Semantics作业代写,The University of Adelaide作业代写,COMP SCI 7417编程代写,Applied Natural Language Processing编程代写,Classifier编程代写,Distributional Semantics编程代写,The University of Adelaide编程代写,COMP SCI 7417programming help,Applied Natural Language Processingprogramming help,Classifierprogramming help,Distributional Semanticsprogramming help,The University of Adelaideprogramming help,COMP SCI 7417assignment help,Applied Natural Language Processingassignment help,Classifierassignment help,Distributional Semanticsassignment help,The University of Adelaideassignment help,COMP SCI 7417solution,Applied Natural Language Processingsolution,Classifiersolution,Distributional Semanticssolution,The University of Adelaidesolution,