ANLP Assignment (Autumn 2020)
For assessment, you are expected to complete and submit this notebook file. When answers require code, you may import and use library functions (unless explicitly told otherwise). All of your own code should be included in the notebook rather than imported from elsewhere. Written answers should also be included in the notebook. You should insert as many extra cells as you want and change the type between code and markdown as appropriate.
CourseNana.COM
In order to avoid misconduct, you should not talk about the assignment questions with your peers. If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.
CourseNana.COM
Marking guidelines are provided as a separate document.
CourseNana.COM
The first few cells contain code to set-up the assignment and bring in some data. In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell. Otherwise do not change the code in these cells.
CourseNana.COM
In [ ]:candidateno=11111119 #this MUST be updated to your candidate number so that you get a unique data sample
In [ ]:#set up drives for resources. Change the path as necessary
from google.colab import drive
#mount google drive
drive.mount('/content/drive/')
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')
In [ ]:#do not change the code in this cell
#preliminary imports
import re
import random
import math
import pandas as pd
import matplotlib.pyplot as plt
from itertools import zip_longest
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet_ic')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
brown_ic = wn_ic.ic("ic-brown.dat")
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from sussex_nltk.corpus_readers import ReutersCorpusReader
For assessment, you are expected to complete and submit this notebook file. When answers require code, you may import and use library functions (unless explicitly told otherwise). All of your own code should be included in the notebook rather than imported from elsewhere. Written answers should also be included in the notebook. You should insert as many extra cells as you want and change the type between code and markdown as appropriate. CourseNana.COM
In order to avoid misconduct, you should not talk about the assignment questions with your peers. If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants. CourseNana.COM
Marking guidelines are provided as a separate document. CourseNana.COM
The first few cells contain code to set-up the assignment and bring in some data. In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell. Otherwise do not change the code in these cells. CourseNana.COM
candidateno=11111119 #this MUST be updated to your candidate number so that you get a unique data sample
#set up drives for resources. Change the path as necessary
from google.colab import drive
#mount google drive
drive.mount('/content/drive/')
import sys
sys.path.append('/content/drive/My Drive/NLE Notebooks/resources/')
#do not change the code in this cell
#preliminary imports
import re
import random
import math
import pandas as pd
import matplotlib.pyplot as plt
from itertools import zip_longest
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet_ic')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
brown_ic = wn_ic.ic("ic-brown.dat")
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from sussex_nltk.corpus_readers import ReutersCorpusReader
Question 1: Books vs DVDs
In this question, you will be investigating NLP methods for distinguishing reviews written about books from reviews written about DVDs.
CourseNana.COM
In [ ]:#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
"""
Given corpus generator and ratio:
- partitions the corpus into training data and test data, where the proportion in train is ratio,
:param data: A corpus generator.
:param ratio: The proportion of training documents (default 0.7)
:return: a pair (tuple) of lists where the first element of the
pair is a list of the training data and the second is a list of the test data.
"""
data = list(data)
n = len(data)
train_indices = random.sample(range(n), int(n * ratio))
test_indices = list(set(range(n)) - set(train_indices))
train = [data[i] for i in train_indices]
test = [data[i] for i in test_indices]
return (train, test)
def feature_extract(review):
"""
Generate a feature representation for a review
:param review: AmazonReview object
:return: dictionary of Boolean features
"""
return {word:True for word in review.words()}
def get_training_test_data(categories=('book','dvd'),ratio=0.7,seed=candidateno):
"""
Get training and test data for a given pair of categories and ratio, pre-formatted for use with NB classifier
:param category: pair of categories of review corpus, two from ["kitchen, "dvd, "book", "electronics"]
:param ratio: proportion of data to use as training data
:return: pair of lists
"""
random.seed(candidateno)
train_data=[]
test_data=[]
for category in categories:
reader=AmazonReviewCorpusReader().category(category)
train, test = split_data(reader.documents(),ratio=ratio)
train_data+=[(feature_extract(review),category)for review in train]
test_data+=[(feature_extract(review),category)for review in test]
random.shuffle(train_data)
random.shuffle(test_data)
return train_data,test_data
When you have run the cell below, your unique training and testing samples will be stored in training_data
and testing_data
CourseNana.COM
In [ ]:#do not change the code in this cell
training_data,testing_data=get_training_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])
a) Use your training data to find i) the top 20 words which occur more frequently in book reviews than in dvd reviews ii) the top 20 words which occur more frequently in dvd reviews than book reviews Discuss what pre-processing techniques you have applied (or not applied) in answering this question, and why. [10 marks]
CourseNana.COM
b) Design, build and test a word list classifier to classify reviews as being from the book domain or from the dvd domain. Make sure you discuss i) how you decide the lengths and contents of the word lists and ii) accuracy, precision and recall of your final classifier.[15 marks
CourseNana.COM
c) Compare the performance of your word list classifier with a Naive Bayes classifier (e.g., from NLTK). Make sure you discuss the results. [10 marks]
CourseNana.COM
d) Design and carry out an experiment into the impact of the amount of training data on each of these classifiers. Make sure you describe design decisions in your experiment, include a graph of your results and discuss your conclusions. [15 marks]
CourseNana.COM
In this question, you will be investigating NLP methods for distinguishing reviews written about books from reviews written about DVDs. CourseNana.COM
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
"""
Given corpus generator and ratio:
- partitions the corpus into training data and test data, where the proportion in train is ratio,
:param data: A corpus generator.
:param ratio: The proportion of training documents (default 0.7)
:return: a pair (tuple) of lists where the first element of the
pair is a list of the training data and the second is a list of the test data.
"""
data = list(data)
n = len(data)
train_indices = random.sample(range(n), int(n * ratio))
test_indices = list(set(range(n)) - set(train_indices))
train = [data[i] for i in train_indices]
test = [data[i] for i in test_indices]
return (train, test)
def feature_extract(review):
"""
Generate a feature representation for a review
:param review: AmazonReview object
:return: dictionary of Boolean features
"""
return {word:True for word in review.words()}
def get_training_test_data(categories=('book','dvd'),ratio=0.7,seed=candidateno):
"""
Get training and test data for a given pair of categories and ratio, pre-formatted for use with NB classifier
:param category: pair of categories of review corpus, two from ["kitchen, "dvd, "book", "electronics"]
:param ratio: proportion of data to use as training data
:return: pair of lists
"""
random.seed(candidateno)
train_data=[]
test_data=[]
for category in categories:
reader=AmazonReviewCorpusReader().category(category)
train, test = split_data(reader.documents(),ratio=ratio)
train_data+=[(feature_extract(review),category)for review in train]
test_data+=[(feature_extract(review),category)for review in test]
random.shuffle(train_data)
random.shuffle(test_data)
return train_data,test_data
When you have run the cell below, your unique training and testing samples will be stored in training_data
and testing_data
CourseNana.COM
#do not change the code in this cell
training_data,testing_data=get_training_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])
a) Use your training data to find i) the top 20 words which occur more frequently in book reviews than in dvd reviews ii) the top 20 words which occur more frequently in dvd reviews than book reviews Discuss what pre-processing techniques you have applied (or not applied) in answering this question, and why. [10 marks] CourseNana.COM
b) Design, build and test a word list classifier to classify reviews as being from the book domain or from the dvd domain. Make sure you discuss i) how you decide the lengths and contents of the word lists and ii) accuracy, precision and recall of your final classifier.[15 marks CourseNana.COM
c) Compare the performance of your word list classifier with a Naive Bayes classifier (e.g., from NLTK). Make sure you discuss the results. [10 marks] CourseNana.COM
d) Design and carry out an experiment into the impact of the amount of training data on each of these classifiers. Make sure you describe design decisions in your experiment, include a graph of your results and discuss your conclusions. [15 marks] CourseNana.COM
Question 2: Distributional Semantics
In this question, you will be investigating the distributional hypothesis: words which appear in similar contexts tend to have similar meanings. We are going to be using the Reuters corpus of financial documents for this part of the assignment. When you run the following cell you should see that it contains 1,113,359 sentences.
CourseNana.COM
In [ ]:#do not change the code in this cell
rcr = ReutersCorpusReader().finance()
rcr.enumerate_sents()
The following cell will take 2-5 minutes to run. It will generate a unique-to-you sample of 200,000 sentences. These sentences are tokenised and normalised for case and number for you.
CourseNana.COM
In [ ]:#do not change the code in this cell
def normalise(tokenlist):
tokenlist=[token.lower() for token in tokenlist]
tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
return tokenlist
random.seed(candidateno)
samplesize=2000
iterations =100
sentences=[]
for i in range(0,iterations):
sentences+=[normalise(sent) for sent in rcr.sample_sents(samplesize=samplesize)]
print("Completed {}%".format(i))
print("Completed 100%")
generate_features()
will used and explored below.
CourseNana.COM
In [ ]:# do not change the code in this cell
def generate_features(sentences,window=1):
mydict={}
for sentence in sentences:
for i,token in enumerate(sentence):
current=mydict.get(token,{})
features=sentence[max(0,i-window):i]+sentence[i+1:i+window+1]
for feature in features:
current[feature]=current.get(feature,0)+1
mydict[token]=current
return mydict
a) Run generate_features(sentences[:5])
. With reference to the code and the specific examples, explain how the output was generated [5 marks]
CourseNana.COM
In [ ]:generate_features(sentences[:5])
b) Write code and find the 1000 most frequently occurring words that
CourseNana.COM
- are in your sample; AND
- have at least one noun sense according to WordNet [5 marks]
In [ ]:# do not change the code in this cell. It relates to part c)
wordpair=("house","garden")
concept_1=wn.synsets(wordpair[0])[0]
concept_2=wn.synsets(wordpair[1])[0]
print("Path similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.path_similarity(concept_1,concept_2)))
print("Resnik similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.res_similarity(concept_1,concept_2, brown_ic)))
print("Lin similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.lin_similarity(concept_1,concept_2, brown_ic)))
c) i) The code above outputs the path similarity score, the Resnik similarity score and the Lin similarity score for a pair of concepts in WordNet. Explain what each of these numbers means.
CourseNana.COM
ii) For every possible pair of words identified in Q2, determine the semantic similarity of the pair according to WordNet. Make sure you justify your choice of semantic similarity measure and explain and justify the strategy used for words with multiple senses.
CourseNana.COM
iii) Identify the 10 most similar words (according to WordNet) to the most frequent word in the corpus [15 marks]
CourseNana.COM
d) i) Write code to construct distributional vector representations of words in the corpus with a parameter to specify context size. Explain how you calculate the value of association between each word and each context feature.
CourseNana.COM
ii) Use your code to construct representations of the 1000 words identified in Q2 with a window size of 1 and thus determine the 10 words which are distributionally most similar to the most frequent word in the corpus. [10 marks]
CourseNana.COM
e) Plan and carry out an investigation into the correlation between semantic similarity according to WordNet and distributional similarity with different context window sizes. You should make sure that you include a graph of how correlation varies with context window size and that you discuss your results. [15 marks]
CourseNana.COM
In [12]:##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 388
import io
from nbformat import current
filepath="/content/drive/My Drive/NLE Notebooks/assessment/ANLPassignment.ipynb"
question_count=754
with io.open(filepath, 'r', encoding='utf-8') as f:
nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
if cell.cell_type == "markdown":
word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))
Submission length is 0
In this question, you will be investigating the distributional hypothesis: words which appear in similar contexts tend to have similar meanings. We are going to be using the Reuters corpus of financial documents for this part of the assignment. When you run the following cell you should see that it contains 1,113,359 sentences. CourseNana.COM
#do not change the code in this cell
rcr = ReutersCorpusReader().finance()
rcr.enumerate_sents()
The following cell will take 2-5 minutes to run. It will generate a unique-to-you sample of 200,000 sentences. These sentences are tokenised and normalised for case and number for you. CourseNana.COM
#do not change the code in this cell
def normalise(tokenlist):
tokenlist=[token.lower() for token in tokenlist]
tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
return tokenlist
random.seed(candidateno)
samplesize=2000
iterations =100
sentences=[]
for i in range(0,iterations):
sentences+=[normalise(sent) for sent in rcr.sample_sents(samplesize=samplesize)]
print("Completed {}%".format(i))
print("Completed 100%")
generate_features()
will used and explored below.
CourseNana.COM
# do not change the code in this cell
def generate_features(sentences,window=1):
mydict={}
for sentence in sentences:
for i,token in enumerate(sentence):
current=mydict.get(token,{})
features=sentence[max(0,i-window):i]+sentence[i+1:i+window+1]
for feature in features:
current[feature]=current.get(feature,0)+1
mydict[token]=current
return mydict
a) Run generate_features(sentences[:5])
. With reference to the code and the specific examples, explain how the output was generated [5 marks]
CourseNana.COM
generate_features(sentences[:5])
b) Write code and find the 1000 most frequently occurring words that CourseNana.COM
- are in your sample; AND
- have at least one noun sense according to WordNet [5 marks]
# do not change the code in this cell. It relates to part c)
wordpair=("house","garden")
concept_1=wn.synsets(wordpair[0])[0]
concept_2=wn.synsets(wordpair[1])[0]
print("Path similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.path_similarity(concept_1,concept_2)))
print("Resnik similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.res_similarity(concept_1,concept_2, brown_ic)))
print("Lin similarity between 1st sense of {} and 1st sense of {} is {}".format(wordpair[0],wordpair[1],wn.lin_similarity(concept_1,concept_2, brown_ic)))
c) i) The code above outputs the path similarity score, the Resnik similarity score and the Lin similarity score for a pair of concepts in WordNet. Explain what each of these numbers means. CourseNana.COM
ii) For every possible pair of words identified in Q2, determine the semantic similarity of the pair according to WordNet. Make sure you justify your choice of semantic similarity measure and explain and justify the strategy used for words with multiple senses. CourseNana.COM
iii) Identify the 10 most similar words (according to WordNet) to the most frequent word in the corpus [15 marks] CourseNana.COM
d) i) Write code to construct distributional vector representations of words in the corpus with a parameter to specify context size. Explain how you calculate the value of association between each word and each context feature. CourseNana.COM
ii) Use your code to construct representations of the 1000 words identified in Q2 with a window size of 1 and thus determine the 10 words which are distributionally most similar to the most frequent word in the corpus. [10 marks] CourseNana.COM
e) Plan and carry out an investigation into the correlation between semantic similarity according to WordNet and distributional similarity with different context window sizes. You should make sure that you include a graph of how correlation varies with context window size and that you discuss your results. [15 marks] CourseNana.COM
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 388
import io
from nbformat import current
filepath="/content/drive/My Drive/NLE Notebooks/assessment/ANLPassignment.ipynb"
question_count=754
with io.open(filepath, 'r', encoding='utf-8') as f:
nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
if cell.cell_type == "markdown":
word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))
Submission length is 0