Assignment 1
COMP 7607: Natural Language Processing
Zoie comes from Beijing and she is learning English. She finds it sometimes difficult to decide which preposition she should use in some cases. For example, should she say “they live in a small town” or “they live at a small town”? One day she visited HKU and came across a student who happens to study COMP7607 this semester.
“You must be an expert in language!”, said Zoie.
“We learned language model recently!”, said the student with excitement.
“That’s nice! Could you please write a program for me that tells me what preposition I should use in a sentence”, asked Zoie with her eyes shining like stars.
“Sure! Anything!”, said the student.
And here comes the assignment 1.
Grade: We will evaluate your grade based on the code quality, output results/log, and discussion. We don’t evaluate your submission by perplexity in Section 1, but by better prediction accuracy in Sec- tion 2 increases your score. There are some optional problems in this assignment, and solving them increases your grade. Please don’t waste your time manually searching for hyperparameters. You are more than welcome to introduce a new hyperparameter optimization method to find more reasonable values.
Submit: You should submit two files: UniversityNumber.ipynb file and final prediction file University-Number.test.out of Section 2 to Moodle. All code and discussion should be included in your submitted .ipynb file. Make sure your code does not use your local files so that the results are reproducible. Before submitting, please run your notebook and keep all running logs so that we can check the results.
1 N-gram Language Model
A language model is a probability function p that assigns probabilities to word sequences such as w⃗ = (i, love, hong kong). Let’s envision p(w⃗ ) as the probability that, if you were to tune in to the radio at any random moment, the following four words would be "i love Hong Kong." This probability can be
1
evaluated even in the context of a longer sentence, such as “the students of COMP7607 course says, i love Hong Kong more than ever.” Often, we analyze p(w⃗ ) to compare it with alternative sequences and determine our preference.
Formally, in the context of language modeling, each element W belonging to the event space E represents a potential value of the infinite sequence of words that would be emitted from the radio once it is turned on. The notation p(w⃗ ) serves as a shorthand for p(prefix(W, |w⃗ |) = w⃗ ), where |w⃗ | represents the length of the sequence w⃗ . Consequently, p(i, love, hong, kong) encapsulates the cumu- lative probability of all infinite word sequences W that commence with the phrase “i love hong kong...”.
Suppose p(w⃗ ) = w1w2 . . . wn (a sequence of n words). A trigram language model defines def
p(w⃗)=p(w1)·p(w2 |w1)·p(w3 |w1 w2)·p(w4 |w2 w3)···p(wn |wn−2 wn−1)
on the assumption that the sequence was generated in the order w1, w2, w3, . . . (“from left to right”) with each word chosen in a way dependent on the previous two words. (But the first word w1 is not dependent on anything, since we turned on the radio at an arbitrary moment.)
Q1: Expand the above definition of p(w⃗ ) using naive estimates of the parameters, such as def C(w2 w3 w4)
p(w4 |w2,w3)=
where C(w2w3w4) denotes the count of times the trigram w2w3w4 was observed in a training corpus.
Note: In practice, we refer to simple parameter estimates like these as "maximum-likelihood estimates" (MLE). The advantage of MLE is that it maximizes the probability, or minimizes perplexity, of the training data. However, they tend to perform poorly on test data
Q2: One could also define a kind of reversed trigram language model preversed that instead assumed the words were generated in reverse order (“from right to left”):
def
preversed(w⃗ ) =p(wn) · p(wn−1 | wn) · p(wn−2 | wn−1wn) · p(wn−3 | wn−2wn−1) (1)
···p(w2 |w3w4)·p(w1 |w2w3) (2)
By manipulating the notation, show that the two models are identical (i.e., p(w⃗) = preversed(w⃗)) for any w⃗ provided that both models use MLE parameters estimated from the same training data (see Q1 above).
2 N-gram Language Model Implementation
In this section, you will build your own language model. We provide a dataset with three files contain- ing many whitespace-tokenized sentences
C(w2 w3)
• train.txt: data for training your language model.
• dev.txt: data for developing and optimizing your language model.
• test.txt: data for only evaluating your language model.
You will first build vocabulary with the training data, and then build your own unigram, bigram, and trigram language models with some smoothing methods.
2.1 Building Vocabulary
You will download and preprocess the tokenized training data to build the vocabulary. To handle out- of-vocabulary (OOV) words, you will convert tokens that occur less than three times in the training data into a special unknown token <UNK>. You should also add start-of-sentence tokens <s> and end-of-sentence </s> tokens.
Please show the vocabulary size and discuss the number of parameters of n-gram models. 2.2 N-gram Language Modeling
After preparing your vocabulary, you are expected to build bigram and unigram language models and report their perplexity on the training set, and dev set. Please discuss your experimental results. If you encounter any problems, please analyze them and explain why.
2.3 Smoothing
In this section, you will implement two smoothing methods to improve your language models.
2.3.1 Add-one (Laplace) smoothing
Please improve your bigram language model with add-one smoothing. Report its perplexity on the training set and dev set. Briefly discuss the differences in results between this and the bi-gram model you built in Section 2.2.
2.3.2 Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing, as shown here in the bigram case:
P∗ (w |w ) = C(wi−1 wi) + k (3) Add-k i i−1 C(wi) + kV
Please optimize the perplexity on the dev set by trying different k (no more than three times). Report its perplexity on the training set and dev set. Briefly discuss the differences in results between this and the bi-gram model with add-one smoothing.
2.3.3 Linear Interpolation
Please implement linear interpolation smoothing between unigram, bigram, and trigram models. Re- port its perplexity on the training set and dev set. Please optimize on a dev set by trying different hyperparameter sets. Finally, report the perplexity on the test set with the best hyperparameter set you get. Briefly discuss the results.
Optimization. So far, we manually choose the hyperparameter that maximizes the perplexity of the dev set and evaluates it on the test set. There are various ways to find this optimal set of hyperparam- eters. Do you know any other learning algorithms? Please give an example.
3
3 Preposition Prediction
In this section, you will use your language model to answer what preposition to use in a sentence.
• dev.in: The whitespace-tokenized sentences for developing your language model, where 5 prepo- sitions (at, in, of, for on) are replaced with a special token <PREP>.
• dev.out: The correct answers for the dev set.
• test.in: Input sentences for testing in the same format as dev.in, which will be released a week
before the deadline.
Your task is to use your language model to predict for each special token <PREP> in the validation (and soon the test) set, which preposition is the proper one there. You should optimize the accuracy of your model on the dev set and predict the prepositions of sentences in the test set. Please strictly follow the format of dev.out for your output test.out as well.
Please explain your method, report the accuracy of your model on dev sets, and discuss the re- sults. Based on your hyperparameters and results, explain the relationship between the test perplexity model and performance in applications.
Go for it, Zoie. The journey of language modeling is about to start!