2023 COMP 4446 / 5046 Assignment 1
Assingment 1 is an individual assessment. Please note the University's Academic dishonesty and plagiarism policy.
Submit via Canvas:
- Your notebook
- Run all cells before saving the notebook, so we can see your output
In this assignment, we will explore ways to predict the length of a Wikipedia article based on the first 100 tokens in the article. Such a model could be used to explore whether there are systematic biases in the types of articles that get more detail.
If you are working in another language, please make sure to clearly indicate which part of your code is running which section of the assignment and produce output that provides all necessary information. Submit your code, example outputs, and instructions for executing it.
Note: This assignment contains topics that are not covered at the time of release. Each section has information about which lectures and/or labs covered the relevant material. We are releasing it now so you can (1) start working on some parts early, and (2) know what will be in the assignment when you attend the relevant labs and lectures.
TODO: Copy and Name this File
Make a copy of this notebook in your own Google Drive (File -> Save a Copy in Drive) and change the filename, replacing YOUR-UNIKEY . For example, for a person with unikey mcol1997 , the filename should be: COMP-4446-5046_Assignment1_mcol1997.ipynb
Readme If there is something you want to tell the marker about your submission, please mention it here. [write here - optional]
Data Download [DO NOT MODIFY THIS]
We have already constructed a dataset for you using a recent dump of data from Wikipedia. Both the training and test datasets are provided in the form of csv files (training_data.csv, test_data.csv) and can be downloaded from Google Drive using the code below. Each row of the data contains:
The length of the article The title of the article The first 100 tokens of the article
In case you are curious, we constructed this dataset as follows:
- Downloaded a recent dump of English wikipedia.
- Ran WikiExtractor to get the contents of the pages.
- Filtered out very short pages.
- Ran SpaCy with the en_core_web_lg model to tokenise the pages (Note, SpaCy's development is led by an alumnus of USyd!).
- Counted the tokens and saved the relevant data in the format described above. This code will download the data. DO NOT MODIFY IT
## DO NOT MODIFY THIS CODE
# Code to download files into Colaboratory
# Install the PyDrive library
!pip install -U -q PyDrive
# Import libraries for accessing Google Drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Function to read the file, save it on the machine this colab is running on, and then rea
import csv
def read_file(file_id, filename):
downloaded = drive.CreateFile({'id':file_id})
downloaded.GetContentFile(filename)
with open(filename) as src:
reader = csv.reader(src)
data = [r for r in reader]
...
print("LABEL: {0} / SENTENCE: {1}".format(training_data[0][0], training_data[0][1:]))
print("------------------------------------")
# Preview of the data in the csv file, which has three columns:
# (1) length of article, (2) title of the article, (3) first 100 words in the article
for v in training_data[:10]:
...
for the replacement of the state with stateless societies or other forms of free associati
ons . As a historically left - wing movement , usually placed on the farthest left of the
political spectrum , it is usually described alongside communalism and libertarian Marxism
as the libertarian wing ( libertarian socialism )']
-----------------------------------6453
Anarchism
Anarchism is a political philosophy and movement that is skeptical of all justifications f
or authori...
...
Achilles
In Greek mythology , Achilles ( ) or Achilleus ( ) was a hero of the Trojan War , the grea
test of al...
13412
Abraham Lincoln
Abraham Lincoln ( ; February 12 , 1809
tician , a...
9485
Aristotle
Aristotle (; " Aristotélēs " , ; 384–322
g the Clas...
1683
An American in Paris
– April 15 , 1865 ) was an American lawyer , poli
BC ) was a Greek philosopher and polymath durin
An American in Paris is a jazz - influenced orchestral piece by American composer George G
ershwin fi...
149
Academy Award for Best Production Design
The Academy Award for Best Production Design recognizes achievement for art direction in f
ilm . The ...
7178
Academy Awards
The Academy Awards , better known as the Oscars , are awards for artistic and technical me
rit for th...
1 - Predicting article length from initial content
This section relates to content from the week 1 lecture and the week 2 lab. In this section, you will implement training and evaluation of a linear model (as seen in the week 2 lab) to predict the length of a wikipedia article from its first 100 words. You will represent the text using a Bag of Words model (as seen in the week 1 lecture).
1.1 Word Mapping [2pt]
In the code block below, write code to go through the training data and for any word that occurs at least 10 times: Assign it a unique ID (consecutive, starting at 0) Place it in a dictionary that maps from the word to the ID
1.2 Data to Bag-of-Words Tensors [2pt]
In the code block below, write code to prepare the data in PyTorch tensors. The text should be converted to a bag of words (ie., a vector the length of the vocabulary in the mapping in the previous step, with counts of the words in the text).
1.3 Model Creation [2pt]
Construct a linear model with an SGD optimiser (we recommend a learning rate of 1e-4 ) and mean squared error as the loss.
1.4 Training [2pt]
Write a loop to train your model for 100 epochs, printing performance on the dev set every 10 epochs.
1.1 Measure Accuracy [2pt] In the code block below, write code to evaluate your model on the test set.
1.2 Analyse the Model [2pt] In the code block below, write code to identify the 50 words with the highest weights and the 50 words with the lowest weights.
2 - Compare Data Storage Methods
This section relates to content from the week 1 lecture and the week 2 lab.
Implement a variant of the model with a sparse vector for your input bag of words (See https://pytorch.org/docs/stable/sparse.html for how to switch a vector to be sparse). Use the default sparse vector type (COO).
2.1 Training and Test Speed [2pt]
Compare the time it takes to train and test the new model with the time it takes to train and test the old model. You can time the execution of a line of code using %time . See this guide for more on timing.
3 - Switch to Word Embeddings
This section relates to content from the week 2 lecture and the week 3 lab. In this section, you will implement a model based on word2vec.
- Use word2vec to learn embeddings for the words in your data.
- Represent each input document as the average of the word vectors for the words it contains.
- Train a linear regression model.
3.1 Accuracy [1pt]
Calculate the accuracy of your model.
3.2 Speed [1pt]
Calcualte how long it takes your model to be evaluated.
4 - Open-Ended Improvement
This section relates to content from the week 1, 2, and 3 lectures and the week 1, 2, and 3 labs. This section is an open-ended opportunity to find ways to make your model more accurate and/or faster (e.g., use WordNet to generalise words, try different word features, other optimisers, etc). We encourage you to try several ideas to provide scope for comparisons. If none of your ideas work you can still get full marks for this section. You just need to justify the ideas and discuss why they may not have improved performance.
4.1 Ideas and Motivation [1pt]
In this box, describe your ideas and why you think they will improve accuracy and/or speed. Your answer goes here
4.2 Implementation [2pt]
Implement your ideas
4.3 Evaluation [1pt]
Evaluate the speed and accuracy of the model with your ideas
In this text box, briefly describe the results. Did your improvement work? Why / Why not? Your answer goes here