Homepage
Programming
COMP5046 Natural Language Processing - Assignment 1: Predict the length of a Wikipedia article

COMP5046 Natural Language Processing - Assignment 1: Predict the length of a Wikipedia article

Engage in a Conversation

2023 COMP 4446 / 5046 Assignment 1

Assingment 1 is an individual assessment. Please note the University's Academic dishonesty and plagiarism policy. CourseNana.COM

Submit via Canvas: CourseNana.COM

Your notebook
Run all cells before saving the notebook, so we can see your output

In this assignment, we will explore ways to predict the length of a Wikipedia article based on the first 100 tokens in the article. Such a model could be used to explore whether there are systematic biases in the types of articles that get more detail. CourseNana.COM

If you are working in another language, please make sure to clearly indicate which part of your code is running which section of the assignment and produce output that provides all necessary information. Submit your code, example outputs, and instructions for executing it. CourseNana.COM

Note: This assignment contains topics that are not covered at the time of release. Each section has information about which lectures and/or labs covered the relevant material. We are releasing it now so you can (1) start working on some parts early, and (2) know what will be in the assignment when you attend the relevant labs and lectures. CourseNana.COM

TODO: Copy and Name this File

Make a copy of this notebook in your own Google Drive (File -> Save a Copy in Drive) and change the filename, replacing YOUR-UNIKEY . For example, for a person with unikey mcol1997 , the filename should be: COMP-4446-5046_Assignment1_mcol1997.ipynb CourseNana.COM

Readme If there is something you want to tell the marker about your submission, please mention it here. [write here - optional] CourseNana.COM

Data Download [DO NOT MODIFY THIS]

We have already constructed a dataset for you using a recent dump of data from Wikipedia. Both the training and test datasets are provided in the form of csv files (training_data.csv, test_data.csv) and can be downloaded from Google Drive using the code below. Each row of the data contains: CourseNana.COM

The length of the article The title of the article The first 100 tokens of the article CourseNana.COM

In case you are curious, we constructed this dataset as follows: CourseNana.COM

Downloaded a recent dump of English wikipedia.
Ran WikiExtractor to get the contents of the pages.
Filtered out very short pages.
Ran SpaCy with the en_core_web_lg model to tokenise the pages (Note, SpaCy's development is led by an alumnus of USyd!).
Counted the tokens and saved the relevant data in the format described above. This code will download the data. DO NOT MODIFY IT

## DO NOT MODIFY THIS CODE
# Code to download files into Colaboratory
# Install the PyDrive library
!pip install -U -q PyDrive
# Import libraries for accessing Google Drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Function to read the file, save it on the machine this colab is running on, and then rea
import csv
def read_file(file_id, filename):
downloaded = drive.CreateFile({'id':file_id})
downloaded.GetContentFile(filename)
with open(filename) as src:
reader = csv.reader(src)
data = [r for r in reader]
...
print("LABEL: {0} / SENTENCE: {1}".format(training_data[0][0], training_data[0][1:]))
print("------------------------------------")
# Preview of the data in the csv file, which has three columns:
# (1) length of article, (2) title of the article, (3) first 100 words in the article
for v in training_data[:10]:
...
for the replacement of the state with stateless societies or other forms of free associati
ons . As a historically left - wing movement , usually placed on the farthest left of the
political spectrum , it is usually described alongside communalism and libertarian Marxism
as the libertarian wing ( libertarian socialism )']
-----------------------------------6453
Anarchism
Anarchism is a political philosophy and movement that is skeptical of all justifications f
or authori...
...
Achilles
In Greek mythology , Achilles ( ) or Achilleus ( ) was a hero of the Trojan War , the grea
test of al...
13412
Abraham Lincoln
Abraham Lincoln ( ; February 12 , 1809
tician , a...
9485
Aristotle
Aristotle (; " Aristotélēs " , ; 384–322
g the Clas...
1683
An American in Paris

– April 15 , 1865 ) was an American lawyer , poli

BC ) was a Greek philosopher and polymath durin

An American in Paris is a jazz - influenced orchestral piece by American composer George G
ershwin fi...
149
Academy Award for Best Production Design
The Academy Award for Best Production Design recognizes achievement for art direction in f
ilm . The ...
7178
Academy Awards
The Academy Awards , better known as the Oscars , are awards for artistic and technical me
rit for th...

1 - Predicting article length from initial content

This section relates to content from the week 1 lecture and the week 2 lab. In this section, you will implement training and evaluation of a linear model (as seen in the week 2 lab) to predict the length of a wikipedia article from its first 100 words. You will represent the text using a Bag of Words model (as seen in the week 1 lecture). CourseNana.COM

1.1 Word Mapping [2pt]

In the code block below, write code to go through the training data and for any word that occurs at least 10 times: Assign it a unique ID (consecutive, starting at 0) Place it in a dictionary that maps from the word to the ID CourseNana.COM

1.2 Data to Bag-of-Words Tensors [2pt]

In the code block below, write code to prepare the data in PyTorch tensors. The text should be converted to a bag of words (ie., a vector the length of the vocabulary in the mapping in the previous step, with counts of the words in the text). CourseNana.COM

1.3 Model Creation [2pt]

Construct a linear model with an SGD optimiser (we recommend a learning rate of 1e-4 ) and mean squared error as the loss. CourseNana.COM

1.4 Training [2pt]

Write a loop to train your model for 100 epochs, printing performance on the dev set every 10 epochs. CourseNana.COM

1.1 Measure Accuracy [2pt] In the code block below, write code to evaluate your model on the test set. CourseNana.COM

1.2 Analyse the Model [2pt] In the code block below, write code to identify the 50 words with the highest weights and the 50 words with the lowest weights. CourseNana.COM

2 - Compare Data Storage Methods

This section relates to content from the week 1 lecture and the week 2 lab. CourseNana.COM

Implement a variant of the model with a sparse vector for your input bag of words (See https://pytorch.org/docs/stable/sparse.html for how to switch a vector to be sparse). Use the default sparse vector type (COO). CourseNana.COM

2.1 Training and Test Speed [2pt]

Compare the time it takes to train and test the new model with the time it takes to train and test the old model. You can time the execution of a line of code using %time . See this guide for more on timing. CourseNana.COM

3 - Switch to Word Embeddings

This section relates to content from the week 2 lecture and the week 3 lab. In this section, you will implement a model based on word2vec. CourseNana.COM

Use word2vec to learn embeddings for the words in your data.
Represent each input document as the average of the word vectors for the words it contains.
Train a linear regression model.

3.1 Accuracy [1pt]

Calculate the accuracy of your model. CourseNana.COM

3.2 Speed [1pt]

Calcualte how long it takes your model to be evaluated. CourseNana.COM

4 - Open-Ended Improvement

This section relates to content from the week 1, 2, and 3 lectures and the week 1, 2, and 3 labs. This section is an open-ended opportunity to find ways to make your model more accurate and/or faster (e.g., use WordNet to generalise words, try different word features, other optimisers, etc). We encourage you to try several ideas to provide scope for comparisons. If none of your ideas work you can still get full marks for this section. You just need to justify the ideas and discuss why they may not have improved performance. CourseNana.COM

4.1 Ideas and Motivation [1pt]

In this box, describe your ideas and why you think they will improve accuracy and/or speed. Your answer goes here CourseNana.COM

4.2 Implementation [2pt]

Implement your ideas CourseNana.COM

4.3 Evaluation [1pt]

Evaluate the speed and accuracy of the model with your ideas CourseNana.COM

In this text box, briefly describe the results. Did your improvement work? Why / Why not? Your answer goes here CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: Assignment 1: International Airlines - Traffic by city pairs

Next: BENG0091: Stochastic Calculus and Uncertainty Analysis - Coursework 1: Efficient frontier theory

Australia代写,The University of Sydney代写,COMP5046代写,COMP4446代写,Natural Language Processing代写,Bag-of-Words代写,SGD optimiser代写,Word Embeddings代写,word2vec代写,WordNet代写,Python代写,Australia代编,The University of Sydney代编,COMP5046代编,COMP4446代编,Natural Language Processing代编,Bag-of-Words代编,SGD optimiser代编,Word Embeddings代编,word2vec代编,WordNet代编,Python代编,Australia代考,The University of Sydney代考,COMP5046代考,COMP4446代考,Natural Language Processing代考,Bag-of-Words代考,SGD optimiser代考,Word Embeddings代考,word2vec代考,WordNet代考,Python代考,Australiahelp,The University of Sydneyhelp,COMP5046help,COMP4446help,Natural Language Processinghelp,Bag-of-Wordshelp,SGD optimiserhelp,Word Embeddingshelp,word2vechelp,WordNethelp,Pythonhelp,Australia作业代写,The University of Sydney作业代写,COMP5046作业代写,COMP4446作业代写,Natural Language Processing作业代写,Bag-of-Words作业代写,SGD optimiser作业代写,Word Embeddings作业代写,word2vec作业代写,WordNet作业代写,Python作业代写,Australia编程代写,The University of Sydney编程代写,COMP5046编程代写,COMP4446编程代写,Natural Language Processing编程代写,Bag-of-Words编程代写,SGD optimiser编程代写,Word Embeddings编程代写,word2vec编程代写,WordNet编程代写,Python编程代写,Australiaprogramming help,The University of Sydneyprogramming help,COMP5046programming help,COMP4446programming help,Natural Language Processingprogramming help,Bag-of-Wordsprogramming help,SGD optimiserprogramming help,Word Embeddingsprogramming help,word2vecprogramming help,WordNetprogramming help,Pythonprogramming help,Australiaassignment help,The University of Sydneyassignment help,COMP5046assignment help,COMP4446assignment help,Natural Language Processingassignment help,Bag-of-Wordsassignment help,SGD optimiserassignment help,Word Embeddingsassignment help,word2vecassignment help,WordNetassignment help,Pythonassignment help,Australiasolution,The University of Sydneysolution,COMP5046solution,COMP4446solution,Natural Language Processingsolution,Bag-of-Wordssolution,SGD optimisersolution,Word Embeddingssolution,word2vecsolution,WordNetsolution,Pythonsolution,