1. Homepage
  2. Programming
  3. COMP4650/6490 Document Analysis – Semester 2 2022 - Assignment 2: Movie Review Analysis

COMP4650/6490 Document Analysis – Semester 2 2022 - Assignment 2: Movie Review Analysis

Engage in a Conversation
COMP4650COMP6490Document AnalysisPythonPyTorchMovie Review Sentiment ClassificationGenre ClassificationRNN Name GeneratorAustraliaAustralian National University

COMP4650/6490 Document Analysis – Semester 2 / 2022 CourseNana.COM

Assignment 2 CourseNana.COM

Overview CourseNana.COM

Throughout this assignment you will make changes to the provided code to improve or complete existing models. In some cases, you will write your own code from scratch after reviewing an example. CourseNana.COM

Submission CourseNana.COM

  • The answers to this assignment (including your code files) have to be submitted online in Wattle.
  • You will produce an answers file with your responses to each question. Your answers file must be a PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.
  • You should submit a ZIP file containing all of the code files and your answers PDF file, BUT NO DATA.

Question 1: Movie Review Sentiment Classification (4 marks) CourseNana.COM

For this question you have been provided with a movie review dataset. The dataset consists of 50,000 review articles written for movies on IMDb, each labelled with the sentiment of the review – either positive or negative. Your task is to apply logistic regression with dense word vectors to the movie review dataset to predict the sentiment label from the review text. CourseNana.COM

A simple approach to building a sentiment classifier is to train a logistic regression model that uses aggregated pre-trained word embeddings. While this approach, with simple aggregation, normally works best with short sequences, you will try it out on the movie reviews. CourseNana.COM

You have been provided with a Python file dense_linear_classifier.py which reads in the dataset and splits it into training, testing, and validation sets; and then loads the pre-trained word embeddings. These embeddings were extracted from the spacy-2.3.5 Python package’s en_core_web_md model and, to save disk space, were filtered to only include words that occur in the movie reviews. CourseNana.COM

Your task is to use a logistic regression classifier with aggregated word embedding features to determine the sentiment labels of documents from their text. First implement the document_to_vector function which converts a document into a vector by first tokenising it (the TreebankWordTokenizer in the nltk package would be an excellent choice) and then aggregating the word embeddings of those words that exist in the dense word embedding dictionary. You will have to work out how to handle words that are missing from the dictionary. For aggregation, the mean is recommended but you could also try other functions such as max. Next, implement the fit_model and test_model functions using your document_to_vector function and LogisticRegression from the scikit-learn package. Using fit_model, test_model, and your training and validation sets you should then try several values for the regularisation parameter C and select the best based on accuracy. To try regularisation parameters, you should use an automatic hyperparameter search method. Next, re-train your classifier using the training set concatenated with the validation set and your best C value. Evaluate the performance of your model on the test set. CourseNana.COM

Answer the following questions in your answer PDF: CourseNana.COM

  1. What range of values for C did you try? Explain, why this range is reasonable. Also explain what search technique you used and why it is appropriate here.
  2. What was the best performing C value?
  3. What was your final accuracy?

Question 2: Genre Classification (Kaggle competition: 4 marks, Write-up: 3 marks) CourseNana.COM

For this task you will design and implement a classification algorithm that identifies the genre of a piece of text. This task will be run as a competition on Kaggle. Your marks for this question will be partially based on your results in this competition, but your mark will not be affected by other students’ scores, instead you will be graded against several benchmark solutions. The other part of your mark will come from your code and write-up. CourseNana.COM

The dataset consists of text sequences from English language books in the genres: horror (class id 0),
science fiction (class id 1),
humour (class id 2), and CourseNana.COM

crime fiction (class id 3). 3https://www.nltk.org/_modules/nltk/tokenize/treebank.html CourseNana.COM

Each text sequence is 10 contiguous sentences. Your task is to build the best classifier when evaluated with macro averaged F1 score. CourseNana.COM

Note: the training data and the test data come from different sets of books. You have been provided with docids (examples with the same docid come from the same book) for the training data but not the test data. CourseNana.COM

You have been provided with an example solution in genre_classifier_0pc.py which shows you how to read in the training data (genre_train.json), test data (genre_test.json) and output a CSV file that the judging system can read. This solution is provided only as an example (it is the 0% benchmark for this problem), you will want to build your own solution from scratch. CourseNana.COM

Rules CourseNana.COM

These rules are designed to ensure a degree of fairness between students with different access to compu- tational resources and to ensure that the task is not trivial. Breaching the contest rules will likely result in 0 marks for the competition part of this assignment. CourseNana.COM

  • Do not use additional supervised training data. That is, you are not allowed to collect a new genre classification dataset to use for training. Pre-training on other tasks, such as language modelling, is permitted.
  • Pre-trained non-contextual word vectors (such as word2vec, GloVe, fasttext) may be used even if they require an additional download (e.g. you may use word vectors from spacy, genism, or fasttext).
  • You may use the pre-trained transformers distilbert-base-cased or distilbert-base-uncased from the transformers5 library, but other pre-trained transformers are not permitted.
  • Youcanusethefollowinglibraries(inadditiontoPythonstandardlibraries):numpy,scipy,pandas, torch, transformers, tensorflow, nltk, sklearn, xgboost, gensim, spacy, imblearn, torchnlp. If you would like to use other libraries, please ask on the Piazza forum well in advance of the assignment deadline.
  • This is an individual task, do not collude with other individuals. Copying code from other people’s models or models available on the internet is not permitted.

Question 3: RNN Name Generator (4 marks) CourseNana.COM

Your task is to develop an autoregressive RNN model which can generate people’s names. The RNN will generate each character of a person’s name given all previous characters. Your model should look like the following when training: CourseNana.COM

Note that the input is shown here as a sequence of characters but in practice the input will be a sequence of character ids. There is also a softmax non-linearity after the linear layer but this is not shown in the diagram. The output (after the softmax) is a categorical probability distribution over the vocabulary, what is shown as the output here is the ground truth label. Notice that the input to the model is just the expected output shifted to the right one step with the <bos> (beginning of sentence token) prepended. The three dots to the right of the diagram indicate that the RNN is to be rolled out to some maximum length. When generating sequences, rather than training, the model should look like the following: CourseNana.COM

Specifically, we choose a character from the probability distribution output by the network and feed it as input to the next step. Choosing a character can be done by sampling from the probability distribution or by choosing the most likely character (otherwise known as argmax decoding). CourseNana.COM

The character vocabulary consists of the following: “” The null token padding string.
<bos> The beginning of sequence token.
. The end of sequence token. CourseNana.COM

a-z All lowercase characters. A-Z All uppercase characters. 0-9 All digits.
“ ” The space character. CourseNana.COM

Starter code is provided in rnn_name_generator.py, and the list of names to use as training and validation sets are provided in names_small.json. CourseNana.COM


CourseNana.COM

To complete this question you will need to complete three functions and one class method: the func- tion seqs_to_ids, the forward method of class RNNLM, the function train_model, and the function gen_string. In each case you should read the description provided in the starter code. CourseNana.COM

seqs_to_ids: Takes as input a list of names. Returns a 2D numpy matrix containing the names rep- resented using token ids. All output rows (each row corresponds to a name) should have the same length of max_length, achieved by either truncating the name or padding it with zeros. For example, an input of: [“Bec.”, “Hannah.”, “Siqi.”] with a max_length set to 6 should return (normally we will use max_length = 20 but for this example we use 6) CourseNana.COM

[[30 7 5 2 0 0]
[36 3 16 16 3 10]
[47 11 19 11 2 0]]
Where the first row represents “Bec.” and two padding characters, the second row represents “Hannah”, the third row represents “Siqi.” with one padding character. CourseNana.COM

forward: A method of class RNNLM. In this function you need to implement the GRU model shown in the diagram above. The layers have all been provided for you in the class initialiser. CourseNana.COM

train_model: In this method you need to train the model by mini-batch stochastic gradient decent. The optimiser and loss function are provided to you. Note that the loss function takes logits (output of the linear layer before softmax is applied) as input. At the end of every epoch you should print the validation loss using the provided calc_val_loss function. CourseNana.COM

gen_string: In this method you will need to generate a new name, one character at a time. You will also need to implement both sampling and argmax decoding. CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
COMP4650代写,COMP6490代写,Document Analysis代写,Python代写,PyTorch代写,Movie Review Sentiment Classification代写,Genre Classification代写,RNN Name Generator代写,Australia代写,Australian National University代写,COMP4650代编,COMP6490代编,Document Analysis代编,Python代编,PyTorch代编,Movie Review Sentiment Classification代编,Genre Classification代编,RNN Name Generator代编,Australia代编,Australian National University代编,COMP4650代考,COMP6490代考,Document Analysis代考,Python代考,PyTorch代考,Movie Review Sentiment Classification代考,Genre Classification代考,RNN Name Generator代考,Australia代考,Australian National University代考,COMP4650help,COMP6490help,Document Analysishelp,Pythonhelp,PyTorchhelp,Movie Review Sentiment Classificationhelp,Genre Classificationhelp,RNN Name Generatorhelp,Australiahelp,Australian National Universityhelp,COMP4650作业代写,COMP6490作业代写,Document Analysis作业代写,Python作业代写,PyTorch作业代写,Movie Review Sentiment Classification作业代写,Genre Classification作业代写,RNN Name Generator作业代写,Australia作业代写,Australian National University作业代写,COMP4650编程代写,COMP6490编程代写,Document Analysis编程代写,Python编程代写,PyTorch编程代写,Movie Review Sentiment Classification编程代写,Genre Classification编程代写,RNN Name Generator编程代写,Australia编程代写,Australian National University编程代写,COMP4650programming help,COMP6490programming help,Document Analysisprogramming help,Pythonprogramming help,PyTorchprogramming help,Movie Review Sentiment Classificationprogramming help,Genre Classificationprogramming help,RNN Name Generatorprogramming help,Australiaprogramming help,Australian National Universityprogramming help,COMP4650assignment help,COMP6490assignment help,Document Analysisassignment help,Pythonassignment help,PyTorchassignment help,Movie Review Sentiment Classificationassignment help,Genre Classificationassignment help,RNN Name Generatorassignment help,Australiaassignment help,Australian National Universityassignment help,COMP4650solution,COMP6490solution,Document Analysissolution,Pythonsolution,PyTorchsolution,Movie Review Sentiment Classificationsolution,Genre Classificationsolution,RNN Name Generatorsolution,Australiasolution,Australian National Universitysolution,