1. Homepage
  2. Programming
  3. CSE 158, CSE 258 Web Mining and Recommender Systems, Fall 2023 : Homework 4 - Text Mining

CSE 158, CSE 258 Web Mining and Recommender Systems, Fall 2023 : Homework 4 - Text Mining

Engage in a Conversation
UCSDCSE 158CSE 258Web Mining and Recommender SystemsText MiningBag-of-wordsWord2vec

CSE 158/258, Fall 2023: Homework 4 Instructions CourseNana.COM

Please submit your solution by Monday Nov 27. Submissions should be made on gradescope. Please complete homework individually. CourseNana.COM

You should submit two files: CourseNana.COM

answers hw4.txt should contain a python dictionary containing your answers to each question. Its format should be like the following: CourseNana.COM

           { "Q1": 1.5, "Q2": [3,5,17,8], "Q2": "b", (etc.) }

The provided code stub demonstrates how to prepare your answers and includes an answer template for each question. CourseNana.COM

homework4.py A python file containing working code for your solutions. The autograder will not execute your code; this file is required so that we can assign partial grades in the event of incorrect solutions, check for plagiarism, etc. Your solution should Clearly document which sections correspond to each question and answer. CourseNana.COM

Tasks: Text Mining CourseNana.COM

  1. Using the Steam category data, build training/test sets consisting of 10,000 reviews each. Code to do so is provided in the stub.1 We’ll start by building features to represent common words. Start by removing punctuation and capitalization, and finding the 1,000 most common words across all reviews (‘text’ field) in the training set. See the ‘text mining’ lectures for code for this process. Report the 10 most common words, along with their frequencies, as a list of (frequency, word) tuples. CourseNana.COM

  2. Build bag-of-words feature vectors by counting the instances of these 1,000 words in each review. Set the labels (y) to be the ‘genreID’ column for the training instances. You may use these labels directly with sklearn’s LogisticRegression model, which will automatically perform multiclass classification. Report performance (accuracy) on your test set. CourseNana.COM

  3. What is the inverse document frequency of the words ‘character’, ‘game’, ‘length’, ‘a’, and ‘it’? What are their tf-idf scores in the first (training) review (using log base 10, unigrams only, following the first definition of tf-idf given in the slides)? All frequencies etc. should be calculated using the training data only. (2 marks) CourseNana.COM

  4. Adapt your unigram model to use the tfidf scores of words, rather than a bag-of-words representation. That is, rather than your features containing the word counts for the 1000 most common unigrams, it should contain tfidf scores for the 1000 most common unigrams. Report the accuracy of this new model. CourseNana.COM

1Although the data is larger, we’ll use only a small fraction for these experiments. CourseNana.COM


  1. Which review in the test set the highest cosine similarity compared to the first review in the training set, in terms of their tf-idf representations (considering unigrams only). Provide the cosine similarity score and the reviewID? CourseNana.COM

  2. Try to improve upon the performance of the above classifiers from Questions 2 and 4 by using different dictionary sizes, or changing the regularization constant C passed to the logistic regression model. Report the performance of your solution. CourseNana.COM

    Use the first half (10,000) of the book review corpus for training and the rest for testing (code to read the data is provided in the stub). Process reviews without capitalization or punctuation (and without using stemming or removing stopwords). CourseNana.COM

These tasks should be completed using the entire dataset of 20,000 reviews from Goodreads: CourseNana.COM

7. Using the word2vec library in gensim, fit an item2vec model, treating each ‘sentence’ as a temporally- ordered2 list of items per user. Use parameters min count=1, size=5, window=3, sg=1.3 Report the 5 most similar items to the book from the first review along with their similarity scores (your answer can be the output of the similar by word function). CourseNana.COM

2You may use dateutil.parser.parse to parse the date string. 3The size argument might be vector size in some library versions.  CourseNana.COM

Get in Touch with Our Experts

Wechat WeChat
Whatsapp Whatsapp
UCSD代写,CSE 158代写,CSE 258代写,Web Mining and Recommender Systems代写,Text Mining代写,Bag-of-words代写,Word2vec代写,UCSD代编,CSE 158代编,CSE 258代编,Web Mining and Recommender Systems代编,Text Mining代编,Bag-of-words代编,Word2vec代编,UCSD代考,CSE 158代考,CSE 258代考,Web Mining and Recommender Systems代考,Text Mining代考,Bag-of-words代考,Word2vec代考,UCSDhelp,CSE 158help,CSE 258help,Web Mining and Recommender Systemshelp,Text Mininghelp,Bag-of-wordshelp,Word2vechelp,UCSD作业代写,CSE 158作业代写,CSE 258作业代写,Web Mining and Recommender Systems作业代写,Text Mining作业代写,Bag-of-words作业代写,Word2vec作业代写,UCSD编程代写,CSE 158编程代写,CSE 258编程代写,Web Mining and Recommender Systems编程代写,Text Mining编程代写,Bag-of-words编程代写,Word2vec编程代写,UCSDprogramming help,CSE 158programming help,CSE 258programming help,Web Mining and Recommender Systemsprogramming help,Text Miningprogramming help,Bag-of-wordsprogramming help,Word2vecprogramming help,UCSDassignment help,CSE 158assignment help,CSE 258assignment help,Web Mining and Recommender Systemsassignment help,Text Miningassignment help,Bag-of-wordsassignment help,Word2vecassignment help,UCSDsolution,CSE 158solution,CSE 258solution,Web Mining and Recommender Systemssolution,Text Miningsolution,Bag-of-wordssolution,Word2vecsolution,