1. Homepage
2. Programming
3. CPEN 221 Principles of Software Construction - Mini Project: n-grams, Autocompletion, and Gender Bias

# CPEN 221 Principles of Software Construction - Mini Project: n-grams, Autocompletion, and Gender Bias

CanadaCPEN 221CPEN221Principles of Software Constructionn-gramsAutocompletionsentiment analysisJava

# Corrigenda + Hints

## Overview

Have you wondered how Google makes recommendations for search terms when you start typing some text? Have you also thought about whether there are biases inherent in reviews?

In this mini-project you will learn a bit about both of these ideas, and you will get to explore some methods for sentiment analysis. Of course, you will also get to apply the principles of software construction that you have learnt so far.

When you look through the provided skeleton code, you should implement everything that is marked with a TODO. The only exceptions to this requirement are TODOs in the cpen221.mp1.autocompletion.gui package.

## Data

You will be working with a data set that represents a subset of the reviews from RateMyProfessor.com. The data set contains approximately 14000 reviews, with each review including the following information: the rating (a score between 1 and 5 with 1 being the worst and 5 being the best), the gender of the professor being reviewed (either M[ale] or F[emale]), and the text of the review (a short comment made by the reviewer about the professor). In this data set, the ratings are in increments of 0.5.

The data is stored in a text file named ratemyprofessor_data.txt, inside the data directory of the project. Each line contains a review, with the first line having the heading for the columns. To interact with the file, you can use the methods in the DataWrapper class. The methods available via DataWrapper are:

1. the constructor to create a DataWrapper object;
2. nextLine, which reads the next line of the file;

<aside> <img src="/icons/snippet_blue.svg" alt="/icons/snippet_blue.svg" width="40px" /> All the code for this task should be in the cpen221.mp1.ngrams package.

</aside>

The first step in this mini-project is to count the number of times an n-gram occurs in a document. An n-gram is — for this mini-project — a sequence of n words, although the concept is quite general and can be used in diverse fields such as linguistics and bioinformatics.

Let us consider the following [arbitrary] text: “The blue cow jumps over the blue cow moon.” This text has the following 1-grams (words), 2-grams, and 3-grams with the relevant counts:

1-gram count
the 2
blue 2
cow 2
jumps 1
over 1
moon 1
2-gram count
the blue 2
blue cow 2
cow jumps 1
jumps over 1
over the 1
cow moon 1
3-gram count
the blue cow 2
blue cow jumps 1
cow jumps over 1
jumps over the 1
over the blue 1
blue cow moon 1

We will ignore case in determining the words so “The”, “the” and “tHe”, for instance, would all be the same 1-gram. We will also ignore multiple whitespaces in determining the $n$-grams, so for example, we will treat “jumps over the” as the 3-gram “jumps over the”. We also ignore punctuation, so we will treat “it will not rain today, i hope” as “it will not rain today i hope” when computing the different $n$-grams.

Given a document or text input, as well as a parameter $n$, you will have two different methods:

1. a method that counts the total number of unique 1-grams, 2-grams, …, $n$-grams (up to the specified limit);
2. a method that returns a List of Maps, with a Map for the 1-grams and their counts, a Map for the 2-grams and their counts, and so on. The i-th entry in the List is a Map of the i+1-gram counts. The $n$-grams in each Map should consist of words produced by the getWords snippet below, separated by single whitespaces only.
• How should we define what constitutes a word?

Should punctuation characters be part of a word? Should we manually define word breaks? Whereas there are many possible definitions and interpretations, we shall use the definitions contained in Java’s BreakIterator. We will also work with words by converting all alphabets to lower case. Here is a code snippet, which you can use freely, that illustrates how to use a BreakIterator and obtain an array of words:

private String[] getWords(String text) {
ArrayList<String> words = new ArrayList<>();
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int start = wb.first();
for (int end = wb.next();
end != BreakIterator.DONE;
start = end, end = wb.next()) {
String word = text.substring(start, end).toLowerCase();
word = word.replaceAll("^\\\\s*\\\\p{Punct}+\\\\s*", "").replaceAll("\\\\s*\\\\p{Punct}+\\\\s*$", ""); if (!word.equals(" ")) { words.add(word); } } String[] wordsArray = new String[words.size()]; words.toArray(wordsArray); return wordsArray; }  ## Task 2: Keywords and RateMyProfessor <aside> <img src="/icons/snippet_blue.svg" alt="/icons/snippet_blue.svg" width="40px" /> All the code for this task should be in the cpen221.mp1.ratemyprofessor package. More specifically, you will implement the methods in the DataAnalyzer class. </aside> Given a sequence of keywords (up to 3 keywords, i.e., a 3-gram), identify all the reviews that contain that sequence, and then partition the reviews based on rating and gender. We want to group the reviews into six categories: men-low (ML), women-low (WL), men-medium (MM), women-medium (WM), men-high (MH), and women-high (WH). We will define a rating as low if it is in the range [0, 2], a rating as medium if it is in the range (2, 3.5] and high if it is in the range (3.5, 5]. For this task, you should implement a method that returns a Map with the keys ML, WL, MM, WM, MH and WH and the values corresponding to those keys would be the counts of reviews that match the provided search keywords. You can [optionally] generate a histogram chart of the data using the provided Histogram type. ## Task 3: Completing Search Terms <aside> <img src="/icons/snippet_blue.svg" alt="/icons/snippet_blue.svg" width="40px" /> Most of the code for this task should be in the cpen221.mp1.autocompletion package. You will also work with searchterm.SearchTerm and implement at least one method in the class. This task will bring together Tasks 1 and 2. Can you (this part is optional) use the Autocompletor GUI to search for terms in the Rate My Professor dataset and then plot the histogram? </aside> When one is typing a search term, it is helpful to have suggestions that complete the search term. By completing this task, we can also combine the work done in Tasks 1 and 2, but aspects of this task can be stand-alone. If$s$is a String that represents the search term then you should recommend$n$-grams that contain$s$as a prefix. Here is an example: Suppose the search term is “where in the wor” then the following$n$-grams are possible recommendations: • where in the world is carmen sandiego • where in the world is it 5pm • where in the world are there no mosquitoes • where in the world did you get that Notice that “where in the wor” is a prefix to all these$n$-grams. You should return an array of recommendations, with the recommendations ordered by a weight term and in decreasing order of weights. In the case of ties, you should order the tied terms in dictionary order (lexicographic ordering). The core functionality to implement is as follows (although there are a set of supporting methods — see the skeleton code — that you will also need to implement). Your search term recommendation method would return an array of SearchTerms. The recommendation method is invoked on an AutoCompletor instance that is seeded with the text on which we will run the search. One way to instantiate an AutoCompletor would be with a List of Maps, like you produced in Task 1. The values in the Maps are the weights that you would use to order the recommendations. This method will also take as input a parameter that limits the number of SearchTerms to be returned. To complete this task, you will use the searchterm.SearchTerm type and you should also understand how comparators work in Java (although you should not have to implement too much here). ## Task 4: Sentiment Analysis <aside> <img src="/icons/snippet_yellow.svg" alt="/icons/snippet_yellow.svg" width="40px" /> To complete this task, you will implement SentimentAnalyzer that should be part of the cpen221.mp1.sentimentanalysis package. The skeleton code you should use is provided at the end of this description. </aside> Can we use the words in a review — for instance, from the RateMyProfessor data set — to determine if the review is positive or not? Can we use the words in a review to determine the rating? We will use a naïve Bayesian analysis along with a bag-of-words approach to achieve a “reasonable” approach to this problem. What is a bag-of-words? Let us consider the following string: “The car drove down the highway”. The bag-of-words we can obtain from this string is: {”the”: 2, “car: 1, “drove”: 1, “down”: 1, “highway”: 1}. (Aren’t these 1-grams?) Your goal is to implement a method that generates the bag-of-words for all reviews with each rating; in other words, one bag-of-words for rating 1, one bag-of-words for rating 2, and so on. We will then use conditional probabilities to perform sentiment analysis. The conditional probability of an event occurring is the chance that an event occurs given that some other event has occurred. We will use the (standard) notation that$P(A)$is the probability that some event$A$occurs. The conditional probability$P(A|B)$is the probability that event$A$occurs given that event$B$has occurred. Here is an example: the probability of that it is raining in Vancouver may be 0.3 (30% chance), and we can call this event$A$. Therefore,$P(A) = 0.3$. The probability that it is raining given that the sky is cloudy may be 0.7. Let us denote the event that the sky is cloudy as$B$. Therefore,$P(A|B) = 0.7$. Finally, the probability that it is raining when it is not cloudy may be 0 and therefore$P(A|\textrm{not }B) = 0$. You will need to find the probability that a review has a rating given its review comment text:$P(\text{rating}|\text{review comment text})$. This conditional probability is hard to evaluate exactly, since we cannot directly find the rating from the comment text. To simplify this computation, we can use the bag of words as an approximation:$P(\text{rating}|\text{bag-of-words})$. That is given a list of the words used, find the rating. This is a lot simpler in that the order of the words does not matter. Some accuracy is lost doing so, but the problem becomes much easier to solve. If we have a bag-of-words for each rating, this can be used to see how likely it is that a specific bag-of-words came from each rating's bag-of-words. The most likely rating for that bag of words can then be found. To find compute this likelihood, we can use Bayes’ Theorem:$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$. In our setting:$P(\text{rating}|\text{bag-of-words}) = \frac{P(\text{bag-of-words}|\text{rating})P(\text{rating})}{P(\text{bag-of-words})}$. This computation allows us to determine the probability that a bag of words occurs given a rating. In turn, we need to find$P(\text{bag-of-words}), P(\text{rating})$, and$P(\text{bag-of-words}|\text{rating})$.$P(\text{rating})$can be found easily by determining the number of occurrences of that rating relative to the total number of entries:$P(\text{rating}) = \frac{\text{occurrences of rating}}{\text{number of reviews}}$. It is difficult to find the probability of a bag-of-words, since all of the words have complicated interrelations, where the combination of two words combined can have a probability different from the product of the two probabilities:$P(A) \times P(B) \not = P(AB)$, unless$A$and$B$are independent, meaning that the probability of one does not effect the probability of the other. For this mini-project, we can use Naïve Bayesian analysis, which is where all the events are treated as independent. In our case, the events are the occurrences of words in a bag-of-words. This assumption makes the problem simpler but we sacrifice accuracy.$P(\text{bag-of-words}) = P(\text{word}_1, \text{word}_2, \text{word}_3,...,\text{word}_n)$becomes$P(\text{bag-of-words}) \sim P(\text{word}_1) \times P(\text{word}_2) \times P(\text{word}_3)\times \dots \times P(\text{word}_n)$. To find the probability of a single word, we can use:$P(\text{word}) = \frac{\text{number of occurrences of word}}{\text{total number of words}}$. This computation can also be used for the conditional probability of bag-of-words:$P(\text{bag-of-words}|\text{rating}) = P(\text{word}_1, \text{word}_2, \text{word}_3,...,\text{word}_n|\text{rating})$becomes$P(\text{bag-of-words}|\text{rating}) \sim P(\text{word}_1|\text{rating}) \times P(\text{word}_2|\text{rating}) \times P(\text{word}_3|\text{rating}) \times \dots \times P(\text{word}_n|\text{rating})$. To find the probability of occurrence of a single word we can use:$P(\text{word}|\text{rating}) = \frac{\text{number of occurrences word in rating}}{\text{total number of words in reviews at this rating}}$. The goal is to find the rating that best matches the comment. Using the corresponding bag-of words, the probability of each rating can be found:$P(\text{bag-of-words}|\text{rating}_i)$, where$i \in \{1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5\}\$. The most likely rating for a given comment can be found by returning the rating that results in the highest conditional probability for the bag-of-words associated with the review we are considering.

Here is the skeleton code that provides the signatures for the key operations:

## Get Expert Help On This Assignment

#### Scan above qrcode with Wechat

Canada代写,CPEN 221代写,CPEN221代写,Principles of Software Construction代写,n-grams代写,Autocompletion代写,sentiment analysis代写,Java代写,Canada代编,CPEN 221代编,CPEN221代编,Principles of Software Construction代编,n-grams代编,Autocompletion代编,sentiment analysis代编,Java代编,Canada代考,CPEN 221代考,CPEN221代考,Principles of Software Construction代考,n-grams代考,Autocompletion代考,sentiment analysis代考,Java代考,Canadahelp,CPEN 221help,CPEN221help,Principles of Software Constructionhelp,n-gramshelp,Autocompletionhelp,sentiment analysishelp,Javahelp,Canada作业代写,CPEN 221作业代写,CPEN221作业代写,Principles of Software Construction作业代写,n-grams作业代写,Autocompletion作业代写,sentiment analysis作业代写,Java作业代写,Canada编程代写,CPEN 221编程代写,CPEN221编程代写,Principles of Software Construction编程代写,n-grams编程代写,Autocompletion编程代写,sentiment analysis编程代写,Java编程代写,Canadaprogramming help,CPEN 221programming help,CPEN221programming help,Principles of Software Constructionprogramming help,n-gramsprogramming help,Autocompletionprogramming help,sentiment analysisprogramming help,Javaprogramming help,Canadaassignment help,CPEN 221assignment help,CPEN221assignment help,Principles of Software Constructionassignment help,n-gramsassignment help,Autocompletionassignment help,sentiment analysisassignment help,Javaassignment help,Canadasolution,CPEN 221solution,CPEN221solution,Principles of Software Constructionsolution,n-gramssolution,Autocompletionsolution,sentiment analysissolution,Javasolution,