A CPEN 221 MiniProject
Corrigenda + Hints
Overview
Have you wondered how Google makes recommendations for search terms when you start typing some text? Have you also thought about whether there are biases inherent in reviews?
In this mini-project you will learn a bit about both of these ideas, and you will get to explore some methods for sentiment analysis. Of course, you will also get to apply the principles of software construction that you have learnt so far.
When you look through the provided skeleton code, you should implement everything that is marked with a TODO
. The only exceptions to this requirement are TODO
s in the cpen221.mp1.autocompletion.gui
package.
Data
You will be working with a data set that represents a subset of the reviews from RateMyProfessor.com. The data set contains approximately 14000 reviews, with each review including the following information: the rating (a score between 1 and 5 with 1 being the worst and 5 being the best), the gender of the professor being reviewed (either M[ale] or F[emale]), and the text of the review (a short comment made by the reviewer about the professor). In this data set, the ratings are in increments of 0.5.
The data is stored in a text file named ratemyprofessor_data.txt
, inside the data
directory of the project. Each line contains a review, with the first line having the heading for the columns. To interact with the file, you can use the methods in the DataWrapper
class. The methods available via DataWrapper
are:
- the constructor to create a
DataWrapper
object; nextLine
, which reads the next line of the file;
Task 1: n-grams
<aside>
<img src="/icons/snippet_blue.svg" alt="/icons/snippet_blue.svg" width="40px" /> All the code for this task should be in the cpen221.mp1.ngrams
package.
</aside>
The first step in this mini-project is to count the number of times an n-gram occurs in a document. An n-gram is — for this mini-project — a sequence of n words, although the concept is quite general and can be used in diverse fields such as linguistics and bioinformatics.
Let us consider the following [arbitrary] text: “The blue cow jumps over the blue cow moon.” This text has the following 1-grams (words), 2-grams, and 3-grams with the relevant counts:
1-gram | count |
---|---|
the | 2 |
blue | 2 |
cow | 2 |
jumps | 1 |
over | 1 |
moon | 1 |
2-gram | count |
---|---|
the blue | 2 |
blue cow | 2 |
cow jumps | 1 |
jumps over | 1 |
over the | 1 |
cow moon | 1 |
3-gram | count |
---|---|
the blue cow | 2 |
blue cow jumps | 1 |
cow jumps over | 1 |
jumps over the | 1 |
over the blue | 1 |
blue cow moon | 1 |
We will ignore case in determining the words so “The”, “the” and “tHe”, for instance, would all be the same 1-gram. We will also ignore multiple whitespaces in determining the $n$-grams, so for example, we will treat “jumps over the” as the 3-gram “jumps over the”. We also ignore punctuation, so we will treat “it will not rain today, i hope” as “it will not rain today i hope” when computing the different $n$-grams.
Given a document or text input, as well as a parameter $n$, you will have two different methods:
- a method that counts the total number of unique 1-grams, 2-grams, …, $n$-grams (up to the specified limit);
- a method that returns a
List
ofMap
s, with aMap
for the 1-grams and their counts, aMap
for the 2-grams and their counts, and so on. Thei
-th entry in theList
is aMap
of thei+1
-gram counts. The $n$-grams in eachMap
should consist of words produced by thegetWords
snippet below, separated by single whitespaces only.
-
How should we define what constitutes a word?
Should punctuation characters be part of a word? Should we manually define word breaks? Whereas there are many possible definitions and interpretations, we shall use the definitions contained in Java’s
BreakIterator
. We will also work with words by converting all alphabets to lower case. Here is a code snippet, which you can use freely, that illustrates how to use aBreakIterator
and obtain an array of words:private String[] getWords(String text) { ArrayList<String> words = new ArrayList<>(); BreakIterator wb = BreakIterator.getWordInstance(); wb.setText(text); int start = wb.first(); for (int end = wb.next(); end != BreakIterator.DONE; start = end, end = wb.next()) { String word = text.substring(start, end).toLowerCase(); word = word.replaceAll("^\\\\s*\\\\p{Punct}+\\\\s*", "").replaceAll("\\\\s*\\\\p{Punct}+\\\\s*$", ""); if (!word.equals(" ")) { words.add(word); } } String[] wordsArray = new String[words.size()]; words.toArray(wordsArray); return wordsArray; }
Task 2: Keywords and RateMyProfessor
<aside>
<img src="/icons/snippet_blue.svg" alt="/icons/snippet_blue.svg" width="40px" /> All the code for this task should be in the cpen221.mp1.ratemyprofessor
package. More specifically, you will implement the methods in the DataAnalyzer
class.
</aside>
Given a sequence of keywords (up to 3 keywords, i.e., a 3-gram), identify all the reviews that contain that sequence, and then partition the reviews based on rating and gender. We want to group the reviews into six categories: men-low (ML
), women-low (WL
), men-medium (MM
), women-medium (WM
), men-high (MH
), and women-high (WH
). We will define a rating as low if it is in the range [0, 2], a rating as medium if it is in the range (2, 3.5] and high if it is in the range (3.5, 5].
For this task, you should implement a method that returns a Map
with the keys ML
, WL
, MM
, WM
, MH
and WH
and the values corresponding to those keys would be the counts of reviews that match the provided search keywords.
You can [optionally] generate a histogram chart of the data using the provided Histogram
type.
Task 3: Completing Search Terms
<aside>
<img src="/icons/snippet_blue.svg" alt="/icons/snippet_blue.svg" width="40px" /> Most of the code for this task should be in the cpen221.mp1.autocompletion
package. You will also work with searchterm.SearchTerm
and implement at least one method in the class. This task will bring together Tasks 1 and 2. Can you (this part is optional) use the Autocompletor
GUI to search for terms in the Rate My Professor dataset and then plot the histogram?
</aside>
When one is typing a search term, it is helpful to have suggestions that complete the search term. By completing this task, we can also combine the work done in Tasks 1 and 2, but aspects of this task can be stand-alone.
If $s$ is a String
that represents the search term then you should recommend $n$-grams that contain $s$ as a prefix.
Here is an example: Suppose the search term is “where in the wor” then the following $n$-grams are possible recommendations:
- where in the world is carmen sandiego
- where in the world is it 5pm
- where in the world are there no mosquitoes
- where in the world did you get that
Notice that “where in the wor” is a prefix to all these $n$-grams.
You should return an array of recommendations, with the recommendations ordered by a weight term and in decreasing order of weights. In the case of ties, you should order the tied terms in dictionary order (lexicographic ordering).
The core functionality to implement is as follows (although there are a set of supporting methods — see the skeleton code — that you will also need to implement). Your search term recommendation method would return an array of SearchTerm
s. The recommendation method is invoked on an AutoCompletor
instance that is seeded with the text on which we will run the search. One way to instantiate an AutoCompletor
would be with a List
of Map
s, like you produced in Task 1. The values in the Map
s are the weights that you would use to order the recommendations. This method will also take as input a parameter that limits the number of SearchTerm
s to be returned.
To complete this task, you will use the searchterm.SearchTerm
type and you should also understand how comparators work in Java (although you should not have to implement too much here).
Task 4: Sentiment Analysis
<aside>
<img src="/icons/snippet_yellow.svg" alt="/icons/snippet_yellow.svg" width="40px" /> To complete this task, you will implement SentimentAnalyzer
that should be part of the cpen221.mp1.sentimentanalysis
package. The skeleton code you should use is provided at the end of this description.
</aside>
Can we use the words in a review — for instance, from the RateMyProfessor data set — to determine if the review is positive or not? Can we use the words in a review to determine the rating? We will use a naïve Bayesian analysis along with a bag-of-words approach to achieve a “reasonable” approach to this problem.
What is a bag-of-words? Let us consider the following string: “The car drove down the highway”. The bag-of-words we can obtain from this string is: {”the”: 2, “car: 1, “drove”: 1, “down”: 1, “highway”: 1}. (Aren’t these 1-grams?)
Your goal is to implement a method that generates the bag-of-words for all reviews with each rating; in other words, one bag-of-words for rating 1, one bag-of-words for rating 2, and so on.
We will then use conditional probabilities to perform sentiment analysis. The conditional probability of an event occurring is the chance that an event occurs given that some other event has occurred. We will use the (standard) notation that $P(A)$ is the probability that some event $A$ occurs. The conditional probability $P(A|B)$ is the probability that event $A$ occurs given that event $B$ has occurred. Here is an example: the probability of that it is raining in Vancouver may be 0.3 (30% chance), and we can call this event $A$. Therefore, $P(A) = 0.3$. The probability that it is raining given that the sky is cloudy may be 0.7. Let us denote the event that the sky is cloudy as $B$. Therefore, $P(A|B) = 0.7$. Finally, the probability that it is raining when it is not cloudy may be 0 and therefore $P(A|\textrm{not }B) = 0$.
You will need to find the probability that a review has a rating given its review comment text: $P(\text{rating}|\text{review comment text})$.
This conditional probability is hard to evaluate exactly, since we cannot directly find the rating from the comment text. To simplify this computation, we can use the bag of words as an approximation: $P(\text{rating}|\text{bag-of-words})$. That is given a list of the words used, find the rating. This is a lot simpler in that the order of the words does not matter. Some accuracy is lost doing so, but the problem becomes much easier to solve.
If we have a bag-of-words for each rating, this can be used to see how likely it is that a specific bag-of-words came from each rating's bag-of-words. The most likely rating for that bag of words can then be found. To find compute this likelihood, we can use Bayes’ Theorem: $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.
In our setting: $P(\text{rating}|\text{bag-of-words}) = \frac{P(\text{bag-of-words}|\text{rating})P(\text{rating})}{P(\text{bag-of-words})}$.
This computation allows us to determine the probability that a bag of words occurs given a rating. In turn, we need to find $P(\text{bag-of-words}), P(\text{rating})$, and $P(\text{bag-of-words}|\text{rating})$.
$P(\text{rating})$ can be found easily by determining the number of occurrences of that rating relative to the total number of entries: $P(\text{rating}) = \frac{\text{occurrences of rating}}{\text{number of reviews}}$.
It is difficult to find the probability of a bag-of-words, since all of the words have complicated interrelations, where the combination of two words combined can have a probability different from the product of the two probabilities: $P(A) \times P(B) \not = P(AB)$, unless $A$ and $B$ are independent, meaning that the probability of one does not effect the probability of the other.
For this mini-project, we can use Naïve Bayesian analysis, which is where all the events are treated as independent. In our case, the events are the occurrences of words in a bag-of-words. This assumption makes the problem simpler but we sacrifice accuracy.
$P(\text{bag-of-words}) = P(\text{word}_1, \text{word}_2, \text{word}_3,...,\text{word}_n)$ becomes $P(\text{bag-of-words}) \sim P(\text{word}_1) \times P(\text{word}_2) \times P(\text{word}_3)\times \dots \times P(\text{word}_n)$.
To find the probability of a single word, we can use: $P(\text{word}) = \frac{\text{number of occurrences of word}}{\text{total number of words}}$.
This computation can also be used for the conditional probability of bag-of-words:
$P(\text{bag-of-words}|\text{rating}) = P(\text{word}_1, \text{word}_2, \text{word}_3,...,\text{word}_n|\text{rating})$ becomes $P(\text{bag-of-words}|\text{rating}) \sim P(\text{word}_1|\text{rating}) \times P(\text{word}_2|\text{rating}) \times P(\text{word}_3|\text{rating}) \times \dots \times P(\text{word}_n|\text{rating})$.
To find the probability of occurrence of a single word we can use: $P(\text{word}|\text{rating}) = \frac{\text{number of occurrences word in rating}}{\text{total number of words in reviews at this rating}}$.
The goal is to find the rating that best matches the comment. Using the corresponding bag-of words, the probability of each rating can be found: $P(\text{bag-of-words}|\text{rating}_i)$, where $i \in \{1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5\}$. The most likely rating for a given comment can be found by returning the rating that results in the highest conditional probability for the bag-of-words associated with the review we are considering.
Here is the skeleton code that provides the signatures for the key operations: