Week 08Laboratory Exercises
Objectives
· Proficiency at text processing in Python.
· Understanding multi-dimensional dicts.
· Explore a simple machine learning algorithm.
Preparation
Before the lab you should re-read the relevant lecture slides and their accompanying examples.
Getting Started
Set up for the lab by creating a new directory called lab08 and changing to this directory.
1 $ mkdir lab08
2 $ cd lab08
There are some provided files for this lab which you can fetch with this command:
1 $ 2041 fetch lab08
If you're not working at CSE, you can download the provided files as a zip file or a tar file
How many words in standard input?
In these exercises you will work with a dataset containing sing lyrics.
This dataset contains the lyrics of the songs of 10 well-known artists.
1 $ unzip lyrics.zip
2 Archive: lyrics.zip
3 creating: lyrics/
4 inflating: lyrics/David_Bowie.txt inflating: lyrics/Adele.txt
5 inflating: lyrics/Metallica.txt
6 inflating: lyrics/Rage_Against_The_Machine.txt
7 inflating: lyrics/Taylor_Swift.txt
8 inflating: lyrics/Keith_Urban.txt inflating: lyrics/Ed_Sheeran.txt
9 inflating: lyrics/Justin_Bieber.txt
10 inflating: lyrics/Rihanna.txt
11 inflating: lyrics/Leonard_Cohen.txt
12 inflating: song0.txt
13 inflating: song1.txt
14 inflating: song2.txt
15 inflating: song3.txt
16 inflating: song4.txt
The lyrics for each song have been re-ordered to avoid copyright concerns.
The dataset also contains lyrics from 5 songs where we don't know the artists.
1 $ cat song0.txt
2 I've made up my mind, Don't need to think it over,
3 If I'm wrong I am right,
4 Don't need to look no further,
5 This ain't lust,
6 I know this is love but,
7 If I tell the world,
8 I'll never say enough,
9 Cause it was not said to you,
10 And that's exactly what I need to do,
11 If I'm in love with you,
12 $ cat song1.txt
13 Come Mr. DJ song pon de replay
14 Come Mr. DJ won't you turn the music up
15 All the gal pon the dance floor wantin' some more what Come Mr. DJ won't you turn the music up
16 $ cat song2.txt
17 And they say She's in the class A team
18 Stuck in her daydream
Each is from one of the artists in the dataset but they are not from a song in the dataset.
As a first step in this analysis, write a Python script
total_words.py
which counts the total number of words in its stdin.
For the purposes of this program (and the following programs)
we will define a word to be a maximal, non-empty, contiguous, sequence of alphabetic characters([a-zA-Z]). Any characters other than [a-zA-Z] separate words.
So for example the phrase "The soul's desire" contains 4 words: ("The", "soul", "s", "desire")
1 $ ./total_words.py < lyrics/Justin_Bieber.txt
2 46589 words
3 $ ./total_words.py < lyrics/Metallica.txt
4 38096 words
5 $ ./total_words.py < lyrics/Rihanna.txt
6 53157 words
HINT:
If your word counts are a little too high, you might be counting empty strings.
NOTE:
A word is defined for these exercises to be a maximal, non-empty, contiguous, sequence of alphabetic characters([a-zA-Z]).
You can assume your input is only ASCII.
Your answer must be Python only. You can not use other languages such as Shell, Perl or C.
You may not run external programs.
When you think your program is working, you can use autotest to run some simple automated tests:
1 $ 2041 autotest total_words
When you are finished working on this exercise,you must submit your work by running give :
1 $ give cs2041 lab08_total_words total_words.py
beforeMonday 25 July 12:00 to obtain the marks for this lab exercise.
How many times does a word occur in standard input
Write a Python script count_word.py that counts the number of times a specified word is found in its stdin.
The word you should count will be specified as a command line argument.
Your program should ignore the case of words.
1 $ ./count_word.py death < lyrics/Metallica.txt
2 death occurred 69 times
3 $ ./count_word.py death < lyrics/Justin_Bieber.txt
4 death occurred 0 times
5 $ ./count_word.py love < lyrics/Ed_Sheeran.txt
6 love occurred 218 times
7 $ ./count_word.py love < lyrics/Rage_Against_The_Machine.txt
8 love occurred 4 times
HINT:
Start with your code from the previous activity.
NOTE:
A word is defined for these exercises to be a maximal, non-empty, contiguous, sequence of alphabetic characters([a-zA-Z]).
You can assume your input is only ASCII.
Your answer must be Python only. You can not use other languages such as Shell, Perl or C.
You may not run external programs.
When you think your program is working, you can use autotest to run some simple automated tests:
1 $ 2041 autotest count_word
When you are finished working on this exercise, you must submit your work by running
give
1 $ give cs2041 lab08_count_word count_word.py
before Monday 25 July 12:00 to obtain the marks for this lab exercise.
Do you use that word often?
Write a Python script frequency.py thar prints the frequency with which each artist uses a word specified as an argument.
So if Justin Bieber uses the word "love" 493 times in the 46583 words of his songs, then its frequency is 493/46583 = 0.0105832599875491
1 $ ./frequency.py love
2 165/ 16359 = 0.010086191 Adele
3 189/ 34080 = 0.005545775 David Bowie
4 218/ 18207 = 0.011973417 Ed Sheeran
5 493/ 46589 = 0.010581897 Justin Bieber
6 217/ 27016 = 0.008032277 Keith Urban
7 212/ 26192 = 0.008094075 Leonard Cohen
8 57/ 38096 = 0.001496220 Metallica 4/ 18985 = 0.000210693 Rage Against The Machine
9 494/ 53157 = 0.009293226 Rihanna
10 89/ 26188 = 0.003398503 Taylor Swift
11 $ ./frequency.py death
12 1/ 16359 = 0.000061128 Adele
13 9/ 34080 = 0.000264085 David Bowie
14 3/ 18207 = 0.000164772 Ed Sheeran
15 0/ 46589 = 0.000000000 Justin Bieber
16 1/ 27016 = 0.000037015 Keith Urban
17 16/ 26192 = 0.000610874 Leonard Cohen
18 69/ 38096 = 0.001811214 Metallica 23/ 18985 = 0.001211483 Rage Against The Machine
19 0/ 53157 = 0.000000000 Rihanna
Make sure your Python script produces exactly the output above.
When numbers get very small, logarithms are your friend
Now suppose we have the song line"truth is beauty".
Given that David Bowie uses:
the word"truth" with frequency 0.000146727
the word "is" with frequency 0.005898407
the word "beauty" with frequency 0.000264108
we can estimate the probability of Bowie writing the phrase "truth is beauty"
as:
1 0.000146727 * 0.005898407 * 0.000264108 = 2.28573738067596e-10
We could similarly estimate probabilities for each of the other 9 artists and then determine which of the 10 artists is most likely to sing "truth is beauty" (it's Leonard Cohen).
A sidenote: we are actually making a large simplifying assumption in calculating this probability.
It is often called thebag of words model.
Multiplying probabilities like this quickly leads to very small numbers and may result in arithmetic underflow of our floating pointrepresentation.
A common solution to this underflow is instead to work with the log of the numbers.
So instead we will calculate the log of the probability of the phrase. You do this by adding the log of the probabilities of each word. For example, you calculate the log-probability of
Bowie singing the phrase "Truth is beauty." like this:
1 log(0.000146727) + log(0.005898407) + log(0.000264108) = -22.1991622527613
Log-probabilities can be used directly to determine the most likely artist, as the artist with the highest log-probability will also havethe highest probability.
Another problem is that we might be given a word that an artist has not used in the dataset we have.
You should avoid this when estimating probabilities by adding 1 to the count of occurrences of each word.
So for example we'd estimate the probability of Ed Sheeran using the word fear as (0+1)/18205 and the probability of Metallica usingthe word fear as (39+1)/38082.
This is a simple version of Additive smoothing.
Write a Python script log_probability.py which given a phrase (sequence of words) as arguments, prints the estimated log of theprobability that each artist would use this phrase.
1 $ ./log_probability.py truth is beauty
2 -23.11614 Adele
3 -21.90679 David Bowie
4 -23.10075 Ed Sheeran -21.70202 Justin Bieber
5 -23.45248 Keith Urban
6 -18.58417 Leonard Cohen
7 -21.08903 Metallica
8 -21.98171 Rage Against The Machine
9 -22.51582 Rihanna
10 -24.40992 Taylor Swift
11 $ ./log_probability.py death and taxes
12 -22.64301 Adele
13 -22.42756 David Bowie
14 -21.66227 Ed Sheeran -25.56650 Justin Bieber
15 -23.20281 Keith Urban
16 -20.97467 Leonard Cohen
17 -20.90589 Metallica
18 -20.26248 Rage Against The Machine -25.84396 Rihanna
Make sure your output matches the above exactly
submit your work by running
Who sang those words?
Write a Python script identify_artist.py that given 1 or more files (each containing part of a song), prints the most likely artist tohave sung those words.
For each file given as argument, you should go through all artists and for each calculate the log-probability that the artist sung thosewords.
You calculate the log-probability that the artist sung the words in theilfe, by for each word in the file calculating the log-probability of that artist using that word, and summing all the the log-probabilities.
You should print the artist with the highest log-probability.
Your program should produce exactly this output:
1 $ ./identify_artist.py song?.txt
2 song0.txt most resembles the work of Adele (log-probability=-352.4)
3 song1.txt most resembles the work of Rihanna (log-probability=-254.9)
4 song2.txt most resembles the work of Ed Sheeran (log-probability=-206.6)
5 song3.txt most resembles the work of Justin Bieber (log-probability=-1089.8)
6 song4.txt most resembles the work of Leonard Cohen (log-probability=-493.8)
Submission
When you are finished each exercises make sure you submit your work by running
give.
You can run give multiple times.
Only your last submission will be marked.
Don't submit any exercises you haven't attempted.
If you are working at home, you may find it more convenient
to upload your work via give's web interface.
Remember you have until Week 9 Monday 12:00:00 to submit your work.
You cannot obtain marks by e-mailing your code to tutors or lecturers.
You check the files you have submittedhere.
Automarking will be run by the lecturer several days after the submission deadline, using test cases different to those autotest runs for you.
After automarking is run by the lecturer you can view your results here.
The resulting mark will also be availablevia give's webinterface.