[2022] COMP(2041|9044) 22T2 - Software Construction - Week 08 Laboratory Exercises

Engage in a Conversation

Week 08Laboratory Exercises CourseNana.COM

CourseNana.COM

Objectives CourseNana.COM

· Proficiency at text processing in Python. CourseNana.COM

· Understanding multi-dimensional dicts. CourseNana.COM

· Explore a simple machine learning algorithm. CourseNana.COM

CourseNana.COM

Preparation CourseNana.COM

Before the lab you should re-read the relevant lecture slides and their accompanying examples. CourseNana.COM

CourseNana.COM

Getting Started CourseNana.COM

Set up for the lab by creating a new directory called lab08 and changing to this directory. CourseNana.COM

1 $ mkdir lab08 CourseNana.COM

2 $ cd lab08 CourseNana.COM

There are some provided files for this lab which you can fetch with this command: CourseNana.COM

1 $ 2041 fetch lab08 CourseNana.COM

If you're not working at CSE, you can download the provided files as a zip file or a tar file CourseNana.COM

CourseNana.COM

How many words in standard input? CourseNana.COM

In these exercises you will work with a dataset containing sing lyrics. CourseNana.COM

This dataset contains the lyrics of the songs of 10 well-known artists. CourseNana.COM

1 $ unzip lyrics.zip CourseNana.COM

2 Archive: lyrics.zip CourseNana.COM

3 creating: lyrics/ CourseNana.COM

4 inflating: lyrics/David_Bowie.txt inflating: lyrics/Adele.txt CourseNana.COM

5 inflating: lyrics/Metallica.txt CourseNana.COM

6 inflating: lyrics/Rage_Against_The_Machine.txt CourseNana.COM

7 inflating: lyrics/Taylor_Swift.txt CourseNana.COM

8 inflating: lyrics/Keith_Urban.txt inflating: lyrics/Ed_Sheeran.txt CourseNana.COM

9 inflating: lyrics/Justin_Bieber.txt CourseNana.COM

10 inflating: lyrics/Rihanna.txt CourseNana.COM

11 inflating: lyrics/Leonard_Cohen.txt CourseNana.COM

12 inflating: song0.txt CourseNana.COM

13 inflating: song1.txt CourseNana.COM

14 inflating: song2.txt CourseNana.COM

15 inflating: song3.txt CourseNana.COM

16 inflating: song4.txt CourseNana.COM

The lyrics for each song have been re-ordered to avoid copyright concerns. CourseNana.COM

The dataset also contains lyrics from 5 songs where we don't know the artists. CourseNana.COM

1 $ cat song0.txt CourseNana.COM

2 I've made up my mind, Don't need to think it over, CourseNana.COM

3 If I'm wrong I am right, CourseNana.COM

4 Don't need to look no further, CourseNana.COM

5 This ain't lust, CourseNana.COM

6 I know this is love but, CourseNana.COM

7 If I tell the world, CourseNana.COM

8 I'll never say enough, CourseNana.COM

9 Cause it was not said to you, CourseNana.COM

10 And that's exactly what I need to do, CourseNana.COM

11 If I'm in love with you, CourseNana.COM

12 $ cat song1.txt CourseNana.COM

13 Come Mr. DJ song pon de replay CourseNana.COM

14 Come Mr. DJ won't you turn the music up CourseNana.COM

15 All the gal pon the dance floor wantin' some more what Come Mr. DJ won't you turn the music up CourseNana.COM

16 $ cat song2.txt CourseNana.COM

17 And they say She's in the class A team CourseNana.COM

18 Stuck in her daydream CourseNana.COM

CourseNana.COM

Each is from one of the artists in the dataset but they are not from a song in the dataset. CourseNana.COM

As a first step in this analysis, write a Python script CourseNana.COM

total_words.py CourseNana.COM

which counts the total number of words in its stdin. CourseNana.COM

For the purposes of this program (and the following programs) CourseNana.COM

we will define a word to be a maximal, non-empty, contiguous, sequence of alphabetic characters([a-zA-Z]). Any characters other than [a-zA-Z] separate words. CourseNana.COM

So for example the phrase "The soul's desire" contains 4 words: ("The", "soul", "s", "desire") CourseNana.COM

CourseNana.COM

1 $ ./total_words.py < lyrics/Justin_Bieber.txt CourseNana.COM

2 46589 words CourseNana.COM

3 $ ./total_words.py < lyrics/Metallica.txt CourseNana.COM

4 38096 words CourseNana.COM

5 $ ./total_words.py < lyrics/Rihanna.txt CourseNana.COM

6 53157 words CourseNana.COM

CourseNana.COM

HINT: CourseNana.COM

If your word counts are a little too high, you might be counting empty strings. CourseNana.COM

NOTE: CourseNana.COM

A word is defined for these exercises to be a maximal, non-empty, contiguous, sequence of alphabetic characters([a-zA-Z]). CourseNana.COM

You can assume your input is only ASCII. CourseNana.COM

Your answer must be Python only. You can not use other languages such as Shell, Perl or C. CourseNana.COM

You may not run external programs. CourseNana.COM

When you think your program is working, you can use autotest to run some simple automated tests: CourseNana.COM

1 $ 2041 autotest total_words CourseNana.COM

When you are finished working on this exercise,you must submit your work by running give : CourseNana.COM

1 $ give cs2041 lab08_total_words total_words.py CourseNana.COM

beforeMonday 25 July 12:00 to obtain the marks for this lab exercise. CourseNana.COM

CourseNana.COM

How many times does a word occur in standard input CourseNana.COM

Write a Python script count_word.py that counts the number of times a specified word is found in its stdin. CourseNana.COM

The word you should count will be specified as a command line argument. CourseNana.COM

Your program should ignore the case of words. CourseNana.COM

1 $ ./count_word.py death < lyrics/Metallica.txt CourseNana.COM

2 death occurred 69 times CourseNana.COM

3 $ ./count_word.py death < lyrics/Justin_Bieber.txt CourseNana.COM

4 death occurred 0 times CourseNana.COM

5 $ ./count_word.py love < lyrics/Ed_Sheeran.txt CourseNana.COM

6 love occurred 218 times CourseNana.COM

7 $ ./count_word.py love < lyrics/Rage_Against_The_Machine.txt CourseNana.COM

8 love occurred 4 times CourseNana.COM

CourseNana.COM

HINT: CourseNana.COM

Start with your code from the previous activity. CourseNana.COM

NOTE: CourseNana.COM

A word is defined for these exercises to be a maximal, non-empty, contiguous, sequence of alphabetic characters([a-zA-Z]). CourseNana.COM

You can assume your input is only ASCII. CourseNana.COM

Your answer must be Python only. You can not use other languages such as Shell, Perl or C. CourseNana.COM

You may not run external programs. CourseNana.COM

When you think your program is working, you can use autotest to run some simple automated tests: CourseNana.COM

CourseNana.COM

1 $ 2041 autotest count_word CourseNana.COM

When you are finished working on this exercise, you must submit your work by running CourseNana.COM

give CourseNana.COM

1 $ give cs2041 lab08_count_word count_word.py CourseNana.COM

before Monday 25 July 12:00 to obtain the marks for this lab exercise. CourseNana.COM

CourseNana.COM

Do you use that word often? CourseNana.COM

Write a Python script frequency.py thar prints the frequency with which each artist uses a word specified as an argument. CourseNana.COM

So if Justin Bieber uses the word "love" 493 times in the 46583 words of his songs, then its frequency is 493/46583 = 0.0105832599875491 CourseNana.COM

CourseNana.COM

1 $ ./frequency.py love CourseNana.COM

2 165/ 16359 = 0.010086191 Adele CourseNana.COM

3 189/ 34080 = 0.005545775 David Bowie CourseNana.COM

4 218/ 18207 = 0.011973417 Ed Sheeran CourseNana.COM

5 493/ 46589 = 0.010581897 Justin Bieber CourseNana.COM

6 217/ 27016 = 0.008032277 Keith Urban CourseNana.COM

7 212/ 26192 = 0.008094075 Leonard Cohen CourseNana.COM

8 57/ 38096 = 0.001496220 Metallica 4/ 18985 = 0.000210693 Rage Against The Machine CourseNana.COM

9 494/ 53157 = 0.009293226 Rihanna CourseNana.COM

10 89/ 26188 = 0.003398503 Taylor Swift CourseNana.COM

11 $ ./frequency.py death CourseNana.COM

12 1/ 16359 = 0.000061128 Adele CourseNana.COM

13 9/ 34080 = 0.000264085 David Bowie CourseNana.COM

14 3/ 18207 = 0.000164772 Ed Sheeran CourseNana.COM

15 0/ 46589 = 0.000000000 Justin Bieber CourseNana.COM

16 1/ 27016 = 0.000037015 Keith Urban CourseNana.COM

17 16/ 26192 = 0.000610874 Leonard Cohen CourseNana.COM

18 69/ 38096 = 0.001811214 Metallica 23/ 18985 = 0.001211483 Rage Against The Machine CourseNana.COM

19 0/ 53157 = 0.000000000 Rihanna CourseNana.COM

CourseNana.COM

Make sure your Python script produces exactly the output above. CourseNana.COM

CourseNana.COM

When numbers get very small, logarithms are your friend CourseNana.COM

Now suppose we have the song line"truth is beauty". CourseNana.COM

Given that David Bowie uses: CourseNana.COM

the word"truth" with frequency 0.000146727 CourseNana.COM

the word "is" with frequency 0.005898407 CourseNana.COM

the word "beauty" with frequency 0.000264108 CourseNana.COM

we can estimate the probability of Bowie writing the phrase "truth is beauty" CourseNana.COM

as: CourseNana.COM

1 0.000146727 * 0.005898407 * 0.000264108 = 2.28573738067596e-10 CourseNana.COM

We could similarly estimate probabilities for each of the other 9 artists and then determine which of the 10 artists is most likely to sing "truth is beauty" (it's Leonard Cohen). CourseNana.COM

CourseNana.COM

A sidenote: we are actually making a large simplifying assumption in calculating this probability. CourseNana.COM

It is often called thebag of words model. CourseNana.COM

CourseNana.COM

Multiplying probabilities like this quickly leads to very small numbers and may result in arithmetic underflow of our floating pointrepresentation. CourseNana.COM

A common solution to this underflow is instead to work with the log of the numbers. CourseNana.COM

CourseNana.COM

So instead we will calculate the log of the probability of the phrase. You do this by adding the log of the probabilities of each word. For example, you calculate the log-probability of CourseNana.COM

Bowie singing the phrase "Truth is beauty." like this: CourseNana.COM

1 log(0.000146727) + log(0.005898407) + log(0.000264108) = -22.1991622527613 CourseNana.COM

Log-probabilities can be used directly to determine the most likely artist, as the artist with the highest log-probability will also havethe highest probability. CourseNana.COM

CourseNana.COM

Another problem is that we might be given a word that an artist has not used in the dataset we have. CourseNana.COM

CourseNana.COM

You should avoid this when estimating probabilities by adding 1 to the count of occurrences of each word. CourseNana.COM

So for example we'd estimate the probability of Ed Sheeran using the word fear as (0+1)/18205 and the probability of Metallica usingthe word fear as (39+1)/38082. CourseNana.COM

This is a simple version of Additive smoothing. CourseNana.COM

CourseNana.COM

Write a Python script log_probability.py which given a phrase (sequence of words) as arguments, prints the estimated log of theprobability that each artist would use this phrase. CourseNana.COM

1 $ ./log_probability.py truth is beauty CourseNana.COM

2 -23.11614 Adele CourseNana.COM

3 -21.90679 David Bowie CourseNana.COM

4 -23.10075 Ed Sheeran -21.70202 Justin Bieber CourseNana.COM

5 -23.45248 Keith Urban CourseNana.COM

6 -18.58417 Leonard Cohen CourseNana.COM

7 -21.08903 Metallica CourseNana.COM

8 -21.98171 Rage Against The Machine CourseNana.COM

9 -22.51582 Rihanna CourseNana.COM

10 -24.40992 Taylor Swift CourseNana.COM

11 $ ./log_probability.py death and taxes CourseNana.COM

12 -22.64301 Adele CourseNana.COM

13 -22.42756 David Bowie CourseNana.COM

14 -21.66227 Ed Sheeran -25.56650 Justin Bieber CourseNana.COM

15 -23.20281 Keith Urban CourseNana.COM

16 -20.97467 Leonard Cohen CourseNana.COM

17 -20.90589 Metallica CourseNana.COM

18 -20.26248 Rage Against The Machine -25.84396 Rihanna CourseNana.COM

Make sure your output matches the above exactly CourseNana.COM

submit your work by running CourseNana.COM

CourseNana.COM

Who sang those words? CourseNana.COM

Write a Python script identify_artist.py that given 1 or more files (each containing part of a song), prints the most likely artist tohave sung those words. CourseNana.COM

CourseNana.COM

For each file given as argument, you should go through all artists and for each calculate the log-probability that the artist sung thosewords. CourseNana.COM

CourseNana.COM

You calculate the log-probability that the artist sung the words in theilfe, by for each word in the file calculating the log-probability of that artist using that word, and summing all the the log-probabilities. CourseNana.COM

You should print the artist with the highest log-probability. CourseNana.COM

Your program should produce exactly this output: CourseNana.COM

1 $ ./identify_artist.py song?.txt CourseNana.COM

2 song0.txt most resembles the work of Adele (log-probability=-352.4) CourseNana.COM

3 song1.txt most resembles the work of Rihanna (log-probability=-254.9) CourseNana.COM

4 song2.txt most resembles the work of Ed Sheeran (log-probability=-206.6) CourseNana.COM

5 song3.txt most resembles the work of Justin Bieber (log-probability=-1089.8) CourseNana.COM

6 song4.txt most resembles the work of Leonard Cohen (log-probability=-493.8) CourseNana.COM

CourseNana.COM

Submission CourseNana.COM

When you are finished each exercises make sure you submit your work by running CourseNana.COM

give. CourseNana.COM

You can run give multiple times. CourseNana.COM

Only your last submission will be marked. CourseNana.COM

Don't submit any exercises you haven't attempted. CourseNana.COM

If you are working at home, you may find it more convenient CourseNana.COM

to upload your work via give's web interface. CourseNana.COM

Remember you have until Week 9 Monday 12:00:00 to submit your work. CourseNana.COM

You cannot obtain marks by e-mailing your code to tutors or lecturers. CourseNana.COM

You check the files you have submittedhere. CourseNana.COM

Automarking will be run by the lecturer several days after the submission deadline, using test cases different to those autotest runs for you. CourseNana.COM

After automarking is run by the lecturer you can view your results here. CourseNana.COM

The resulting mark will also be availablevia give's webinterface. CourseNana.COM

[2022] COMP(2041|9044) 22T2 - Software Construction - Week 08 Laboratory Exercises

Get in Touch with Our Experts