School of Information Technology and Electrical Engineering
INFS3208 – Cloud Computing
Programming Assignment Task III (10 Marks)
Task Description:
In this assignment, you are asked to write a piece of Spark code to count occurrences of verbs in the collection of Shakespeare. The returned result should be the top 10 verbs that are most frequently used in Shakespeare’s collection. This assignment is to test your ability to use transformation and action operations in Spark RDD programming. You will be given the collection file (shakespeare.txt), a verb list (all_verbs.txt), a verb dictionary file (verb_dict.txt), and the programming environment (a docker-compose). You can choose either Scala or Python to program in the Jupyter Notebook. There are some technical requirements in your code submission as follows:
- You should use an appropriate method to load the files into RDDs. You are NOT allowed to make changes to the collection file (shakespeare.txt), verb list file (all_verbs.txt), and verb dictionary file (verb_dict.txt).
- To accurately count the verbs in the collection, you should use learned RDD operations to pre- process the text in the collection file:
- Remove empty lines;
- Remove punctuations that could attach to the verbs;
E.g., “work,” and “work” will be counted differently, if you DO NOT remove the
- punctuation mark.
c. Change the capitalization or case of text
E.g., “WORK”, “Work” and “work” will be counted as three different verbs, if you DO
NOT make all of them in lower-case.
- You should use learned RDD operations to find out used verbs in the collection (shakespeare.txt) by matching the verbs in the given verb list (all_verbs.txt).
- A verb can have different forms: present tense, past tense, and future tense
a. E.g., regular verb: “work” - works”, “worked”, and “working”.
b. E.g., irregular verb: “begin” - “begins”, “began”, and “begun”.
c. E.g., linking verb “be” and its various forms, including “is”, “am”, “are”, “was”, “were”, “being” and “been”.
You should use the learned RDD operations to calculate the occurrences of all the verbs (listed in the given verb dictionary file) and merge the verbs that have different tenses by using the learned RDD operations to look up the verb dictionary file (verb_dict.txt).
d. E.g., (work, 100), (works,50), (working,150)→(work, 300).
5. In the final result, you should return the top 10 verbs (in the base form, e.g., work) that are most frequently used in the collection of Shakespeare.
Preparation:
In this individual coding assignment, you will apply your knowledge of Spark and Spark RDD Programming (in Lectures 8 & 9). Firstly, you should read Task Description to understand what the task is and what the technical requirements include. Secondly, you should review all the transformation and action operations in Lectures 8 & 9. In the Appendix, there are some transformation and action
RDD programming activity in Prac 7 (Week 8). Lastly, you need to write the code (Scala or Python) in the Jupyter Notebook. All technical requirements need to be fully met to achieve full marks.
You can either practise on the GCP’s VM or your local machine with Oracle Virtualbox if you are unable to access GCP. Please read the Example of writing Spark code below to have more details.
Assignment Submission:
▪ You need to compress the Jupyter Notebook file.
▪ The name of the compressed file should be named “FirstName LastName StudentNo.zip”.
▪ You must make an online submission to Blackboard .
▪ Only one extension application could be approved due to medical conditions.
Example of writing Spark code:
Step 1:
Log in your VM and change to your home directory
Step 2:
Download docker-compose.yml and the data you required (shakespeare.txt, all_verbs.txt, verb_dict.txt).
Step 3:
Run all the containers: docker-compose up -d Step 4:
Open the Jupyter Notebook (EXTERNAL_IP:8888) and write the Spark code in it.
Step 5:
Use the learned method to load external files (shakespeare.txt, all_verbs.txt, verb_dict) into RDDs.
Output samples:
git clone https://github.com/csenw/cca3.git && cd cca3 sudo chmod -R 777
shakespeare.txt
Step 6:
Use learned RDD operations to pre-process the RDD that stores the text:
- Remove empty lines
- Remove punctuations that could attach to the verbs;
- Change the capitalization or case of text
Output sample:
Step 7:
Use learned RDD operations to keep all the verbs according to the all_verbs.txt. Output sample:
all_verbs.txt
verb_dict.txt
Step 8:
Use learned RDD operations to count the occurrences of the kept verbs: Output sample:
Step 9:
Use learned RDD operations to merge the verb pairs that are from the same verb. E.g. (work, 100), (works,50), (working,150)→(work, 300).
Output sample:
Step 10:
Use learned RDD operations to return the top 10 that are most frequently used in the collection of Shakespeare.
Output sample:
Note that Steps 5 -10 are just recommended procedure, you can feel free to use your processing steps. However, your result should reasonably reflect the top 10 verbs that are most frequently used in the collection of Shakespeare.