1. Homepage
  2. Programming
  3. INFS3208 Cloud Computing - Assignment Task III: Spark RDD

INFS3208 Cloud Computing - Assignment Task III: Spark RDD

Engage in a Conversation
AustraliaUniversity of QueenslandINFS3208Cloud ComputingSpark RDDPythonJava

School of Information Technology and Electrical Engineering CourseNana.COM

INFS3208 – Cloud Computing CourseNana.COM


CourseNana.COM

CourseNana.COM

Programming Assignment Task III (10 Marks) CourseNana.COM

Task Description: CourseNana.COM

In this assignment, you are asked to write a piece of Spark code to count occurrences of verbs in the collection of Shakespeare. The returned result should be the top 10 verbs that are most frequently used in Shakespeare’s collection. This assignment is to test your ability to use transformation and action operations in Spark RDD programming. You will be given the collection file (shakespeare.txt), a verb list (all_verbs.txt), a verb dictionary file (verb_dict.txt), and the programming environment (a docker-compose). You can choose either Scala or Python to program in the Jupyter Notebook. There are some technical requirements in your code submission as follows: CourseNana.COM

  1. You should use an appropriate method to load the files into RDDs. You are NOT allowed to make changes to the collection file (shakespeare.txt), verb list file (all_verbs.txt), and verb dictionary file (verb_dict.txt).
  2. To accurately count the verbs in the collection, you should use learned RDD operations to pre- process the text in the collection file:
    1. Remove empty lines;
    2. Remove punctuations that could attach to the verbs;

E.g., “work,” and “work” will be counted differently, if you DO NOT remove the CourseNana.COM

  1. punctuation mark.

c. Change the capitalization or case of text
E.g., “
WORK”, “Work” and “work” will be counted as three different verbs, if you DO CourseNana.COM

NOT make all of them in lower-case. CourseNana.COM

  1. You should use learned RDD operations to find out used verbs in the collection (shakespeare.txt) by matching the verbs in the given verb list (all_verbs.txt).
  2. A verb can have different forms: present tense, past tense, and future tense

a. E.g., regular verb: “work” - works”, “worked”, and “working”.
b. E.g., irregular verb: “
begin” - “begins”, “began”, and “begun”.
c. E.g., linking verb “
be” and its various forms, including “is”, “am”, “are”, “was”, “were”, “being” and “been”.
You should use the learned RDD operations to calculate the occurrences of all the verbs (listed in the given verb dictionary file) and merge the verbs that have different tenses by using the learned RDD operations to look up the verb dictionary file (
verb_dict.txt). CourseNana.COM

d. E.g., (work, 100), (works,50), (working,150)(work, 300).
5. In the final result, you should return the
top 10 verbs (in the base form, e.g., work) that are most frequently used in the collection of Shakespeare. CourseNana.COM

Preparation: CourseNana.COM

In this individual coding assignment, you will apply your knowledge of Spark and Spark RDD Programming (in Lectures 8 & 9). Firstly, you should read Task Description to understand what the task is and what the technical requirements include. Secondly, you should review all the transformation and action operations in Lectures 8 & 9. In the Appendix, there are some transformation and action CourseNana.COM

RDD programming activity in Prac 7 (Week 8). Lastly, you need to write the code (Scala or Python) in the Jupyter Notebook. All technical requirements need to be fully met to achieve full marks. CourseNana.COM


You can either practise on the GCP’s VM or your local machine with Oracle Virtualbox if you are unable to access GCP. Please read the Example of writing Spark code below to have more details. CourseNana.COM

Assignment Submission: CourseNana.COM

You need to compress the Jupyter Notebook file.
The name of the compressed file should be named “FirstName LastName StudentNo.zip”. CourseNana.COM

You must make an online submission to Blackboard .
Only one extension application could be approved due to medical conditions. CourseNana.COM

Example of writing Spark code: CourseNana.COM

Step 1: CourseNana.COM

Log in your VM and change to your home directory CourseNana.COM

Step 2: CourseNana.COM

Download docker-compose.yml and the data you required (shakespeare.txt, all_verbs.txt, verb_dict.txt). CourseNana.COM

Step 3: CourseNana.COM

Run all the containers: docker-compose up -d Step 4: CourseNana.COM

Open the Jupyter Notebook (EXTERNAL_IP:8888) and write the Spark code in it. CourseNana.COM

Step 5: CourseNana.COM

Use the learned method to load external files (shakespeare.txt, all_verbs.txt, verb_dict) into RDDs. CourseNana.COM

Output samples: CourseNana.COM

git clone https://github.com/csenw/cca3.git && cd cca3 sudo chmod -R 777 CourseNana.COM

shakespeare.txt CourseNana.COM

Step 6: CourseNana.COM

Use learned RDD operations to pre-process the RDD that stores the text: CourseNana.COM

  1. Remove empty lines
  2. Remove punctuations that could attach to the verbs;
  3. Change the capitalization or case of text

Output sample: CourseNana.COM

Step 7: CourseNana.COM

Use learned RDD operations to keep all the verbs according to the all_verbs.txt. Output sample: CourseNana.COM

all_verbs.txt CourseNana.COM

verb_dict.txt CourseNana.COM

Step 8: CourseNana.COM

Use learned RDD operations to count the occurrences of the kept verbs: Output sample: CourseNana.COM

Step 9: CourseNana.COM

Use learned RDD operations to merge the verb pairs that are from the same verb. E.g. (work, 100), (works,50), (working,150)(work, 300). CourseNana.COM

Output sample: CourseNana.COM

Step 10: CourseNana.COM

Use learned RDD operations to return the top 10 that are most frequently used in the collection of Shakespeare. CourseNana.COM

Output sample: CourseNana.COM

Note that Steps 5 -10 are just recommended procedure, you can feel free to use your processing steps. However, your result should reasonably reflect the top 10 verbs that are most frequently used in the collection of Shakespeare. CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
Australia代写,University of Queensland代写,INFS3208代写,Cloud Computing代写,Spark RDD代写,Python代写,Java代写,Australia代编,University of Queensland代编,INFS3208代编,Cloud Computing代编,Spark RDD代编,Python代编,Java代编,Australia代考,University of Queensland代考,INFS3208代考,Cloud Computing代考,Spark RDD代考,Python代考,Java代考,Australiahelp,University of Queenslandhelp,INFS3208help,Cloud Computinghelp,Spark RDDhelp,Pythonhelp,Javahelp,Australia作业代写,University of Queensland作业代写,INFS3208作业代写,Cloud Computing作业代写,Spark RDD作业代写,Python作业代写,Java作业代写,Australia编程代写,University of Queensland编程代写,INFS3208编程代写,Cloud Computing编程代写,Spark RDD编程代写,Python编程代写,Java编程代写,Australiaprogramming help,University of Queenslandprogramming help,INFS3208programming help,Cloud Computingprogramming help,Spark RDDprogramming help,Pythonprogramming help,Javaprogramming help,Australiaassignment help,University of Queenslandassignment help,INFS3208assignment help,Cloud Computingassignment help,Spark RDDassignment help,Pythonassignment help,Javaassignment help,Australiasolution,University of Queenslandsolution,INFS3208solution,Cloud Computingsolution,Spark RDDsolution,Pythonsolution,Javasolution,