Homepage
Exam
[2021] COMP5349 Cloud Computing - Main Exam S1- Q3 Spark Programming and Distributed Execution

[2021] COMP5349 Cloud Computing - Main Exam S1- Q3 Spark Programming and Distributed Execution

This question has been solved

Engage in a Conversation

Question 3. Spark Programming and Distributed Execution (25 points) CourseNana.COM

This question has several parts. All parts are related with the following PySpark application app.py. The application is submitted to a 5-node EMR cluster consists of one master node and four work nodes. Each worker node has 16G memory that can be used by YARN. Each node has 4 vCPUs. The program uses the same tweets data you have used in assignment 2. The data set contain many tweet objects. Each tweet object has many fields; only the following three fields will be used in this question: CourseNana.COM

id: the unique id of the tweet. This field appears in all objects in the data set
retweet_id: the id of the tweet it re-posts. This field appears in some object in the data set
replyto_id: the id of the tweet it replies to. This field appears in some object in the data set

The size of the input file tweets.json is around 6MB. CourseNana.COM

CourseNana.COM

1. [3 points] How many times the input file will be scanned when executing this application? Describe possible improvement to avoid multiple scan and re-computation. CourseNana.COM

2. [5 points] Identify all variables referring to a DataFrame or an RDD between line 14 and line 30. Describe the record/element structure of each DataFrame or RDD. For DataFrames with the same structure, you only need to describe the structure once. CourseNana.COM

3. [9 points] Assume The default resource configuration for Spark application is: CourseNana.COM

Drive memory: 1G;
Application Master Memory: 2G Executor Memory: 8G; Executor Core: 4 CourseNana.COM

The submit script is: CourseNana.COM

spark-submit \ CourseNana.COM

--master yarn \ CourseNana.COM

--deploy-mode cluster \ CourseNana.COM

--num-executors 3 \ CourseNana.COM

app.py \ CourseNana.COM

Describe the process YARN uses to allocate resources for this application. Executing this application will generate a number of tasks; each task needs to be allocated to an executor. Show also a possible task allocation plan. CourseNana.COM

4. [8 points] The given application produces top 5 results for tweets having retweets and/or replies. Assume we are only interested in tweets having both retweets and replies, you are asked to implement a workload using PySpark API to produce similar top 5 results ONLY for those tweets. You may reuse code in the given program. In doing so, you need to indicate the lines that you will reuse. You are encouraged to design more efficient operation sequence to produce the top 5 results. CourseNana.COM

CourseNana.COM

Get the Solution to This Question

WeChat (微信)

Last: [2021] COMP5349 Cloud Computing - Main Exam S1- Q2 Virtualization and Containerization

Next: [2021] COMP5349 Cloud Computing - Main Exam S1- Q4 Distributed Data Consistency

The University of Sydney代写,COMP5349代写,Cloud Computing代写,The University of Sydney代编,COMP5349代编,Cloud Computing代编,The University of Sydney代考,COMP5349代考,Cloud Computing代考,The University of Sydneyhelp,COMP5349help,Cloud Computinghelp,The University of Sydney作业代写,COMP5349作业代写,Cloud Computing作业代写,The University of Sydney编程代写,COMP5349编程代写,Cloud Computing编程代写,The University of Sydneyprogramming help,COMP5349programming help,Cloud Computingprogramming help,The University of Sydneyassignment help,COMP5349assignment help,Cloud Computingassignment help,The University of Sydneysolution,COMP5349solution,Cloud Computingsolution,