Question 3. Spark Programming and Distributed Execution (25 points)
This question has several parts. All parts are related with the following PySpark application app.py. The application is submitted to a 5-node EMR cluster consists of one master node and four work nodes. Each worker node has 16G memory that can be used by YARN. Each node has 4 vCPUs. The program uses the same tweets data you have used in assignment 2. The data set contain many tweet objects. Each tweet object has many fields; only the following three fields will be used in this question:
The size of the input file tweets.json is around 6MB.
1. [3 points] How many times the input file will be scanned when executing this application? Describe possible improvement to avoid multiple scan and re-computation.
2. [5 points] Identify all variables referring to a DataFrame or an RDD between line 14 and line 30. Describe the record/element structure of each DataFrame or RDD. For DataFrames with the same structure, you only need to describe the structure once.
3. [9 points] Assume The default resource configuration for Spark application is:
Drive memory: 1G;
Application Master Memory: 2G Executor Memory: 8G; Executor Core: 4
The submit script is:
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
app.py \
Describe the process YARN uses to allocate resources for this application. Executing this application will generate a number of tasks; each task needs to be allocated to an executor. Show also a possible task allocation plan.
4. [8 points] The given application produces top 5 results for tweets having retweets and/or replies. Assume we are only interested in tweets having both retweets and replies, you are asked to implement a workload using PySpark API to produce similar top 5 results ONLY for those tweets. You may reuse code in the given program. In doing so, you need to indicate the lines that you will reuse. You are encouraged to design more efficient operation sequence to produce the top 5 results.