Question 4 Spark Programming and Distributed Execution (25 points)
This question has several parts. All parts are related with the following Spark program.
1. [8 points] Executing this application may start one or more jobs. Describe the data flow graph of each job by identifying the line numbers of the operations defined in the driver program. Including DAG diagrams produced by Spark history server would not get any point.
2. [6 points] Describe the stages of the jobs identified in part1. Highlight the place/operation when shuffle happens.
3. [8 points] Describe the data type of the four variables: var1, var2, var3 and var4 in the program. If the variable refers to an RDD or Data Frame, describe the element of that RDD or Data Frame.
4. [3 points] What summary statistics does var4 represent?