1. Homepage
  2. Programming
  3. CS544 Intro to Big Data Systems - P4: Spark, Loan Applications

CS544 Intro to Big Data Systems - P4: Spark, Loan Applications

Engage in a Conversation
USWISCUniversity of WisconsinCS544Intro to Big Data SystemsPythonSpark Loan Applications

P4 (regular project): Spark, Loan Applications

Overview

In this project, we'll use Spark to analyze loan applications in WI. You'll load your data to Hive tables and views so you can easily query them. The big table (loans) has many IDs in columns; you'll need to join these against other tables/views to determine the meaning of these IDs. In addition to your analysis, you'll study the performance impact of caching and bucketing. CourseNana.COM

Important: you'll answer ten questions in this project. Paste each question and it's number (e.g., "# Q1: ...") as a comment in your notebook prior to each answer so we can easily search your notebook and give you credit for your answers. CourseNana.COM

Learning objectives: CourseNana.COM

  • load data to Hive tables and views
  • write queries that use filtering, joining, grouping, and windowing
  • interpret Spark query explanations
  • optimize queries with bucketing and caching

Before starting, please review the general project directions. CourseNana.COM

Part 1: Data Setup

Setup an environment with three containers, running the following: CourseNana.COM

  1. HDFS Namenode and Datanode on fresh file system; Spark boss; JupyterLab
  2. Spark worker
  3. Spark worker

For inspiration, here are some files that could help: Container Setup Files CourseNana.COM

Create a p4.ipynb notebook and create a session. We'll enable Hive, with metadata stored on HDFS: CourseNana.COM

from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("cs544")
         .master("spark://main:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://main:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())

Each zip contains one or more CSV files. Upload each CSV to HDFS. You can use shell commands and/or Python code, but be sure to show your work in p4.ipynb. Register or load the data into Spark.
CourseNana.COM

Requirements: CourseNana.COM

  • let Spark infer the schema
  • use Spark options/functions so that you can run the data setup code more than one, and it will simply replace previous data
  • code_sheets.zip: create these temporary views corresponding to the CSVs by the same name: "ethnicity", "race", "sex", "states", "counties", "tracts", "action_taken", "denial_reason", "loan_type", "loan_purpose", "preapproval", "property_type"
  • arid2017_to_lei_xref_csv.zip: load the data to a table named "banks"
  • hdma-wi-2021.zip: load the data to a table named "loans"
  • the "loans" table should be divide into 8 buckets by the county_code column

Q1: what tables are in our warehouse?

You can use spark.sql("SHOW TABLES").show() to answer. It should look like this: CourseNana.COM

+---------+-------------+-----------+
|namespace|    tableName|isTemporary|
+---------+-------------+-----------+
|  default|        banks|      false|
|  default|        loans|      false|
|         | action_taken|       true|
|         |       agency|       true|
|         |     counties|       true|
|         |denial_reason|       true|
|         |    ethnicity|       true|
|         | loan_purpose|       true|
|         |    loan_type|       true|
|         |  preapproval|       true|
|         |property_type|       true|
|         |         race|       true|
|         |          sex|       true|
|         |       states|       true|
|         |       tracts|       true|
+---------+-------------+-----------+

Part 2: Filter and Join

Q2: how many banks contain the word "first" in their name? Which ones contain "second"?

Your filtering should be case insensative. We're looking for a number for the first part and an actual listing for the second part. CourseNana.COM

Q3: how many loan applications has your "favorite" bank received in the dataset?

Browse the dataset a bit and pick a "favorite" bank for these questions. It could be one you use, or has an interesting name, or is popular in WI. You need to pick one that has at least a couple hundred loans in the dataset CourseNana.COM

Use an INNER JOIN between banks and loans to anser this question. lei in loans lets you identify the bank. Filter on respondent_name (do NOT hardcode the LEI). CourseNana.COM

Q4: what does results.explain("formatted") tell us about Spark's query plan for Q3?

Show results.explain("formatted") and write a comment making observations about the following: CourseNana.COM

  1. which table is sent to every executor via a BroadcastExchange operation?
  2. on which tables is "is not null" filtering added by the optimizer?
  3. which Functions do the HashAggregates use?

Q5: what are the top 10 biggest loans (in terms of loan_amount) that were approved by your favorite bank?

A loan is approved if action_taken is "Loan originated". Your answer should have the following columns: census_tract, county, loan_purpose, derived_dwelling_category, thousands, interest_rate, years, action_taken CourseNana.COM

Join with the appropriate views to get the values for these columns. It should look like the following (though will differ depending on which bank you picked): CourseNana.COM

CourseNana.COM

Use LEFT JOINs so that we don't ignore loans with missing values (for example, you can see some Nones in the loan_purpose column of the example). CourseNana.COM

Joining counties will be a very tricky. Tips: CourseNana.COM

  • sometimes multiple rows in counties have the same STATE, COUNTY, NAME combination; eliminate duplicates before joining
  • county_code in loans is actually the state and county codes concatenated together whereas counties has these as separate columns. For example, 55025 is the county_code for Dane county in loans, but this will show up as STATE=55 and COUNTY=25 in the counties view. Be careful because counties have 3-digit codes (like 025) but the counties view doesn't have leading zeros.

Part 3: GROUPY BY and Windowing

Q6: when computing a MEAN aggregate per group of loans, when do we require network I/O between the partial_mean and mean operations?

Write some simple GROUP BY queries on loans and call .explain(). Try grouping by both the county_code and lei columns. CourseNana.COM

If a network transfer is necessary for one query but not the other, write a comment explaining why. You might want to look back at how you loaded the data to a Hive table earlier. CourseNana.COM

Q7: what are the average interest rates for Wells Fargo applications for the ten counties where Wells Fargo receives the most applications?

Answer with a bar plot like this: CourseNana.COM

CourseNana.COM

The bars are sorted by the number of applications in each county (for example, most applications are in Milwaukee, Waukesha is second most, etc). CourseNana.COM

Q8: what is the second biggest loan application amount in each county?

Answer with a plot like the following: CourseNana.COM

CourseNana.COM

Hint: if we were asking for the biggest in each county, you would use GROUP BY and MAX. We're asking for the second biggest, so you should see if a windowing function can help. CourseNana.COM

Part 4: Caching

Create a DataFrame from the following query: CourseNana.COM

SELECT interest_rate
FROM banks
INNER JOIN loans
ON banks.lei_2020 = loans.lei 
WHERE banks.respondent_name = 'Wells Fargo Bank, National Association'

Q9: what is the cost of caching and the impact on subsequent queries?

Write a loop that calls .count() on your DataFrame ten times. Measure the latency in milliseconds each time. On the 5th time, cache the DataFrame. CourseNana.COM

Answer with a scatter or line plot showing the latency of each query execution tha distinguishes between caching or not. Here's one way to visualize it: CourseNana.COM

CourseNana.COM

Q10: what is the impact of caching a single partition on load balance?

Repartition your DataFrame to have 1 partition and cache it again. Write another loop that calls .count() ten times. CourseNana.COM

After each query, use Spark's REST API to check how many tasks each executor has completed: https://spark.apache.org/docs/latest/monitoring.html#rest-api CourseNana.COM

Make a plot like the following to show how much work each has completed: CourseNana.COM

CourseNana.COM

Hints: CourseNana.COM

  • you can use http://localhost:4040/api/v1/applications to find your app's ID
  • you can use http://localhost:4040/api/v1/applications/{app_id}/executors to find stats about executors for your app
  • a correct experiment should show one executor always does the work (whichever one is caching the partition). Results may vary in terms of which of the two executors do the work or initial counts prior to the start of the experimen

Get in Touch with Our Experts

WeChat (微信) WeChat (微信)
Whatsapp WhatsApp
US代写,WISC代写,University of Wisconsin代写,CS544代写,Intro to Big Data Systems代写,Python代写,Spark代写, Loan Applications代写,US代编,WISC代编,University of Wisconsin代编,CS544代编,Intro to Big Data Systems代编,Python代编,Spark代编, Loan Applications代编,US代考,WISC代考,University of Wisconsin代考,CS544代考,Intro to Big Data Systems代考,Python代考,Spark代考, Loan Applications代考,UShelp,WISChelp,University of Wisconsinhelp,CS544help,Intro to Big Data Systemshelp,Pythonhelp,Sparkhelp, Loan Applicationshelp,US作业代写,WISC作业代写,University of Wisconsin作业代写,CS544作业代写,Intro to Big Data Systems作业代写,Python作业代写,Spark作业代写, Loan Applications作业代写,US编程代写,WISC编程代写,University of Wisconsin编程代写,CS544编程代写,Intro to Big Data Systems编程代写,Python编程代写,Spark编程代写, Loan Applications编程代写,USprogramming help,WISCprogramming help,University of Wisconsinprogramming help,CS544programming help,Intro to Big Data Systemsprogramming help,Pythonprogramming help,Sparkprogramming help, Loan Applicationsprogramming help,USassignment help,WISCassignment help,University of Wisconsinassignment help,CS544assignment help,Intro to Big Data Systemsassignment help,Pythonassignment help,Sparkassignment help, Loan Applicationsassignment help,USsolution,WISCsolution,University of Wisconsinsolution,CS544solution,Intro to Big Data Systemssolution,Pythonsolution,Sparksolution, Loan Applicationssolution,