Big Data Programming
CISC 5950 — Project 1
In CISC 5950, we have learned the following topics,
- Set up a 3-node cluster with Hadoop Distributed File System and run examples.
- On top of HDFS, set up the cluster with MapReduce programming framework.
- Run examples of MapReduce programs.
- Scheuling on the Cloud.
In this project, we are going to design our own Hadoop MapReduce-based program to analyze the data. The project consist of two parts.
NY Parking Violations
The NYC Department of Finance collects data on every parking ticket issued in NYC ( 10M per year!). This data is made publicly available to aid in ticket resolution and to guide policymakers. You can find the data from the Link of NYC Parking Data.
The above figure shows several records, where each row represents a parking ticket and the columns are the details of the tickets. To start the project, you have to,
- Start the 3-node cluster
- Set up the HDFS
- Store the data in HDFS
- Set up the MapReduce framework along with the scheduler for resource management.
By analyzing the data, we need to answer the following, • When are tickets most likely to be issued? • What are the most common years and types of cars to be ticketed? • Where are tickets most commonly issued? • Which color of the vehicle is most likely to get a ticket?
NBA Shot Logs
https://www.kaggle.com/dansbecker/nba-shot-logs This is the DATA (https://www.kaggle.com/dansbecker/nba-shot-logs ) on shots taken during the 2014-2015 season, who took the shot, where on the floor was the shot taken from, who was the nearest defender, how far away was the nearest defender, time on the shot clock, and much more. The column titles are generally self-explanatory.
The above figure shows several records, where each row represents a shot and the columns are the details of the shot, e.g. the game ID, who is the defender, what is the distance between them.
By analyzing the data, we need to answer the following, • For each pair of the players (A, B), we define the fear sore of A when facing B is the hit rate, such that B is closet defender when A is shoting. Based on the fear sore, for each player, please find out who is his ”most unwanted defender”. • For each player, we define the comfortable zone of shooting is a matrix of, {SHOT DIST, CLOSE DEF DIST, SHOT CLOCK}
Please develop a MapReduce-based algorithm to classify each player’s records into 4 comfortable zones. Considering the hit rate, which zone is the best for James Harden, Chris Paul, Stephen Curry and Lebron James.
Bonus Question
The biggest challenge when using K-Means is to decide on the number of clusters. Having more clusters creates some small classes with very few records, while having less clusters leads to classes that are too general. Based on a K-Means algorithm above, try to answer the following question, • Given a Black vehicle parking illegally at 34510, 10030, 34050 (street codes). What is the probability that it will get an ticket? (very rough prediction). • At 10 am, I want to go to Lincoln Center and I just want to walk within 0.5 mile. Where should I park? (Divided into zones).
Grading Rubric
You should complete the lab in groups, up to 3 students. (70%) P1: NY Parking Violations (17.5% 4); (20%) P2: NBA Shot Logs (10% 2); (10%) Two Reports the your design and experiments, please as detail as possible and must include your screenshots; In addition, you also need to write two README files for P1 and P2. (5%) Bonus Question;
Submission
You are expected to upload a zip(or tar) file before the deadline to Blackboard. The zip file should include two (or three) folders, • Part1: your codes, report and README • Part2: your codes, report and README • Bonus: your codes, report and README
Userful Links
- Analysis of NYC Parking Tickets.
- Preliminary Data Visualization.
- Exploring 42.3M NYC Parking Tickets.
- NY Parking Violations Issued .
- Insights From Raw NBA Shot Log Data.
- Investigating the hot hand phenomenon in the NBA (CODE).
- Parallel K-Means Clustering Based on MapReduce.
- NBA 16-17 regular season shot log.
- The Fear Factor.
- The Best And Worst Defenders.
- NBA Classification.
- Stephen Curry’s Decision Tree.
- Points per Match (ATL vs WAS only).
- MapReduce-kmeans.