CourseNana | CSE3BDC Big Data Management On The Cloud Assignment: Analysing Bank Data and Twitter Time Series Data

Department of Computer Science and Computer Engineering CSE3BDC Assignment 2024 CourseNana.COM

Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL). CourseNana.COM
Solve challenging big data processing tasks by finding highly efficient solutions. CourseNana.COM
Experience processing three different types of real data: CourseNana.COM

a. Standardmulti-attributedata(Bankdata). CourseNana.COM
1. Time series data (Twitter feed data). CourseNana.COM
2. Bag of words data. CourseNana.COM
Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls). CourseNana.COM
1. a) [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual CourseNana.COM
2. b) [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package CourseNana.COM
3. c) [Spark SQL] https://spark.apache.org/docs/latest/sql-programming-guide.html CourseNana.COM
  
  https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Datase t
  https://spark.apache.org/docs/latest/api/sql/index.html CourseNana.COM
- If you are not sure what a spark API call does, try to write a small example and try it in the spark shell. CourseNana.COM

This assignment is due 11:59 pm on Friday 24th of May, 2024. CourseNana.COM

Penalties are applied to late assignments (accepted up to 5 business days after the due date only). Five precent is deducted per business day late. A mark of zero will be assigned to assignments submitted more than 5 days late. CourseNana.COM

This is an individual assignment. You are not permitted to work as a part of a group when writing this assignment. CourseNana.COM

Submission checklist CourseNana.COM

Ensure that all of your solutions read their input from the full data files (not the small example versions). CourseNana.COM
Check that all of your solutions run without crashing in the docker containers that you used in the labs. CourseNana.COM
Delete all output files. CourseNana.COM
Archive up everything into a single zip file and submit your assignment via LMS. CourseNana.COM

Copying, Plagiarism CourseNana.COM

Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. For individual assignments, plagiarism includes the case where two or more students work collaboratively on the assignment. The Department of Computer Science and Computer Engineering treats plagiarism very seriously. When it is detected, penalties are strictly imposed. CourseNana.COM

If you are working on your assignment on the lab computers, make sure you delete the virtual machine and empty the recycle bin before you leave. Otherwise, other students may be able to see your solutions. CourseNana.COM

ChatGPT and similar AI tools CourseNana.COM

Expected quality of solutions CourseNana.COM

a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks. CourseNana.COM
b) This entire assignment can be done using the docker containers supplied in the labs and the supplied data sets without running out of memory. It is time to show your skills! CourseNana.COM
c) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems. CourseNana.COM
d) For Hive queries, we prefer answers that use less tables. CourseNana.COM

The questions in the assignment will be labelled using the following: • [Hive] CourseNana.COM

o Means this question needs to be done using Hive. • [Spark RDD] CourseNana.COM

o Means this question needs to be done using Spark RDDs, you are not allowed to use any Spark SQL features like dataframe or datasets. CourseNana.COM

• [Spark SQL]
o Means this question needs to be done using Spark SQL and therefore you are CourseNana.COM

not allowed to use RDDs. In addition, you need to do these questions using the spark dataframe or dataset API, do not use SQL syntax. CourseNana.COM

A key purpose of this assessment task is to test your own ability to complete the assigned CourseNana.COM

tasks. Therefore, the use of ChatGPT, AI tools or chatbots with similar functionality is CourseNana.COM

prohibited for this assessment task. Students who are found to be in breach of this rule will CourseNana.COM

be subject to normal academic misconduct measures. Additionally, students may be engaged CourseNana.COM

to provide an oral validation of their understanding of their submitted work (e.g. coding). CourseNana.COM

Assignment structure: CourseNana.COM

• A script which puts all of the data files into HDFS automatically is provided for you. Whenever you start the docker container again you will need to run the following script to upload the data to HDFS again, since HDFS state is not maintained across docker runs: CourseNana.COM

             $ bash put_data_in_hdfs.sh

The script will output the names of all of the data files it copies into HDFS. If you do not run this script, solutions to the Spark questions will not work since they load data from HDFS. CourseNana.COM

To put the files onto HDFS do the following: CourseNana.COM

First start the docker container using run.sh like you have done for your labs. CourseNana.COM
Change to the directory that contains the file put_data_in_hdfs.sh file and then run the following command: CourseNana.COM
```
                         bash put_data_in_hdfs.sh
```
The above will put all the assignment files into HDFS. You can now look at the HDFS contents in Hue like the following. Open Firefox browser and type in the following URL: localhost:8888 CourseNana.COM

4. Typeinusername:rootandpassword:root
5. Next select the files icon on the left to see the files you have CourseNana.COM

uploaded to HDFS. CourseNana.COM

For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs: CourseNana.COM
```
          $ hive -f Task_XX.hql
```
For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file in the src directory. Build and run your Spark code using the provided scripts: CourseNana.COM

$ bash build_and_run.sh

Follow the instructions below to run a small test program that outputs to HDFS so you can see the output. CourseNana.COM

1. ChangetotheTask_testdirectoryandtypethefollowing command: CourseNana.COM

             bash build_and_run.sh

2. NextlookattheoutputoftheprograminHue: CourseNana.COM

Tips: CourseNana.COM

Look at the data files before you begin each task. Try to understand what you are dealing with! CourseNana.COM
For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have “-small” in the name). It’s a good idea to get your solution working on the small inputs first before moving on to the full files. CourseNana.COM
In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions. CourseNana.COM
It can take some time to build and run Spark applications from .scala files. So for the Spark questions it’s best to experiment using spark-shell first to figure out a working solution, and then put your code into the .scala files afterwards. As an example you can try to copy the following highlighted lines from the Task_test source file into the spark shell. CourseNana.COM

Task 1: Analysing Bank Data [38 marks total] CourseNana.COM

We will be doing some analytics on real data from a Portuguese banking institution1. The data is stored in a semicolon (“;”) delimited format. CourseNana.COM

The data is supplied with the assignment at the following locations: CourseNana.COM

Small version CourseNana.COM

Task_1/Data/bank-small.csv

Full version CourseNana.COM

Task_1/Data/bank.csv

The data has the following CourseNana.COM

attributes: Description numeric CourseNana.COM

Attribute index
0 CourseNana.COM

2 CourseNana.COM

3 4 5 6 7 8 CourseNana.COM

9 10 CourseNana.COM

11 12 CourseNana.COM

14 15 16 CourseNana.COM

Attribute name age CourseNana.COM

marital CourseNana.COM

education default balance housing loan contact CourseNana.COM

day month CourseNana.COM

duration campaign CourseNana.COM

previous poutcome termdeposit CourseNana.COM

marital status (categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed)
(categorical: "unknown", "secondary", "primary", "tertiary")
has credit in default? (binary: "yes", "no") CourseNana.COM

average yearly balance, in euros (numeric)
has housing loan? (binary: "yes", "no")
has personal loan? (binary: "yes", "no")
contact communication type (categorical: “unknown", "telephone", "cellular") CourseNana.COM

last contact day of the month (numeric)
last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
last contact duration, in seconds (numeric)
number of contacts performed during this campaign and for this client (numeric, includes last contact) CourseNana.COM

number of contacts performed before this campaign and for this client (numeric)
outcome of the previous marketing campaign (categorical: "unknown","other","failure","success") CourseNana.COM

has the client subscribed a term deposit? (binary: "yes","no") CourseNana.COM

1 CourseNana.COM

job CourseNana.COM

type of job (categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", CourseNana.COM

“blue-collar", "self-employed", "retired", "technician", "services") CourseNana.COM

13 CourseNana.COM

pdays CourseNana.COM

number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) CourseNana.COM

1 CourseNana.COM

: Banking data source: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing CourseNana.COM

Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes): CourseNana.COM

job CourseNana.COM

marital CourseNana.COM

education CourseNana.COM

balance CourseNana.COM

loan CourseNana.COM

management technician entrepreneur blue-collar services technician Management technician CourseNana.COM

Married Divorced Single Married Divorced Married Divorced Married CourseNana.COM

tertiary 2143 Yes secondary 29 Yes secondary 2 No unknown 1506 No secondary 829 Yes tertiary 929 Yes tertiary 22 No primary 10 No CourseNana.COM

Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask. CourseNana.COM

a) [Hive] Report the number of clients of each job category. Write the results to “Task_1a-out”. For the above small example data set you would report the following (output order is not important for this question): CourseNana.COM

"blue-collar" 1 "entrepreneur" 1 "management" 2 "services" 1 "technician" 3 CourseNana.COM

[8 marks] CourseNana.COM

b) [Hive] Report the average yearly balance for all people in each education category. Write the results to “Task_1b-out”. For the small example data set you would report the following (output order is not important for this question): CourseNana.COM

"primary" 10.0
"secondary" 286.6666666666667 "tertiary" 1031.3333333333333 "unknown" 1506.0 CourseNana.COM

[Spark RDD] Group balance into the following three categories: a. Low:-infinityto500 CourseNana.COM

Medium: 501 to 1500 => CourseNana.COM
High: 1501 to +infinity CourseNana.COM

[8 marks] CourseNana.COM

c) CourseNana.COM

Report the number of people in each of the above categories. Write the results to “Task_1c-out” in text file format. For the small example data set you should get the following results (output order is not important in this question): CourseNana.COM

(High,2) (Medium,2) (Low,4) CourseNana.COM

[10 marks] CourseNana.COM

d) [Spark RDD] Sort all people in ascending order of education. For people with the same education, sort them in descending order by balance. This means that all people with the same education, should appear grouped together in the output. For each person, report the following attribute values: education, balance, job, marital, loan. Write the results to “Task_1d-out” in text file format (multiple parts are allowed). For the small example data set, you would report the following: CourseNana.COM

("primary",10,"technician","married","no") ("secondary",829,"services","divorced","yes") ("secondary",29,"technician","divorced","yes") ("secondary",2,"entrepreneur","single","no") ("tertiary",2143,"management","married","yes") ("tertiary",929,"technician","married","yes") ("tertiary",22,"management","divorced","no") ("unknown",1506,"blue-collar","married","no") CourseNana.COM

[12 marks] CourseNana.COM

Task 2: Analysing Twitter Time Series Data [32 marks] CourseNana.COM

In this task we will be doing some analytics on real Twitter data2. The data is stored in a tab (“\t”) delimited format. CourseNana.COM

The data is supplied with the assignment at the following locations: CourseNana.COM

Small version Full version CourseNana.COM

Task_2/Data/twitter-small.tsv Task_2/Data/twitter.tsv CourseNana.COM

The data has the following attributes: CourseNana.COM

Attribute Attribute name index
0 tokenType CourseNana.COM

2 count CourseNana.COM
3 hashtagName CourseNana.COM

Description CourseNana.COM

In our data set all rows have Token type of hashtag. So this attribute is useless for this assignment. CourseNana.COM

An integer representing the number tweets of this hash tag for the given year and month
The #tag name, e.g. babylove, mydate, etc. CourseNana.COM

1 CourseNana.COM

month CourseNana.COM

The year and month specified like the following: YYYYMM. So 4 digits for year followed by 2 digits for month. So like the following 200905, meaning the year 2009 and month of May CourseNana.COM

Here is a small example of the Twitter data that we will use to illustrate the subtasks below: CourseNana.COM

Token type hashtag hashtag hashtag hashtag hashtag hashtag hashtag hashtag hashtag CourseNana.COM

Month count 200910 2 200911 2 200912 90 200812 100 200901 201 200910 1 200912 500 200905 23 200907 1000 CourseNana.COM

Hash Tag Name babylove babylove babylove mycoolwife mycoolwife mycoolwife mycoolwife CourseNana.COM

abc abc CourseNana.COM

2 CourseNana.COM

: Twitter data source: http://www.infochimps.com/datasets/twitter-census-conversation-metrics- one-year-of-urls-hashtags-sm CourseNana.COM

a) CourseNana.COM

b) CourseNana.COM

[Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output using println. So, for the above small example data set the result would be: CourseNana.COM

month: 200907, count: 1000, hashtagName: abc CourseNana.COM

[6 marks] CourseNana.COM

[Do twice, once using Hive and once using Spark RDD] Find the hash tag name that was tweeted the most in the entire data set across all months. Report the total number of tweets for that hash tag name. You can either print the result to the terminal or output the result to a text file. So, for the above small example data set the output would be: CourseNana.COM

abc 1023 CourseNana.COM

[12 marks total: 6 marks for Hive and 6 marks for Spark RDD] CourseNana.COM

[Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Therefore, the same hashtag and month combination cannot occur more than once. Print the result to the terminal output using println. For the above small example data set: CourseNana.COM

Input x = 200910, y = 200912 CourseNana.COM

Output hashtagName: mycoolwife, countX: 1, countY: 500
For this subtask you can specify the months x and y as arguments to the script. This is CourseNana.COM

c) CourseNana.COM

required to test on the full-sized data. For example: CourseNana.COM

     $ bash build_and_run.sh 200901 200902

[14 marks] CourseNana.COM

Task 3: Indexing Bag of Words data [30 marks] CourseNana.COM

In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently. CourseNana.COM

The data is supplied with the assignment at the following locations3: CourseNana.COM

Small version CourseNana.COM

Task_3/Data/docword-small.txt
Task_3/Data/vocab-small.txt

Full version CourseNana.COM

Task_3/Data/docword.txt
Task_3/Data/vocab.txt

The first file is called docword.txt, which contains the contents of all the documents stored in the following format: CourseNana.COM

Attribute Attribute name index
0 docId
1 vocabId CourseNana.COM

Description CourseNana.COM

The ID of the document that contains the word
Instead of storing the word itself, we store an ID from the vocabulary file.
An integer representing the number of times this word occurred in this document. CourseNana.COM

2 count
The second file called vocab.txt contains each word in the vocabulary, which is indexed by CourseNana.COM

vocabIndex from the docword.txt file.
Here is a small example content of the docword.txt file. CourseNana.COM

docId vocabId count CourseNana.COM

3 3 600 2 3 702 1 2 120 2 5 200 2 2 500 3 1 100 3 5 2000 3 4 122 1 3 1200 1 1 1000 CourseNana.COM

3 : Data source: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words CourseNana.COM

Here is an example of the vocab.txt file CourseNana.COM

vocabId word CourseNana.COM

1 plane CourseNana.COM
2 car CourseNana.COM
3 motorbike CourseNana.COM
4 truck CourseNana.COM
5 boat CourseNana.COM

Complete the following subtasks using Spark: CourseNana.COM

a) [spark SQL] Calculate the total count of each word across all documents. List the CourseNana.COM

words in ascending alphabetical order. Write the results to “Task_3a-out” in CSV format (multiple output parts are allowed). So for the above small example input the output would be the following (outputs with multiple parts will be considered in order of the part number): CourseNana.COM

boat,2200 car,620 motorbike,2502 plane,1100 truck,122 CourseNana.COM

Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1. CourseNana.COM

[8 marks] CourseNana.COM
b) [spark SQL] Create a dataframe containing rows with four fields: (word, docId, count, firstLetter). You should add the firstLetter column by using a UDF, which extracts the first letter of word as a String. Save the results in parquet format partitioned by firstLetter to docwordIndexFilename. Use show() to print the first 10 rows of the dataframe that you saved. CourseNana.COM

So, for the above example input, you should see the following output (the exact CourseNana.COM

ordering is not important): CourseNana.COM

+---------+-----+-----+-----------+ CourseNana.COM

| word|docId|count|firstLetter| +---------+-----+-----+-----------+ | plane| 1| 1000| p| | plane| 3| 100| p| |car|2|500| c| |car|1|120| c| |motorbike| 1| 1200| m| |motorbike| 2| 702| m| |motorbike| 3| 600| m| | truck| 3| 122| t| | boat| 3| 2000| b| | boat| 2| 200| b| +---------+-----+-----+-----------+ CourseNana.COM

[14 marks] CourseNana.COM

c) [spark SQL] Load the previously created dataframe stored in parquet format from subtask b). For each document ID in the docIds list (which is provided as a function argument for you), use println to display the following: the document ID, the word with the most occurrences in that document (you can break ties arbitrarily), and the number of occurrences of that word in the document. Skip any document IDs that aren’t found in the dataset. Use an optimisation to prevent loading the parquet file into memory multiple times. CourseNana.COM

If docIds contains “2” and “3”, then the output for the example dataset would be: [2, motorbike, 702] CourseNana.COM

[3, boat, 2000]
For this subtask specify the document ids as arguments to the script. For example: CourseNana.COM

       $ bash build_and_run.sh 2 3

[4 marks] d) [spark SQL] Load the previously created dataframe stored in parquet format from subtask b). For each word in the queryWords list (which is provided as a function CourseNana.COM

argument for you), use println to display the docId with the most occurrences of that word (you can break ties arbitrarily). Use an optimisation based on how the data is partitioned. CourseNana.COM

If queryWords contains “car” and “truck”, then the output for the example dataset would be: CourseNana.COM

[car,2] [truck,3] CourseNana.COM

For this subtask specify the query words as arguments to the script. For example: CourseNana.COM

          $ bash build_and_run.sh computer environment power

Bonus Marks:
1. Using spark perform the following task using the data set of task 2. CourseNana.COM

[4 marks] CourseNana.COM

[Spark RDD or Spark SQL] Find the hash tag name that has increased the number of tweets the most from among any two consecutive months of any hash tag name. Consecutive month means for example, 200801 to 200802, or 200902 to 200903, etc. Report the hash tag name, the 1st month count, and the 2nd month count using println. CourseNana.COM

For the small example data set of task 2 the output would be: CourseNana.COM

Hash tag name: mycoolwife count of month 200812: 100 count of month 200901: 201 CourseNana.COM

Total Marks:
Please note that the total mark for this assignment is capped at 100. If your marks add to more than 100 then your final mark will be 100. CourseNana.COM

Return of Assignments CourseNana.COM

Departmental Policy requires that assignments be returned within three weeks of the submission date. We will endeavour to have your assignment returned before the BDC exam. The time and place of the return will be posted on LMS. CourseNana.COM

[10 marks] CourseNana.COM

CSE3BDC Big Data Management On The Cloud Assignment: Analysing Bank Data and Twitter Time Series Data

Get in Touch with Our Experts