Homepage
Programming
COMP9313 Big Data Management Project 3: Finding Similar News Article Headlines Using Pyspark

COMP9313 Big Data Management Project 3: Finding Similar News Article Headlines Using Pyspark

Engage in a Conversation

COMP9313 2023T2 Project 3 (22 marks)

Finding Similar News Article Headlines Using Pyspark

In this problem, we are still going to use the dataset of Australian news from ABC. Similar news may appear in different years. Your task is to find all similar news article headline pairs across different years. CourseNana.COM

Background: Set similarity self-join

Given a collection of records R, a similarity function sim(., .), and a threshold τ, the set similarity self-join on R, is to find all record pairs r and s from R, such that sim(r, s) >= τ. In this project, you are required to use the Jaccard similarity function to compute sim(r, s). Given the following example, and set τ=0.5, CourseNana.COM

id CourseNana.COM

record CourseNana.COM

0 CourseNana.COM

1 CourseNana.COM

2 CourseNana.COM

3 CourseNana.COM

4 CourseNana.COM

5 CourseNana.COM

1 4 5 6 CourseNana.COM

2 3 6 CourseNana.COM

4 5 6 CourseNana.COM

1 4 6 CourseNana.COM

2 5 6 CourseNana.COM

3 5 CourseNana.COM

The result pairs are: CourseNana.COM

pair CourseNana.COM

similarity CourseNana.COM

(0,2) CourseNana.COM

(0,3) CourseNana.COM

(1,4) CourseNana.COM

(2,3) CourseNana.COM

(2,4) CourseNana.COM

0.75 CourseNana.COM

0.75 CourseNana.COM

0.5 CourseNana.COM

0.5 CourseNana.COM

0.5 CourseNana.COM

Input files:

In the file, each line is a headline of a news article, in format of "date,term1 term2 ... ... ". The date and texts are separated by a comma, and the terms are separated by the space character (note that the stop words have been removed already). A sample file is like below: CourseNana.COM

20191124,woman stabbed adelaide shopping centre CourseNana.COM

20191204,economy continue teetering edge recession CourseNana.COM

20200401,coronanomics learnt coronavirus economy CourseNana.COM

20200401,coronavirus home test kits selling chinese community CourseNana.COM

20201015,coronavirus pacific economy foriegn aid china CourseNana.COM

20201016,china builds pig apartment blocks guard swine flu CourseNana.COM

20211216,economy starts bounce unemployment CourseNana.COM

20211224,online shopping rise due coronavirus CourseNana.COM

20211229,china close encounters elon musks CourseNana.COM

CourseNana.COM

Note that it is possible that one term appears multiple times in a headline. CourseNana.COM

Output:

The output file contains all the similar headlines together with their similarities. In each pair, the headlines must be from different years. Please use the index of the headline in the file as its ID (starting from 0) and use the IDs to represent the headline pairs. Each line is in format of “(Id₁,Id₂)\tSimilarity” (Id₁<Id₂, and there should have no duplicate pairs in the result). The similarities are of double precision. The pairs are sorted in ascending order (by the first and then the second). CourseNana.COM

Given the example input data with threshold 0.1, the final result should be: CourseNana.COM

(0,7)\t0.1111111111111111 CourseNana.COM

(1,2)\t0.125 CourseNana.COM

(1,4)\t0.1 CourseNana.COM

(1,6)\t0.125 CourseNana.COM

(2,6)\t0.14285714285714285 CourseNana.COM

(2,7)\t0.125 CourseNana.COM

(4,6)\t0.1111111111111111 CourseNana.COM

(4,7)\t0.1 CourseNana.COM

(4,8)\t0.1 CourseNana.COM

Code format:

Please name your python file as “project3.py”. CourseNana.COM

Command of running your code:

Your program should take three parameters: the input file, the output folder, and the similarity threshold τ. CourseNana.COM

$ spark-submit project3.py input output tau CourseNana.COM

Please ensure that the code you submit can be compiled. Any solution that has compilation errors will receive no more than 6 marks. CourseNana.COM

Run in Google Dataproc - Cluster configuration:

Create a bucket with name “comp9313-<YOUR_STUDENTID>” in Dataproc. Create a folder “project3” in this bucket for holding the input files. CourseNana.COM

This project aims to let you see the power of distributed computation. Your code should scale well with the number of nodes used in a cluster. You are required to create three clusters in Dataproc to run the same job: CourseNana.COM

·      Cluster1 - 1 master node and 2 worker nodes; CourseNana.COM

·      Cluster2 - 1 master node and 4 worker nodes; CourseNana.COM

·      Cluster3 - 1 master node and 6 worker nodes. CourseNana.COM

For both master and worker nodes, select n1-standard-2 (2 vCPU, 7.5GB memory). CourseNana.COM

Record the runtime on each cluster and draw a figure where the x-axis is the number of nodes you used and the y-axis is the time of getting the result, and store this figure in a file “Runtime.jpg”. Please also take a screenshot of running your program on Dataproc in each cluster as a proof of the runtime. Compress the three screenshots into a zip file “Screenshots.zip”. CourseNana.COM

Create a project and test everything in your local computer, and finally do it in Google Dataproc. CourseNana.COM

Marking Criteria

Your source code will be inspected and marked based on readability and ease of understanding. The efficiency and scalability of this project is very important and will be evaluated as well. Below is an indicative marking scheme: CourseNana.COM

Submission can be compiled and run on Spark: 6 CourseNana.COM

Accuracy: 5 CourseNana.COM

·      No unexpected pairs CourseNana.COM

·      No missing pairs CourseNana.COM

·      Correct order CourseNana.COM

·      Correct similarity scores CourseNana.COM

·      Correct format CourseNana.COM

Efficiency: 9 CourseNana.COM

·      The rank of runtime (using two local threads): CourseNana.COM

Correct results: CourseNana.COM

    0.9 * (10 – floor((rank percentage-1)/10)), e.g., top 10% => 9 CourseNana.COM

Incorrect results: CourseNana.COM

    0.4 * (10 – floor((rank percentage-1)/10)) CourseNana.COM

Code format and structure, Readability, and Documentation: 2 CourseNana.COM

·      The description of the optimization techniques used CourseNana.COM

Cautious:

·      You need to design an exact approach to finding similar records. CourseNana.COM

·      You cannot compute the pair wise similarities. CourseNana.COM

·      Regular Python programming is not permitted in project3. CourseNana.COM

·       When testing the correctness and efficiency of submissions, all the code will be run with two local threads using the default setting of Spark. Please be careful with your runtime and memory usage. CourseNana.COM

Submission:

Deadline: Wed 9th Aug 11:59:59 PM CourseNana.COM

You can submit through Moodle. You need to submit three files: project3.py, Runtime.jpg, and Screenshots.zip. Please compress everything in a package named “zID_proj3.zip” (e.g. z5123456_proj3.zip). CourseNana.COM

Late submission penalty

5% reduction of your marks for up to 5 days CourseNana.COM

Plagiarism:

The work you submit must be your own work. Submission of work partially or completely derived from any other person or jointly written with any other person is not permitted. The penalties for such an offence may include negative marks, automatic failure of the course and possibly other academic discipline. Assignment submissions will be examined manually.

Relevant scholarship authorities will be informed if students holding scholarships are involved in an incident of plagiarism or other misconduct.

Do not provide or show your assignment work to any other person - apart from the teaching staff of this subject. If you knowingly provide or show your assignment work to another person for any reason, and work derived from it is submitted you may be penalized, even if the work was submitted without your knowledge or consent. CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

UNSW代写,COMP9313代写,Big Data Management代写,Python代写,Similar News Article Headlines代写,Spark代写,UNSW代编,COMP9313代编,Big Data Management代编,Python代编,Similar News Article Headlines代编,Spark代编,UNSW代考,COMP9313代考,Big Data Management代考,Python代考,Similar News Article Headlines代考,Spark代考,UNSWhelp,COMP9313help,Big Data Managementhelp,Pythonhelp,Similar News Article Headlineshelp,Sparkhelp,UNSW作业代写,COMP9313作业代写,Big Data Management作业代写,Python作业代写,Similar News Article Headlines作业代写,Spark作业代写,UNSW编程代写,COMP9313编程代写,Big Data Management编程代写,Python编程代写,Similar News Article Headlines编程代写,Spark编程代写,UNSWprogramming help,COMP9313programming help,Big Data Managementprogramming help,Pythonprogramming help,Similar News Article Headlinesprogramming help,Sparkprogramming help,UNSWassignment help,COMP9313assignment help,Big Data Managementassignment help,Pythonassignment help,Similar News Article Headlinesassignment help,Sparkassignment help,UNSWsolution,COMP9313solution,Big Data Managementsolution,Pythonsolution,Similar News Article Headlinessolution,Sparksolution,