Homepage
Programming
COMP9313 Big Data Management Project 2: Top-k most frequent co-occuring term pairs

COMP9313 Big Data Management Project 2: Top-k most frequent co-occuring term pairs

Engage in a Conversation

COMP9313 23T2 Project 2 (16 marks)

Problem statement:

In this problem, we are still going to use the dataset of Australian news from ABC. Your task is to find out the top-k most frequent co-occurring term pairs in each year. The co-occurrence of (w, u) is defined as: u and w appear in the same article headline (i.e., (w, u) and (u, w) are treated equally). CourseNana.COM

Input files:

The dataset you are going to use contains data of news headlines published over several years. In this text file, each line is a headline of a news article, in format of "date,term1 term2 ... ... ". The date and texts are separated by a comma, and the terms are separated by the space character. A sample file is like below: CourseNana.COM

20030219,council chief executive fails to secure position CourseNana.COM

20030219,council welcomes ambulance levy decision CourseNana.COM

20030219,council welcomes insurance breakthrough CourseNana.COM

20030219,fed opp to re introduce national insurance CourseNana.COM

20040501,cowboys survive eels comeback CourseNana.COM

20040501,cowboys withstand eels fightback CourseNana.COM

20040502,castro vows cuban socialism to survive bush CourseNana.COM

20200401,coronanomics things learnt about how coronavirus economy CourseNana.COM

20200401,coronavirus at home test kits selling in the chinese community CourseNana.COM

20200401,coronavirus campbell remess streams bear making classes CourseNana.COM

20201015,coronavirus pacific economy foriegn aid china CourseNana.COM

20201016,china builds pig apartment blocks to guard against swine flu CourseNana.COM

This small sample file can be downloaded at:Output format: CourseNana.COM

You need to ignore the stop words such as “to”, “the”, “in”, etc. (refer to the broadcast variable on how to do this efficiently). A stop word list is stored in this file: CourseNana.COM

Please get the terms from the dataset as below:
CourseNana.COM

·      Split the headline by the space character to obtain terms. CourseNana.COM

·      Ignore the stop words such as “to”, “the”, “in”, etc. CourseNana.COM

·      Ignore terms starting with non-alphabetical characters, i.e., only consider terms starting with “a” to “z”. CourseNana.COM

Your Spark program should generate a list of (k * total years) results, each of which is in format of “Year\tTerm₁,Term₂:Count” (the two terms are sorted in alphabetical order and separated by “,”). The results should be first ranked by the year in ascending order, and then by the co-occurrence count of a pair in descending order, and finally by the term pair in alphabetical order. CourseNana.COM

Given k = 1 and the sample dataset, the output is like: CourseNana.COM

2003\tcouncil,welcomes:2 CourseNana.COM

2004\tcowboys,eels:2 CourseNana.COM

2020\tcoronavirus,economy:2 CourseNana.COM

Code format:

Please name your two python files as “project2_rdd.py” and “project2_df.py” for using RDD and DataFrame APIs, respectively. Compress it in a package named “zID_proj2.zip” (e.g. z5123456_proj2.zip). CourseNana.COM

Command of running your code:

We will use the following command to run your code: CourseNana.COM

$ spark-submit project2_rdd.py input output stopwords k CourseNana.COM

In this command, input is the input file, output is the output folder, stopwords is the stop words file, and k is the number of pairs returned for each year. CourseNana.COM

Notes: CourseNana.COM

·      You can read the files from either HDFS or the local file system. Using the local files is more convenient, but you need to use the prefix "file:///...". Spark uses HDFS by default if the path does not have a prefix. CourseNana.COM

·      Please do not use numpy or pandas, since we aim to assess your understanding of the RDD/DataFrame APIs. CourseNana.COM

·      You can use coalesce(1) to merge the data into a single partition and then save the data to disk. CourseNana.COM

·      In the DataFrame solution, please do not use the spark.sql() function to pass the SQL statement to Spark directly. CourseNana.COM

·      It does not matter if you have a new line at the end of the output file or not. It will not affect the correctness of your solution. CourseNana.COM

Marking Criteria:

Your source code will be checked and marked based on readability and ease of understanding. Each solution has 8 marks. Please ensure that the code you submit can be compiled. Below is an indicative marking scheme (for each): CourseNana.COM

Submission can be compiled and run on Spark: 3 CourseNana.COM

Submission can obtain correct results: 3 CourseNana.COM

·      Correct term pairs CourseNana.COM

·      Correct counts CourseNana.COM

·      Correct order CourseNana.COM

·      Correct format CourseNana.COM

·      Correctly passing self-defined functions to Spark CourseNana.COM

·      Correctly using Spark APIs (RDD/DataFrame solution only RDD/DataFrame APIs allowed) CourseNana.COM

Efficiency of top-k computation: 1 CourseNana.COM

Efficient stop words removal: 0.5 CourseNana.COM

Code format and structure, Readability, and Documentation: 0.5 CourseNana.COM

Submission:

Deadline: Sunday 16th Jul 11:59:59 PM CourseNana.COM

Late submission penalty

5% reduction of your marks for up to 5 days CourseNana.COM

Plagiarism:

The work you submit must be your own work. Submission of work partially or completely derived from any other person or jointly written with any other person is not permitted. The penalties for such an offence may include negative marks, automatic failure of the course and possibly other academic discipline. Assignment submissions will be examined manually.

Relevant scholarship authorities will be informed if students holding scholarships are involved in an incident of plagiarism or other misconduct.

Do not provide or show your assignment work to any other person - apart from the teaching staff of this subject. If you knowingly provide or show your assignment work to another person for any reason, and work derived from it is submitted you may be penalized, even if the work was submitted without your knowledge or consent. CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

UNSW代写,COMP9313代写,Big Data Management代写,Python代写,Top-k most frequent co-occuring term pairs代写,UNSW代编,COMP9313代编,Big Data Management代编,Python代编,Top-k most frequent co-occuring term pairs代编,UNSW代考,COMP9313代考,Big Data Management代考,Python代考,Top-k most frequent co-occuring term pairs代考,UNSWhelp,COMP9313help,Big Data Managementhelp,Pythonhelp,Top-k most frequent co-occuring term pairshelp,UNSW作业代写,COMP9313作业代写,Big Data Management作业代写,Python作业代写,Top-k most frequent co-occuring term pairs作业代写,UNSW编程代写,COMP9313编程代写,Big Data Management编程代写,Python编程代写,Top-k most frequent co-occuring term pairs编程代写,UNSWprogramming help,COMP9313programming help,Big Data Managementprogramming help,Pythonprogramming help,Top-k most frequent co-occuring term pairsprogramming help,UNSWassignment help,COMP9313assignment help,Big Data Managementassignment help,Pythonassignment help,Top-k most frequent co-occuring term pairsassignment help,UNSWsolution,COMP9313solution,Big Data Managementsolution,Pythonsolution,Top-k most frequent co-occuring term pairssolution,