Homepage
Programming
COMP9313 Big Data Management Project 1: Term Weight Computation

COMP9313 Big Data Management Project 1: Term Weight Computation

Engage in a Conversation

COMP9313 23T2 Project 1 (12 marks)

Problem statement:

Detecting popular and trending topics from news articles is important for public opinion monitoring. In this project, your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using MRJob. The problem is to compute the weights of each term regarding each year in the news articles dataset and find out the most important terms in each year whose weights are larger than a given threshold.

Input files:

The dataset you are going to use contains data on news headlines published over several years. In this text file, each line is a headline of a news article, in the format of "date,term1 term2 ... ... ". The date and text are separated by a comma, and the terms are separated by a space character. A sample file is like below (note that the stop words like “to”, “the”, and “in” have already been removed from the dataset): CourseNana.COM

CourseNana.COM

20191124,woman stabbed adelaide shopping centre CourseNana.COM

20191204,economy continue teetering edge recession CourseNana.COM

20200401,coronanomics learnt coronavirus economy CourseNana.COM

20200401,coronavirus home test kits selling chinese community CourseNana.COM

20201015,coronavirus pacific economy foriegn aid china CourseNana.COM

20201016,china builds pig apartment blocks guard swine flu CourseNana.COM

20211216,economy starts bounce unemployment CourseNana.COM

20211224,online shopping rise due coronavirus CourseNana.COM

20211229,china close encounters elon musks CourseNana.COM

CourseNana.COM

Term weights computation:
CourseNana.COM

To compute the weight for a term regarding a year, please use the TF/IDF model. Specifically, the TF and IDF can be computed as: CourseNana.COM

CourseNana.COM

· TF(term t, year y) = the frequency of t in y CourseNana.COM

· IDF(term t, dataset D) = log₁₀ (the number of years in D/the number of years having t) CourseNana.COM

Finally, the term weight of term t regarding the year y is computed as: CourseNana.COM

· Weight(term t, year y, dataset D) = TF(term t, year y)* IDF(term t, dataset D) CourseNana.COM

Please import math and use math.log10() to compute the term weights. CourseNana.COM

Code format:

Please name your Python file “project1.py” and compress it in a package named “zID_proj1.zip” (e.g. z5123456_proj1.zip). CourseNana.COM

Command of running your code:

To reduce the difficulty of the project, you are allowed to pass the total number of years to your job. We will also use more than 1 reducer to test your code. Assuming there are 20 years, β is set to 0.5, and we use 2 reducers, we will use the command like below to run your code: CourseNana.COM

CourseNana.COM

$ python3 project1.py -r hadoop hdfs_input -o hdfs_output --jobconf myjob.settings.years=20 --jobconf myjob.settings.beta=0.5 --jobconf mapreduce.job.reduces=2 CourseNana.COM

· hdfs_input: input file in HDFS, e.g., “hdfs://localhost:9000/user/comp9313/input” CourseNana.COM

· hdfs_output: output folder in HDFS, e.g., “hdfs://localhost:9000/user/comp9313/output” CourseNana.COM

· You can access the total number of years and the value of β in your program like “N = jobconf_from_env('myjob.settings.years')”, (use “from mrjob.compat import jobconf_from_env” in your code). CourseNana.COM

Output format:

You need to output all terms whose term weights regarding each year are larger than the given threshold value β (note that one term could appear in different years). The format of each line is: “Term\tYear,Weight”. You need to sort the results first by the terms in alphabetical order and then by the years in descending order. CourseNana.COM

CourseNana.COM

For example, given the above data set and β=0.4, the output can be checked at (there is no need to remove the quotation marks which are generated by MRJob): https://webcms3.cse.unsw.edu.au/COMP9313/23T2/resources/88345. CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: COMP9517 Computer Vision Assignment: Otsu Thresholding, Isodata Thresholding and Triangle Thresholding

Next: Bayesian Network for Supply Chain Disruption Analysis: Risk of supply chain disruption

Australian代写,UNSW代写,COMP9313代写,Big Data Management代写,Term Weight Computation代写,Australian代编,UNSW代编,COMP9313代编,Big Data Management代编,Term Weight Computation代编,Australian代考,UNSW代考,COMP9313代考,Big Data Management代考,Term Weight Computation代考,Australianhelp,UNSWhelp,COMP9313help,Big Data Managementhelp,Term Weight Computationhelp,Australian作业代写,UNSW作业代写,COMP9313作业代写,Big Data Management作业代写,Term Weight Computation作业代写,Australian编程代写,UNSW编程代写,COMP9313编程代写,Big Data Management编程代写,Term Weight Computation编程代写,Australianprogramming help,UNSWprogramming help,COMP9313programming help,Big Data Managementprogramming help,Term Weight Computationprogramming help,Australianassignment help,UNSWassignment help,COMP9313assignment help,Big Data Managementassignment help,Term Weight Computationassignment help,Australiansolution,UNSWsolution,COMP9313solution,Big Data Managementsolution,Term Weight Computationsolution,