COMP9313 23T2 Project 1 (12 marks)
Problem statement:
Detecting popular and trending topics from news articles is important for public opinion monitoring. In this project, your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using MRJob. The problem is to compute the weights of each term regarding each year in the news articles dataset and find out the most important terms in each year whose weights are larger than a given threshold.
Input files:
The dataset you are going to use contains data on news headlines published over several years. In this text file, each line is a headline of a news article, in the format of "date,term1 term2 ... ... ". The date and text are separated by a comma, and the terms are separated by a space character. A sample file is like below (note that the stop words like “to”, “the”, and “in” have already been removed from the dataset):
20191124,woman stabbed adelaide shopping centre 20191204,economy continue teetering edge recession 20200401,coronanomics learnt coronavirus economy 20200401,coronavirus home test kits selling chinese community 20201015,coronavirus pacific economy foriegn aid china 20201016,china builds pig apartment blocks guard swine flu 20211216,economy starts bounce unemployment 20211224,online shopping rise due coronavirus 20211229,china close encounters elon musks |
Term weights computation:
To compute the weight for a term regarding a year, please use the TF/IDF model. Specifically, the TF and IDF can be computed as:
· TF(term t, year y) = the frequency of t in y
· IDF(term t, dataset D) = log10 (the number of years in D/the number of years having t)
Finally, the term weight of term t regarding the year y is computed as:
· Weight(term t, year y, dataset D) = TF(term t, year y)* IDF(term t, dataset D)
Please import math and use math.log10() to compute the term weights.
Code format:
Please name your Python file “project1.py” and compress it in a package named “zID_proj1.zip” (e.g. z5123456_proj1.zip).
Command of running your code:
To reduce the difficulty of the project, you are allowed to pass the total number of years to your job. We will also use more than 1 reducer to test your code. Assuming there are 20 years, β is set to 0.5, and we use 2 reducers, we will use the command like below to run your code:
$ python3 project1.py -r hadoop hdfs_input -o hdfs_output --jobconf myjob.settings.years=20 --jobconf myjob.settings.beta=0.5 --jobconf mapreduce.job.reduces=2
· hdfs_input: input file in HDFS, e.g., “hdfs://localhost:9000/user/comp9313/input”
· hdfs_output: output folder in HDFS, e.g., “hdfs://localhost:9000/user/comp9313/output”
· You can access the total number of years and the value of β in your program like “N = jobconf_from_env('myjob.settings.years')”, (use “from mrjob.compat import jobconf_from_env” in your code).
Output format:
You need to output all terms whose term weights regarding each year are larger than the given threshold value β (note that one term could appear in different years). The format of each line is: “Term\tYear,Weight”. You need to sort the results first by the terms in alphabetical order and then by the years in descending order.
For example, given the above data set and β=0.4, the output can be checked at (there is no need to remove the quotation marks which are generated by MRJob): https://webcms3.cse.unsw.edu.au/COMP9313/23T2/resources/88345.