ECON6087 2023Spring Assignment1
For assignment 1, we will use a new corpus, “A Million News Headlines” Corpus, cov- ering all the news headlines published on the Australian news source ABC (Australian Broadcasting Corporation, http://www.abc.net.au) over a period of 19 years. The data can be accessed from the following Kaggle page https://www.kaggle.com/datasets/therohk/ million-headlines. You may also learn more details about this dataset and even found some coding examples from the same page. Please use this data to finish the following tasks:
1. Train word embeddings using word2vec on this corpus, and perform a sentiment analysis based on the word embeddings and the “positivity” vector. We construct this vector based on the same way as Luca Bellodi (2022):
2. Plot
the article-level sentiment scores by year-month.
positivity = success + good + happy + perfect + +important + worth + rich
− failure − bad − sad − terrible − bad − regret − poor
· Use the appropriate pre-processing steps that you feel fit;
· Decide on the size of dimensions, number of iterations, and which model you
would like to train;
· Choose a reasonable distance (or similarities) measure;
· Find a reasonable way to aggregate the sentiment scores for each word to the document level.
3. Try to construct sentiment scores toward different countries or international organiza- tions, such as “US”, “UK”, and “Russia”, “Iran”, “NATO”, and “UN”.
Please submit your Rmarkdown files with both codes to complete the above tasks and the plots as output. The deadline is 8 March before class (at 6:15pm).