Assignment 3: LDA Topic Modeling
Note
Installing Tomotopy locally can return an error, if that's the case run this notebook on Google Colab CourseNana.COM
Research Background
LDA is a popular topic modeling algorithm widely used in the fields of Digital Humanities and Social Sciences. In the field of political communication, topic modeling is often applied for analyzing politicians Twitter/X posts, identitying thematic patterns or topics revolving around their posts. CourseNana.COM
For this assignment, students will work with tweets from two USA politicians, Donald Trump and Bernie Sanders, who are often regarded as right-wing populist and left-wing populist respectively. Right-wing populism often emphasizes nationalism, anti-immigration policies, and a critique of global elites from a culturally conservative perspective, focusing on preserving traditional values and social hierarchies. Left-wing populism prioritizes economic inequality, advocating for the redistribution of wealth, expansion of social services, and empowerment of the working class against the capitalist elite. While both forms of populism appeal to the "common people" against perceived elites and established structures, they diverge significantly in their identification of the elites, proposed solutions, and core ideologies. For a more detailed explanation, you can read the chapter by Macaulay (2019) "Bernie and The Donald: A comparison of left-and right-wing populist discourse" (full reference below). CourseNana.COM
Research Questions CourseNana.COM
- What topics are revolving around Donald Trump and Bernie Sanders' posts separately?
- What are the topic differences between Trump (right-wing popoulist) and Sanders (left-wing populist)?
Aim: CourseNana.COM
- The first aim of the assignment is to conduct LDA topic modeling. Identify thematic patterns or politics revolving around Trump or Sanders's posts.
- The second aim is to critically evaluate the results of topic modeling. Try different numbers of topics to see with which settings the topics are more coherent. Critically reflect on the results of LDA topic modeling, discussing them in relation to existing theories about populism.
Data Two datasets are prepared for this assginment. Tweets from Trump and tweets from Sanders. Students are asked to work on these two datasets. CourseNana.COM
Methods CourseNana.COM
- Word segamentation
- Removing stopwords
- LDA topic modeling
- Topic evaulation (coherence and human evaluation)
- Visualization of results.
References CourseNana.COM
- Macaulay, M. (2019). Bernie and the Donald: A comparison of Left-and Right-wing populist discourse. Populist discourse: International perspectives, 165-195.
Setup
Q1. Install necessary libraries, including tomotopy
and little_mallet_wrapper
, and import them
# Q1 (code)
Data preprocessing
Q2. Load the two datasets and concatenate them
The goal is to run topic modelling on the combined dataset of Sanders and Trump's tweets CourseNana.COM
# Q2 (code)
Q3. Clean the data
Transform all tweets to lowercase, remove stopwords, punctuation, and numbers. Add the processed text to a list called training_data
. Create a list with the content of the tweets (original_texts
) and a list that allows you to identify both the author of the tweet and its ID (titles
).
CourseNana.COM
# Q3 (code)
# Tip: add the following line to remove URLS and user mentions
processed_text = re.sub(r"http\S+|www\S+|https\S+|\/\/t|co\/|\@\w+|realdonaldtrump", '', processed_text, flags=re.MULTILINE)
LDA topic modelling
Q4. Train a an LDA topic model with tomotopy
# Q4 (code)
Q5. Print out the top words for each topic and manually evaluate their coherence
# Q5a (code)
# Q5b (words)
# Describe what each topic is about. What ideas, values, or situations do these keywords refer to?
Topic coherence
Use tomotopy
's .coherence()
function to automatically calculate the topic coherence.
CourseNana.COM
The coherence value can vary from 0
(no coherence) to 1
(maximum coherence). Interpret the results and, if needed, retrain the model using a different number of topics.
CourseNana.COM
# There are different metrics for coherence, we choose `c_v`
coh = tp.coherence.Coherence(model, coherence='c_v')
average_coherence = coh.get_score()
coherence_per_topic = [coh.get_score(topic_id=k) for k in range(model.k)]
print('==== Coherence : {} ===='.format('c_v'))
print('Average:', average_coherence, '\nPer Topic:', coherence_per_topic)
print()
Q6. Interpret topic coherence
Report the following: CourseNana.COM
- number of topics you initially used to train the model and the coherence score you got
- changes made to the number of topics and new coherence scores obtained
# Q6 (words)
X1. Optional question 1
(This question is not compulsory, it only allows you to get an extra point.) CourseNana.COM
Create a function to plot the average coherence for models with different number of topics. CourseNana.COM
# X1 (code)
# Tip: y = average topic coherence; x = number of topics in the model
Q7. Topic distributions
Calculate the topic distributions for all tweets and get the top documents for some topics (between 2 and 5) that you think could be more representative of Sanders or Trump. CourseNana.COM
# Q7a (code)
Interpret the results above. Are there topics that have top tweets only by one politician? Why do you think these topics are more representative of one of the two politicians' views? CourseNana.COM
# Q7b (words)
Large scale analysis
Q8. Create a random sample of the whole dataset and visualize the topic distributions for the sampled tweets
# Crete a sample of tweets
from random import sample
target_labels = sample(titles,100)
# Q8 (code)
# Create a heatmap using the random sample
# Tip: to display more than 20 tweets you have to change the values of `dim =` in sns.heatmap()
Q9. Interpret the heatmap
Do you see any pattern in the probability distributions of topics for each politician? CourseNana.COM
Are there topics that are more likely for one of the two politicians? CourseNana.COM
# Q9 (words)
X2. Optional question 2
(This question is not compulsory, it only allows you to get an extra point) CourseNana.COM
Make the sample balanced, with 50 tweets by Trump and 50 by Sanders. CourseNana.COM
# X2 (code)
X3. Optional question 3
(This question is not compulsory, it only allows you to get an extra point) CourseNana.COM
Extend the analysis to all the tweets in the dataset. CourseNana.COM
# X3 (code and words)
# Tip: plotting a heatmap for thousands of tweets is not practical.
# Make a comparison based on the numerical values in the `df_norm_col` dataframe (see Week 6 notebook)