Assignment 3: LDA Topic Modeling

Engage in a Conversation

Assignment 3: LDA Topic Modeling

Note

Installing Tomotopy locally can return an error, if that's the case run this notebook on Google Colab CourseNana.COM

Research Background

LDA is a popular topic modeling algorithm widely used in the fields of Digital Humanities and Social Sciences. In the field of political communication, topic modeling is often applied for analyzing politicians Twitter/X posts, identitying thematic patterns or topics revolving around their posts. CourseNana.COM

For this assignment, students will work with tweets from two USA politicians, Donald Trump and Bernie Sanders, who are often regarded as right-wing populist and left-wing populist respectively. Right-wing populism often emphasizes nationalism, anti-immigration policies, and a critique of global elites from a culturally conservative perspective, focusing on preserving traditional values and social hierarchies. Left-wing populism prioritizes economic inequality, advocating for the redistribution of wealth, expansion of social services, and empowerment of the working class against the capitalist elite. While both forms of populism appeal to the "common people" against perceived elites and established structures, they diverge significantly in their identification of the elites, proposed solutions, and core ideologies. For a more detailed explanation, you can read the chapter by Macaulay (2019) "Bernie and The Donald: A comparison of left-and right-wing populist discourse" (full reference below). CourseNana.COM

Research Questions CourseNana.COM

What topics are revolving around Donald Trump and Bernie Sanders' posts separately?
What are the topic differences between Trump (right-wing popoulist) and Sanders (left-wing populist)?

Aim: CourseNana.COM

The first aim of the assignment is to conduct LDA topic modeling. Identify thematic patterns or politics revolving around Trump or Sanders's posts.
The second aim is to critically evaluate the results of topic modeling. Try different numbers of topics to see with which settings the topics are more coherent. Critically reflect on the results of LDA topic modeling, discussing them in relation to existing theories about populism.

Data Two datasets are prepared for this assginment. Tweets from Trump and tweets from Sanders. Students are asked to work on these two datasets. CourseNana.COM

Methods CourseNana.COM

Word segamentation
Removing stopwords
LDA topic modeling
Topic evaulation (coherence and human evaluation)
Visualization of results.

References CourseNana.COM

Macaulay, M. (2019). Bernie and the Donald: A comparison of Left-and Right-wing populist discourse. Populist discourse: International perspectives, 165-195.

Setup

Q1. Install necessary libraries, including `tomotopy` and `little_mallet_wrapper`, and import them

In [ ]:

# Q1 (code)

Data preprocessing

Q2. Load the two datasets and concatenate them

The goal is to run topic modelling on the combined dataset of Sanders and Trump's tweets CourseNana.COM

In [ ]:

# Q2 (code)

Q3. Clean the data

Transform all tweets to lowercase, remove stopwords, punctuation, and numbers. Add the processed text to a list called training_data. Create a list with the content of the tweets (original_texts) and a list that allows you to identify both the author of the tweet and its ID (titles). CourseNana.COM

In [ ]:

# Q3 (code)

# Tip: add the following line to remove URLS and user mentions
processed_text = re.sub(r"http\S+|www\S+|https\S+|\/\/t|co\/|\@\w+|realdonaldtrump", '', processed_text, flags=re.MULTILINE)

LDA topic modelling

Q4. Train a an LDA topic model with `tomotopy`

In [ ]:

# Q4 (code)

Q5. Print out the top words for each topic and manually evaluate their coherence

In [ ]:

# Q5a (code)

In [ ]:

# Q5b (words)
# Describe what each topic is about. What ideas, values, or situations do these keywords refer to?

Topic coherence

Use tomotopy's .coherence() function to automatically calculate the topic coherence. CourseNana.COM

The coherence value can vary from 0 (no coherence) to 1 (maximum coherence). Interpret the results and, if needed, retrain the model using a different number of topics. CourseNana.COM

In [ ]:

# There are different metrics for coherence, we choose `c_v`

coh = tp.coherence.Coherence(model, coherence='c_v')
average_coherence = coh.get_score()
coherence_per_topic = [coh.get_score(topic_id=k) for k in range(model.k)]

print('==== Coherence : {} ===='.format('c_v'))
print('Average:', average_coherence, '\nPer Topic:', coherence_per_topic)
print()

Q6. Interpret topic coherence

Report the following: CourseNana.COM

number of topics you initially used to train the model and the coherence score you got
changes made to the number of topics and new coherence scores obtained

In [ ]:

# Q6 (words)

X1. Optional question 1

(This question is not compulsory, it only allows you to get an extra point.) CourseNana.COM

Create a function to plot the average coherence for models with different number of topics. CourseNana.COM

In [ ]:

# X1 (code)
# Tip: y = average topic coherence; x = number of topics in the model

Q7. Topic distributions

Calculate the topic distributions for all tweets and get the top documents for some topics (between 2 and 5) that you think could be more representative of Sanders or Trump. CourseNana.COM

In [ ]:

# Q7a (code)

Interpret the results above. Are there topics that have top tweets only by one politician? Why do you think these topics are more representative of one of the two politicians' views? CourseNana.COM

In [ ]:

# Q7b (words)

Large scale analysis

Q8. Create a random sample of the whole dataset and visualize the topic distributions for the sampled tweets

In [ ]:

# Crete a sample of tweets

from random import sample

target_labels = sample(titles,100)

In [ ]:

# Q8 (code)
# Create a heatmap using the random sample
# Tip: to display more than 20 tweets you have to change the values of `dim =` in sns.heatmap()

Q9. Interpret the heatmap

Do you see any pattern in the probability distributions of topics for each politician? CourseNana.COM

Are there topics that are more likely for one of the two politicians? CourseNana.COM

In [ ]:

# Q9 (words)

X2. Optional question 2

(This question is not compulsory, it only allows you to get an extra point) CourseNana.COM

Make the sample balanced, with 50 tweets by Trump and 50 by Sanders. CourseNana.COM

In [ ]:

# X2 (code)

X3. Optional question 3

(This question is not compulsory, it only allows you to get an extra point) CourseNana.COM

Extend the analysis to all the tweets in the dataset. CourseNana.COM

In [ ]:

# X3 (code and words)
# Tip: plotting a heatmap for thousands of tweets is not practical.
# Make a comparison based on the numerical values in the `df_norm_col` dataframe (see Week 6 notebook)

Assignment 3: LDA Topic Modeling

Assignment 3: LDA Topic Modeling

Note

Research Background

Setup

Q1. Install necessary libraries, including tomotopy and little_mallet_wrapper, and import them

Data preprocessing

Q2. Load the two datasets and concatenate them

Q3. Clean the data

LDA topic modelling

Q4. Train a an LDA topic model with tomotopy

Q5. Print out the top words for each topic and manually evaluate their coherence

Topic coherence

Q6. Interpret topic coherence

X1. Optional question 1

Q7. Topic distributions

Large scale analysis

Q8. Create a random sample of the whole dataset and visualize the topic distributions for the sampled tweets

Q9. Interpret the heatmap

X2. Optional question 2

X3. Optional question 3

Get in Touch with Our Experts

Q1. Install necessary libraries, including `tomotopy` and `little_mallet_wrapper`, and import them

Q4. Train a an LDA topic model with `tomotopy`