Assignment 2: Content Analysis and Regression
For this assignment, you need to test a hypothesis using multiple linear regression. Before doing that, you also need to use computational content analysis and NLP techniques to create new variables that you will use as a predictor in the regression model.
##Research Background##
According to Chatman (1980), characters are constructs within abstracted narratives, described through networks of personality traits (e.g., Sarrasine is feminine, Othello is jealous, Roland Deschain is brave). Docherty characterizes the process of depicting and interpreting characters in literature as 'characterization' (cited by Bennett and Royle, 2017). Reaske (1996) identifies several devices of characterization, including character appearance, asides and soliloquies, dialogue, hidden narration, language, and actions performed. Characterization is crucial in narrative because it allows readers to relate to characters and feel emotionally engaged in the story (Cohen, 2001). providing information on personalities and behaviors for gender representation analysis in fiction.
For this assignment, you'll work with a corpus of the genre Real Person Fiction (RPF), where characters are characterized by blending real-life traits with fans' interpretations and reimagination, reflecting societal and cultural trends.
On the online fanfiction platform AO3, fanfictions about the Korean boy band BTS represent the largest fandom, surpassing even the Marvel Universe and Harry Potter franchises. Research into the global popularity of the Korean Wave (Hallyu) has highlighted the concept of "manufactured versatile masculinity" exhibited by male K-pop idols, a blend of softer, more effeminate appearances or behaviors with traditional forms of hegemonic masculinity, described by scholars such as Jung (2011), Kuo et al. (2020), Kwon (2019), and Oh (2015). Oh (2015) terms this "liminal masculinity," with androgynous K-pop male idols crossing gender lines.
Aim:
This assignment aims to analyze the impact of soft masculinity on K-pop fanfiction's success using a corpus of 100 BTS fanfictions.
Data:
We will utilize a dataset from the GOLEM project, comprising 100 BTS-related fanfictions, including story ID, publication year, word count, kudos, comments, and story content in English (1,000 to 1,200 words).
Methods:
- operationalize the concept of 'soft masculinity' to make it measurable
- use regression analysis to test a hypothesis
# Let's check how many rows there are now
0] df.shape[
##Research Design## The steps of this research involve formulating a hypothesis, selecting kudos
as a proxy for the success of a story -- the dependent variable (Y) -- and calculating a masculinity score to be used as the independent variable (x1). We will also use additional variables that are also likely to have an effect on the success of a story: publication_year
(x2) - because AO3 user base has grown in time and more recent stories are more likely to receive more kudos simply becasue there are more readers on the platform - and lexical richness (x3) - because it's plausible that a story with a richer vocabulary also has a better style and, therefore, it's liked more by readers.
Note that we don't have variables for masculinity_score
and lexical_richness
yet, so we need to calculate them from the text of the stories.
The hypothesis states:
H1: Low levels of masculinity in male characters positively affects fanfiction success when controlling for publication year and lexical richness.
H0: What is the null hypothesis?
# H0 (answer in words):
##Compute Masculinity Score##
To calculate a sterotypical masculinity score, we can refer to older theories of perceived gender identity that probably define gender roles in a stereotypical way. an example of this is the Bem Sex-Role Inventory (BSRI) by Dr. Sandra Lipzits Bem (1974), which classifies personality traits into masculine, feminine, and androgynous.
Bem divides personal traits into 60 traits: 20 masculine traits, 20 feminine traits and 20 neutral traits (see figure below).
The above list shows that, despite recent discussions about masculinity, femininity, and gender roles have become more diversified, traditional definitions such as those provided by the Bem Sex-Role Inventory (BSRI) can be useful to detect gender stereotypes. Within the definitions of masculinity and femininity outlined by the BSRI, we observe a clear power imbalance: masculinity is associated with dominance (e.g., assertive, strong personality, forceful, leadership ability, dominant, aggressive, ambitious), while femininity leans towards submissiveness (e.g., yielding, understanding, tender). Therefore, we can consider employing the power-agent frames designed by Sap et al. to compute a power score for the male characters in the fanfiction stories. Lower masculinity scores can be plausibly associated with a representation of 'soft masculinity' in relation to a character.
###Riveter###
In the W5 lab, we have already gained preliminary experience with the Riveter pipeline.
In this section, we will use utilize the Riveter pipeline with Sap's power-agent frames to calculate the masculinity_score
for identifiable agents in the text. Since we are interested only in the masculinity of male characters, we will use regular expressions to identify male pronouns (he, him, himself) and calculate their corresponding masculinity_score
. This score will be added to the df as a new column.
# Set up everything you need to use Riveter, following the notebook we used in W5 lab
# No need to put the code here, as long as it's working
# We assume that you have installed all the required packages, either locally or on Colab
Now we have prepared all the dependencies needed.
from collections import defaultdict
import os
import pandas as pd
import random
from riveter import Riveter # if the notebook is not in the /riveter folder, this will throw an error
import seaborn as sns
import matplotlib.pyplot as plt
Prepare the corpus:
First, we can try to initialize two lists to store story_content
and story_id
, but this will casue the problem below:
# Q0 (code): Load lexicon 'power' and create an empty dictionary called 'scores_dict'
Now we can apply the splitting function we defined above and then use Riveter on our corpus. To do this, we need to create a loop that iterates through all the rows in the dataframe and compute scores for each story.
Q1. Train Riveter to assign scores to text based on male pronouns
from tqdm import tqdm # used to display a progress bar when executing code
for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Processing stories"):
= row['story_id']
story_id = row['story_content']
story_content
# apply the splitting function:
= split_text_into_segments(story_content)
segments = [f"{story_id}_{i}" for i in range(len(segments))]
text_ids
# Q1 (code): train riveter specifying 'persona_patterns_dict= ' to assign scores only based on male pronouns
# write code below:
# store the computed scores in a dictionary
= riveter.get_score_totals()
persona_score_dict = persona_score_dict.get('masculine', 0)
masculine_score
# get a feedback about the computed scores while the code is running
print(f"Story ID: {story_id}, Masculine Power Score: {masculine_score}")
# store the score of each story in the same dictionary so that we can then add it to the dataframe
= masculine_score
scores_dict[story_id]
# add the dictionary of scores to the dataframe
'masculine_power_score'] = df['story_id'].map(scores_dict) df[
Q2. Print a sample of the dataframe to check whether the masculine_power_score
has been added correctly
# Q2 (code)
Methodology
- Multiple Linear Regression: Perform a regression analysis with
kudos
as the dependent variable andmasculine_power_score
,published_year
, andwords
as independent variables. - Residual Analysis: Conduct normality and homoscedasticity tests on the residuals to validate the assumptions of linear regression.
- Model Evaluation: Assess the model using adjusted R-squared, F-test, and t-tests for individual coefficients.
- Multicollinearity Check: Calculate the Variance Inflation Factor (VIF) for the independent variables to detect possible multicollinearity.
Q3. Check the data distribution and handle missing values
# Q6b (words): Interpret F-test result
# Q6c (words): Interpret coefficients and t-test result
Based on the OLS regression results provided, here is an example analysis:
Normality Test, Homoscedasticity Test
# Q7a (code): Calculate residuals and do a Shapiro-Wilk Test
# Q7b (words): Write your analysis for the Normality Test there:
# Q7c (code): Homoscedasticity Test (plot residuals vs. predictions)
In the residuals vs. predicted values plot, you would look for patterns. In a well-fitted model, you would expect to see the residuals randomly scattered around zero, with no clear pattern. The presence of a pattern might suggest issues with model specification, such as non-linearity or heteroscedasticity.
# Q7d (words): Write your analysis for the Homoscedasticity Test here:
Q8: Multicollinearity
# Q8a (code)
from statsmodels.stats.outliers_influence import variance_inflation_factor
Regarding multicollinearity, the VIF values for masculine_power_score, lexical_richness, and published_year are close to 1, which suggests low multicollinearity. However, the very high VIF for the const term, along with the large condition number, suggests that there may be numerical issues, possibly due to a large scale difference between predictors or multicollinearity issues not captured by standard VIF calculations.
# Q8b (words): Write your analysis for the multicollinearity test here:
Q9: Reflection
# Q9 (words): Write your reflection on the whole research framework and corresponding result here, e.g., what do you think can be improved?
##Reference:##
Seymour Benjamin Chatman. 1980. Story and Discourse: Narrative Structure in Fiction and Film. Cornell University Press, Ithaca, NY, USA.
Bennet, Andrew, and Nicholas Royle. Introduction to Literature Criticism and Theory. Edinburgh: Pearson Education Limited, 2004.Web.July. 2017.
Reaske, Christoper Russel. Analyze Drama. New York: Monarch Press, 1996. Print.
Jung, Sun “Bae Yong-Joon, Soft Masculinity, and Japanese Fans: Our Past Is in Your Present Body” from Korean Masculinities and Transcultural Consumption, Hong Kong Scholarship Online, 2010.
Kuo, Linda, et al., “Performance, Fantasy, or Narrative: LGBTQ+ Asian American Identity through Kpop Media in Fandom”, Journal of Homosexuality, 2020.
Kwon, Jungmin, Straight Korean Female Fans and Their Gay Fantasies, University of Iowa Press, 2019, ebook.
Oh, Chuyun, “Queering spectatorship in K-pop: The androgynous male dancing body and western female fandom”, Journal of Fandom Studies, vol. 3, no. 1, 2015, pp. 59-78.