Assignment 2: Content Analysis and Regression

Engage in a Conversation

Assignment 2: Content Analysis and Regression

For this assignment, you need to test a hypothesis using multiple linear regression. Before doing that, you also need to use computational content analysis and NLP techniques to create new variables that you will use as a predictor in the regression model. CourseNana.COM

##Research Background## CourseNana.COM

According to Chatman (1980), characters are constructs within abstracted narratives, described through networks of personality traits (e.g., Sarrasine is feminine, Othello is jealous, Roland Deschain is brave). Docherty characterizes the process of depicting and interpreting characters in literature as 'characterization' (cited by Bennett and Royle, 2017). Reaske (1996) identifies several devices of characterization, including character appearance, asides and soliloquies, dialogue, hidden narration, language, and actions performed. Characterization is crucial in narrative because it allows readers to relate to characters and feel emotionally engaged in the story (Cohen, 2001). providing information on personalities and behaviors for gender representation analysis in fiction. CourseNana.COM

For this assignment, you'll work with a corpus of the genre Real Person Fiction (RPF), where characters are characterized by blending real-life traits with fans' interpretations and reimagination, reflecting societal and cultural trends. CourseNana.COM

On the online fanfiction platform AO3, fanfictions about the Korean boy band BTS represent the largest fandom, surpassing even the Marvel Universe and Harry Potter franchises. Research into the global popularity of the Korean Wave (Hallyu) has highlighted the concept of "manufactured versatile masculinity" exhibited by male K-pop idols, a blend of softer, more effeminate appearances or behaviors with traditional forms of hegemonic masculinity, described by scholars such as Jung (2011), Kuo et al. (2020), Kwon (2019), and Oh (2015). Oh (2015) terms this "liminal masculinity," with androgynous K-pop male idols crossing gender lines. CourseNana.COM

Aim: CourseNana.COM

This assignment aims to analyze the impact of soft masculinity on K-pop fanfiction's success using a corpus of 100 BTS fanfictions. CourseNana.COM

Data: CourseNana.COM

We will utilize a dataset from the GOLEM project, comprising 100 BTS-related fanfictions, including story ID, publication year, word count, kudos, comments, and story content in English (1,000 to 1,200 words). CourseNana.COM

Methods: CourseNana.COM

operationalize the concept of 'soft masculinity' to make it measurable
use regression analysis to test a hypothesis

# Let's check how many rows there are now
df.shape[0]

##Research Design## The steps of this research involve formulating a hypothesis, selecting kudos as a proxy for the success of a story -- the dependent variable (Y) -- and calculating a masculinity score to be used as the independent variable (x1). We will also use additional variables that are also likely to have an effect on the success of a story: publication_year (x2) - because AO3 user base has grown in time and more recent stories are more likely to receive more kudos simply becasue there are more readers on the platform - and lexical richness (x3) - because it's plausible that a story with a richer vocabulary also has a better style and, therefore, it's liked more by readers. CourseNana.COM

Note that we don't have variables for masculinity_score and lexical_richness yet, so we need to calculate them from the text of the stories. CourseNana.COM

The hypothesis states: CourseNana.COM

H1: Low levels of masculinity in male characters positively affects fanfiction success when controlling for publication year and lexical richness. CourseNana.COM

H0: What is the null hypothesis? CourseNana.COM

# H0 (answer in words):

##Compute Masculinity Score## CourseNana.COM

To calculate a sterotypical masculinity score, we can refer to older theories of perceived gender identity that probably define gender roles in a stereotypical way. an example of this is the Bem Sex-Role Inventory (BSRI) by Dr. Sandra Lipzits Bem (1974), which classifies personality traits into masculine, feminine, and androgynous. CourseNana.COM

Bem divides personal traits into 60 traits: 20 masculine traits, 20 feminine traits and 20 neutral traits (see figure below). CourseNana.COM

图片1.png CourseNana.COM

The above list shows that, despite recent discussions about masculinity, femininity, and gender roles have become more diversified, traditional definitions such as those provided by the Bem Sex-Role Inventory (BSRI) can be useful to detect gender stereotypes. Within the definitions of masculinity and femininity outlined by the BSRI, we observe a clear power imbalance: masculinity is associated with dominance (e.g., assertive, strong personality, forceful, leadership ability, dominant, aggressive, ambitious), while femininity leans towards submissiveness (e.g., yielding, understanding, tender). Therefore, we can consider employing the power-agent frames designed by Sap et al. to compute a power score for the male characters in the fanfiction stories. Lower masculinity scores can be plausibly associated with a representation of 'soft masculinity' in relation to a character. CourseNana.COM

###Riveter### CourseNana.COM

In the W5 lab, we have already gained preliminary experience with the Riveter pipeline. CourseNana.COM

In this section, we will use utilize the Riveter pipeline with Sap's power-agent frames to calculate the masculinity_score for identifiable agents in the text. Since we are interested only in the masculinity of male characters, we will use regular expressions to identify male pronouns (he, him, himself) and calculate their corresponding masculinity_score. This score will be added to the df as a new column. CourseNana.COM

# Set up everything you need to use Riveter, following the notebook we used in W5 lab
# No need to put the code here, as long as it's working
# We assume that you have installed all the required packages, either locally or on Colab

Now we have prepared all the dependencies needed. CourseNana.COM

from collections import defaultdict
import os
import pandas as pd
import random
from riveter import Riveter # if the notebook is not in the /riveter folder, this will throw an error

import seaborn as sns
import matplotlib.pyplot as plt

Prepare the corpus: CourseNana.COM

First, we can try to initialize two lists to store story_content and story_id, but this will casue the problem below: CourseNana.COM

# Q0 (code): Load lexicon 'power' and create an empty dictionary called 'scores_dict'

Now we can apply the splitting function we defined above and then use Riveter on our corpus. To do this, we need to create a loop that iterates through all the rows in the dataframe and compute scores for each story. CourseNana.COM

Q1. Train Riveter to assign scores to text based on male pronouns CourseNana.COM

from tqdm import tqdm # used to display a progress bar when executing code

for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Processing stories"):
    story_id = row['story_id']
    story_content = row['story_content']

    # apply the splitting function:
    segments = split_text_into_segments(story_content)
    text_ids = [f"{story_id}_{i}" for i in range(len(segments))]

    # Q1 (code): train riveter specifying 'persona_patterns_dict= ' to assign scores only based on male pronouns
    # write code below:

    # store the computed scores in a dictionary
    persona_score_dict = riveter.get_score_totals()
    masculine_score = persona_score_dict.get('masculine', 0)

    # get a feedback about the computed scores while the code is running
    print(f"Story ID: {story_id}, Masculine Power Score: {masculine_score}")

    # store the score of each story in the same dictionary so that we can then add it to the dataframe
    scores_dict[story_id] = masculine_score

# add the dictionary of scores to the dataframe
df['masculine_power_score'] = df['story_id'].map(scores_dict)

Q2. Print a sample of the dataframe to check whether the masculine_power_score has been added correctly CourseNana.COM

# Q2 (code)

Methodology CourseNana.COM

Multiple Linear Regression: Perform a regression analysis with kudos as the dependent variable and masculine_power_score, published_year, and words as independent variables.
Residual Analysis: Conduct normality and homoscedasticity tests on the residuals to validate the assumptions of linear regression.
Model Evaluation: Assess the model using adjusted R-squared, F-test, and t-tests for individual coefficients.
Multicollinearity Check: Calculate the Variance Inflation Factor (VIF) for the independent variables to detect possible multicollinearity.

Q3. Check the data distribution and handle missing values CourseNana.COM

# Q6b (words): Interpret F-test result

# Q6c (words): Interpret coefficients and t-test result

Based on the OLS regression results provided, here is an example analysis: CourseNana.COM

Normality Test, Homoscedasticity Test CourseNana.COM

# Q7a (code): Calculate residuals and do a Shapiro-Wilk Test

# Q7b (words): Write your analysis for the Normality Test there:

# Q7c (code): Homoscedasticity Test (plot residuals vs. predictions)

In the residuals vs. predicted values plot, you would look for patterns. In a well-fitted model, you would expect to see the residuals randomly scattered around zero, with no clear pattern. The presence of a pattern might suggest issues with model specification, such as non-linearity or heteroscedasticity. CourseNana.COM

# Q7d (words): Write your analysis for the Homoscedasticity Test here:

Q8: Multicollinearity CourseNana.COM

# Q8a (code)
from statsmodels.stats.outliers_influence import variance_inflation_factor

Regarding multicollinearity, the VIF values for masculine_power_score, lexical_richness, and published_year are close to 1, which suggests low multicollinearity. However, the very high VIF for the const term, along with the large condition number, suggests that there may be numerical issues, possibly due to a large scale difference between predictors or multicollinearity issues not captured by standard VIF calculations. CourseNana.COM

# Q8b (words): Write your analysis for the multicollinearity test here:

Q9: Reflection CourseNana.COM

# Q9 (words): Write your reflection on the whole research framework and corresponding result here, e.g., what do you think can be improved?

##Reference:## CourseNana.COM

Seymour Benjamin Chatman. 1980. Story and Discourse: Narrative Structure in Fiction and Film. Cornell University Press, Ithaca, NY, USA. CourseNana.COM

Bennet, Andrew, and Nicholas Royle. Introduction to Literature Criticism and Theory. Edinburgh: Pearson Education Limited, 2004.Web.July. 2017. CourseNana.COM

Reaske, Christoper Russel. Analyze Drama. New York: Monarch Press, 1996. Print. CourseNana.COM

Jung, Sun “Bae Yong-Joon, Soft Masculinity, and Japanese Fans: Our Past Is in Your Present Body” from Korean Masculinities and Transcultural Consumption, Hong Kong Scholarship Online, 2010. CourseNana.COM

Kuo, Linda, et al., “Performance, Fantasy, or Narrative: LGBTQ+ Asian American Identity through Kpop Media in Fandom”, Journal of Homosexuality, 2020. CourseNana.COM

Kwon, Jungmin, Straight Korean Female Fans and Their Gay Fantasies, University of Iowa Press, 2019, ebook. CourseNana.COM

Oh, Chuyun, “Queering spectatorship in K-pop: The androgynous male dancing body and western female fandom”, Journal of Fandom Studies, vol. 3, no. 1, 2015, pp. 59-78. CourseNana.COM

Assignment 2: Content Analysis and Regression

Assignment 2: Content Analysis and Regression

Get in Touch with Our Experts