Machine Learning in Practice – 2024 S1 – Assignment
ML Solutions for Misinformation Detection in Social Media
ML Project: Jupyter Lifecycle Expedition
Machine Learning in Practice – 2024 S1 – Assignment Introduction
Social media, particularly X (formally known as Twitter), has revolutionized the way information spreads, but it's also an incubator for fake news and misinformation. Misinformation on platform X can evolve from diverse forms and may stem from various sources, whether intentional or not, taking advantage of the platform's viral nature to widen its dissemination. As we approach major events like elections, the urgency to address this challenge becomes increasingly apparent. As there is no specific form that misinformation is presented in, there is an increasing need to develop more innovative and novel approaches to addressing it.
Machine learning and natural language processing (NLP) offer promising solutions to identify trends and detect misinformation. However, free-text data is challenging to incorporate into classification models due to its lack of structure. To overcome this challenge, latent variable models such as topic models or feature generation can be used to infer intermediary representations that can be used as structured data for classification tasks.
In this project, you will showcase the significance of integrating data sourced from X alongside newly engineered features to classify the authenticity of news-related tweets. A dataset obtained from X has been web-scraped, and the various sections of this assignment will establish one kind of exploratory strategy for addressing a classification challenge.
2
Machine Learning in Practice – 2024 S1 – Assignment Dataset
The Assignment dataset consists of an assortment of news headlines, along with associated X posts relating to the headline. The dataset consists of 134,198 rows and 15 columns. There are 3 types of feature variables and only 1 target variable:
Feature Variables
➢ Textual Data:
news_author (str) author of a news headline.
news_headline (str) – headline of a news article.
related_tweet (str) – X post relating to the news headline posted by a user.
➢ Post Metadata
post_replies (int) - number of replies on the post.
post_retweets (int) - number of retweets on the post.
post_favourites (int) - number of favourites on the post.
post_quotes (int) - number of times the post has been quote tweeted.
➢ User Metadata
user_followers (int) - number of followers.
user_following (int) - number of following users.
user_friends (int) - number of friends (mutual following).
user_tweet_count (int) – total number of tweets the user has made.
user_favourites_count (int) – total number of favourites user has across all tweets.
user_mentions (int) – total number of of users mentioned (@) in related_tweet
user_tweet_count_lists (int) – total number of tweets the user has in their lists.
Target Variable
➢ Misinformation (bool) – a T/F value representing if a tweet is false.
Machine Learning in Practice – 2024 S1 – Assignment
Specification Summary
Type: Project report, individual assignment
Deliverable: Report in the format of Python script only (.ipynb)
The aim of this assignment is to provide you with experience in the steps involved in text preparation, feature generation, and creating, evaluating, and improving classification models. You will need to research NLP, and python functionalities if you aim to achieve excellent marks and discover innovative techniques/methods.
Exploration, Preparation & Feature Generation
This section requires you to explore various aspects of your dataset and prepare the data for future sections. It is important you take time to carefully explore your data and make decisions on preparation or generation that make sense.
Preprocessing steps are essential to clean and standardize data before feature generation and enhance the quality of extracted features. Classification models that harness generated features may enable models to better understand and analyze data or to better learn patterns and relationships, compared to regular models.
Further, X or Twitter recently open sourced their algorithms and many articles provide insights into what features of a tweet are important. Knowing this may help to better understand how to classify a tweet as misinformation.
Your task is to
➢ Explore and prepare your data.o Inthistask,youcouldperformthenecessarycleaningandpre-processingtasks,explore or try to understand and profile your data through various techniques (i.e. clustering, topic modelling, etc.).
➢ Generate new features from your data.
o You should have a good understanding of your data from above and can nowexperiment with feature generation. In this task you should consider what can be generated to improve your classification model.
4
Machine Learning in Practice – 2024 S1 – Assignment Classification (Model Building and Evaluation)
It is important to try multiple variations of features/parameters in model building to achieve the best performance. Additionally, you should elaborate on the performance metrics you have used to evaluate your model and explain why they suit the available data.
Your task
➢ Experiment developing and evaluating classification models to find a model that has the best overall performance.
o Once you find the best performing model, you should only show how you built and evaluated that specific one.
➢ Elaborate on the major tasks you have undertaken to improve the best-performing model and explain why the performance metrics suit the available data.
Submission
Your report should be delivered in an .ipynb file. A notebook template is provided to show how to structure your work. You need to use the template (Assignment_Template.ipynb) and strictly follow its format which is designed based on the provided Assignment rubric.
It can be useful that add some in-line comments (using #) next to your codes to explain it briefly.
You will get a better mark if your approach is innovative. This means no other student has applied it, or a few others have applied a similar approach with some differences. Therefore, it is highly advised that you do not share your creative work with anyone else. You can still discuss preliminary ideas and help each other, just remember your submission must be your own work.
You will only need to submit one .ipynb file and should use the provided Python template file. Before submission:
➢ Ensure that your code can run without errors. If your code returns an error at any point, your assignment will only be marked up until the error, and the remainder of your code won't earn any marks. Example errors may include: Syntax issues or Name Errors.
➢ Make sure that all the important outputs are shown in your notebook. However, avoid showing trivial outputs. For example, you should remove codes randomly displaying the whole DataFrame, etc.
➢ Your marker will first look at your generated output as a reference without running your notebook (unless deemed necessary). Therefore, your significant outputs need to be generated, and the elaboration should be provided in the notebook, as shown in the template.