DTS301TC Data Mining
School of AI and Advanced Computing
Individual Project
Sunday, October 29th 23:59 (Beijing Time), 2023☐ Category A
DTS301TC Data Mining
Individual Project
Deadline: Sunday, October 29th 23:59 (Beijing Time), 2023 Percentage in final mark: 60%
Learning outcomes assessed:
D. Develop skills of using recent data mining software for solving practical problems.
E. Gain experience of doing independent study and research.
Late policy: 5% of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the submission date, up to a maximum of five working days.
Risks:
-
Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may result in loss of marks.
-
The assignment must be submitted via Learning Mall to the correct drop box. Only electronic submission is accepted and no hard copy submission.
-
All students must download their file and check that it is viewable after submission. Documents may become corrupted during the uploading process (e.g. due to slow internet connections). However, students themselves are responsible for submitting a functional and correct file for assessments.
-
Academic Integrity Policy is strictly followed. Overview
The objective of this project is to apply data mining techniques in a real-world dataset to gain a better understanding of real-world data mining applications. In this project, you need to identify one appropriate data mining problem from a COVID-19 related twitter dataset and apply data mining algorithms to extract useful information from the dataset using R or Python. According to the learning outcome E, you are expected to do some independent study and research in this individual project.
Dataset
The project uses a sample of GeoCoV19 Twitter dataset (https://crisisnlp.qcri.org/covid19). The dataset contains a large number of geo-tagged COVID-19 tweets during the period of Feb 1st – March 31st, 2020, from various locations in the United States.
The dataset is stored in a CSV file and needs to be processed with your R or Python program. Each record (row) contains information about a tweet. The columns are explained as follows.
-
tweet_id – the ID of a tweet
-
created_at – the time when a tweet is published
-
user_id – the ID of a user
-
country_code – in which country the tweet is published
-
state – in which state the tweet is published
-
text – the actual tweet message
Requirements and Tasks
You are allowed to use existing R or Python libraries to solve the following tasks. Mark breakdown for each task can be found from the DTS301TC Project Marking Criteria at the end of this document.
T1 Statistic Analysis and Data Visualization:
T1-1: Find how many different tweets and users included in this dataset.
T1-2: Find the top 10 users who tweeted the most.
T1-3: Draw a figure to show the number of tweets posted on each day (From Feb 1st to March 31st, 2020).
T1-4: Draw a figure to show the number of tweets posted from each state. T2 Text Data Cleaning, Pre-processing and Visualization:
T2-1: Raw tweets are highly unstructured and often contain redundant and problematic information. For instance, the links, emojis and symbols (e.g., #, @) in a tweet may not be necessary for the text mining tasks. Use R or Python to clean and pre-process raw tweets.
T2-2: Apply necessary text mining preprocessing techniques, e.g., tokenization, stemming, stop word removal, etc.
T2-3: Generate a word cloud to show the frequently used words in the COVID-19 tweet dataset. You can further pre-process the dataset based on the topic you choose in T3.
T3 Data Processing and Analysis:
Identify one data mining problem and use data mining algorithm(s) to extract useful information from the given dataset. You can choose your own topic. Please make sure your topic is appropriate and have some research value. Some potential topics are listed for your reference.
-
Identifying trending topics of COVID-19 on twitter
-
Extracting tweets related to specific topic, e.g., China, vaccine, policy, mask, etc.
-
Spatial and temporal analysis and sentiment analysis of tweets
-
Topic modeling of COVID-19 tweets
-
etc.
Report
You need to write a report to show all the contents for this project. In general, the report must be in English and should include the following contents:
Source code and results for T1 Statistic Analysis and Data Visualization. You can add-
one or two paragraphs to explain anything that is not obvious.
-
Source code and results for T2 Text Data Cleaning, Pre-processing and Visualization.
You need to give some examples to show the tweet content before and after data pre-
processing. You can also add one or two paragraphs to explain anything that is not obvious.
-
For T3 Data processing and Analysis you should include the following contents:
-
Introduction: State clearly what is the topic, why you chose the topic, show the originality and significance of the topic, and discuss if there are some existing studies related to the topic.
-
Methodology: State what data mining algorithm(s) you use to solve the problem, explain how to use it and identify the novelty of your method (if any).
-
Experiments: Include your code and some brief explanation.
-
Evaluation: Show all the results (e.g., tables, figures, etc.) you get from your method and give the corresponding explanation. You can also discuss the pros and
cons of different models if you implemented multiple models for your topic.
e. Conclusion: Summary of the results, list some current limitations and future
directions. f. Reference
-
If you refer to any work from other sources, the original work must be cited.
Maximum 2500 words for the report excluding source code. (Clarity and brevity are valued over length).
Submission
Electronic submission on Learning Mall is mandatory. You need to submit a zip file (named IDnumber_Name_DTS301TC_Project.zip (e.g.: 1900000_ZhangSan_DTS301TC_Project.zip)) containing all your source code in R or Python and your report in pdf format.
-