CourseNana | DTS301TC Data Mining Individual Project: COVID-19 tweet analysis

DTS301TC Data Mining
CourseNana.COM

School of AI and Advanced Computing
Individual Project
Sunday, October 29th 23:59 (Beijing Time), 2023☐ Category A CourseNana.COM

DTS301TC Data Mining Individual Project
CourseNana.COM

Deadline: Sunday, October 29th 23:59 (Beijing Time), 2023 Percentage in final mark: 60% CourseNana.COM

Learning outcomes assessed: CourseNana.COM

D. Develop skills of using recent data mining software for solving practical problems. CourseNana.COM

E. Gain experience of doing independent study and research. CourseNana.COM

Late policy: 5% of the total marks available for the assessment shall be deducted from the assessment mark for each working day after the submission date, up to a maximum of five working days. CourseNana.COM

Risks: CourseNana.COM

Please read the coursework instructions and requirements carefully. Not following these instructions and requirements may result in loss of marks. CourseNana.COM
The assignment must be submitted via Learning Mall to the correct drop box. Only electronic submission is accepted and no hard copy submission. CourseNana.COM
All students must download their file and check that it is viewable after submission. Documents may become corrupted during the uploading process (e.g. due to slow internet connections). However, students themselves are responsible for submitting a functional and correct file for assessments. CourseNana.COM
Academic Integrity Policy is strictly followed. Overview CourseNana.COM

The objective of this project is to apply data mining techniques in a real-world dataset to gain a better understanding of real-world data mining applications. In this project, you need to identify one appropriate data mining problem from a COVID-19 related twitter dataset and apply data mining algorithms to extract useful information from the dataset using R or Python. According to the learning outcome E, you are expected to do some independent study and research in this individual project. CourseNana.COM

Dataset CourseNana.COM

The project uses a sample of GeoCoV19 Twitter dataset (https://crisisnlp.qcri.org/covid19). The dataset contains a large number of geo-tagged COVID-19 tweets during the period of Feb 1st – March 31st, 2020, from various locations in the United States. CourseNana.COM

The dataset is stored in a CSV file and needs to be processed with your R or Python program. Each record (row) contains information about a tweet. The columns are explained as follows. CourseNana.COM

tweet_id – the ID of a tweet CourseNana.COM
created_at – the time when a tweet is published CourseNana.COM
user_id – the ID of a user CourseNana.COM
country_code – in which country the tweet is published CourseNana.COM
state – in which state the tweet is published CourseNana.COM
text – the actual tweet message CourseNana.COM

Requirements and Tasks CourseNana.COM

You are allowed to use existing R or Python libraries to solve the following tasks. Mark breakdown for each task can be found from the DTS301TC Project Marking Criteria at the end of this document. CourseNana.COM

T1 Statistic Analysis and Data Visualization: CourseNana.COM

T1-1: Find how many different tweets and users included in this dataset. CourseNana.COM

T1-2: Find the top 10 users who tweeted the most. CourseNana.COM

T1-3: Draw a figure to show the number of tweets posted on each day (From Feb 1st to March 31st, 2020). CourseNana.COM

T1-4: Draw a figure to show the number of tweets posted from each state. T2 Text Data Cleaning, Pre-processing and Visualization: CourseNana.COM

T2-1: Raw tweets are highly unstructured and often contain redundant and problematic information. For instance, the links, emojis and symbols (e.g., #, @) in a tweet may not be necessary for the text mining tasks. Use R or Python to clean and pre-process raw tweets. CourseNana.COM

T2-2: Apply necessary text mining preprocessing techniques, e.g., tokenization, stemming, stop word removal, etc. CourseNana.COM

T2-3: Generate a word cloud to show the frequently used words in the COVID-19 tweet dataset. You can further pre-process the dataset based on the topic you choose in T3. CourseNana.COM

T3 Data Processing and Analysis: CourseNana.COM

Identify one data mining problem and use data mining algorithm(s) to extract useful information from the given dataset. You can choose your own topic. Please make sure your topic is appropriate and have some research value. Some potential topics are listed for your reference. CourseNana.COM

Identifying trending topics of COVID-19 on twitter CourseNana.COM
Extracting tweets related to specific topic, e.g., China, vaccine, policy, mask, etc. CourseNana.COM
Spatial and temporal analysis and sentiment analysis of tweets CourseNana.COM
Topic modeling of COVID-19 tweets CourseNana.COM
etc. CourseNana.COM

Report CourseNana.COM

You need to write a report to show all the contents for this project. In general, the report must be in English and should include the following contents: CourseNana.COM

Source code and results for T1 Statistic Analysis and Data Visualization. You can add CourseNana.COM
1. one or two paragraphs to explain anything that is not obvious. CourseNana.COM
2. Source code and results for T2 Text Data Cleaning, Pre-processing and Visualization. CourseNana.COM
  
  You need to give some examples to show the tweet content before and after data pre- CourseNana.COM
  
  processing. You can also add one or two paragraphs to explain anything that is not obvious. CourseNana.COM
3. For T3 Data processing and Analysis you should include the following contents: CourseNana.COM
  1. Introduction: State clearly what is the topic, why you chose the topic, show the originality and significance of the topic, and discuss if there are some existing studies related to the topic. CourseNana.COM
  2. Methodology: State what data mining algorithm(s) you use to solve the problem, explain how to use it and identify the novelty of your method (if any). CourseNana.COM
  3. Experiments: Include your code and some brief explanation. CourseNana.COM
  4. Evaluation: Show all the results (e.g., tables, figures, etc.) you get from your method and give the corresponding explanation. You can also discuss the pros and CourseNana.COM
    
    cons of different models if you implemented multiple models for your topic. CourseNana.COM
  e. Conclusion: Summary of the results, list some current limitations and future CourseNana.COM
  
  directions. f. Reference CourseNana.COM
If you refer to any work from other sources, the original work must be cited. CourseNana.COM

Maximum 2500 words for the report excluding source code. (Clarity and brevity are valued over length). CourseNana.COM

Submission CourseNana.COM

Electronic submission on Learning Mall is mandatory. You need to submit a zip file (named IDnumber_Name_DTS301TC_Project.zip (e.g.: 1900000_ZhangSan_DTS301TC_Project.zip)) containing all your source code in R or Python and your report in pdf format. CourseNana.COM

DTS301TC Data Mining Individual Project: COVID-19 tweet analysis

Get in Touch with Our Experts