SIT 112 - Data Science Concepts - Assignment 2
SIT 112 - Data Science Concepts - Assignment 2
Lecturer: Shalini Stephen | shalini.stephen@deakin.edu.au
Deakin College
Associated with Deakin University, VIC 3215, Australia
Due: Week 10 Saturday, 10th September 2022, by 11:55 pm
Instructions
This notebook has been prepared for you to complete Assignment 2. Some sections have been partially completed to help you get started. The total marks for this notebook is 100.
· Before you start, read the entire notebook carefully to understand what you need to do. You should also refer to the main instructions in Assignment2_instructions.pdf to know what else you need to complete for this assignment as well submission instruction.
· Instructions marked with (D) and (HD) are for students aiming at high grades. They are more involved and can be completed after all other instructions.
· For each cell marked with #YOU ARE REQUIRED TO INSERT YOUR CODE IN THIS CELL, there will be places where you must supply your own codes when instructed.
· For each cell marked with #YOU ARE REQUIRED TO INSERT YOUR COMMENT IN THIS CELL, there will be place where you must provide your own comment when instructed.
· For each graphic/ chart / plot you created YOU ARE REQUIRED to provide X-label, Y-label and Title.
· For any output of your code execution YOU ARE REQUIRED to provide very brief explanatory information (can be a few words).
You need to provide simple explanatory information (with a few words) to
· markdown cells marked with Note mean description sections.
· markdown cells marked with Instruction mean the instructions given to you to complete the designated section.
Part 1: Crawling and Storing Tweet Data
The first part of the assignment examines your skills and knowledge to query tweets and store them in json files. For each provided keyword, your tasks are:
· Crawl all tweets which contain this keyword written in English and geocoded within the location, provided for your group.
· Store the tweets collected into json files.
Follow the instructions below to complete your task. Note: The following packages will be required for this assignment. If you need to import more packages, you might append them to the end of the following cell. #Import packages needed for processing import re import json import xml import numpy as np from collections import Counter from TwitterAPI import TwitterAPI # in case you need to install this package, see practical 6 from sklearn.cluster import KMeans
import requests
disabling urllib3 warnings
requests.packages.urllib3.disable_warnings()
import matplotlib.pyplot as plt %matplotlib inline
#If you need add any additional packages, then add them below
Instruction 1.1. Enter your provided keywords to the variable keywords below.
[Total mark: 1]
Twitter API credentials
CONSUMER_KEY = #ENTER YOUR CONSUMER_KEY CONSUMER_SECRET = #ENTER YOUR CONSUMER_SECRET OAUTH_TOKEN = #ENTER YOUR OAUTH_TOKEN OAUTH_TOKEN_SECRET = #ENTER YOUR OAUTH_TOKEN_SECRET
Authonticating with your application credentials
api = TwitterAPI(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET) #INSERT YOUR CODE HERE Note: As you have learned from the pracical sessions, to perform a query from Twitter API for a particular geo-coded location you need a center point and a radius. The center point is specified by its (latitude,longitute) pair. The information below has been provided to you to perform the query in the subsequent tasks.
geo coordinations of the desired place
PLACE_LAT = # INSERT YOUR CODE PLACE_LON = # INSERT YOUR CODE PLACE_RAD = # INSERT YOUR CODE Instruction 1.3. For each keyword, you are required to crawl at least 200 tweets (the more the better) using the Twitter API. However, as you have learned from the practical sessions, each query will return a maximum of only 100 tweets. Therefore, subsequent query must use the maximum Tweet ID from the previous batch to crawl the next lot.
The following function, called retrieve_tweets(), has been partially implemented to automatically download tweets until it reaches the maximum number of tweets needed.
Data type of tweets
print(type(k1_tweets[0])) Instruction 1.6. To examine what the tweets look like, in the cell below write your code to print out all fields of the first tweet in k1_tweets and print out the text of the first tweet collected for each keyword.
[Total mark: 4]
Part 2: Data Analytics
The second part of this assignment will examine your skills and knowlege in data manipulation and analysis tasks. It includes three main components:
Part 2A. For each keyword, you will be required to load the tweets from your saved json files (from Part 1) and filter out all tweets that are too short.
Part 2B. Using your knowledge from practical sessions 5, 6 and 7, you will be required to construct the term-by-document matrix for the tweets and to perform visualisation tasks to understand them.
Part 2C. You will apply the Kmeans clustering algorithm to cluster your tweets and report the clustering results.
Follow the instructions below to complete your assigned tasks.
Part 2A: Load and Filter Tweets from Files
Instruction 2.1. The following function, named read_json_file(), has been partially implemented to load data from a json file. This function will be used later on to load three json files you have saved from Part 1. Your task is to insert your own code where instructed to complete this function.
[Total mark: 3]
''' Insert your own code where instructed to complete this function ''' def read_json_file(filename): """ reads from a json file and saves the result in a list named data """ with open(filename, 'r') as fp: # INSERT THE MISSING PIECE OF CODE HERE
return data
Instruction 2.2. Now use the read_json_file() function defined above, write three function calls to load data from three json files you have saved from Part 1.
[Total mark: 4]
def is_short_tweet(tweet): ''' Check if the text of "tweet" has less than 50 characters ''' # INSERT YOUR CODE HERE
''' Write your codes to remove all tweets which have less than 50 characters in variables k1_tweets, k2_tweets and k3_tweets and store the results in the new variables k1_tweets_filtered, k2_tweets_filtered and k3_tweets_filtered respectively '''
k1_tweets_filtered = k2_tweets_filtered = k3_tweets_filtered =
Part 2B: Constructing Term-by-Document Matrix
As we have learned in our class, in text analytics and in general dealing with unstructured data, to start perform computational tasks such as computing the distance between two documents, we need to represent them in numerical formats. A popular technique we have learned is the bag-of-word representation and the term-by-document matrix, also known as the vector-space model.
This part of the assignment will require you to construct the term-by-document matrix for the tweets stored in three variables k1_tweets_filtered, k2_tweets_filtered and k3_tweets_filtered. Note. Tweets are often not neat as you might have seen from early tasks. As tweet such as this
tweet_k1_processed is now a list of words.
We use ' '.join() method to join the list to a string.
''' Use the example above, write your code to display the first tweets stored in the variables k2_tweets_filtered and k3_tweets_filtered before and after they have been pre-processed using the function pre_process() supplied earlier. '''
''' Now write your code to print out the first 5 processed tweets for each keyword. Hint: Each tweet in tweets_processed is now a list of words, not a string. To print a string, you might need to use ' '.join(tweet), when tweet is a processed tweet
Part 2C: Data Clustering
Thus far in this assignment, we have collected tweets for each keyword and analysed them seperately. We have constructed the term-by-document matrix for each collection of tweets seperately. A fundemantal and common task in data science, analytics, machine learning, science and engineering is clustering. This is also known as unsupervised learning or exploratory data analysis as we have learned in our classes.
This part of this assignment will use the Kmeans algorithm learned in our classes to cluster the entire colllection of tweets collected for all keywords. To do so, we need to compute the distance between any two pair of tweets. This requires us to compute a joint term-by-document matrix for all tweets.
The reason that we cannot use the individual term-by-document matrices computed earlier (e.g.,k1_termdoc, k2_termdoc, k3_termdoc) for this task is because they have different dictionary sizes. Hence, tweets collected for different keywords have been represented by vectors of different dimension.
The following piece of codes will help you to inspect these dimensions. print('Dimension of the term-by-document matrix for keyword "{}":'.format(keywords[0])) print('{} x {}\n'.format(k1_termdoc.shape[0],k1_termdoc.shape[1]))
Explain below why visualising the clusters here is hard to do in this case.
Instruction 2.24 (HD). Nevertheless, it is possible to visualise the structure of the cluster centres - surprisingly, using bar charts. Each vector component in the cluster centre vectors corresponds to a word in the dictionary. The value (amplitude) of the vector component for each particular word in the dictionary shows the strength of presence of a word in the cluster. In this task, you want to:
1. Plot bar charts for each of the three clusters, obtained from KMeans, where each bar chart has 20 strongest words sorted by their presence strength. [2 marks]
2. Explain the bar charts from the point of view of chosen keywords, English grammar and our text preprocessing routine. [1 marks]
[Total mark: 3]
[Total marks: 2]
''' Write your code to obtain the labels of tweets for each keyword and store the labels of the first keyword in k1_labels, the labels of the second keyword in k2_labels and the labels of the third keyword in k3_labels. '''