Homepage
Programming
[2022] SIT112 Data Science Concepts - Assignment2: Tweet Data Clustering

[2022] SIT112 Data Science Concepts - Assignment2: Tweet Data Clustering

Engage in a Conversation

SIT 112 - Data Science Concepts - Assignment 2 CourseNana.COM

Lecturer: Shalini Stephen | shalini.stephen@deakin.edu.au CourseNana.COM

Deakin College CourseNana.COM

Associated with Deakin University, VIC 3215, Australia CourseNana.COM

Due: Week 10 Saturday, 10th September 2022, by 11:55 pm CourseNana.COM

Instructions CourseNana.COM

This notebook has been prepared for you to complete Assignment 2. Some sections have been partially completed to help you get started. The total marks for this notebook is 100. CourseNana.COM

·       Before you start, read the entire notebook carefully to understand what you need to do. You should also refer to the main instructions in Assignment2_instructions.pdf to know what else you need to complete for this assignment as well submission instruction.

CourseNana.COM

·       Instructions marked with (D) and (HD) are for students aiming at high grades. They are more involved and can be completed after all other instructions. CourseNana.COM

·       For each cell marked with #YOU ARE REQUIRED TO INSERT YOUR CODE IN THIS CELL, there will be places where you must supply your own codes when instructed.

CourseNana.COM

·       For each cell marked with #YOU ARE REQUIRED TO INSERT YOUR COMMENT IN THIS CELL, there will be place where you must provide your own comment when instructed.

CourseNana.COM

·       For each graphic/ chart / plot you created YOU ARE REQUIRED to provide X-label, Y-label and Title.

CourseNana.COM

·       For any output of your code execution YOU ARE REQUIRED to provide very brief explanatory information (can be a few words).

CourseNana.COM

You need to provide simple explanatory information (with a few words) to CourseNana.COM

·       markdown cells marked with Note mean description sections. CourseNana.COM

·       markdown cells marked with Instruction mean the instructions given to you to complete the designated section. CourseNana.COM

Part 1: Crawling and Storing Tweet Data CourseNana.COM

The first part of the assignment examines your skills and knowledge to query tweets and store them in json files. For each provided keyword, your tasks are: CourseNana.COM

·       Crawl all tweets which contain this keyword written in English and geocoded within the location, provided for your group. CourseNana.COM

·       Store the tweets collected into json files. CourseNana.COM

Follow the instructions below to complete your task. Note: The following packages will be required for this assignment. If you need to import more packages, you might append them to the end of the following cell. #Import packages needed for processing import re import json import xml import numpy as np from collections import Counter from TwitterAPI import TwitterAPI # in case you need to install this package, see practical 6 from sklearn.cluster import KMeans CourseNana.COM

import requests CourseNana.COM

disabling urllib3 warnings CourseNana.COM

requests.packages.urllib3.disable_warnings() CourseNana.COM

import matplotlib.pyplot as plt %matplotlib inline CourseNana.COM

#If you need add any additional packages, then add them below CourseNana.COM

Instruction 1.1. Enter your provided keywords to the variable keywords below. CourseNana.COM

[Total mark: 1] CourseNana.COM

Twitter API credentials CourseNana.COM

CONSUMER_KEY = #ENTER YOUR CONSUMER_KEY CONSUMER_SECRET = #ENTER YOUR CONSUMER_SECRET OAUTH_TOKEN = #ENTER YOUR OAUTH_TOKEN OAUTH_TOKEN_SECRET = #ENTER YOUR OAUTH_TOKEN_SECRET CourseNana.COM

Authonticating with your application credentials CourseNana.COM

api = TwitterAPI(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET) #INSERT YOUR CODE HERE Note: As you have learned from the pracical sessions, to perform a query from Twitter API for a particular geo-coded location you need a center point and a radius. The center point is specified by its (latitude,longitute) pair. The information below has been provided to you to perform the query in the subsequent tasks. CourseNana.COM

geo coordinations of the desired place CourseNana.COM

PLACE_LAT = # INSERT YOUR CODE PLACE_LON = # INSERT YOUR CODE PLACE_RAD = # INSERT YOUR CODE Instruction 1.3. For each keyword, you are required to crawl at least 200 tweets (the more the better) using the Twitter API. However, as you have learned from the practical sessions, each query will return a maximum of only 100 tweets. Therefore, subsequent query must use the maximum Tweet ID from the previous batch to crawl the next lot. CourseNana.COM

The following function, called retrieve_tweets(), has been partially implemented to automatically download tweets until it reaches the maximum number of tweets needed. CourseNana.COM

Data type of tweets CourseNana.COM

print(type(k1_tweets[0])) Instruction 1.6. To examine what the tweets look like, in the cell below write your code to print out all fields of the first tweet in k1_tweets and print out the text of the first tweet collected for each keyword. CourseNana.COM

[Total mark: 4] CourseNana.COM

Part 2: Data Analytics CourseNana.COM

The second part of this assignment will examine your skills and knowlege in data manipulation and analysis tasks. It includes three main components: CourseNana.COM

Part 2A. For each keyword, you will be required to load the tweets from your saved json files (from Part 1) and filter out all tweets that are too short. CourseNana.COM

Part 2B. Using your knowledge from practical sessions 5, 6 and 7, you will be required to construct the term-by-document matrix for the tweets and to perform visualisation tasks to understand them. CourseNana.COM

Part 2C. You will apply the Kmeans clustering algorithm to cluster your tweets and report the clustering results. CourseNana.COM

Follow the instructions below to complete your assigned tasks. CourseNana.COM

Part 2A: Load and Filter Tweets from Files CourseNana.COM

Instruction 2.1. The following function, named read_json_file(), has been partially implemented to load data from a json file. This function will be used later on to load three json files you have saved from Part 1. Your task is to insert your own code where instructed to complete this function. CourseNana.COM

[Total mark: 3] CourseNana.COM

''' Insert your own code where instructed to complete this function ''' def read_json_file(filename): """ reads from a json file and saves the result in a list named data """ with open(filename, 'r') as fp: # INSERT THE MISSING PIECE OF CODE HERE CourseNana.COM

return data CourseNana.COM

Instruction 2.2. Now use the read_json_file() function defined above, write three function calls to load data from three json files you have saved from Part 1. CourseNana.COM

[Total mark: 4] CourseNana.COM

def is_short_tweet(tweet): ''' Check if the text of "tweet" has less than 50 characters ''' # INSERT YOUR CODE HERE CourseNana.COM

''' Write your codes to remove all tweets which have less than 50 characters in variables k1_tweets, k2_tweets and k3_tweets and store the results in the new variables k1_tweets_filtered, k2_tweets_filtered and k3_tweets_filtered respectively ''' CourseNana.COM

k1_tweets_filtered = k2_tweets_filtered = k3_tweets_filtered = CourseNana.COM

Part 2B: Constructing Term-by-Document Matrix CourseNana.COM

As we have learned in our class, in text analytics and in general dealing with unstructured data, to start perform computational tasks such as computing the distance between two documents, we need to represent them in numerical formats. A popular technique we have learned is the bag-of-word representation and the term-by-document matrix, also known as the vector-space model. CourseNana.COM

This part of the assignment will require you to construct the term-by-document matrix for the tweets stored in three variables k1_tweets_filtered, k2_tweets_filtered and k3_tweets_filtered. Note. Tweets are often not neat as you might have seen from early tasks. As tweet such as this CourseNana.COM

tweet_k1_processed is now a list of words. CourseNana.COM

We use ' '.join() method to join the list to a string. CourseNana.COM

''' Use the example above, write your code to display the first tweets stored in the variables k2_tweets_filtered and k3_tweets_filtered before and after they have been pre-processed using the function pre_process() supplied earlier. ''' CourseNana.COM

''' Now write your code to print out the first 5 processed tweets for each keyword. Hint: Each tweet in tweets_processed is now a list of words, not a string. To print a string, you might need to use ' '.join(tweet), when tweet is a processed tweet CourseNana.COM

Part 2C: Data Clustering CourseNana.COM

Thus far in this assignment, we have collected tweets for each keyword and analysed them seperately. We have constructed the term-by-document matrix for each collection of tweets seperately. A fundemantal and common task in data science, analytics, machine learning, science and engineering is clustering. This is also known as unsupervised learning or exploratory data analysis as we have learned in our classes. CourseNana.COM

This part of this assignment will use the Kmeans algorithm learned in our classes to cluster the entire colllection of tweets collected for all keywords. To do so, we need to compute the distance between any two pair of tweets. This requires us to compute a joint term-by-document matrix for all tweets. CourseNana.COM

The reason that we cannot use the individual term-by-document matrices computed earlier (e.g.,k1_termdoc, k2_termdoc, k3_termdoc) for this task is because they have different dictionary sizes. Hence, tweets collected for different keywords have been represented by vectors of different dimension. CourseNana.COM

The following piece of codes will help you to inspect these dimensions. print('Dimension of the term-by-document matrix for keyword "{}":'.format(keywords[0])) print('{} x {}\n'.format(k1_termdoc.shape[0],k1_termdoc.shape[1])) CourseNana.COM

Explain below why visualising the clusters here is hard to do in this case. CourseNana.COM

Instruction 2.24 (HD). Nevertheless, it is possible to visualise the structure of the cluster centres - surprisingly, using bar charts. Each vector component in the cluster centre vectors corresponds to a word in the dictionary. The value (amplitude) of the vector component for each particular word in the dictionary shows the strength of presence of a word in the cluster. In this task, you want to: CourseNana.COM

1.     Plot bar charts for each of the three clusters, obtained from KMeans, where each bar chart has 20 strongest words sorted by their presence strength. [2 marks] CourseNana.COM

2.     Explain the bar charts from the point of view of chosen keywords, English grammar and our text preprocessing routine. [1 marks] CourseNana.COM

[Total mark: 3] CourseNana.COM

[Total marks: 2] CourseNana.COM

''' Write your code to obtain the labels of tweets for each keyword and store the labels of the first keyword in k1_labels, the labels of the second keyword in k2_labels and the labels of the third keyword in k3_labels. ''' CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: COMPSCI 235 Software Development Methodology - Assignment 2: Design and build a web application

Next: Project #1 Course: Security and Privacy Fall 2022 - DES Algorithm

Australia代写,Deakin University代写,SIT112代写,Data Science Concepts代写,Tweet Data Clustering代写,Crawling and Storing Tweet Data代写,Data Analytics代写,Term-by-Document Matrix代写,Kmeans Clustering代写,Data Visualisation代写,Australia代编,Deakin University代编,SIT112代编,Data Science Concepts代编,Tweet Data Clustering代编,Crawling and Storing Tweet Data代编,Data Analytics代编,Term-by-Document Matrix代编,Kmeans Clustering代编,Data Visualisation代编,Australia代考,Deakin University代考,SIT112代考,Data Science Concepts代考,Tweet Data Clustering代考,Crawling and Storing Tweet Data代考,Data Analytics代考,Term-by-Document Matrix代考,Kmeans Clustering代考,Data Visualisation代考,Australiahelp,Deakin Universityhelp,SIT112help,Data Science Conceptshelp,Tweet Data Clusteringhelp,Crawling and Storing Tweet Datahelp,Data Analyticshelp,Term-by-Document Matrixhelp,Kmeans Clusteringhelp,Data Visualisationhelp,Australia作业代写,Deakin University作业代写,SIT112作业代写,Data Science Concepts作业代写,Tweet Data Clustering作业代写,Crawling and Storing Tweet Data作业代写,Data Analytics作业代写,Term-by-Document Matrix作业代写,Kmeans Clustering作业代写,Data Visualisation作业代写,Australia编程代写,Deakin University编程代写,SIT112编程代写,Data Science Concepts编程代写,Tweet Data Clustering编程代写,Crawling and Storing Tweet Data编程代写,Data Analytics编程代写,Term-by-Document Matrix编程代写,Kmeans Clustering编程代写,Data Visualisation编程代写,Australiaprogramming help,Deakin Universityprogramming help,SIT112programming help,Data Science Conceptsprogramming help,Tweet Data Clusteringprogramming help,Crawling and Storing Tweet Dataprogramming help,Data Analyticsprogramming help,Term-by-Document Matrixprogramming help,Kmeans Clusteringprogramming help,Data Visualisationprogramming help,Australiaassignment help,Deakin Universityassignment help,SIT112assignment help,Data Science Conceptsassignment help,Tweet Data Clusteringassignment help,Crawling and Storing Tweet Dataassignment help,Data Analyticsassignment help,Term-by-Document Matrixassignment help,Kmeans Clusteringassignment help,Data Visualisationassignment help,Australiasolution,Deakin Universitysolution,SIT112solution,Data Science Conceptssolution,Tweet Data Clusteringsolution,Crawling and Storing Tweet Datasolution,Data Analyticssolution,Term-by-Document Matrixsolution,Kmeans Clusteringsolution,Data Visualisationsolution,