Homepage
Programming
FIT1043 Introduction to data science - Assignment 3: Foursquare and Twitter data analysis

FIT1043 Introduction to data science - Assignment 3: Foursquare and Twitter data analysis

Engage in a Conversation

FIT1043 Assignment 3 Semester 2, 2022 CourseNana.COM

Due: Friday 21st October 2022, 11:55pm CourseNana.COM

Hand in Requirements: CourseNana.COM

1) Please hand in a PDF file containing your answers to all the questions and, numbered correspondingly. CourseNana.COM

2) Your report should include the following cases: CourseNana.COM

● The screenshots/images of the outputs/graphs you generate in order to justify your answers to all the questions. CourseNana.COM

● Copies of all the bash command lines and R scripts you use. If your answer is wrong, you may still get half marks if your command line or script is close to correct. CourseNana.COM

3) Please be informed that you need to explain what each part of command does for all your answers. For instance, if the code you use is ‘unzip tutorial_data.zip‘, you need to explain that the code is used to uncompress the zip file. CourseNana.COM

4) Please don’t include the questions into the assignment (It has 5% penalty). CourseNana.COM

NOTE: Two data sets for this assignment are in the Google shared drive: CourseNana.COM

Both are large, so your best bet is to download them while in the lab/studio and do the assignment there. You will need to use either a Linux machine for this or a Mac terminal or Cygwin on a Windows machine. CourseNana.COM

Assignment Tasks: CourseNana.COM

This assignment is worth 50 marks, which makes up for 15% of this Unit’s assessment. There are two tasks that you need to complete for this assignment. Students that complete only Tasks A1-A7, B1-B4 can only get a maximum of Distinction. Students that attempt tasks A8 and B5 will be showing critical analysis skills and a deeper understanding of the task at hand and can achieve the highest grade. You need to use the Unix shell and R to complete the tasks. CourseNana.COM

CourseNana.COM

[30 marks] Task A: Investigating User global-scale check-in data collected from Foursquare Data in the Shell CourseNana.COM

Download the file dataset_TIST2015.tar, which contains user check-in data from Foursquare (https://foursquare.com/). CourseNana.COM

1) [4 marks] Decompress the tar file and have a look at the files it contains. How many files are there in the tar file? How big is each file? CourseNana.COM

CourseNana.COM

2) [2 marks] What delimiter is used to separate the columns in the file (dataset_TIST2015_Checkins_v2.txt) and how many columns are there? CourseNana.COM

3) [4 marks] The first column in dataset_TIST2015_Checkins_v2.txt file is user_id. What are the other columns? Print out the names of the columns. CourseNana.COM

4) [4 marks] How many Checkins are there in the file? and how many users are there in the file? CourseNana.COM

5) [2 marks] What is the first and last dates in dataset_TIST2015_Checkins_v2.txt file (Assume that the data is ordered by date in chronological order) CourseNana.COM

6) [4 marks] How many unique venue IDs are there in the dataset_TIST2015_POIs.txt file? CourseNana.COM

7) [4 marks] How many unique Venue categories are included in dataset_TIST2015_POIs.txt file for France (Hint: FR is country code for France) CourseNana.COM

8) Background: How would you select venues from Europe? Consider the structure of the data presented in the readme file. Check-ins are indexed by a Venue ID, and these are described separately in a separate file, the POI file. You can select European venues from the POI file in (at least) two ways: select items in a latitude longitude bounding box, or select items by country code. Don’t be too fussed by the exact locations (include or exclude Turkey, Ukraine, etc., that is OK either way). CourseNana.COM

[6 marks] Create an awk script to create a European subset of the POI file, and name the subset file “POIeu.txt”. Investigate your European subset. CourseNana.COM

Submit the created POIeu.txt along with your PDF file.
What country has the most venues and what the least, with how many?
Which country has the most Seafood restaurants?
What is the most common (as in, how many venues) class of restaurant in Europe?

[20 marks] Task B: Investigating the Twitter Data in the Shell and Graphing in R CourseNana.COM

In this task you are working with Twitter_Data_1.gz data file. Please decompress the file and answer the following questions. CourseNana.COM

1) [2 marks] How many times does the term ‘Donald Trump’ appear in tweets? (Note: If the term appears two times in a tweet, we count as two) CourseNana.COM

2) [5 marks] Background: We want to consider how the amount of discussion regarding Donald Trump varies over the time period covered by the data file. CourseNana.COM

To answer this question, you will need to extract the timestamps for all tweets referring to Donald Trump. You will then need to read them into R and generate a histogram. [Hint: To read the data into R, first generate a file containing only the timestamp column as text. Then read the file into R as a CSV.] R will not recognise the strings as timestamps automatically, so you’ll need to convert them from text values using the strptime() function. Instructions on how to use the function are available here: (https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html). CourseNana.COM

[Note: the histogram should be plotted in the next question (Q3).] CourseNana.COM

Question: You will need to write a format string, starting with “%a %b” to tell the function how to parse the particular date/time format in your file. What format string do you need to use? CourseNana.COM

3) [6 marks] (R code) Once you’ve converted the timestamps, use the hist() function to plot the data. [Hint: you may need to set the number of bins sufficiently high to see the variation over time well.] CourseNana.COM

4) [3 marks] [R code] The plot has a bit of an unusual shape. Can you see a pattern before Feb 15 and what happens after that? CourseNana.COM

5) [4 marks] Plot a second histogram, but this time showing the distribution over number of tweets per author in the file.
[Hint: You’ll need to count up the number of Tweets by each unique author in the Twitter file giving a file with two columns “user” and “twitter count” in the bash Shell . Then load them into R. This is a large file so you can also just isolate the counts, sort and count them to get a summary statistics file with columns “twitter count” and “number of users”.] CourseNana.COM

CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: COMP2120 Software Engineering - Group Assignment 3: PR for OpenSource

Next: CS440 Programming Languages, Fall 2022 - Machine Problems 3: Interpreters

Australian代写,Monash University代写,FIT1043代写,Introduction to data science代写,Shell代写,R代写,Foursquare代写,Twitter代写,Australian代编,Monash University代编,FIT1043代编,Introduction to data science代编,Shell代编,R代编,Foursquare代编,Twitter代编,Australian代考,Monash University代考,FIT1043代考,Introduction to data science代考,Shell代考,R代考,Foursquare代考,Twitter代考,Australianhelp,Monash Universityhelp,FIT1043help,Introduction to data sciencehelp,Shellhelp,Rhelp,Foursquarehelp,Twitterhelp,Australian作业代写,Monash University作业代写,FIT1043作业代写,Introduction to data science作业代写,Shell作业代写,R作业代写,Foursquare作业代写,Twitter作业代写,Australian编程代写,Monash University编程代写,FIT1043编程代写,Introduction to data science编程代写,Shell编程代写,R编程代写,Foursquare编程代写,Twitter编程代写,Australianprogramming help,Monash Universityprogramming help,FIT1043programming help,Introduction to data scienceprogramming help,Shellprogramming help,Rprogramming help,Foursquareprogramming help,Twitterprogramming help,Australianassignment help,Monash Universityassignment help,FIT1043assignment help,Introduction to data scienceassignment help,Shellassignment help,Rassignment help,Foursquareassignment help,Twitterassignment help,Australiansolution,Monash Universitysolution,FIT1043solution,Introduction to data sciencesolution,Shellsolution,Rsolution,Foursquaresolution,Twittersolution,