1. Homepage
  2. Exam
  3. COMP2420/COMP6420 - 2022 Sample - Q2 Data Science

COMP2420/COMP6420 - 2022 Sample - Q2 Data Science

This question has been solved
Engage in a Conversation

Q2) Data Science (20 marks)

Part 1 (10 marks)

Researcher A claims to have quantified the drink consumption habits of Coffee Grounds customers. He claims that the choice of each customer is a random sample from a distribution over five outcomes: Smoothie, Coffee, Milk Tea, Classic Tea, and Sparkling Juice, with probability of 0.2, 0.15, 0.2, 0.4, and 0.05, respectively. CourseNana.COM


CourseNana.COM

a) (2 marks)  Another researcher B rejects one part of the A's claim: he can’t believe customers choose Classic Tea with a 40% probability (he doesn’t care at all about the probabilities of the other choices). State null and alternative hypotheses that he should use to investigate the issue. CourseNana.COM

b) (2 marks)  Now B needs a sample of customer choices. Each beverage cup contains a mark describing its original contents. Should he look in the garbage can outside of Coffee Grounds at the end of the day and count the proportion of cups that contained Classic Tea? Why or why not? CourseNana.COM

c) (3 marks) Alternatively, Coffee Grounds offers to give B a uniform random sample of 10 orders from its database of all past orders. He replies, ”That’s not enough, I need a large random sample.” They ask why. How should he respond to justify his request of a large random sample? CourseNana.COM

d) (3 marks) B chooses as his test statistic the absolute difference between 40% and the observed proportion of Classic Tea orders. Coffee Grounds provides B with a random sample of 1000 orders. 400 of the orders are for Classic Tea. He then simulates the test statistic 100,000 times and computes a p-value. Based on this information, which of the following is true? Briefly justify your answer. CourseNana.COM

  • (i) The null hypothesis will certainly be rejected using a 5% p-value threshold.
  • (ii) The null hypothesis will certainly not be rejected using a 5% p-value threshold.
  • (iii) Can’t tell without actually running the simulation and looking at the empirical histogram

Part 2 (10 marks)

Bike sharing systems are a new generation of traditional bike rentals, where the whole process from membership, rental and bike return has become automatic. Through these systems, the user is able to easily rent a bike from a particular location and return the bike at another location. The characteristics of data being generated by these systems make them attractive for research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival location is explicitly recorded in these systems. This feature turns the bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. CourseNana.COM

One such dataset is provided to you here. The description of fields is as follows: CourseNana.COM

  • instant: Record index
  • dteday : Date - YYYY-MM-DD
  • season : Season (1:spring, 2:summer, 3:fall, 4:winter)
  • yr : Year. Only two years - 2011 and 2012. The value 0 represents 2011, and 1 represents 2012.
  • mnth : Month (1 to 12)
  • hr : Hour (0 to 23)
  • holiday : Whether that day is a holiday or not
  • weekday : Day of the week
  • workingday : If the day is neither a weekend nor a holiday, the value is 1. Else, it is 0.
  • weathersit : Weather situation. The value is one of the following:
    • 1: Clear, Few clouds, Partly cloudy, Cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp : Temperature in Celsius.
  • atemp: "Feels like" temperature in Celsius (14 degrees might feel like 8 degrees)
  • hum: Humidity as a fraction (81% is represented as 0.81).
  • windspeed: Wind speed.
  • casual: Number of bikes rented by casual users
  • registered: Number of bikes rented by registered users
  • cnt: Total number of rental bikes (bikes rented by both casual and registered users. cnt = casual + registered)

The filename is data/hours.csv. Based on this data please answer the questions below. CourseNana.COM

a) (1 mark) What are the mean temperature, humidity, windspeed and total rentals per month CourseNana.COM

b) (2 marks) Is there a difference between the real temperature and the "feels like" temperature? If there is a difference, then does this exist across the different seasons? Justify your answer. CourseNana.COM

c) (3 marks)  Is temperature associated with daily bike rentals (registered vs. casual)? Draw a plot to answer this question and describe your findings. CourseNana.COM

d) (4 marks)  What percentage of days are appropriate for biking, for each of the following *ideal biking conditions? CourseNana.COM

i. Temperature > 5°, weather situation 1-3 (1,2,3), windspeed < 40 km/h, hr > 8 and hr < 14
ii. Temperature > 10°, weather situation 1-2, windspeed < 20 km/h, hr > 6 and hr < 18 CourseNana.COM

** Note:** You don't explicitly have daily data CourseNana.COM

Get the Solution to This Question

WeChat (微信) WeChat (微信)
Whatsapp WhatsApp