Homework #4
Assignment Type: Team (You must work with your MQM-assigned team and make a single group submission. You may not collaborate with anyone from any other team.)
Due Date: 8:00 AM EST on Monday, October 9, 2023
Instructions:
--Your group will submit one file: a Jupyter notebook (i.e. an .ipynb file) that includes your queries, output, and commentary AND includes an embedded picture (i.e. a .jpg file) of your relational schema that you will create in ERDPlus
--I recognize that formatting your Jupyter cells can be a little bit of a pain. However, I do ask that you try to get your queries into a similar format to what we’ve been utilizing thus far in the course (and you can always refer to the example below)…
Example Question:
Select “bmi” and “age” for 5 records. Order the output from highest to lowest “bmi.” Include your output.
Answer:
Query
SELECT bmi,
age
FROM health
ORDER BY bmi DESC
LIMIT 5;
Output
bmi age
6388.49 48
5858.59 54
4324.40 27
4320.96 48
3745.48 66
--Notes about the formatting:
--Put each SQL clause on a new line
--Use all caps for all SQL clauses and keywords
--Write each field in the SELECT statement on a new line
--Clearly separate SQL clauses from field names, table names, etc. using spaces or tabs
Background
You have been hired by Dognition (https://www.dognition.com) to examine their data, identify potential issues, and provide some general insights regarding their product and users. The Dognition team is aware that their database is somewhat “messy” but they are unclear about the extent of the problems. There are six tables in the database and the team has provided you with some information about the fields in each table. None of the six tables have a properly-defined primary key. Given all of this information, you have been assigned three main tasks…
Task #1 (Broad Data Cleaning and Relational Schema)
· Drop duplicate records in each table. Be careful if you try to use COUNT(DISTINCT Col1, Col2, Col3, …). If you COUNT(DISTINCT …) with multiple columns at once, a row will be excluded from the COUNT(DISTINCT …) if it has a NULL value in any of those columns. This is simply how COUNT(DISTINCT …) works when you pass it multiple columns at once. Also note that I’m not asking you to actually remove duplicates from each table—you don’t have the permissions to alter the data in the database. However, in order to perform the next step, you definitely need to get rid of duplicates first (or else no set of columns will uniquely identify each row in a table). Also note that SELECT DISTINCT *… is a perfectly legitimate query (and something you might want to explore).
· After eliminating duplicates, determine which field(s) in a table could comprise a primary key for that table IN ITS CURRENT STATE. Put differently, find the column or set of columns that can be used to uniquely identify each row in a table. You do not need to worry about any additions to the database—just work with what you have! ALSO, PLEASE NOTE THAT THE SET OF COLUMNS THAT UNIQUELY IDENTIFY EACH ROW IN A TABLE MAY NOT BE A “VALID” (COMPOSITE) PRIMARY KEY BECAUSE THEY MAY CONTAIN SOME NULL VALUES. WE WILL IGNORE THIS POTENTIAL ISSUE WHEN BUILDING THE RELATIONAL SCHEMA. Some notes…
o The “exam_answers” table is especially messy. After dropping all duplicate records in the table, you should also exclude all records that have a NULL value for dog_guid or end_time. After doing that, I will tell you that the combination of script_detail_id, start_time, end_time, loop_number, and dog_guid can uniquely identify each row in the table. You still need to verify this.
o The “site_activities” table is also somewhat messy—I believe that the combination of the description, created_at, and user_guid fields can uniquely identify each row in the table. However, I would like you to verify this too.
o For the “users” table, I would like for you to set things up such that the user_guid field can uniquely identify each row in the table (basically you’ll have to deal with 13 records that have multiple values for one field—keep the ones that have an “informative” value for that field). Do NOT individually hard-code these 13 cases—figure out a way to eliminate them (I used two CTEs and a LEFT JOIN, but I’m sure there are other methods).
· In your Jupyter Notebook, be sure to include all of your work that got you to your final relational schema!
· If you can embed your relational schema in your Jupyter notebook that would be best! Simply upload a .jpg to your Jupyter file space and then you can embed it into any markdown cell with Edit > Insert Image.
· In everything that follows, ALWAYS USE YOUR “CLEANED” DATA TABLES THAT YOU GENERATED ABOVE. This is easily accomplished by copy/pasting the same CTEs in all subsequent Jupyter cells.
Task #2 (Searching for Potential Issues)
· Examine each table for potential issues, inconsistencies, etc.
· Summarize the “problems” for each table
· Since we do not have access to anyone at Dognition, make assumptions and restrictions that address these issues. For example, if it appears that a certain set of records are blatantly erroneous, state that you are excluding them from all further analyses.
· While this task is tedious it definitely is not trivial. Please be sure to spend ample time investigating these potential issues. However, I recognize that you will by no means be able to discover and vet everything.
· Just to give you an idea of my expectations here, I want you to work as a group for roughly 1 hour on each the dogs and users tables. For all of the other 4 tables an individual team member can spend roughly 30 minutes investigating each one. Be sure to include your work and any final conclusions you make for each table.
Task #3 (Summarize the Data)
a) Use the data to analyze “user sign-ups.” In particular, investigate the “user_guid,” “created_at,” and “membership_type” fields in the “users” table. Discuss potential issues associated with these fields. After identifying and managing these issues (by making reasonable assumptions—there is by no means a “single correct answer” here), generate a result set that contains 8 columns:
o Year
o Month
o Count of user sign-ups in that year/month combination
o Percentage of the total count of user sign-ups in that year/month combination that have membership_type = 1
o Percentage of the total count of user sign-ups in that year/month combination that have membership_type = 2
o Percentage of the total count of user sign-ups in that year/month combination that have membership_type = 3
Be sure to examine your results and attempt to explain any irregularities (perhaps using a quick Google search)!
b) Use your results from the “user sign-ups” analysis to investigate the correlation between time of joining (created_at in the users table), type of subscription, and number of tests completed. This analysis is more open-ended and again, does not have a single answer. I believe you will need to use data from the “complete_tests,” “dogs,” and “users” tables (at a minimum). Be sure to include plenty of comments/markdown to describe your approach and any final conclusions.
c) The Dognition team believes that they are having a hard time retaining customers. Some team members believe that the Dognition assessment is too complicated (so that many users get to a certain point, become frustrated, and quit). Other team members believe that there may be issues with the Dognition website, where certain webpages are prone to issues, resulting in user confusion. There is also a hypothesis that the assessment itself is simply better suited to certain “types” of owners and/or dogs. Help Dognition investigate these ideas. (Note that there is definitely no “correct” answer here. You will be judged on your creativity/potential impact and the scope of your analysis.)
Grading
· IN THE EVENT THAT A TEAM MEMBER (OR MULTIPLE TEAM MEMBERS) DO NOT CONTRIBUTE SUFFICIENTLY TO THE PROJECT, I RESERVE THE RIGHT TO ASSIGN DIFFERENT GRADES WITHIN A TEAM. ANY INSTANCE OF THIS TYPE OF ISSUE CAN BE REPORTED DIRECTLY TO ME VIA A DM IN SLACK.
· The assignment will be graded on a 10-point scale. Below is a general rubric for how each task will be evaluated (where we reserve the right to assign point values between these cut-points)…
o Task #1: 2 points
--2/2—Showed significant effort and the relational schema is mostly correct. Plenty of work in your Jupyter notebook supporting your relational schema. This is equivalent to a "check plus" grade.
--1/2—Showed good effort. Relational schema may have significant issues. Potentially less supporting work in your Jupyter notebook. This is equivalent to a "check" grade.
--0/2—Did not show enough effort. Relational schema is either not submitted or largely incorrect. Very little work in your Jupyter notebook. This is equivalent to a "check minus" grade.
o Task #2: 2 points
--2/2—Showed significant effort. Thoroughly investigated potential data issues and summarized these issues for each table. This is equivalent to a "check plus" grade.
--1/2—Showed good effort. Potentially a less-detailed investigation or missing altogether for certain tables. This is equivalent to a "check" grade.
--0/2—Did not show enough effort. Little to no querying/explanation. This is equivalent to a "check minus" grade.
o Task #3: 4 points
--4/4—Showed significant effort. All parts a.), b.), and c.) are completed. Queries/output are mostly correct, applicable, and interesting. This is equivalent to a "check plus" grade.
--3/4—Showed good effort. Analysis is not quite as thorough. More errors in queries/output. Results might be less informative. This is equivalent to a "check" grade.
--1/4—Did not show enough effort. Queries/results might be blatantly wrong or unsubstantiated. This is equivalent to a "check minus" grade.
o Formatting: 2 points
--2/2—In general, the relational schema and Jupyter Notebook are clear and easy to follow. Queries are formatted properly. There are sufficient notes/annotation for someone familiar with SQL (but not necessarily the dataset) to easily follow along.
--1/2—Formatting just not quite to the level of a 2/2, making it difficult to follow along at points. Missing some notes/annotations.
--0/2—Serious formatting issues making it incredibly difficult to follow along
Dognition Database Information
Dognition (https://www.dognition.com) is a company that teaches you how to build a deeper
connection with your dog by giving you an unprecedented perspective on your dog’s personality
and capabilities.
As a customer, you purchase the opportunity to follow detailed instructions and how-to videos
about how to play 20 fun and interactive games with your dog that were created by scientists,
trainers, and behavioral specialists. These 20 games comprise the Dognition Assessment. When
you finish the assessment, you receive a 10-15 page report about your dog’s unique personality
dimension.
The Profile Report gives you individualized insight into the cognitive strategies your dog uses to
interact with the world, and in-depth breakdowns of how your dog performed in each game
compared to other dogs. You are also told your dog's Personality Profile, which is one of nine
profiles or “dimensions” with titles like “Socialite,” “Einstein,” or “Maverick.”
Some important notes…
· The 20-game Dognition Assessment assesses 5 core dimensions of cognition: empathy, communication, cunning, memory, and reasoning.
· With a few exceptions, all games are presented to customers in the same order.
· Customers are not able to advance through tasks or trials out of order. This rule is implemented to make it as easy as possible for participants to follow all the steps of the games correctly.
· After completing the 20-game Dognition Assessment, customers can sign up to receive additional games and activities at the rate of one game and one activity per month.
· The Dognition team ran an experiment to determine whether a promotion that gave customers free access to the first 4 games of the Dognition Assessment would entice them to pay to complete the rest of the Dognition Assessment.
· All information that could be used to identify specific users has been removed.
· More details about the Dognition data collection method can be found at: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135176 (Stewart, Laughlin, et al. “Citizen science as a new tool in dog cognition research.” PloS One 10.9 (2015): e0135176.)
NOTE: the “data dictionary” provided to us by Dognition (below) may not be fully accurate/complete. But, we need to make the best with what we were given!
Field Name | Description |
activity_type | Type of physical interaction with the website (ie: “cancel_monthly”, “video_player”, “pdf_report_render”, etc.) |
birthday | Birth year of the dog |
breed | Name of breed |
breed_group | Group to which a dog’s breed belongs (Herding, Hound, Non- Sporting, Sporting, Terrier, Toy, Working) |
breed_type | Type of dog breed (pure breed, mixed breed, cross breed, popular hybrid) |
category_id | All data is NULL in this field of the database |
city | City of the user |
country | Country of the user |
created_at | Time stamp when the record in that row was created. PLEASE NOTE THAT THIS FIELD DOES NOT NEED TO BE CONSISTENT FOR A SPECIFIC USER_GUID OR DOG_GUID ACROSS TABLES. IT IS FOR THE RECORD OF INTEREST IN THE TABLE OF INTEREST. |
description | Description related to script_detail_id |
dimension | One of Dognition’s 9 personality profiles (Ace, Charmer, Einstein, Expert, Maverick, Protodog, Renaissance-dog, Socialite, Stargazer) |
dna_tested | Flag for whether a dog’s DNA has been tested (1=Yes/0=No) |
dog_fixed | Flag for whether a dog is neutered (also known as “fixed”) (1=Yes/0=No) |
dog_guid | Unique ID for a dog |
end_time | Timestamp when the user submitted their answer to a question in a Dognition test. |
exclude | Flag provided by Dognition indicating whether an entry should be excluded (reasons not documented in the current data set; (1=Exclude, 0=Don’t necessarily exclude) |
free_start_user | Flag indicating whether a user received a “free start” by getting the first 4 games for free (1=free start, 0=not free start) |
gender | Gender of the dog |
last_active_at | Time-stamp of user’s last activity in his/her Dognition account |
loop_number | Exam questions often have multiple “loops.” This signifies which loop the customer was on. Negative numbers represent “re-dos.” |
max_dogs | Number of dogs associated with a user |
mean iti (days) | Mean inter-test-interval (ITI) between each test a dog completed (in days) |
mean iti (minutes) | Mean inter-test-interval (ITI) between each test a dog completed (in minutes) |
median iti (days) | Median inter-test-interval (ITI) between each test a dog completed (in days) |
median iti (minutes) | Median inter-test-interval (ITI) between each test a dog completed (in minutes) |
membership_id | Unique ID |
membership_type | Type of subscription. 1=Dognition Assessment of initial 20 games, 2=Annual, 3=Monthly, 4=Free, 5=Subscription type of membership chosen by the user.
Annual subscriptions provide the Dognition Assessment plus 12 months of subscription service (one new test and activity are “unlocked” per month, beginning of the month after purchase). Monthly subscriptions provide the Dognition Assessment plus the subscription service billed monthly. Free subscriptions were either offered through a “free start” promotion or through the Dognition MOOC. These users have access to the first 4 games for free and can then upgrade to unlock the rest. The “Subscription” option (value of 5) is a recent test that offers the entire 20-game Dognition Assessment for free, but then allows user to upgrade to a monthly subscription |
rank_by_dogid | The chronological rank of each test a dog completed (e.g. 1=first test dog completed, 3=third test dog completed, etc.) |
rank_by_userid | The chronological rank of each test a user completed (e.g. 1=first test user completed, 3=third test user completed, etc.) |
rating | The answer to the question “How surprising were [your dog’s name]’s choices?” Users could choose any number between 1 (not surprising) to 9 (very surprising) |
script_detail_id | Number corresponding with the exact webpage a customer was on |
script_id | An ID representing the type of Dognition activity users are engaged in. Numbers represent personality questionnaire (first set of questions after registering a dog), Dognition Assessment versions, other surveys, and monthly games available after the initial Dognition Assessment is completed |
sign_in_count | Number of times a user signs in to his/her Dognition account |
start_time | Timestamp when user received a question in a Dognition test |
state | State of the user |
step_type | Key for whether the test item was a question or a stopwatch |
subcategory_name | Name of the cognitive sub-category a test belongs to (Communication, Cunning, Empathy, Expression Game, Impossible Task, Laterality, Memory, Numerosity, Perspective Game, Reasoning, Self Control Game, Shaker Game, Shell Game, Smell Game, Social Bias, Spatial Navigation). |
subscribed | Flag for whether a user has a paid subscription (1=Yes/0=No) |
test_name | Name of Dognition test |
time diff between first and last game (days) | Time difference between the first and last game completed by a dog (in days) |
time diff between first and last game (minutes) | Time difference between the first and last game completed by a dog (in minutes) |
total tests completed | Number of tests completed by the dog |
updated_at | Time stamp when the record in that row was created (will be the same as created_at if the record was never updated). PLEASE NOTE THAT THIS FIELD DOES NOT NEED TO BE CONSISTENT FOR A SPECIFIC USER_GUID OR DOG_GUID ACROSS TABLES. IT IS FOR THE RECORD OF INTEREST IN THE TABLE OF INTEREST. |
user_guid | Unique ID for a human user |
weight | Weight of the dog (lbs) |
zip | Zip code of the user |