Homework #1
Assignment Type: Individual (COLLABORATION RULE—for this first homework you may collaborate with your classmates but you must ultimately submit a copy of your own work in Canvas. Consulting the internet for ideas is obviously fine. However, you cannot, for instance, explicitly ask someone on stackoverflow, “What is the answer to this homework question?” Similarly, I wouldn’t recommend copying a homework question verbatim into a generative AI prompt (like ChatGPT-3.5). This requires virtually zero effort and would be a wasted opportunity to learn by doing (which is the only way you will become a SQL master!).
Due Date: 8:00 AM EST on 9/18/2023 (no late homework will be accepted)
Instructions:
--You will essentially copy your queries (and output, when requested) into this Word document and then submit the document in Canvas
--SUBMIT ALL QUERIES USING THE EXAMPLE FORMATTING BELOW…
Example Question:
Select “bmi” and “age” for 5 records. Order the output from highest to lowest “bmi.” Include your output.
Answer:
Query
SELECT bmi,
age
FROM health
ORDER BY bmi DESC
LIMIT 5;
Output
bmi age
6388.49 48
5858.59 54
4324.40 27
4320.96 48
3745.48 66
--Notes about the formatting:
--Use Courier New font (because it is fixed-width)
--Put each SQL clause on a new line
--Use all caps for all SQL clauses and keywords
--Write each field in the SELECT statement on a new line
--Use tabs to clearly separate SQL clauses from field names, table names, etc.
--**IMPORTANT: Do not copy any of the actual data from the sanford database for any reason other than answering the questions on this assignment. If I, the teaching assistants, or the IT department detect any unauthorized access or usage the maximum grade that you can attain in the course is a LP. Please ask if you are not sure if a specific use is authorized.
--The assignment will be graded for both completeness and correctness. For the “correctness” part, we will randomly select a few questions, run your code, and compare your output to the answer.
Hints:
--The table name (“health”) is case-sensitive for the sanford database but the field names are not
--In most cases there is no such thing as a single “right” answer. If two different queries generate the same desired output then both are acceptable.
--To copy output directly from the results window in MySQL Workbench, it is typically easiest to right-click and choose the “tab separated” option
Name:
1.) First, let’s perform some simple “sanity checks” on the data. (Note: it is always a good idea to perform sanity checks when working with a new dataset.) Count the total number of records in the lone table, health, and include the output.
Query
Output
2.) Next, let’s check to see if we can determine the number of unique records in health. Unfortunately COUNT(DISTINCT(*)) does not work (because when COUNT() and DISTINCT() are combined they only accept one column as input—but, we’ll find a workaround next week). Is there a specific field that looks like it might be unique? Count the unique values of id and display the result. Based on this result, are all records unique?
Query
Output
Written Answer
3.) Write a query to count the total number of patients for each sex. Sort your output on this count from lowest to highest value. Include your output.
Query
Output
4.) Next, for each sex, calculate the average incidence of hypertension, vascular disease, and diabetes (in a single query). Exclude the one patient with ‘Unknown’ sex. (Note: there is no need for any difficult calculation here—take advantage of the fact that hypertension, vascular disease, and diabetes are defined as “dummy” variables).
Query
Output
sex AVG(hypertension) AVG(vasc_disease) AVG(diabetes)
5.) Run the following command:
SHOW FULL COLUMNS
FROM health;
The output provides important information about the columns in our table, including field name, type, whether the field can be NULL, whether the field is a key, any default values, and any extra information. For additional information, see: https://mariadb.com/kb/en/mariadb/show-columns/. For now we will focus on the Type column in this output, specifically for the “age” field. The type of “age” is listed as “varchar(45)”—this means that age is stored as a string (i.e. text) with a maximum length of 45 characters. Hmm—that’s a little odd. Wouldn’t we typically expect age to be an integer? Run a query that returns all of the distinct values of age. Is there a specific age value that is not an integer?
Query
Written Answer
6.) Well, let’s see if the “potentially problematic” age that you found in the previous problem will in fact pose an issue. Take that “problematic” value and run the following query (make sure to put the value in single-quotes to identify it as a string AND be careful if you copy/paste this directly from Word—the formatting here is NOT actually using single quotes):
SELECT 10 + ‘<your problematic value>’;
Show your output. What?!?!? How did it just do that magic? Well, MariaDB utilized implicit type conversion—it implicitly converted our “problematic” string to an integer and added the integers together (while managing to ignore the “extra” mathematical symbol). HOWEVER, WE TYPICALLY WILL WANT TO AVOID USING IMPLICIT TYPE CONVERSION BECAUSE ITS RESULTS WILL NOT ALWAYS BE WHAT WE ANTICIPATE. Field types are very important in SQL/RDBMS and we will explore them in more detail next week.
Query
Output
7.) For now, take advantage of this implicit type conversion and calculate the estimated average age for patients grouped by sex and hypertension. Be sure to ROUND the average ages to two decimal places (and search the internet if you need some help!). Include the count of patients in each group as well. Only include patients that were alive when the data was collected. Further, exclude the patient with ‘Unknown’ sex. Finally, only include groups that have at least 10,000 patients with diabetes. Order your output by sex and then hypertension (both in ascending order). Are we necessarily underestimating or overestimating the average age of each group (because we don’t know the ages of anyone older than 90)?
Query
Output
Written Answer
8.) Write a simple query to count the total number of records in the table that have a NULL value for the “a1c” field. Include your output.
Query
Output
9.) Confirm your output from the previous question by writing a query that groups patients by a1c value. Order your output from highest group count to lowest. Include only 4 rows in your output.
Query
Output
10.) Run your query from the previous question but now for the “visits_sched” field, changing it to include 20 rows in your output. How many records have a NULL value?
Query
Output
Written Answer
11.) Try to confirm your output from the previous question by adapting and re-running your query from Question 8. Does your output from this query seem to confirm or contradict your output from the previous query? If there seems to be a contradiction, do a little “ad hoc” testing and explain what is happening here.
Query
Output
Written Answer
12.) A stakeholder wants to run an analysis involving three fields: bmi, visits_sched, and visits_miss. However, she wants to exclude all records that have a missing value for any of these three fields. How many records will be available for her analysis? (Be sure to test for missing values in all three fields in your query. And, note that a missing value doesn’t need to be coded the same way across these three fields. In fact, it looks to me like someone is coding a “missing” value differently in each of these fields. How dare they!)
Query
Output
13.) A different stakeholder is attempting to identify patients who are missing a large proportion of their scheduled appointments (and therefore costing the health system a lot of money). Write a query to count the total number of patients who have had at least 10 scheduled appointments and missed at least 50% of those appointments. The stakeholder wants the counts grouped by payor and ranked by the count in descending order. Only include patients who are currently alive. (Hint: you may take advantage of implicit type conversion here. However, before running the query to generate your count, it might be reassuring to run some ad hoc queries to check if your calculation is operating as you had hoped.)
Query
Output
14.) A statistically-minded colleague is intrigued by the results from your previous query and wants to do some further investigation. First, they want you to determine, for each payor group, the average missed appointment rate across all patients in the group (where each patient’s individual missed appointment rate is equal to the number of missed appointments divided by the number of scheduled appointments). Round your group average missed appointment rates to three decimal places. Order your result by average missed appointment rate in descending order.
Query
Output
15.) **You will not be able to solve this problem until after Class 3** (This problem is a little tricky. I will say that nothing is preventing you from putting CASE statements inside aggregate functions…) “Very interesting,” they say! “Since I am very statistically-minded, I would be interested if you can take this one step further and test for a statistically significant difference, please!” In particular, calculate a Welch’s t-test statistic for the difference in average missed appointment rates between Medicaid patients and non-Medicaid patients. Just to be clear, you only need to calculate the test statistic here. No need to find a critical value since your colleague is SOOO statistically-minded. :) You can use the following equation for the test statistic:
Welch_t = [mean for Medicaid – mean for non-Medicaid]
/
sqrt[
((variance for Medicaid)/count of Medicaid patients) +
((variance for non-Medicaid)/count of non-Medicaid patients)]
Be sure to use the “sample” flavor of the variance aggregate function. Round your test statistic to two decimal places. Rename your test statistic “Welch_t”. Just to be super clear here—you should be writing a single query that returns a single value as its result set!
Query
Output
16.) **You will not be able to solve this problem until after Class 3** Let’s categorize patients as either “high_risk” or “low_risk” for severe illness from COVID-19. We will define a patient as “high_risk” if they are at least 65 years old OR they have hypertension OR they have vascular disease OR they have diabetes OR they have a BMI value of at least 30. Otherwise, a patient is “low_risk.” Call this categorization variable “risk_group.” Write a query that counts the number of patients in each risk category and also counts the number people who are “current every day smokers” for each risk category (and call this second aggregate “current_smoke”). Show “low_risk” patients first in your output.
Query
Output
17.) **It might make sense to wait on this until after Class 3** Now it’s time to “COME ON DOWN!” and answer some questions about the Price is Right! Recall from Class 3 that we are referring to our pir database. Also recall that you can assume that on each date (i.e. each episode/show), no two contestants have the same name. So, the combination of date and name can be used to identify a unique contestant. If you need a refresher on what all of the columns mean, try running SHOW FULL COLUMNS FROM five_ep; to view the comments that I included!
Let’s try to find the best “potentially winning bid” on “Bidders Row” that didn’t actually win. Remember that for a bid on Bidders Row to win, it must be LESS than the price of the prize. Specifically, write a query that returns the difference in dollars between the actual retail price of the Bidders Row prize and the best “potentially winning bid” that did not actually win. Put differently, you are essentially looking to identify the best “potentially winning second place bid” and return the number of dollars that this bid was away from the price of the prize. So, your result set should be a single dollar value!
Query
Output
18.) **It might make sense to wait on this until after Class 3** Next, let’s try to find the “Bidders Row” contestant who was the most unlucky. We will define the “most unlucky” contestant as the one who made the most accurate average bid on “Bidders Row.” Here we will NOT worry about whether the bid is above or below the price of the prize. So, for instance, if the actual retail price of the prize was $400 then bids of $300 and $500 would be considered equivalent here (both are off by $100). Write a query that returns the date, contestant name, number of Bidders Row bids, and the average amount that the contestant’s bids were off from the price of the prize FOR THE contestant who made the best bids (on average) BUT NEVER MADE IT TO A PRICING GAME. (Again, the difference between each bid and the price of prize should always be a positive value!) Your result set should have a single row!
Query
Output
19.) **It might make sense to wait on this until after Class 3** Finally, find all instances where the first three “Bidders Row” bids are GREATER than the prize amount (so that the fourth bidder wins automatically as long as they bid less than the prize amount). Return the date, value of “eventTypeCounter”, and the number of dollars between the prize amount and the lowest of the three first bids. Your result set should have one row for each Bidders Row instance satisfying the criteria. (As an aside, there are actual instances in the show when all four Bidders Row bidders make bids that are greater than the actual retail price of the prize. In these instances, all bids are cleared and the bidding is reset. When this happened in the shows recorded for the database I simply ignored the first round of failed bidding.)
Query
Output