Homepage
Homework
FUQUA - Data Infrastructure - HW1: SQL

FUQUA - Data Infrastructure - HW1: SQL

Engage in a Conversation

Assignment Type: Individual (COLLABORATION RULE—for this first homework you may collaborate with your classmates but you must ultimately submit a copy of your own work in Canvas. Consulting the internet for ideas is obviously fine. However, you cannot, for instance, explicitly ask someone on stackoverflow, “What is the answer to this homework question?” Similarly, I wouldn’t recommend copying a homework question verbatim into a generative AI prompt (like ChatGPT-3.5). This requires virtually zero effort and would be a wasted opportunity to learn by doing (which is the only way you will become a SQL master!). CourseNana.COM

Due Date: 8:00 AM EST on 9/18/2023 (no late homework will be accepted) CourseNana.COM

Instructions: CourseNana.COM

--You will essentially copy your queries (and output, when requested) into this Word document and then submit the document in Canvas CourseNana.COM

--SUBMIT ALL QUERIES USING THE EXAMPLE FORMATTING BELOW… CourseNana.COM

CourseNana.COM

Example Question: CourseNana.COM

Select “bmi” and “age” for 5 records. Order the output from highest to lowest “bmi.” Include your output. CourseNana.COM

CourseNana.COM

Answer: CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

SELECT bmi, CourseNana.COM

age CourseNana.COM

FROM health CourseNana.COM

ORDER BY bmi DESC CourseNana.COM

LIMIT 5; CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

bmi age CourseNana.COM

6388.49 48 CourseNana.COM

5858.59 54 CourseNana.COM

4324.40 27 CourseNana.COM

4320.96 48 CourseNana.COM

3745.48 66 CourseNana.COM

CourseNana.COM

--Notes about the formatting: CourseNana.COM

--Use Courier New font (because it is fixed-width) CourseNana.COM

--Put each SQL clause on a new line CourseNana.COM

--Use all caps for all SQL clauses and keywords CourseNana.COM

--Write each field in the SELECT statement on a new line CourseNana.COM

--Use tabs to clearly separate SQL clauses from field names, table names, etc. CourseNana.COM

--**IMPORTANT: Do not copy any of the actual data from the sanford database for any reason other than answering the questions on this assignment. If I, the teaching assistants, or the IT department detect any unauthorized access or usage the maximum grade that you can attain in the course is a LP. Please ask if you are not sure if a specific use is authorized. CourseNana.COM

--The assignment will be graded for both completeness and correctness. For the “correctness” part, we will randomly select a few questions, run your code, and compare your output to the answer. CourseNana.COM

Hints: CourseNana.COM

--The table name (“health”) is case-sensitive for the sanford database but the field names are not CourseNana.COM

--In most cases there is no such thing as a single “right” answer. If two different queries generate the same desired output then both are acceptable. CourseNana.COM

--To copy output directly from the results window in MySQL Workbench, it is typically easiest to right-click and choose the “tab separated” option CourseNana.COM

CourseNana.COM

Name: CourseNana.COM

CourseNana.COM

1.) First, let’s perform some simple “sanity checks” on the data. (Note: it is always a good idea to perform sanity checks when working with a new dataset.) Count the total number of records in the lone table, health, and include the output. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

2.) Next, let’s check to see if we can determine the number of unique records in health. Unfortunately COUNT(DISTINCT(*)) does not work (because when COUNT() and DISTINCT() are combined they only accept one column as input—but, we’ll find a workaround next week). Is there a specific field that looks like it might be unique? Count the unique values of id and display the result. Based on this result, are all records unique? CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

Written Answer CourseNana.COM

CourseNana.COM

3.) Write a query to count the total number of patients for each sex. Sort your output on this count from lowest to highest value. Include your output. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

4.) Next, for each sex, calculate the average incidence of hypertension, vascular disease, and diabetes (in a single query). Exclude the one patient with ‘Unknown’ sex. (Note: there is no need for any difficult calculation here—take advantage of the fact that hypertension, vascular disease, and diabetes are defined as “dummy” variables). CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

sex AVG(hypertension) AVG(vasc_disease) AVG(diabetes) CourseNana.COM

CourseNana.COM

5.) Run the following command: CourseNana.COM

CourseNana.COM

SHOW FULL COLUMNS CourseNana.COM

FROM health; CourseNana.COM

CourseNana.COM

The output provides important information about the columns in our table, including field name, type, whether the field can be NULL, whether the field is a key, any default values, and any extra information. For additional information, see: https://mariadb.com/kb/en/mariadb/show-columns/. For now we will focus on the Type column in this output, specifically for the “age” field. The type of “age” is listed as “varchar(45)”—this means that age is stored as a string (i.e. text) with a maximum length of 45 characters. Hmm—that’s a little odd. Wouldn’t we typically expect age to be an integer? Run a query that returns all of the distinct values of age. Is there a specific age value that is not an integer? CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Written Answer CourseNana.COM

CourseNana.COM

6.) Well, let’s see if the “potentially problematic” age that you found in the previous problem will in fact pose an issue. Take that “problematic” value and run the following query (make sure to put the value in single-quotes to identify it as a string AND be careful if you copy/paste this directly from Word—the formatting here is NOT actually using single quotes): CourseNana.COM

CourseNana.COM

SELECT 10 + ‘<your problematic value>’; CourseNana.COM

CourseNana.COM

Show your output. What?!?!? How did it just do that magic? Well, MariaDB utilized implicit type conversion—it implicitly converted our “problematic” string to an integer and added the integers together (while managing to ignore the “extra” mathematical symbol). HOWEVER, WE TYPICALLY WILL WANT TO AVOID USING IMPLICIT TYPE CONVERSION BECAUSE ITS RESULTS WILL NOT ALWAYS BE WHAT WE ANTICIPATE. Field types are very important in SQL/RDBMS and we will explore them in more detail next week. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

7.) For now, take advantage of this implicit type conversion and calculate the estimated average age for patients grouped by sex and hypertension. Be sure to ROUND the average ages to two decimal places (and search the internet if you need some help!). Include the count of patients in each group as well. Only include patients that were alive when the data was collected. Further, exclude the patient with ‘Unknown’ sex. Finally, only include groups that have at least 10,000 patients with diabetes. Order your output by sex and then hypertension (both in ascending order). Are we necessarily underestimating or overestimating the average age of each group (because we don’t know the ages of anyone older than 90)? CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

Written Answer CourseNana.COM

CourseNana.COM

8.) Write a simple query to count the total number of records in the table that have a NULL value for the “a1c” field. Include your output. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

9.) Confirm your output from the previous question by writing a query that groups patients by a1c value. Order your output from highest group count to lowest. Include only 4 rows in your output. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

10.) Run your query from the previous question but now for the “visits_sched” field, changing it to include 20 rows in your output. How many records have a NULL value? CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

Written Answer CourseNana.COM

CourseNana.COM

11.) Try to confirm your output from the previous question by adapting and re-running your query from Question 8. Does your output from this query seem to confirm or contradict your output from the previous query? If there seems to be a contradiction, do a little “ad hoc” testing and explain what is happening here. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

Written Answer CourseNana.COM

CourseNana.COM

12.) A stakeholder wants to run an analysis involving three fields: bmi, visits_sched, and visits_miss. However, she wants to exclude all records that have a missing value for any of these three fields. How many records will be available for her analysis? (Be sure to test for missing values in all three fields in your query. And, note that a missing value doesn’t need to be coded the same way across these three fields. In fact, it looks to me like someone is coding a “missing” value differently in each of these fields. How dare they!) CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

13.) A different stakeholder is attempting to identify patients who are missing a large proportion of their scheduled appointments (and therefore costing the health system a lot of money). Write a query to count the total number of patients who have had at least 10 scheduled appointments and missed at least 50% of those appointments. The stakeholder wants the counts grouped by payor and ranked by the count in descending order. Only include patients who are currently alive. (Hint: you may take advantage of implicit type conversion here. However, before running the query to generate your count, it might be reassuring to run some ad hoc queries to check if your calculation is operating as you had hoped.) CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

14.) A statistically-minded colleague is intrigued by the results from your previous query and wants to do some further investigation. First, they want you to determine, for each payor group, the average missed appointment rate across all patients in the group (where each patient’s individual missed appointment rate is equal to the number of missed appointments divided by the number of scheduled appointments). Round your group average missed appointment rates to three decimal places. Order your result by average missed appointment rate in descending order. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

15.) **You will not be able to solve this problem until after Class 3** (This problem is a little tricky. I will say that nothing is preventing you from putting CASE statements inside aggregate functions…) “Very interesting,” they say! “Since I am very statistically-minded, I would be interested if you can take this one step further and test for a statistically significant difference, please!” In particular, calculate a Welch’s t-test statistic for the difference in average missed appointment rates between Medicaid patients and non-Medicaid patients. Just to be clear, you only need to calculate the test statistic here. No need to find a critical value since your colleague is SOOO statistically-minded. :) You can use the following equation for the test statistic: CourseNana.COM

CourseNana.COM

Welch_t = [mean for Medicaid – mean for non-Medicaid] CourseNana.COM

/ CourseNana.COM

sqrt[ CourseNana.COM

((variance for Medicaid)/count of Medicaid patients) + CourseNana.COM

((variance for non-Medicaid)/count of non-Medicaid patients)] CourseNana.COM

CourseNana.COM

Be sure to use the “sample” flavor of the variance aggregate function. Round your test statistic to two decimal places. Rename your test statistic “Welch_t”. Just to be super clear here—you should be writing a single query that returns a single value as its result set! CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

16.) **You will not be able to solve this problem until after Class 3** Let’s categorize patients as either “high_risk” or “low_risk” for severe illness from COVID-19. We will define a patient as “high_risk” if they are at least 65 years old OR they have hypertension OR they have vascular disease OR they have diabetes OR they have a BMI value of at least 30. Otherwise, a patient is “low_risk.” Call this categorization variable “risk_group.” Write a query that counts the number of patients in each risk category and also counts the number people who are “current every day smokers” for each risk category (and call this second aggregate “current_smoke”). Show “low_risk” patients first in your output. CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

17.) **It might make sense to wait on this until after Class 3** Now it’s time to “COME ON DOWN!” and answer some questions about the Price is Right! Recall from Class 3 that we are referring to our pir database. Also recall that you can assume that on each date (i.e. each episode/show), no two contestants have the same name. So, the combination of date and name can be used to identify a unique contestant. If you need a refresher on what all of the columns mean, try running SHOW FULL COLUMNS FROM five_ep; to view the comments that I included! CourseNana.COM

CourseNana.COM

Let’s try to find the best “potentially winning bid” on “Bidders Row” that didn’t actually win. Remember that for a bid on Bidders Row to win, it must be LESS than the price of the prize. Specifically, write a query that returns the difference in dollars between the actual retail price of the Bidders Row prize and the best “potentially winning bid” that did not actually win. Put differently, you are essentially looking to identify the best “potentially winning second place bid” and return the number of dollars that this bid was away from the price of the prize. So, your result set should be a single dollar value! CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

18.) **It might make sense to wait on this until after Class 3** Next, let’s try to find the “Bidders Row” contestant who was the most unlucky. We will define the “most unlucky” contestant as the one who made the most accurate average bid on “Bidders Row.” Here we will NOT worry about whether the bid is above or below the price of the prize. So, for instance, if the actual retail price of the prize was $400 then bids of $300 and $500 would be considered equivalent here (both are off by $100). Write a query that returns the date, contestant name, number of Bidders Row bids, and the average amount that the contestant’s bids were off from the price of the prize FOR THE contestant who made the best bids (on average) BUT NEVER MADE IT TO A PRICING GAME. (Again, the difference between each bid and the price of prize should always be a positive value!) Your result set should have a single row! CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

19.) **It might make sense to wait on this until after Class 3** Finally, find all instances where the first three “Bidders Row” bids are GREATER than the prize amount (so that the fourth bidder wins automatically as long as they bid less than the prize amount). Return the date, value of “eventTypeCounter”, and the number of dollars between the prize amount and the lowest of the three first bids. Your result set should have one row for each Bidders Row instance satisfying the criteria. (As an aside, there are actual instances in the show when all four Bidders Row bidders make bids that are greater than the actual retail price of the prize. In these instances, all bids are cleared and the bidding is reset. When this happened in the shows recorded for the database I simply ignored the first round of failed bidding.) CourseNana.COM

CourseNana.COM

Query CourseNana.COM

CourseNana.COM

Output CourseNana.COM

CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: Crypto-Systems & Data Protection (ES94N-15, 3rd attempt): Cryptosystem Report

Next: BTM 382 Database Management - Assignment 1: Database Integrity

Duke代写,Fuqua代写,Data Infrastructure代写,SQL代写,Duke代编,Fuqua代编,Data Infrastructure代编,SQL代编,Duke代考,Fuqua代考,Data Infrastructure代考,SQL代考,Dukehelp,Fuquahelp,Data Infrastructurehelp,SQLhelp,Duke作业代写,Fuqua作业代写,Data Infrastructure作业代写,SQL作业代写,Duke编程代写,Fuqua编程代写,Data Infrastructure编程代写,SQL编程代写,Dukeprogramming help,Fuquaprogramming help,Data Infrastructureprogramming help,SQLprogramming help,Dukeassignment help,Fuquaassignment help,Data Infrastructureassignment help,SQLassignment help,Dukesolution,Fuquasolution,Data Infrastructuresolution,SQLsolution,