Python, R and data structures – Group Coursework 3
Date set: 29/01/23
Date to be submitted: 20/02/23
Introduction
Again, apologies for not getting this up before Christmas due to illness but hopefully the three weeks at the start of term make it easier to carry out the work than having to find some time during revision.
This coursework is slightly different to the ones you have done in Python as we will assume the user is happy to type in functions. The main goal instead is to make it easier for them to achieve what they require by creating user defined functions rather than having to use R syntax.
I am also assuming that your functions in part iii) will interact with the data frame stored on the hard drive/ssd/cloud as this will make the form of the functions easier for the user to type in (as they will not need to pass over a data frame and receive back a data frame).
There are lots of mini tasks in this coursework so although there are more parts some of the solutions can be only a few lines and there is less need to think about various data structures compared to the Python coursework.
Question
A company runs a death benefit scheme for its employees. The scheme is free to join but to be part of it the employee must take a brief medical check-up where their height (in cm), weight (in kg) and smoking status is taken. While they are part of the scheme, they can always take part in further medical checks, but this is not compulsory. However, if a check-up does occur, the new weight and smoking status are recorded and where necessary, the data is updated.
The scheme only allows employees to join on 1st January in any year. When they join, their current age on the 1st January is stored in this particular benefits system with the year they joined the scheme.
i) Data verification – using data held in dataraw.csv
While the file on Moodle (dataraw.csv) is small, it is assumed to be only a sample. The columns data that is included in the file is the employee number, the surname of the employee, the year they joined the scheme (which is 2023 for all the employees in this file), the age of the employee when they joined the scheme, the height (in cm), the weight (in kg) and the smoker status of the employee.
Your first task is to write code that verifies that the data meets the criteria stated above - unique employee number, calculated BMI is between 15 and 45, age between 21 and 55 and the smoker status is a valid status.
ii) Post data verification – using the data held in dataclean.csv to build and save the initial data frame
It is assumed that you have fed back the queries you found in part i), the main data people have fixed the problems and have now supplied you with a cleaned-up data file (dataclean.csv). With this file you need to build your initial working data frame. To do this you need to add:
1. a) A column that contains the current BMI of the employee should be added. In addition, having BMI as only a numeric value is not as useful as also holding the category that this defines. One set of definitions that are currently being used are:
· Less than 18.5 – Underweight
· Between 18.5 and 25 (though not including 25) – healthy weight
· Between 25 and 30 (though not including 30) – overweight
· 40 and over – severely obese
2. b) To be able to keep track of the employees who are members of the scheme (see part iii below) the data frame needs the following five columns to be added
· Current Age
· Age at withdrawal
· Year of withdrawal
· Age at death
For the first of these columns, the current age will be the age of the employee when they joined the scheme. For the other four columns, as all the employees at the moment are current employees in the scheme, these columns should contain only the values NA.
Once the data frame has been completed it should be saved as a csv file. Again, the user is expected to change the file name in the code, but make it clear where this is. It is up to them to make sure they do not write over an existing file i.e. this is not your worry and doesn’t need to be tested.
Maintenance of the file/data
The company will want to maintain and analyse the data as time progresses. To help with this they want you to write some code that allows them to use user-defined functions to carry out the following tasks rather than them having to use in built R functions and the standard syntax. (Remember to give your functions sensible names!)
1. a) A function that will show the current state of the data frame.
2. b) A function that will show the employee numbers and names of the employees that are currently alive in the data frame.
3. c) A function that is run at the start of the new year that ages the employees who are still alive by adding one to their current age.
4. d) After the start of the new year function (c)) has run, but before any of the following functions have run, the user can add new members. The function will require the data – employee number, employee surname, age they joined, height and weight. Before adding the entered values to the data frame
5. e) A function that allows the user to record the deaths of any employees. This function will be able to take one or more employee numbers. If any of the employee numbers do not exist, then the user should be told;
6. f) A function that allows the user to record the withdrawals of any employees. This function will be able to take one or more employee numbers. If any of the employee numbers do not exist, then the user should be told; if the employee number relates to an employee who is already recorded as dead or withdrawn then the user should be told.
7. g) A function that allows the weight of an employee to be updated. This will also update the BMI and make the standard check that it is between 15 and 45 and update the BMI category.
8. h) A function that allows the smoking status of an employee to be updated. This function will only update the record to one of the recognised statuses listed above and you cannot move to Never Smoked.
9. i) A function that summarises the ages of the population by giving the mean, median and standard deviation of the age of current policyholders, age at death of deceased policyholders and age at withdrawal of withdrawn policyholders. This is either for all policyholders or the user can ask for it to be broken down by smoking status or BMI grouping (they are only allowed to pick one of these at any time).
10. j) A function that saves data frames that are subsets of the main data frame. These data frames will be split by either smoker status or BMI grouping.
Rules
For all three parts, I will load your data into the standard R console and run all the code for that part i.e. I will run part i) on its own, then load in part ii) and run that, etc.
For part i), all I expect to need to do is to change the file location in the code before running it and I expect I will get a summary of the data that looks to be in error from the file with the errors clearly laid out.
For part ii), all I expect to need to do is to change the file location for the files to be read in and written out, but the code should just run and create and save the data frame.
For part iii) I again expect to change the file names. Note that as there are a few output files in the task j) above you may want to have a directory path (e.g. “U:\\data\\”) and a file name separately so all I need to do is change the directory path if I like the file names you have chosen. When I run the code, I do not expect much to happen as it should nearly all be driven by functions that I will need to call with the correct parameters
As noted above, I am assuming that your functions in part iii) will interact with the saved data frame so that the user can type in their functions as simply as possible i.e., for the mortality function you may have something like :
Rather than:
Submissions
deaths(E110234, E110237) member.records<-deaths(member.records, E110234, E110237)
As noted above, there will be four files to submit – R code for parts i), ii) and iii) and a user guide for part iii) functions.
While R does not crash as much as Python, you should still be checking the data that the user types in and informing them why it is incorrect e.g. wrong type of data or the employee number doesn’t exist, rather than just getting some R errors or nothing happening.