EBIS 2023 Business Analytics -
Individual Assignment 1 (Total 100 Marks, 3 pages)
This assignment is designed to walk you through data manipulation and basic visualization techniques for given research questions. You will be also asked to answer several questions about data distribution using the real-world donorchoose.org data.
Now, suppose we are interested in the characteristics of projects proposed on the donorschoose.org platform. Specifically, we have two research interests: (a) whether teachers get higher optional support ratio as they complete more donation projects over time, and (b) whether teachers get better at attracting a higher amount of donations when the poverty level of their schools is high.
Here are the steps you should follow:
1. Load projects.csv data into a dataframe or tibble named “dt”. [5 marks]
2. Before analyzing the data, you need to create new variables that are needed for the analysis. [5 marks per each, total 20 marks]
(2) Remove projects from the dataset if the value of their *resource_type* column is *Other*. Then recode *NA* in the *margin* and *margin_percentage* to 0. In other words, after the manipulation, the *resource_type* column should not contain *Other* as a part of its values, and *margin* and *margin_percentage* columns should not contain any missing values (NA).
Tips:
- for this step, you may use ifelse() in mutate () to recode the values.
- is.na() returns TRUE if the function finds NA among values.
(3) Create a new column called *completed_project_no* that shows how many complected projects a teacher has in the dataset. For example, teacher A has his/her first complected projects with *completed* in the *funding_status* column, then this projects record should have 1 as the value in the *completed_project_no* column, his/her second complected project should have 2, etc.
Tips:
- for counting with conditions, you may use cumsum(condition_column== "value") to calculate the count.
(4) Create a new column called *optional_ratio* that contains a ratio (in terms of %) of the amount of optional support relative to total donation. Use “summed_donations_excluding_optional_support” and “summed_donations_including_optional_support” for the calculation.
3. Graph the density of the *optional_ratio* column for completed_project_no==1, completed_project_no==2, and completed_project_no ==4. If you use ggplot2 library, you can use the *geom_density* function.
- Do you think they are close to normal density?
- Interpret any difference you notice from these three densities.
[A correct visualization: 10 marks / Interpretation: 10 marks]
Tips:
- %in% can be used for matching values and returns a vector of the positions of matches of its first argument in its second.
- xlim(0,50) can be used to set a limit of x-axis, up to 50 for visualization. This action does not remove large values in the variable.
4. Now using the *dt* data frame (or tibble), it is time to create a new data set that you will name *ts*. This table should have two columns for each teacher. [10 marks per each, total 20 marks]
(a) *avg_donation*: the average donation to the completed projects created by each teacher.
(b) *poverty_level*: the poverty level of the area where the teachers are located.
Tips:
- use total_donations column for measuring the amount of donations received for each project
5. Using the new “ts” table, graph the densities of *avg_donation* only for poverty_level== minimal and poverty_level==high.
- Do you think they are close to normal density?
- Interpret any difference you notice among these densities.
[A correct visualization: 10 marks / Interpretation: 10 marks]
Tips:
- %in% can be used for matching values and returns a vector of the positions of matches of its first argument in its second.
- xlim(0,1000) can be used to set a limit of x-axis, up to 10 for visualization. This action does not remove large values in the variable.
6. Now, interpret the differences between outputs of step 3 and outputs of step 5. What would you conclude concerning the given the two research interests presented at the beginning of this assignment? Please try to explain why. [7.5 marks per each, total 15 marks]
Instructions:
1. Use R.
You do not have to use tidyverse, but it is recommended. If you are to use tidyverse, here’s a useful online resource that has most of what you need to finish this assignment:
https://r4ds.had.co.nz/transform.html
2. Use Rmarkdown, and compile your codes, results, and explanations into an HTML or a PDF file. You should submit only one final compiled report file (other formats will NOT be graded).
3. Do not include more than twenty lines of output of your code. For example, if you want to show that your code successfully transforms the entire table, only shows, say, the first ten rows of the table.
4. Do not create a new data frame unless you are instructed to do so. When you create a new column, use the instructed column name. If you have to make your own, you need to justify.
5. The use of any generative AI tool is strictly prohibited for this assignment. If such use is detected, it will be considered an attempt at plagiarism.