STAT 1261/2260 Project Guidelines
Teams
• Team size
– 2 to 4 students
– I will consider smaller team sizes but you should give convincing justification. In particular, you need to convince me you have the capacity to complete the project with less people.
• Team request
– Email me with your request to form a team.
– One team member should email, with a Cc to the other members. Your email should include a list of the team members and their majors.
Scope
I only expect you to use the techniques that I have shown you in the lectures. You should not use any techniques that you do not understand.
I would prefer that you do relatively simple, clear analyses using simple techniques than complex analyses that you do not fully understand. Your job as a data scientist is to draw clear conclusions from data.
Suggested Structure
See the rubric for the requirements of your project files.
I also recommend that you:
• Download the data you are working on, and save it with your project files. Leave instructions on how I can get the original data that you downloaded.
• Consider using a “setup” file, such as a markdown file, that runs once, to set up the project. For example, it might install any libraries that you will use. It may download the data from a URL and save it to your project directory.
Part I and Part II of the Project
Part I Data Wrangling and Visualization
The first part of the project will concentrate on data exploration. Your task involves selecting a dataset of and delving into a comprehensive analysis of the data. This includes utilizing graphical displays and summary statistics to gain insights from the dataset. To illustrate, try to address the following questions:
• What is the distribution of each variable? Is the distribution roughly symmetric? Any extreme values? Is it approximately normal?
• What relationships can be observed between the variables? You can employ graphs and descriptive statistics to answer this question.
• Based on the answers to the preceding questions, what hypotheses can you formulate for testing? Which variable can be considered as a response variable? Which variables appear to be valuable in the estimation or prediction of the response variable?
Part II Modeling
Based on your findings from Part I, choose a few machine learning models to fit. Make sure that you tune their parameters to optimize the models according to some model assessment criteria.
Project Report
The first part of the project report should be roughly between 1000 and 1500 words of explanatory text and code, and the final report (including Parts I and II) should be between 1500 and 3000 words, not including figures and tables.
It can be in the form of a PDF, Word, or HTML document. Note that your R Markdown file needs to be submitted as well.
Project Calendar
Week | Date (Monday) | Task |
1 | Aug. 28 | Start to form teams |
2 | Sept. 4 | Data exploration |
3 | Sept. 11 | Data exploration |
4 | Sept. 18 | Teams finalized |
5 | Sept. 25 | Start to work on Part 1 |
6 | Oct. 2 | |
7 | Oct. 9 | |
8 | Oct. 16 | Part 1 Due on Oct. 20 |
9 | Oct. 23 | Start to work on Part 2 |
10 | Oct. 30 | |
11 | Nov. 6 | |
12 | Nov. 13 | |
13 | Nov. 20 | Thanksgiving Break |
14 | Nov. 27 | Part 2 Due on Dec. 1 |
15 | Dec. 4 | Project Presentations |
Presentation
Your project presentations are short presentations on your project to me, your instructor, and the rest of the class.
Presentation Details
• The project presentations should be 5-7 minutes.
• You should send your slides to me the day before the presentation day.
• I will video record the presentations, to make sure that the grading is fair and consistent.
Goal of the presentation
The goal of the project presentation is to get quickly to your main conclusions, and the evidence for these conclusions. A good presentation will help your listeners engage with your analysis, and think about new questions to ask. The focus of the presentation should be on the following:
• Summarize your data.
• Describe your main analysis strategy.
• Describe your main findings.
• Draw conclusions with care, citing evidence from your data, and from any relevant literature.
What if you get the “opposite” conclusion or no conclusion?
If all your analyses have so far proved negative or inclusive, that’s fine too. Say what you tried, what evidence you were able to find and whether you need new evidence or a new analysis strategy. You might also conclude that your initial hypothesis was wrong, and that the data gives evidence against it.
Who will do the presentation?
Please discuss and decide with your teammates about who prepares and does the presentation.
Data for Projects
Your task is open-ended task, and this is typical of real data analysis projects. Each project will go in a different direction, and you will find that your group will become experts in interpreting your own data. You might even end up writing a little paper from your report.
Data Sets Suggested
If this is the first data science course for you or you have no previous experience in data analysis, I recommend you use one of the following data sets from Kaggle.
1. Heart Failure Prediction (4 kB)
2. Data Science Job Salaries (8 kB)
3. Sleep Health and Lifestyle Dataset (3 kB)
4. Heart Attack Analysis & Prediction Dataset (4 kB)
5. Airline Passenger Satisfaction (3.04 MB)
6. Credit Risk of Customers (19 kB)
7. American Citizens Annual Income (343 kB)
8. Loan Approval Prediction Data (83 kB)
9. Travel Insurance Prediction Data (13 kB)
10. Employee Satisfaction Index Dataset (8 kB)
Other Data Sources
Feel free to use data from your own discipline. Ask around to see if you can find interesting data from the University of Pittsburgh, maybe in your School.
Below there are some links to find data sets.
• Kaggle Datasets Kaggle has a large list of datasets. You may use filters to choose the format and the size of the data set. I suggest you use small to medium-sized (<5MB) data sets because otherwise, it will take a long time to fit and tune models.
• Google Dataset Search Try the link, you???ll get the idea.
• World Bank Data The site has a lot of data on global development, and related issues.