Math 10 Introduction to Programming for Data Science - Final Project: Supervised Learning and Unsupervised Learning
Math 10 Final Project, 23 Winter
Due: 11:59 PM Wed March 22th 2023
Submission: Upload the .ipynb file to Canvas.
The submission should be a well-organized report (with well-structured sections, high-quality figures and necessary descriptions as markdown in the notebook file, with all code blocks already executed). Submitting merely the codes and/or incomplete results will severely impact your grades. Please include everything in one single ipynb file, and any other formats (.pdf, .doc ...) or redundant files are not valid and won’t be graded.
Dataset Downloading:
In the final project, we provide (and it’s required to use) following three choices of datasets, please click the links for details and download the datasets:
1) Tabular Data: Health Insurance Cross Sell Prediction
2) Image Data: American sign language-MNIST https://www.kaggle.com/datasets/datamunge/sign-language-mnist
3) Text Data: TripAdvisor Review https://www.kaggle.com/datasets/andrewmvd/trip-advisor-hotel-reviews
You must pick one of the datasets and conduct all the tasks described below. The choice won’t affect your grades. Of course, exploring all three datasets is especially welcome if you aim to get A+ in this course.
Tasks and Grading Policy (20pts in the total course grade)
Task 1: Data Loading, Processing and Explorary Data Analysis (EDA)(4 pts)
- Write codes to correctly load the data, and use markdown to write down what Python
data type/package you have selected to store/process the data and briefly explain the reason.
Additional requirements for each data:
Tabular Data: Generate the summary statistics of all columns. Write the code and use text to explain how your transform the data to the number-valued matrix in the end.
Image Data: For each category, randomly pick up one sample image and plot them all together in the same figure.
Text Data: Use text and codes to show the whole processing process how you transform texts to vectors/matrix.
Task 2: Supervised Learning (8pts)
- Define a meaningful supervised learning problem for the dataset using markdown cell. Use codes to assign the appropriate 1) training data, 2) test data and 3) labels.
- Use at least three supervised learning model to solve the problem. If you use a third-part package, explain why you choose it. If you use customized codes, please write documentation strings. If the package has not be covered in this course, cite the original resources and explain your understanding about the package, or you won’t receive credit for it.
- Choose at least one method, use text and equations to describe the algorithm in more details. It’s okay to refer to lecture notes/ discussion files, but please rephrase.
- Choose at least one method (can be same or different), change the hyperparameter in the model to see how the performance change correspondingly. Use markdown cell to describe your findings.
- Describe how you evaluate the performance for all the methods indeed? And what is the result? Be specific about the performance on both training and test data.
- It’s totally fine to drop some variables in the original data (e.g. doing simple feature selection), especially in the tabular dataset as long as you can explain your rationale.
Task 3: Unsupervised Learning (6 pts)
- Choose at least two unsupervised learning methods to analyze the data. For at least one method, use text and equations to describe the algorithm in more details.
- Describe what is your finding in the unsupervised learning task using markdown cells.
Task 4: Organizing your report (2 pt)
This 2pt will be determined by the overall quality of your report in ipynb form (judged by the instructor when grading). There is no guarantee that you will get the 2 point in full if only
submitting the correct (instead of nearly-perfect) report. In other words, it’s totally possible that you receive zero in this task, if your report merely just copy and paste codes and Markown from lecture notes/discussion files.
Try to write the descriptions and codes (including document strings and comments) in well- organized and logical way, and generate high-quality figures. A practical tip is to imagine that you’re writing a thesis instead of merely codes– therefore you need to include :
- Meaningful title of the whole project reflecting its scientific contents
- Sections and subsections at different heading levels using markdown language
- Latex equations when introducing the methods used
- Abstract/conclusion/transition paragraphs
- Clear figure axis labels when plotting
- Document strings or comments for long functions
You may also consult the corresponding highly voted Kaggle notebooks for each dataset for possible examples. Especially, repeating the words of this file and copying requirements of the tasks are not necessary and not encouraged – I expect to see your own results and associated descriptions/explanations.
Other Requirements/Resources:
- Each student should work on the final project independently, and direct discussion on the content (especially about debugging) with other students/ TA/ teacher is NOT allowed. Violations of the academic integrity rules will be reported to the department.
- Make sure to submit the final project .ipynb file to Canvas before the deadline. We do not allow to extend the deadline.
- Computer/software issue is not a valid excuse to submit incomplete results, since we have already tested the datasets and basic tasks in personal laptop satisfying the minimum requirement asked by university– not to mention that we have also introduced free Kaggle or Google Colab resources to run the codes in cloud.