Machine Learning Foundations with Python (90-803)
This is an individual test. No communication with other students is allowed. Do not post questions about the midterm in Piazza; this assignment is under exam conditions. You are only allowed to ask clarification questions.
Note: even a 1-minute later (whatever the reason) will result in a 0. Plan ahead and deliver ahead of time!
From the datasets provided, you will be required to complete a series of tasks; please read carefully through them.
Part 1: Prediction
From the dataset provided, you will be required to complete the following tasks:
- Join all 6 datasets into a single dataset and print its output to a file called
WH_2015_2020.csv. You will work with this file moving forward.
- Perform any necessary cleaning steps.
- Draw at least two plots that will give you insight into your data and your next steps (this can be a correlation matrix, the distribution of a particular variable, or a density plot, among others). Include an explanation of why these plots are relevant and how they will help you with your models.
- Predict the happiness score for each country in 2020 (pretend your model hasn't seen 2020!)
- You will be responsible for creating your training and testing datasets.
- Take any necessary feature engineering steps
- Fit three different models, and choose the best model for your data. Make sure to:
- Properly tune your models(when needed)
- Perform feature selection (make sure that all the features you are using are significant for your model)
- Validate your model and report your performance metrics. Re-train your model if necessary. Include explanations on the metrics you decide to report.
- Justify - In each step, explain why you choose specific models, features, parameters, and techniques.
Once you have your final model, answer the following questions:
- What are your predicted scores for the top 5 most happy countries?
- What are your predicted scores for the bottom 5 less happy countries?
- Pick two features to compare the top and bottom 5 countries. How are these features different in the bottom and top countries?
- How good are the happiness score predictions with your final model?
The submission for this section should be
General Midterm Notes
- You can decide if you drop any columns, or missing values, but you have to document them and give reasons for them. The reasons cannot be "there were too many variables", "it was too difficult" ,"seemed like an obvious column to drop" or similar.
- You can add subsections to your file if it helps to make it easier/clearer to understand.
- Make each step of the required tasks clear to us, that way, we can easily award you well-deserved points!
- Remember to use both code and markdown cells, and include comments in your code (we can't read your process off from your mind!).
- Only use the python libraries, techniques, and knowledge learned from Weeks 1-7.
- Please keep all dataset inside the