Final Project
Prediction on movie rating
In final project, we have a IMDB movie dataset, which includes thousands of movies. In the training data, all movies have been rated, ranging from 1-10. The goal of this project is to analyse data, build a machine learning model on the training data, and predict rates for movies in the testing data.
The final grade will be a combination of data processing, exploratory data analysis, machine learning model building, implementation and report writing.
Data
This data set is in attributes-in-columns format, comma-separated values. It contains two files:
train.csv: training data, consisting of 50,000 instances with 8 features and the target “rate”.
test: test data, consisting of 10,000 unlabelled instances with 8 features.
NOTE: the target “rate” in the data is continuous numbers (e.g., 0.2, 5.8, 8.5, etc.). You can choose either of the following options to make predictions:
· Build regression models to predict the continuous ratings.
· Transform continuous ratings to integer numbers (e.g., 1, 5, 9, etc.), and consider them as 11 different classes, so that you can build classification models for the predictions.
Requirements
I. Coding
1. Data pre-processing and Exploratory data analysis
a. Missing data – remove instances that contain missing values
b. Abnormal data – outliers, inconsistent data format/type, wrong values, etc.
c. Distribution analysis
d. Plotting, to help you understand your data
2. Feature engineering
a. Feature selection – if needed, drop irrelevant features
b. Transformation – if needed, transform categorical features into multiple binary features
c. Normalisation
3. Model building
a. Build at least two machine learning models on training data
b. Evaluate your models and pick up the best model for the prediction in the next step
c. (Optional) Hyperparameter tuning
4. Prediction
a. Use your best model to make predictions on test data
b. Save your prediction results into a CSV file with the following format, the first column: movie_id and the second column: rate (your predictions):
movie_id | rate |
test_1 | 3 |
test_2 | 9 |
test_3 | 7 |
II. Report writing
A 1000-1500 report detailing:
o The steps to pre-process your data
o How did you perform the feature engineering
o Machine learning models building and the performance evaluation
o Conclusions
Submission
You are required to submit the following files:
o Jupyter notebook in HTML file, named as “Student ID_notebook.html”
o Code has been well documented (adding markdown/comments to explain your code)
· Predicted result (20’)
o Prediction results in CSV file, named as “Student ID_prediction.csv”
o The file should contain two columns: movie_id and predicted rates
· Report (40’)
o Save your report in PDF file, named as “Student ID_report.pdf”
Please zip all your submissions and include your student ID into the name of the submission file, e.g., “Student ID_submission.zip".