Using Machine Learning Tools Assignment 1
Overview
In this assignment, you will apply some popular machine learning techniques to the problem of predicting bike rental demand. A data set has been provided containing records of bike rentals in Seoul, collected during 2017-18.
General instructions
This assignment is divided into several tasks. Use the spaces provided in this notebook to answer the questions posed in each task. Note that some questions require writing a small amount of code and some require graphical results. It is your responsibility to make sure your responses are clearly labelled and your code has been fully executed (with the correct results displayed) before submission!
Do not manually edit the data set file we have provided! For marking purposes, it's important that your code is written to run correctly on the original data file.
When creating graphical output, label is clearly, with appropriate titles, xlabels and ylabels, as appropriate. Chapter 2 of the reference book is based on a similar workflow to this prac, so you may look there for some further background and ideas. You can also use any other general resources on the internet that are relevant although do not use ones which directly relate to these questions with this dataset (which would normally only be found in someone else's assignment answers). If you take a large portion of code or text from the internet then you should reference where this was taken from, but we do not expect any references for small pieces of code, such as from documentation, blogs or tutorials. Taking, and adapting, small portions of code is expected and is common practice when solving real problems.
The following code imports some of the essential libraries that you will need. You should not need to modify it, but you are expected to import other libraries as needed.
STEP1: Load the data set from the csv file (SeoulBikeData.csv) into a DataFrame, and summarise it with at least two appropriate pandas functions Download the data set from MyUni using the link provided on the assignment page. A paper that describes one related version of this dataset is: Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153, pp.353-366, March, 2020. Feel free to look at this if you want more information about the dataset.
The data is stored in a CSV (comma separated variable) file and contains the following information
- Date: year-month-day
- Rented Bike Count: Count of bikes rented at each hour
- Hour: Hour of the day
- Temperature: Temperature in Celsius
- Humidity: %
- Windspeed: m/s
- Visibility: 10m
- Dew point temperature: Celsius
- Solar radiation: MJ/m2
- Rainfall: mm
- Snowfall: cm
- Seasons: Winter, Spring, Summer, Autumn
- Holiday: Holiday/No holiday
- Functional Day: NoFunc(Non Functional Hours), Fun(Functional hours)
Load the data set from the csv file into a DataFrame, and summarise it with at least two appropriate pandas functions.
STEP2: To get a feeling for the data it is a good idea to do some form of simple visualisation. Display a set of histograms for the features as they are right now, prior to any cleaning steps.
STEP3: The "Functioning Day" feature records whether the bike rental was open for business on that day. For this assignment we are only interested in predicting demand on days when the business is open, so remove rows from the DataFrame where the business is closed. After doing this, delete the Functioning Day feature from the DataFrame and verify that this worked.
The goal is to predict bike rental demand using historical data. To achieve this, you will use regression techniques with "Bike Rental Count" as the target feature for this prediction, but for this, it is important that all other features in the data are numerical. STEP4: Two of the features in the data, "Holiday" and "Season", need to be converted to numerical format. Write code to convert the "Holiday" feature to 0 or 1 from its current format. For the "Season" feature, a better solution would be to add 4 new columns, labeled as "Winter", "Spring", "Summer", and "Autumn". Each of these columns should store a 0 or 1, depending on the corresponding season in each row.
STEP5 It is known that bike rentals depend strongly on whether it's a weekday or a weekend. Replace the Date feature with a Weekday feature that stores 0 or 1 depending on whether the date represents a weekend or weekday. To do this, use the function date_is_weekday below, which returns 1 if it is a weekday and 0 if it is a weekend. Apply the function to the Date column in your DataFrame (you can use DataFrame.transform to apply it).
STEP6 Convert all the remaining data to numerical format, with any non-numerical entries set to NaN.
STEP7 Use graphical methods to display your data and identify problematic entries. Set any problematic values in the numerical data to np.nan and check that this has worked. Once this is done, specify a sklearn pipeline that will perform imputation to replace problematic entries (nan values) with an appropriate median value and any other pre-processing that you think should be used. Just specify the pipeline - do not run it now.
STEP8: Generate a pre-processed version of the entire dataset by applying the pipeline defined in STEP7. Then, create separate scatter plots for each feature against the target variable "Bike Rental Count" to visualize the strength of the relationship. Additionally, calculate the correlation of each feature with the target using either the pandas function corr() or numpy corrcoef() and find the 3 attributes that are the most correlated with bike rentals.
STEP9: Divide the data into training and test sets using an appropriate splitting method such that 20% of the data is kept for testing. Create a pipeline that includes the linear regression model in addition to the pipeline defined in STEP7. Fit the pipeline to the training set and calculate the rmse of the fit to evaluate its performance. As a comparison, compute the rmse that would be obtained by predicting the mean value of bike rentals for all training examples. Type your answer here, replacing this text.
STEP10: Fit a Kernel Ridge regression model (imported from sklearn.kernel_ridge) to the X_train data from STEP9. Build a new pipeline that includes the Kernel Ridge regression model in addition to the pipeline defined in STEP7, and fit it to the training data using default settings. Generate a scatter plot of the predicted values against the actual values for the training data, and calculate the RMSE of the fit to the training data.
STEP11: fit a Support Vector Regression (from sklearn.svm import SVR). As you did for STEP10, create a new pipeline using the pipelinr from STEP7 and this model and fit it to your training data, using the default settings. Again, generate a scatter plot of the predicted values against the actual values for the training data, and calculate the RMSE of the fit to the training data.
STEP12: Perform a 10 fold cross validation for each of the three model (LinearRegression,KernelRidge,SVR). This splits the training set (that we've used above) into 10 equal size subsets, and uses each in turn as the validation set while training a model with the other 9. You should therefore have 10 rmse values for each cross validation run. Find the mean and standard deviation of the rmse values obtained for each model for the validation splits.
STEP13: Both the Kernel Ridge Regression and Support Vector Regression have hyperparameters that can be adjusted to suit the problem. Use grid search to systematically compare the generalisation performance (rmse) obtained with different hyperparameter settings (still with 10-fold CV). Use the sklearn function GridSearchCV to do this.
For KernelRidge, vary the hyperparameter alpha. (note, if you are using KernelRidge as the last step in a pipeline, alpha is refered to as kernelridge__alpha)
For SVR, vary the hyperparameter C. (note, if you are using SVR as the last step in a pipeline, C is refered to as SVR__C)
Find the hyperparameter setting for each medel. Finally, train and apply both models, with the best hyperparameter settings, to the test set and report the performance as rmse.