Assignment 2: Predicting fuel consumption of the ship
Instructions to submit the assignment
- Submit your assingnment before 12th February 11.59pm in Canvas under Assignment 2.
- Write your code in the codeblocks provided after respective questions.
- While submitting your notebook, please run the code. Submit without deleting your outputs.
- This is an individual assignment. Write your own answers and refrain from working in groups.
In this assignment, we will learn how to implement a multiple regression model using statsmodels library in Python. scikit-learn is an equally popular machine learning toolbox. We prefer former over latter due to the functionality it provides to analyse the results. You can install statsmodels by running following command in your Anaconda shell:
conda install -c conda-forge statsmodels
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Question 1: Loading the data
We will use a publicly available dataset, which was used to develop statistical models for analyzing the dynamics of ocean-going vessels, especially with regard to modelling fuel effciency. You can refer to the dataset and the metadata at this website. Let us start exploring the dataset!
We have downloaded the dataset for you. We have taken a small subset of the original dataset after merging different files on the website. We intend to provide a prototype, which you can use to gain better understanding of the regression. The dataset is available in Assignment_2 under files in the LumiNUS.
Create a single dataframe in pandas by reading the dataset from the CSV file. Print the names of the columns and total number of datapoints. (2 points)
df =
Question 2: Data preprocessing
Data preprocessing is a vital step before the data is fed to a machine learning model.
We observe a lot of NaN values.
Print the total numbers of NaNs present in every column of the dataframe Keep only those datapoints in df that have a non-NaN value for fuelVolumeFlowRate (2 points)
Althought tstamp is in the human readable format, it is not convenient to use for a computer. We will now create a column UNIX_tp that contains tstamp in the UNIX timestamp format, which is simply an integer - a number. Such a timestamp is easy to use to perform various operations.
df['UNIX'] = (pd.to_datetime(df['tstamp']) - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
A closer look at data will reveal that it comprises of multiple trips (try to read few timestamps!). There are certain breakpoints in the series for which readings are not available. For the purpose of analysis the data we want to assign a tripID to every datapoint. If the time difference between two readings is more than a certain threshold, we say that two readings belong to two different trips.
Suppose that the tripID of the first datapoint is 0. If the time difference between the first and second datapoint is more than the threshold then the seond datapoint should be given a tripID of 1; otherwise the second datapoint also belongs to 0th trip.
Create a new column tripID in the dataframe. Write a small code to populate it by choosing the threshod to be 1E-7.
Hint: Read about function pct_change and use it on df['UNIX'].
(3 points)
How many trips are there?
(1 points) When the ship is on its course, the sensor readings do not change every second. Readings at such a finer granularity not only increase the size of the data but also lead to issues of degenerate solutions (Do you remember the assumption of Linear Independence?).
Windowing is a widely used technique to deal with these issues in the time series data. Let's assume that we fix a window of size 180 for the current experiment. We perform windowing using following rules:
Let's assume that every datapoint within a trip amounts to a reading on the tick of a clock. For instance, if there are 400 datapoints in a trip first 180 datapoints belong to window 0, next 180 datapoints belong to window 1 and the rest belong to window 2. Two datapoints in two different trips can not belong to the same window. For instance, there are 80 datapoints out of which first 40 belong to trip 0 and the rest belong to trip 1. In this case, first 40 datapoints belong to window 1 and the rest belong to window 2. Create a new column windowID in the dataframe. Write a small code to populate it using the above procedure.
(3 points)
How many windows are there?
(1 points) We have almost finished pre-processing the data! The assumption one makes when using windowing is that all datapoints within a window are similar and they can substituted for a representative for the window. The representative can be a random point within a window or an aggregate such as mean or median.
Our dataset for machine learning model comprises of datapoints that are representatives of the windows. Create a pandas dataframe dataset that contains representative from every window. We use mean/average of the datapoints within the window as the representative of the window.
Hint: Groupby and aggregate! Length of the dataset should be same as the number of windows.
(2 points)
dataset =
Since the job of windowID and tripID is over, we will remove these columns from the dataset.
dataset = dataset.drop(columns=['wndID', 'unix', 'tripID'])
Question 3: Prediction model
We will use statsmodel library to train a prediction model.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
Train a simple model to predict fuelVolumeFlowRate given portPitch, starboardPitch, windSpeed and portRudder readings as input. Print the summary of the model that provides results of the OLS regression.
Hint: OLS Regression with statsmodels
(2 points) model = You observe that the R-squared value of the regression is around 0.7! Is there a way to improve it? You have learnt in the class that residual plots might be helpful in such a case. We make a plot of residuals versus portPitch. Run the following code to see the plots.
fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(model, 'portPitch', fig=fig)
Can you suggest any change in the model based on the observation of the residual plot?
(1 point) Implement the change and validate (show that the proposed change fixes the "issues" you observed earlier!)
(3 points)
new_model =