1. Homepage
  2. Programming
  3. Assignment 2: Predicting fuel consumption of the ship

Assignment 2: Predicting fuel consumption of the ship

Engage in a Conversation
Machine LearningRegression ModelNUSSingaporePython

Assignment 2: Predicting fuel consumption of the ship

Instructions to submit the assignment

  • Submit your assingnment before 12th February 11.59pm in Canvas under Assignment 2.
  • Write your code in the codeblocks provided after respective questions.
  • While submitting your notebook, please run the code. Submit without deleting your outputs.
  • This is an individual assignment. Write your own answers and refrain from working in groups.

In this assignment, we will learn how to implement a multiple regression model using statsmodels library in Python. scikit-learn is an equally popular machine learning toolbox. We prefer former over latter due to the functionality it provides to analyse the results. You can install statsmodels by running following command in your Anaconda shell: CourseNana.COM

conda install -c conda-forge statsmodels CourseNana.COM

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Question 1: Loading the data

We will use a publicly available dataset, which was used to develop statistical models for analyzing the dynamics of ocean-going vessels, especially with regard to modelling fuel effciency. You can refer to the dataset and the metadata at this website. Let us start exploring the dataset! CourseNana.COM

We have downloaded the dataset for you. We have taken a small subset of the original dataset after merging different files on the website. We intend to provide a prototype, which you can use to gain better understanding of the regression. The dataset is available in Assignment_2 under files in the LumiNUS. CourseNana.COM

Create a single dataframe in pandas by reading the dataset from the CSV file. Print the names of the columns and total number of datapoints. (2 points) CourseNana.COM

df = 

Question 2: Data preprocessing

Data preprocessing is a vital step before the data is fed to a machine learning model. CourseNana.COM

We observe a lot of NaN values. CourseNana.COM

Print the total numbers of NaNs present in every column of the dataframe Keep only those datapoints in df that have a non-NaN value for fuelVolumeFlowRate (2 points) CourseNana.COM

Althought tstamp is in the human readable format, it is not convenient to use for a computer. We will now create a column UNIX_tp that contains tstamp in the UNIX timestamp format, which is simply an integer - a number. Such a timestamp is easy to use to perform various operations. CourseNana.COM

df['UNIX'] = (pd.to_datetime(df['tstamp']) -  pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')  

A closer look at data will reveal that it comprises of multiple trips (try to read few timestamps!). There are certain breakpoints in the series for which readings are not available. For the purpose of analysis the data we want to assign a tripID to every datapoint. If the time difference between two readings is more than a certain threshold, we say that two readings belong to two different trips. CourseNana.COM

Suppose that the tripID of the first datapoint is 0. If the time difference between the first and second datapoint is more than the threshold then the seond datapoint should be given a tripID of 1; otherwise the second datapoint also belongs to 0th trip. CourseNana.COM

Create a new column tripID in the dataframe. Write a small code to populate it by choosing the threshod to be 1E-7. CourseNana.COM

Hint: Read about function pct_change and use it on df['UNIX']. CourseNana.COM

(3 points) CourseNana.COM

How many trips are there? CourseNana.COM

(1 points) When the ship is on its course, the sensor readings do not change every second. Readings at such a finer granularity not only increase the size of the data but also lead to issues of degenerate solutions (Do you remember the assumption of Linear Independence?). CourseNana.COM

Windowing is a widely used technique to deal with these issues in the time series data. Let's assume that we fix a window of size 180 for the current experiment. We perform windowing using following rules: CourseNana.COM

Let's assume that every datapoint within a trip amounts to a reading on the tick of a clock. For instance, if there are 400 datapoints in a trip first 180 datapoints belong to window 0, next 180 datapoints belong to window 1 and the rest belong to window 2. Two datapoints in two different trips can not belong to the same window. For instance, there are 80 datapoints out of which first 40 belong to trip 0 and the rest belong to trip 1. In this case, first 40 datapoints belong to window 1 and the rest belong to window 2. Create a new column windowID in the dataframe. Write a small code to populate it using the above procedure. CourseNana.COM

(3 points) CourseNana.COM

How many windows are there? CourseNana.COM

(1 points) We have almost finished pre-processing the data! The assumption one makes when using windowing is that all datapoints within a window are similar and they can substituted for a representative for the window. The representative can be a random point within a window or an aggregate such as mean or median. CourseNana.COM

Our dataset for machine learning model comprises of datapoints that are representatives of the windows. Create a pandas dataframe dataset that contains representative from every window. We use mean/average of the datapoints within the window as the representative of the window. CourseNana.COM

Hint: Groupby and aggregate! Length of the dataset should be same as the number of windows. CourseNana.COM

(2 points) CourseNana.COM

dataset = 

Since the job of windowID and tripID is over, we will remove these columns from the dataset. CourseNana.COM

dataset = dataset.drop(columns=['wndID', 'unix', 'tripID'])

Question 3: Prediction model

We will use statsmodel library to train a prediction model. CourseNana.COM

import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

Train a simple model to predict fuelVolumeFlowRate given portPitch, starboardPitch, windSpeed and portRudder readings as input. Print the summary of the model that provides results of the OLS regression. CourseNana.COM

Hint: OLS Regression with statsmodels CourseNana.COM

(2 points) model = You observe that the R-squared value of the regression is around 0.7! Is there a way to improve it? You have learnt in the class that residual plots might be helpful in such a case. We make a plot of residuals versus portPitch. Run the following code to see the plots. CourseNana.COM

fig = plt.figure(figsize=(12,8))
sm.graphics.plot_regress_exog(model, 'portPitch', fig=fig)

Can you suggest any change in the model based on the observation of the residual plot? CourseNana.COM

(1 point) Implement the change and validate (show that the proposed change fixes the "issues" you observed earlier!) CourseNana.COM

(3 points) CourseNana.COM

new_model = 

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
Machine Learning代写,Regression Model代写,NUS代写,Singapore代写,Python代写,Machine Learning代编,Regression Model代编,NUS代编,Singapore代编,Python代编,Machine Learning代考,Regression Model代考,NUS代考,Singapore代考,Python代考,Machine Learninghelp,Regression Modelhelp,NUShelp,Singaporehelp,Pythonhelp,Machine Learning作业代写,Regression Model作业代写,NUS作业代写,Singapore作业代写,Python作业代写,Machine Learning编程代写,Regression Model编程代写,NUS编程代写,Singapore编程代写,Python编程代写,Machine Learningprogramming help,Regression Modelprogramming help,NUSprogramming help,Singaporeprogramming help,Pythonprogramming help,Machine Learningassignment help,Regression Modelassignment help,NUSassignment help,Singaporeassignment help,Pythonassignment help,Machine Learningsolution,Regression Modelsolution,NUSsolution,Singaporesolution,Pythonsolution,