Assignment 1 - Exploratory Data Analysis and SQLite
MTM5004 (AY2022/23)
European Maritime Safety Agency's Report
The European Maritime Safety Agency reports carbon dioxide emissions from ships, mrv.emsa.europa.eu , according to the European Community Regulation 2015/757. The data set
MRV 2019, provided in the CSV file MRV_2019.csv , contains technical data on vessel fuel consumption, carbon dioxide emissions, and technical efficiency. The dataset has 12,344 records and 14 variables.
The columns of the CSV file are: 'IMO Number Name', 'Ship type', 'Technical efficiency [gCO2/t·nm]', 'Port of Registry', 'DoC issue date', 'DoC expiry date', 'Verifier Name', 'Verifier Country', 'Total fuel consumption [m tonnes]', 'Total CO2 emissions [m tonnes]', 'Annual Total time spent at sea [hours]', 'Annual average Fuel consumption per distance [kg / n mile]', 'Annual average CO2 emissions per distance [kg CO2 / n mile]'.
They correspond to the following data.
1. IMO Number,
2. Name,
3. Ship type,
4. Technical efficiency (g/tM) 5. Port of Registry,
6. DoC issue,
7. DoC expiry,
8. Verifier Name, 9. Verifier Country,
10. Total fuel consumption (t),
11. Total CO2 emissions (t),
12. Annual Total time spent at sea (h),
13. Annual average Fuel consumption per distance (kg/M), 14. Annual average CO2 emissions per distance(kg/M).
The symbols for the units of measurements are M for nautical mile, g for gram, kg for kilogram, t for tonne, h for hours.
Each record corresponds to a vessel, usually a ship. The ship is identified by its matriculation number by the International Maritime Organization (IMO). Other fields of a ship's records indicate the name of the ship, its type, its technical efficiency, namely, the ship’s energy efficiency design index (EEDI) or its estimated value (EIV), its port of registry, the issue and expiry dates of its declaration of
Instructions to submit the assignment
- Name your jupyter notebook as `Assignment1_[StudentID].ipynb`.
- Your solution notebook must contain the python code that we can run to verify the answers.
conformity (DoC), the name and country of the verifier of the data in the record, its total consumption, its total carbon dioxide (CO2) emissions, its annual time at sea, and its annual average fuel consumption and annual average carbon dioxide emissions per distance.
It is recommended that you further research the above-mentioned terms and notions as they are frequently used in the maritime industry.
Part 1: Data Loading and Cleaning with Pandas
Let's begin by importing required libraries.
Inspect the dataset carefully and identify the anomalies in the dataset. Print the number of missing values in each column of the dataset.
import pandas as pd import numpy as np
Import the dataset from the table mrv2019 into a Pandas DataFrame. Load all the data as strings. Use the following names for the columns of the DataFrame.
1. 'IMO',
2. 'Name',
3. 'Type',
4. 'Technical Efficiency', 5. 'Port of Registry',
6. 'DoC issue',
7. 'DoC expiry',
8. 'Verifier Name',
9. 'Verifier Country',
10. 'Total Fuel',
11. 'Total CO2', 12. 'Time at Sea', 13. 'Fuel per Mile', 14. 'CO2 per Mile'.
You would have observed that the data even contains anomalous values such as Division by zero! and Not Applicable. Remove the rows with missing values and other anomalies.
Capitalise (change to uppercase) the names of the ports of registry.
Change the types of the columns 'Total Fuel', 'Total CO2', 'Time at Sea', 'Fuel per Mile', and 'CO2 per Mile' to pynum float type np.float64 .
Replace the column Technical Efficiency by two new columns Technical Efficiency Index and Technical Efficiency Value that contains the acronym of the technical efficiency index EEDI or EIV as a string and the value of the technical effieciency index as pynum float type
np.float64 ., respectively.
You would have observed that there are anomalied in the format of Technical Efficiency. We want to esure that the dataset contains only two indices EEDI and EIV . We do so by executing following code.
df = df[df['Technical Efficiency Index'].isin(['EEDI', 'EIV'])]
Part 2: Energy Efficiency Analysis
The technical energy efficiency or design energy efficiency of a vessel measures the CO2 emission relative to the amount of transport work done. The Energy Efficiency Design Index (EEDI) is commonly used to represent the technical efficiency of a vessel. The Estimated Index Value (EIV) is a simplified form of EEDI. Interested students are encouraged to read more about these values here (https://www.linkedin.com/pulse/eiv-another-irrelevant-shipping-acronym-poul-woodall).
Partition the dataset into two subsets, one for ships for which EEDI is available and another subset for EIV. To do so, create two Pandas DataFrames: df_eedi , which contains data for the ships for which Technical Efficiency Index is EEDI, and df_eiv , which contains data for the ships for which Technical Efficiency Index is EIV. Both of these DataFrames must be indexed by IMO number of the ships.
(2 points)
df_eedi = df[df['Technical Efficiency Index'].isin(['EEDI'])] df_eiv = df[df['Technical Efficiency Index'].isin(['EIV'])]
We import the Matplotlib library for drawing plots.
Remove the datapoints from df_eedi and df_eiv that have Technical Efficiency Value smaller than the lower bound and larger than the upper bound.
import matplotlib.pyplot as plt
We want to compare distribution of EEDI an EIV values. We used matplotlib boxplot to draw two boxplots side by side to each other as shown below (code not provided!). We do not see anything since the code also raised multiple errors. The errors are due to presence of outliers in the Technical Efficiency Values. Let's use the IQR method to remove such outliers.
EEDI versus EIV
IQR Method: The interquartile range (IQR) is calculated as the difference between the 75th and the 25th percentiles of the data. The IQR method considers observations with values 1.5 times the IQR below that of the 25th percentile or above that of the 75th percentile as outliers.
For each each DataFrame, df_eedi and df_eiv , calculate the lower and upper limit for the Technical Efficiency Value. The lower limit is 1.5 times of IQR below 25th percentile whereas the upper limit is 1.5 times of IQR above 75th percentile.
Write the code to draw boxplots to compare Technical Efficiency Values under two indices. The final visualisation must follow specifications in the boxplot shown earlier such as, two boxplots next to each other with a common Y-axis.
Generate the bivariate Pearson correlation matrix for the following five variables for each subset of the data between technical efficiency, fuel consumption, CO2 emissions, fuel consumption per mile, CO2 emissions per mile.
Generate the boxplot of Technical Efficiency Value for EEDI indexed data acroos different types of ships. This boxplots comprises of technical efficiency values on the Y-axis and ship types on the X- axis.
Qualitative Analysis
Please provide your observations/comments (up to two lines for each) on:
Boxplot
Correlation matrix
Technical efficiency values across ship types
Part 3: SQLite Integration
(4 points)
If we had done the data preparation and cleaning in a database with SQL, our code could scale to a much larger data set without putting much stress on the main memory of our computer. You may want to try and redo the above questions entirely in SQL with SQLite by first loading the original CSV file into a table, rawmrv2019 using sqlite3 interactive terminal command .import -csv MRV_2019.csv rawmrv2019 . This is not part of this assignment.
Instead, we can create a table and store the data from the data frame into the table. We use SQLite and the sqlite3 Python library.
import sqlite3
Create create and connect to a database mrv2019.db . In [ ]:
Create a table mrv2019 with appropriate domains to load the records of the data frame. Use the following names for the columns of the table. Choose the appropriate domain for the columns among
TEXT , DATE , and NUMERIC . We underscore instead of spaces for the column names in order to avoid having to escape double quotes in SQL and Python.
1. 'IMO',
2. 'Name',
3. 'Type',
4. 'Technical_Efficiency_Index', 5. 'Technical Efficiency_Value', 6. 'Port_of_Registry',
7. 'DoC_issue',
8. 'DoC_expiry',
connection = sqlite3.connect('mrv2019.db')
9. 'Verifier_Name', 10. 'Verifier_Country', 11. 'Total_Fuel',
12. 'Total_CO2',
13. 'Time_at_Sea', 14. 'Fuel_per_Mile', 15. 'CO2_per_Mile'.
Note that SQL is actually case insensitive. Capitalise SQL keywords for readability.
(1 point)
Insert the data from the DataFrame into the table.
cursor = connection.cursor()
for index, row in df.iterrows():
cursor.execute("INSERT INTO mrv2019 VALUES (? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,?, ?) ))
Let us do a simple exploratory analysis. In this dataset, we have different ships, each of which is registered in different port.
Now that the analysis is over, we should drop the table, commit changes to the database and gracefully close the database connections.
Write an SQL query that finds the ports with the top 21 highest number of ship registrations. Draw a barplot of the result. The X-axis is labeled by the names of the ports whereas the Y-axis is labeled by the corresponding number of ship registrations.
(3 points)
cursor = connection.cursor() cursor.execute("DROP TABLE mrv2019") cursor.close()
connection.commit() connection.close()