Exercise 1
1.1 Consider some continuous random variables generated from an unknown distribution stored in 'clean_data.npy'. Fit a univariate Gaussian distribution to this data and estimate the mean and variance of the Gaussian distribution using the maximum likelihood estimator. Report the estimated mean and variance for the Gaussian distribution. Overlay this probability density function curve on the normalised histogram of the data.
(5 marks)
1.2 Next, consider a 'corrupted' version of the data used in the previous exercise, stored in 'corrupted_data.npy'. This new data is affected by some degree of outliers from an unknown source. Repeat the process of fitting a univariate Gaussian distribution to this new data (using MLE) and report the estimated mean and variance of the distribution. Plot its probability density function for continuous random variables in the range $[-10, 35]$. Comment on how the new Gaussian distribution parameters estimated have changed relative to the previous values estimated in exercise 1.1, and why.
(5 marks)
1.3 Fit a distribution to the corrupted data from exercise 1.2 in a manner that is robust to the outliers present. Demonstrate this robustness by comparing the probability density functions of the robust and univariate Gaussian distribution for the corrupted data. Explain briefly, how your chosen approach to fitting a robust distribution to the corrupted data achieves robustness.
(5 marks)
Exercise 2
2.1 You are given a data array called "shape_array.npy" that comprises 7 samples organised as columns in the array. Each column vector is a 3D shape of a blood vessel of size $(N\times3)$ that has been reshaped into a vector of size $(N*3 \times 1)$. Perform PCA (using the scikit-learn implementation) of the data array and extract the principal components (eigenvectors), and the singular values associated with each of the eigenvectors.
(5 marks)
2.2 Next, perform eigendecomposition of the covariance matrix estimated from the given data array. Report any differences you might find between the two and briefly explain the reason for any differences. Find the new coordinates of each shape (i.e. column in the data array) in the new coordinate space defined by the estimated eigenvectors.
(5 marks)
2.3 Reconstruct any one shape from the provided data array using (a) new coordinates estimated from PCA in 2.1 and (b) the new coordinates estimated using eigendecomposition in 2.2. Overlay the two resulting shapes and briefly comment on their similarity. Finally, in a couple of sentences explain why PCA is often described as an approach for dimensionality reduction/data compression.
(5 marks)
Exercise 3: Predict Cancer Mortality Rates in US Counties
The provided dataset comprises data collected from multiple counties in the US. The regression task for this assessment is to predict cancer mortality rates in "unseen" US counties, given some training data. The training data ('Training_data.csv') comprises various features/predictors related to socio-economic characteristics, amongst other types of information for specific counties in the country. The corresponding target variables for the training set are provided in a separate CSV file ('Training_data_targets.csv'). Use the notebooks provided for lab sessions throughout this module to provide solutions to the exercises listed below. Throughout all exercises text describing your code and answering any questions included in the exercise descriptions should be included as part of your submitted solution.
The list of predictors/features available in this data set are described below:
Data Dictionary
avgAnnCount: Mean number of reported cases of cancer diagnosed annually
avgDeathsPerYear: Mean number of reported mortalities due to cancer
incidenceRate: Mean per capita (100,000) cancer diagoses
medianIncome: Median income per county
popEst2015: Population of county
povertyPercent: Percent of populace in poverty
MedianAge: Median age of county residents
MedianAgeMale: Median age of male county residents MedianAgeFemale: Median age of female county residents AvgHouseholdSize: Mean household size of county PercentMarried: Percent of county residents who are married PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed PctPrivateCoverage: Percent of county residents with private health coverage PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance) PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage PctPublicCoverage: Percent of county residents with government-provided health coverage PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone PctWhite: Percent of county residents who identify as White PctBlack: Percent of county residents who identify as Black PctAsian: Percent of county residents who identify as Asian
PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian PctMarriedHouseholds: Percent of married households BirthRate: Number of live births relative to number of women in county
import os
import pandas as pd
Define paths to the training data and targets files
training_data_path = root_dir + 'Training_data.csv'
training_targets_path = root_dir + 'Training_data_targets.csv'
Exercise 3.1
Read in the training data and targets files. The training data comprises features/predictors while the targets file comprises the targets (i.e. cancer mortality rates in US counties) you need to train models to predict. Plot histograms of all features to visualise their distributions and identify outliers. Do you notice any unusual values for any of the features? If so comment on these in the text accompanying your code. Plot these correlations using the scatter matrix plotting function available in pandas and comment on at least two sets of features that show visible correlations to each other.
(5 marks)
-
There seem to be errors/outliers in the median age features (MedianAge) with values >> 100. This is clearly an error and needs to be corrected prior to fitting regression models. (1.5 marks for code above and this discussion)
-
Top five features with strongest correlations to targets are: incidenceRate, PctBachDeg25_Over, PctPublicCoverageAlone, medIncome and povertyPercent (2 marks for this description and code above).
-
medIncome and povertyPercent are negatively correlated to each other as you would expect.
-
povertyPercent and PctBachDeg25_Over are also negatively correlated highlighting that counties with higher degrees of poverty have fewer Bachelor graduates by the age of 25. povertyPercent also shows a strong positive correlation with PctPublicCoverageAlone, indicating that poverty stricken counties are less likely to be able to afford private healthcare coverage.
-
Similarly, PctBachDeg25_Over is negatively correlated with PctPublicCoverageAlone and positively correlated with medIncome. (1.5 marks for discussion of at least two sets of features that show correlations and code above)
Exercise 3.2
Create an ML pipeline using scikit-learn (as demonstrated in the lab notebooks) to pre-process the training data. (5 marks)
Exercise 3.3
Fit linear regression models to the pre-processed data using: Ordinary least squares (OLS) and Ridge models. Choose suitable regularisation weights for Lasso and Ridge regression and include a description in text of how they were chosen. Quantitatively compare your results from all three models and report the best performing one. Report the overall performance of the best regression model identified. Include code for all steps above. (10 marks)