1. Homepage
  2. Programming
  3. COMP3611 Machine Learning - Assessment: Gaussian Distribution, PCA and Predict Cancer Mortality Rates in US counties

COMP3611 Machine Learning - Assessment: Gaussian Distribution, PCA and Predict Cancer Mortality Rates in US counties

Engage in a Conversation
LeedsCOMP3611Machine LearningGaussian DistributionPCAPredict Cancer Mortality Rates in US Countieslinear regressionOrdinary least squaresLassoRidgePython

Exercise 1

1.1 Consider some continuous random variables generated from an unknown distribution stored in 'clean_data.npy'. Fit a univariate Gaussian distribution to this data and estimate the mean and variance of the Gaussian distribution using the maximum likelihood estimator. Report the estimated mean and variance for the Gaussian distribution. Overlay this probability density function curve on the normalised histogram of the data. CourseNana.COM

(5 marks) CourseNana.COM

1.2 Next, consider a 'corrupted' version of the data used in the previous exercise, stored in 'corrupted_data.npy'. This new data is affected by some degree of outliers from an unknown source. Repeat the process of fitting a univariate Gaussian distribution to this new data (using MLE) and report the estimated mean and variance of the distribution. Plot its probability density function for continuous random variables in the range $[-10, 35]$. Comment on how the new Gaussian distribution parameters estimated have changed relative to the previous values estimated in exercise 1.1, and why. CourseNana.COM

(5 marks) CourseNana.COM

1.3 Fit a distribution to the corrupted data from exercise 1.2 in a manner that is robust to the outliers present. Demonstrate this robustness by comparing the probability density functions of the robust and univariate Gaussian distribution for the corrupted data. Explain briefly, how your chosen approach to fitting a robust distribution to the corrupted data achieves robustness. CourseNana.COM

(5 marks) CourseNana.COM

Exercise 2

2.1 You are given a data array called "shape_array.npy" that comprises 7 samples organised as columns in the array. Each column vector is a 3D shape of a blood vessel of size $(N\times3)$ that has been reshaped into a vector of size $(N*3 \times 1)$. Perform PCA (using the scikit-learn implementation) of the data array and extract the principal components (eigenvectors), and the singular values associated with each of the eigenvectors. CourseNana.COM

(5 marks) CourseNana.COM

2.2 Next, perform eigendecomposition of the covariance matrix estimated from the given data array. Report any differences you might find between the two and briefly explain the reason for any differences. Find the new coordinates of each shape (i.e. column in the data array) in the new coordinate space defined by the estimated eigenvectors. CourseNana.COM

(5 marks) CourseNana.COM

2.3 Reconstruct any one shape from the provided data array using (a) new coordinates estimated from PCA in 2.1 and (b) the new coordinates estimated using eigendecomposition in 2.2. Overlay the two resulting shapes and briefly comment on their similarity. Finally, in a couple of sentences explain why PCA is often described as an approach for dimensionality reduction/data compression. CourseNana.COM

(5 marks) CourseNana.COM

Exercise 3: Predict Cancer Mortality Rates in US Counties

The provided dataset comprises data collected from multiple counties in the US. The regression task for this assessment is to predict cancer mortality rates in "unseen" US counties, given some training data. The training data ('Training_data.csv') comprises various features/predictors related to socio-economic characteristics, amongst other types of information for specific counties in the country. The corresponding target variables for the training set are provided in a separate CSV file ('Training_data_targets.csv'). Use the notebooks provided for lab sessions throughout this module to provide solutions to the exercises listed below. Throughout all exercises text describing your code and answering any questions included in the exercise descriptions should be included as part of your submitted solution. CourseNana.COM

The list of predictors/features available in this data set are described below: CourseNana.COM

Data Dictionary CourseNana.COM

avgAnnCount: Mean number of reported cases of cancer diagnosed annually CourseNana.COM

avgDeathsPerYear: Mean number of reported mortalities due to cancer CourseNana.COM

incidenceRate: Mean per capita (100,000) cancer diagoses CourseNana.COM

medianIncome: Median income per county CourseNana.COM

popEst2015: Population of county CourseNana.COM

povertyPercent: Percent of populace in poverty CourseNana.COM

MedianAge: Median age of county residents CourseNana.COM

MedianAgeMale: Median age of male county residents MedianAgeFemale: Median age of female county residents AvgHouseholdSize: Mean household size of county PercentMarried: Percent of county residents who are married PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed PctPrivateCoverage: Percent of county residents with private health coverage PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance) PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage PctPublicCoverage: Percent of county residents with government-provided health coverage PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone PctWhite: Percent of county residents who identify as White PctBlack: Percent of county residents who identify as Black PctAsian: Percent of county residents who identify as Asian CourseNana.COM

PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian PctMarriedHouseholds: Percent of married households BirthRate: Number of live births relative to number of women in county CourseNana.COM

import os CourseNana.COM

import pandas as pd CourseNana.COM

Define paths to the training data and targets files

training_data_path = root_dir + 'Training_data.csv' CourseNana.COM

training_targets_path = root_dir + 'Training_data_targets.csv' CourseNana.COM

Exercise 3.1 CourseNana.COM

Read in the training data and targets files. The training data comprises features/predictors while the targets file comprises the targets (i.e. cancer mortality rates in US counties) you need to train models to predict. Plot histograms of all features to visualise their distributions and identify outliers. Do you notice any unusual values for any of the features? If so comment on these in the text accompanying your code. Plot these correlations using the scatter matrix plotting function available in pandas and comment on at least two sets of features that show visible correlations to each other. CourseNana.COM

(5 marks) CourseNana.COM

  • There seem to be errors/outliers in the median age features (MedianAge) with values >> 100. This is clearly an error and needs to be corrected prior to fitting regression models. (1.5 marks for code above and this discussion) CourseNana.COM

  • Top five features with strongest correlations to targets are: incidenceRate, PctBachDeg25_Over, PctPublicCoverageAlone, medIncome and povertyPercent (2 marks for this description and code above). CourseNana.COM

  • medIncome and povertyPercent are negatively correlated to each other as you would expect. CourseNana.COM

  • povertyPercent and PctBachDeg25_Over are also negatively correlated highlighting that counties with higher degrees of poverty have fewer Bachelor graduates by the age of 25. povertyPercent also shows a strong positive correlation with PctPublicCoverageAlone, indicating that poverty stricken counties are less likely to be able to afford private healthcare coverage. CourseNana.COM

  • Similarly, PctBachDeg25_Over is negatively correlated with PctPublicCoverageAlone and positively correlated with medIncome. (1.5 marks for discussion of at least two sets of features that show correlations and code above) CourseNana.COM

Exercise 3.2 CourseNana.COM

Create an ML pipeline using scikit-learn (as demonstrated in the lab notebooks) to pre-process the training data. (5 marks) CourseNana.COM

Exercise 3.3 CourseNana.COM

Fit linear regression models to the pre-processed data using: Ordinary least squares (OLS) and Ridge models. Choose suitable regularisation weights for Lasso and Ridge regression and include a description in text of how they were chosen. Quantitatively compare your results from all three models and report the best performing one. Report the overall performance of the best regression model identified. Include code for all steps above. (10 marks) CourseNana.COM

Get in Touch with Our Experts

Wechat WeChat
Whatsapp Whatsapp
Leeds代写,COMP3611代写,Machine Learning代写,Gaussian Distribution代写,PCA代写,Predict Cancer Mortality Rates in US Counties代写,linear regression代写,Ordinary least squares代写,Lasso代写,Ridge代写,Python代写,Leeds代编,COMP3611代编,Machine Learning代编,Gaussian Distribution代编,PCA代编,Predict Cancer Mortality Rates in US Counties代编,linear regression代编,Ordinary least squares代编,Lasso代编,Ridge代编,Python代编,Leeds代考,COMP3611代考,Machine Learning代考,Gaussian Distribution代考,PCA代考,Predict Cancer Mortality Rates in US Counties代考,linear regression代考,Ordinary least squares代考,Lasso代考,Ridge代考,Python代考,Leedshelp,COMP3611help,Machine Learninghelp,Gaussian Distributionhelp,PCAhelp,Predict Cancer Mortality Rates in US Countieshelp,linear regressionhelp,Ordinary least squareshelp,Lassohelp,Ridgehelp,Pythonhelp,Leeds作业代写,COMP3611作业代写,Machine Learning作业代写,Gaussian Distribution作业代写,PCA作业代写,Predict Cancer Mortality Rates in US Counties作业代写,linear regression作业代写,Ordinary least squares作业代写,Lasso作业代写,Ridge作业代写,Python作业代写,Leeds编程代写,COMP3611编程代写,Machine Learning编程代写,Gaussian Distribution编程代写,PCA编程代写,Predict Cancer Mortality Rates in US Counties编程代写,linear regression编程代写,Ordinary least squares编程代写,Lasso编程代写,Ridge编程代写,Python编程代写,Leedsprogramming help,COMP3611programming help,Machine Learningprogramming help,Gaussian Distributionprogramming help,PCAprogramming help,Predict Cancer Mortality Rates in US Countiesprogramming help,linear regressionprogramming help,Ordinary least squaresprogramming help,Lassoprogramming help,Ridgeprogramming help,Pythonprogramming help,Leedsassignment help,COMP3611assignment help,Machine Learningassignment help,Gaussian Distributionassignment help,PCAassignment help,Predict Cancer Mortality Rates in US Countiesassignment help,linear regressionassignment help,Ordinary least squaresassignment help,Lassoassignment help,Ridgeassignment help,Pythonassignment help,Leedssolution,COMP3611solution,Machine Learningsolution,Gaussian Distributionsolution,PCAsolution,Predict Cancer Mortality Rates in US Countiessolution,linear regressionsolution,Ordinary least squaressolution,Lassosolution,Ridgesolution,Pythonsolution,