554.488/688 Computing for Applied Mathematics Fall 2022 - Final Project Assignment
Fannie Mae Loan Performance Prediction Project
Project Aim
The aim of this project is use data collected on a large number of loans in Oct of 2021, to develop prediction models for the number of months payments are made on mortgage loans and for pre- dicting foreclosure of a loan based on information available to FNMA at the time the loan is put on their books.
Some background
FNMA aka Fannie Mae (look it up in Wikipedia) was put in place in order to ensure liquidity in the US mortgage loan markets. When a mortgage holder secures a mortgage from a bank, the bank will sell that mortgage to FNMA giving them the capital to enable them to make additional future loans. The FNMA bundles the mortgages they acqure into what are called mortgage-backed securities (MBS’s) and sells them to investors while insuring the underlying mortgages against losses of principal. The investors recieve the bundle of monthy payments associated with the underlying mortgages. When a holder of a mortgages forcloseses/defaults, or sells their house, or refinances their mortgage while most of the principle is transferred to the holder of the MBS, but this means that their future cash flow might not be as expected (interest rates may have gone down so any future bond investments bring lower returns). So the investor, in pricing the value of their asset, would like to be able to predict outcomes such as foreclosure or when the loan will be settled.
Hopefully this brief description gives you an understanding as to why one woud be interested in, for a given loan, being able to determine how likely it is to foreclose, or determine its duration i.e. how many months of payments can be expected before monthly payments cease to occur.
Your personal dataset
An email will be sent to every student in the class with urls for two comma delimited files: a training dataset and a test dataset.
Each student will have their own unique set of data. Each data have been drawn from different populations - using results for someone elses dataset will likely lead to poor performance.
You should not share these datasets with any other students in the class. You should not collaborate with other students in the class
Any evidence of data sharing or collaboration will be viewed as an ethics violation and subject to the rules and regulations of the university.
Training set
The training set is a comma delimited file consisting of information for exactly 250,000 mortgage loans with 30 variables (LOAN ID, 27 predictor variables, and 2 response variables):
The LOAN ID variable is a unique 12 character identifier for a mortgage loan.
Predidctor variables (as well as the others) are described in the Appendix. These variables provide information about the mortgage known to Fannie Mae when the mortgage was ac- quired by them.
The response variables are variables that ultimately become known by the time the data on loan performance was collected in Oct 2021.
– NMONTHS variable is the number of months of mortgage payments made on the loan up until the date when data was collected.
– FORECLOSURE variable is 1 if the loan foreclosed, and 0 otherwise as of the date when the data was collected.
Test set
The test set is also a comma delimited file consisting of information for 100,000 mortgage loans (drawn at random from the same loan population as your training set) with only the LOAN ID and the 27 predictor variables. I have the ground truth, i.e. the NMONTHS and FORECLOSURE variables for the loans in your test set. Once I have your predictions I will be able to determine the quality of those predictions.
Your task
Your task is to use the training data to build a predictors of each of the two response variables NMONTHS, FORECLOSURE.
For FORECLOSURE, I am asking you to pick 1,000 loans you think are most likely to foreclose.
Reading the data to create a data frame
The two files you are provided with are comma delimited with all data represented as a string (each column/field of fixed size). To ease the process of reading these files to produce data frames, a jupyter notebook called “FunctionToReadData” has been provided. This function does the conver- sions of the fields for you so to create the data frames (either training or testing) you simpy give a command like:
df=read_data(fileid)
where fileid is the identifier of the downloaded file.
Some recommendations
You can use any method you wish to build your prediction model, but I recommend that you – use regression for NMONTHS
– logistic regression for FORECLOSURE
get started early!!! don’t put this off!!!
don’t assume prediction rates will be low - do the best you can
it is not just important to get good predictions - it is also important to be able to quantify how well your predictions are likely to perform i.e. do a good job in estimating your error rates
since you only have ground truth in the training set, it is recommended that you separate that dataset into a training set and a test set so that your error estimates are not underestimated due to over-fitting.
try various choices of sets of variables to use as predictors and compare performance on test data
I will ask you to provide a couple of summary bits of information about the variables in your training set.
NMONTHS (3): Number of months of mortgage payments made up until date the data was col-