ECON 178 S122:
Final Project Guidelines
Instructor: Ying Zhu
Overview of the data
The data is from the 1991 Survey of Income and Program Participation (SIPP). You are provided with 7933 observations.
The sample contains households data in which the reference persons aged 25-64 years old. At least one person is employed, and no one is self-employed. The observation units correspond to the household reference persons.
The data set contains a number of feature variables that you can choose to predict total wealth. The outcome variable (total wealth) and feature variables are described in the next slide.
Dataframe with the following variables
Variable to predict (outcome variable):
• tw: total wealth (in US $).
• Total wealth equals net financial assets, including Individual Retirement Account (IRA) and 401(k) assets, plus housing equity plus the value of business, property, and motor vehicles.
Variables related to retirement (features):
- ira: individual retirement account (IRA) (in US $).
- e401: 1 if eligible for 401(k), 0 otherwise Financial variables (features):
- nifa: non-401k financial assets (in US $).
- inc: income (in US $).
Variables related to home ownership (features):
• hmort: home mortgage (in US $).
• hval: home value (in US $).
• hequity: home value minus home mortgage.
Other covariates (features):
- educ: education (in years).
- male: 1 if male, 0 otherwise.
- twoearn: 1 if two earners in the household, 0 otherwise.
- nohs, hs, smcol, col: dummies for education: no high- school, high-school, some college, college.
- age: age.
- fsize: family size.
- marr: 1 if married, 0 otherwise.
What is 401k and IRA?
- Both 401k and IRA are tax deferred savings options which aims to increase individual saving for retirement
- The 401(k) plan:
- a company-sponsored retirement account where employees can contribute
- employers can match a certain % of an employee’s contribution
- 401(k) plans are offered by employers -- only employees in companies
offering such plans can participate
• The feature variable e401 contains information on the eligibility
• IRA accounts:
• Individuals can participate
- No employer matching
- The feature variable ira contains IRA account (in US $)
Reference: https://www.investopedia.com/ask/answers/12/401k.asp
Your tasks
Build a prediction/fitted model to predict total wealth (tw) in US dollars
Write up a paper, up to 20 pages (not including the code), 11 size font, and 1.5 spacing
○ Introduction
■ Briefly state the objectives of the study
○ Statistical analyses
■ Describe how you apply the tools you have learned from this course to perform the prediction task ■ You should try different methods and compare their prediction performance and interpretability
○ Conclusions
■ Summarize what you have discovered from this project
■ (Optional) Discuss caveats to the conclusions drawn from your analyses
Bonus points
o We kept 20% of the sample on which we are going to run your proposed model and method. We will rank the students by accuracy of the prediction on that 20% of the sample.
The project is due on July 29 (by 5:00pm PST). Please submit your paper and code according to the instructions. Late assignment will NOT be accepted except with my prior consent regarding unusual circumstances permitted by University policies (proper documentations will be needed)
How to carry out this project?
Data can be found on Canvas
- Download the data and save it in your working directory
- To load the data into R, use the code:
data_tr <- read.table("data_tr.txt", header = TRUE, sep = "\t", dec = ".")[,-1]
Inspecting your data and preliminary analyses
- Dependent variable (Y): tw: total wealth (in US $)
- Predictors (X): your choice (but please make sensible choices)
- Some suggestions: use scatter plots and/or simple linear regressions with OLS to visualize basic relationships between total wealth and various predictors
In-depth analyses
- What could be the X variables in your prediction exercise?
- What methods should you use? (OLS, Ridge, Stepwise selections, Lasso)
- How do you select the best prediction/fitted model (K-fold cross validation, Leave- one-out)
What could be the X variables in your prediction exercise?
●The plain predictors listed on Slide 3
○ Watch out for perfect collinearity: You do not want to include predictors that are perfect collinear.
■ For example, you don’t want to include hmort (home mortgage), hval (home value), and hequity (home value minus home mortgage) all three at the same time because hequity = hval-hmort. One solution to this – drop hequity from your models
■ As another example of perfect collinearity, say you include the intercept term (a column of “1”s) and all four dummy variables nohs, hs, smcol, col (no high-school, high-school, some college, college), note that nohs+hs+smcol+col = columns of 1 (the intercept). One solution to this -- drop one of the education dummies from your models
●Transformations of the plain predictors listed on Slide 3: use what you have learned from Topic 6: Flexible Linear Models
○ Polynomial transformation
○ The spline basis representation
○ Transformation using binary indicators
○ Generalized additive models (GAM)
○ Interacting dummy variables with other variables; for example, age x twoearn
● Before transforming the plain predictors, scatter plots may help you to visualize how each predictor is associated with the total wealth. For example, you may see a nonlinear relationship so you might want to consider some type of polynomial transformation or the spline basis representation
Collection of methods
We have already seen:
• OLS
• Ridgeregressions
• Stepwiseselectionmethods • Lasso
Note:
1. In the project, you should select different methods from the list above and compare their prediction performance and interpretability
2. For Ridge, Stepwise selection, and Lasso, don’t forget the use of Cross- Validation
3. In addition to prediction performance, you might want to think about whether the set of predictors used to predict total wealth make intuitive sense