Assignment 1
Instruction:
v You can use this WORD file as an answer sheet. Attach required output below each question. R codes can be attached to each question or at the end of the document (for partial credits if your results are not correct).
v Name the word file as: YourName.doc
The following problem takes place in the US in the late 1990’s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest began competing on existing routes and starting nonstop service on routes that previously lacked it. Building completely new airports is generally not feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players. The firm needs predictive models to support its consulting service. One thing the firm might want to be able to predict is fares, in the event a new airport is brought into service. The firm starts with the file Airfares.csv, which contains real data that were collected between Q3-1996 and Q2-1997. The variables in these data are listed below, and are believed to be important in predicting FARE. Some airport-to-airport data are available, but most data are at the city-to-city level. One question that will be of interest in the analysis is the effect that the presence or absence of Southwest (SW) has on FARE.
v COUPON: Average number of coupons (a one-coupon flight is a nonstop flight, a two-coupon flight is a one-stop flight, etc.) for that route
v NEW: Number of new carriers entering that route between Q3-96 and Q2-97
v VACATION: Whether (Yes) or not (No) a vacation route
v SW: Whether (Yes) or not (No) Southwest Airlines serves that route
v HI: Herfindahl index, which measures market concentration
v S_INCOME: Starting city’s average personal income
v E_INCOME: Ending city’s average personal income
v S_POP: Starting city’s population
v E_POP: Ending city’s population
v SLOT: Whether or not either endpoint airport is slot controlled (this is a measure of airport congestion)
v GATE: Whether or not either endpoint airport has gate constraints (this is another measure of airport congestion)
v DISTANCE: Distance between two endpoint airports in miles
v PAX: Number of passengers on that route during period of data collection
v FARE: Average fare on that route
1 Exploratory Analysis
1.1) Explore the numerical predictors and response (FARE) by creating a correlation table and examining some scatterplots between FARE and those predictors. What seems to be the best single predictor of FARE? Hint: consider ggpairs() function.
Attach your results here
1.2) Use either bar chart or frequency table to show the distribution of categorical predictors (VACATION, SW, SLOT, GATE).
Attach your results here
1.3) For each categorical predictor, plot the average FARE for its categories.
Attach your results here
2 Explanatory Modeling
2.1) Use the whole data, fit a regression of FARE vs all predictors.
Attach your results here
2.2) How many percent of variation in FARE can be explained by the model?
2.3) Does the model explain a significant amount of variation in FARE? Explain your answer.
2.4) Explain the effect that the presence or absence of Southwest (SW) has on FARE.
3 Predictive Modeling
3.1) Partition the data into a training set (60%) and a validation set (40%). Before you sample for training set, run set.seed(1). Fit a linear regression model to predict FARE with all the predictors using the training set.
Attach model summary here
3.2) Use forward selection to select predictors. Show model summary.
Attach model summary here
3.3) Compare the predictive accuracy between model 3.1 and model 3.2.
3.4) With model 3.2, predict the FARE on a route with the following characteristics: COUPON = 1.202, NEW = 3, VACATION = No, SW = No, HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP =4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12782, DISTANCE = 1976 miles.
Hint: create a data frame for the new route, apply model 3.2 on the new data frame with predict() function.
You may attach your R codes here (optional)