Project 2: Forecasting Highway Car Volumes
Chen Nan
Due date: Nov 17, 2023
1 Introduction
This project studies traffic volumes on an express way connecting two metropolitan cities. A measuring station is set up to record the number of cars passing the station (east bound). The data is organized as an hourly data, which is consist of the traffic volumes, as well as weather information. The description of the variables are summarized below:
1. IsHoliday: a categorical variable to indicate whether it is a public holiday or not
2. Temp: a numeric variable, average temperature during the hour in Kelvin
3. Rain1h: a numeric variable, amount in mm of rain that occurred in the hour
4. Snow1h: a numeric variable, amount in mm of snow that occurred in the hour
5. CloudsAll: a numeric variable, percentage of cloud covering the sky
6. WeatherMain: a categorical variable, short textual description of the current weather
7. WeatherDescription: a categorical variable, longer textual description of the current weather
8. Time: DateTime Hour of the data collected in local time
9. TrafficVolume: the numerical response variable, hourly reported east bound traffic volume
In the datafile “P2train.csv”, 40,000 hourly readings are included. Each row include infor- mation for one hour. The testing dataset “P2test.csv” has the same data structure. However, 334 values in the response variables are “missing” (with value 0), and you need to forecast the hourly traffic volumes based on information prior to each row, respectively. Note that not all rows in “P2test.csv” have missing values in the response variables. The row indices (starting from 1) with missing response values are saved in “P2index.csv”. You can use all values in the training data as well as rows before a missing value to make predictions. The accuracies of your prediction will be evaluated based on the numbers you provided.
1
2 Project Task
2.1 Regression models
In this task, you need to build a regression model using any predictors or transformation of predic- tors to predict the TrafficVolume. Time can be also used as a predictor if you want (i.e., regression on time). You can also include interactions in your model. You can consider what are the necessary components to include in the model, and make diagnostic check and model interpretation. Model selection methods can be used as well.
2.2 Exponential smoothing
In this task, you can apply appropriate exponential smoothing to make forecasting, based on Time and TrafficVolume only. Please explore and verify your choice of model, and elaborate potential pros and cons of the model.
2.3 Free form forecasting
In this task, you can select or build your own model to predict the traffic volumes. You can only use methods discussed in the module (regression, exponential smoothing, Box-Jenkens methodologies, etc) or their combinations. Elaborate your choice and potential room for improvement. For this task, you need to submit your forecasting results for accuracy evaluation.
3 Submission
Make sure you zip all the three files below in a single zip file with name convention “student id.zip”, e.g., “U2305000.zip”.
1. A report not more than 10 pages with 1.5 spacing (soft copies only), which documents the methods using, main findings, and interpretations. Please take note the following guidelines
• Your report should be concise and complete.
• Your report should be self-contained, without referring to figures or numbers in your Jupyter Notebook. However, codes and non-informative software printouts should NOT be included in the report.
-
Focus on the analysis and reasoning in your report. Make sure your statement/hypothesis/conjecture are (partially) supported by data evidence.
-
You do NOT need to list every approach you have tried. Focus on elaborating “why you choose the model/approach in the end”.
2
• You can discuss with others, but you need to independently write your codes and reports. • Copying from other sources is considered plagiarism. You will get disciplinary action
from NUS.
-
Complete codes used for the analysis, with reasonable details of comments in Jupyter Note- book (Soft copies only). Attention: Make sure your results are reproducible by the codes you submitted. Unreproducible results are considered cheating/plagiarism.
-
Forecasting results on the test dataset in a “csv” file with a single column, and 335 rows (including the header), as shown in the following example
U2305000 10.31 8.5 20.1
... 11.5
Only the Task 3 requires forecasting results submission.