Homework 1 (40 pts)
Part 1: Introduction on the data
This homework assignment is based on a project to study the impact of weather conditions on bike sharing demand, the Bike project.
Background
Biking as an alternative transportation mode can provide numerous benefits not only to individuals health but also to the whole community by alleviating some relevant issues found in big cities where traffic congestion, insufficient parking facilities, air and noise pollution are a daily burden. Some of these benefits include the flexibility of traveling short distances easily. Likewise, it also allows the flexibility of traveling long distances by relying on it to cover the first/last mile from/to the transit stations, thereby shortening travel time. Moreover, biking provides great convenience because it offers door-to-door service where bike parking facilities are usually next door, which makes it available for riders any time.
All these great benefits have created several gaps where different bike sharing systems have emerged where an individual does not have to own a bike to ride one. However, as beneficial as bike sharing is, the demand is greatly impacted by weather conditions. and the trip purpose whether the trip is a work trip or not. Past studies have shown that weather conditions such as temperature, humidity, and wind speed have a significant impact on usage demand of bike sharing (Fuller et al. 2013; Gebhart and Noland 2014; Heinen et al. 2010).
Data Source
Bike sharing usage data from 2011 to 2012 in this study was collected by Capital Bikeshare (CaBi) - one of the largest companies providing bike sharing systems in the United States with more than 2500 bikes distributed across 300 stations in Washington, DC and Arlington, VA. (Capital Bikeshare 2012). The whole dataset including the bike sharing usage data, weather data, and holiday schedule was obtained from the University of California Irvine Center for Machine Leaning website https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset.
Data Description
In this study, a literature review was conducted to choose relevant and adequate variables to evaluate how bike sharing usage is impacted by weather conditions. Ultimately, the following continuous and categorical variables were chosen:
Variable Notation | Variable name | Variable Type | Description |
X1 | Temperature | Continuous | Normalized temperature in Celsius. The values are divided by 41 (max). |
X2 | Humidity | Continuous | Normalized humidity. The values are divided by 100 (max). |
X3 | Windspeed | Continuous | Normalized windspeed. The values are divided by 100 (max). |
X4 | Working day | Categorical | 1 If day is neither weekend nor holiday, and 0 otherwise. |
Y | Count | Continuous | The number of total rental bike users. |
Data is accessible from the course website: Data and Resource > Data used in class> BikeProject.csv
Part 2:Introduction on the homework types and format requirements.
There are two kinds of problems, conceptual and application. The conceptual problem focuses on definition, notation, and formula. For this kind of problem, you are supposed to compute by hand (or basic arithmetic function in Excel or R), but not the function that directly shows the answer. Formula and working progress should be clearly shown. By default, all questions in the homework assignment are of this type.
The application problem focuses on R application skill and output interpretation. This problem usually contains the phrase “use R….”, or “according to the R output”. For this kind of problem, you don’t need to compute the results by hand. Instead, get the result from R and proceed.
For instance, in the homework 1, problem 1-3 are conceptual problems, and problem 4 is application problem.
For problems 1 to 3, you may use Excel or R to compute the residuals and sum of squares, means for the variables before compute the residual standard error. When computing the item, show the formula and detail and use the correct notation. You may not use the linear regression function, such as lm() to compute the numbers because the purpose of these problems is to get familiar with the formula and notation.
For problem 4, you may use the linear regression function such as lm() to run the analysis, the purpose is to be familiar with the R output.
Part 3. Homework questions
In this homework, we consider a simple linear regression Y ~ X, where X=X1 is the temperature. The goal is to study the impact of the temperature on the bike rental counts.
1.(10) Estimate the parameters ( for a linear regression to predict Y based on X. Complete the following with details.
2. (8) In order to estimate the linear impact of X on Y, at a confidence of (, you should use the critical value, or the t value denoted as t(___, ____), which has a value of ____ (use basic R function or Excel for the exact value), at , and _____at . The standard error of the estimation _______________(formula)=________(value). The margin error, or of the confidence interval is _________ at , and _____at .
3. (10) perform a hypothesis test on the linear impact of X on Y, with a T test with a significant value of 0.1.
Note:
· if a question doesn’t specify the hypothesized value, it is two-sided test against 0.
· All hypothesis problem should include the following component: Ho/Ha defined in symbols ( etc.), test statistic (notation and formulas), reject region defined on a critical value (p-value computed on a probability formula), and conclusion.
4. (6) Use R to obtain a summary of this SLR model. Highlight the following concepts on the output, the notation, the values, and finally an interpretation. Compute the item with R or Excel if it is not directly available in the summary.
For example, the point estimate of linear impact of X on Y
The point estimate of linear impact of X on Y: , it means when X is increased by 1 unit, Y is increased by 0.037756 unit. It measures the linear impact of X on Y through the SLR model.
a) The standard error of the point estimate of the linear impact of X on Y
b) The residual standard error
c) The degree of freedom of the residual (the interpretation of this concept will be covered later)
d) The mean square of the standard error
e) The standard deviation of the dependent variable Y, denoted by and briefly explain how it is related to the total sum of variance, SST=
5. (6) Multiple choice.
· The tendency, or the form by which of the response variable, Y, varies with X can be estimated with a linear function ____(T/F). The linear function has a true form of in the population domain.
· At a general X=Xh level, the predicted value is estimated by . Both and are variables and can be estimated by and on a sample ________(T/F)
· The deviation between the actual response variable Y and the predicted Y, or at a given X=Xh level is called the random error and is denoted by _____(/ ), which can be estimated with a value denoted by_____(/ ) in a sample.
· This random error is assumed to have a distribution of _______(T/F), where the standard deviation, can be estimated by the standard error term denoted by ___ ( / ) computed from a sample.
· The actual response variable, , represents the linear relationship between X and Y. The two “ingredients” in this relationship can be identified as ________.
A. B. C.