III Causal
Forecasting Methods
These methods use historical data to estimate the
relationship between the variable you are trying to predict (a.k.a. dependent
variable or response variable) and other variables (a.k.a. independent
variables or explanatory variables).
For example, we might try to predict a driver’s reaction
time using the number of hours of sleep the driver had the night before and the
number of ounces of alcohol he/she has consumed. In this case, reaction time is the dependent variable and sleep
hours and alcohol ounces are the independent variables.
Another example might involve estimating someone’s income
(the dependent variable) based on his/her age and number of years of higher
education (the independent variables).
The forecasting model will always involve only one dependent
variable and one or more independent variables.
IIIA Simple Linear
Regression
This model is appropriate when there is only one independent
variable.
The population model is stated as
yi = b 0 + b 1 x i + e I
where
yi = observed
value of the dependent variable for case i
b 0
= the population intercept
b 1
= the population slope
x i = the observed value of the independent
variable
e i
= the error for case i (can be
positive or negative)
This model is fitted using sample data and the sample model
can be stated as
yhat i = b 0 + b 1 x i
where
yhat i = expected or predicted value of the
dependent variable for case i
b 0 = the intercept (if x equals zero, the expected
value of y is b 0 )
b 1 = the slope (for every one unit increase in x, y
is expected to change by b 1 units)
x i = the
value of the independent variable for case i
Let e i = y i – yhat i be the error for case i
But how do you come up with b 0 and b 1? They are determined by minimizing the sum of
the squared error terms. This is known
as least squares estimation.
b 1 = [S (xi – xbar)(yi-ybar)]/S(xi
– xbar)2
b 0 = ybar - b 1xbar
where ybar is the average value of the dependent variable and
xbar is the average value of the independent variable.
To illustrate, look at SODA.xls on the formulas page.
The resulting model is
yhat i = 10.90 + 0.90 x i
So if attendance is 200 (first verify that it is in the
range for the sample). We can forecast soda consumption by calculating
yhat i = 10.90 + 0.90 * 200 = 190.90 sodas
For each additional person attending the game, soda
consumption is expected to increase by 0.90 sodas.
If no people went to the game, expected consumption is 10.90
sodas (??)
The intercept does not always have a good intuitive meaning.
We can achieve the same results as well as some additional
analysis using some built-in functions of Excel. Look at SODA.XLS on the built-in page.
First name ranges.
Highlight cells b6:b25 and name this block ATTEND. Highlight c6:c25 and name this block SODAS.
In cell b28 type =intercept(sodas,attend)
In cell b29 type =slope(sodas,attend)
In cell b30 type =rsq(sodas,attend)
In cell b31 type =steyx(sodas,attend)
In cell b32 type =correl(sodas,attend)
We already know what the intercept and slope represent, but
what about the others? First some
background…
SST = SSR + SSE
Sum of Squares Total (SST) is the total variability of the y
values around their mean.
Sum of Squares Regression (SSR) is the amount of variability
explained by the regression relationship.
Sum of Squares Error (SSE) is the amount of variability not
explained (recall that we minimize the sum of the squared error terms).
Want SSE è 0 and SSR
è SST
Coefficient of Determination R2 = SSR/SST Proportion of variability in Y explained by
x. [rsq(y range, x range)]
0<= R2 <= 1
If R2 =
1, perfect relationship
If R2 = 0,
no relationship
For the soda problem R2 = 0.96 which means that 96%
of the variability in soda consumption can be explained by attendance.
Related to that is the Correlation Coefficient rxy = sqrt(R2) [=correl(y
range, x range)]
It takes the sign of the slope.
-1 <= rxy <= 1
If rxy = 1, perfect positive correlation
If rxy = -1, perfect negative correlation
If rxy = 0, no correlation
Standard Error of the Estimate se is a
measure of forecasting accuracy. [=steyx(yrange,
x range)]
68% of actual values will be within +/- one se of
the forecasted value.
95% of actual values will be within +/- two se of
the forecasted value.
Can construct a scatter plot of these observations.
Highlight ATTEND and SODAS.
Click on chart wizard.
XY scatter
Next
Next
Titles Soda
consumption versus Attendance
Next
As new sheet
Finish
Now right mouse click on any point.
Choose Add Trendline
Linear
OK
Can also run a regression using Excel’s Data Analysis tools.
Tools
Data Analysis (if this option does not appear, you need to
go to add-ins and select it)
Regression
OK
Input Y Range c5:c25
Input X Range b5:b25
Check labels box
Output Range any
blank space
Check Residuals box
Check Residual Plots box
OK