III         Causal Forecasting Methods

 

These methods use historical data to estimate the relationship between the variable you are trying to predict (a.k.a. dependent variable or response variable) and other variables (a.k.a. independent variables or explanatory variables).

 

For example, we might try to predict a driver’s reaction time using the number of hours of sleep the driver had the night before and the number of ounces of alcohol he/she has consumed.  In this case, reaction time is the dependent variable and sleep hours and alcohol ounces are the independent variables.

 

Another example might involve estimating someone’s income (the dependent variable) based on his/her age and number of years of higher education (the independent variables).

 

The forecasting model will always involve only one dependent variable and one or more independent variables.

 

IIIA      Simple Linear Regression

 

This model is appropriate when there is only one independent variable. 

 

The population model is stated as

 

yi = b 0 + b 1 x i + e I

 

where

yi =      observed value of the dependent variable for case i

b 0 =    the population intercept

b 1 =    the population slope

x i  =    the observed value of the independent variable

e i =     the error for case i (can be positive or negative)

 

This model is fitted using sample data and the sample model can be stated as

 

yhat i = b 0 + b 1 x i

 

where

yhat i = expected or predicted value of the dependent variable for case i

b 0  =   the intercept (if x equals zero, the expected value of y is  b 0 )

b 1  =   the slope (for every one unit increase in x, y is expected to change by b 1 units)

x i =     the value of the independent variable for case i

 

Let e i = y i – yhat i     be the error for case i

 

But how do you come up with b 0 and b 1?  They are determined by minimizing the sum of the squared error terms.  This is known as least squares estimation. 

 

b 1 = [S (xi – xbar)(yi-ybar)]/S(xi – xbar)2

 

b 0 = ybar - b 1xbar

 

where ybar is the average value of the dependent variable and xbar is the average value of the independent variable.

 

To illustrate, look at SODA.xls on the formulas page.

 

The resulting model is

yhat i = 10.90 + 0.90 x i

 

So if attendance is 200 (first verify that it is in the range for the sample). We can forecast soda consumption by calculating

 

yhat i = 10.90 + 0.90 * 200 = 190.90 sodas

 

For each additional person attending the game, soda consumption is expected to increase by 0.90 sodas.

 

If no people went to the game, expected consumption is 10.90 sodas (??)

The intercept does not always have a good intuitive meaning.

 

We can achieve the same results as well as some additional analysis using some built-in functions of Excel.  Look at SODA.XLS on the built-in page.

 

First name ranges.  Highlight cells b6:b25 and name this block ATTEND.  Highlight c6:c25 and name this block SODAS.

 

In cell b28 type =intercept(sodas,attend)

In cell b29 type =slope(sodas,attend)

In cell b30 type =rsq(sodas,attend)

In cell b31 type =steyx(sodas,attend)

In cell b32 type =correl(sodas,attend)

 

We already know what the intercept and slope represent, but what about the others?  First some background…

 

SST = SSR + SSE

 

Sum of Squares Total (SST) is the total variability of the y values around their mean.

Sum of Squares Regression (SSR) is the amount of variability explained by the regression relationship.

Sum of Squares Error (SSE) is the amount of variability not explained (recall that we minimize the sum of the squared error terms).

 

Want SSE è 0 and SSR è SST

 

Coefficient of Determination R2 = SSR/SST    Proportion of variability in Y explained by x.  [rsq(y range, x range)]

 

0<= R2 <= 1

 

If R2  = 1, perfect relationship

If R2  = 0, no relationship

 

For the soda problem R2 = 0.96 which means that 96% of the variability in soda consumption can be explained by attendance.

 

Related to that is the Correlation Coefficient  rxy = sqrt(R2)                [=correl(y range, x range)]

It takes the sign of the slope.

 

-1 <= rxy <= 1

 

If rxy = 1, perfect positive correlation

If rxy = -1, perfect negative correlation

If rxy = 0, no correlation

 

Standard Error of the Estimate se is a measure of forecasting accuracy.    [=steyx(yrange, x range)]

68% of actual values will be within +/- one se of the forecasted value.

95% of actual values will be within +/- two se of the forecasted value.

 

Can construct a scatter plot of these observations.

Highlight ATTEND and SODAS.

Click on chart wizard.

XY scatter

Next

Next

Titles   Soda consumption versus Attendance

Next   

As new sheet

Finish

 

Now right mouse click on any point.

Choose Add Trendline

Linear

OK

 

Can also run a regression using Excel’s Data Analysis tools.

 

Tools

Data Analysis (if this option does not appear, you need to go to add-ins and select it)

Regression

OK

Input Y Range            c5:c25

Input X Range            b5:b25

Check labels box

Output Range            any blank space

Check Residuals box

Check Residual Plots box

OK