Assumptions of Linear Regression

 

1.                  Homoscedasticity – the variance of the error terms is constant for each value of x,  To check this, look at the plot(s) of the residuals versus the X value(s).  You do not want to see a fanning effect in the residuals.

 

2.                  Linearity – the relationship between each x and y is linear.  To check this,  again look at the plot(s) of the residuals versus the X value(s).  You don’t want to see a clustering of positive residuals or a clustering of negative residuals.

 

3.                  Normally Distributed Error Terms – the error terms follow the normal distribution.

 

4.                  Independence of Error Terms – successive residuals are not correlated,  If they are correlated, it is known as autocorrelation.  If possible, use the Durbin Watson statistic to check this.  This statistic is not reported in Excel, so we will use a rule of thumb recommended by Winston and Albright.

 

First define a sign change as a situation where successive residuals

have different signs.  Let n = the number of observations.

 

If the number of sign changes is <= (n-1)/2-sqrt(n-1) then positive

autocorrelation is likely. 

 

If the number of sign changes is >= (n-1)/2+sqrt(n-1) then negative

autocorrelation is likely. 

 

 

Inferences in Regression

 

In the simple linear regression model, this answers the question, “Is the relationship between X and Y significant?

 

Let a = tolerance for error.  Typical values are .01, .05, and .10.

1-a is the confidence level

 

Null hypothesis                      H0 :      the relationship between x & y is not sig

Alternate hypothesis             H1 :      the relationship between x & y is sig

Find p value that correspond to x

If p >= a, fail to reject the H0

If p < a, conclude the H1

 

For our SODA problem, p = .000.

Since p<a, we conclude that there is a significant relationship between soda

consumption and attendance.

 

IIIB                  Multiple Regression Models

 

Allows for more than one independent variable.

 

The model is now expressed as

yhat i = b 0 + b 1 x 1i  + b 1 x 2i  + … b k x ki  

 

where k = the number of independent variables

 

yhat i = expected or predicted value of the dependent variable for case i

b 0  = the intercept (if all x’s are zero, the expected value of y is  b 0 )

b j  = the slope (for every one unit increase in xj , y is expected to change by b j units, given that

the other independent variables are held constant)

x i = the value of the independent variable for case i

 

To illustrate, look at DISTCOST.XLS.

 

Tools

Data Analysis

Regression

OK

Input Y Range      a2:a26

Input X Range      b2:c26

Check labels box

Output Range      any blank space

Check Residuals box

Check Residual Plots box

OK

 

 

The resulting model is

yhat i = -2.72 + .047 x 1i + .012 x 2i

 

So if Sales are 410 thousand and Orders are 4000 (first verify that these are

in the range for the sample).  We can forecast distribution costs by calculating

 

yhat i = -2.72 + .047(410) + .012(4000) =  64.55 in thousands of dollars

 

For each additional thousand dollars in sales, distribution costs are expected to

increase by .047 (thousand) given that orders is held constant.

 

For each additional order filled, distribution costs are expected to increase by

.012 (thousand) given that sales is held constant.

 

If no sales revenues are earned and no orders are filled, expected distribution

costs are –2.72 (thousand) (??)  The intercept does not always have a good

intuitive meaning.

 

 

Coefficient of Determination R2 = SSR/SST    Proportion of variability in Y explained by the set of

independent variables. 

 

For the distcost problem R2 = 0.88 which means that 88% of the variability in

distribution costs can be explained by sales and orders.

 

 

Standard Error of the Estimate se is a measure of forecasting accuracy.   

Se = 4.77  seems relatively small

68% of actual values will be within +/- one se of the forecasted value.

95% of actual values will be within +/- two se of the forecasted value.

 

Inferences in Multiple Regression

 

First answer the question, “Is the relationship between the DV and the set of IVs significant?”

 

Use the Significance F value on the Excel output

 

Null hypothesis                H0 :      the relationship between y and the set of xs is not significant

Alternate hypothesis       H1 :      the relationship between y and the set of xs is significant

If Significance F >= a, fail to reject the H0

If Significance F < a, conclude the H1

 

For the distcost problem, Significance F = 3.04E-10 < a, so conclude the H1

 

Now check the individual variables.

This answers the question, “Is the relationship between xj and y significant in the presence of the other variables?

 

Null hypothesis                H0 :      the relationship between xj  and y is not sig

in the presence of the other variables

Alternate hypothesis       H1 :      the relationship between xj  and y is sig

in the presence of the other variables

 

Find p value that correspond to xj

If p >= a, fail to reject the H0

If p < a, conclude the H1

 

For our distcost problem, for sales, p = .031.

Since p<a, we conclude that there is a significant relationship between sales and distribution costs in the presence of orders.

 

For our distcost problem, for orders, p = .0000287.

Since p<a, we conclude that there is a significant relationship between orders and distribution costs in the presence of sales.

 

So it is worthwhile to keep both variables in the model.