1.
Homoscedasticity – the variance of the error terms is
constant for each value of x, To check
this, look at the plot(s) of the residuals versus the X value(s). You do not want to see a fanning effect in
the residuals.
2.
Linearity – the relationship between each x and y is
linear. To check this, again look at the plot(s) of the residuals
versus the X value(s). You don’t want
to see a clustering of positive residuals or a clustering of negative
residuals.
3.
Normally Distributed Error Terms – the error terms follow
the normal distribution.
4.
Independence of Error Terms – successive residuals are not
correlated, If they are correlated, it
is known as autocorrelation. If possible,
use the Durbin Watson statistic to check this.
This statistic is not reported in Excel, so we will use a rule of thumb
recommended by Winston and Albright.
First
define a sign change as a situation where successive residuals
have
different signs. Let n = the number of
observations.
If the
number of sign changes is <= (n-1)/2-sqrt(n-1) then positive
autocorrelation
is likely.
If the
number of sign changes is >= (n-1)/2+sqrt(n-1) then negative
autocorrelation
is likely.
In the simple linear regression model, this answers the
question, “Is the relationship between X and Y significant?
Let a =
tolerance for error. Typical values are
.01, .05, and .10.
1-a is the
confidence level
Null hypothesis H0
: the relationship between x & y
is not sig
Alternate hypothesis H1 : the relationship between x & y is sig
Find p value that correspond to x
If p >= a, fail to
reject the H0
If p < a, conclude
the H1
For our SODA problem, p = .000.
Since p<a, we
conclude that there is a significant relationship between soda
consumption and attendance.
Allows for more than one independent variable.
The model is now expressed as
yhat i = b 0 + b 1 x 1i
+ b 1 x 2i + … b k x ki
where k = the number of independent variables
yhat i = expected or predicted value of the
dependent variable for case i
b 0 = the
intercept (if all x’s are zero, the expected value of y is b 0 )
b j = the
slope (for every one unit increase in xj , y is expected to change
by b j units, given that
the other independent variables are held constant)
x i = the value of the independent variable for
case i
To illustrate, look at DISTCOST.XLS.
Tools
Data Analysis
Regression
OK
Input Y Range a2:a26
Input X Range b2:c26
Check labels box
Output Range any blank space
Check Residuals box
Check Residual Plots box
OK
The resulting model is
yhat i = -2.72 + .047 x 1i + .012 x 2i
So if Sales are 410 thousand and Orders are 4000 (first
verify that these are
in the range for the sample). We can forecast distribution costs by calculating
yhat i = -2.72 +
.047(410) + .012(4000) = 64.55 in
thousands of dollars
For each additional thousand dollars in sales, distribution
costs are expected to
increase by .047 (thousand) given that orders is held
constant.
For each additional order filled, distribution costs are
expected to increase by
.012 (thousand) given that sales is held constant.
If no sales revenues are earned and no orders are filled,
expected distribution
costs are –2.72 (thousand) (??) The intercept does not always have a good
intuitive meaning.
Coefficient of Determination R2 = SSR/SST Proportion of variability in Y explained by
the set of
independent variables.
For the distcost problem R2 = 0.88 which means
that 88% of the variability in
distribution costs can be explained by sales and orders.
Standard Error of the Estimate se is a
measure of forecasting accuracy.
Se = 4.77
seems relatively small
68% of actual values will be within +/- one se of
the forecasted value.
95% of actual values will be within +/- two se of
the forecasted value.
First answer the question, “Is the relationship between the
DV and the set of IVs significant?”
Use the Significance F value on the Excel output
Null hypothesis H0 : the relationship between y and the set of
xs is not significant
Alternate hypothesis H1 : the relationship between y and the set of xs is significant
If Significance F >= a, fail to
reject the H0
If Significance F < a, conclude
the H1
For the distcost problem,
Significance F = 3.04E-10 < a, so
conclude the H1
Now check the individual variables.
This answers the question, “Is the relationship between xj
and y significant in the presence of the other variables?
Null hypothesis H0 : the relationship between xj and y is not sig
in the
presence of the other variables
Alternate hypothesis H1 : the relationship between xj and y is sig
in the
presence of the other variables
Find p value that correspond to xj
If p >= a, fail to
reject the H0
If p < a, conclude
the H1
For our distcost problem, for
sales, p = .031.
Since p<a, we conclude
that there is a significant relationship between sales and distribution costs
in the presence of orders.
For our distcost problem, for
orders, p = .0000287.
Since p<a, we
conclude that there is a significant relationship between orders and distribution
costs in the presence of sales.
So it is worthwhile to keep both
variables in the model.