A categorical variable is one that is not naturally
numerical. We represent categories
using dummy variables (a.k.a. binary variables).
For example, if there are two categories, then X = 0
if observation belongs to category 1 and X = 1 if observation belongs to
category 2.
If the variable is for gender then we could code it so that
if male, X = 0 and if female, X=1.
Perhaps we’re trying to estimate the life span of an
individual given his/her gender. The
relationship might be something like
yhat i = 72 +3.5x i
So if you’re a female,
yhat i = 72 +3.5(1) = 75.5 years
And if you’re male, yhat i = 72 +3.5(0) = 72
years
The difference in life expectancies is b1 = 3.5
years.
What if there are more than two categories? For example, how would we represent class
standing (FR, SO, JR, SR)? It takes one
less dummy variable than the number of categories to represent this. An example coding scheme follows.
Let X1 = 1 if SO and 0 otherwise. Let X2 = 1 of JR and 0 otherwise. Let X3 = 1 if SR and 0 otherwise.
X1 X2 X3
FR 0 0 0 (the
category with all zeros is referred to as the reference class)
SO 1 0 0
JR 0 1 0
SR 0 0 1
Let’s use this in an example. Suppose we are trying to predict someone’s income based on
his/her class standing (X1, X2, X3) and age (x4). (See the INCOME.XLS file.)
The regression results show that
yhat i = 2435.3 + 163.1x1i + 3704.5x2i
+ 11239x3i+ 478.2x4i
For freshmen, x1 = x2 = x3 = 0
yhat i = 2435.3 + 478.2x4i
For sophomores, x1 =1, x2 = x3 = 0
yhat i = 2598.4 + 478.2x4i
For juniors, x2 = 1,
x1 = x3 = 0
yhat i = 6139.8 + 478.2x4i
For seniors, x3 = 1, x1 = x2 = 0
yhat i = 13674.3 + 478.2x4i
These are all parallel lines (i.e. the slopes are the
same). Only the intercepts differ. Graph these. Then answer the following questions.
· If there are two students who are both 28, but one is a freshman and one is a senior, what is the difference in their expected incomes?
Do you see why freshman is referred to as the reference
class?
The slope for a dummy variable shows the difference in
yhat across categories, holding other
independent variables constant!
Note that this model is significant and each variable is
significant in the presence of the others except for X1. How would you interpret that?