Categorical Independent Variables

 

A categorical variable is one that is not naturally numerical.  We represent categories using dummy variables (a.k.a. binary variables).

 

For example, if there are two categories, then X = 0 if observation belongs to category 1 and X = 1 if observation belongs to category 2.

 

If the variable is for gender then we could code it so that if male, X = 0 and if female, X=1. 

 

Perhaps we’re trying to estimate the life span of an individual given his/her gender.  The relationship might be something like

 

yhat i = 72 +3.5x i

 

So if you’re a female,  yhat i = 72 +3.5(1) = 75.5 years

 

And if you’re male, yhat i = 72 +3.5(0) = 72 years

 

The difference in life expectancies is b1 = 3.5 years.

 

 

 

What if there are more than two categories?  For example, how would we represent class standing (FR, SO, JR, SR)?  It takes one less dummy variable than the number of categories to represent this.  An example coding scheme follows.

 

Let X1 = 1 if SO and 0 otherwise.  Let X2 = 1 of JR and 0 otherwise.  Let X3 = 1 if SR and 0 otherwise.

 

            X1       X2       X3

FR       0          0          0          (the category with all zeros is referred to as the reference class)

SO      1          0          0

JR       0          1          0

SR       0          0          1

 

Let’s use this in an example.  Suppose we are trying to predict someone’s income based on his/her class standing (X1, X2, X3) and age (x4).  (See the INCOME.XLS file.)

 

The regression results show that

yhat i = 2435.3 + 163.1x1i + 3704.5x2i + 11239x3i+ 478.2x4i

 

For freshmen, x1 = x2 = x3 = 0

yhat i = 2435.3 + 478.2x4i

 

 

For sophomores, x1 =1, x2 = x3 = 0

yhat i = 2598.4 + 478.2x4i

 

 

For juniors,  x2 = 1, x1 = x3 = 0

yhat i = 6139.8 + 478.2x4i

 

For seniors, x3 = 1, x1 = x2 = 0

yhat i = 13674.3 + 478.2x4i

 

These are all parallel lines (i.e. the slopes are the same).  Only the intercepts differ.  Graph these.  Then answer the following questions.

 

 

 

·        If there are two students who are both 28, but one is a freshman and one is a senior, what is the difference in their expected incomes?

 

Do you see why freshman is referred to as the reference class?

 

 

The slope for a dummy variable shows the difference in yhat  across categories, holding other independent variables constant!

 

Note that this model is significant and each variable is significant in the presence of the others except for X1.  How would you interpret that?