Skip to content. | Skip to navigation

Personal tools

Dummy variables

Creating Indicator (Dummy) Variables

 

Shortcuts to save you time.

Sometimes we need to create a number of indicator, or dummy, variables from a single categorical variable. These indicators usually take the value of 1 if the observation has the attribute and 0 if the observation does not. Most Stata commands support a syntax that creates indicator variables for you automatically. They call that syntax factor variables. There's a clear and detailed discussion in Chapter 25 of the Users Guide.

Here's a simple example to give you the flavor of factor variables syntax. Suppose each respondent has a value for their age group in a variable called agegroup that has 5 values.  You want to create 5 indicator variables with values 0/1 that describe whether the respondent is in age group 1 or not, age group 2 or not, etc., and use them in a regression on the correlates of body mass index (bmi). The factor variable syntax is:

regress bmi i.agegroup

It's that simple! Age group 1 is automatically treated as the base level and omitted from the equation.

Factor variables syntax is much more powerful than this simple example illustrates. For example, it will create interactions for you with continuous as well as categorical variables. See the Users Guide for a complete discussion.

Unfortunately, not all commands support the factor variable syntax. Below are some alternatives in case you need to use a command that doesn't support factor variables.

The most obvious way to do this is the generate command. Suppose we want to create 5 indicator variables from agegroup, a variable with 5 values:

gen age1=0
replace age1= 1 if agegroup == 1
gen age2=0
replace age2= 1 if agegroup == 2
etc.

This can be tedious if you're creating a lot of indicator variables. A few shortcuts are available, including recode, autocode, and egen, all of which are discussed in the Users Guide referred to above. Here are a couple more alternatives. 


The first shortcut is the forvalues command. See Looping over variables and values in this tutorial to learn the basics of this command.

forvalues n=1/5 {
gen byte age`n' = 0
  replace age`n' = 1 if agegroup==`n' }

This use of forvalues command simply generates the two commands in the first example 5 times for us, eliminating all that typing and opportunity for error. Note that we've added the data storage type byte to the generate command. Since the indicators only contain the values 0 and 1, they easily fit in a single byte of storage, so this option saves megabytes of storage. See Describing the data in this tutorial for an explanation of Stata's storage types.


The second shortcut is the tabulate command, which is the easiest to use.

tab agegroup, gen(age)

The gen option on tabulate creates a new dummy variable for each value of agegroup. It names each dummy using the prefix you assign in parentheses, in this case "age". Note that the dummies are named age1 through age5, which may or may not correspond to their value. However, the values are recorded in the variable labels.


The third shortcut is the xi command. This command is really intended to feed indicator variables into another Stata command, such as a regression. It has largely been replaced by factor variables, but it will create dummy variables.

rename agegroup age  
xi, prefix(i) noomit i.age

First we rename agegroup to age so that the indicator variables have a shorter name. In the xi command, the prefix(i)option gets rid of Stata's default prefix, "_I", which it adds to each dummy variable name. We use the noomit option because xi does not create a dummy for the lowest value (remember, it's designed to feed these dummies into a multivariate procedure, so one category must be dropped). The result is 5 indicator variables named iage_1, iage_2, ..., iage_5.

 


Review again?

 

Another topic?


Questions or comments? If you are affiliated with the Carolina Population Center, send them to Phil Bardsley.