Adding summary statistics to a data file


Adding a statistic to each observation, reducing the file to summary statistics.

Stata has commands that allow you to either add a summary statistic to each observation in memory, or to reduce the file according to values of a group so that each resulting observation is the summary statistic for that group.

use "q:\utilities\statatut\exampfac.dta"

/* Add the mean facility age to each observation. */

egen mage= mean(age)
su mage
list facid age mage in 1/5

/* Create a file containing the mean facility age by authority. */

collapse (mean) age, by(authorit)

Stata offers a large number of statistics with both the egen and collapse commands. See the manual or the on-line help for a full list.


1. The egen command adds a new variable, in this case called mage, to every observation in the data. What happens to the original observations and variables? Answer.

2. The name "egen" stands for "extensions to generate." What's the difference between generate and egen? Answer.

3. The collapse command calculated the mean age for each value of authority. How many observations and variables did the resulting data file in memory contain? Answer.




1. The original observations and variables are unchanged. The egen command simply adds another column to the data in memory.

2. The generate command has functions, such as "log" below, to create unique values on each observation. The egen command has a different set of functions. Some of its functions put unique values on each observation, while others put summary statistics across all observations (or groups) on each observation.

For example, the following generate command would calculate the natural log of the age of each facility in exampfac.dta and add that value to each facility's record:

      generate lage= log(age)

In contrast, the following egen command would calculate the median age of all facilities of each type (authorit), and it would add that value to each facility's record:

      sort authorit
      by authorit: egen medage= median(age)

By the way, you can use bysort to combine the above two commands and reduce your typing:

      bysort authorit: egen medage= median(age)

See the manual or online help for generate, egen, and functions

for more information.

3. After the collapse command, the resulting file had 13 observations and 2 variables. The number of observations is determined by the number of distinct values in the by variable. The number of variables is: one for each summary statistic calculated, and one for each by variable.

