The by command in detail


More examples of the by command.

This page is intended to show how to use the by command to create variables and access specific observations in sub-groups.  Sub-group processing breaks down the dataset into multiple mini-datasets.  The way to think about it is as if each sub-group is the entire dataset.

The way to access data from different observations is with internal Stata variables _n and _N.  The variable _n is equal to the numeric position of an observation and _N is equal to the total number of observations.  For example, in a dataset of 10 observations, in the first observations _n is equal to 1 and _N is equal to 10.  In the second observation _n is equal to 2 and _N is equal to 10, etc.

v001 _n _N
---- -- --
1002  1 10
1002  2 10
1002  3 10
1002  4 10
1002  5 10
1003  6 10
1003  7 10
1003  8 10
1004  9 10
1004 10 10

When using the by command, _n is equal to 1 for the first obs of the by-group and _N is equal to the total number of observations for the by-group.  So if a person has 5 observations in the data set and v001 is the id variable, "by v001: " creates by-groups where the first observation in the by-group _n=1 and _N=5.  The second observation _n=2 and _N=5 , etc.  The last observation for that person _n will equal _N.


Note: if conditions in the command being run by the by command do not subset the by group so _n and _N are not affected by the ifcondition.

by v001:

 makes the data look like this:

v001 _n _N
---- -- --
1002  1  5
1002  2  5
1002  3  5
1002  4  5
1002  5  5
1003  1  3
1003  2  3
1003  3  3
1004  1  2
1004  2  2

If you want to keep the first obervation per person do:

 by v001: keep  if _n == 1

 second observation is:

 by v001: keep  if _n == 2

 second to the last observation is:

 by v001: keep  if _n == _N - 1

 last observation is:

 by v001: keep  if _n == _N


The following is Stata code that plays with this concept.

NOTE:  the following is a do-file.  Highlight and copy all text between the horizontal bars and paste into the Stata interactive do-file editor window so that you can submit this do-file.

use "q:\utilities\statatut\examfac2.dta",clear

keep facid factype q102_1-q103_20
** q103_1-q103_20 are # hours/week usually worked **

** facid is the id var **

** q102_1-q102_20 are titles of facility workers **
label list title type

/** The following code figures out how many total hours per week
  * different types of workers at a health facility work. **/
sort facid

/* Get rid of duplicate ids. */
drop  if facid == facid[_n-1]

/* This would also keep the first obs in a duplicate id situation:
 * by facid: keep  if facid[_n] == 1 */

/* Make the data set multiple obs per facility. */
reshape long q102_ q103_, i(facid) j(position)

/** Save the data set as a temporary data set in case you want to try another idea. **/
gen ttl_h1= 0
replace ttl_h1=q103_  if q103_ != 99 & q102_ == 1

by facid: gen ttl_th1=ttl_h1[1]+ttl_h1[2]+ttl_h1[3]+ttl_h1[4]+ttl_h1[5]+ ///
   ttl_h1[6]+ttl_h1[7]+ttl_h1[8]+ttl_h1[9]+ttl_h1[10]+                   ///
   ttl_h1[11]+ttl_h1[12]+ttl_h1[13]+ttl_h1[14]+ttl_h1[15]+               ///

list facid factype q102_ q103_ ttl_h1 ttl_th1 in 1/500 if factype==3

/* or you could use the -forvalues- command */
replace ttl_th1= 0
forvalues num= 1/20 {
 by facid: replace ttl_th1= ttl_h1[`num'] + ttl_th1

/* If you wanted totals for all the other titles you'd have
 * to repeat the above 19 more times. */

/* You could use the -egen- command to sum up work hours per title for each facilty
 * and for commands to reduce the amount of typing. */


foreach var of newlist ttl_h1-ttl_h10 {
 gen `var'= .

forvalues num = 1/10 {
 replace ttl_h`num'= q103_  if q103_ != 99 & q102_ == `num'
 egen ttl_th`num'= sum(ttl_h`num'), by(facid)

// examples of when factype == 3 in the first 500 obs
list facid factype q102_ q103_ ttl_h1 ttl_th1  in 1/500  if factype == 3

/** Keep only the variables that represent all obs of the facility. **/
keep facid factype ttl_th1-ttl_th10

/** Keep only the last obs per facid, though any of it's obs would be just as good. **/
by facid: keep  if _n == _N

list  in 1/500  if factype == 3


This page was contributed by Dan Blanchette, however please direct questions to Phil Bardsley as noted below.


Review again?


Another topic?

Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme