The by command in detailMore examples of the by command.This page is intended to show how to use the by command to create variables and access specific observations in sub-groups. Sub-group processing breaks down the dataset into multiple mini-datasets. The way to think about it is as if each sub-group is the entire data set. The way to access data from different observations is with internal Stata variables _n and _N. The variable _n is equal to the numeric position of an observation and _N is equal to the total number of observations. For example, in a data set of 10 obs, the first obs _n is equal to 1 and _N is equal to 10. For the second obs _n is equal to 2 and _N is equal to 10, etc. v001 _n _N ---- -- -- 1002 1 10 1002 2 10 1002 3 10 1002 4 10 1002 5 10 1003 6 10 1003 7 10 1003 8 10 1004 9 10 1004 10 10When using the by command, _n is equal to 1 for the first obs of the by-group and _N is equal to the total number of obs for the by-group. So if a person has 5 observations in the data set and v001 is the id variable, "by v001:" creates by-groups where the first obs in the by-group _n=1 and _N=5. The second obs _n=2 and _N=5 etc. The last obs for that person _n will equal _N. by v001: makes the data look like this: v001 _n _N ---- -- -- 1002 1 5 1002 2 5 1002 3 5 1002 4 5 1002 5 5 1003 1 3 1003 2 3 1003 3 3 1004 1 2 1004 2 2 If you want to keep the first obs per person do: by v001: keep if _n==1 second obs is- by v001: keep if _n==2 second to the last obs is- by v001: keep if _n==_N-1 last obs is- by v001: keep if _n==_N The following is Stata code that plays with this concept. NOTE: the following is a do-file. Highlight and copy all text between the horizontal bars and paste into the Stata interactive do-file window so that you can submit this do-file.
capture log close
log using by.log, replace
version 8.0
clear
set memory 20m
use "t:\statatut\examfac2.dta"
keep facid factype q102_1-q103_20
** q103_1-q103_20 are # hours/week usually worked **
** facid is the id var **
** q102_1-q102_20 are titles of facility workers **
label list title type
/** The following code figures out how many total hours per week
* different types of workers at a health facility work. **/
sort facid
/* Get rid of duplicate ids. */
drop if facid==facid[_n-1]
/* This would also keep the first obs in a duplicate id situation:
* by facid: keep if facid[_n]==1 */
/* Make the data set multiple obs per facility. */
reshape long q102_ q103_, i(facid) j(position)
/** Save the data set as a tempory data set in case you want to try another idea. **/
preserve
gen ttl_h1=0
replace ttl_h1=q103_ if q103_!=99 & q102_==1
by facid: gen ttl_th1=ttl_h1[1]+ttl_h1[2]+ttl_h1[3]+ttl_h1[4]+ttl_h1[5]+ ///
ttl_h1[6]+ttl_h1[7]+ttl_h1[8]+ttl_h1[9]+ttl_h1[10]+ ///
ttl_h1[11]+ttl_h1[12]+ttl_h1[13]+ttl_h1[14]+ttl_h1[15]+ ///
ttl_h1[16]+ttl_h1[17]+ttl_h1[18]+ttl_h1[19]+ttl_h1[20]
list facid factype q102_ q103_ ttl_h1 ttl_th1 in 1/500 if factype==3
/* or you could use the -forvalues- command */
replace ttl_th1=0
forvalues num = 1/20 {
by facid: replace ttl_th1=ttl_h1[`num']+ttl_th1
}
/* If you wanted totals for all the other titles you'd have
* to repeat the above 19 more times. */
/* You could use the egen command to sum up work hours per title for each facilty
* and for commands to reduce the amount of typing. */
clear
restore
preserve
foreach var of newlist ttl_h1-ttl_h10 {
gen `var'=.
}
forvalues num = 1/10 {
replace ttl_h`num'=q103_ if q103_!=99 & q102_==`num'
egen ttl_th`num'=sum(ttl_h`num'), by(facid)
}
list facid factype q102_ q103_ ttl_h1 ttl_th1 in 1/500 if factype==3
/** Keep only the variables that represent all obs of the facility. **/
keep facid factype ttl_th1-ttl_th10
/** Keep only the last obs per facid, though any of it's obs would be just as good. **/
by facid: keep if _n==_N
list in 1/500 if factype==3
Review again? Another topic? Questions or comments? If you are affiliated with the Carolina Population Center, send them to Phil Bardsley.
|

