Merging summary statistics onto each observation


Hierarchical data, analysis at the higher level

This example combines the collapse command learned in Adding summary statistics to a data file with the merge command covered earlier. We'll use the same data here as the previous example: 503 facilities (which we call "higher" level) in one file, and 2,584 service providers (which we call "lower" level) in another file. But this time we want to combine them in such a way that each observation is a facility, i.e., we want to do our analysis at the higher level.

Our analysis question for this example might be: is the mean number of trained providers at government facilities higher than at private facilities? We need to count trained providers in the provider file (using the collapse command) to summarize their data at the facility level, and merge the resulting file onto the facility data.

We use only the new merge syntax here. Please refer to the previous examples of merge for Stata 10 and earlier syntax.

Copy these commands into a do-file editor and run them.

/* Use the facility file and drop the duplicate. */

use "q:\utilities\statatut\exampfac.dta"
keep facid authorit
duplicates list facid 
drop  if facid == 1001 & missing(authorit) // get rid of the duplicate
save "d:\statatemp\tempfac.dta", replace

/* Collapse the service provider file to count the trained providers per facility */

use "q:\utilities\statatut\exampro.dta"
keep facid bcs ccs 
replace bcs=0 if bcs==2
replace ccs=0 if ccs==2
sort facid
collapse (sum) bcs ccs, by (facid)
list in 1/10

/* Merge the collapsed provider data with the facility data */

merge 1:1 facid using "d:\statatemp\tempfac.dta"

/* Check how well the merge went - we don't expect all "3's" */

keep  if _merge == 3

/* Recode authorit to give 2 groups: govt and non-govt */ 
/* Note that the private group is labeled "MarieStopes" */
/* because that's the label on value 2. */
recode authorit 3/12=2 ta authorit /* Run a t-test to check difference in mean training levels */ ttest bcs, by(authorit) /* Clean up */ erase "d:\statatemp\tempfac.dta" drop _merge



1. Why recode bcs and ccs to 0/1 before the collapse command? Answer.

2. After collapsing the provider data to the facility level, we merged a facility-level file with another facility-level file. Could we have merged them without using facid, in other words, could we have done a one-to-one merge instead of a match-merge? Answer.

3. What were the results of the match-merge? Answer.

4. Do we need to drop the non-matching observations before running the t-test? Answer.

5. Could we use egen instead of collapse in this situation? Answer.




1. We need to count the number of trained providers at each facility. The easiest way to do that is to add the people who are trained, so we recode them to 1=trained and 0=not trained.

Back to question 



2. No. In the earlier example, where we merged two files of DHS women's data, the two files originally came from the same combined file. So, we were confident that by sorting first, the women would be in the same order and match one-to-one without using their identifying variables. (We matched on identifiers anyway just to be safe.)


In this case, though, the provider data were collected separately from the facility data. Theoretically, all facilities should be represented in the provider data. But practically, given all the ways field surveys can and do go wrong, it didn't work out that cleanly. Therefore, we needed to match-merge.

Back to question 



3. The results were the same as in our previous example using these two files. 468 matches, 34 facilities with no matching provider data, and 9 (facility-level) provider observations with no matching facilities. These 9 provider observations are collapsed from the 22 providers we saw unmatched in the previous example.

Back to question



4. No we don't. The t-test, and most other Stata statistical commands, drop observations with missing values on any one of the analysis variables. In this case we have 467 observations in the analysis instead of 468, because 1 facility has a missing value for authority.

In fact, if we were to do this analysis using the svy commands, we would need to leave all facilities in the sample. The svy commands correct the standard error for the effects of clustering and stratification, and they require the full set of observations in order to make this correction accurately. See Commands to Analyze Survey Data for more information on this topic.

Back to question


5. Yes, we could use egen to count the number of trained providers, drop the duplicates in the provider file so there was only one observation per facility, and merge the result onto the facility file:

/* Use egen to count the trained providers per facility */

use "q:\utilities\statatut\exampro.dta"
keep facid bcs ccs 
replace bcs= 0 if bcs == 2
replace ccs= 0 if ccs == 2
sort facid
by facid: egen numbcs= total(bcs)
by facid: egen numccs= total(ccs)
list in 1/10
duplicates drop facid, force
drop bcs ccs

/* Merge the collapsed provider data onto the facility data */

merge 1:1 facid using "tempfac.dta"

There are a couple of advantages to the egen approach. First, you can easily keep additional variables that may be of interest on the provider file. While this is possible with collapse, it can be cumbersome. Second, you can browse the data after egen to see whether you created the count correctly. Then, if it looks good, you can drop the duplicates.

Back to question 



Review again?


Another topic?

Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme