One-to-many merging

 

Hierarchical data, analysis at the lower level

In this example, we have data from the 1999 Tanzania Facility Survey. We have data from 503 facilities (which we call "higher" level) in one file, and data from 2,584 service providers (which we call "lower" level) in another file. We want to combine them so that each observation is a service provider, i.e., we want to do our analysis at the lower level.

Our analysis question for this example might be: are providers from government facilities more likely to be trained than those from private facilities?

Copy these commands into a do-file editor and run them.


 

Merge using New Syntax (Stata 11 and later)

clear
use "q:\utilities\statatut\exampfac.dta"
keep facid authorit 
merge 1:m facid using "q:\utilities\statatut\exampro.dta", keepusing(facid bcs ccs)
drop _merge
/* Check how well the merge went - we do not expect all "3's" */

 

 

Merge using Old Syntax (Stata 10 and earlier)

/* Use the provider file and sort it */

clear
use "q:\utilities\statatut\exampro.dta"
keep facid bcs ccs
sort facid
save "d:\statatemp\tempro.dta", replace

/* Use the facility file and sort it */

clear
use "q:\utilities\statatut\exampfac.dta"
keep facid authorit 
sort facid

/* Merge the provider data onto the facility data */

merge facid using "d:\statatemp\tempro.dta"

/* Check how well the merge went - we don't expect all "3's" */

tab _merge
keep if _merge == 3
drop _merge
su

/* Recode authority to form government vs private */

recode authorit 3/12=2

/* Run chi-square analysis */

tab authorit bcs, row chi

/* Clean up */

erase "d:\statatemp\tempro.dta"

 

Questions:

1. I used the new Stata syntax, and the merge failed! I got an error message:

variable facid does not uniquely identify observations in the master data

What does this mean, and how can I fix it? Answer.

 

2. Using the Stata 11 and later syntax, can I first use providers and then merge facilities onto them? Answer.

 

3. I used the Stata 10 and earlier syntax, and the merge worked, but I got two notes:

variable facid does not uniquely identify observations in the master data
variable facid does not uniquely identify observations in tempro.dta

What does this mean, and do I need to worry about it? Answer.

 

4. This merge is not too bad - about 98% of the service providers were matched with their appropriate facilities. How many providers had no matching facility data, and how many facilities weren't paired with providers? Answer.


5. Is it possible to tell from the tabulation of _merge that we're getting the correct number of providers in the merge? How about the correct number of facilities? Answer.

 


Answers:

 

1. The problem is that the facility ("master") data has duplicates, that is, it contains more than one observation with the same value of facid. The good news is that Stata stopped you from continuing with the merge, getting the wrong answer, and possibly never discovering the error. (Earlier versions of Stata allowed you to continue, but with a warning - see Answer 2.) The bad news is that you should have checked for duplicates before doing this merge!

Earlier, you saw how to check directly for duplicate identifiers:

       use "q:\utilities\statatut\exampfac.dta", clear
       duplicates list facid

That is the easiest way to check for duplicates, and it should always be done prior to a merge.

The error message in our results window refers to the master file, which in this case is the facilities. There may be multiple providers from each facility, so we expect duplicates of facid in the provider (using) file. We told Stata to expect duplicate providers by putting "m" on the right side (corresponding to the using file) of "1:m" in the merge command. Similarly, we told Stata to expect unique values of facid in the facility file by putting "1" on the left side (corresponding to the master file).

Using the duplicates list command, we discovered that our facility data has two facilities with facid == 1001. One of these is a mistake, because it's value of authorit is missing:

     +------------------+
     | facid   authorit |
     |------------------|
 12. |  1001        Gov |
 13. |  1001          . |
     +------------------+

To fix the problem we drop that facility and continue with the merge. Back to question.

 

 


 

 

2. Yes, you can merge facilities and providers together in either order. To merge in the other order, the syntax (after fixing the duplicates problem) is:

clear
use "q:\utilities\statatut\exampro.dta"
keep facid bcs ccs 
merge m:1 facid using "q:\utilities\statatut\exampfac.dta", keepusing(facid authorit)

The terms "1:m" and "m:1" tell Stata what to expect when it tries to match observations in the two files. If it finds what you tell it to expect, you're merge will be successful. But, if Stata doesn't find what you expect, it will stop the merge. The only case where it will continue is where you tell it there are many observations with the same identifiers, but in fact there is only 1 instance of each identifier. Since the resulting merged files will be correct, there's no reason for Stata to stop or even to warn you.

Stata also allows a "many-to-many" merge (m:m), but we recommend that you forget you ever heard this. In over 30 years of managing survey data, we have never found an instance where Stata's implementation of this concept fit. In most situations it would result in the wrong answer, and as seen in question 3, that is the case here as well. Back to question.


 

3. Please read the answer to Question 1 first, then continue here.

As we see here, it's possible to ignore these notes and continue as though the merge worked properly, when in fact we got the wrong answer. (We could even use the new syntax "merge m:m" and continue as though all is well, but the merge would be incorrect.)

Specifically, here's what happens when we ignore the warnings:

As we saw earlier, one of the facilities with facid 1001 is a mistake, because it's value of authorit is missing:

     +------------------+
     | facid   authorit |
     |------------------|
 12. |  1001        Gov |
 13. |  1001          . |
     +------------------+

Our provider data contains 5 observations from facility 1001:

      +-------------------+
      | facid   bcs   ccs |
      |-------------------|
  29. |  1001     1     1 |
  30. |  1001     2     2 |
  31. |  1001     2     2 |
  32. |  1001     2     1 |
  33. |  1001     1     2 |
      +-------------------+

When we allow the merge to continue, our mistake will be hidden from us. Look at the result of the merge by facid:

      +---------------------------------------+
      | facid   authorit   bcs   ccs   _merge |
      |---------------------------------------|
  12. |  1001        Gov     1     1        3 |
  13. |  1001          .     2     2        3 |
 521. |  1001          .     2     2        3 |
 522. |  1001          .     2     1        3 |
 523. |  1001          .     1     2        3 |
      +---------------------------------------+

Stata matches the first facility, which has a value of "Gov" for authorit, with the first provider. This match is correct. Then it matches the second facility, which has a value of "." (missing) for authorit, with the second provider. It has no new facilities to match with the third provider, so it uses the second facility again, repeating this mistake for the remaining providers from facility 1001.

 

Despite these mistakes, we think the merge is fine when we see all those 3's in the _merge column. The best way to prevent the error is to check for duplicate facilities before the merge. Back to question.

 

 


 

4. 22 providers had no facility data, and 34 facilities had no provider data. The facilities were in memory for the merge (the "master" data), so if they had no matches they received a value of 1 for _merge. The providers were on disk (the "using" data), so if they had no matches they received a value of 2 for _merge. Back to question.

 

 


 

5. The tabulation of _merge tells us that we have 22 providers with no matching facilities (_merge==2) and 2562 providers with matching facilities (_merge==3). The total is 2584, which is the number of providers in the original file, so all are accounted for.

 

There's no simple check like this for the number of facilities. Back to question.

 

 

 


 

Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme