One-to-one merging

 

Same observations in each file being merged

The first example is a one-to-one merge of DHS data. The women's file in DHS has more variables than the 2,047 limit in Intercooled Stata. This is not a problem at CPC, where we are running Stata/SE with a limit of 32,767 variables. However, many people run Intercooled Stata, so we are taking a conservative approach in this example. (Type help limits to see how many variables your version of Stata can handle.)

Because of this variable limit, we broke the women's file into 2 pieces. Each file has exactly the same women (8,781 observations), but each file has a different set of variables (about 1500 variables in each). When we split the original file in two, we were careful to keep the 3 identifying variables in each file for later merging.

For this example, we selected age, education, and number of children from the first file, and marital status, age at first marriage, and desired family size from the second file. We need to merge these two sample files into a single file for analysis. The identifying variables for this merge are cluster (v001), household within cluster (v002), and line number within household (v003). The order in which the identifying variables are nested determines the sort order.

Prior to Stata 11, the data had to be sorted by the identifying variables before merging. Now you have a choice of syntax for merging. We have shown you both methods below, the old which requires sorting, and the new which does not (starting with version 11 Stata will sort the data for you when you use the new syntax). Both the old and new syntax work in Stata 11, but the new syntax is better for several reasons explained below.

Regardless of which version you are using, it is always a good idea to check your identifiers for duplicates before a merge. These are observations that have the same identifying variables, in this case v001 v002 v003. We expect each combination of these three variables to uniquely identify a woman, but as we see below, two duplicates crept into our data somehow.

Before trying this example, create a folder somewhere on your computer to store intermediate data files. These are files that you need for only for a short time while creating an analysis file, and then you can erase them. At CPC I put them into a folder I named d:\statatemp.

Then, copy the following commands into a do-file and submit them.


 

Check for, examine, and drop duplicates

/* Check for duplicates in exampw1 */

clear
use "q:\utilities\statatut\exampw1.dta"
duplicates report v001 v002 v003

/* List the duplicates */

duplicates tag v001 v002 v003, gen(duptag)
list if duptag > 0

/* Drop the duplicates */

duplicates drop v001 v002 v003, force
drop duptag
save "d:\statatemp\tempw1.dta"

/* Drop duplicates from exampw2 */

clear
use "q:\utilities\statatut\exampw2.dta"
duplicates drop v001 v002 v003, force
save "d:\statatemp\tempw2.dta"

 

Merge using New Syntax (Stata 11 and later)

clear
use "d:\statatemp\tempw1.dta"
merge 1:1 v001 v002 v003 using "d:\statatemp\tempw2.dta"

/* Look at the Results Window - should see only matches */

/* Clean up */

drop _merge

 

Merge using Old Syntax (Stata 10 and earlier)

/* Sort the first file and save it temporarily */

clear
use "d:\statatemp\tempw1.dta"
sort v001 v002 v003
save "d:\statatemp\tempw1.dta", replace

/* Sort the second file and save it temporarily */

clear
use "d:\statatemp\tempw2.dta" 
sort v001 v002 v003
save "d:\statatemp\tempw2.dta", replace

/* Use the first file, merge on the second file */

clear
use "d:\statatemp\tempw1.dta"
merge v001 v002 v003 using "d:\statatemp\tempw2.dta"

/* Check how well the merge went - should see only "3" for _merge */

tab _merge
su

/* Clean up */

erase "d:\statatemp\tempw1.dta"
erase "d:\statatemp\tempw2.dta"
drop _merge

 

Questions:

1. What does the duplicates command do? Answer.


2. The file tempw1.dta has 8,779 observations (after dropping two duplicates) and 6 variables. The file tempw2.dta also has 8,779 observations and 6 variables. After merging them, how many observations and variables do you expect the merged file to contain? Answer.


3. There is a 10th variable on the merged file, named _merge. What is its role? Answer.


4. What does the command erase do? Answer.


5. Why drop _merge? Answer.


6. Is there an easier way to work with intermediate data files? Answer.

 


Answers:

 

1. The duplicates command checks whether the variables you list identify observations in your data uniquely. If more than one observation has the same combination of identifying variables, duplicates report will tell you. You can then tag the duplicates and examine them using duplicates tag.

To examine the duplicates, you can list them as we show above. Or you might use the Data Editor in browse mode (browse), so you don't accidentally change the data:

browse  if duptag > 0

At this point, if you have access to the questionnaires or other information about the raw data, you may be able to resolve the duplicates problem. In this case, though, we do not have access to the questionnaires, so we used duplicates drop to arbitrarily drop all but the first instance of each duplicate. (The command drops all but the first instance, given the current sort order of the data.)

Back to question


 

2. It is a good idea to anticipate the results of a merge, so that when you see the actual results, you will know immediately whether the merge worked the way you intended it.

In this case, a one-to-one merge, we expect the resulting file to have the same number of observations as the two original files: 8,779. The two files have 3 ID variables in common, so the merged file will have 3 ID variables. The two files have 3 additional variables each, so the merged file will have 6 variables in addition to the IDs, for a total of 9 variables.

Back to question 

 


 

3. Stata creates the variable _merge when you ask it to merge two files. The variable can have 3 different values as follows:

  • 1= this observation comes from the "master" file only (the file in memory)
  • 2= this observation comes from the "using" file only (the file on disk)
  • 3= this observation contains variables from both the master and using files (a match)

In this case, all observations have variables from both the master and using files, so they all have a value of 3 for _merge.

Back to question 

 


 

4. After sorting the master and using files, we saved each one on disk with the names tempw1.dta and tempw2.dta. When we no longer needed them, we deleted them from disk with the erase command.

Back to question 

 


 

5. At some time in the future, we may want to merge another file onto this new file. During the merge, Stata will again want to create the variable _merge. Since the variable already exists, Stata will stop the merge command and write an error message to the log.

The easiest way to handle this is to always drop _merge after looking at it. If you want to keep it around, you can rename it, or you can ask Stata to create it with a different name like "merge1" (using the generate option on the merge command).

Back to question 

 


 

6. Intermediate data files can be handled in two ways. One way is shown here, where we create a permanent file and then erase it when we no longer need it. The second way is to use Stata temporary files, which Stata automatically erases at the end of your interactive Stata session. This second method is described in temporary files in the Miscellaneous Tips and Tricks section of this tutorial.

 

It is up to you to decide which way is easier!

Back to question.

 

 


 

Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme