Introduction to merging in SAS

 

What is a match-merge?

A match-merge

  • combines observations from two or more SAS data sets based on the values of specified common variables (one or more)
  • creates a new data set (the merged data set)
  • is done in a data step with the statements
    • MERGE to name the input data sets
    • BY to name the common variable(s) to be used for matching

Prerequisites for a match-merge

  • input data sets must have at least one common variable to merge on
  • input data sets must be sorted by the common variable(s) that will be used to merge on

From now on when we use the term "merge" it will mean "match-merge."

 

Examples of merging

When we want to combine two data sets by merging and we know beforehand that each observation in each data set has a match in the other data set we can do a very straight-forward merge. Here are a couple of examples.

Example 1.

Often different data on the same cases are stored in two or more different data sets. For example, you may have two person level data sets on exactly the same individuals but containing different information on those individuals. To combine the data on those individuals into one data set requires a merge. Here is the way the merge would work using the simplest example:

 

        Data set one           Data set two

         ID   A   B             ID   C
         
         10   1   2             10   0
         20   3   4             20   5
         30   5   6             30   7

 

This data step does a merge of data set one and two by ID:
        data three;
          merge one two;
          by id;
        run;

 

The output data set three looks like this:
            ID    A   B   C
                                                
            10    1   2   0
            20    3   4   5
            30    5   6   7

The data sets being merged in the example above contain different data on the same cases and the variable ID is a unique identifier (no duplicates).


Example 2.

Often we want to combine data from two data sets where each observation has a match on the BY-variable(s) but there are duplicate values of the BY-variable(s) in one of the data sets. For example, we may want to add household level variables (in one data set) to persons from those same households (in a second data set). Or, we may want to add community level variables (in one data set) to the households of those same communities (in a second data set).

As an example we will merge two data sets used in examples in the section "Working with grouped observations"--a person level data set named percps99 and a household level data set named hhcps99. We know that the households represented are the same in both data sets and the household identifier is the variable H_SEQ. In the person level data set the variable H_SEQ has duplicate values (when the household has more than one member). In the household level data set, however, H_SEQ is unique.

We want to add the household level variables (in hhcps99) to the persons (in percps99). The two input data sets to the merge are permanent SAS data sets in the same directory. The merged data set will be a temporary SAS data set named indiv.

	libname in "C:\data\class01\";

	data indiv;
	 merge in.percps99
  	       in.hhcps99;
	 by h_seq;
	run;

 

More on MERGE: when matching is not perfect

What happens when there is no match on the BY variable(s) in a match-merge?

Here is a simple example:

        Data set one         Data set two

         ID   A    B            ID   C
       
         10   1    2            10   0
         20   3    4            30   1
         30   5    6            40   1
The second observation (ID=20) in data set one does not have a match on ID in data set two. A match-merge of one and two by ID like this,
         data three;
           merge one two;
           by id;
         run;

NOTE: It is a good habit not to do data manipulation in the data step of a merge like shown in these examples.
results in an output data set three which looks like this:
         ID   A   B     C

         10   1   2     0    both data sets contributed to this observation
         20   3   4     .    only data set one (left-hand data set) contributed to this observation
         30   5   6     1    both data sets contributed to this observation
         40   .   .     1    only data set two (right-hand data set) contributed to this observation

 

The IN= variables

What if you want to keep in the output data set of a merge only the matches (only those observations to which both input data sets contribute)? SAS will set up for you special temporary variables, called the "IN=" variables, so that you can do this and more.

 

Here's what you have to do:
  • signal to SAS on the MERGE statement that you need the IN= variables for the input data set(s)
  • use the IN= variables in the data step appropriately

So to keep only the matches in the match-merge above, ask for the IN= variables and use them:

         data  three;
           merge one(in=x) two(in=y);       /* x & y are your choices of names */ 
           by id;                           /* for the IN= variables for data  */
           if x=1 and y=1;                  /* sets one and two respectively   */
         run;

Data set three will now consist of only the matches on ID:

         ID   A   B     C

         10   1   2     0 
         30   5   6     1
Only the matches are kept in the output data set above because of the way the IN= variables X and Y take on values in the PDV:
  • 1 if the data set contributes to the observation
  • 0 if the data set does not contribute to the observation
For the above example, you can picture the IN= variables X and Y taking on values like this:
      
         ID   A   B   C   X    Y

         10   1   2   0   1    1
         20   3   4   .   1    0
         30   5   6   1   1    1


If you want to keep not only the matches, but also to keep track in separate data sets of the non-matches, you can let the data step create three data sets like this:

     data x1y1          /* x1y1, x1y0, x0y1 are your choices of data set names */
          x1y0 
          x0y1;
      merge one(in= x) two(in= y);
       by id;

       if x = 1 and y = 1 then output x1y1; /* write all matches to x1y1 */
       if x = 1 and y = 0 then output x1y0;
       if x = 0 and y = 1 then output x0y1;
     run;

 

How do you know when you merged correctly?

Before doing a merge, ask yourself these questions:

  • Do you know your input data sets to MERGE?
    • Is each input data set sorted by the BY-variables?
    • Are the BY-variables named the same in each input data set? If not, you will need to do some renaming.
    • Are there other common variables that are not used in the by statement? If yes, watch out--results may surprise you.
    • How many observations and how many variables are in the input data sets?

  • Do you need to keep all variables in the input data sets? If not, you may want to use the DROP= or KEEP= data set options.
  • Can you predict the number of variables in the output (merged) data set? (always possible)
  • Can you predict the N of the output data set? (not always possible, but it's a good idea to ask yourself this question anyhow)

Test your program: print and examine observations BEFORE and AFTER the merge.

Finally, take time to examine the log carefully.

 

More on merging in the Tips and Tricks Data Management page:

 


Another topic?

Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme