Working with grouped observations

 

What is a BY-group?

Sometimes the data you work with will have observations arranged in groups by the values of one or more variables (the "BY variables"). A simple illustration of this would be a data set with variables:  A, B, C and D that looks like this:

Example 1.

                    A     B    C    D 
  
                   10     1    2    3
                   20     1    2    0
                   20     2    4    1
                   30     1    0    1
                   40     1    1    0
                   40     2    6    9
                   40     3    0    5


You can picture that the observations of this data set were arranged this way by sorting it by the two variables:  A and B. As a result, the values of A in the observations are in ascending order.  In addition, for a fixed value of A, the observations are arranged in ascending order by the values of B.  Each group of observations for a fixed value of A is called a BY-group defined by A.  Each group of observations for a fixed pair of values of A and B is a BY-group defined by the variables: A and B.  In this example the BY-groups defined by A have a varying number of observations (1, 2, or 3), and the BY-groups defined by the pair A and B have exactly one observation each.

Here are examples of real data with this structure:

 

Importance of knowing how to work with grouped observations

Knowing how to work with grouped observations allows you to work with data at different levels. In addition, understanding BY-groups and the FIRST. and LAST. BY-variables explained below, will allow you to quickly test for the uniqueness of identifier variables in your data.

 

Applications

  • variable construction at the aggregate level
  • restructuring data sets in order to change the unit of analysis
  • identifying observations with duplicate values on a variable or set of variables

 

What you have to do

  • make sure the data set is sorted by the appropriate variable(s)
  • use the BY statement following the SET statement to name the BY variable(s)
  • use the temporary SAS variables FIRST.byvariable and LAST.byvariable that SAS creates just for the current data step from the by statement to keep track of where you are in the BY group

 

How the FIRST.byvariable and LAST.byvariable take on values

  • depends on where the current obs is in the BY group
    • if in the first observation of the BY group: FIRST.byvariable = 1
    • if not in the first observation of the BY group: FIRST.byvariable = 0
    • if in the last observation of the BY group: LAST.byvariable = 1
    • if not in the last observation of the BY group: LAST.byvariable = 0

  • illustration of how the FIRST.A and LAST.A variables take on values for Example 1 above.  A blank line has been inserted between the separate BY-groups defined by A.
         A    B    C    D    FIRST.A   LAST.A 
    
         10   1    2    3     1          1
    
         20   1    2    0     1          0
         20   2    4    1     0          1
    
         30   1    0    1     1          1
    
         40   1    1    0     1          0
         40   2    6    9     0          0
         40   3    0    5     0          1

 

Program examples of BY-group processing

 

Determining the uniqueness of an identifier variable

We will test for uniqueness of the identifier variable in a household level data set.  Click here to see the data set and the program code to do this.

 

Creating a household level variable from person level data

Using the same person level data set which you have seen earlier in this section, we will count, for each household, the number of persons less than 18 years of age.  Click here to see the data set and the program code to do this.

 

 


Another topic?

Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme