Skip Navigation

UNC Carolina Population Center

 

Working with grouped observations


What is a BY-group?

Sometimes the data you work with will have observations arranged in groups by the values of one or more variables (the "BY variables"). A simple illustration of this would be a data set with variables A, B, C, D that looks like this:

Example 1.

                    A     B    C    D 

10 1 2 3
20 1 2 0
20 2 4 1
30 1 0 1
40 1 1 0
40 2 6 9
40 3 0 5


You can picture that the observations of this data set were arranged this way by sorting it by two variables: A, B. As a result, the values of A on the observations are in ascending order. In addition, for a fixed value of A, the observations are arranged in ascending order by the values of B. Each group of observations for a fixed value of A is called a BY-group defined by A. Each group of observations for a fixed pair of values A, B is a BY-group defined by A, B. In this example the BY-groups defined by A have a variable number of observations (1,2, or 3), and the BY-groups defined by the pair A, B have exactly one observation each.

Here are examples of real data with this structure:


Importance of knowing how to work with grouped observations

Knowing how to work with grouped observations allows you to work with data at different levels. In addition, understanding BY-groups and the FIRST. and LAST. BY-variables explained below, will allow you to quickly test for the uniqueness of identifier variables in your data.


Applications

  • variable construction at the aggregate level
  • restructuring data sets in order to change the unit of analysis
  • identifying observations with duplicate values on a variable or set of variables

What you have to do

  • make sure data set is sorted by the appropriate variable(s)
  • use the BY statement following SET to name the BY variable(s)
  • use the special variables FIRST.byvariable and LAST.byvariable to keep track of where you are in the BY group

How the FIRST.byvariable and LAST.byvariable take on values in the PDV

  • depends on where the current obs is in the BY group
    • if first in the BY group: FIRST.byvariable=1
    • if not first in the BY group: FIRST.byvariable=0
    • if last in the BY group: LAST.byvariable=1
    • if not last in the BY group: LAST.byvariable=0

  • illustration of how the FIRST.A and LAST.A variables take on values in the PDV for Example 1 above. A blank line has been inserted between the separate BY-groups defined by A.
         A    B    C    D    FIRST.A   LAST.A 
    
         10   1    2    3     1          1
    
         20   1    2    0     1          0
         20   2    4    1     0          1
    
         30   1    0    1     1          1
    
         40   1    1    0     1          0
         40   2    6    9     0          0
         40   3    0    5     0          1
    



Program examples of BY-group processing

Determining the uniqueness of an identifier variable

We will test for uniqueness of the identifier variable in a household level data set. Click here to see the data set and the program code to do this.


Creating a household level variable from person level data

Using the same person level data set which you have seen earlier in this section, we will count, for each household, the number of persons less than 18 years of age. Click here to see the data set and the program code to do this.


Another topic?
Questions or comments?  If you are affiliated to the Carolina Population Center, send them to Phil Bardsley; non-affiliates may contact the author Dan Blanchette.