Skip Navigation

UNC Carolina Population Center

 

Testing for uniqueness of an identifier variable


A household level data set derived from U.S. Census data

  • H_SEQ is the household sequence number (which is supposed to be unique)
  • each observation is a household
  • the original data set consists of 50,785 households interviewed in the Current Population Survey of March 1999.


H_SEQ    H_FAMINC  H_NUMPER  HG_REG    HRHTYPE   HSUP_WGT 

   1       6         4        3         1         140484     
   2      10         3        3         4         179294
   3      13         5        3         1         193890  
   4       0         2        1         1          33756
   5       1         1        1         7         124633
   6       8         2        1         1         100164
   7       6         5        3         2         133469
      .
      .
      .


The program below makes use of the FIRST.H_SEQ and LAST.H_SEQ variables to determine whether the variable H_SEQ is unique. The first observation of each H_SEQ BY-group is written to the temporary data set unique and all other observations, if any, in a BY-group (the duplicates) are written to the temporary data set dups. The household level data set being examined is a permanent SAS data set named hhcps99 and has been previously sorted by the variable H_SEQ.

   libname in '/afs/isis/depts/cpc/computer/stone/data/class01/';    

   data unique dups;
   set in.hhcps99;
   by h_seq;

   if first.h_seq=1 then output unique;
                    else output dups;
   run;



Continue with BY groups?
Another topic?
Questions or comments?  If you are affiliated with the Carolina Population Center, send them to Phil Bardsley.