Skip Navigation

UNC Carolina Population Center

 

SAS Programming Tips for Working With Very Large Data Sets


What is a very large data set?

Some symptoms:

  • when one million observations seems relatively small
  • when you can't remember your N's
  • when a job that runs in 15 minutes seems fast
  • when a one gigabyte file seems reasonable in size
  • when the number of observations with duplicate id problem is larger than the number of observations in most data sets you've worked with
  • when you find yourself planning your work around David's vacation schedule

Examples from a current project



How can you save time and space?

Some rules:

  • Make your first run with OPTIONS OBS=0;
    
    
    • Repeat until no syntax errors.
    • If creating variable labels, repeat with Proc Contents until labels correct.
    
    


  • Use KEEP= data set option with SET, MERGE, and Proc Sort.
    
    
    • KEEP= with SET
    	data women;
    	set in.indiv(keep=personid status emigdth mures98 migrat97 mig1-mig63);
    	.	
    	.
    
    • KEEP= with MERGE
    	data tmp;
    	merge women(in=a)
                  in.indiv(keep=personid dob muborn wherborn county
    	               muwrk80 muwrk90 ed70 ed80 ed81 ed85-ed97
    	               mothrid fathrid);
    	by personid;
    	.
    	.
    
    • KEEP= with Proc Sort
    	proc sort data=in.womfixed(keep=personid mothrid myrborn fathrid fyrborn)
                      out=tmp;
    	by mothrid;
    	run;
    
    
  • Minimize sorting in programs by storing permanent data sets in the sorted order that will be most useful in the future.
  • 
    
    • Beware that sorting a permanent data set without the OUT= option on Proc Sort will overwrite the data set and will destroy it completely in a run with system option OBS=0. Use the system option NOREPLACE to avoid accidental replacement of a permanent data set.


  • Use LENGTH to set lengths in bytes for each variable.
    	length personid mothrid fathrid spid74-spid98 6;
    	length emigdth chnum migrat97 3;
    	length child1-child17 $ 4;
    

  • Use WHERE= data set option or the WHERE statement to select observations when possible.
    
    
    • Do this:
    	data tmp;
    	set parntliv(where=(mother='1' & malive=' '));
    	.
    	.
    
    
    • Or this:
    	data tmp;
    	set parntliv;
    	where mother='1' & malive=' ';
    	.
    	.
    
    
    • Not this:
    	data tmp;
    	set parntliv;
    	if mother='1' & malive=' '; 
    	.
    	.
    
    

  • Use Proc Datasets to delete, during job execution, temporary data sets no longer needed.
    	proc datasets library=work; delete tmp tmp2; run;
    
    • The log will display:
         	proc datasets library=work;
                                              -----Directory-----
    
                                   Libref:            WORK                                
                                   Engine:            V8                                  
                                   Physical Name:     /tmpsas2/SAS_work247300006C54_gromit
                                   File Name:         /tmpsas2/SAS_work247300006C54_gromit
                                   Inode Number:      16384                               
                                   Access Permission: rwxr-xr-x                           
                                   Owner Name:        cpcstone                            
                                   File Size (bytes): 512                                 
    
    
    
                                   #  Name  Memtype    File Size  Last Modified
                                   --------------------------------------------------
                                   1  TMP   DATA      4118159360   30JUL2001:18:01:09
                                   2  TMP2  DATA        39460864   30JUL2001:18:06:46
                                   3  TMP3  DATA      4118159360   30JUL2001:18:25:55
         	delete tmp tmp2; run;
    
    NOTE: Deleting WORK.TMP (memtype=DATA).
    NOTE: Deleting WORK.TMP2 (memtype=DATA).
    
    

  • Reuse temporary data set names.
    	.
    	.
    	. 
    	data tmp2;
    	set tmp;
    	. 
    	.
    	run;
    
    	proc .... data=tmp2;
    	run;
    
    	data tmp;   /* <---- WORK.TMP is reused to save space in WORK library */
    	set tmp2;
    	.
    	.
    	.
    
    
  • Don't write unnecessary data steps that create unnecessary temporary data sets.
    
    
    • NO--don't do this!
    	/*first make a copy of in.indiv and then merge with inw.womfixed*/
    
    	data tmp; 
    	set in.indiv(keep=personid dob muborn wherborn county);
    	run;
    
    	data women;
    	merge inw.womfixed(in=a)
          	      tmp;
    	by personid;
    	if a=1;
    	run;
    
    • YES--do this!
    	/*merge the permanent data sets directly*/
     
    	data women;
    	merge inw.womfixed(in=a)
                  in.indiv(keep=personid dob muborn wherborn county);
    	by personid;
    	if a=1;
    	run;
    

  • Create random samples of your data to test/develop your programs.
    
    	data out.smpl5pct; 
    	set in.indiv;
    
    	if ranuni(2106)<.05 then output;   /* 5% random sample of in.indiv*/
    	run;
    
    
    



Erika Stone
last modified: Feb. 21, 2002
Questions or comments?  If you are affiliated to the Carolina Population Center, send them to Phil Bardsley