Skip to content. | Skip to navigation

Personal tools

SAS code matched to Stata code

How to "Do" in Stata What You Know How to "Program" in SAS Here is Stata code matched to SAS code as closely as it can be. Use your browser's search tool to find the SAS code for which you need the Stata equivalent.

 

Note: Stata commands are partially underlined to show the minimum characters that need to be typed for Stata to recognize that command.

SAS

Stata


In SAS operators can be symbols or mnemonic equivalents such as:
  & 
or
  and 
For many situations in SAS order does not matter:
  <=  
can be:
  =< 
and
  >=  
can be:
  => 

Most operators are the same in Stata as in SAS, but in Stata operators do not have mnemonic equivalents.  For example, you have to use the ampersand ( & ) and not the word "and".

This works:
 var_a >= 1 & var_b <= 10 
where this does not:
 var_a >= 1 and var_b <= 10

These are the operators that are different in Stata:
 Symbol Definition

  &      and

  |      or  

  >=     greater than or equal to 

  <=     less than or equal to 

  ==     equality (for equality testing)

  !=    does not equal

  !      not

  ^      power

Note:  Symbols have to be in the order shown: " >= " not " => " .


 /* this is a comment */

 * this is also a comment ;




 /* this is a comment */

 * this is also a comment 

 // this is a comment as well 
To continue a command to the next line (line continuation):
 /// you can comment here as well 
For example:
 list id state gender age income ///     
      race income date


Range of values:
  if 1 <= var_a <= 10 
or:
  if var_a in(1,2,3,4,5,6,7,8,9,10)
or a list of character values:
  if state in("NC","AZ","TX","NY","MA","CA","NJ")





  if var_a >= 1 & var_a <= 10 
or:
  if inrange(var_a,1,10)
or:
  if inlist(var_a,1,2,3,4,5,6,7,8,9,10)
or a list of string values:
  if inlist(state,"NC","AZ","TX","NY","MA","CA","NJ")
Stata has a limit of 10 arguments to inlist() (which includes the string variable) when the arguments are strings.  More than one variable can be specified.



Referencing multiple variables at a time:
Say the following variables are in a dataset in the order shown:
 var1 var2 var3 age var4 var5 
Then you could code them as:
 var1--var5 
To SAS, this means "all variables that are positionally between var1 and var5" which would include the variable age.
 var1-var5 
is the same as:
 var1 var2 var3 var4 var5 
no matter the positions of the variables are in the dataset.

Using a colon selects variables containing the same prefix:
 var:
could represent:
 var1 var2 var10 variable varying var_1


Referencing multiple variables at a time:
 var1-var5 
To Stata, this means "all variables that are positionally between var1 and var5."  Notice that there is only one dash ( - ).
 var?
The question mark ( ? ) is a wild card that represents one character in the variable name.  It could be a number, a letter, or an underscore ( _ ).
 var*
The asterisk/star ( * ) is a wild card that represents many characters in the variable name.  They could be numbers, letters, or underscores.  Thus:
 var* 
could represent:
 var1 var2 var10 variable varying var_1 
More on this in Stata's help page on varlists.



To save the contents of the Log window and/or Output window, go to that window and click on the menu bar's "File", "Save".  In SAS batch mode these files are automatically generated for you.  You can also use PROC PRINTTO to print the log and/or just the output to the same or other external files.  The default is to append to the files if they exist so use the "new" option to overwrite the files if they already exist:


proc printto log= "C:\MyDir\test.log" 
               new
           print= "C:\MyDir\test.lst";

proc print data= sashelp.shoes;
run;

** specifying nothing shuts off writing to the external files**;
proc printto;
run;


To save the contents of the results window, start logging to a log file BEFORE you submit commands that you want logged.  Open a log file by clicking on the icon in the tool bar that looks like a scroll and a traffic light.  A "*.log" file is a simple ASCII text file; a "*.smcl" file is formatted with html-like tags so that Stata's viewer will display it like the results window does. 

You can also use the log command:
 log using "D:\mydata\mydofile.log", replace 
Note: The replace option simply tells Stata to overwrite the log file if it already exists.  This is helpful when you have to run a do-file over and over again.



 libname in "D:\mydata\";

 data new;
  set in.mySASfile;
 run;
or, starting in SAS 8:
 data new;
  set "D:\mydata\mysasfile.sas7bdat";
 run; 



 use "D:\mydata\myStataFile.dta" 
You can also click on the "open file" icon and select your dataset.



Save the dataset newer to "D:\mydata\":
 libname in "D:\mydata\"; 

 data in.newer; 
  set new;
 run;



 save "D:\mydata\newer.dta" 
To overwrite the dataset newer if it already exists:
 save "D:\mydata\newer.dta" , replace 
You can also click on the "save" icon to save your dataset.



 proc contents; 
 run;
On selected variables:
 proc contents data= in.newer
  (keep= id state gender age income);
 run; 


 describe
On selected variables:
 describe id state gender age income



 proc means;
 run;
On selected variables:
 proc means;
  var age income;
 run;
or:
 proc univariate;
  var age income;
 run;




 summarize
On selected variables:
 summarize age income
If you want variable labels and a proc univariate style output try:
 summarize age income, detail
or:
 codebook age income


 proc freq;
   table var1;
 run; 


 tabulate var1 
or, for just checking out your dataset, try the codebook command.


A series of 1-way tables:
 proc freq;
  tables var1 var2; 
 run;


A series of 1-way tables:
 tab1 var1 var2 



A 2-way table:
 proc freq;
  tables var1*var2; 
 run;


A 2-way table:
 tab2 var1 var2 



 proc print;
 run;
selected variables in this order:
 proc print;
  var id age income;
 run;
On selected variables and a limited range of observations:
 proc print data= new (firstobs= 1 obs= 20);
  var id age income;
 run;


 list
On selected variables in this order:
 list id age income
On selected variables and a limited range of observations:
 list id age income in 1/20


Create a numeric variable with a default length of 8 bytes:
 var1= 1234; 


Create a numeric variable with the minimum allowable length (3 bytes):
 length var1 3;
 var1= 1234; 



 generate var1= 1234
Note:  the default numeric data type is "float."  The statement above is relying on that default. 
It could have been written explicitly as:
 generate float var1= 1234 
"float" stands for "floating point decimal."
You could more wisely save storage space by specifying:
 gen int var1= 1234 

"int" stands for "integer data type."


Create a character variable with a length of 3 bytes:
 name= "Bob"; 


Generate a string variable with a length of 3 bytes:
 gen str3 name= "Bob" 


Increase the variable length to allow for 5 characters:
 data new;
  length name $5;
  set new; 

 *Change the values of numeric 
 *  and character variables: *;
   var1= 123456;
   name= "Bobby";
run;



 replace var1= 123456 
Stata automatically increases the storage type if necessary.  To change the storage of a variable manually, use the recast command. 
 replace name= "Bobby" 
Stata automatically increases length to 5


Example of an if-then statement:
 if var1 = 123456 then var2= 1; 



The condition follows the command:
 replace var2= 1  if var1 == 123456 
Notice that Stata requires two equals signs when testing equality.

Example of an if-then do loop:
   if age <= 10 then do;
    child= 1;
    parent= 0;
   end; 




 replace child= 1  if age <= 10
 replace parent= 0  if age <= 10 

Since each command is executed on all observations before the next command is executed, the if-then-do loop is not an option.  Stata does have excellent looping tools: foreach, forvalues, and while.


Example of an if-then-else:
   if       0 <= age <= 2 then agegp= 1;
   else if  2 < age <= 10 then agegp= 2;
   else if 10 < age <= 20 then agegp= 3;
   else if 20 < age <= 40 then agegp= 4; 
   else agegp = . ; 




For the same reason if-then-do loops (above) are not possible in Stata, the same goes for if-then-else.  But here is a way of doing the same thing.  In this example "missing(agegp)" is used to simply highlight the fact that it has not been assigned a value, just like the else does in if-then-else:
 gen agegp= .
 replace agegp= 1  if missing(agegp)   ///
                     & age >= 0 & age <= 2 
 replace agegp= 2  if missing(agegp)   ///
                     & age >  2 & age <= 10
 replace agegp= 3  if missing(agegp)   ///
                     & age > 10 & age <= 20
 replace agegp= 4  if missing(agegp)   ///
                     & age > 20 & age <= 40 

The cond() function can also be used:
 // nest cond() functions
 gen agegp= cond(missing(age),.,            /// else
             cond(age >= 1 & age <= 2 ,1,   /// else
              cond(age >  2 & age <= 10,2,  /// else
               cond(age > 10 & age <= 20,3, /// else
                cond(age > 20 & age <= 40,4,.))))) 


Better done with the recode command which can also create value labels:
 recode age ( 0/2.9999  = 1 "0 to 2 year olds")   ///
            ( 3/10.9999 = 2 "3 to 10 year olds")  ///
            (11/20.9999 = 3 "11 to 20 year olds") ///
            (21/40.9999 = 4 "21 to 40 year olds") ///
            (    else   = . ) , gen(agegp) test 

The test option checks to see if the ranges overlap.
Since recode's ranges are >= and <= , adding .9999 to the upper range ensures that fractional values are handled correctly.



Drop variables var1, var2, and var3:
 data new(drop= var1 var2 var3);
  set new;
 run;


Drop variables var1, var2, and var3:
 drop var1 var2 var3 




Keep variables var1, var2, and var3:
 data new(keep= var1 var2 var3);
  set new;
 run;


Keep variables var1, var2, and var3:
 keep var1 var2 var3 

Keep observations / subsetting if statement:
 data new;
  set new;
   if var1 = 1 then output new;  
 run;


Keep observations:
 keep  if var1 == 1 




Delete observations:
 data new;
  set new;
   if var1 = 1 then delete;
 run;


Drop observations:
 drop  if var1 == 1 




Loop over a variable list (varlist):
 data new(drop= i);
  set new;
   array raymond {4} var1 var2 var3 var4;
   do i= 1 to 4;
    if raymond{i} = 99 then raymond{i}= . ;
   end;
 run;

Check out this array example in the Topics in SAS Programming page.


Loop over a variable list (varlist):
 foreach i of varlist  var1 var2 var3 var4 {
   replace `i'= .  if `i' == 99
 }
Note:  Notice that the quote to the left of the local macro variable i is a left quote ( ` ).  The left quote is located at the top of your keyboard next to the ( ! 1 ) key.  In this example i is a local macro variable that exists only for the duration of the foreach command so it does not need to be dropped like the variable i in the SAS code.



Create variable labels:
data new;
 set new;
 label age= "age in years"
       income= "salary plus bonuses"
   ;;; 
run;


Create variable labels:
 label var age "age in years"
 label var income "salary plus bonuses" 




Define a format:
 proc format;
  value yesno 
   1= "yes"
   2= "no"
  ;;;
 run;
Assign the format to a variable:
 data newer;
  set newer;
   format smokes yesno.;
 run;


Define a format.  These are called "value labels":
 label define yesno 1 "yes" /*
                */  2 "no" 


Assign the value label to a variable:
 label value smokes yesno 


Remove formats from a variable:
 data newer;
  set newer;
     ** just do not specify a format **;
   format smokes  ; 
 run;



 label value smokes . 


Assign formats defined by SAS to a variable:
 data newer;
  set newer;  
   format interview_date mmddyy8.;
 run;



Assign formats defined by Stata to a variable:
 format interview_date %tdNN/DD/YY
 /* pre Stata 10 the format did not start 
  * with the letter "t" and did not 
  * need two letters for each part of the date: */
 format interview_date %dN/D/Y

Note:  The letter N in %tdNN/DD/YY stands for "number of the month".  Specifying Mon in %tdDDMonCCYY uses the three letter abbreviation of the name of the month.  So %tdNN/DD/YY displays as "11/06/45" and %tdDDMonCCYY displays as "06Nov1945".



 title "Number of Companies That Got Acquired";


Since the Results window/log file is a mix of both the log and the Output window Stata does not need a title statement.  Titling can be accomplished with a comment.
 /* Number of Companies That Got Acquired */ 




 proc sort data= new 
             out= newer;
  by id;
 run;


 sort id 


 proc sort data= sashelp.shoes (keep= region product 
                  subsidiary stores sales inventory) 
   out= work.shoes; 
   by region subsidiary product; 
 run; 

 /* fix flaw in dataset 
  * where the Copenhagen subsidiary 
  *  has 2 obs for product = "Sport Shoe" **/
 proc summary nway data= work.shoes;
 /* the by statement fixes 
  * the variable order in work.shoes **/
 by region subsidiary product;
 var stores sales inventory;
 output out= work.shoes (drop= _TYPE_ _FREQ_)
        sum= stores sales inventory;
run;

 /* long to wide because:
  *  there are repeats of by-variable values **/
 proc transpose data= work.shoes 
   out= shoes_wide prefix=prodnum; 
   by region subsidiary;
   var product; 
 run; 

 
 keep region subsidiary product
 bysort region subsidiary (product) : gen prodnum= _n
 reshape wide product,  ///
    i(region subsidiary) j(prodnum)


The xpose command is similar but only works with numeric data.  It will turn string variables into missing values.

 /* wide to long because:
  *  there are no repeats of by-variable values  **/
  proc transpose data= work.shoes_wide
    out= shoes_long name=prodnum;
   by region subsidiary;
   var prodnum: ;
  run; 


  // "j(prodnum)" just names the _j variable prodnum
 reshape long product, i(region subsidiary) j(prodnum) 

Check out this reshape example in the Stata Tutorial page.

Using by-groups:
 data newer;
  set newer;
   by id;

   if first.id = 1 then f_num= 1;

   if first.id = 1 and last.id = 1 
        then s_num= 1;

   if last.id = 1 then l_num= 1;
 run; 



 by id: gen f_num= 1  if _n == 1 
 by id: gen s_num= 1  if _n == 1 & _N == 1
 by id: gen l_num= 1  if _n == _N 

Stata's _n is equivalent to SAS's _n_ in that it is equal to the observation number; but when inside a by command _n is equal to 1 for the first observation of the by-group, 2 for the second observation of the by-group, etc.

Stata's _N is equal to the number of observations in the dataset except in a by command when it is equal to the total number of observations in the by-group.



Count the total number of observations within each ID group, and add that total to each observation:
 proc summary data= new  nway;
   class id;
   var age;
   output out= temp(drop= _type_ _freq_) 
            n= totboys;
 run;

 proc sort data= temp;
   by id;
 run;

 proc sort data= new;
   by id;
 run;

 data newer;
   merge new temp;
   by id;
 run; 



  bysort id: egen totboys= count(age) 

Note:  in both SAS and Stata, the count will be the number of observations where the variable being counted has a non-missing value.  Here we used the variable age.



Create a cumulative/running sum of boys within each ID group:
 
 data new;
   set newer;
   by id;
   retain count 0;
   if first.id then count = 0;
   if gender = 1 and age <= 18 
      then count = count + 1;
 run; 



 bysort id: gen count= sum(gender == 1 & age <= 18) 


 data both;
  merge in.new(in = a)
      in.newer(in = b);
   by id;
   if a = 1 and b = 1;
 run;

Check out this merge example in the Topics in SAS Programming page.


 use "D:\mydata\new.dta"
 sort id 
 /* Starting in Stata 11 you have to specify 
  *  what type of merge you are doing nor have. 
  *  to have your datasets sorted before the merge.
  *  This is a one-to-one merge:
  */
 merge 1:1 id using "D:\mydata\newer.dta"
 // or in previous versions of Stata:
 merge id using "D:\mydata\newer.dta"
 keep  if _merge == 3
Stata automatically creates the variable _merge after a merge.  Stata will not merge on another dataset if the variable _merge already exists in one of the datasets.
The dataset in memory is the "master" dataset.  The dataset that is being merged on is the "using" dataset.  Unlike SAS, variables shared by the master dataset and the using dataset will not be updated (values overwritten) by the using dataset.  Like SAS, the formats, labels, and informats of variables shared by the master dataset and the using dataset will be defined by the master dataset.  Remember that the master always wins.  Use the update option to overwrite missing data in master file.


Concatenate two datasets / add observations to a dataset:
 data both;
  set in.new
      in.newer;
 run;



 use "D:\mydata\new.dta"
 append using "D:\mydata\newer.dta"

 /* Starting in Stata 11 you can use append without 
  *  having a dataset already in memory: */
 append using "D:\mydata\new.dta" "D:\mydata\newer.dta"


Sort datasets in order to prepare them for a merge:

Sort permanently stored datasets and create new, sorted copies in the WORK library:
 proc sort data= in.company
             out= work.company;
  by id;
 run;

 proc sort data= in.firm
             out= work.firm;
  by id;
 run;

 data temp2;
  merge firm(in= a)
        company(in= b);
  by id;
 run; 


Sorting datasets in order to prepare them for a merge is only required if you are using a version of Stata prior to Stata 11:

Create a local macro variable to represent a filename for Stata to use in temporarily storing a data file on the computer's hard drive if requested to do so later:
 tempfile company
 use "D:\mydata\company.dta" 
 sort id 

Save the dataset that is currently in memory to a temporary filename in Stata's temp directory.  This file will be deleted when Stata is exited just like a dataset in SAS's WORK library:
 save "`company'"
 use "D:\mydata\firm.dta" 
 // pre Stata 11 code:
 sort id
 merge id using "`company'" 

 /* Starting in Stata 11 the data does not need to
  *  be sorted but the type of merge needs to be
  *  specified like in this one-to-one merege: */
 merge 1:1 id using "`company'" 


 proc surveymeans;
  cluster sampunit;
  strata stratum;
  var age income;
  weight sampwt;
 run; 


 svyset sampunit [pweight= sampwt], strata(stratum)

 svy: mean age income


Analyze a subpopulation by implementing the domain option:
 proc surveymeans;
  cluster sampunit;
  strata stratum;
  domain female;
  var age income;
  weight sampwt;
 run; 


Analyze a subpopulation by implementing the subpop option:
 svy: mean age income, subpop(females)
Note:  options come after a comma ( , ).


Starting in SAS 9:
 proc surveyfreq;  
  cluster sampunit;
  strata stratum;
  tables females*var1*var2; 
  weight sampwt;
 run;
When using proc surveyfreq the domain/subpop variable needs to be included in the tables statement.



 svyset sampunit [pweight= sampwt], strata(stratum)

 svy: tab var1 var2, subpop(females)

 svy: tab var1 , subpop(females)



 proc surveyreg; 
  cluster sampunit;
  strata stratum;
  model depvar= indvar1 indvar2 indvar3;
  weight sampwt;
 run; 
The surveyreg procedure does not have a way of dealing with subpopulations.  Using by or where will not suffice as they will compute incorrect standard errors.



 svyset sampunit [pweight= sampwt], strata(stratum)

 svy: regress depvar indvar1 indvar2 indvar3, ///
      subpop(females) 




Starting in SAS 9:
 proc surveylogistic; 
  cluster sampunit;
  strata stratum;
  model depvar= indvar1 indvar2 indvar3;
  weight sampwt;
 run; 
The surveylogistic procedure does not have a way of dealing with subpopulations.  Using by or where will not suffice as they will compute incorrect standard errors.



 svyset sampunit [pweight = sampwt], strata(stratum)


 svy: logit depvar indvar1 indvar2 indvar3, ///
      subpop(females) 



Create a local macro variable ver:
 %let ver= 7;
 version= &ver.; 

Technically, SAS macro variables begin with an ampersand ( & ) and end with a period ( . ).  It is good practice to end your macro variables with a period.




 local ver= 7
 gen version= `ver' 

Notice that to evaluate the local macro variable ver a left quote ( ` ) is used and then a right quote ( ' ).  The left quote is located on your keyboard next to the ( ! 1 ) key.

Print a subset of observations when a condition is true just to see examples (not all situations) where the condition exists in your data:
/** WHERE subsets the data *
  * before OBS subsets the data */

 proc print data= sashelp.shoes
      (where= (stores < 20)  obs= 10);
 run;
The above code lists the first 10 observations where (stores < 20).

 list  in 1/10 if stores < 20
 // the order of if and in does not matter:
 list  if stores < 20 in 1/10 
Both will first subset the data to the first 10 observations and then attempt to subset the data based on the condition "if stores < 20".  So, a hack way of doing the same in Stata is to use the sum() function.  Since sum() creates a running sum, you have to repeat the condition outside the sum() to subset the data to that condition to list the first 10 observations.  The sum() function adds up the true conditions because true conditions evaluate to 1 (one) and false evaluate to 0 (zero).

 list  if sum((stores < 20)) <= 10 & stores < 20
So you have to repeat the condition to subset the dataset to just those observations before starting the running sum.

If the condition is long you could mess up typing it twice so put it in a local macro variable:
 local  cond  stores < 20
 list  if sum((`cond')) <= 10 & `cond'
This is what the Stata command ifwins does.

Get a frequency count for each combination of a set of multiple categorical variables:
 ** example of a 3-way table **;
 proc freq data= sashelp.shoes;
   tables  region * product * stores / list;
 run;


There is no built-in Stata command to do this, but the contract command can be used like so:
 preserve
 contract region product stores , ///
          freq(frequency)         ///
          percent(percentage)     ///
          cfreq(cumulative_freq)  /// 
          cpercent(cumulative_pct)
 list
 restore

Note:  Stata commands are partially underlined to show the minimum characters that need to be typed for Stata to recognize that command.

Questions or comments? If you are affiliated with the Carolina Population Center, send them to Phil Bardsley.