Skip Navigation

UNC Carolina Population Center

 

How to "Do" in Stata What You Know How to "Program" in SAS

The Stata code on this page is valid for Stata version 8.

Here is Stata 8 code matched to SAS code as closely as it can be.  Use your browser's search tool to find the SAS code for which you need the Stata equivalent.

**Note**:  Stata commands are partially underlined to show the minimum characters that need to be typed for Stata to recognize that command.

 




SAS

Stata

In SAS operators can be symbols or mnemonic equivalents such as:
 & 
or
 and 
For many situations in SAS order doesn't matter:
 <=  
can be:
 =< 
and
 >=  
can be:
 => 
Most operators are the same in Stata as in SAS, but in Stata operators do not have mnemonic equivalents.  For example, you have to use the ampersand (&) and not the word "and":
This works:
var_a > = 1 & var_b <= 10 
where this does not:
var_a > = 1 and var_b <= 10

These are the operators that are different in Stata:
Symbol Definition
& and
| or
>= greater than or equal to
<= less than or equal to
== equality (for equality testing)
! = does not equal
~ not
^ power
NOTE: Symbols have to be in the order shown: " >=  " not "  => " .
Range of values:
 if 1 <= var_a <= 10 
or:
 if var_a in(1,2,3,4,5,6,7,8,9,10)
 if var_a >= 1 & var_a <= 10 
or:
 if inrange(var_a,1,10)
or:
 if inlist(var_a,1,2,3,4,5,6,7,8,9,10)
Referencing multiple variables at a time:
Say the following variables are in a data file in the order shown:
var1 var2 var3 age var4 var5 
Then you could code them as:
var1--var5 
To SAS, this means "all variables that are positionally between var1 and var5," which would include the variable age.
Referencing multiple variables at a time:
var1-var5 
To Stata, this means "all variables that are positionally between var1 and var5."  Notice that there is only one hyphen ( - ).
Referencing multiple variables at a time:
var1-var5 
is the same as:
var1 var2 var3 var4 var5 
no matter the positions of the variables are in the observation.

Using a colon selects variables containing the same prefix:
var:
could represent:
var1 var2 var10 variable varying var_1

Referencing multiple variables at a time:
var?
The question mark ( ? ) is a wild card that represents one character in the variable name.  It could be a number, a letter, or an underscore ( _ ).
var*
The asterisk ( * ) is a wild card that represents many characters in the variable name. They could be numbers, letters, or underscores.  Thus
var* 
could represent:
var1 var2 var10 variable varying var_1 
To save the contents of the Log window and/or Output window, go to that window and click on the menu bar's "File", "Save".  In SAS batch mode these files are automatically generated for you. To save the contents of the results window, start logging to a log file BEFORE you submit commands that you want logged.  Open a log file by clicking on the icon in the tool bar that looks like a scroll and a traffic light.  A " *.log " file is a simple ASCII text file; a " *.smcl " file is formatted with html-like tags. 

You can also use the log command:
log using "d:\mydata\mydofile.log", replace 
NOTE: The "replace" option simply tells Stata to overwrite the log file if it already exists.  This is helpful when you have to run a do-file over and over again.
More on this in the Stata tutorial.
libname in8 v8 "d:\mydata\";

data new;
set in8.mySASfile;
run;
or, starting in SAS 8:
data new;
set "d:\mydata\mysasfile.sas7bdat";
run;
use "d:\mydata\myStataFile.dta" 
You can also click on the "open file" icon and select your dataset.
More on this in the Stata tutorial.
Save the dataset newer to d:\mydata\ :
data in8.newer; 
set new;
run;
save "d:\mydata\newer.dta" 
To overwrite the dataset newer if it already exists:
save "d:\mydata\newer.dta" , replace 
You can also click on the "save" icon to save your dataset.
More on this in the Stata tutorial.
proc contents; 
On selected variables:
proc contents data = in8.newer
(keep = id age height);
run;
describe
On selected variables:
describe id age height
More on this in the Stata tutorial.
proc means;
On selected variables:
proc means;
var age height;
run;
or
proc univariate;
var age height;
run;
summarize 
On selected variables:
summarize age height
If you want variable labels and a proc univariate style output try:
summarize age height, detail
or:
codebook age height 
More on this in the Stata tutorial.
proc surveymeans;
cluster sampunit;
strata stratum;
var age height;
weight sampwt;
run;
Stata version 8:
svyset sampunit [pweight = sampwt], strata(stratum)

svymean age height

More on this in the Stata tutorial.
Analyze a subpopulation by implementing the domain option:
proc surveymeans;
cluster sampunit;
strata stratum;
domain female;
var age height;
weight sampwt;
run;
Stata version 8:

Analyze a subpopulation by implementing the subpop option:
svymean age height, subpop(female)

More on this in the Stata tutorial.
proc freq;
tabulate
or, for just checking out your dataset, try:
codebook
More on this in the Stata tutorial.
A series of 1-way tables:
proc freq;
 tables var1 var2;
run;
A series of 1-way tables:
tab1 var1 var2 
More on this in the Stata tutorial.
A 2-way table:
proc freq;
 tables var1*var2;
run;
A 2-way table:
tab2 var1 var2 
More on this in the Stata tutorial.
Starting in SAS 9:
proc surveyfreq;  
cluster sampunit;
strata stratum;
tables females*var1*var2;
weight sampwt;
run;
When using proc surveyfreq the domain/subpop variable needs to be included in the tables statement.
Stata version 8:
svyset sampunit [pweight = sampwt], strata(stratum)

svytab var1 var2, subpop(females)

More on this in the Stata tutorial.
proc surveyreg; 
cluster sampunit;
strata stratum;
model depvar = indvar1 indvar2 indvar3;
weight sampwt;
run;
Proc surveyreg does not have a way of dealing with subpopulations.  Using "by" or "where" will not suffice as they will compute incorrect standard errors.
Stata version 8:
svyset sampunit [pweight = sampwt], strata(stratum)
svyregress depvar indvar1 indvar2 indvar3, subpop(females)

Starting in SAS 9:
proc surveylogistic; 
cluster sampunit;
strata stratum;
model depvar = indvar1 indvar2 indvar3;
weight sampwt;
run;
Proc surveylogistic does not have a way of dealing with subpopulations.  Using "by" or "where" will not suffice as they will compute incorrect standard errors.
Stata version 8:
svyset sampunit [pweight = sampwt], strata(stratum)

svylogit depvar indvar1 indvar2 indvar3, subpop(females)
proc print;
On selected variables:
proc print;
var id age height;
run;
On selected variables and a limited range of observations:
proc print data = new (firstobs = 1 obs = 20);
var id age height;
run;
list
On selected variables:
list id age height 
On selected variables and a limited range of observations:
list id age height in 1/20
More on this in the Stata tutorial.
/* comment */
* comment ;
Stata version 8:
/* comment */
* comment
// comment
To continue a line:
///
For example:
list hhid personid gender age weight height ///     
race income date
More on this in the Stata tutorial.
Create a numeric variable with a default length of 8 bytes:
var1 = 1234; 

Create a numeric variable with the minimum allowable length (3 bytes):
length var1 3;
var1 = 1234;
generate var1 = 1234 
NOTE:  the default numeric type is "float."  The statement above is relying on that default. 
It could have been written explicitly as:
generate float var1 = 1234  
"float" stands for "floating point decimal."
You could more wisely save storage space by specifying:
gen int var1 = 1234  
"int" stands for "integer."
More on this in the Stata tutorial.
Create a character variable with a length of 3 bytes:
name = "Bob";  
Generate a string variable with a length of 3 bytes:
gen str3 name = "Bob"  
Increase the variable length to allow for 5 characters:
data new;
  length name $5; 
 set new;
Change the values of numeric and character variables.
  var1 = 123456;
  name = "Bobby";
run;
replace var1 = 123456  
Stata automatically increases the storage type if necessary.  To change the storage of a variable manually, use the recast command. 
replace name = "Bobby"  
Stata automatically increases length to 5 
More on this in the Stata tutorial.
Example of an if-then statement:
if var1=123456 then var2=1;
The condition follows the command:
replace var2 = 1 if var1 == 123456 
Notice that Stata requires two equals signs when testing equality.
Example of an if-then do loop:
  if age <= 10 then do;
child = 1;
parent = 0;
end;
replace child = 1 if age <= 10
replace parent = 0 if age <= 10
Since each command is executed on all observations before the next command is executed, the "if-then do loop" is not an option. Stata does have excellent looping tools: foreach, forvalues, and while.
More on this in the Stata tutorial
Example of an if-then-else:
  if 0 <= age <= 2 then agegp = 1;
else if 2 < age <= 10 then agegp = 2;
else if 10 < age <= 20 then agegp = 3;
else if 20 < age <= 40 then agegp = 4;
else agegp = . ;
For the same reason "if-then-do loops" (above) are not possble in Stata, the same goes for "if-then-else".  But here is a way of doing the same thing.   In this example " agegp  == . " is used to simply highlight the fact that it has not been assigned a value, just like the "else" does in "if-then-else":
gen agegp = .
replace agegp = 1 if agegp == . & age >= 0 & age <= 2
replace agegp = 2 if agegp == . & age > 2 & age <= 10
replace agegp = 3 if agegp == . & age > 10 & age <= 20
replace agegp = 4 if agegp == . & age > 20 & age <= 40

Better done with the recode command which can also create value labels:
recode age ( 0/2.9999  = 1 "0 to 2 year olds")   ///
( 3/10.9999 = 2 "3 to 10 year olds") ///
(11/20.9999 = 3 "11 to 20 year olds") ///
(21/40.9999 = 4 "21 to 40 year olds") ///
( else = . ) , gen(agegp) test
The test option checks to see if the ranges overlap.
Since recode's ranges are >= and <= , adding .9999 to the upper range ensures that fractional values are handled correctly.

Drop variables var1, var2, and var3:
data new(drop = var1 var2 var3);
set new;
run;
Drop variables var1, var2, and var3:
drop var1 var2 var3 
More on this in the Stata tutorial.
Keep variables var1, var2, and var3:
data new(keep = var1 var2 var3);
set new;
run;
Keep variables var1, var2, and var3:
keep var1 var2 var3 
Keep observations / subsetting if statement:
data new;
set new;
if var1 = 1 then output;
run;
Keep observations
keep if var1 == 1 
Delete observations:
data new;
set new;
if var1 = 1 then delete ;
run;
Drop observations:
drop if var1 == 1 
More on this in the Stata tutorial.
Loop over a variable list (varlist):

data new(drop = i);
set new;
array raymond {4} var1 var2 var3 var4;
do i = 1 to 4;
if raymond{i} = 99 then raymond{i} = .;
end;
run;
foreach i in var1 var2 var3 var4 {
replace `i' = . if `i' == 99
}
NOTE: Notice that the quote to the left of the letter " i " is a left quote ( ` ). The left quote is located at the top of your keyboard next to the "! 1" key. In this example i is a local macro variable that exists only for the duration of the foreach command so it does not need to be dropped like the variable i in the SAS code.
More on this in the Stata tutorial.
Create variable labels:

label age = "age in years"
height = "height in inches";
label var age "age in years"
label var height "height in inches"
More on this in the Stata tutorial.
Define a format:
proc format;
value yesno
1 = "yes"
2 = "no";
run;
Assign the format to a variable:
data newer;
set newer;
format smokes yesno.;
run;
Define a format.  These are called "value labels":
label define yesno 1 "yes" /*
*/ 2 "no"

Assign the value label to a variable:
label value smokes yesno 

More on this in the Stata tutorial.
Assign formats defined by SAS to a variable:
 format interview_date mmddyy8.;
Assign formats defined by Stata to a variable:
format interview_date %n/d/y
NOTE: The letter "n" in "%n/d/y" stands for "number of the month".   "%m/d/y" would use the name of the month.
title "Nutritional Intakes for 12-18 year olds";
Since the Results window/log file is a mix of both the log and the Output window Stata doesn't need a title statement.  Titling can be accomplished with a comment.
/* Nutritional Intakes for 12-18 year olds */ 
proc sort data = new out = newer;
by id;
run;
sort id 
More on this in the Stata tutorial.
proc transpose data = new
(keep = age edu rel sex id lineno)
out = tr_new;
by id;
run;
reshape long age edu rel sex, i(id) j(lineno)
More on this in the Stata tutorial.
data newer;
set newer;
by id;
if first.id = 1 then f_num = 1;
if first.id = 1 and last.id = 1 then s_num = 1;
if last.id = 1 then l_num = 1;
run;
by id: gen f_num = 1 if _n == 1 
by id: gen s_num = 1 if _n == 1 & _N == 1
by id: gen l_num = 1 if _n == _N
Stata's "_n" is equivalent to SAS's "_n_" in that it is equal to the observation number; but when inside a by command "_n" is equal to 1 for the first observation of the by-group, 2 for the second observation of the by-group, etc.
Stata's "_N" is equal to the number of observations in the dataset except in a by-command when it is equal to the total number of observations in the by-group.
More on this in the Stata tutorial.
Count the number of boys within an id by-group:
data new;
set newer;
by id;
retain count 0;
if first.id then count = 0;
if gender = 1 and age<= 18 then count = count+1;
run;
Count the number of boys by id:
by id: gen count = sum(gender == 1 & age<= 18)
The sum function creates a running sum of the expression inside it.
data both;
merge new(in = a)
in8.newer(in = b);
by id;
if a = 1 and b = 1;
run;
merge id using "d:\mydata\newer.dta"
keep if _merge == 3
Stata automatically creates the variable "_merge" after a merge.  Stata will not merge on another dataset if _merge already exists on one of the datasets.
The dataset in memory is the "master" dataset.  The dataset that is being merged on is the "using" dataset.  Unlike SAS, variables shared by the master dataset and the using dataset will not be updated (values overwritten) by the using dataset.  Like SAS, the formats, labels, and informats of variables shared by the master dataset and the using dataset will be defined by the master dataset.  Remember that the master always wins.  Use the -update- option to overwrite data in master file.
More on this in the Stata tutorial.
Concatenate two datasets:
data both;
set new
in8.newer;
run;
append using "d:\mydata\newer.dta"

More on this in the Stata tutorial.
Sort datasets in order to prepare them for a merge:

Sort permanently stored datasets and create new, sorted copies in the work library:
proc sort data = in8.individual out = indiv;
by id;
run;

proc sort data = in8.household out = house;
by id;
run;

data temp2;
merge house(in = a)
indiv(in = b);
by id;
run;
Sort datasets in order to prepare them for a merge:

Create a local macro variable to represent a filename for Stata to use in  temporarily storing a data file on the computer's hard drive if requested to do so later:
tempfile indiv
use "d:\mydata\individual.dta" 
sort id 

Save the dataset that's currently in memory to a temporary filename in Stata's temp directory.  This file will be deleted when Stata is exited just like a dataset in SAS's work library:
save "`indiv'"
use "d:\mydata\household.dta" 
sort id
merge id using "`indiv'"
More on this in the Stata tutorial.
Create a local macro variable "ver":
%let ver = 7;
version = &ver.;
local ver = 7
gen version = `ver'
Notice that to evaluate the local macro variable "ver" a left quote " ` " is used and then a right quote " ' ". The left quote is located on your keyboard next to the "! 1" key.


**Note**:  Stata commands are partially underlined to show the minimum characters that need to be typed for Stata to recognize that command.


Back to Main page



Questions or comments?  If you are affiliated to the Carolina Population Center, send them to Phil Bardsley; non-affiliates may contact the author Dan Blanchette.