# Commands to Analyze Survey Data

Stata provides two ways to analyze survey data. After a description of the two ways, there is a table to help you decide which one to choose.

### The survey Commands

The preferred way is to use the family of commands that begin with **svy:**. (See **help survey** in Stata for a list of commands that can be run after svy:) These commands were designed especially for analyzing data from sample surveys. Before any of the survey estimation commands can be used, the **svyset** command should be used to specify one or more of the variables that describe the stratification, sampling weight, and/or primary sampling unit variables. You can try **svyset** by running the following commands:

clear use "q:\utilities\statatut\svysamp.dta" svyset earea [pweight=sampwt], strata(urbrur)

In this example from the 1999 Tanzania DHS data, the variable earea ("enumeration area") is the PSU, sampwt is the probability weight, and urbrur (urban-rural) is the stratum identifier.

These values stay in effect until they are cleared or reset. If you save the data, these values are saved with the data and will be in effect the next time you use the data file.

We could now use any of the survey estimation commands. For example, the mean of a variable from the data set could be estimated as follows:

svy: mean numkids (running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 4029 Number of PSUs = 176 Population size = 4029 Design df = 174 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ numkids | 2.409699 .0617263 2.28787 2.531528 --------------------------------------------------------------

Stata first reports the names of variables that were defined with the svyset command and some statistics about the data used in the computation. It is a good idea to make sure the names of the variables, the number of strata and the number of PSU's reported are correct. The number of observations with non-missing data (4029) and the size of the population represented by the observations (4029) are also reported (these weights are normalized).

After any of the survey estimation commands, you can use the **test** command to test linear hypotheses and **lincom** to compute linear combinations of estimations. These special commands adjust the test statistics properly for the sample design. For example, to test whether urban women have fewer children than rural women:

svy: mean numkids, over(urbrur) test [numkids]Urban = [numkids]Rural

### Subpopulation Analysis

When using the svy commands to analyze only a portion of the sample (a sub-population), it is important to analyze the entire data set and to use the **subpop** option to identify those observations you want included in the estimate. This is because Stata needs to have information from every observation in the sample to compute the variance, standard error, and confidence intervals even though only the observations in the sub-sample are needed to compute means, proportions, and regression coefficients.

To use the subpop option, you need to generate a variable that has a value of 1 for the observations in your sub-population and a value of 0 for those that should be excluded. Here is an example where we compute the mean of numkids for the people living on Zanzibar (the variable zanzibar has a value of 1):

svy, subpop(zanzibar): mean numkids //CORRECT SUBPOPULATION ANALYSIS(running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 4029 Number of PSUs = 176 Population size = 4.0e+09 Subpop. no. obs = 969 Subpop. size = 1.0e+08 Design df = 174 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ numkids | 2.858553 .1157252 2.630147 3.086959 --------------------------------------------------------------

Note that the subpopulation number of observations is listed as 969. It's a good idea to check that number to make sure your subpop variable is working as expected.

It would be **incorrect** to use the **if** option to subset the data:

svy: mean numkids if zanzibar==1 //INCORRECT - DO NOT DO THIS(running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 969 Number of PSUs = 30 Population size = 1.0e+08 | Design df = 28WRONG-------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ numkids | 2.858553 .1037723 2.645985 3.071121 -------------------------------------------------------------- | | |WRONGWRONGWRONG

The number of PSUs is incorrect, so the standard error and the confidence interval are also incorrect. Note that the estimate of the mean is the same. This is just one example of how different the results can be when you subset the data. For some variables the difference might be much smaller or much larger. It is best to always use the subpop option when analyzing a sub-population with the svy command.

### Using the pweight and robust cluster() Options

The second way to analyze survey data is to use the estimation commands that allow the pweight and robust cluster options. The estimation commands when used with the pweight and robust cluster options handle the sampling weights and clustering properly. However, there is no option for specifying the stratification variable. As a result, the standard error may be larger than it would be using and svy command.

The following set of commands demonstrates the difference between logit (without stratum) and svylogit (with stratum):

clear use "q:\utilities\statatut\svysamp.dta" svyset earea [pweight=sampwt], strata(urbrur) logit twokids age educat [pweight=sampwt], robust cluster(earea) svy: logit twokids age educat

### Choosing a method

The following table compares the two methods available for analyzing data from a sample survey:

Method | Strengths | Limitations |
---|---|---|

The survey commands | test and lincom commands used after estimation adjust the test statistics correctly for the sample design. Can make finite population corrections for without-replacement samples. Option available on svyset command to specify the stratification variable. |
The analysis command you want may not support the svy prefix: see help svy_estimation for a current list of those commands |

Commands that allow pweight and robust cluster() options. | There may be an estimation command that supports cluster but does not support svy. |
Should have at least 40 clusters available. Option for specifying a stratification variable is not available. |

It is best to use the survey commands to analyze survey data. These commands incorporate the effect of clustering and stratification as well as the effect of sampling weights when computing the variance, standard error, and confidence intervals, and they allow you to perform analyses on a subset of the data using the subpop option. If the analysis technique you need is not available with the survey commands, then using the estimation commands with pweights and robust cluster() options would be a good choice.