Choosing the Correct Weight Syntax
One of the most common mistakes made when analyzing data from sample surveys is specifying an incorrect type of weight for the sampling weights. Only one of the four weight keywords provided by Stata, pweight, is correct to use for sampling weights. The purpose of each type of weight follows.
Sampling or Probability weights: pweight
Stata has a special term pweight to specify probability weights. Probability weights are another name for sampling weights. The pweight option causes Stata to use the sampling weight as the number of subjects in the population that each observation represents when computing estimates such as proportions, means, and regressions parameters. A robust variance estimation technique will automatically be used to adjust for the design characteristics so that variances, standard errors and confidence intervals are correct.
Run the following commands to demonstrate the difference between unweighted and weighted results, and to see that Stata automatically uses the robust estimation technique when you use pweights. As before, these data are from the 1999 Tanzania DHS women's survey. Here we predict having 0-2 kids using a woman's education and controlling for her age.
clear use "q:\utilities\statatut\svysamp.dta" logit twokids age educat logit twokids age educat [pweight=sampwt] logit twokids age educat [pweight=sampwt], robust cluster(earea)
Note that the coefficient associated with education changes only slightly with the use of pweights. However, the standard error increases quite a bit, with a corresponding decrease in z. Most notably, p increases from .001 to .030 with the addition of weights. If we had not included probability weights, we would have assigned too much importance to the role education plays in the number of children these women have. Adding the cluster(earea) option makes only a slight adjustment in this case, but is recommended.
The discussion below of other weight commands is included as general information. In most cases, these commands are ***NOT APPROPRIATE*** for use with sample survey data.
Frequency Weights: fweight
Frequency weights are integers that indicate the number of times the observation was actually observed. It is used when your data set has been collapsed and contains a variable that tells the frequency each record occurred. For example, if the original data was:
x1 x2 y 16 3 1 16 3 1 19 2 0 19 2 0 19 2 0
and the estimation command would be
logit y x1 x2
then the collapsed data would look like:
x1 x2 y count 16 3 1 2 19 2 0 3
and the estimation command would be
logit y x1 x2 [fweight=count]
Do not use fweights to specify sampling weights. Your variance of estimates, p-values and standard errors will be computed incorrectly.
Analytic Weights: aweight
Analytic weights are used when you want to compute a linear regression on data that are observed means. For example, instead of having data that looks like:
group x y 1 3 22 1 4 30 2 8 25 2 2 19 2 5 16
suppose the data has been condensed with only the averages being available:
group x y n 1 3.5 26.0 2 2 5.0 20.0 3
and a linear regression could be done by using the command:
regress y x [aweight=n]
Do not use aweights to specify sampling weights. This is because the formulas that use aweights assume that larger weights designate more accurately measured observations. Conversely, one observation from a sample survey is no more accurately measured than any other observation. Hence, using the aweight command to specify sampling weights will cause Stata to estimate incorrect values of the variance and standard errors of estimates, and p-values for hypothesis tests.
Importance Weights: iweight
Stata has a special weight command, iweight, which can be used by programmers who need to implement their own analytical techniques by using some of the available estimation commands. Special care should be taken when using importance weights to understand how they are used in the formulas for estimates and variance. This information is available in the Methods and Formulas section in the Stata manual for each estimation command. In general, these formulas will be incorrect for computing the variance for data from a sample survey.