How to analyze USDA data at CPC
Using sample weights:
As long as a sample weight is used, the mean is going
to be the same regardless of what software is used (SAS, SUDAAN, Stata).
The standard error is what changes based on the software's choice of variance
formulas and on that software's method for correcting for sample design.
When using a survey software (SUDAAN, survey commands in Stata, or survey
procedures in SAS), the standard error is not affected by the choice of
sample weight, normalized or not. In Stata the type of weight to use is
the pweight. NOTE: pweight does not automatically normalize the sample weight like aweight does. Stata's survey commands do not allow aweight.
Note: A normalized sample weight sums to the number
of observations in the data set and its mean is 1.
Never use a normalized sample weight because it does not matter for times when the result is a ratio/mean/percentage.
- CSFII 1994-96,1998 sample weights sum to the population of the United States. The weights are integers.
- CSFII 1989-91 sample weights sum to the number of people they intended to sample. The weights are integers.
- NFCS 1977-78 sample weights sum to the number of people they intended to sample. The weights are not integers.
- NFCS 1965 sample weights are either 1 or 2. They were created by a former programmer here at CPC because adults 20 to 64 years old were undersampled.
Why Control For Sample Design
Ideally, we can account for all unobserved differences between members of different strata, PSUs, households, etc. Correcting for all sampling levels would yield a larger difference in the standard error than correcting for, say, half the levels. SUDAAN comes closest by offering corrections on many levels of sampling measurement. Correcting for more than 1-2 levels of sampling usually results in little gain in precision (little difference in the standard error). Stata only allows for up to 2 levels. The question then becomes, "Which of the many levels of sampling should we correct for?" Since the goal is to correct for as much unobserved error as we can, we experiment with different combinations of the sampling levels to see which combination gets us closest to the ideal. The ideal would be the combination that provides the largest difference in the standard error than not using any sample design correction. Note: Weight the data in all tests.
Controlling for sample design (looking for the correct standard errors):
Since the USDA data are survey samples, always use Stata's commands
(svy: mean,
svy: tab, etc) or SUDAAN' procedures: (DESCRIPT, CROSSTAB, etc) in order
to derive the correct standard error. Stata uses SUDAAN's formulas for
their survey commands so either software gives the same standard errors.
SAS's survey procedures do not use the same formulas so there are some
minor differences in the standard errors (but not the mean, of course).
USDA data were collected with a "With Replacement" sample design. Stata assumes WR and you should chose "design=WR" in SUDAAN (though WR is SUDAAN's default sample design if design is not specified). Since USDA data sample design is WR, there is no need/does not make sense to make a finite population correction.
In order to accurately compute the standard error
(and to assure that there is at least two PSUs per STRATUM), always have
the whole dataset present when analyzing the data (make sure that
every subject has at least one observation). Use a subpop variable to indicate who is in your desired subsample and who is not. A subpop variable has to be a 0/1 variable for everyone in the dataset (it can not be equal to missing for anyone). Everyone in the dataset needs a non-missing value for: STRATUM, PSU, sample weight, and SUBPOP variables. All ANALYSIS variables should have their original values (a missing value if that is that subject's original value). For example, if you are constructing
a variable for 12-14 year old Hispanic females, do not subset the data set
to just 12-14 year old Hispanic females. Construct the variable for everyone
in the entire data set (males, other females, etc). During analysis use a subpop variable to limit the analysis to 12-14 year old Hispanic females. Having missing data for an analysis variable is equivalent to running the survey command with the condition: if !missing(analysis_var). The Stata manual clearly
warns against the use of if or in.
Subjects in the subsample can have multiple observations, but subjects
not in the subsample (subpop == 0) only need to have one observation.
When controlling for sample design, never analyze more than one wave of data in the same data set. Since each wave of data has different sample designs, it does not represent a whole population when put in one dataset. It is okay to put the results from output datasets into one dataset to test the differences between the means (t-test) of different waves of data. When this was tested the standard errors turned out to be a bit more conservative than when done separately. The means were not affected. Survey software uses the whole data set to control for sample design. A whole dataset is supposed to represent a population. When comparing waves of data, analyze each year separately, then put the results of all waves into a single data set. Any further analysis cannot use sample design correction methods and should not need to since the results are already corrected for sample design.
What set of observations is Stata analyzing?
Using a subpop variable does not do the same thing as an if condition.
In fact that is why the subpop option was invented. The svy commands
use the whole dataset to help determine the standard error even
if you are only looking at a subset of it (with a subpop var).
During the time Stata is analyzing your data, Stata subsets to only those
observations where ALL the following variables are non-missing:
- strata (if using one)
- psu (if using one)
- sample weight (if using one)
- subpop (if using one)
- analysis variable(s)**
If any one of them is missing then Stata drops the observations where any of those variables are missing.
NOTE: svy: mean with more than one variable will not subset to observations where
all analysis variables are non-missing unless the complete
option is specified.
The sample design correction variables are:
- CSFII 1994-96,1998: VARSTRAT and VARUNIT. These have been renamed to STRATUM and PSU. The documentation refers to them as VARSTRAT and VARUNIT, and some datasets may have them not renamed.
- CSFII 1989-91: STRATUM and PSU.
- NFCS 1977-78: SUPERSTR and STRATUM. NOTE: Only 1 PSU per STRATUM was sampled. In the documentation they interchange PSU and STRATUM. In order to correct on more than one level use SUPERSTR as the STRATUM variable and the STRATUM variable as the PSU.
- NFCS 1965-66: SUPERSTR and STRATUM. NOTE: Only 1 PSU per STRATUM was sampled. In order to correct on more than one level use SUPERSTR as the STRATUM variable and the STRATUM variable as the PSU.


