|
This document contains:
--Using sample weights
--Why control for sample design?
--Controlling for sample design
--What set of observations is Stata analyzing?
--Sample design correction variables
NOTE: This document is an attempt to standardize how analysis of the USDA data is done here at CPC.
Using sample weights:
As long as a sample weight is used, the mean is going
to be the same regardless of what software is used (SAS, SUDAAN, Stata).
The standard error is what changes based on the software's choice of variance
formulas and on that software's method for correcting for sample design.
When using a survey software (SUDAAN, survey commands in Stata, or survey
procedures in SAS), the standard error is not affected by the choice of
sample weight, normalized or not. In Stata the type of weight to use is
the "pweight." NOTE: pweight does not automatically normalize the sample
weight like aweight does. Stata's survey commands do not allow aweight.
Note: A normalized sample weight sums to the number
of observations in the data set and it's mean is 1.
Never use a normalized sample weight because it doesn't
matter for times when the result is a ratio/mean/percentage.
--CSFII 1994-96,1998 sample weights sum to the population
of the United States. The weights are integers.
--CSFII 1989-91 sample weights sum to the number
of people they intended to sample. The weights are integers.
--NFCS 1977-78 sample weights sum to the number of
people they intended to sample. The weights are not integers.
--NFCS 1965 sample weights are either 1 or 2. They
were created by a former programmer here at CPC because adults
20 to 64 years old were undersampled.
Back to top
Back Home
Why Control For Sample Design
Ideally, we can account for all unobserved differences between members
of different strata, PSUs, households, etc. Correcting for all sampling
levels would yield a larger difference in the standard error than correcting
for, say, half the levels. SUDAAN comes closest by offering corrections
on many levels of sampling measurement. Correcting for more than 1-2 levels
of sampling usually results in little gain in precision (little difference
in the standard error). Stata only allows for up to 2 levels. The
question then becomes, "Which of the many levels of sampling should we
correct for?" Since the goal is to correct for as much unobserved error
as we can, we experiment with different combinations of the sampling
levels to see which combination gets us closest to the ideal.
The ideal would be the combination that provides the largest difference
in the standard error than not using any sample design correction.
Note: weight the data in all tests.
Controlling for
sample design (looking for the correct standard errors):
Since the USDA data are survey samples, always use
Stata (svymean, svytab etc) or SUDAAN (descript, crosstab, etc) in order
to derive the correct standard error. Stata uses SUDAAN's formulas for
their survey commands so either software gives the same standard errors.
SAS's survey procedures do not use the same formulas so there are some
minor differences in the standard errors (but not the mean, of course).
USDA data were collected with a "With Replacement"
sample design. Stata assumes WR and you should chose "design=WR" in SUDAAN
(though WR is SUDAAN's default sample design if design is not specified).
Since USDA data sample design is WR, there is no need/doesn't make sense
to make a finite population correction.
In order to accurately compute the standard error
(and to assure that there is at least two PSUs per STRATUM), always have
the whole data set present when analyzing the data (make sure that
every subject has at least one observation). Use a subpop variable to
indicate who's in your desired subsample and who's
not. A subpop variable has to be a 0/1 variable for everyone in the data
set (it can not be equal to missing for anyone). Everyone in the data set
needs a non-missing value for: STRATUM, PSU, sample weight, and SUBPOP variables.
All ANALYSIS variables should have their original values (a missing value if
that is that subject's original value). For example, if you are constructing
a variable for 12-14 year old Hispanic females, do not subset the data set
to just 12-14 year old Hispanic females. Construct the variable for everyone
in the entire data set (males, other females, etc). During analysis use a subpop
variable to limit the analysis to 12-14 year old Hispanic females. Having
missing data for an analysis variable is equivalent to running the survey
command with the condition "if analysis_var != . ". The Stata manual clearly
warns against the use of "if" or "in".
Subjects in the subsample can have multiple observations, but subjects
not in the subsample (subpop=0) only need to have one observation.
When controlling for sample design, never analyze
more than one wave of data in the same data set. Since each wave of data
has different sample designs, it doesn't represent a whole population when
put in one data set. It's okay to put the results from output data sets
into one data set to test the differences between the means
(t-test) of different waves of data. When this was
tested the standard errors
turned out to be a bit more conservative than when done separately. The
means were not affected. Survey software uses the whole data set to control
for sample design. A whole data set is supposed to represent a population.
When comparing waves of data, analyze each year separately, then put the
results of all waves into a single data set. Any further analysis cannot
use sample design correction methods and should not need to since the results
are already corrected for sample design.
Back to top
Back Home
What set of observations is Stata analyzing?"
Using a subpop variable does not do the same thing as an -if-.
In fact that's why the subpop option was invented. The -svy- commands
use the whole dataset to help determine the standard error even
if you are only looking at a subset of it (with a subpop var).
During the time Stata is analyzing your data, Stata subsets to only those
observations where ALL the following variables are non-missing:
-
strata (if using one)
-
psu (if using one)
-
sample weight (if using one)
-
subpop (if using one)
-
analysis variable(s)**
If any one of them is missing then Stata drops the obs where
any of those variables are missing.
** svymean with more than one variable will not subset to obs where
all analysis variables are non-missing unless the "complete"
option is specified.
Back to top
Back Home
The sample design correction variables are:
CSFII 1994-96,1998: VARSTRAT and VARUNIT. These have been
renamed to STRATUM and PSU. The documentation refers to them as VARSTRAT
and VARUNIT, and some data sets may have them not renamed.
CSFII 1989-91: STRATUM and PSU.
NFCS 1977-78: SUPERSTR and STRATUM. NOTE: Only 1
PSU per STRATUM was sampled. In the documentation they interchange PSU
and STRATUM. In order to correct on more than one level use SUPERSTR as
the STRATUM variable and the STRATUM variable as the PSU.
NFCS 1965-66: SUPERSTR and STRATUM. NOTE: Only 1
PSU per STRATUM was sampled. In order to correct on more than one level
use SUPERSTR as the STRATUM variable and the STRATUM variable as the PSU.
Back to top
Back Home
|