# Data Characteristics

A sample survey is conducted to obtain information about the characteristics of a population. To reduce the cost and time necessary to collect the data, this task is often handled by selecting a subset (a sample) from the set of all measurements (the target population) of interest to the researchers. The methods that are used to select the sample add certain characteristics to the data. These characteristics must be incorporated into your analysis to get estimates concerning the entire population. There should be variables for each observation in your data set to identify each of the characteristics that are described next.

### Clustering

A truly random selection of households would involve listing all the households in the country and randomly selecting the desired number of households from that list. Using this approach, each household would have an independent and equal chance of being included in the survey. While this approach is statistically ideal, the cost of first enumerating all households and then visiting the selected households that would be scattered all over the country make this approach impractical.

A more practical approach is cluster sampling. For example, in the Demographic and Health Surveys, "enumeration areas" from the Census or similar national surveys are first selected randomly from a list of all such areas in the country (or within strata if stratification is being used). These areas are often referred to as "clusters" or as "primary sampling units" (PSU's). They may be towns or villages, or they may be census tracts in cities. Generally each cluster contains roughly the same number of households.

The next step is to enumerate (count and label) all the households in the cluster. Then a random-selection process is used to select households within each cluster. This is the sample of households that will be visited for the survey.

While cluster sampling is much more practical, it also means that the households are not statistically independent. Instead, the characteristics of a given household (and its household members) are more like those of other households in the same cluster, and are less like households in other clusters. This effect of a non-independent sampling process, called the "sample survey design effect", shows up in the standard error of estimation statistics (means, regression coefficients). Clustering tends to decrease the size of standard errors, leading to a greater likelihood of rejecting the null hypothesis. In other words, it's more conservative to correct statistically for the design effect.

### Stratification

The population can be divided into sections (the strata) that are internally more homogeneous. This may be done in order to over-sample smaller groups in a target population. Examples of strata are region of country, urban/rural residence, or education level. A separate sample is selected from each stratum. Like clustering, the observations within strata are not statistically independent, and adjusting for stratification leads to more conservative inferences about statistical significance.

### Sampling Weights

Each observation in the sample is chosen using a method of random selection. An important property of this method is that the probability of selection may not be equal for all members of the population. The sampling weight for each observation is computed as the inverse of the selection probability. Additional adjustments (such as non-response) may be made to the sampling weights. An observation with a sampling weight of 1000 represents one thousand individuals from the target population while another observation with a sampling weight of 50 represents only fifty individuals. Your analysis technique will need to use the sampling weights to estimate the characteristics of the target population from the reports of the sample. Thus, the sampling weights are needed in computing both the population estimates (such as means and regression coefficients) and their standard errors.

### Population Number of PSUs in Each Stratum

Sampling with replacement means that once a PSU was chosen, it remains eligible to be selected again. Without replacement means that once the unit is selected, it is no longer eligible for selection and a finite population correction will need to be made in your analysis. Data sets for without replacement samples will need to have a variable that tells how many PSUs per stratum are in the population. Stata will use the data set to count how many PSUs were selected per stratum and compute a sampling fraction to use in analysis.

If the proportion of PSUs selected from each stratum (the sampling fraction) is small, then your sample can be analyzed as if it was selected with replacement and you do not need this variable. This simplifies your analysis since you can ignore the finite population correction. According to Cochran, "In practice the fpc can be ignored whenever the sampling fraction does not exceed 5% and for many purposes even if it is as high as 10%. The effect of ignoring the correction is to overestimate the standard error of the estimate." (William G. Cochran, Sampling Techniques, 3rd Edition, 1977, John Wiley & Sons)

### Questions

What characteristics of the sampling design affect estimates such as totals, means, proportions, and regression coefficients? **Answer:** Sampling weights.

What characteristics of the sampling design affect standard errors, p-values, and confidence intervals? **Answer:** Sampling weights, clustering, and stratification.