Logistic Regression Analysis
The Logit Model
Logistic Regression is used to model dichotomous (0 or 1) outcomes. This technique models the log odds of an outcome defined by the values of covariates in your model. In addition to covering how to model subpopulations, we will use both the svy commands and the robust cluster commands. The following example comes from The National Longitudinal Study of Adolescent Health.
Research Question: How is being in the upper quartile of the Vocabulary test score (PVT_Q4) influenced by a boy's grade in English (ENGL_GPA) and Family composition (BIOMAPA)?
Predictive Model:

Where
b_{0} = Intercept
b_{1} = Change in log odds of being in upper quartile for one year increment in age
b_{2} = Change in log odds of being in upper quartile for living with Biological Parents
b_{3} = Change in log odds of being in upper quartile for increase in one grade level
The model predicted logodds for the categorical subpopulations will be:
BIOMAPA  ENGL_GPA  Ln(odds) 

0 = No  4 = A  b_{0} + b_{1}AGE_KID + 4b_{3} 
0 = No  3 = B  b_{0} + b_{1}AGE_KID + 3b_{3} 
0 = No  2 = C  b_{0} + b_{1}AGE_KID + 2b_{3} 
0 = No  1 = D/F  b_{0} + b_{1}AGE_KID + b_{3} 
1 = Yes  4 = A  b_{0} + b_{1}AGE_KID + b_{2} + 4b_{3} 
1 = Yes  3 = B  b_{0} + b_{1}AGE_KID + b_{2} + 3b_{3} 
1 = Yes  2 = C  b_{0} + b_{1}AGE_KID + b_{2} + 2b_{3} 
1 = Yes  1 = D/F  b_{0} + b_{1}AGE_KID + b_{2} + b_{3} 
We are assuming a model with a common slope for age of the boy, but different intercepts defined by grade in English and living with both biological parents.
The relationship between probability and odds
The odds of an outcome is related to the probability of the outcome by the following relation:

An odds ratio is just the ratio of the odds of the outcome evaluated at two different sets of values for your covariates. It is easy to show that to test the hypothesis that p1 = p2 you can test that the hypothesis that an odds ratio comparing group 1 to group 2 is equal to 1. However, you cannot easily put a confidence interval on the difference between the two probabilities.
SVY: LOGIT
The svyset command is used to specify the design information for analysis. Use the strata keyword to specify the stratification variable (region), the pweight keyword to specify the probability weight variable (gswgt1), and specify the primary sampling unit (psuscid).
svyset psuscid [pweight=gswgt1], strata(region)
The svy: logit command states the model being tested. The first variable following svy: logit denotes the outcome (pvt_q4) of our model, and the following variables are the covariates. The option subpop
is used to specify the subpopulation we want to be used to compute parameter estimates. All 18,924 observations are needed for the variance computation because Stata determines the design information (number of primary sampling units) used in the formula variance computation.
svy, subpop(male): logit pvt_q4 age_kid biomapa engl_gpa
Stata lists the number of observations with no missing values for the variables in the model (N=17,191) and has summed the corresponding sample weights to estimate 19,955,620 adolescents in the U.S. are represented by these observations. The number of observations with complete data in the subpopulation is 8,366 representing 10,084,117 boys. Note that the number of strata (4) and primary sampling units (132) has been correctly counted.
Survey: Logistic regression Number of strata = 4 Number of obs = 17191 Number of PSUs = 132 Population size = 19955620 Subpop. no. of obs = 8366 Subpop. size = 10084117 Design df = 128 F( 3, 126) = 49.14 Prob > F = 0.0000   Linearized twokids  Coef. Std. Err. t P>t [95% Conf. Interval] + age_kid  .0451845 .0278879 1.62 0.108 .1003656 .0099965 biomapa  .4273138 .0820139 5.21 0.000 .2650354 .5895923 engl_gpa  .4258579 .0423055 10.07 0.000 .3421493 .5095664 _cons  1.886177 .4411884 4.28 0.000 2.759144 1.013211 
The adjust command can be used to estimate a linear combination of the coefficients estimated for the variables in our model. If you do not specify a value for a variable when using adjust, Stata will incorrectly substitute the sample mean rather than an estimate of the population mean. This is because adjust ignores any weights used by the estimation commands. (See Stata Reference Manual, Release 9, Vol 1 AG, page 10.) To correctly compute a linear combination, it is necessary to specify a value for all variables in the model. For example, the following statement:
adjust age_kid=17 engl_gpa=3, by(biomapa) xb se ci
produces an estimate of the log odds of scoring above the 75th percentile for boys at age 17 with a grade of B in English for both categories of living with both biological parents:
 Dependent variable: pvt_q4 Command: logit Covariates set to value: age_kid = 17, engl_gpa = 3   Live with  Bio Mom &  Dad  0=N/1=Y  xb stdp lb ub + 0  1.37674 (.100564) [1.57572 1.17776] 1  .949427 (.094665) [1.13674 .762115]  Key: xb = Linear Prediction stdp = Standard Error [lb , ub] = [95% Confidence Interval]
You can also include the exp option at the end of the adjust command to get adjust to print exponentiated linear combination of the coefficients. The pr option on adjust is not available after using the svylogit command.
The lincom command can also be used to produce linear combinations of the coefficients:
lincom 17*age_kid + 1*biomapa + 3*engl_gpa + _cons ( 1) 17.0 age_kid + biomapa + 3.0 engl_gpa + _cons = 0.0  pvt_q4  Coef. Std. Err. t P>t [95% Conf. Interval] + (1)  .9494267 .0946653 10.03 0.000 1.136738 .7621154 
The results from lincom match those from adjust. The advantage of using lincom is that a hypothesis test can also be performed. For example, suppose you want to compute the odds ratio comparing 17 yearold boys not living with both biological parents to12 yearold boys living with both biological parents. Assume both boys make the same grade in English. We would want to estimate the difference in log odds for these to:
(b_{0} + 17*b_{1} + GRADE*b_{3})  (b_{0} + 12*b_{1} + b_{2} + GRADE*b_{3}) = 5*b_{1}  b_{2}
Since b_{1} is the coefficient for AGE_KID and b_{2} is the coefficient for BIOMAPA, the lincom command would be:
lincom 5*age_kid  1*biomapa
This produces the desired difference in log odds:
( 1) 5.0 age_kid  biomapa = 0.0  pvt_q4  Coef. Std. Err. t P>t [95% Conf. Interval] + (1)  .6532364 .1641137 3.98 0.000 .9779635 .3285094 
The or option can be added to the lincom command to get the odds ratio (e^{5*b1b2} ):
lincom 5*age_kid  1*biomapa , or
The following table will be printed:
( 1) 5.0 age_kid  biomapa = 0.0  pvt_q4  Odds Ratio Std. Err. t P>t [95% Conf. Interval] + (1)  .5203589 .085398 3.98 0.000 .3760762 .7199962 
Thus, assuming equal grades in English, the odds of a 17 yearold boy not living with both biological parents is only half that of a 12 year boy who lives with his biological parents.
The test command can be used to test joint hypothesis about variables. For example, testing that the coefficient for age_kid and biomapa are both equal to zero can be done with the following stata command:
test age_kid biomapa
which produces the following output:
Adjusted Wald test ( 1) age_kid = 0.0 ( 2) biomapa = 0.0 F( 2, 127) = 14.51 Prob > F = 0.0000
logit with pweight and robust cluster
Note that we can subset the data (if male == 1) when using the robust cluster( ) options in Stata and still have the variance computed with an acceptable technique. The primary sampling unit (psuscid) is used as the argument to the cluster option and the sample weights (gswgt1) are specified by [pweight=gswgt1].
logit pvt_q4 age_kid biomapa engl_gpa if male == 1 [pweight=gswgt1], robust cluster(psuscid)
The results and interpretation in the following output are identical to the results from svylogit.
Logistic regression Number of obs = 8366 Wald chi2(3) = 142.44 Prob > chi2 = 0.0000 Log pseudolikelihood = 4429.7883 Pseudo R2 = 0.0384 (Std. Err. adjusted for 132 clusters in psuscid)   Robust pvt_q4  Coef. Std. Err. z P>z [95% Conf. Interval] + age_kid  .0451845 .0277835 1.626 0.104 .0996393 .0092702 biomapa  .4273138 .0817512 5.227 0.000 .2670844 .5875433 engl_gpa  .4258579 .0430438 9.894 0.000 .3414936 .5102222 _cons  1.886177 .4414855 4.272 0.000 2.751473 1.020881 
Questions or comments? If you are affiliated with the Carolina Population Center, send them to Phil Bardsley.