Logistic Regression Analysis
The Logit Model
Logistic Regression is used to model dichotomous (0 or 1) outcomes. This technique models the log odds of an outcome defined by the values of covariates in your model. In addition to covering how to model sub-populations, we will use both the svy commands and the robust cluster commands. The following example comes from The National Longitudinal Study of Adolescent Health.
Research Question: How is being in the upper quartile of the Vocabulary test score (PVT_Q4) influenced by a boy's grade in English (ENGL_GPA) and Family composition (BIOMAPA)?
Predictive Model:
|
Where
b0 = Intercept
b1 = Change in log odds of being in upper quartile for one year increment in age
b2 = Change in log odds of being in upper quartile for living with Biological Parents
b3 = Change in log odds of being in upper quartile for increase in one grade level
The model predicted log-odds for the categorical subpopulations will be:
| BIOMAPA | ENGL_GPA | Ln(odds) |
|---|---|---|
| 0 = No | 4 = A | b0 + b1AGE_KID + 4b3 |
| 0 = No | 3 = B | b0 + b1AGE_KID + 3b3 |
| 0 = No | 2 = C | b0 + b1AGE_KID + 2b3 |
| 0 = No | 1 = D/F | b0 + b1AGE_KID + b3 |
| 1 = Yes | 4 = A | b0 + b1AGE_KID + b2 + 4b3 |
| 1 = Yes | 3 = B | b0 + b1AGE_KID + b2 + 3b3 |
| 1 = Yes | 2 = C | b0 + b1AGE_KID + b2 + 2b3 |
| 1 = Yes | 1 = D/F | b0 + b1AGE_KID + b2 + b3 |
We are assuming a model with a common slope for age of the boy, but different intercepts defined by grade in English and living with both biological parents.
The relationship between probability and odds
The odds of an outcome is related to the probability of the outcome by the following relation:
|
An odds ratio is just the ratio of the odds of the outcome evaluated at two different sets of values for your covariates. It is easy to show that to test the hypothesis that p1 = p2 you can test that the hypothesis that an odds ratio comparing group 1 to group 2 is equal to 1. However, you cannot easily put a confidence interval on the difference between the two probabilities.
SVY: LOGIT
The svyset command is used to specify the design information for analysis. Use the strata keyword to specify the stratification variable (region), the pweight keyword to specify the probability weight variable (gswgt1), and specify the primary sampling unit (psuscid).
svyset psuscid [pweight=gswgt1], strata(region)The svy: logit command states the model being tested. The first variable following svy: logit denotes the outcome (pvt_q4) of our model, and the following variables are the covariates. The option subpop
is used to specify the sub-population we want to be used to compute parameter estimates. All 18,924 observations are needed for the variance computation because Stata determines the design information (number of primary sampling units) used in the formula variance computation.
svy, subpop(male): logit pvt_q4 age_kid biomapa engl_gpa
Stata lists the number of observations with no missing values
for the variables in the model (N=17,191) and has summed the
corresponding sample weights to estimate 19,955,620 adolescents in the
U.S. are represented by these observations. The number of observations
with complete data in the sub-population is 8,366 representing
10,084,117 boys. Note that the number of strata (4) and primary
sampling units (132) has been correctly counted.
Survey: Logistic regression
Number of strata = 4 Number of obs = 17191
Number of PSUs = 132 Population size = 19955620
Subpop. no. of obs = 8366
Subpop. size = 10084117
Design df = 128
F( 3, 126) = 49.14
Prob > F = 0.0000
------------------------------------------------------------------------------
| Linearized
twokids | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age_kid | -.0451845 .0278879 -1.62 0.108 -.1003656 .0099965
biomapa | .4273138 .0820139 5.21 0.000 .2650354 .5895923
engl_gpa | .4258579 .0423055 10.07 0.000 .3421493 .5095664
_cons | -1.886177 .4411884 -4.28 0.000 -2.759144 -1.013211
------------------------------------------------------------------------------
The adjust command can be used to estimate a linear
combination of the coefficients estimated for the variables in our
model. If you do not specify a value for a variable when using adjust,
Stata will incorrectly substitute the sample mean rather than an
estimate of the population mean. This is because adjust ignores any weights used by the estimation commands.
(See Stata Reference Manual, Release 9, Vol 1 A-G,
page 10.) To correctly compute a linear combination, it is necessary to
specify a value for all variables in the model. For example, the
following statement:
adjust age_kid=17 engl_gpa=3, by(biomapa) xb se ci
produces an estimate of the log odds of scoring above the 75th
percentile for boys at age 17 with a grade of B in English for both
categories of living with both biological parents:
-----------------------------------------------------------------------------
Dependent variable: pvt_q4 Command: logit
Covariates set to value: age_kid = 17, engl_gpa = 3
-----------------------------------------------------------------------------
----------------------------------------------------------
Live with |
Bio Mom & |
Dad |
0=N/1=Y | xb stdp lb ub
----------+-----------------------------------------------
0 | -1.37674 (.100564) [-1.57572 -1.17776]
1 | -.949427 (.094665) [-1.13674 -.762115]
----------------------------------------------------------
Key: xb = Linear Prediction
stdp = Standard Error
[lb , ub] = [95% Confidence Interval]
You can also include the exp option at the end of the adjust command to get adjust to print exponentiated linear combination of the coefficients. The pr option on adjust is not available after using the svylogit command.
The lincom command can also be used to produce linear combinations of the coefficients:
lincom 17*age_kid + 1*biomapa + 3*engl_gpa + _cons
( 1) 17.0 age_kid + biomapa + 3.0 engl_gpa + _cons = 0.0
------------------------------------------------------------------------------
pvt_q4 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -.9494267 .0946653 -10.03 0.000 -1.136738 -.7621154
------------------------------------------------------------------------------
The results from lincom match those from adjust. The advantage of
using lincom is that a hypothesis test can also be performed. For
example, suppose you want to compute the odds ratio comparing 17
year-old boys not living with both biological parents to12 year-old
boys living with both biological parents. Assume both boys make the
same grade in English. We would want to estimate the difference in log
odds for these to:
(b0 + 17*b1 + GRADE*b3) - (b0 + 12*b1 + b2 + GRADE*b3) = 5*b1 - b2
Since b1 is the coefficient for AGE_KID and b2 is the coefficient for BIOMAPA, the lincom command would be:
lincom 5*age_kid - 1*biomapa
This produces the desired difference in log odds:
( 1) 5.0 age_kid - biomapa = 0.0
------------------------------------------------------------------------------
pvt_q4 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | -.6532364 .1641137 -3.98 0.000 -.9779635 -.3285094
------------------------------------------------------------------------------
The or option can be added to the lincom command to get the odds ratio (e5*b1-b2 ):
lincom 5*age_kid - 1*biomapa , or
The following table will be printed:
( 1) 5.0 age_kid - biomapa = 0.0
------------------------------------------------------------------------------
pvt_q4 | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .5203589 .085398 -3.98 0.000 .3760762 .7199962
------------------------------------------------------------------------------
Thus, assuming equal grades in English, the odds of a 17 year-old boy not living with both biological parents is only half that of a 12 year boy who lives with his biological parents.
The test command
can be used to test joint hypothesis about variables. For example,
testing that the coefficient for age_kid and biomapa are both equal to
zero can be done with the following stata command:
test age_kid biomapa
which produces the following output:
Adjusted Wald test
( 1) age_kid = 0.0
( 2) biomapa = 0.0
F( 2, 127) = 14.51
Prob > F = 0.0000
logit with pweight and robust cluster
Note that we can subset the data (if male == 1) when using the robust cluster( ) options in Stata and still have the variance computed with an acceptable technique.
The primary sampling unit (psuscid) is used as the argument to the cluster option and the sample weights (gswgt1) are specified by [pweight=gswgt1].
logit pvt_q4 age_kid biomapa engl_gpa if male == 1 [pweight=gswgt1], robust cluster(psuscid)
The results and interpretation in the following output are identical to the results from svylogit.
Logistic regression Number of obs = 8366
Wald chi2(3) = 142.44
Prob > chi2 = 0.0000
Log pseudolikelihood = -4429.7883 Pseudo R2 = 0.0384
(Std. Err. adjusted for 132 clusters in psuscid)
------------------------------------------------------------------------------
| Robust
pvt_q4 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
age_kid | -.0451845 .0277835 -1.626 0.104 -.0996393 .0092702
biomapa | .4273138 .0817512 5.227 0.000 .2670844 .5875433
engl_gpa | .4258579 .0430438 9.894 0.000 .3414936 .5102222
_cons | -1.886177 .4414855 -4.272 0.000 -2.751473 -1.020881
------------------------------------------------------------------------------


