Skip to content. | Skip to navigation

Personal tools

Logistic Regression Analysis

The Logit Model

Logistic Regression is used to model dichotomous (0 or 1) outcomes. This technique models the log odds of an outcome defined by the values of covariates in your model. In addition to covering how to model sub-populations, we will use both the svy commands and the robust cluster commands. The following example comes from The National Longitudinal Study of Adolescent Health.

Research Question: How is being in the upper quartile of the Vocabulary test score (PVT_Q4) influenced by a boy's grade in English (ENGL_GPA) and Family composition (BIOMAPA)?

Predictive Model:

log /
|
|
|
|
\
Pr (PVT_Q4 = 1)

1- Pr (PVT_Q4 = 1)
\
|
|
|
|
/
= b0 + b1 AGE_KID + b2 BIOMAPA + b3 ENGL_GPA

Where

b0 = Intercept

b1 = Change in log odds of being in upper quartile for one year increment in age

b2 = Change in log odds of being in upper quartile for living with Biological Parents

b3 = Change in log odds of being in upper quartile for increase in one grade level

The model predicted log-odds for the categorical subpopulations will be:

BIOMAPAENGL_GPALn(odds)
0 = No 4 = A b0 + b1AGE_KID + 4b3
0 = No 3 = B b0 + b1AGE_KID + 3b3
0 = No 2 = C b0 + b1AGE_KID + 2b3
0 = No 1 = D/F b0 + b1AGE_KID + b3
1 = Yes 4 = A b0 + b1AGE_KID + b2 + 4b3
1 = Yes 3 = B b0 + b1AGE_KID + b2 + 3b3
1 = Yes 2 = C b0 + b1AGE_KID + b2 + 2b3
1 = Yes 1 = D/F b0 + b1AGE_KID + b2 + b3

We are assuming a model with a common slope for age of the boy, but different intercepts defined by grade in English and living with both biological parents.

The relationship between probability and odds

The odds of an outcome is related to the probability of the outcome by the following relation:

odds = probability
1 - probability

An odds ratio is just the ratio of the odds of the outcome evaluated at two different sets of values for your covariates. It is easy to show that to test the hypothesis that p1 = p2 you can test that the hypothesis that an odds ratio comparing group 1 to group 2 is equal to 1. However, you cannot easily put a confidence interval on the difference between the two probabilities.

SVY: LOGIT

The svyset command is used to specify the design information for analysis. Use the strata keyword to specify the stratification variable (region), the pweight keyword to specify the probability weight variable (gswgt1), and specify the primary sampling unit (psuscid).

     svyset psuscid [pweight=gswgt1], strata(region)

The svy: logit command states the model being tested. The first variable following svy: logit denotes the outcome (pvt_q4) of our model, and the following variables are the covariates. The option subpop

is used to specify the sub-population we want to be used to compute parameter estimates. All 18,924 observations are needed for the variance computation because Stata determines the design information (number of primary sampling units) used in the formula variance computation.

     svy, subpop(male): logit pvt_q4 age_kid biomapa engl_gpa

Stata lists the number of observations with no missing values for the variables in the model (N=17,191) and has summed the corresponding sample weights to estimate 19,955,620 adolescents in the U.S. are represented by these observations. The number of observations with complete data in the sub-population is 8,366 representing 10,084,117 boys. Note that the number of strata (4) and primary sampling units (132) has been correctly counted.

Survey: Logistic regression

Number of strata   =         4                  Number of obs      =     17191
Number of PSUs     =       132                  Population size    =  19955620
                                                Subpop. no. of obs =      8366
                                                Subpop. size       =  10084117
                                                Design df          =       128
                                                F(   3,    126)    =     49.14
                                                Prob > F           =    0.0000

------------------------------------------------------------------------------
             |             Linearized
     twokids |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     age_kid |  -.0451845   .0278879    -1.62   0.108    -.1003656    .0099965
     biomapa |   .4273138   .0820139     5.21   0.000     .2650354    .5895923
    engl_gpa |   .4258579   .0423055    10.07   0.000     .3421493    .5095664
       _cons |  -1.886177   .4411884    -4.28   0.000    -2.759144   -1.013211
------------------------------------------------------------------------------

The adjust command can be used to estimate a linear combination of the coefficients estimated for the variables in our model. If you do not specify a value for a variable when using adjust, Stata will incorrectly substitute the sample mean rather than an estimate of the population mean. This is because adjust ignores any weights used by the estimation commands. (See Stata Reference Manual, Release 9, Vol 1 A-G, page 10.) To correctly compute a linear combination, it is necessary to specify a value for all variables in the model. For example, the following statement:

     adjust age_kid=17 engl_gpa=3, by(biomapa) xb se ci

produces an estimate of the log odds of scoring above the 75th percentile for boys at age 17 with a grade of B in English for both categories of living with both biological parents:

-----------------------------------------------------------------------------
     Dependent variable: pvt_q4      Command: logit
Covariates set to value: age_kid = 17, engl_gpa = 3
-----------------------------------------------------------------------------

----------------------------------------------------------
Live with |
Bio Mom & |
Dad       |
0=N/1=Y   |         xb        stdp          lb          ub
----------+-----------------------------------------------
        0 |   -1.37674   (.100564)   [-1.57572   -1.17776]
        1 |   -.949427   (.094665)   [-1.13674   -.762115]
----------------------------------------------------------
     Key:  xb         =  Linear Prediction
           stdp       =  Standard Error
           [lb , ub]  =  [95% Confidence Interval]

You can also include the exp option at the end of the adjust command to get adjust to print exponentiated linear combination of the coefficients. The pr option on adjust is not available after using the svylogit command.

The lincom command can also be used to produce linear combinations of the coefficients:

     lincom 17*age_kid + 1*biomapa + 3*engl_gpa + _cons

 ( 1)  17.0 age_kid + biomapa + 3.0 engl_gpa + _cons = 0.0

------------------------------------------------------------------------------
      pvt_q4 |      Coef.    Std. Err.      t    P>|t|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.9494267    .0946653   -10.03   0.000   -1.136738   -.7621154
------------------------------------------------------------------------------

The results from lincom match those from adjust. The advantage of using lincom is that a hypothesis test can also be performed. For example, suppose you want to compute the odds ratio comparing 17 year-old boys not living with both biological parents to12 year-old boys living with both biological parents. Assume both boys make the same grade in English. We would want to estimate the difference in log odds for these to:

(b0 + 17*b1 + GRADE*b3) - (b0 + 12*b1 + b2 + GRADE*b3) = 5*b1 - b2

Since b1 is the coefficient for AGE_KID and b2 is the coefficient for BIOMAPA, the lincom command would be:

     lincom 5*age_kid - 1*biomapa 

This produces the desired difference in log odds:

( 1)  5.0 age_kid - biomapa = 0.0

------------------------------------------------------------------------------
      pvt_q4 |      Coef.    Std. Err.      t    P>|t|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.6532364    .1641137    -3.98   0.000   -.9779635   -.3285094
------------------------------------------------------------------------------

The or option can be added to the lincom command to get the odds ratio (e5*b1-b2 ):

     lincom 5*age_kid - 1*biomapa , or

The following table will be printed:

 ( 1)  5.0 age_kid - biomapa = 0.0

------------------------------------------------------------------------------
      pvt_q4 | Odds Ratio    Std. Err.      t    P>|t|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   .5203589     .085398    -3.98   0.000    .3760762    .7199962
------------------------------------------------------------------------------

Thus, assuming equal grades in English, the odds of a 17 year-old boy not living with both biological parents is only half that of a 12 year boy who lives with his biological parents.

The test command can be used to test joint hypothesis about variables. For example, testing that the coefficient for age_kid and biomapa are both equal to zero can be done with the following stata command:

     test age_kid biomapa

which produces the following output:

Adjusted Wald test

 ( 1)  age_kid = 0.0
 ( 2)  biomapa = 0.0

       F(  2,   127) =   14.51
            Prob > F =    0.0000

logit with pweight and robust cluster

Note that we can subset the data (if male == 1) when using the robust cluster( ) options in Stata and still have the variance computed with an acceptable technique. The primary sampling unit (psuscid) is used as the argument to the cluster option and the sample weights (gswgt1) are specified by [pweight=gswgt1].

     logit pvt_q4 age_kid biomapa engl_gpa  if male == 1 [pweight=gswgt1], robust cluster(psuscid)

The results and interpretation in the following output are identical to the results from svylogit.

Logistic regression                               Number of obs   =       8366
                                                  Wald chi2(3)    =     142.44
                                                  Prob > chi2     =     0.0000
Log pseudolikelihood = -4429.7883                 Pseudo R2       =     0.0384

                              (Std. Err. adjusted for 132 clusters in psuscid)
------------------------------------------------------------------------------
         |               Robust
  pvt_q4 |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
 age_kid |  -.0451845   .0277835     -1.626   0.104      -.0996393    .0092702
 biomapa |   .4273138   .0817512      5.227   0.000       .2670844    .5875433
engl_gpa |   .4258579   .0430438      9.894   0.000       .3414936    .5102222
   _cons |  -1.886177   .4414855     -4.272   0.000      -2.751473   -1.020881
------------------------------------------------------------------------------

Review again?
Another topic?


Questions or comments? If you are affiliated with the Carolina Population Center, send them to Phil Bardsley.