Logistic regression
Reading: 5.1 & 5.2 in Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman & Hill (http://search.library.oregonstate.edu/OSU:everything:CP71242639930001451)
Logistic regression is the standard way to model binary outcomes.
I.e. a response variable that only takes the values 0 or 1.
Example: political preference from Gelman & Hill
Conservative parties generally receive more support among voters with higher incomes. We illustrate classical logistic regresssion with a simple analysis of this pattern from the National Election Study in 1992.
For each repondent, , in this poll, we label if he or she preferred George Bush (the Republican candiadate for president) or 0 is he or she preferred Bill Clinton (the Democratic candidate), for now excluding repondents who preferred Ross Perot or other candidates.
We predict preferences given the respondent’s income level which is characterized on a five-point scale.
Our goal is to relate to .
Exploratory analysis in R
Can we fit a regression model?
Should we fit a regression model?
Logistic regression model
In logistic regression, the response is related to the explantories through the probability of the response being 1:
or equivalently
where
is known as the linear predictor.
are assumed to be i.i.d Bernoulli with probability of success.
The inverse logit transforms continuous values to (0, 1)
Interpreting the logistic regression coefficients
fit_1 <- glm(vote ~ income, family = binomial(link = "logit"),
data = pres_1992)
summary(fit_1)
##
## Call:
## glm(formula = vote ~ income, family = binomial(link = "logit"),
## data = pres_1992)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2756 -1.0034 -0.8796 1.2194 1.6550
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.40213 0.18946 -7.401 1.35e-13 ***
## income 0.32599 0.05688 5.731 9.97e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1591.2 on 1178 degrees of freedom
## Residual deviance: 1556.9 on 1177 degrees of freedom
## AIC: 1560.9
##
## Number of Fisher Scoring iterations: 4
The fitted model
Interpreting the logistic regression coefficients
Very generally, a coefficient greater than zero indicates increasing probability with increasing explanatory. A coefficient less than zero indicates decreasing probability with increasing explanatory.
But, the non-linear relationship with makes it hard to interpret that exact value.
Three approaches:
At or near center of data
Divide by 4 rule
Odds ratios
At or near center of data
invlogit <- function(x) 1/(1 + exp(-x))
# = Interpret at some x =
mean_inc <- with(pres_1992, mean(income, na.rm=T))
invlogit(-1.40 + 0.33*mean_inc)
## [1] 0.4049001
Estimated probability of supporting Bush for a respondent of average income is 0.4
At or near center of data
# = Interpret change in P for 1 unit change in x, at some x =
invlogit(-1.40 + 0.33*3) - invlogit(-1.40 + 0.33*2)
## [1] 0.07590798
An increase in income from category 2 to category 3 is associated with an increase in the estimated probability of supporting Bush of 0.08
At or near center of data
logit_p <- (-1.40 + 0.33*3.1)
0.33*exp(logit_p)/(1 + exp(logit_p))^2
## [1] 0.07963666
Each “small” unit of increase in income, at the average income, is associated with an increase in the estimated probability of supporting Bush of 0.08
Divide by 4 rule
The logistic function reaches its maximum slope at its center, where the derivative is .
# = Interpret bound on change in P =
coef(fit_1)[2]/4
## income
## 0.08149868
At most a one unit change in income is associated with an increase of P(Bush) of 0.08
Odds ratios
A unit increase in results in a increase in the log odds ratio of supporting Bush.
A one unit increase in income is associated with a change in the log odds ratio of 0.33
Inference & prediction
Coefficients are estimated with maximum likelihood.
Standard errors represent uncertainty in estimates.
Assymptotically, estimates are Normally distributed under repeated sampling.
An approximate 95% confidence interval for estimates is:
estimate standard error
Predictions take the form of a predictive probability
For a voter not in the survey with an income level of 5, the predicted probability of supporting Bush is 0.55