Logistic regression
Reading: 5.1 & 5.2 in Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman & Hill (http://search.library.oregonstate.edu/OSU:everything:CP71242639930001451)
Logistic regression is the standard way to model binary outcomes.
I.e. a response variable that only takes the values 0 or 1.
\[ y_i = \begin{cases} 1, & \text{with probability }p_i \\ 0, & \text{with probability } 1 - p_i \end{cases} \]
Example: political preference from Gelman & Hill
Conservative parties generally receive more support among voters with higher incomes. We illustrate classical logistic regresssion with a simple analysis of this pattern from the National Election Study in 1992.
For each repondent, \(i\), in this poll, we label \(y_i=1\) if he or she preferred George Bush (the Republican candiadate for president) or 0 is he or she preferred Bill Clinton (the Democratic candidate), for now excluding repondents who preferred Ross Perot or other candidates.
We predict preferences given the respondent’s income level which is characterized on a five-point scale.
\[ y_i = \begin{cases} 1, & \text{respondent $i$ preferred George Bush} \\ 0, & \text{respondent $i$ preferred Bill Clinton} \end{cases} \]
\[ x_i = \text{Income class of respondent $i$: 0 (poor), 1, 2, 3, 4 or 5 (rich)} \]
Our goal is to relate \(y_i\) to \(x_i\).
Exploratory analysis in R
Can we fit a regression model?
Should we fit a regression model?
Logistic regression model
In logistic regression, the response is related to the explantories through the probability of the response being 1:
\[ \text{logit}\left(P(y_i = 1)\right) = X_i\beta \] or equivalently
\[ P(y_i = 1) = \text{logit}^{-1}\left(X_i\beta\right) \] where \(\text{logit}(p_i) = \log{\left(\tfrac{p_i}{1-p_i}\right)}\)
\(X_i\beta\) is known as the linear predictor.
\(y_i\) are assumed to be i.i.d Bernoulli with probability \(p_i\) of success.
The inverse logit transforms continuous values to (0, 1)
Interpreting the logistic regression coefficients
fit_1 <- glm(vote ~ income, family = binomial(link = "logit"),
data = pres_1992)
summary(fit_1)
##
## Call:
## glm(formula = vote ~ income, family = binomial(link = "logit"),
## data = pres_1992)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2756 -1.0034 -0.8796 1.2194 1.6550
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.40213 0.18946 -7.401 1.35e-13 ***
## income 0.32599 0.05688 5.731 9.97e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1591.2 on 1178 degrees of freedom
## Residual deviance: 1556.9 on 1177 degrees of freedom
## AIC: 1560.9
##
## Number of Fisher Scoring iterations: 4
The fitted model
Interpreting the logistic regression coefficients
Very generally, a coefficient greater than zero indicates increasing probability with increasing explanatory. A coefficient less than zero indicates decreasing probability with increasing explanatory.
But, the non-linear relationship with \(p_i\) makes it hard to interpret that exact value.
Three approaches:
At or near center of data
Divide by 4 rule
Odds ratios
At or near center of data
invlogit <- function(x) 1/(1 + exp(-x))
# = Interpret at some x =
mean_inc <- with(pres_1992, mean(income, na.rm=T))
invlogit(-1.40 + 0.33*mean_inc)
## [1] 0.4049001
Estimated probability of supporting Bush for a respondent of average income is 0.4
At or near center of data
# = Interpret change in P for 1 unit change in x, at some x =
invlogit(-1.40 + 0.33*3) - invlogit(-1.40 + 0.33*2)
## [1] 0.07590798
An increase in income from category 2 to category 3 is associated with an increase in the estimated probability of supporting Bush of 0.08
At or near center of data
logit_p <- (-1.40 + 0.33*3.1)
0.33*exp(logit_p)/(1 + exp(logit_p))^2
## [1] 0.07963666
Each “small” unit of increase in income, at the average income, is associated with an increase in the estimated probability of supporting Bush of 0.08
Divide by 4 rule
The logistic function reaches its maximum slope at its center, where the derivative is \(\beta/4\).
# = Interpret bound on change in P =
coef(fit_1)[2]/4
## income
## 0.08149868
At most a one unit change in income is associated with an increase of P(Bush) of 0.08
Odds ratios
\[ \log \left( \frac{P(y = 1 | x)}{P(y = 0 | x)} \right) = \alpha + \beta x \]
A unit increase in \(x\) results in a \(\beta\) increase in the log odds ratio of supporting Bush.
A one unit increase in income is associated with a change in the log odds ratio of 0.33
Inference & prediction
Coefficients are estimated with maximum likelihood.
Standard errors represent uncertainty in estimates.
Assymptotically, estimates are Normally distributed under repeated sampling.
An approximate 95% confidence interval for estimates is:
estimate \(\pm 2 \times\)standard error
Predictions take the form of a predictive probability \[ \hat{p}_0 = \hat{P}(y_0 = 1) = \text{logit}^{-1} (x_0\hat{\beta}) \]
For a voter not in the survey with an income level of 5, the predicted probability of supporting Bush is 0.55