## Logistic regression

Reading: 5.1 & 5.2 in Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman & Hill (http://search.library.oregonstate.edu/OSU:everything:CP71242639930001451)

Logistic regression is the standard way to model binary outcomes.

I.e. a response variable that only takes the values 0 or 1.

\[ y_i = \begin{cases} 1, & \text{with probability }p_i \\ 0, & \text{with probability } 1 - p_i \end{cases} \]

## Example: political preference from Gelman & Hill

Conservative parties generally receive more support among voters with higher incomes. We illustrate classical logistic regresssion with a simple analysis of this pattern from the National Election Study in 1992.

For each repondent, \(i\), in this poll, we label \(y_i=1\) if he or she preferred George Bush (the Republican candiadate for president) or 0 is he or she preferred Bill Clinton (the Democratic candidate), for now excluding repondents who preferred Ross Perot or other candidates.

We predict preferences given the respondent’s income level which is characterized on a five-point scale.

\[ y_i = \begin{cases} 1, & \text{respondent $i$ preferred George Bush} \\ 0, & \text{respondent $i$ preferred Bill Clinton} \end{cases} \]

\[ x_i = \text{Income class of respondent $i$: 0 (poor), 1, 2, 3, 4 or 5 (rich)} \]

Our goal is to relate \(y_i\) to \(x_i\).

## Exploratory analysis in R

Can we fit a regression model?

Should we fit a regression model?

## Logistic regression model

In logistic regression, the response is related to the explantories through the probability of the response being 1:

\[ \text{logit}\left(P(y_i = 1)\right) = X_i\beta \] or equivalently

\[ P(y_i = 1) = \text{logit}^{-1}\left(X_i\beta\right) \] where \(\text{logit}(p_i) = \log{\left(\tfrac{p_i}{1-p_i}\right)}\)

\(X_i\beta\) is known as the linear predictor.

\(y_i\) are assumed to be i.i.d Bernoulli with probability \(p_i\) of success.

## The inverse logit transforms continuous values to (0, 1)

## Interpreting the logistic regression coefficients

```
fit_1 <- glm(vote ~ income, family = binomial(link = "logit"),
data = pres_1992)
summary(fit_1)
```

```
##
## Call:
## glm(formula = vote ~ income, family = binomial(link = "logit"),
## data = pres_1992)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2756 -1.0034 -0.8796 1.2194 1.6550
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.40213 0.18946 -7.401 1.35e-13 ***
## income 0.32599 0.05688 5.731 9.97e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1591.2 on 1178 degrees of freedom
## Residual deviance: 1556.9 on 1177 degrees of freedom
## AIC: 1560.9
##
## Number of Fisher Scoring iterations: 4
```

## The fitted model

## Interpreting the logistic regression coefficients

Very generally, a coefficient greater than zero indicates increasing probability with increasing explanatory. A coefficient less than zero indicates decreasing probability with increasing explanatory.

But, the non-linear relationship with \(p_i\) makes it hard to interpret that exact value.

Three approaches:

**At or near center of data**

**Divide by 4 rule**

**Odds ratios**

## At or near center of data

```
invlogit <- function(x) 1/(1 + exp(-x))
# = Interpret at some x =
mean_inc <- with(pres_1992, mean(income, na.rm=T))
invlogit(-1.40 + 0.33*mean_inc)
```

`## [1] 0.4049001`

Estimated probability of supporting Bush for a respondent of average income is 0.4

## At or near center of data

```
# = Interpret change in P for 1 unit change in x, at some x =
invlogit(-1.40 + 0.33*3) - invlogit(-1.40 + 0.33*2)
```

`## [1] 0.07590798`

An increase in income from category 2 to category 3 is associated with an increase in the estimated probability of supporting Bush of 0.08

## At or near center of data

```
logit_p <- (-1.40 + 0.33*3.1)
0.33*exp(logit_p)/(1 + exp(logit_p))^2
```

`## [1] 0.07963666`

Each “small” unit of increase in income, at the average income, is associated with an increase in the estimated probability of supporting Bush of 0.08

## Divide by 4 rule

The logistic function reaches its maximum slope at its center, where the derivative is \(\beta/4\).

```
# = Interpret bound on change in P =
coef(fit_1)[2]/4
```

```
## income
## 0.08149868
```

At most a one unit change in income is associated with an increase of P(Bush) of 0.08

## Odds ratios

\[ \log \left( \frac{P(y = 1 | x)}{P(y = 0 | x)} \right) = \alpha + \beta x \]

A unit increase in \(x\) results in a \(\beta\) increase in the log odds ratio of supporting Bush.

A one unit increase in income is associated with a change in the log odds ratio of 0.33

## Inference & prediction

Coefficients are estimated with maximum likelihood.

Standard errors represent uncertainty in estimates.

Assymptotically, estimates are Normally distributed under repeated sampling.

An approximate 95% confidence interval for estimates is:

estimate \(\pm 2 \times\)standard error

**Predictions** take the form of a predictive probability
\[
\hat{p}_0 = \hat{P}(y_0 = 1) = \text{logit}^{-1} (x_0\hat{\beta})
\]

For a voter not in the survey with an income level of 5, the predicted probability of supporting Bush is 0.55