Understand the purpose of interactions and how to interpret interactions correctly in the context of regression.
Note: Before lab begins, please install the packages
datasets onto the lab computer or your personal laptop.
df <- as.data.frame(cbind(state.x77[,c(3, 4)], state.region)) colnames(df) <- c("Illiteracy", "Life.Exp", "Region") head(df) tail(df)
Suppose we were interested in determining whether there was an association of Life Expectancy and Illiteracy or Life Expectancy and Geographic Region.
Assuming life expectancy is the response variable, what kind of regression model would we fit? Would we include an illiteracy by region interaction or not? Take a minute or two to discuss with your neighbor.
Let’s suppose that we are professors in public health and U.S. geography (ie we are experts in this field). We deem that fitting a model with an interaction term makes sense. Before fitting the model, write what the model would be with your neighbor, making sure to define any indicator variables you use:
\[ Life_i = \beta_0 + ... \]
Let’s also explore the data a bit:
library(ggplot2) ggplot(data = df, aes(x = Illiteracy, y = Life.Exp)) + geom_point() + facet_wrap(~ Region)
From the plot, do you anticipate any signficant interactions between illiteracy and region?
Now let’s fit the model in R.
mod <- lm(Life.Exp ~ Illiteracy + factor(Region) + Illiteracy*factor(Region), data = df) ## summary(mod)
What terms were significant? Was your intuition correct based on the exploratory plot?
What does this mean? Discuss with your neighbor the interpretations of the interactions, particularly the significant interaction.
Can we see if there is a difference in “slopes” between region 2 and region 3 from the output above? If so, great! If not, what can we do to compare these two regions directly? Let’s do it. Now! Hint: Look into the
Finally, how we test to see if there is ANY interaction effect? Hint: Think Extra Sum of Squares F Test.
df2 <- as.data.frame(cbind(state.x77[,c(4, 5, 1)])) colnames(df2) <- c("Life.Exp", "Murder", "Pop") head(df2)
Is there evidence of an association between murder rate and life expectancy or high school graduation rate and life expectancy in the U.S.?
Again, suppose that we all just received doctorate degrees in public policy and geography and are experts in this field (congratulations!). As experts, we think that, BEFORE fitting a model, that the interaction between murder rate and high school graduation rate is associated with life expectancy. How would you explain the interaction term to someone with little to no statistical background?
Now, let’s fit the model
mod2 <- lm(Life.Exp ~ Murder + Pop + Murder*Pop, data = df2) summary(mod2)
Is there any evidence for an interaction? How would you interpret the interaction?
We see from the above output that Population on its own is not a signficant predictor. Would we ever want to fit the following model?
\[ Life_i = \beta_0 + \beta_1Murder + \beta_2Murder*Pop + \epsilon_i\]
mod3 <- lm(Life.Exp ~ Murder + Murder:Pop, data = df2) summary(mod3)
Can you think of a good plot to examine the nature of the interaction? There is not one correct answer to this, but we will discuss three different types of plots (two now, one later).
Notice that we CANNOT interpret the beta coefficient for murder directly. (If we tried, we would say that a one unit increase in murder is associated with a -0.3696 decrease in mean life expectancy, holding population and murder times population constant). BUT, we can’t hold murder times population constant. Oh no. How could we interpret the association of murder with life expectancy?
We see that the parameter estimate for the interaction is positive, but the estimate for murder is negative. Intuitively, we estimate that increases in murder is associated with a mean decrease in life expectancy, the response (which makes sense). However, as population size of the state increases, the estimated strength of the association of murder with life expectancy decreases.
We can also make interpretations if we fix population size: For example, if we fix the population size at 5000, we estimate that a one unit change in murder is associated with a mean change of life expectancy of: -0.3696 + 0.00001856*5000 = -0.2768 units of life expectancy.
If we fix the population size at 10000, we estimate that a one unit change in murder is associated with a mean change of life expectancy of: -0.3696 + 0.00001856*10000 = -0.184 units of life expectancy (so a less strong association for a larger population size).
How might we visualize this association in a plot? Can you compare this plot to a plot of a model without the interaction term? Hint: Use the
expand.grid function that we used in lab a few weeks ago. Below is some code to get you started.
mod2 <- lm(Life.Exp ~ Murder + Pop + Murder*Pop, data = df2) mod4 <- lm(Life.Exp ~ Murder + Pop, data = df2) pred_df <- expand.grid(Pop = c(500, 1000, 5000, 10000, 15000), Murder = 2:14)
I know that was a lot of information: let me know if you have any questions about interactions now, or as you are studying for comps this summer!