Lab 4 - Visualizing models | ST 552 Statistical Methods

Learning Objectives

Visualize a model by making predictions on a grid of explanatories.

This is one useful strategy for understanding a model. A complementary strategy is to examine the model form mathematically, we’ll see a few examples in class on Friday.

Warmup

library(tidyverse)
library(modelr)

For this lab we’ll work with some data from the openintro package. The dataset bdims contains body measurements of 507 physically active individuals. For the purposes of the lab we’ll look at trying to predict a physically active individuals weight.

data(bdims, package = "openintro") 
bdims <- bdims %>% 
  mutate(sex = as.factor(sex) %>% fct_recode(male = "1", female = "0"))
(bdims <- as_tibble(bdims))

Read more about the data at: ?openintro::bdims

Let’s start with the naive model: \[ \text{weight}_i = \beta_0 + \beta_1 \text{height}_i + \epsilon_i, \quad i = 1, \ldots, 507 \]

Taking a quick look at the data:

ggplot(bdims, aes(hgt, wgt)) +
  geom_point()

This seems like a reasonable place to start.

We can fit the model to the data with:

fit_hgt <- lm(wgt ~ hgt, data = bdims)
summary(fit_hgt)

## 
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.743  -6.402  -1.231   5.059  41.103 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -105.01125    7.53941  -13.93   <2e-16 ***
## hgt            1.01762    0.04399   23.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.5136 
## F-statistic: 535.2 on 1 and 505 DF,  p-value: < 2.2e-16

Now, how do we visualize this fitted model in the context of the data?

There are two common approaches:

Extract the coefficients and add a line with the corresponding slope and intercept:

coefs <- coef(fit_hgt)

ggplot(bdims, aes(hgt, wgt)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_abline(intercept = coefs[1], slope = coefs[2], color = "#377EB8")

Or, use the model to predict the response on a grid of explanatory values:

(pred_linear <- data_grid(bdims, hgt) %>% 
  add_predictions(fit_hgt))

## # A tibble: 147 x 2
##      hgt  pred
##    <dbl> <dbl>
##  1  147.  44.8
##  2  150.  47.1
##  3  150.  47.5
##  4  151.  48.8
##  5  152   49.7
##  6  152.  50.1
##  7  153.  51.1
##  8  154.  52.1
##  9  154.  52.2
## 10  155.  52.6
## # ... with 137 more rows

Then join the predictions with a line (Charlotte will build this plot up):

ggplot(bdims, aes(hgt, wgt)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_line(aes(y = pred), data = pred_linear, color = "#377EB8")

The second approach tends to generalize to more complicated models more easily, and is the approach I’d recommend.

modelr provides functions to help with this approach

Let’s take a closer look at the code used in the second approach:

pred_linear <- data_grid(bdims, hgt) %>% 
      add_predictions(fit_hgt)

Two things are happening in this chunk: data_grid() from modelr is being used to create a tibble containing values at which we will make predictions, and add_predictions() also from modelr, is adding a column for the predictions based on our model at these values. The result is just a tibble with we can plot using our usual tools:

pred_linear

## # A tibble: 147 x 2
##      hgt  pred
##    <dbl> <dbl>
##  1  147.  44.8
##  2  150.  47.1
##  3  150.  47.5
##  4  151.  48.8
##  5  152   49.7
##  6  152.  50.1
##  7  153.  51.1
##  8  154.  52.1
##  9  154.  52.2
## 10  155.  52.6
## # ... with 137 more rows

Let’s explore data_grid() and add_predictions() in turn. data_grid() takes a data frame and one or more variable names. If you provide a single variable name, it returns the unique values that variable takes in the data:

data_grid(data = bdims, hgt)

## # A tibble: 147 x 1
##      hgt
##    <dbl>
##  1  147.
##  2  150.
##  3  150.
##  4  151.
##  5  152 
##  6  152.
##  7  153.
##  8  154.
##  9  154.
## 10  155.
## # ... with 137 more rows

Alternatively you can use seq_range() to get an evenly spaced grid of values:

data_grid(bdims, hgt = seq_range(hgt, by = 10))

## # A tibble: 6 x 1
##     hgt
##   <dbl>
## 1  147.
## 2  157.
## 3  167.
## 4  177.
## 5  187.
## 6  197.

If you provide more than one variable, you’ll get all possible combinations of the two:

(grid_hgt_sex <- data_grid(bdims, 
  hgt = seq_range(hgt, by = 10), sex))

## # A tibble: 12 x 2
##      hgt sex   
##    <dbl> <fct> 
##  1  147. female
##  2  147. male  
##  3  157. female
##  4  157. male  
##  5  167. female
##  6  167. male  
##  7  177. female
##  8  177. male  
##  9  187. female
## 10  187. male  
## 11  197. female
## 12  197. male

(We’ll use this data grid shortly so I’m keeping it in grid_hgt_sex)

This grid specifies the point at which we want to make predictions, add_predictions() also from modelr, will add them if we specify a model:

add_predictions(data = grid_hgt_sex, model = fit_hgt)

## # A tibble: 12 x 3
##      hgt sex     pred
##    <dbl> <fct>  <dbl>
##  1  147. female  44.8
##  2  147. male    44.8
##  3  157. female  55.0
##  4  157. male    55.0
##  5  167. female  65.1
##  6  167. male    65.1
##  7  177. female  75.3
##  8  177. male    75.3
##  9  187. female  85.5
## 10  187. male    85.5
## 11  197. female  95.7
## 12  197. male    95.7

Notice the predictions end up in a columns called pred and our original columns are preserved.

(This is equivalent to predict(fit_height, newdata = data_hgt_sex)) but returns a data frame with the original columns intact).

Plotting this data alone isn’t particularly useful:

pred_hgt_sex <- add_predictions(data = grid_hgt_sex, model = fit_hgt)

ggplot(pred_hgt_sex, aes(hgt, pred)) +
  geom_point()

But adding to a layer in our plot of the data helps us understand the model in the context of the data. Putting it all together:

grid_hgt_sex <- data_grid(bdims, hgt = seq_range(hgt, by = 10), sex) 
pred_hgt_sex <- add_predictions(grid_hgt_sex, model = fit_hgt) 

ggplot(bdims, aes(hgt, wgt)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_line(aes(y = pred), data = pred_hgt_sex)

Your turn - a non-linear relationship

A quadratic model in height: \[ \text{weight}_i = \beta_0 + \beta_1 \text{height}_i + \beta_2 \text{height}_i^2 + \epsilon_i, \quad i = 1, \ldots, 507 \]

can be fit with:

fit_height_quad <- lm(wgt ~ hgt + I(hgt^2), data = bdims)

Use add_predictions() with grid_hgt_sex to add the fitted model to the plot of the data

grid_hgt_sex <- data_grid(bdims, hgt = seq_range(hgt, n = 1000), sex) 
fit_height_poly <- lm(wgt ~ poly(hgt, 10), data = bdims)
pred_poly <- add_predictions(grid_hgt_sex, model = fit_height_poly) 

ggplot(bdims, aes(hgt, wgt)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_line(aes(y = pred), data = pred_poly)

# geom_fun

(How did I find a color?)

RColorBrewer::display.brewer.all() # choose a palette
RColorBrewer::brewer.pal(3, "Set1") # extract one of the colors

Visualizing more dimensions

So far we’ve ignored sex but you might imagine it plays a role in the relationship between height and weight. Our current model predicts the same straight line relationship for both male and female:

fit_hgt_sex_1 <- lm(wgt ~ hgt * sex, data = bdims)
fit_hgt_sex_2 <- lm(wgt ~ hgt + sex, data = bdims)

summary(fit_hgt_sex_1)

## 
## Call:
## lm(formula = wgt ~ hgt * sex, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.187  -5.957  -1.439   4.955  43.355 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.81929   13.77877  -3.180  0.00156 ** 
## hgt           0.63334    0.08351   7.584 1.63e-13 ***
## sexmale     -17.13407   19.56250  -0.876  0.38152    
## hgt:sexmale   0.14923    0.11431   1.305  0.19233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.795 on 503 degrees of freedom
## Multiple R-squared:  0.5682, Adjusted R-squared:  0.5657 
## F-statistic: 220.7 on 3 and 503 DF,  p-value: < 2.2e-16

summary(fit_hgt_sex_2)

## 
## Call:
## lm(formula = wgt ~ hgt + sex, data = bdims)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.184  -5.978  -1.356   4.709  43.337 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -56.94949    9.42444  -6.043 2.95e-09 ***
## hgt           0.71298    0.05707  12.494  < 2e-16 ***
## sexmale       8.36599    1.07296   7.797 3.66e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.802 on 504 degrees of freedom
## Multiple R-squared:  0.5668, Adjusted R-squared:  0.5651 
## F-statistic: 329.7 on 2 and 504 DF,  p-value: < 2.2e-16

pred_sex_1 <- add_predictions(grid_hgt_sex, model = fit_hgt_sex_1)
pred_sex_2 <- add_predictions(grid_hgt_sex, model = fit_hgt_sex_2)

ggplot(bdims, aes(hgt, wgt, color = sex)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_line(aes(y = pred), data = pred_sex_1)

ggplot(bdims, aes(hgt, wgt, color = sex)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_line(aes(y = pred), data = pred_sex_2)

If you want to put both models in one plot:

pred_sex <- add_predictions(grid_hgt_sex, model = fit_hgt_sex_1, "pred_1") %>% 
  add_predictions(fit_hgt_sex_2, "pred_2")

ggplot(bdims, aes(hgt, wgt, color = sex)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_line(aes(y = pred_1, linetype = "M1"), data = pred_sex) +
  geom_line(aes(y = pred_2, linetype = "M2"), data = pred_sex) 

## Or alternatively
pred_sex %>% 
  gather("model", "pred", pred_1, pred_2) %>% 
  ggplot(aes(hgt, color = sex)) +
    geom_point(aes(y = wgt), size = 0.5, alpha = 0.5, data = bdims) +
    geom_line(aes(y = pred, linetype = model))

Two possible ways of including sex in the regression model are:

fit_hgt_sex_1 <- lm(wgt ~ hgt * sex, data = bdims)
fit_hgt_sex_2 <- lm(wgt ~ hgt + sex, data = bdims)

Your turn: Visualize the two models and describe the difference between them. Try to write down the models that are being fit.

Visualizing residuals

Visualizing predictions only tells us about the patterns the model has captured, but not the ones that it has failed to capture. Residuals are one way to provide this insight.

The modelr function add_residuals() takes two inputs like add_predictions() but now the input data is the observed data:

add_residuals(bdims, fit_hgt_sex_1)

## # A tibble: 507 x 26
##    bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
##     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1   42.9   26     31.5   17.7   28     13.1   10.4   18.8   14.1   106.
##  2   43.7   28.5   33.5   16.9   30.8   14     11.8   20.6   15.1   110.
##  3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1   115.
##  4   44.3   29.9   34     18.4   28.2   13.9   11.2   20.9   15     104.
##  5   42.5   29.9   34     21.5   29.4   15.2   11.6   20.7   14.9   108.
##  6   43.3   27     31.5   19.6   31.3   14     11.5   18.8   13.9   120.
##  7   43.5   30     34     21.9   31.7   16.1   12.5   20.8   15.6   124.
##  8   44.4   29.8   33.2   21.8   28.8   15.1   11.9   21     14.6   120.
##  9   43.5   26.5   32.1   15.5   27.5   14.1   11.2   18.9   13.2   111 
## 10   42     28     34     22.5   28     15.6   12     21.1   15     120.
## # ... with 497 more rows, and 16 more variables: che.gi <dbl>,
## #   wai.gi <dbl>, nav.gi <dbl>, hip.gi <dbl>, thi.gi <dbl>, bic.gi <dbl>,
## #   for.gi <dbl>, kne.gi <dbl>, cal.gi <dbl>, ank.gi <dbl>, wri.gi <dbl>,
## #   age <int>, wgt <dbl>, hgt <dbl>, sex <fct>, resid <dbl>

Notice there is now an additional column called resid.

Which we can then visualize, for example, here the residuals are plotted against height:

add_residuals(bdims, fit_hgt_sex_1) %>% 
  ggplot(aes(hgt, resid, color = sex)) +
    geom_ref_line(h = 0) +
    geom_point()

If the model has failed to capture an important relationship between weight and height we’d expect to see a pattern in the average residual as a function of height. A line at zero has been added to help with the comparison. Looks pretty good here.

Your turn Look at the residuals against the variable wai.gi. Does our model fail to capture something?

add_residuals(data = bdims, model = fit_hgt_sex_1) %>% 
  ggplot(aes(wai.gi, resid, color = sex)) +
    geom_ref_line(h = 0) +
    geom_point()

More complicated models

As models get more complicated:

Visualizing more than three variables at a time is hard! A good strategy is to pick a couple of variables to visualize at a time, and make more than one plot. Or fix a variable at a meaningful value.
Use a mathematical examination of the model to influence your choice of interesting views. E.g. A parameter that only shifts the intercept might be omitted from a plot in favor of a parameter that changes the slope.
Make sure your grid has all the variables required for your model. If you pass a model to the .model argument in data_grid(), it will pick “typical” values for any needed variables that you don’t specify.

As an example let’s add wai.gi, waist girth, to our model. There are numerous ways we could do this, but let’s look at adding it as a main (linear) effect as well as through an interaction with sex: \[ \begin{aligned} \text{weight}_i &= \beta_0 + \beta_1 \text{height}_i + \beta_2 \text{male}_i + \beta_3\text{wai.gi}_i + \\ & \beta_4\left(\text{height}_i \times \text{male}_i\right) + \beta_5\left(\text{wai.gi}_i \times \text{male}_i\right) + \epsilon_i, \quad i = 1, \ldots, 507 \end{aligned} \]

fit_wai <- lm(wgt ~ hgt*sex + wai.gi*sex, data = bdims)

We’ve now got a model with three variables and five parameters. First we need to make sure our data grid contains value for all three variables (notice I’m specifying the sequences a little differently),

(grid_wai <- data_grid(bdims,
  hgt = seq_range(hgt, n = 15, pretty = TRUE),
  wai.gi = seq_range(wai.gi, n = 5, pretty = TRUE), 
  sex
))

## # A tibble: 192 x 3
##      hgt wai.gi sex   
##    <int>  <int> <fct> 
##  1   145     50 female
##  2   145     50 male  
##  3   145     60 female
##  4   145     60 male  
##  5   145     70 female
##  6   145     70 male  
##  7   145     80 female
##  8   145     80 male  
##  9   145     90 female
## 10   145     90 male  
## # ... with 182 more rows

Then we can add predictions as per usual

pred_wai <- grid_wai %>%  
  add_predictions(fit_wai)

A useful way to start, is to fix one of the values, let’s look at holding height constant at 170cm, which we can easily do by filtering the prediction grid, and looking at the predicted weight as a function of waist girth:

pred_wai %>% 
  filter(hgt == 170) %>% 
  ggplot(aes(wai.gi, pred, color = sex)) +
    geom_line() +
  labs(title = "Height = 170cm")

For someone of height 170cm, we can see this model shows increasing waist girth is associated with increasing mean weight (lines have positive slope), and that for females the increase is slightly larger per unit increase in waist girth (female line has steeper slope).

But we only looked at one value of height, we could now repeat this view for our grid of heights:

ggplot(pred_wai, aes(wai.gi, pred, color = sex)) +
  geom_line() +
  facet_wrap(~ hgt, labeller = label_both)

And we see that increasing waist girth is associated with increasing mean weight for all heights, and that for females the increase is slightly larger per unit increase in waist girth for all heights. It’s a little harder to see if the slopes are changing over the different heights. But here’s an alternate view:

ggplot(pred_wai, aes(wai.gi, pred, color = hgt)) +
  geom_line(aes(group = hgt)) +
  facet_wrap(~ sex)

Now it’s easier to see the slope between the predicted weight and waist girth is the same for all heights (the lines are parallel within gender).

Your turn: Fit a model where there is also an interaction between waist girth and height and repeat the above plots.

Wrapping up

Combining the functions data_grid() and add_predictions() is a useful way to visualize a model.
Residuals are a useful way to look for things our model fails to capture.

Lab 4 - Visualizing models Jan 30 2019