ST552 Lab 4 Jan 27th 2016

For this lab we’ll work with some data from the openintro package (you may need to install it install.packages("openintro")). The dataset bdims contains body measurements of 507 physically active individuals. For the purposes of the lab we’ll look at trying to predict a physically active individuals weight from their waist girth and gender.

bdims$gender <- ifelse(bdims$sex == 1, "male", "female")

Our goal is to practice adding regression lines to plots as a strategy to understand the models we fit.

Simple plot of the raw data using ggplot2

qplot works a lot like plot, except variables are evaluated inside the data frame, so if you specify data = you don’t need to preface variable names with $.

qplot(, wgt, data = bdims, colour = gender) # a scatter plot

The visual properties of the objects used to represent observations (aesthetics in ggplot2 nomenclature) can also be mapped to variables. For example, colour = gender in the above maps the colour of the points to the gender variable. Try instead shape = gender or colour = hgt

It’s easy to add a fitted regression line to a plot,

qplot(, wgt, data = bdims, colour = gender) +
  geom_smooth(method = "lm", se = FALSE)

But beware! This fits a separate simple linear regression to each group defined by the aesthetics and/or facets. With geom_smooth you are limited to regression with a single explanatory variable, and it is hard to get out any information on the fit. For these reasons, you generally fit the model explicitly using lm, then generate your own predictions to plot.

What’s the difference between fitting two separate simple linear regressions, and one multiple linear regression that allows different slopes and intercepts for men and women?

men <- lm(wgt ~, data = subset(bdims, gender == "male"))
women <- lm(wgt ~, data = subset(bdims, gender == "female"))
both <- lm(wgt ~*gender, data = bdims)

Generating fitted mean lines from lm for plotting

Let’s replicate this plot, by manually fitting the model and producing predicted means.

The model that corresponds to separate lines for men and women is fit in R with:

fit <- lm(wgt ~*gender , data = bdims)

A very flexible way to examine and plot fitted regression models involves predicting the mean response on a grid of values for the explanatory variables. The key idea is to stop thinking about adding the regression lines as “adding a line with a given slope and intercept”, and instead think about “adding predicted means and joining them together with lines”. This generally involves three steps:

  1. Creating a data frame where each row is a particular combination of explanatory values we are interested in.
  2. Predicting the mean at each combination of explanatory values (and maybe standard errors, or CI/PI limits) and combining them with the original data.
  3. Plotting the predicted means or adding them to a plot of the raw data


expand.grid creates a data frame from all combinations of the variables you give it. For example, examine

expand.grid(gender = c("male", "female"), = c(80, 90),
  hgt = c(170, 180))

To replicate the plot, what would be a useful set of values for gender and wgt?

Other useful functions are unique, seq, levels, quantile


Compare the structure of the returned object for each of these (i.e. using str),

predict(fit, newdata = pred_grid)
predict(fit, newdata = pred_grid, = TRUE)
predict(fit, newdata = pred_grid, interval = "confidence")

Notice how the type of object returned depends on what you asked for. How you work with the output depends on what you ask for too. I want confidence intervals as well, so since the output is an array with the same number of rows as my original data, I’ll just cbind them together.

pred_fit <- cbind(pred_grid, 
  predict(fit, newdata = pred_grid, interval = "confidence"))


Now it’s just a matter of creating the plot I want. Start with the raw data:

qplot(, wgt, data = bdims, colour = gender)

Then add a new layer that uses our predictions as the data instead of the raw data. New layers inherit all the aesthetics from the qplot command. There is no wgt column in pred_fit so we need to explicitly say the y aesthetic should instead be mapped to the fit column.

qplot(, wgt, data = bdims, colour = gender) +
  geom_line(aes(y = fit), data = pred_fit) # tada!

We might even add CIs:

qplot(, wgt, data = bdims, colour = gender) +
  geom_line(aes(y = fit), data = pred_fit) +
  geom_ribbon(aes(ymax = upr, ymin = lwr, y = NULL, fill = gender), 
    colour = NA, alpha = 0.2, data = pred_fit)

Now it might be better to downplay the raw data to make the lines easier to see.

qplot(, wgt, data = bdims, colour = gender,
  size = I(1)) +
  geom_line(aes(y = fit), data = pred_fit) +
  geom_ribbon(aes(ymax = upr, ymin = lwr, y = NULL, fill = gender), 
    colour = NA, alpha = 0.4, data = pred_fit)

Exploring models

Ok, your turn. Using the above technique, examine the predicted means of these models.

fit_1 <- lm(wgt ~ + gender, data = bdims)
pred_fit_1 <- cbind(pred_grid, 
  predict(fit_1, newdata = pred_grid, interval = "confidence"))
qplot(, wgt, data = bdims, colour = gender) +
  geom_line(aes(y = fit), data = pred_fit_1)

fit_2 <- lm(wgt ~, data = bdims)
fit_3 <- lm(wgt ~ - 1, data = bdims)
fit_4 <- lm(wgt ~ +, data = bdims)
fit_5 <- lm(wgt ~ (1 + + I(^2))*gender, data = bdims)

Can you match up the model to the descriptions below of the relationship between mean weight and waist girth?

Can you write down the equations for each model?

What is the difference between these two models?

fit_2 <- lm(wgt ~ + gender, data = bdims)
fit_2a <- lm(wgt ~ + gender - 1, data = bdims)

Produce a plot with predictions from a model that includes the height variable instead of gender (you will need to create a new prediction grid that also includes height).

Challenge question Are we just cylinders? Is there a way to use the fact that for cylinders of constant density: \[ \text{weight} \propto \text{height}\times \text{circumference}^2\].