ST552 Lab 4 Jan 27th 2016

For this lab we’ll work with some data from the `openintro`

package (you may need to install it `install.packages("openintro")`

). The dataset `bdims`

contains body measurements of 507 physically active individuals. For the purposes of the lab we’ll look at trying to predict a physically active individuals weight from their waist girth and gender.

```
library(openintro)
head(bdims)
?bdims
bdims$gender <- ifelse(bdims$sex == 1, "male", "female")
```

Our goal is to practice adding regression lines to plots as a strategy to understand the models we fit.

`qplot`

works a lot like plot, except variables are evaluated inside the data frame, so if you specify `data =`

you don’t need to preface variable names with `$`

.

```
library(ggplot2)
qplot(wai.gi, wgt, data = bdims, colour = gender) # a scatter plot
```

The visual properties of the objects used to represent observations (aesthetics in ggplot2 nomenclature) can also be mapped to variables. For example, `colour = gender`

in the above maps the colour of the points to the `gender`

variable. **Try instead shape = gender or colour = hgt**

It’s easy to add a fitted regression line to a plot,

```
qplot(wai.gi, wgt, data = bdims, colour = gender) +
geom_smooth(method = "lm", se = FALSE)
```

But **beware**! This fits a separate simple linear regression to each group defined by the aesthetics and/or facets. With `geom_smooth`

you are limited to regression with a single explanatory variable, and it is hard to get out any information on the fit. For these reasons, you generally fit the model explicitly using `lm`

, then generate your own predictions to plot.

**What’s the difference between fitting two separate simple linear regressions, and one multiple linear regression that allows different slopes and intercepts for men and women?**

```
men <- lm(wgt ~ wai.gi, data = subset(bdims, gender == "male"))
women <- lm(wgt ~ wai.gi, data = subset(bdims, gender == "female"))
both <- lm(wgt ~ wai.gi*gender, data = bdims)
```

Let’s replicate this plot, by manually fitting the model and producing predicted means.

The model that corresponds to separate lines for men and women is fit in R with:

`fit <- lm(wgt ~ wai.gi*gender , data = bdims)`

A very flexible way to examine and plot fitted regression models involves predicting the mean response on a grid of values for the explanatory variables. The key idea is to stop thinking about adding the regression lines as “adding a line with a given slope and intercept”, and instead think about “adding predicted means and joining them together with lines”. This generally involves three steps:

- Creating a data frame where each row is a particular combination of explanatory values we are interested in.
- Predicting the mean at each combination of explanatory values (and maybe standard errors, or CI/PI limits) and combining them with the original data.
- Plotting the predicted means or adding them to a plot of the raw data

`expand.grid`

creates a data frame from all combinations of the variables you give it. For example, examine

```
expand.grid(gender = c("male", "female"),
wai.gi = c(80, 90),
hgt = c(170, 180))
```

**To replicate the plot, what would be a useful set of values for gender and wgt?**

Other useful functions are `unique`

, `seq`

, `levels`

, `quantile`

Compare the structure of the returned object for each of these (i.e. using `str`

),

```
predict(fit, newdata = pred_grid)
predict(fit, newdata = pred_grid, se.fit = TRUE)
predict(fit, newdata = pred_grid, interval = "confidence")
?predict.lm
```

Notice how the type of object returned depends on what you asked for. How you work with the output depends on what you ask for too. I want confidence intervals as well, so since the output is an array with the same number of rows as my original data, I’ll just `cbind`

them together.

```
pred_fit <- cbind(pred_grid,
predict(fit, newdata = pred_grid, interval = "confidence"))
head(pred_fit)
```

Now it’s just a matter of creating the plot I want. Start with the raw data:

`qplot(wai.gi, wgt, data = bdims, colour = gender)`

Then add a new layer that uses our predictions as the data instead of the raw data. New layers inherit all the aesthetics from the `qplot`

command. There is no `wgt`

column in `pred_fit`

so we need to explicitly say the y aesthetic should instead be mapped to the `fit`

column.

```
qplot(wai.gi, wgt, data = bdims, colour = gender) +
geom_line(aes(y = fit), data = pred_fit) # tada!
```

We might even add CIs:

```
qplot(wai.gi, wgt, data = bdims, colour = gender) +
geom_line(aes(y = fit), data = pred_fit) +
geom_ribbon(aes(ymax = upr, ymin = lwr, y = NULL, fill = gender),
colour = NA, alpha = 0.2, data = pred_fit)
```

Now it might be better to downplay the raw data to make the lines easier to see.

```
qplot(wai.gi, wgt, data = bdims, colour = gender,
size = I(1)) +
geom_line(aes(y = fit), data = pred_fit) +
geom_ribbon(aes(ymax = upr, ymin = lwr, y = NULL, fill = gender),
colour = NA, alpha = 0.4, data = pred_fit)
```

Ok, your turn. Using the above technique, examine the predicted means of these models.

```
fit_1 <- lm(wgt ~ wai.gi + gender, data = bdims)
pred_fit_1 <- cbind(pred_grid,
predict(fit_1, newdata = pred_grid, interval = "confidence"))
qplot(wai.gi, wgt, data = bdims, colour = gender) +
geom_line(aes(y = fit), data = pred_fit_1)
fit_2 <- lm(wgt ~ wai.gi, data = bdims)
fit_3 <- lm(wgt ~ wai.gi:gender - 1, data = bdims)
fit_4 <- lm(wgt ~ wai.gi + gender:wai.gi, data = bdims)
fit_5 <- lm(wgt ~ (1 + wai.gi + I(wai.gi^2))*gender, data = bdims)
```

**Can you match up the model to the descriptions below of the relationship between mean weight and waist girth?**

- A single straight line for both genders
- Parallel straight lines for both genders
- Separate quadratic lines for each gender
- Lines that pass through (weight = 0, waist girth = 0) with different slopes for each gender
- Lines with the same intercept, but different slopes for each gender

**Can you write down the equations for each model?**

**What is the difference between these two models?**

```
fit_2 <- lm(wgt ~ wai.gi + gender, data = bdims)
fit_2a <- lm(wgt ~ wai.gi + gender - 1, data = bdims)
```

**Produce a plot with predictions from a model that includes the height variable instead of gender (you will need to create a new prediction grid that also includes height).**

**Challenge question** Are we just cylinders? Is there a way to use the fact that for cylinders of constant density: \[ \text{weight} \propto \text{height}\times \text{circumference}^2\].