---
title: Predicting Body Fat
author: Charlotte Wickham
date: "2019-02-01"
---

Based on code from http://www.maths.bath.ac.uk/~jjf23/LMR/scripts2/prediction.R, also found in Section 4.2 of Linear Models with R

```{r setup}
library(faraway)
library(tidyverse)
data(fat, package = "faraway")
```

## Background

**Q1**: Take a quick read through of the documentation on this dataset, `?fat`.  We will be using the `brozek` variable, a hard to obtain but accurate measure of body fat percentage, as our response.

In context of the data (discuss with your neighbours):

  `y`: body fat percentage (based on Brozek's eqn.)
  `x_0`: a specific value for age, weight, height and the 10 body circumference.

  * What would a confidence interval on the mean response tell us? When might it be useful?
  
  For men with characteristic given by `x_0`, the 95% confidence interval for _average_ body fat percentage.  Public health scenario...
  
  * What would a prediction interval on a response tell us? When might it be useful?
  
  For a man with characteristics given by `x_0`, 95% prediction interval for body percentage.  E.g. Nick wants to know his body fat.  Doctor who wants to estimate body fat for a specific male patient in front of them. 


Faraway says: 

> "Normally, we would start with an exploratory analysis of the data and a detailed consideration of what model to use but let's be rash and just fit a model and start predicting."

```{r}
lmod <- lm(brozek ~ age + weight + height + neck + chest + 
    abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, data = fat)

X <- model.matrix(lmod)
```

## Prediction for a "typical" man

This is Faraway's definition of an average man:
```{r}
(x0 <- apply(X, 2, median))  # 
```

He decided he wanted a prediction for an "average man". We needed an `x0` to play with. 

We can then predict bodyfat for a man with these physical characteristics:
```{r}
(y0 <- sum(x0 * coef(lmod))) # x_0^T \hat{\beta}
```

Or let R do the work:
```{r}
(pred <- predict(lmod, new = data.frame(t(x0)), se = TRUE))
# se.fit is the standard error in the predicted *mean* response.
# residual.scale is estimate of sigma
```

**Q2**: Find the `se.fit` value by using matrix algebra i.e. $\hat{\sigma}\sqrt{x_0^T(X^TX)^{-1}x_0}$ and `residual.scale` instead.

```{r}
pred$residual.scale * sqrt( t(x0) %*% solve(t(X) %*% X) %*% x0   )
```


`predict()` will construct a prediction interval if you ask for it:
```{r}
predict(lmod, new = data.frame(t(x0)), interval = "prediction")
```

**Q3** Using the values from Q2 replicate the calculation of this interval.

```{r}
se_mean <- pred$residual.scale * 
  sqrt(t(x0) %*% solve(t(X) %*% X) %*% x0)
se_pred <- pred$residual.scale * 
  sqrt(1 + t(x0) %*% solve(t(X) %*% X) %*% x0)

y0 + c(-1, 1) * qt( 0.975, df = 238) * as.numeric(se_pred)
```

If you ask for a `"confidence"` interval instead you'll get the interval for the **mean** response:
```{r}
predict(lmod, new = data.frame(t(x0)), interval = "confidence") 
```

## Width of intervals

Intervals are wider the further we are from the average explanatory values. Let's look at a prediction for someone who is at the (sample) 95th percentile on all variables:
```{r}
(x1 <- apply(X, 2, quantile, probs = 0.95))
```

Compare the width of these intervals to those above:
```{r}
predict(lmod, new = data.frame(t(x1)), interval = "prediction")
predict(lmod, new = data.frame(t(x1)), interval = "confidence")
```

## Aside on other packages

The `add_predictions()` function we saw in lab doesn't calculate standard errors so it isn't very useful in this scenario.  However `data_grid()` may be.  If you don't provide  variables it will pick "typical" values for all those needed in `.model`:
```{r}
library(modelr)
data_grid(fat, .model = lmod)
```

It returns a tibble so it works nicely with `predict()` without having to do any `t()` or `data.frame()` calls:
```{r}
typical_man <- data_grid(fat, .model = lmod)
predict(lmod, newdata = typical_man, interval = "prediction")
```

The broom package has an `augment()` function that will add predictions and their standard errors and return then in tibble form:
```{r}
library(broom)
augment(lmod, newdata = typical_man)
```
Notice the new columns `.fitted` and `.se.fit` which you could use to construct intervals.