```
library(faraway)
library(tidyverse)
```

The code in today’s lab comes straight from Chapter 3 in Linear Models with R with only minor edits that I think improve clarity.

# Preliminaries

## Brief review of `for`

loops

An example of a `for`

loop in R:

```
n_sims <- 10
output <- numeric(n_sims)
for (i in 1:n_sims){
output[i] <- mean(rnorm(20))
}
output
```

```
## [1] -0.21798434 -0.16131401 0.50563841 -0.25441329 0.14596754
## [6] 0.18129472 -0.14947953 0.02673052 0.09009637 -0.12237762
```

TA to remind you about parts of a for loop:

- Output
- Sequence
- Body

There are alternatives to for loops that remove all the repetitive book-keeping parts and let you focus on **what** is being done rather than **how** it is being done, for example the `map()`

functions in the purrr package. We won’t use them here, because when you see these statistical methods first, I think the for loop syntax more closely matches the textual description of what is happening (at least how Faraway describes it).

## The `sample()`

function

Used in two ways (in Chapter 3 in Faraway):

To randomly permute a set of values:

`sample(1:5)`

`## [1] 3 4 2 1 5`

`sample(c("a", "b", "c"))`

`## [1] "a" "c" "b"`

With no further arguments,

`sample()`

will randomly reorder the vector you give it.To sample with replacement from a set of numbers, add

`replace = TRUE`

:`sample(1:5, replace = TRUE)`

`## [1] 4 3 3 3 5`

`sample(c("a", "b", "c"), replace = TRUE)`

`## [1] "a" "c" "c"`

In both cases we are omitting the `size`

argument, which by default is the same as the length of the input vector.

# Bootstrap confidence intervals

**Overall idea:** To understand the sampling distribution of a parameter estimate, generate data that approximate what repeated experiments/samples would look like.

**Procedure** From Faraway (*Model based bootstrap*)

Repeat many times:

- Generate \(\epsilon^*\) by sampling with replacement from the residuals, \(e_1, \ldots, e_n\).
- Form \(y^* = X\hat{\beta} + \epsilon^*\).
- Compute \(\hat{\beta}^*\) from \((X, y^*)\).

Examine distribution of parameter estimates generated by bootstrap to make inferences on true parameter value.

## Setup

Model based bootstrapping requires us to assume a specific model form, fit it:

```
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
```

And retain the fitted values and residuals:

```
resids <- residuals(lmod)
preds <- fitted(lmod)
```

## A single bootstrap sample

A single bootstrap dataset is constructed by adding the fitted values (i.e. \(X\hat{\beta}\)), and bootstrapped errors (a sample with replacement from the residuals):

```
# bootstrapped errors
(errors <- sample(resids, replace = TRUE))
```

```
## Eden Isabela Espanola Champion SanCristobal
## -20.314202 -39.403558 49.343513 14.635734 73.048122
## Pinzon Isabela SantaCruz Enderby Gardner1
## -42.475375 -39.403558 182.583587 -28.785943 62.033276
## Pinzon Champion Bartolome Pinta SantaFe
## -42.475375 14.635734 38.273154 -111.679486 -23.376486
## Pinta Bartolome Genovesa Isabela Espanola
## -111.679486 38.273154 40.497176 -39.403558 49.343513
## Pinzon Rabida Baltra Coamano SantaCruz
## -42.475375 -5.553122 -58.725946 38.383916 182.583587
## Darwin Caldwell Pinta Tortuga Seymour
## 19.018992 -26.330659 -111.679486 -36.935732 -5.805095
```

```
# bootstrapped response
preds + errors
```

```
## Baltra Bartolome Caldwell Champion Coamano
## 96.4117443 -46.6767121 78.6741729 25.0000000 36.6642065
## Daphne.Major Daphne.Minor Darwin Eden Enderby
## 0.6123302 -5.4838900 173.5645950 -0.4717408 92.8192184
## Espanola Fernandina Gardner1 Gardner2 Genovesa
## 5.1811115 111.6253322 34.2398785 -47.0456905 -23.8736612
## Isabela Marchena Onslow Pinta Pinzon
## 274.7240716 126.9676948 44.5344083 176.2759284 199.8188885
## Las.Plazas Rabida SanCristobal SanSalvador SantaCruz
## -7.3995684 70.0000000 148.2259319 316.0602339 444.0000000
## SantaFe SantaMaria Seymour Tortuga Wolf
## 104.3954776 169.2859692 -61.8743916 16.0000000 20.8954789
```

In this example our model involves multiple terms so it’s a little hard to visualise. But here are the original fitted values (blue open dots) along with the a bootstrapped set of response values (black filled dots) against one of the predictors, `Elevation`

:

```
ggplot(gala, aes(Elevation, preds)) +
geom_point(color = "#377EB8", shape = 21) +
geom_point(aes(y = preds + errors), color = "black")+
ylim(c(-200, 500))
```

Here’s another one:

```
errors <- sample(resids, replace = TRUE)
ggplot(gala, aes(Elevation, preds)) +
geom_point(color = "#377EB8", shape = 21) +
geom_point(aes(y = preds + errors), color = "black") +
ylim(c(-200, 500))
```

Notice this process retains the structure implied by the fitted model. So, if our model has a large effect due to Elevation, we should maintain that effect in our bootstrap samples.

## Repeating many times

We can repeat this process many times each time fitting the model and retaining the estimated coefficients:

```
nb <- 4000
coefmat <- matrix(NA, nrow = nb, ncol = 6)
colnames(coefmat) <- names(coef(lmod))
for (i in 1:nb){
booty <- preds + sample(resids, replace = TRUE)
bmod <- update(lmod, booty ~ .)
coefmat[i, ] <- coef(bmod)
}
```

Then using these many estimated coefficients we can construct confidnce intervals using the percentile method:

`(cis <- apply(coefmat, 2, quantile, probs = c(0.025, 0.975)))`

```
## (Intercept) Area Elevation Nearest Scruz Adjacent
## 2.5% -25.04466 -0.06304558 0.2259949 -1.683388 -0.6046124 -0.10553011
## 97.5% 41.41317 0.02103163 0.4198246 2.139087 0.1760862 -0.03806448
```

Or create density plots on individual parameters:

```
coefmat <- as_tibble(coefmat)
cis <- as_tibble(cis)
ggplot(coefmat, aes(x = Area)) +
geom_density() +
geom_vline(aes(xintercept = Area), data = cis, linetype = "dashed",
color = "grey20") +
labs(title =
expression(paste("Bootstrap estimate of the sampling distribution for ", hat(beta)[Area])))
```

# Permutation tests

**Overall idea:** To understand the null distribution of a test statistic, approximate what repeated experiments/samples would look like, **if the null hypothesis was true**.

**Procedure**

Repeat many times:

- Generate \((X^*, y^*)\) by permutation, in a way suggested by the null hypothesis.
- Compute test statistic \(T(X^*, y^*)\).

Compare observed test statistic to the distribution of test statistics generated by permutation, to conduct a hypthothesis test.

## An overall F-test example

Faraways chooses an example with a not too small overall F-test p-value (I’ve given this model object a different name to Faraway to emphasize this is not the same model as `lmod`

above):

```
lmod_small <- lm(Species ~ Nearest + Scruz, data = gala)
lms <- summary(lmod_small)
```

The overall F-test statistic is available:

`lms$fstatistic`

```
## value numdf dendf
## 0.6019558 2.0000000 27.0000000
```

which we can use to find the corresponding p-value:

`1 - pf(lms$fstatistic[1], lms$fstatistic[2], lms$fstatistic[3])`

```
## value
## 0.5549255
```

This p-value is based on our assumption of Normal errors. We’ll save the statistic for later:

`obs_fstat <- lms$fstat[1]`

The permutation approach attempts to find the null distribution of this test statistic without relying on any distributional assumptions. The null hypthothesis in this overall F-test, is that none of the explanatory variables impact our response. If this was the case, the response value we see for a particular island shouldn’t depend on its attributes. We should have been just as likely to see 58 species on Bartolome and 31 species on Baltra instead of the other way around. In fact, any permutation of the response values assigned to islands should have been equally likely to occur.

*(Notice this feels a little awkward with observational data. It is generally much easier to justify a permutation approach in a randomized experiment, where the treatment variables were actually assigned to units at random.)*

## A single permutation

To generate a null distribution of the test statistic we consider permutations of the response variable.

`permuted_species <- sample(gala$Species)`

Again its hard to visualise becasue we have a few variables involved, but take a look at this permuted response (black filled points) against `Elevation`

:

```
ggplot(gala, aes(Elevation, Species)) +
geom_point(color = "#377EB8", shape = 21) +
geom_point(aes(y = permuted_species), color = "black") +
ylim(c(-200, 500))
```

Notice that a key difference between this and the bootstrap approach is that we break relationships here. In the original data there was a strong relationship between `Elevation`

and `Species`

, but under the null hypothesis there isn’t, and our permutated data reflects this.

To get a test-statistic from this data, we fit the model of interest and extract the F-statistic:

```
lm_permutation <- lm(permuted_species ~ Nearest + Scruz, data = gala)
summary(lm_permutation)$fstat[1]
```

```
## value
## 7.434452
```

This is an example of one test statistic generated under the null hypothesis, to conduct a test we need a whole lot of them.

## Scaling up

```
nperms <- 4000
fstats <- numeric(nperms)
for (i in 1:nperms){
lmods <- lm(sample(Species) ~ Nearest + Scruz, data = gala)
fstats[i] <- summary(lmods)$fstat[1]
}
```

These statistics then form our null distribution:

```
ggplot(mapping = aes(x = fstats)) +
geom_density() +
geom_vline(aes(xintercept = obs_fstat), linetype = "dashed",
color = "grey20") +
labs(title = "Permutation based null distribution for the overall F-statistic",
xlab = "F-statistic")
```

And to get a p-value, we find the proportion of test statistics generated under the null hypothesis that are more extreme than that observed:

`mean(fstats > obs_fstat)`

`## [1] 0.5565`

In this case, pretty close to the p-value derived using the Normal assumptions.

## An individual parameter example

You can also use a permuation approach to test individual parameter estimates in a regression. Now, since the null hypothesis is only about one parameter, after accounting for the effects of the other variables, we only permute values of the explanatory varibale of interest.

For example, in `lmod_small`

we might ask only about the parameter on `Scruz`

, \(\beta_{\text{Scruz}}\). For the null hypothesis \(\beta_{\text{Scruz}} = 0\), the Normal based inference gives us a p-value of 0.28:

`summary(lmod_small)$coef`

```
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.4765494 28.3561013 3.472852 0.001751461
## Nearest 1.1792236 1.9184475 0.614676 0.543914527
## Scruz -0.4406401 0.4025312 -1.094673 0.283329519
```

We’ll still the t-statistic as our test statistic, so we’ll keep the value from the observed data:

`obs_tstat <- summary(lmod_small)$coef[3, 3]`

But now we’ll use permutation to find a null distribution rather than relying on theory and comparing it to a Student’s t-distribution.

```
tstats <- numeric(nperms)
for (i in 1:nperms){
lmods <- lm(Species ~ Nearest + sample(Scruz), data = gala)
tstats[i] <- summary(lmods)$coef[3, 3]
}
```

Here’s the resulting null distribution, with the t-distribution that the Normal assumption would lead to overlaid:

```
ggplot(mapping = aes(x = tstats)) +
geom_density() +
geom_vline(aes(xintercept = obs_tstat), linetype = "dashed",
color = "grey20") +
stat_function(fun = dt, args = list(df = lmod_small$df.residual),
color= "grey50")+
labs(title = expression(
paste("Permutation based null distribution for the t-statistic, ", beta[Scruz]==0)),
xlab = "t-statistic")
```

Notice the null distribution based on the permutation test doesn’t seem that close to the t-distribution. However, the p-value isn’t particularly different in this case:

`mean(abs(tstats) > abs(obs_tstat))`

`## [1] 0.26375`