Multiple Linear Regression

Today

Matrix warmup
Multiple Linear Regression
Matrix setup

Matrix warmup

See handout

Simple linear regression

Recall in simple linear regression:

Have $n$ observations of a response $y_{i}$ , and a single explanatory variable, $x_{i}$ .

The response is related to the explanatory variable by: $y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i} i = 1, \dots, n$

where $ϵ_{i}$ are independent and identically distributed with expected value 0, and variance $σ^{2}$ .

Now we have more than one explanatory variable.

Have $n$ observations of a response, $y_{i}$ and a set of explanatory variables, $(x_{i 1}, x_{i 2}, \dots, x_{i (p - 1)})$ .

The response is related to the explanatory variables by: $y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{p - 1} x_{i (p - 1)} + ϵ_{i} i = 1, \dots, n$

where $ϵ_{i}$ are independent and identically distributed with expected value 0, and variance $σ^{2}$ .

Example: Galápagos Islands

Faraway 2.6

Measurements on 30 Galápagos Islands are made.

First 5 islands:

	Species	Area	Elevation	Nearest	Scruz	Adjacent
Baltra	58	25.09	346	0.6	0.6	1.84
Bartolome	31	1.24	109	0.6	26.3	572.3
Caldwell	3	0.21	114	2.8	58.7	0.78
Champion	25	0.1	46	1.9	47.4	0.18
Coamano	2	0.05	77	1.9	1.9	903.8

Variable Descriptions

?gala

gala	R Documentation

Species diversity on the Galapagos Islands

Format

The dataset contains the following variables

Species: the number of plant species found on the island
Endemics: the number of endemic species
Area: the area of the island (km $^{2}$ )
Elevation: the highest elevation of the island (m)
Nearest: the distance from the nearest island (km)
Scruz: the distance from Santa Cruz island (km)
Adjacent: the area of the adjacent island (square km)

A possible model

$\begin{aligned} {Species}_{i} & = β_{0} + β_{1} {Area}_{i} + β_{2} {Elevation}_{i} + β_{3} {Nearest}_{i} + \\ β_{4} {Scruz}_{i} + β_{5} {Adjacent}_{i} + ϵ_{i} i = 1, \dots, n \end{aligned}$

E.g. $i = 1$ , Baltra: $58 = β_{0} + β_{1} 25.09 + β_{2} 346 + β_{3} 0.6 + β_{4} 0.4 + β_{5} 1.84 + ϵ_{1}$

Your turn:

What does $i$ index?
What is the value of $n$ ?
What is the value of $p$ ?

General matrix form

$\begin{aligned} (\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{n} \end{matrix}) & = (\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 (p - 1)} \\ 1 & x_{21} & x_{22} & \dots & x_{2 (p - 1)} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & x_{n 1} & x_{n 2} & \dots & x_{n (p - 1)} \end{matrix}) (\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{p - 1} \end{matrix}) + (\begin{matrix} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{matrix}) \\ y & = X β + ϵ \end{aligned}$ where $\begin{aligned} y_{n \times 1} & = (y_{1}, y_{2}, \dots, y_{n})^{T} \\ ϵ_{n \times 1} & = (ϵ_{1}, ϵ_{2}, \dots, ϵ_{n})^{T} \\ β_{p \times 1} & = (β_{0}, β_{1}, \dots, β_{p - 1})^{T} \\ X_{n \times p} & = (\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 (p - 1)} \\ 1 & x_{21} & x_{22} & \dots & x_{2 (p - 1)} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & x_{n 1} & x_{n 2} & \dots & x_{n (p - 1)} \end{matrix}) \end{aligned}$

Galápagos: Matrix form

$y_{30 \times 1} = (\begin{matrix} 58 \\ 31 \\ 3 \\ 25 \\ 2 \\ ⋮ \end{matrix}), X_{30 \times 6} = (\begin{array}{rrrrrr} 1 & 25.09 & 346 & 0.6 & 0.6 & 1.84 \\ 1 & 1.24 & 109 & 0.6 & 26.3 & 572.33 \\ 1 & 0.21 & 114 & 2.8 & 58.7 & 0.78 \\ 1 & 0.1 & 46 & 1.9 & 47.4 & 0.18 \\ 1 & 0.05 & 77 & 1.9 & 1.9 & 903.82 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \end{array})$

$β_{6 \times 1} = (\begin{matrix} β_{0} \\ β_{1} \\ β_{2} \\ β_{3} \\ β_{4} \\ β_{5} \end{matrix}), ϵ_{30 \times 1} = (\begin{matrix} ϵ_{1} \\ ϵ_{2} \\ ϵ_{3} \\ ϵ_{4} \\ ϵ_{5} \\ ⋮ \end{matrix})$

Your Turn

Write out the design matrix, $X$ , for the following models, using the data for the first five islands:

$\begin{aligned} {Species}_{i} & = β_{0} + β_{1} {Area}_{i} + β_{2} {Nearest}_{i} + ϵ_{i} \\ {Species}_{i} & = β_{1} {Area}_{i} + β_{2} {Area}_{i}^{2} + ϵ_{i} \\ {Species}_{i} & = β_{0} + β_{1} 1_{{{Area}_{i} > 1}} + ϵ_{i} \end{aligned}$ where $1_{{.}}$ is an indicator variable that takes the value 1, when the condition in the argument is true, and 0 otherwise.

Fitted values and residuals

If we had an estimate for the $β$ vector, $\hat{β} = {({\hat{β}}_{0}, {\hat{β}}_{1}, \dots, {\hat{β}}_{p - 1})}^{T}$

Then we can define fitted value and residual vectors: $\begin{aligned} \hat{y} & = (\hat{y_{1}}, \dots, \hat{y_{n}})^{T} = X \hat{β} \\ e & = \hat{ϵ} = (e_{1}, \dots, e_{n})^{T} = y - X \hat{β} \end{aligned}$

Questions to answer this week:

How will we find $\hat{β}$ ?
What properties do the estimates have?

Multiple Linear Regression Jan 14 2019

Today

Matrix warmup

Simple linear regression