Multiple Linear Regression Jan 14 2019

Today

Matrix warmup

See handout

Simple linear regression

Recall in simple linear regression:

Have \(n\) observations of a response \(y_i\), and a single explanatory variable, \(x_i\).

The response is related to the explanatory variable by: \[ y_i = \beta_0 + \beta_1 x_{i} + \epsilon_i \quad i = 1, \ldots, n \]

where \(\epsilon_i\) are independent and identically distributed with expected value 0, and variance \(\sigma^2\).

Multiple linear regression

Now we have more than one explanatory variable.

Have \(n\) observations of a response, \(y_i\) and a set of explanatory variables, \((x_{i1}, x_{i2}, \ldots, x_{i(p-1)})\).

The response is related to the explanatory variables by: \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_{p-1} x_{i(p-1)} + \epsilon_i \quad i = 1, \ldots, n \]

where \(\epsilon_i\) are independent and identically distributed with expected value 0, and variance \(\sigma^2\).

Example: Galápagos Islands

Faraway 2.6

Measurements on 30 Galápagos Islands are made.

First 5 islands:

  Species Area Elevation Nearest Scruz Adjacent
Baltra 58 25.09 346 0.6 0.6 1.84
Bartolome 31 1.24 109 0.6 26.3 572.3
Caldwell 3 0.21 114 2.8 58.7 0.78
Champion 25 0.1 46 1.9 47.4 0.18
Coamano 2 0.05 77 1.9 1.9 903.8

Variable Descriptions

?gala
gala R Documentation

Species diversity on the Galapagos Islands

Format

The dataset contains the following variables

Species

the number of plant species found on the island

Endemics

the number of endemic species

Area

the area of the island (km\(^2\))

Elevation

the highest elevation of the island (m)

Nearest

the distance from the nearest island (km)

Scruz

the distance from Santa Cruz island (km)

Adjacent

the area of the adjacent island (square km)

A possible model

\[ \begin{aligned} \text{Species}_i &= \beta_0 + \beta_1 \text{Area}_i + \beta_2 \text{Elevation}_i + \beta_3 \text{Nearest}_i + \\ & \quad \beta_4 \text{Scruz}_i + \beta_5 \text{Adjacent}_i + \epsilon_i \quad i = 1, \ldots, n \end{aligned} \]

E.g. \(i = 1\), Baltra: \[ 58 = \beta_0 + \beta_1 25.09 + \beta_2 346 + \beta_3 0.6 + \beta_4 0.4 + \beta_5 1.84 + \epsilon_1 \]

Your turn:

General matrix form

\[ \begin{aligned} \left(\begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix}\right) &= \left(\begin{matrix} 1 & x_{11} & x_{12} & \ldots & x_{1 (p-1)}\\ 1 & x_{21} & x_{22} & \ldots & x_{2 (p-1)}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{n (p-1)}\\ \end{matrix}\right) \left(\begin{matrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{matrix}\right) + \left(\begin{matrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{matrix}\right) \\ y &= X\beta + \epsilon \end{aligned} \] where \[ \begin{aligned} y_{n\times 1} &= (y_1, y_2, \ldots, y_n)^T \\ \epsilon_{n\times 1} &= (\epsilon_1, \epsilon_2, \ldots, \epsilon_n)^T \\ \beta_{p\times 1} &= (\beta_0, \beta_1, \ldots, \beta_{p-1})^T \\ X_{n \times p} &= \left(\begin{matrix} 1 & x_{11} & x_{12} & \ldots & x_{1 (p-1)}\\ 1 & x_{21} & x_{22} & \ldots & x_{2 (p-1)}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{n (p-1)}\\ \end{matrix}\right) \end{aligned} \]

Galápagos: Matrix form

\[ y_{30\times 1} = \left( \begin{array}{c} 58\\ 31\\ 3\\ 25\\ 2 \\ \vdots \end{array} \right), \, X_{30\times 6} = \left( \begin{array}{rrrrrr} 1 & 25.09 & 346 & 0.6 & 0.6 & 1.84\\ 1 & 1.24 & 109 & 0.6 & 26.3 & 572.33\\ 1 & 0.21 & 114 & 2.8 & 58.7 & 0.78\\ 1 & 0.1 & 46 & 1.9 & 47.4 & 0.18\\ 1 & 0.05 & 77 & 1.9 & 1.9 & 903.82 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \end{array} \right) \]

\[ \beta_{6 \times 1} = \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \end{array} \right), \quad \epsilon_{30 \times 1} = \left( \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \vdots \end{array} \right) \]

Your Turn

Write out the design matrix, \(X\), for the following models, using the data for the first five islands:

\[ \begin{aligned} \text{Species}_i &= \beta_0 + \beta_1 \text{Area}_i + \beta_2 \text{Nearest}_{i} + \epsilon_i \\ \text{Species}_i &= \beta_1 \text{Area}_i + \beta_2 \text{Area}^2_i + \epsilon_i \\ \text{Species}_i &= \beta_0 + \beta_1 1_{\{\text{Area}_i > 1 \}} + \epsilon_i \end{aligned} \] where \(1_{\{.\}}\) is an indicator variable that takes the value 1, when the condition in the argument is true, and 0 otherwise.

Fitted values and residuals

If we had an estimate for the \(\beta\) vector, \[ \hat{\beta} = \left(\hat{\beta}_0, \hat{\beta}_1 , \ldots, \hat{\beta}_{p-1} \right)^T \]

Then we can define fitted value and residual vectors: \[ \begin{aligned} \hat{y} &= (\hat{y_1}, \ldots, \hat{y_n})^T = X\hat{\beta} \\ e &= \hat{\epsilon} = (e_1, \ldots, e_n)^T = y - X\hat{\beta} \end{aligned} \]

Questions to answer this week: