Least squares estimates of the regression parameters Jan 16 2019

Warmup

Recall from last time we can set up a multiple linear regression model in the matrix form: \[ y = X\beta + \epsilon \]

Give the name and dimensions of each term

Today’s goal

Derive the form of the estimates for the parameter vector \(\beta\).

Least Squares

Just like in simple linear regression, we’ll estimate \(\beta\) by least squares. In simple linear regression this involved finding \(\hat{\beta_0}\) and \(\hat{\beta_1}\) to minimise the sum of squared residuals: \[ \text{sum of squared residuals SLR} = \sum_{i = 1}^{n} e_i^2 = \sum_{i = 1}^{n}\left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i) \right)^2 \]

Your turn: What procedure do you use to minimise a function? E.g. if \(f(x)\) is a function of a single real value \(x\), how do you find the \(x\) that minimises \(f(x)\)?

(2 min discussion)

For multiple linear regression the least squares estimate of the \(\beta\) is the vector \(\hat{\beta}\) that minimizes the sum of squared residuals: \[ \text{sum of squared residuals MLR} = \sum_{i = 1}^{n} e_i^2 = ||e||^2 = \left(y - X\hat{\beta}\right)^T \left(y - X\hat{\beta}\right) \]

Your turn Expand the matrix product on the right into four terms. Be careful with the order of matrix multiplication and recall \(\left(uX\right)^T = X^Tu^T\).

\[ \sum_{i = 1}^{n} e_i^2 = ||e||^2 = \left(y - X\hat{\beta}\right)^T \left(y - X\hat{\beta}\right) \]

Consider the terms: \[ -\hat{\beta}^TX^Ty \quad \text{and} - y^TX\hat{\beta} \]

Argue that these can be combined into the single term \[ -2\hat{\beta}^TX^Ty \]

(Hint: consider the dimensions of these terms)

Finding the minimum

Now our objective is to find \(\hat{\beta}\) that minimises: \[ y^Ty - 2\hat{\beta}^TX^Ty + \hat{\beta}^T X^TX\hat{\beta} \]

The usual procedure would be to take derivative with respect to \(\hat{\beta}\), set to zero and solve for \(\hat{\beta}\). Except \(\hat{\beta}\) is a vector! We need to use vector calculus.

Vector calculus

You should be familiar with the usual differentiation rules for scalar \(a\) and \(x\):

There are analogs when we want to take derivative with respect to a vector \(\mathbf{x}\):

Use the rules above to take the derivative of the sum of squared residuals with respect to the vector \(\hat{\beta}\)

\[ \begin{aligned} \frac{\partial}{\partial \hat{\beta}} \left( y^Ty - 2\hat{\beta}^TX^Ty + \hat{\beta}^T X^TX\hat{\beta} \right) &= \end{aligned} \]

Normal Equations

Setting the above derivative to zero leads to the Normal Equations. The least squares estimates satisfy: \[ X^Ty = X^TX \hat{\beta} \]

If \(X^TX\) is invertible, the least squares estimates are (fill me in): \[ \hat{\beta} = \left(\phantom{X^T}\phantom{X} \right)^{-1}\phantom{X}^Ty \]

If \(X\) has rank \(p\) then \(X^TX\) will be invertible.

Fitted Values and Residuals

Plug in the least squares estimate for \(\hat{\beta}\) to find the fitted values and residuals \[ \begin{aligned} \hat{y} = X\hat{\beta} = \\ \hat{\epsilon} = e = y - X\hat{\beta} = \end{aligned} \]

Hat matrix

The hat matrix is: \[ H = X\left(X^TX\right)^{-1}X^T \] named because it puts “hats” on the response, i.e. multiplying the response by the hat matrix gives the fitted values: \[ Hy = \hat{y} \]

Your Turn: Show \(\left(I- H\right)X = \pmb{0}\)

Other properties of \(H\)

You can use these results to argue that the residuals are orthogonal to the columns of \(X\), i.e. show \(e^TX = \pmb{0}\) \[ \begin{aligned} e^TX &= ((I - H)Y)^TX \quad \text{plug in form for residuals} \\ &= Y^T(I - H)^T X \quad \text{distribute transpose} \\ &= Y^T(I - H) X \quad \text{symmetry} \\ &= Y^T 0 \quad \text{from above} \\ & = 0 \end{aligned} \]

Next time

What are the properties of the least squares estimates?