Properties of the least squares estimates Jan 18 2019

Warmup

Let \(a\) and \(b\) be scalar constants, and \(X\) be a scalar random variable.

Fill in the blanks \[ \begin{aligned} \E{aX + b} &= \underline{\phantom{a \E{X} + b}} \\ \Var{aX + b} &= \underline{\phantom{a^2 \Var{X}}} \end{aligned} \]

Goal

Recall that the least squares estimates are: \[ \hat{\beta}_{p\times1} = \left( X^TX \right)^{-1} X^Ty \]

Our goal today is to learn about the statistical properties of these estimates, in particular their expectation and variance.

Random Vectors

\(\hat{\beta}\) is a vector-valued random variable, so we first need to cover a little more background.

Let \(U_1, \ldots, U_n\) be scalar random variables. Then the vector \[ \mathbf{U} = \left( U_1, \ldots, U_n \right)^T \] is a vector valued random variable, a.k.a. a random vector.

The expectation of \(\mathbf{U}\) is the vector of expectations, \[ \E{\mathbf{U}} = \left(\E{U_1}, \ldots, \E{U_n} \right)^T \]

And the variance-covariance matrix of \(\mathbf{U}\) is \[ \Var{\mathbf{U}} = \Cov{\mathbf{U}} = \begin{pmatrix} \Var{U_1} & \Cov{U_1, U_2} & \cdots & \Cov{U_1, U_n} \\ \Cov{U_2, U_1} & \Var{U_2} & \cdots & \Cov{U_2, U_n} \\ \vdots & \vdots & \ddots & \vdots \\ \Cov{U_n, U_1} & \Cov{U_n, U_2} & \cdots & \Var{U_n} \end{pmatrix}_{n\times n} \]

For example, the errors in multiple linear regression, \(\epsilon_i,\, i = 1, \ldots, n\) are independent with mean 0 and variance \(\sigma^2\).

Then, \[ \epsilon = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix}, \quad \E{\epsilon} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} = \mathbf{0}, \quad \Var{\epsilon} = \begin{pmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & \sigma^2 \end{pmatrix} = \sigma^2 I_{n} \]

Properties of Expectation and Variance for random vectors

Let:

Then: \[ \begin{aligned} \E{A\mathbf{U} + b} = A\E{\mathbf{U}} + b \\ \Var{A\mathbf{U} + b} = A\Var{\mathbf{U}}A^T \end{aligned} \]

These are the vector analogs of the properties you wrote down in the warmup.

Find \(\E{y}\) and \(\Var{y}\) where \(y_{n \times 1}\) satisfies the multiple linear regression equation: \[ y = X\beta + \epsilon \] and \(\E{\epsilon} = \mathbf{0}\), \(\Var{\epsilon} = \sigma^2I\)

Expectation of the least squares estimates

Assume the regression set up (with the usual dimensions): \[ y = X\beta + \epsilon \] where \(X\) is fixed with rank \(p\), \(\E{\epsilon} = 0\), and \(\Var{\epsilon} = \sigma^2 I_n\).

Fill in the blanks to show the least squares estimates are unbiased

\[ \begin{aligned} \E{\hat{\beta}} & = \E{\left(X^TX\right)^{-1}X^Ty} \\[10mm] & = \E{ \left(X^TX\right)^{-1}X^T \left(\phantom{X\beta + \epsilon} \right)} \quad \text{plug in the regression equation for } y \\[10mm] & = \E{ \phantom{\left(X^TX\right)^{-1}X^T X \beta} + \phantom{\left(X^TX\right)^{-1}X^T \epsilon }} \quad \text{expand} \\[10mm] & = \E{ \phantom{\beta} + \phantom{\left(X^TX\right)^{-1}X^T \epsilon }} \quad \text{simplify term on the left } A^{-1}A = I \\[10mm] & = \phantom{\beta} + \phantom{\left(X^TX\right)^{-1}X^T} \E{ \phantom{\epsilon} } \quad \text{property of expectation} \\[10mm] & = \phantom{\beta + \left(X^TX\right)^{-1}X^T \mathbf{0}} \quad \text{regression assumptions} \\[10mm] & = \beta \end{aligned} \]

Variance-covariance matrix of the least square estimates

Fill in the blanks to find the variance covariance matrix of least squares estimates

\[ \begin{aligned} \Var{\hat{\beta}} &= \Var{\left(X^TX\right)^{-1}X^Ty} \\ &= \Var{\left(X^TX\right)^{-1}X^T X \beta + \left(X^TX\right)^{-1}X^T \epsilon} \quad \text{plug in reg. eqn. and expand}\\[10mm] &= 0 + \bigl(\phantom{\left(X^TX\right)^{-1}X^T}\bigr) \Var{\epsilon} \bigl(\phantom{\left(X^TX\right)^{-1}X^T} \bigr)^T \quad \text{property of Var}\\[10mm] & = \bigl(\phantom{\left(X^TX\right)^{-1}X^T}\bigr) \phantom{\sigma^2 I} \bigl(\phantom{\left(X^TX\right)^{-1}X^T} \bigr)^T \quad \text{regression assumption}\\[10mm] & = \phantom{\sigma^2}\bigl(\phantom{\left(X^TX\right)^{-1}X^T}\bigr) \bigl(\phantom{\left(X^TX\right)^{-1}X^T} \bigr)^T \quad \text{move scalar to front}\\[10mm] & = \sigma^2 \bigl(\phantom{\left(X^TX\right)^{-1}X^T}\bigr) \phantom{X \left(X^TX\right)^{-1}} \quad \text{distribute transpose}\\[10mm] &= \sigma^2 \phantom{\left(X^TX\right)^{-1}} \quad \text{since } A^{-1}A = I\\[10mm] & = \sigma^2 \left(X^TX\right)^{-1} \end{aligned} \]

We can pull out the variance of a particular parameter estimate, say \(\hat{\beta_i}\), from the diagonal of the matrix: \[ \Var{\beta_i} = \sigma^2(X^TX)^{-1}_{i+1i+1} \] where \(A_{ij}\) indicates the element in the i’th row and j’th column of the matrix \(A\).

Why \(i+1\)?

The off diagonal terms tell us about the covariance between parameter estimates.

Estimating \(\sigma\)

To make use of the variance-covariance results we need to be able to estimate \(\sigma^2\).

An unbiased estimate is: \[ \hat{\sigma}^2 = \frac{1}{n-p}\sum_{i =1}^{n}{e_i^2} = \frac{||e||^2}{n-p} \]

The denominator \(n-p\) is known as the model degrees of freedom.

Standard errors of particular parameters

The standard error of a particular parameter is then the squareroot of the variance replacing \(\sigma^2\) with its estimate: \[ \SE{\beta_i} = \hat{\sigma}^2 \sqrt{(X^TX)^{-1}_{i+1i+1}} \]

Gauss Markov Theorem

You might wonder if we can find estimates with better properties. The Guass-Markov theorem says the least squares estimates are BLUE (Best Linear Unbiased Estimator).

Of all linear, unbiased estimates, the least squares estimates have the smallest variance.

Of course if you are willing to have a non-linear estimate and/or, a biased estimate you might be able to find an estimate with smaller variance.

Proof, see Section 2.8 in Faraday

Summary

For the linear regression model \[ Y = X\beta + \epsilon \] where \(\E{\epsilon} = 0_{n\times 1}\), \(\Var{\epsilon} = \sigma^2 I_n\), and the matrix \(X_{n \times p}\) is fixed with rank \(p\).

The least squares estimates are \[ \hat{\beta} = (X^TX)^{-1}X^TY \]

Furthermore, the least squares estimates are BLUE, and \[ \begin{aligned} \E{\hat{\beta}} &= \beta, \qquad \Var{\hat{\beta}} = \sigma^2 (X^TX)^{-1} \\ \E{\hat{\sigma}^2} &= \E{\tfrac{1}{n-p}\sum_{i =1}^{n}{e_i^2}} = \sigma^2 \end{aligned} \]

We have not used any Normality assumptions to show these properties.