Prediction in regression Feb 01 2019

Next week

Midterm in class on Friday. Posted under today’s date:

Homework 4 has a due date two weeks away (you decide if you want to do it now or later)

Next week:


We’ve built a model: \[ y = X\beta + \epsilon \] Now given a new vector of values of the explanatories \(x_0\) we can predict the response \[ \hat{y_0} = x_0^T\hat{\beta} \]

But what is the uncertainty in this prediction?

Two kinds:

Faraway example

Suppose we have bulit a regression model that predicts the rental price of houses in a given area based on predictors such as the number of bedrooms and closeness to a major highway.

Two kinds of predictions:

Prediction of a future value

Suppose a specific house comes on the market with characteristics \(x_0\). It’s rental price will be \(x_0^T\beta + \epsilon\).

Since, \(\E{\epsilon} = 0\) our predicted price will be \(x_0^T\hat{\beta}\), but in assessing the variance of this prediction, we must include an estimate of \(\epsilon\).

Our uncertainty comes from our uncertainty in our estimates, as well as the variability of the response about its mean

Prediction of the mean repsonse

Suppose we ask the question – “What would a house with characteristics \(x_0\) rent for on average?”

This price is \(x_0^T\beta\) and is again predicted by \(x_0^T \hat{\beta}\) but now only variance in \(\hat{\beta}\) needs to be taken into account.

Our uncertianty only comes from our uncertainty in our estimates

Leads to two types of interval

\[ \Var{x_0^T \hat{\beta}} = \sigma^2 x_0^T(X^TX)^{-1}x_0 \]

Assuming future \(\epsilon\) is independent of \(\hat{\beta}\) a prediction interval for a future response is: \[ \hat{y_0} \pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{1 + x_0^T(X^TX)^{-1}x_0} \]

A confidence interval for the mean response is: \[ \hat{y_0} \pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{x_0^T(X^TX)^{-1}x_0} \] which will always be narrower.

Work through Faraway’s example

Normally, we would start with an exploratory analysis of the data and a detailed consideration of what model to use but let’s be rash and just fit a model and start predicting.

Find Rmarkdown in or get at:

We are interested in predicting body fat (%) as a function of physical measurements (e.g. weight, height, circumference of hip, etc.)

(Different data to lab, remember there we were predicting weight)

Your Turn #1


Take a quick read through of the documentation on this dataset.

In context of the data (discuss with your neighbours):

Your Turn #2

Go through the code. Discuss each step:

There are three questions for you in the code. Work with your neighbours to answer them.


What can go wrong with predictions?

  1. Bad model
  2. Quantitative extrapolation (explanatory values beyond those observed)
  3. Qualitative extrapolation (situations beyond those which generated the data, i.e. extrapolation to a different population, e.g. females)
  4. Overconfidence due to overfitting
  5. Black swans (very unusual events that don’t occur in sample, but do occasionally occur)