Next week
Midterm in class on Friday. Posted under today’s date:
- Study guide
- Previous year’s midterm (and solution)
Homework 4 has a due date two weeks away (you decide if you want to do it now or later)
Next week:
- Weds lecture: review. Bring your questions.
- Weds lab: Trevor will go over a relevant comp exam question (you might like to look it over beforehand)
Prediction
We’ve built a model: \[ y = X\beta + \epsilon \] Now given a new vector of values of the explanatories \(x_0\) we can predict the response \[ \hat{y_0} = x_0^T\hat{\beta} \]
But what is the uncertainty in this prediction?
Two kinds:
- prediction of the mean response
- prediction of a future observation
Faraway example
Suppose we have bulit a regression model that predicts the rental price of houses in a given area based on predictors such as the number of bedrooms and closeness to a major highway.
Two kinds of predictions:
- Prediction of a future value
- Prediction of the mean response
Prediction of a future value
Suppose a specific house comes on the market with characteristics \(x_0\). It’s rental price will be \(x_0^T\beta + \epsilon\).
Since, \(\E{\epsilon} = 0\) our predicted price will be \(x_0^T\hat{\beta}\), but in assessing the variance of this prediction, we must include an estimate of \(\epsilon\).
Our uncertainty comes from our uncertainty in our estimates, as well as the variability of the response about its mean
Prediction of the mean repsonse
Suppose we ask the question – “What would a house with characteristics \(x_0\) rent for on average?”
This price is \(x_0^T\beta\) and is again predicted by \(x_0^T \hat{\beta}\) but now only variance in \(\hat{\beta}\) needs to be taken into account.
Our uncertianty only comes from our uncertainty in our estimates
Leads to two types of interval
\[ \Var{x_0^T \hat{\beta}} = \sigma^2 x_0^T(X^TX)^{-1}x_0 \]
Assuming future \(\epsilon\) is independent of \(\hat{\beta}\) a prediction interval for a future response is: \[ \hat{y_0} \pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{1 + x_0^T(X^TX)^{-1}x_0} \]
A confidence interval for the mean response is: \[ \hat{y_0} \pm t_{n-p}^{(\alpha/2)}\hat{\sigma}\sqrt{x_0^T(X^TX)^{-1}x_0} \] which will always be narrower.
Work through Faraway’s example
Normally, we would start with an exploratory analysis of the data and a detailed consideration of what model to use but let’s be rash and just fit a model and start predicting.
Find Rmarkdown in rstudio.cloud or get at: stat552.cwick.co.nz/lecture/11-faraway-fat.Rmd
We are interested in predicting body fat (%) as a function of physical measurements (e.g. weight, height, circumference of hip, etc.)
(Different data to lab, remember there we were predicting weight)
Your Turn #1
?fat
Take a quick read through of the documentation on this dataset.
In context of the data (discuss with your neighbours):
- What would a confidence interval on the mean response tell us? When might it be useful?
- What would a prediction interval on a response tell us? When might it be useful?
Your Turn #2
Go through the code. Discuss each step:
- what is happening conceptually?
- what is the code doing?
- are their other ways to do it?
There are three questions for you in the code. Work with your neighbours to answer them.
Notes
Prediction intervals become wider the further we are from observed values of the predictors.
These intervals depend on the model being correctly specified. In practice, you never know the true model. We do our best to specify a good model but there is always uncertainty in the form of the model.
This model uncertainty is not reflected in these intervals. We take into account parameter uncertainty, but model uncertainty is harder to quantify.
What can go wrong with predictions?
- Bad model
- Quantitative extrapolation (explanatory values beyond those observed)
- Qualitative extrapolation (situations beyond those which generated the data, i.e. extrapolation to a different population, e.g. females)
- Overconfidence due to overfitting
- Black swans (very unusual events that don’t occur in sample, but do occasionally occur)