Violations of assumptions
In rough order of importance:
- Systematic form of the model, \(\E{Y} = X\beta\). If violated, the parameters in the model may be meaningless, estimates may be biased.
- Independence of errors, \(\epsilon_i\) independent of \(\epsilon_j\) for all \(i\) and \(j\). If violated, estimates are still unbiased, but standard errors are generally inappropriate.
- Constant variance, \(\Var{\epsilon_i} = \sigma^2\) for all \(i\). If violated, variance in predictions may not be properly quantified.
- Normality, \(\epsilon \sim N()\). Can rely on CLT for large samples. If violated, prediction intervals are probably innappropriate.
Using the residuals to diagnose problems
If our model is correct, \(\epsilon \sim N(0, \sigma^2 I)\). But, we don’t observe the errors.
Usually, we use the residuals as our best guess for the errors, and examine them for problems with the assumptions.
However, residuals by construction are not equal variance, or uncorrelated (you can try to standardize), but in practice the effects are small and ignored.
We can’t prove the assumptions are satisfied, but we can look for evidence of gross violations.
Graphical versus formal inferential methods
I am a strong proponent of graphical methods over formal tests for assumption checking.
Tests can only provide quantification of a deviation you are expecting, graphics reveal the unexpected.
Tests tend to make you focus on statistical significance not practical significance.
For example, a large sample of data that is just a little non-Normal, will tend to give tiny p-values in a test of Normality, but for our purposes it isn’t really a problem.
Residual plots to examine
Residuals versus fitted values
Residuals versus explanatories (both those included and those excluded from the model)
Normal probability plot (Q-Q plot) of the residuals
Anything else you can think of that might reveal structure in the residuals. For example, if measurements are made over time or space, look for temporal or spatial patterns in the residuals.
What to look for
An even width band vertically centered around zero as you move left to right (always put the residuals on the y-axis).
Points falling close to a straight line
Your turn: Part One
Handout (Charlotte will bring):
Part One Describe what you see in the residual plots that suggests a violation of assumptions.
Your turn: Part Two
Part Two The same five models are examined but in a random order, with a much smaller sample size. Can you match these diagnostics to those in Part Two??
Your turn: Part Three
Part Three Do you see any violations here?
Common problems and possible solutions
Non-constant spread
- transform response (background knowledge, trial & error, Box-Cox)
- use more complicated models (glm, gee)
Non-linearity
- transform response
- transform predictor
- allow for curvature (add predictor\(^2\), splines, gam)
- use a non-linear model
Non-normality
- transform response
- use more complicated models (glm)
Structure when examined against an excluded variable - include it