Problems with predictors Feb 20 2019

Roadmap

Done:

To Do:

There will be 8 homeworks total, recall your lowest score is dropped.

Today

Problems with predictors (Faraway 7)

Seat position in cars

data(seatpos, package = "faraway")
?seatpost

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know where different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers.

The dataset contains the following variables:

library(ggplot2)
ggplot(seatpos, aes(Ht, hipcenter)) +
  geom_point()

lmod <- lm(hipcenter ~ ., data = seatpos)
sumary(lmod)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.432128 166.571619  2.6201  0.01384
## Age           0.775716   0.570329  1.3601  0.18427
## Weight        0.026313   0.330970  0.0795  0.93718
## HtShoes      -2.692408   9.753035 -0.2761  0.78446
## Ht            0.601345  10.129874  0.0594  0.95307
## Seated        0.533752   3.761894  0.1419  0.88815
## Arm          -1.328069   3.900197 -0.3405  0.73592
## Thigh        -1.143119   2.660024 -0.4297  0.67056
## Leg          -6.439046   4.713860 -1.3660  0.18245
## 
## n = 38, p = 9, Residual SE = 37.72029, R-Squared = 0.69

Variance inflation factors

Let Ri2 be the R2 from the regression of the ith explanatory variable on all the other explanatory variables. That is, the proportion of the variation in the ith explanatory variable that is explained by the other explanatory variables.

If the ith variable was orthogonal to the other variables, Ri2=0.

If the ith variable was a linear combination of the other variables, Ri2=1.

Var(βj^)=σ2(11Rj2)1i(xijx¯j)2 where (11Rj2) is known as the variance inflation factor.

This is not a violation of the assumptions

Detecting multicollinearity

In the seat example: large model R2 but nothing is individually significant. Large standard errors on terms that should be highly significant.

  1. Look at the correlation matrix of the explanatory variables. But, this will only identify pairs of explanatories that are correlated (not complicated relationships)
  2. Regress Xi on other variables and look for high R2, equivalently directly find variance inflation factors.
  3. Look at the eigenvalues of XTX and look for condition numbers κ=λ1λp>30

Example

Go through example in R

What to do about multicollinearity?

Errors in variables

We assumed fixed X.

You can also use least squares if X is random before you observe it, and you want to do inference conditional on the observed X.

If, X is measured with error, i.e. X=Xa+δY=Xaβ+ϵ then the least squares estimates will be biased (usually towards zero if Xa and δ are unrelated).

There are “errors in variables” estimation techniques.

Linear transformations of predictors

Transformations of the form XjXjab do not change the fit of the regression model, only the interpretation of the parameters.

One useful one is to standardise all the explanatory variables XjXjXj¯sXj which puts all the parameters on the same scale: “…a change in Xj of one standard deviation is associated with a change in response of βj…”

Also, can be useful to re-express a predictor in more reasonable units. For example, expressing income in $1000s rather than $1s.