Predictive performance Mar 04 2019

From last time: comments on criteria-based methods

These metrics are estimates, and like all estimates are subject to variability.

The ranking of models for one dataset might be different to another generated from the same data generating process.

There are some asymptotic results. Two common types:

Alternative estimate of model performance: use external test set, and estimate your desired metric directly. (The idea behind cross validation)

Predictive performance of models

A Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission (NIT) principle was used to collect data on samples of finely chopped pure meat. 215 samples were measured. For each sample, the fat content was measured along with a 100 channel spectrum of absorbances. Since determining the fat content via analytical chemistry is time consuming we would like to build a model to predict the fat content of new samples using the 100 absorbances which can be measured more easily.

How do we decide what a good model is?

How good is our model at prediction?

We build a model that takes \(x_0\) for a new observation as input and predicts \(\hat{y}_0\).

A common (but not the only) metric is the mean squared prediction error: \[ \E{\left(y_0 - \hat{y}_0\right)^2} \] How far on average are our predictions from the truth? (Expectation is over repeated samples from the data generating mechanism).

This error is for unseen data, so is often called the extra-sample or generalization error.

We can’t find this expectation because we don’t know the true model. So, we try to estimate it.

Illustration

One way, split the model in two. Build the model with one, the training set. Estimate the prediction error with the other, the test set.

set.seed(19128)
ind <- sample(nrow(meatspec), size = 172)
trainmeat <- meatspec[ind, ]
testmeat <- meatspec[-ind, ]

# fit model with training set
fit_meat <- lm(fat ~ ., data = trainmeat)

# predict on "unseen" data, the test set
testmeat$pred <- predict(fit_meat, testmeat)

# average squared error
mse <- with(testmeat, mean((fat - pred)^2))
# usually reported on root scale
sqrt(mse)
## [1] 3.339487

Using the dataset for both training and evaluation leads to overconfidence

We could try to estimate the using the residuals (a.k.a training error) \[ 1/n \sum_{i=1}^n (\hat{y}_i - y_i)^2 \] but this tends to underestimate our error because we are using the same data to fit and evaluate the model.

sqrt(mean(residuals(fit_meat)^2)) # much too small!
## [1] 0.7320721

In general you can replace the mean squared error with any loss function you like, e.g median absolute error. The form will reflect what you are using the predictions for.

How could we improve our model?

# fit model with training set
fit_meat <- lm(fat ~ ., data = trainmeat)

23-code.R Use laptops, do model selection. Evaluate on test set.

K-fold cross validation

Regularized regression

Next time…

Lasso and ridge.