Stat 552

Homework 8

Due Mar 11th

Reading: Faraway 10, 11.3 & 11.4

  1. 10.7

  2. Consider the dataset diamonds in the ggplot2 package. I am providing you with a (random) 50% subset of the data

load(url("http://stat552.cwick.co.nz/data/diamonds_sub.rda"))
head(diamonds_sub)

Build a regression model using diamonds_sub to predict a diamonds price from the other available variables.

You can look at ?diamonds to learn about the variables, but otherwise you should not examine the full data set and only use the subset provided.

Beware! Building a good predictive model can swallow a lot of time. Your answer needs to include at least:

  • some exploration of the data
  • a model that captures a few of the largest relationships, and an estimate of the mean squared prediction error for this model.
  • at least one alternative model and a comparison to the model above in terms of predictive accuracy.
  • a summary of your proposed model, including at least one figure that displays some predicted prices as a function of the explanatory variables (no single figure will capture your entire model, it is up to you to decide what one aspect of your model you want to display).

Any extra work should only be done if it doesn’t impact your ability to meet your other commitments (inside or outside of school).

Your writeup for this question should follow the general guidelines for the report in HW #6. However, prediction is the goal here, so your methods and results sections will focus more on models you considered and their predictive performance, rather than assumptions and inference.

There is a prize for the person who has the best predictions on a 20% sample that is disjoint from the records in diamonds_sub. To be eligible for the “best predictions” prize, you must also submit an .rda (R binary) file containing a function that takes as input a data frame with the same columns as diamonds and returns a vector of predicted prices. An easy way to create this from a fitted model is provided in hw8_make_preds.R.