ST552 Lab 3 Jan 20th 2016


Reduce repetition in code by:

The DRY principle (Don’t Repeat Yourself)

DRY is the principle in programming that every idea in your code should be captured in one place only. For us, in practice, this means it’s easy to change parameters in a simulation, or make an adjustment in a model without having to make the change in many places in our code.

Take HW #2 part II.3. How did you do it? Share with your neighbour and TA. Can you identify repetition in your code?

(Matt go through an example)

A pseudo-code version of what we are trying to achieve:

  1. Degree, k = 1
  2. Fit a polynomial model of degree k using the direct method.
  3. If the fit worked, increase k and repeat 2., otherwise stop.

Writing functions

In this section, we will write a function that captures the idea of: “fit a kth degree polynomial regression model, using the direct method”

If I already set up (we’ll do this in a better way later)

n <- 20
x <- 1:n
y <- x + rnorm(n)
X <- cbind(1, x, x^2, x^3, x^4, x^5, x^6, x^7, x^8, x^9, x^10)

How would we fit (using the direct method) the model \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\)?

How would we fit (using the direct method) the model \(y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i\)?

In general, how could we fit \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \ldots + \beta_k x_i^k + \epsilon_i \] for \(k \le 10\).

We really want to capture this idea in a function. Fill in the body of this function so it finds the polynomial regression coefficients using the direct method

find_poly <- function(k){

(Matt: remind people how functions work.)

Is there any repetition? Can you remove it?

It’s good practice to not let functions depend on anything outside the function. So, we should supply the design matrix and response as arguments to our function.

Ok, now to get the task done we just do:

k <- 1
find_poly(k, X, y)
k <- 2
find_poly(k, X, y)
k <- 3
find_poly(k, X, y)
# and so on until it fails

There’s still a lot of repetition!

Loops in R

Loops are a general programming structure that perform repetition. A for loop repeatedly evaluates it’s contents as it cycles through it’s argument. For example,

for (i in 1:10){
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

the argument i cycles through the values 1 to 10, each time evaluating print(i). How can we write our task as a for loop?

Try it. Now!

for loops don’t return anything and if we want to keep our results from each fit we need to store them somewhere explicitly. For example,

coefs <- vector("list", 10) # an empty list
for (k in 1:10){
  coefs[[k]] <- find_poly(k, X, y)

Ok, our task is now (mostly) complete without repetition.

Some other notes if there is time

What about the lm approach?

The key is to realize lm can take a matrix on the RHS of a formula and it will include a term for every column:

lm(y ~ X[, 1:3] - 1)  # fitting the quadratic for example
## Call:
## lm(formula = y ~ X[, 1:3] - 1)
## Coefficients:
##   X[, 1:3]   X[, 1:3]x    X[, 1:3]  
##  0.2229180   1.0295678  -0.0006584
lm(y ~ X - 1)         # fitting the degree 9 polynomial
## Call:
## lm(formula = y ~ X - 1)
## Coefficients:
##          X          Xx           X           X           X           X  
##  2.513e+01  -6.038e+01   5.341e+01  -2.281e+01   5.546e+00  -8.274e-01  
##          X           X           X           X           X  
##  7.842e-02  -4.737e-03   1.767e-04  -3.711e-06   3.356e-08

You could either expand the find_poly function to also do the lm method, or give it it’s own function and put it in the same loop as the direct method.

The apply family

The apply functions in R, perform the same type of computation as loops, but with much less bookkeeping on your part. For example, our for loop using lapply looks like:

ks <- 1:6
coefs_2 <- lapply(ks, find_poly,  X,  y)

It’s also easy to create our design matrix with sapply:

X <- sapply(0:10, function(k, x) x^k, x)


If you don’t want your loop to stop at the first error, take a look at the function try.