Stat 411/511

Lab 1

Jan 6th

Goals:

  • RStudio and projects
  • Getting set up for reproducible code
  • Coding with style
  • The dangers of attach

Getting set up in RStudio

We’ll use RStudio as our interface to R.

Projects are are useful way to organize your work in RStudio. When you open a project:

  • R will open with the working directory set to the project directory
  • code files you had open last time you were working on the project will still be open
  • the History contains only code you have run in this project

To create a new project for this class:

  1. Open RStudio
  2. Go to File -> New Project
  3. Select New Directory
  4. Select Empty Project
  5. Call it ST552. Put it in your ONID or SCIENCE homedrive if you want it available from other computers.

Create the project once. Next time you are going to work on ST552, open the project in RStudio (File -> Open Project), or use the project dropdown in the top right of RStudio image of dropdown.

(Accessing your SCIENCE drive from other computers: http://my.science.oregonstate.edu/mount_network_drives)
(Accessing your ONID drive from other computers: http://oregonstate.edu/helpdocs/accounts/onid-osu-network-id/using-your-onid/your-home-directory)

Reproducible code

Reproducibility is one of the huge advantages of using a programming language for data analysis. Our code becomes a complete recipe to go from a possibly messy dataset, to numbers and figures for a statistical report. We can repeat our analyses in the future and get exactly the same result. However, writing truly reproducible code takes discipline.

By default, R (and RStudio) saves a copy of your workspace (packages you have loaded and objects you have created) when you exit R, and it loads it again when you come back. This may seem convenient, but it encourages bad habits for reproducibility. It’s too easy to rely on packages being loaded or accidentally relying on objects you created outside of your script. The first thing we will do is change this default behavior.

Go to Tools -> Project Options. In the General tab, set the following:

  • Restore .RData into workspace at startup = No
  • Save workspace to .RData on exit = No
  • Always save history = Yes

Now, when we start a fresh R session (Session -> Restart R) we know there is nothing from a previous session hanging around.

This does it for your ST552 project, but I’d encourage you to use these options for all your work (you can set them globally in Tools -> Global Options).

When I am writing R code, I will occasionally check for reproducibility, by restarting R, and sourcing my code (Code -> Source File, or Source button in Editor). Sourcing a file, runs all the code in the file from top to bottom, but it will stop if an error occurs. If an error does occur, fix the error, restart R and try sourcing again. At the very least do this before closing R, and before handing in code. Your first homework requires submitting an R code file that will be checked for reproducibility.

Reproducible code is the first step towards reproducible reports, next week…

R Style

Code is a form of communication. You should write it in a way that is easy for others (including your future self) to read and understand. Consistency is key, pick a style and stick with it.

We’ll follow http://adv-r.had.co.nz/Style.html Your first homework involves a submission of code that will be checked according to this guide. All code submitted for future homeworks in this class must conform to this style.

Download, code.r, a poorly styled file. Open it in RStudio. Find the problems with style and fix them according to the style guidelines.

Don’t use attach

attach leads to confusion.

x <- 4
fake_data <- data.frame(x = 1, y = 2)
attach(fake_data)
# what will this return?
x
# this?
x + y
# this?
rm(x)
x
# this?
detach(fake_data)
x

It’s better to be explicit about where variables are coming from. Use the data argument if a function has it, or use the with function.

fake_data <- data.frame(x = 1:10, y = rnorm(10))
lm(y ~ x, data = fake_data) # use the data argument
## 
## Call:
## lm(formula = y ~ x, data = fake_data)
## 
## Coefficients:
## (Intercept)            x  
##     -0.6742       0.1546
with(fake_data, lm(y ~ x))  # or use with
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -0.6742       0.1546