Lab 2 Jan 16 2019

You can find this document in its raw Rmarkdown form in the this project on rstudio.cloud.

Learning Objectives

Tibbles/Data frames/Matrices

There are a number of ways to store rectangular (observations in rows, variables in columns) data in R. Data frames are the most common because you can have different data types in each column, e.g. most of the data examples we’ve seen in class are data frames:

str(Sleuth3::ex0727)
## 'data.frame':    21 obs. of  2 variables:
##  $ Mass : num  3.33 4.62 5.43 5.73 6.12 6.29 6.45 6.51 6.65 6.75 ...
##  $ Tcell: num  0.252 0.263 0.251 0.251 0.183 0.213 0.332 0.203 0.252 0.342 ...
str(faraway::gala)
## 'data.frame':    30 obs. of  7 variables:
##  $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
##  $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
##  $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
##  $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
##  $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
##  $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
##  $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...
str(HistData::GaltonFamilies)
## 'data.frame':    934 obs. of  8 variables:
##  $ family         : Factor w/ 205 levels "001","002","003",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ father         : num  78.5 78.5 78.5 78.5 75.5 75.5 75.5 75.5 75 75 ...
##  $ mother         : num  67 67 67 67 66.5 66.5 66.5 66.5 64 64 ...
##  $ midparentHeight: num  75.4 75.4 75.4 75.4 73.7 ...
##  $ children       : int  4 4 4 4 4 4 4 4 2 2 ...
##  $ childNum       : int  1 2 3 4 1 2 3 4 1 2 ...
##  $ gender         : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 1 1 2 1 ...
##  $ childHeight    : num  73.2 69.2 69 69 73.5 72.5 65.5 65.5 71 68 ...

Your turn What data types are in the columns of GaltonFamilies? What other data types are common as variables in data frames in R?

Tibbles are a modern re-imagining of data frames, very common in the tidyverse set of R packages. One particularly useful property of tibbles is slightly nicer printing, especially of large data sets. Some built in data sets are tibbles, e.g.

ggplot2::diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

(If you are working in an Rmarkdown notebook and just run the code in this chunk, you might not see any difference between the display of a data frame and a tibble, that’s because RStudio is using it’s own display for both).

You can also turn a data frame into a tibble with as_tibble(), from the tibble package which is loaded as part of the tidyverse package:

library(tidyverse)
galton <- as_tibble(HistData::GaltonFamilies)
galton
## # A tibble: 934 x 8
##    family father mother midparentHeight children childNum gender
##    <fct>   <dbl>  <dbl>           <dbl>    <int>    <int> <fct> 
##  1 001      78.5   67              75.4        4        1 male  
##  2 001      78.5   67              75.4        4        2 female
##  3 001      78.5   67              75.4        4        3 female
##  4 001      78.5   67              75.4        4        4 female
##  5 002      75.5   66.5            73.7        4        1 male  
##  6 002      75.5   66.5            73.7        4        2 male  
##  7 002      75.5   66.5            73.7        4        3 female
##  8 002      75.5   66.5            73.7        4        4 female
##  9 003      75     64              72.1        2        1 male  
## 10 003      75     64              72.1        2        2 female
## # ... with 924 more rows, and 1 more variable: childHeight <dbl>

Your turn: Describe the differences between the way a data frame and a tibble print.

If you are interested you can read more about tibbles, and their differences from data frames, in Section 3.6 Data frames and tibbles of Advanced R.

While data frames and tibbles are the most useful structure for data analysis, we’ll use matrices to help us learn about linear models. In contrast to data frames, matrices in R are of only one data type. For this class, we’ll only work with matrices that have numeric entries.

To create a matrix from scratch you can use the matrix() function, passing in a vector of values and dimensions (I find it easiest to set byrow = TRUE and layout the values in a rectangular form), e.g.

matrix(c(
    1, 0,
    2, 1, 
    1, 0
  ), 
  ncol = 2,
  byrow = TRUE)
##      [,1] [,2]
## [1,]    1    0
## [2,]    2    1
## [3,]    1    0

However, most of the time in this class, columns will be based on data. You can pull the columns out (by name using $) and join them together with cbind():

cbind(galton$midparentHeight, galton$childHeight)

Or subset the data, then try as.matrix():

as.matrix(galton[, c("midparentHeight", "childHeight")])

But be careful, it’s up to you to make sure you end up with the desired data types.

Your turn What kind of matrices do these two commands produce? I.e. what type of data is inside? How did they handle the fact that the gender column was a factor?

cbind(galton$children, galton$gender)
as.matrix(galton[, c("children", "gender")])

Neither is probably what we want if we want to include gender in a design matrix, but we’ll talk more about that at a later time.’

Learning how R handles linear models is mostly about learning how R creates design matrices from formula that are specified relative to a tibble/data frame.

Math in Rmarkdown

Including math in Rmarkdown is easy, you can put latex commands inside dollar signs $.

For example, using single dollar signs, includes math inline, e.g. \(\sigma\) ($\sigma$) includes the Greek letter sigma. Double dollar signs puts math in a displayed equation, e.g. \[ \overline{X} = \frac{1}{n}\sum_i^n x_i \]

$$
\overline{X} = \frac{1}{n}\sum_i^n x_i
$$

Learning all the possible latex commands is much harder. A reasonable overview can be found at this MathJax Basic Tutorial.

For any formula on the class page, you should be able to right-click -> “Show Math As” -> “Tex commands”, and copy and paste into your own document (remember to surround it in $ or $$).

Your turn Find an example of a model specification from last weeks notes on the website (the HTML versions, not the PDFs). Try to copy it into your own Rmarkdown document: \[ \]

Matrix algebra in R

Head to http://www.statmethods.net/advstats/matrix.html to see a list of many of the matrix functions available in R, and that you’ll need to complete your homework this week.

Your turn Skim the page and make notes on how to:

To practice create the following matrices with as little typing as possible: (You might like to look at the source for this lab to see how the math is typeset) \[ I_{10 \times 10} \]

\[ D = \begin{pmatrix} 1 & 0 & 0 & \ldots & 0\\ 0 & 2 & 0 & \ldots & 0\\ 0 & 0 & 3 & \ldots & 0\\ \vdots & \vdots & \vdots & \ddots & 0\\ 0 & 0 & 0 & \ldots & 10 \end{pmatrix} \] \[ O = \pmb{1}_{10 \times 10} \quad (\text{a } 10 \times 10 \text{ matrix full of ones}) \]

\[ X = \left[ \begin{matrix} 1 & 1\\ 1 & 2 \\ 1 & 3 \\ \vdots & \vdots \\ 1 & 10 \end{matrix}\right] \quad \]

Then calculate: \[ X^T, \quad D^{-1}, \text{and } X^TX \]

Simulation of Normal random variables in R

You probably already know this, but for completeness, to simulate a realization of \(n\) independent Normal random variables with mean 0 and standard deviation 1 in R use rnorm():

n <- 10 # for example
rnorm(n)
##  [1]  0.60889975 -0.19138996  0.01755698  0.09124131  0.42197315
##  [6]  1.00143668 -0.26411662  1.24313495  0.58255142 -0.89598511

rnorm has arguments mean and sd if you need a different mean and standard deviation.

Want dependence? Start with uncorrelated observations, and transform them (check the first answer) or use the function rmvnorm in the mvtnorm package.

We can simulate output from a linear model by combining matrix algebra with simulated errors, you’ll do this in Homework 2.