Understanding regression models Jan 30 2019

Today

Inference so far

We’ve talked about the machinery to perform:

But what can we do with this machinery?

Two stages in understanding regression models

  1. Understand a model in the context of a problem

  2. Define a set of models to answer questions of interest

Focus on #1 to gain intuition in how to approach #2.

Example

LA Dodgers (baseball team) sometimes give out “bobbleheads” at home games. They are curious if this increases attendance at games.

81 games in 2012 season, 11 of which bobbleheads were given out.

Have measurements on:

(Inspired by Chapter 2 in Modeling Techniques in Predictive Analytics: Business Problems and Solutions with R. Get data from http://www.informit.com/promotions/modeling-techniques-in-predictive-analytics-141183 if you interested.)

A model

\[ \begin{aligned} \text{attendance}_i &= \beta_0 + \beta_1 1\{\text{bobblehead YES}\}_i + \\ & \beta_2 1\{\text{Tue}\}_i + \beta_3 1\{\text{Wed}\}_i+ \beta_4 1\{\text{Thu}\}_i+ \\ & \beta_5 1\{\text{Fri}\}_i+ \beta_6 1\{\text{Sat}\}_i+ \beta_7 1\{\text{Sun}\}_i+ \epsilon_i \end{aligned} \]

What does this model say about the relationship between attendance and whether bobbleheads are given out and day of the week?

Aside some terminology

variable: a measurement made on the observational units.

E.g. bobblehead (yes/no) and day of week (mon/tue/wed/thu/fri/sat/sun).

term: a column of the design matrix

E.g. \(\text{bobblehead}\), \(1\{\text{Fri}\}\)

One useful strategy

Ask about the effect of a variable?

What does the model say about the mean response when a variable is varied, holding all other variables constant.

Categorical variable: Find the mean response for each level and compare.

Continuous variable: Find the change in mean response if the variable increases by 1 unit.

The effect of bobbleheads

The variable has two levels: yes, no

We’ll find \(\E{\text{attendence} | \, \text{bobblehead} = \text{Yes}\,}\) and \(\E{\text{attendence} | \text{bobblehead} = \text{No}}\) then compare them.

\[ \E{\text{attendence} | \, \text{bobblehead} = \text{Yes}\,} \]

\[ \E{\text{attendence} | \, \text{bobblehead} = \text{No}\,} \]

The effect of bobbleheads

\[ \E{\text{attendence} | \, \text{bobblehead} = \text{Yes}\,} - \E{\text{attendence} | \, \text{bobblehead} = \text{No}\,} \]

Relationship to inference

If \(\beta_1\) is zero, then bobbleheads don’t have an effect on attendance.

We could answer the questions:

But this is an observational study, so we need to be careful with our language!
OK “It is estimated distributing bobbleheads is associated with an increased mean attendance of XX”.
Not OK “It is estimated distributing bobbleheads increases the mean attendance by XX”.

Your turn: Day of the week

What does model say about the expected attendance on Monday?

What does model say about the expected attendance on Tuesday?

What does model say about the expected attendance on Wednesday?

Your turn: Day of the week

Day of the week: inference

If \(\beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6 = \beta_7\) is zero, then day of the week doesn’t have an effect on attendance.

What tools could we use to answer the questions:

An example with a continuous variable

In lab today: \[ \text{weight}_i = \beta_0 + \beta_11\{\text{male}\}_i + \beta_2\text{height}_i + \epsilon_i \]

What is the effect of height?

What is the effect of height?

Compare \(\E{\text{weight}| \, \text{height} = h \,}\) to \(\E{\text{weight}| \, \text{height} = h + 1 \,}\), holding other variables constant.

Interactions

Interactions: describe situations where the effect of one variable depends on the level of another explanatory variable.

E.g. \[ \text{weight}_i = \beta_0 + \beta_11\{\text{male}\}_i + \beta_2\text{height}_i + \beta_3 \left(1\{\text{male}\} \times \text{height}\right)_i + \epsilon_i \]

The same strategy will work.