Creating Models

# Creating Models
## NHS-R Conference 2021
### Emil Hvitfeldt
### 2021-11-02

---

<header class="header header--fixed" role="banner">
<nav class="distill-site-nav distill-site-header">
<div class="nav-left">
<a href="index.html" class="title">NHS tidymodels workshop</a>
</div>
<div class="nav-right">
<a href="index.html">Home</a>
<div class="nav-dropdown">
<button class="nav-dropbtn">
Slides
 
<span class="down-arrow">▾</span>
</button>
<div class="nav-dropdown-content">
<a href="1-introduction.html">1: Introduction</a>
<hr/>
<a href="2-models.html">2: Models</a>
<hr/>
<a href="3-features.html">3: Features</a>
<hr/>
<a href="4-resampling.html">4: Resampling</a>
<hr/>
<a href="5-tuning.html">5: Tuning</a>
</div>
</div>
<a href="https://github.com/EmilHvitfeldt/nhs-tidymodels-workshop">
<i class="fab fa-github" aria-hidden="true"></i>
</a>
<a href="javascript:void(0);" class="nav-toggle">☰</a>
</div>
</nav>
</header>

<div style = "position:fixed; visibility: hidden">
`$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$`
`$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$`
`$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$`
`$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$`
</div>

# [`tidymodels.org`](https://www.tidymodels.org/)

# _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/))

---

# Define the score of "The Model"

In tidymodels, there is the idea that a model-oriented data analysis consists of

- a preprocessor, and 
 - a model
 
The preprocessor might be a simple formula or a sophisticated recipe.

It's important to consider both of these activities as part of the data analysis process.

- Post-model activities should also be included there (e.g. calibration, cut-off optimization, etc.)
 - We don't have those implemented yet.

---

# Basic tidymodels components

---

# A relevant example

Let's say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data.

- AKA principal component regression
 
---

# A relevant example

Let's say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data.

- AKA ~~principal component regression~~ feature extraction

---

# A relevant example

Let's say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data.

- AKA ~~principal component regression~~ feature extraction

What do we consider the estimation part of this process?

---

# Is it this?

---

# Or is it this?

---

# What's the difference?

It is easy to think that the model fit is the only estimation steps.

There are cases where this could go really wrong:

* Poor estimation of performance (buy treating the PCA parts as known)

* Selection bias in feature selection

* Information/data leakage

These problems are exacerbated as the preprocessors increase in complexity and/or effectiveness.

_We'll come back to this at the end of this section_

---

<div style="font-size: 110pt;">
.orange[Data splitting]
</div>

---

<div style="font-size: 50pt;">
.white[Always have a separate piece of data that can] .purple[contradict] .white[what you] .green[believe]
</div>

---

# Data splitting and spending

How do we "spend" the data to find an optimal model?

We _typically_ split data into training and test data sets:

*  ***Training Set***: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model.

*  ***Test Set***: these data can be used to get an independent assessment of model efficacy. **They should not be used during model training** (like, at all).

---

# Data splitting and spending

The more data we spend, the better estimates we'll get (provided the data is accurate).

Given a fixed amount of data:

* Too much spent in training won't allow us to get a good assessment of predictive performance.  We may find a model that fits the training data very well, but is not generalizable (overfitting)

* Too much spent in testing won't allow us to get a good assessment of model parameters

Statistically, the best course of action would be to use all the data for model building and use statistical methods to get good estimates of error.

From a non-statistical perspective, many consumers of complex models emphasize the need for an untouched set of samples to evaluate performance.

---

# Mechanics of data splitting

There are a few different ways to do the split: simple random sampling, _stratified sampling based on the outcome_, by date, or methods that focus on the distribution of the predictors.

For stratification:

* **classification**: this would mean sampling within the classes to preserve the distribution of the outcome in the training and test sets

* **regression**: determine the quartiles of the data set and sample within those artificial groups

For _time series_, we often use the most recent data as the test set.

---

# Splitting with Chicago data

`initial_split()` can be used when we use randomness to make the split.

Let's put the last two weeks of data into the test set. `initial_time_split()` can be used for this purpose:

```r
set.seed(1234)
chi_split <- initial_time_split(Chicago, prop = 1 - (14/nrow(Chicago)))
chi_split
```

```
## <Analysis/Assess/Total>
## <5684/14/5698>
```

```r
chi_train <- training(chi_split)
chi_test  <- testing(chi_split)

c(training = nrow(chi_train), testing = nrow(chi_test))
```

```
## training  testing 
##     5684       14
```

---

<div style="font-size: 80pt;">
.orange[Creating models in R]
</div>

---

# Specifying models in R using formulas

To fit a model to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this.

The **formula** interface using R [formula rules](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Formulae-for-statistical-models) to specify a _symbolic_ representation of the terms:

Variables + interactions

```r
# day_of_week is not in the data set but day_of_week = lubridate::wday(date, label = TRUE)
model_fn(ridership ~ day_of_week + Western + day_of_week:Western, data = chi_train)
```

Shorthand for all predictors

```r
model_fn(ridership ~ ., data = chi_train)
```

Inline functions / transformations

```r
model_fn(log10(ridership) ~ ns(Western, df = 3) + ., data = chi_train)
```

---

# Downsides to formulas

* You can't nest in-line functions such as `model_fn(y ~ pca(scale(x1), scale(x2), scale(x3)), data = dat)`.
* All the model matrix calculations happen at once and can't be recycled when used in a model function. 
* For very _wide_ data sets, the formula method can be [extremely inefficient](https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/). 
* There are limited _roles_ that variables can take which has led to several re-implementations of formulas. 
* Specifying multivariate outcomes is clunky and inelegant.
* Not all modeling functions have a formula method (consistency!).

---

# Specifying models without formulas

Some modeling function have a non-formula (XY) interface. This usually has arguments for the predictors and the outcome(s):

```r
# Usually, the variables must all be numeric
pre_vars <- c("Austin", "Clark_Lake", "California")
model_fn(x = chi_train[, pre_vars],
         y = chi_train$ridership)
```

This is inconvenient if you have transformations, factor variables, interactions, or any other operations to apply to the data prior to modeling.

Overall, it is difficult to predict if a package has one or both of these interfaces. For example, `lm` only has formulas.

There is a **third interface**, using _recipes_ that will be discussed later that solves some of these issues.

---

# A linear regression model

Let's start by fitting an ordinary linear regression model to the training set. You can choose the model terms for your model, but I will use a very simple model:

```r
simple_lm <- lm(ridership ~ Clark_Lake + humidity, data = chi_train)
```

Before looking at coefficients, we should do some model checking to see if there is anything obviously wrong with the model.

To get the statistics on the individual data points, we will use the awesome `broom` package:

```r
simple_lm_values <- augment(simple_lm)
names(simple_lm_values)
```

```
## [1] "ridership"  "Clark_Lake" "humidity"   ".fitted"    ".resid"    
## [6] ".hat"       ".sigma"     ".cooksd"    ".std.resid"
```

---

<div style="font-size: 110pt;">
.orange[Fitting via tidymodels]
</div>

---

# The parsnip package

- A tidy unified _interface_ to models
- `lm()` isn't the only way to perform linear regression
  - .pkg[glmnet] for regularized regression
  - .pkg[stan] for Bayesian regression
  - .pkg[keras] for regression using tensorflow
- But...remember the consistency slide?
  - Each interface has its own minutiae to remember
  - .pkg[parsnip] standardizes all that!
]
.pull-right[

]

---

# parsnip in action

1) Create specification

2) Set the engine

3) Fit the model

```r
spec_lin_reg <- linear_reg()
spec_lin_reg
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

```r
spec_lm <- spec_lin_reg %>% set_engine("lm")
spec_lm
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

]

```r
fit_lm <- fit(
  spec_lm,
  ridership ~ Clark_Lake + humidity,
  data = chi_train
)

fit_lm
```

```
## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = ridership ~ Clark_Lake + humidity, data = data)
## 
## Coefficients:
## (Intercept)   Clark_Lake     humidity  
##    1.764496     0.883693    -0.002497
```

Note: Models have default engines. We don't really need to use `set_engine("lm")` for this example.

]

---

# Alternative engines

With .pkg[parsnip], it is easy to switch to a different engine, like Stan, to run the
same model with alternative backends.

```r
spec_stan <- 
  spec_lin_reg %>%
  # Engine specific arguments are 
  # passed through here
  set_engine("stan", chains = 4, iter = 1000)

# Otherwise, looks exactly the same!
fit_stan <- fit(
  spec_stan,
  ridership ~ Clark_Lake + humidity,
  data = chi_train
)
```

]

```r
coef(fit_stan$fit)
```

```
## (Intercept)  Clark_Lake    humidity 
##  1.76242135  0.88364918 -0.00241108
```

```r
coef(fit_lm$fit)
```

```
##  (Intercept)   Clark_Lake     humidity 
##  1.764495522  0.883692888 -0.002497053
```

]

---

# Duplicate computations

Note that, for both of these fits, some of the computations are repeated.

For example, the formula method does a fair amount of work to figure out how to turn the data frame into a matrix of predictors.

When there are special effects (e.g. splines), dummy variables, interactions, or other components, the formula/terms objects have to keep track of everything.

In cases where there are a lot of _predictors_, these computations can consume a lot of resources. If we can save them, that would be helpful.

The answer is a _workflow_ object. These bundle together a preprocessor (such as a formula) along with a model.

---

# A modeling _workflow_

We can _optionally_ bundle the recipe and model together into a <span style="color:LightGray;"><strike>pipeline</strike></span> _workflow_:

```r
reg_wflow <- 
  workflow() %>%    # attached with the tidymodels package
  add_model(spec_lm) %>% 
  add_formula(ridership ~ Clark_Lake + humidity) # or add_recipe() or add_variables()

reg_fit <- fit(reg_wflow, data = chi_train)
reg_fit
```

```
## ══ Workflow [trained] ════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: linear_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────
## ridership ~ Clark_Lake + humidity
## 
## ── Model ─────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)   Clark_Lake     humidity  
##    1.764496     0.883693    -0.002497
```

---

# Swapping models

```r
stan_wflow <- 
  reg_wflow %>% 
  update_model(spec_stan)

set.seed(21)
stan_fit <- fit(stan_wflow, data = chi_train)
stan_fit
```

```
## ══ Workflow [trained] ════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: linear_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────
## ridership ~ Clark_Lake + humidity
## 
## ── Model ─────────────────────────────────────────────────────────────────────
## stan_glm
##  family:       gaussian [identity]
##  formula:      ..y ~ .
##  observations: 5684
##  predictors:   3
## ------
##             Median MAD_SD
## (Intercept) 1.8    0.2   
## Clark_Lake  0.9    0.0   
## humidity    0.0    0.0   
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 3.1    0.0   
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
```

---

# Workflows

Once the first model is fit, the preprocessor (i.e. the formula) is processed and the model matrix is formed.

New models don't need to repeat those computations.

Some other nice features:

* Workflows are smarter with data than `model.matrix()` in terms of new factor levels. 
* Other preprocessors can be used: recipes and `dplyr::select()` statements (that do no data processing).
* As will be seen later, they can help organize your work when a sequence of models are used. 
* A workflow captures the entire modeling process (mentioned earlier) and a simple `fit()` and `predict()` sequence are used for all of the estimation parts.

---

# Using workflows to predict

```r
# generate some bogus data (instead of using the training or test sets)
set.seed(3)
shuffled_data <- map_dfc(Chicago, ~ sample(.x, size = 10))

predict(stan_fit, shuffled_data) %>% slice(1:3)
```

```
## # A tibble: 3 × 1
##   .pred
##   <dbl>
## 1  15.9
## 2  18.5
## 3  17.4
```

```r
predict(stan_fit, shuffled_data, type = "pred_int") %>% slice(1:3)
```

```
## # A tibble: 3 × 2
##   .pred_lower .pred_upper
##         <dbl>       <dbl>
## 1        10.2        22.0
## 2        12.6        24.6
## 3        11.3        23.4
```

---

# The tidymodels prediction guarantee!

* The predictions will always be inside a **tibble**.
* The column names and types are **unsurprising**.
* The number of rows in `new_data` and the output **are the same**.

This enables the use of `bind_cols()` to combine the original data and the predictions.

---

# Evaluating models

tidymodels has a [lot of performance metrics](https://yardstick.tidymodels.org/reference/index.html) for different types of models (e.g. binary classification, etc).

Each takes a tibble as an input along with the observed and predicted column names:

```r
pred_results <- 
  predict(stan_fit, shuffled_data) %>% 
  bind_cols(shuffled_data)

# Data was randomized; these results should be bad
pred_results %>% rmse(truth = ridership, estimate = .pred)
```

```
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.17
```

---

# Multiple metrics/KPIs

A _metric set_ can bundle multiple statistics:

```r
reg_metrics <- metric_set(rmse, rsq, mae, ccc)

# A tidy format of the results
pred_results %>% reg_metrics(truth = ridership, estimate = .pred)
```

```
## # A tibble: 4 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       6.17 
## 2 rsq     standard       0.304
## 3 mae     standard       4.36 
## 4 ccc     standard       0.434
```

---

# broom methods

```r
glance(reg_fit)
```

```
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.780         0.780  3.07    10094.       0     2 -14448. 28905. 28931.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
tidy(reg_fit)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.76      0.222       7.94  2.46e-15
## 2 Clark_Lake   0.884     0.00622   142.    0       
## 3 humidity    -0.00250   0.00295    -0.847 3.97e- 1
```

---

# broom methods

For `augment()` we require the data to predict and attach

```r
augment(reg_fit, shuffled_data %>% select(Clark_Lake, humidity, ridership))
```

```
## # A tibble: 10 × 4
##    Clark_Lake humidity ridership .pred
##         <dbl>    <dbl>     <dbl> <dbl>
##  1      16.3      78        6.36 15.9 
##  2      19.2      70       19.3  18.5 
##  3      17.8      45       15.7  17.4 
##  4      19.9      50        4.31 19.3 
##  5      18.7      61       18.7  18.1 
##  6       2.13     54.5      8.44  3.51
##  7       4.56     43        1.74  5.69
##  8      17.1      77.5     14.7  16.6 
##  9      19.4      59       14.6  18.7 
## 10      16.7      39       15.3  16.5
```

---

# Hands-On: Fit a model

Go to the lab and try to fit some models. The labs include the skeleton. Try to mix it up with different predictors or models.