Using resampling to estimate performance

class: center, middle, title-slide

# Using resampling to estimate performance
## NHS-R Conference 2021
### Emil Hvitfeldt
### 2021-11-02

---

<header class="header header--fixed" role="banner">
<nav class="distill-site-nav distill-site-header">
<div class="nav-left">
<a href="index.html" class="title">NHS tidymodels workshop</a>
</div>
<div class="nav-right">
<a href="index.html">Home</a>
<div class="nav-dropdown">
<button class="nav-dropbtn">
Slides
 
<span class="down-arrow">▾</span>
</button>
<div class="nav-dropdown-content">
<a href="1-introduction.html">1: Introduction</a>
<hr/>
<a href="2-models.html">2: Models</a>
<hr/>
<a href="3-features.html">3: Features</a>
<hr/>
<a href="4-resampling.html">4: Resampling</a>
<hr/>
<a href="5-tuning.html">5: Tuning</a>
</div>
</div>
<a href="https://github.com/EmilHvitfeldt/nhs-tidymodels-workshop">
<i class="fab fa-github" aria-hidden="true"></i>
</a>
<a href="javascript:void(0);" class="nav-toggle">☰</a>
</div>
</nav>
</header>

class: inverse, middle, center

<div style = "position:fixed; visibility: hidden">
`$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$`
`$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$`
`$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$`
`$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$`
</div>

# [`tidymodels.org`](https://www.tidymodels.org/)

# _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/))

---

# What resample?

---

# Resampling methods

.pull-left[
These are additional data splitting schemes that are applied to the _training_ set and are used for **estimating model performance**.

They attempt to simulate slightly different versions of the training set. These versions of the original are split into two model subsets:

* The _analysis set_ is used to fit the model (analogous to the training set). 
* Performance is determined using the _assessment set_.

This process is repeated many times.

]

.pull-right[

]

There are [different flavors of resampling](https://bookdown.org/max/FES/resampling.html) but we will focus on one method in these notes.

---

# The model workflow and resampling

All resampling methods repeat this process multiple times:

The final resampling estimate is the average of all of the estimated metrics (e.g. RMSE, etc).

---

# V-Fold cross-validation

.pull-left[

Here, we randomly split the training data into _V_ distinct blocks of roughly equal size (AKA the "folds").

* We leave out the first block of analysis data and fit a model.
* This model is used to predict the held-out block of assessment data.
* We continue this process until we've predicted all _V_ assessment blocks

The final performance is based on the hold-out predictions by _averaging_ the statistics from the _V_ blocks.

]

.pull-right[

_V_ is usually taken to be 5 or 10 and leave-one-out cross-validation has each sample as a block.

**Repeated CV** can be used when training set sizes are small. 5 repeats of 10-fold CV averages for 50 sets of metrics.

]

---

#  3-Fold cross-validation with _n_ = 30

Randomly assign each sample to one of three folds

---

#  3-Fold cross-validation with _n_ = 30

---

# Resampling results

The goal of resampling is to produce a single estimate of performance for a model.

Even though we end up estimating _V_ models (for _V_-fold CV), these models are discarded after we have our performance estimate.

Resampling is basically an empirical simulation system_ used to understand how well the model would work on _new data_.

---

# Cross-validating using rsample

rsample has a number of resampling functions built in. One is `vfold_cv()`, for performing V-Fold cross-validation like we've been discussing.

```r
set.seed(2453)

cv_splits <- vfold_cv(chi_train) #10-fold is default

cv_splits
```

```
## #  10-fold cross-validation 
## # A tibble: 10 × 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [5115/569]> Fold01
##  2 <split [5115/569]> Fold02
##  3 <split [5115/569]> Fold03
##  4 <split [5115/569]> Fold04
##  5 <split [5116/568]> Fold05
##  6 <split [5116/568]> Fold06
##  7 <split [5116/568]> Fold07
##  8 <split [5116/568]> Fold08
##  9 <split [5116/568]> Fold09
## 10 <split [5116/568]> Fold10
```

???

Note that `<split [2K/222]>` rounds to the thousandth and is the same as `<1977/222/2199>`

---

# Cross-validating Using rsample

- Each individual split object is similar to the `initial_split()` example.
- Use `analysis()` to extract the resample's data used for the fitting process.
- Use `assessment()` to extract the resample's data used for the performance process.

.pull-left[

```r
cv_splits$splits[[1]]
```

```
## <Analysis/Assess/Total>
## <5115/569/5684>
```

]

.pull-right[

```r
cv_splits$splits[[1]] %>% 
  analysis() %>%
  dim()
```

```
## [1] 5115   50
```

```r
cv_splits$splits[[1]] %>% 
  assessment() %>%
  dim()
```

```
## [1] 569  50
```

]

---

# Time series resampling

Our Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea.

We can emulate our training/test split by making similar resamples.

* Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.
* Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.
* Fold 3: Take the first X years + 4 weeks of data as the analysis set, the next 2 weeks as the assessment set.
* and so on

Here a small example with a 3 day assessment set

---

#  Rolling forecast origin resampling

---

#  Using rsample to do this

```r
chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date"
  )
```

Use the `date` column to find the date data.

---

#   Using rsample to do this

```r
chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week"
  )
```

Our units will be weeks.

---

#   Using rsample to do this

```r
chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",
    lookback = 52 * 15
  )
```

Every analysis set has 15 years of data

---

#   Using rsample to do this

```r
chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",
    lookback = 52 * 15,
    assess_stop = 2
  )
```

Every assessment set has 2 weeks of data

---

#   Using rsample to do this

```r
chi_rs <-
  chi_train %>%
  sliding_period(
    index = "date",  
    period = "week",
    lookback = 52 * 15,
    assess_stop = 2,
    step = 2
  )
```

Increment by 2 weeks so that there are no overlapping assessment sets. For example:

```r
chi_rs$splits[[1]] %>% assessment() %>% pluck("date") %>% range()
```

```
## [1] "2016-01-07" "2016-01-20"
```

```r
chi_rs$splits[[2]] %>% assessment() %>% pluck("date") %>% range()
```

```
## [1] "2016-01-21" "2016-02-03"
```

---

# Our resampling object

```r
chi_rs
```

```
## # Sliding period resampling 
## # A tibble: 16 × 2
##    splits            id     
##    <list>            <chr>  
##  1 <split [5463/14]> Slice01
##  2 <split [5467/14]> Slice02
##  3 <split [5467/14]> Slice03
##  4 <split [5467/14]> Slice04
##  5 <split [5467/14]> Slice05
##  6 <split [5467/14]> Slice06
##  7 <split [5467/14]> Slice07
##  8 <split [5467/14]> Slice08
##  9 <split [5467/14]> Slice09
## 10 <split [5467/14]> Slice10
## 11 <split [5467/14]> Slice11
## 12 <split [5467/14]> Slice12
## 13 <split [5467/14]> Slice13
## 14 <split [5467/14]> Slice14
## 15 <split [5467/14]> Slice15
## 16 <split [5467/11]> Slice16
```

We will fit 16 models on  16 slightly different analysis sets.

Each will produce a separate RMSE and we will average the  16 RMSE values to get the resampling estimate of that statistic.

---

# Generating the resampling statistics

Let's use the workflow from the last section (`chi_wflow`).

In tidymodels, there is a function called `fit_resamples()` that will do all of this for us:

```r
ctrl <- control_resamples(save_pred = TRUE)

chi_res <-
  chi_wflow %>% 
  fit_resamples(resamples = chi_rs, control = ctrl)
chi_res
```

```
## # Resampling results
## # Sliding period resampling 
## # A tibble: 16 × 5
##    splits            id      .metrics         .notes           .predictions   
##    <list>            <chr>   <list>           <list>           <list>         
##  1 <split [5463/14]> Slice01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  2 <split [5467/14]> Slice02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  3 <split [5467/14]> Slice03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  4 <split [5467/14]> Slice04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  5 <split [5467/14]> Slice05 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  6 <split [5467/14]> Slice06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  7 <split [5467/14]> Slice07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  8 <split [5467/14]> Slice08 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
##  9 <split [5467/14]> Slice09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 10 <split [5467/14]> Slice10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 11 <split [5467/14]> Slice11 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 12 <split [5467/14]> Slice12 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 13 <split [5467/14]> Slice13 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 14 <split [5467/14]> Slice14 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 15 <split [5467/14]> Slice15 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × …
## 16 <split [5467/11]> Slice16 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [11 × …
```

---

# Getting the results

To obtain the resampling estimates of performance:

```r
collect_metrics(chi_res)
```

```
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.86     16  0.241  Preprocessor1_Model1
## 2 rsq     standard   0.946    16  0.0218 Preprocessor1_Model1
```

To get the holdout predictions:

```r
chi_pred <- collect_predictions(chi_res)
chi_pred %>% slice(1:4)
```

```
## # A tibble: 4 × 5
##   id      .pred  .row ridership .config             
##   <chr>   <dbl> <int>     <dbl> <chr>               
## 1 Slice01 20.1   5464     20.4  Preprocessor1_Model1
## 2 Slice01 18.5   5465     20.1  Preprocessor1_Model1
## 3 Slice01  6.84  5466      4.78 Preprocessor1_Model1
## 4 Slice01  5.35  5467      3.26 Preprocessor1_Model1
```

---

# Plot performance

A simple ggplot with a custom `coord_*` can be used.

.pull-left[

```r
chi_pred %>% 
  ggplot(aes(.pred, ridership)) + 
  geom_abline(lty = 2, col = "green") +
  geom_point(alpha = 0.3, cex = 2) +
  coord_obs_pred()
```

We can also use the [`shinymodels`](https://github.com/tidymodels/shinymodels) package to get an interactive version of this plot:

```r
library(shinymodels)
explore(chi_res, hover_cols = c(date, ridership))
```

]
.pull-right[

]

---

# Plotting the metrics over time

You can get the per-resample metrics and prediction using the `summarize = FALSE` option.

An example function to add them to the results:

```r
# Add a date column to time series resampling object metrics
add_date_to_metrics <- function(x, date_col, value = min, ...) {
  res <- collect_metrics(x, summarize = FALSE, ...)
  x %>% 
    mutate(
      # Get the assessment set
      holdout = purrr::map(splits, assessment),
      # Keep the date colum
      holdout = purrr::map(holdout, ~ select(.x, all_of(date_col))),
      # Find a date to represent the range
      date = purrr::map(holdout, ~ value(.x[[date_col]]))
    ) %>% 
    # date is a nested tibble so unnest then merge with results
   unnest(c(all_of(date_col))) %>% 
    select(id, all_of(date_col)) %>% 
    full_join(res, by = "id")
}
```

---

# Plotting the metrics over time

.pull-left[

```r
chi_res %>% 
  add_date_to_metrics("date") %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = date, y = .estimate)) + 
  geom_point() + 
  labs(y = "RMSE") + 
  scale_x_date(date_breaks = "2 months")
```

]

.pull-right[
<img src="4-resampling_files/figure-html/unnamed-chunk-16-1.svg" width="90%" style="display: block; margin: auto;" />
]

---

# Some notes

* These models fits are independent of one another. [Parallel processing](https://www.tmwr.org/resampling.html#parallel) can be used to significantly speed up the training process. 
* The individual models can [be saved](https://www.tmwr.org/resampling.html#extract) so you can look at variation in the model parameters or recipes steps. 
* If you are interested in a [validation set](https://www.tmwr.org/resampling.html#validation), tidymodels considers that a single resample of the data. Everything else in this chapter works the same.

# Hands-On: Perform resampling

Go to the lab and fit your model within some resamples.