class: center, middle, title-slide # Using resampling to estimate performance ## NHS-R Conference 2021 ### Emil Hvitfeldt ### 2021-11-02 ---
NHS tidymodels workshop
Home
Slides
▾
1: Introduction
2: Models
3: Features
4: Resampling
5: Tuning
☰
class: inverse, middle, center <!--- Packages ---------------------------------------------------------------> <!--- Chunk options ----------------------------------------------------------> <!--- pkg highlight ----------------------------------------------------------> <style> .pkg { font-weight: bold; letter-spacing: 0.5pt; color: #866BBF; } </style> <!--- Highlighing colors -----------------------------------------------------> <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$` `$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$` `$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$` `$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { purple: ["{\\color{purple}{#1}}", 1], green: ["{\\color{green}{#1}}", 1], orange: ["{\\color{orange}{#1}}", 1], white: ["{\\color{white}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .purple {color: #866BBF;} .green {color: #03A66A;} .orange {color: #F29441;} .white {color: #FFFFFF;} </style> <!--- knitr hooks ------------------------------------------------------------> # [`tidymodels.org`](https://www.tidymodels.org/) # _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/)) --- # What resample? --- # Resampling methods .pull-left[ These are additional data splitting schemes that are applied to the _training_ set and are used for **estimating model performance**. They attempt to simulate slightly different versions of the training set. These versions of the original are split into two model subsets: * The _analysis set_ is used to fit the model (analogous to the training set). * Performance is determined using the _assessment set_. This process is repeated many times. ] .pull-right[ <img src="images/resampling.svg" width="120%" style="display: block; margin: auto;" /> ] There are [different flavors of resampling](https://bookdown.org/max/FES/resampling.html) but we will focus on one method in these notes. --- # The model workflow and resampling All resampling methods repeat this process multiple times: <img src="images/diagram-resampling.svg" width="65%" style="display: block; margin: auto;" /> The final resampling estimate is the average of all of the estimated metrics (e.g. RMSE, etc). --- # V-Fold cross-validation .pull-left[ Here, we randomly split the training data into _V_ distinct blocks of roughly equal size (AKA the "folds"). * We leave out the first block of analysis data and fit a model. * This model is used to predict the held-out block of assessment data. * We continue this process until we've predicted all _V_ assessment blocks The final performance is based on the hold-out predictions by _averaging_ the statistics from the _V_ blocks. ] .pull-right[ _V_ is usually taken to be 5 or 10 and leave-one-out cross-validation has each sample as a block. **Repeated CV** can be used when training set sizes are small. 5 repeats of 10-fold CV averages for 50 sets of metrics. ] --- # 3-Fold cross-validation with _n_ = 30 Randomly assign each sample to one of three folds <img src="images/three-CV.svg" width="55%" style="display: block; margin: auto;" /> --- # 3-Fold cross-validation with _n_ = 30 <img src="images/three-CV-iter.svg" width="65%" style="display: block; margin: auto;" /> --- # Resampling results The goal of resampling is to produce a single estimate of performance for a model. Even though we end up estimating _V_ models (for _V_-fold CV), these models are discarded after we have our performance estimate. Resampling is basically an empirical simulation system_ used to understand how well the model would work on _new data_. --- # Cross-validating using rsample rsample has a number of resampling functions built in. One is `vfold_cv()`, for performing V-Fold cross-validation like we've been discussing. ```r set.seed(2453) cv_splits <- vfold_cv(chi_train) #10-fold is default cv_splits ``` ``` ## # 10-fold cross-validation ## # A tibble: 10 × 2 ## splits id ## <list> <chr> ## 1 <split [5115/569]> Fold01 ## 2 <split [5115/569]> Fold02 ## 3 <split [5115/569]> Fold03 ## 4 <split [5115/569]> Fold04 ## 5 <split [5116/568]> Fold05 ## 6 <split [5116/568]> Fold06 ## 7 <split [5116/568]> Fold07 ## 8 <split [5116/568]> Fold08 ## 9 <split [5116/568]> Fold09 ## 10 <split [5116/568]> Fold10 ``` ??? Note that `<split [2K/222]>` rounds to the thousandth and is the same as `<1977/222/2199>` --- # Cross-validating Using rsample - Each individual split object is similar to the `initial_split()` example. - Use `analysis()` to extract the resample's data used for the fitting process. - Use `assessment()` to extract the resample's data used for the performance process. .pull-left[ ```r cv_splits$splits[[1]] ``` ``` ## <Analysis/Assess/Total> ## <5115/569/5684> ``` ] .pull-right[ ```r cv_splits$splits[[1]] %>% analysis() %>% dim() ``` ``` ## [1] 5115 50 ``` ```r cv_splits$splits[[1]] %>% assessment() %>% dim() ``` ``` ## [1] 569 50 ``` ] --- # Time series resampling Our Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea. We can emulate our training/test split by making similar resamples. * Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set. * Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set. * Fold 3: Take the first X years + 4 weeks of data as the analysis set, the next 2 weeks as the assessment set. * and so on Here a small example with a 3 day assessment set --- # Rolling forecast origin resampling <img src="images/rolling.svg" width="65%" style="display: block; margin: auto;" /> --- # Using rsample to do this ```r chi_rs <- chi_train %>% sliding_period( index = "date" ) ``` Use the `date` column to find the date data. --- # Using rsample to do this ```r chi_rs <- chi_train %>% sliding_period( index = "date", period = "week" ) ``` Our units will be weeks. --- # Using rsample to do this ```r chi_rs <- chi_train %>% sliding_period( index = "date", period = "week", lookback = 52 * 15 ) ``` Every analysis set has 15 years of data --- # Using rsample to do this ```r chi_rs <- chi_train %>% sliding_period( index = "date", period = "week", lookback = 52 * 15, assess_stop = 2 ) ``` Every assessment set has 2 weeks of data --- # Using rsample to do this ```r chi_rs <- chi_train %>% sliding_period( index = "date", period = "week", lookback = 52 * 15, assess_stop = 2, step = 2 ) ``` Increment by 2 weeks so that there are no overlapping assessment sets. For example: ```r chi_rs$splits[[1]] %>% assessment() %>% pluck("date") %>% range() ``` ``` ## [1] "2016-01-07" "2016-01-20" ``` ```r chi_rs$splits[[2]] %>% assessment() %>% pluck("date") %>% range() ``` ``` ## [1] "2016-01-21" "2016-02-03" ``` --- # Our resampling object ```r chi_rs ``` ``` ## # Sliding period resampling ## # A tibble: 16 × 2 ## splits id ## <list> <chr> ## 1 <split [5463/14]> Slice01 ## 2 <split [5467/14]> Slice02 ## 3 <split [5467/14]> Slice03 ## 4 <split [5467/14]> Slice04 ## 5 <split [5467/14]> Slice05 ## 6 <split [5467/14]> Slice06 ## 7 <split [5467/14]> Slice07 ## 8 <split [5467/14]> Slice08 ## 9 <split [5467/14]> Slice09 ## 10 <split [5467/14]> Slice10 ## 11 <split [5467/14]> Slice11 ## 12 <split [5467/14]> Slice12 ## 13 <split [5467/14]> Slice13 ## 14 <split [5467/14]> Slice14 ## 15 <split [5467/14]> Slice15 ## 16 <split [5467/11]> Slice16 ``` We will fit 16 models on 16 slightly different analysis sets. Each will produce a separate RMSE and we will average the 16 RMSE values to get the resampling estimate of that statistic. --- # Generating the resampling statistics Let's use the workflow from the last section (`chi_wflow`). In tidymodels, there is a function called `fit_resamples()` that will do all of this for us: ```r ctrl <- control_resamples(save_pred = TRUE) chi_res <- chi_wflow %>% fit_resamples(resamples = chi_rs, control = ctrl) chi_res ``` ``` ## # Resampling results ## # Sliding period resampling ## # A tibble: 16 × 5 ## splits id .metrics .notes .predictions ## <list> <chr> <list> <list> <list> ## 1 <split [5463/14]> Slice01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 2 <split [5467/14]> Slice02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 3 <split [5467/14]> Slice03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 4 <split [5467/14]> Slice04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 5 <split [5467/14]> Slice05 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 6 <split [5467/14]> Slice06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 7 <split [5467/14]> Slice07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 8 <split [5467/14]> Slice08 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 9 <split [5467/14]> Slice09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 10 <split [5467/14]> Slice10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 11 <split [5467/14]> Slice11 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 12 <split [5467/14]> Slice12 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 13 <split [5467/14]> Slice13 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 14 <split [5467/14]> Slice14 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 15 <split [5467/14]> Slice15 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [14 × … ## 16 <split [5467/11]> Slice16 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [11 × … ``` --- # Getting the results To obtain the resampling estimates of performance: ```r collect_metrics(chi_res) ``` ``` ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 1.86 16 0.241 Preprocessor1_Model1 ## 2 rsq standard 0.946 16 0.0218 Preprocessor1_Model1 ``` To get the holdout predictions: ```r chi_pred <- collect_predictions(chi_res) chi_pred %>% slice(1:4) ``` ``` ## # A tibble: 4 × 5 ## id .pred .row ridership .config ## <chr> <dbl> <int> <dbl> <chr> ## 1 Slice01 20.1 5464 20.4 Preprocessor1_Model1 ## 2 Slice01 18.5 5465 20.1 Preprocessor1_Model1 ## 3 Slice01 6.84 5466 4.78 Preprocessor1_Model1 ## 4 Slice01 5.35 5467 3.26 Preprocessor1_Model1 ``` --- # Plot performance A simple ggplot with a custom `coord_*` can be used. .pull-left[ ```r chi_pred %>% ggplot(aes(.pred, ridership)) + geom_abline(lty = 2, col = "green") + geom_point(alpha = 0.3, cex = 2) + coord_obs_pred() ``` We can also use the [`shinymodels`](https://github.com/tidymodels/shinymodels) package to get an interactive version of this plot: ```r library(shinymodels) explore(chi_res, hover_cols = c(date, ridership)) ``` ] .pull-right[ <img src="4-resampling_files/figure-html/unnamed-chunk-14-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- # Plotting the metrics over time You can get the per-resample metrics and prediction using the `summarize = FALSE` option. An example function to add them to the results: ```r # Add a date column to time series resampling object metrics add_date_to_metrics <- function(x, date_col, value = min, ...) { res <- collect_metrics(x, summarize = FALSE, ...) x %>% mutate( # Get the assessment set holdout = purrr::map(splits, assessment), # Keep the date colum holdout = purrr::map(holdout, ~ select(.x, all_of(date_col))), # Find a date to represent the range date = purrr::map(holdout, ~ value(.x[[date_col]])) ) %>% # date is a nested tibble so unnest then merge with results unnest(c(all_of(date_col))) %>% select(id, all_of(date_col)) %>% full_join(res, by = "id") } ``` --- # Plotting the metrics over time .pull-left[ ```r chi_res %>% add_date_to_metrics("date") %>% filter(.metric == "rmse") %>% ggplot(aes(x = date, y = .estimate)) + geom_point() + labs(y = "RMSE") + scale_x_date(date_breaks = "2 months") ``` ] .pull-right[ <img src="4-resampling_files/figure-html/unnamed-chunk-16-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- # Some notes * These models fits are independent of one another. [Parallel processing](https://www.tmwr.org/resampling.html#parallel) can be used to significantly speed up the training process. * The individual models can [be saved](https://www.tmwr.org/resampling.html#extract) so you can look at variation in the model parameters or recipes steps. * If you are interested in a [validation set](https://www.tmwr.org/resampling.html#validation), tidymodels considers that a single resample of the data. Everything else in this chapter works the same. # Hands-On: Perform resampling Go to the lab and fit your model within some resamples.
10
:
00