Model Tuning

class: center, middle, title-slide

# Model Tuning
## NHS-R Conference 2021
### Emil Hvitfeldt
### 2021-11-02

---

<header class="header header--fixed" role="banner">
<nav class="distill-site-nav distill-site-header">
<div class="nav-left">
<a href="index.html" class="title">NHS tidymodels workshop</a>
</div>
<div class="nav-right">
<a href="index.html">Home</a>
<div class="nav-dropdown">
<button class="nav-dropbtn">
Slides
 
<span class="down-arrow">▾</span>
</button>
<div class="nav-dropdown-content">
<a href="1-introduction.html">1: Introduction</a>
<hr/>
<a href="2-models.html">2: Models</a>
<hr/>
<a href="3-features.html">3: Features</a>
<hr/>
<a href="4-resampling.html">4: Resampling</a>
<hr/>
<a href="5-tuning.html">5: Tuning</a>
</div>
</div>
<a href="https://github.com/EmilHvitfeldt/nhs-tidymodels-workshop">
<i class="fab fa-github" aria-hidden="true"></i>
</a>
<a href="javascript:void(0);" class="nav-toggle">☰</a>
</div>
</nav>
</header>

class: inverse, middle, center

<div style = "position:fixed; visibility: hidden">
`$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$`
`$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$`
`$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$`
`$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$`
</div>

# [`tidymodels.org`](https://www.tidymodels.org/)

# _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/))

---

# Tuning parameters

These are model or preprocessing parameters that are important but cannot be estimated directly form the data.

Some examples:

.pull-left[

* Tree depth in decision trees.

* Number of neighbors in a K-nearest neighbor model.

* Activation function (e.g. sigmoidal, ReLu) in neural networks.

* Number of PCA components to retain

]
.pull-right[

* Covariance/correlation matrix structure in mixed models.

* Data distribution in survival models.

* Spline degrees of freedom. 
]

---

# Optimizng tuning parameters

The main approach is to try different values and measure their performance. This can lead us to good values for these parameters.

The main two classes of optimization models are:

* _Grid search_ where a pre-defined set of candidate values are tested. 
 
 
 * _Iterative search_ methods suggest/estimate new values of candidate parameters to evaluate.

Once the value(s) of the parameter(s) are determine, a model can be finalized but fitting the model to the entire training set.

---

# Measuring tuning paramters

We need performance metrics to tell us which candidate values are good and which are not.

Using the test set, or simply re-predicting the training set, are very bad ideas.

Since tuning parameters often control complexity, they can often lead to [_overfitting_](https://www.tmwr.org/tuning.html#overfitting-bad).

* This is where the model does very well on the training set but poorly on new data.

Using _resampling_ to estimate performance can help identify parameters that lead to overfitting.

The cost is computational time.

---

# Overfitting with a support vector machine

---

# Choosing tuning parameters

Let's take our previous model and add a few changes:

```r
lm_spec <- 
  linear_reg() %>% 
  set_engine("lm")

chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "year")) %>% 
  step_holiday(date) %>% 
  update_role(date, new_role = "id") %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)