Feature engineering

class: center, middle, title-slide

# Feature engineering
## NHS-R Conference 2021
### Emil Hvitfeldt
### 2021-11-02

---

<header class="header header--fixed" role="banner">
<nav class="distill-site-nav distill-site-header">
<div class="nav-left">
<a href="index.html" class="title">NHS tidymodels workshop</a>
</div>
<div class="nav-right">
<a href="index.html">Home</a>
<div class="nav-dropdown">
<button class="nav-dropbtn">
Slides
 
<span class="down-arrow">▾</span>
</button>
<div class="nav-dropdown-content">
<a href="1-introduction.html">1: Introduction</a>
<hr/>
<a href="2-models.html">2: Models</a>
<hr/>
<a href="3-features.html">3: Features</a>
<hr/>
<a href="4-resampling.html">4: Resampling</a>
<hr/>
<a href="5-tuning.html">5: Tuning</a>
</div>
</div>
<a href="https://github.com/EmilHvitfeldt/nhs-tidymodels-workshop">
<i class="fab fa-github" aria-hidden="true"></i>
</a>
<a href="javascript:void(0);" class="nav-toggle">☰</a>
</div>
</nav>
</header>

class: inverse, middle, center

<div style = "position:fixed; visibility: hidden">
`$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$`
`$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$`
`$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$`
`$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$`
</div>

# [`tidymodels.org`](https://www.tidymodels.org/)

# _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/))

---

# What is feature engineering?

First thing's first: what's a feature?

I tend to think of a feature as some representation of a predictor that will be used in a model.

Old-school features:

* Interactions
 * Polynomial expansions/splines
 * PCA feature extraction
 
"Feature engineering" sounds pretty cool, but let's take a minute to talk about _preprocessing_ data.

---

# Two types of preprocessing

---

# Two types of preprocessing

---

# Easy examples

For example, centering and scaling are definitely not feature engineering.

Consider the `date` field in the Chicago data. If given as a raw predictor, it is converted to an integer.

Spoiler alert: the date is the most important factor. It can be re-encoded as:

* Days since a reference date 😪
* Day of the week ❤️❤️❤️❤️
* Month 😪
* Year ❤️❤️
* Indicators for holidays ❤️❤️❤️
* Indicators for home games for NFL, NBA, etc.  😪

---

# General definitions

* _Data preprocessing_ are the steps that you take to make your model successful.

* _Feature engineering_ are what you do to the original predictors to make the model do the least work to predict the outcome as well as possible.

We'll demonstrate the .pkg[recipes] package for all of your data needs.

---

# Recipes prepare your data for modeling

The package is extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data. 
    
Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. 
    
The resulting processed output can then be used as inputs for statistical or machine learning models.

---

# A first recipe

```r
chi_rec <- 
  recipe(ridership ~ ., data = chi_train)

# If ncol(data) is large, you can use
# recipe(data = chi_train)
```

Based on the formula, the function assigns columns to roles of "outcome" or "predictor"

```r
summary(chi_rec)
```

```
## # A tibble: 50 × 4
##    variable         type    role      source  
##    <chr>            <chr>   <chr>     <chr>   
##  1 Austin           numeric predictor original
##  2 Quincy_Wells     numeric predictor original
##  3 Belmont          numeric predictor original
##  4 Archer_35th      numeric predictor original
##  5 Oak_Park         numeric predictor original
##  6 Western          numeric predictor original
##  7 Clark_Lake       numeric predictor original
##  8 Clinton          numeric predictor original
##  9 Merchandise_Mart numeric predictor original
## 10 Irving_Park      numeric predictor original
## # … with 40 more rows
```

---

# A first recipe - work with dates

```r
chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year"))
```

This creates three new columns in the data based on the date. Now that the day-of-the-week column is a factor.

---

# A first recipe - work with dates

```r
chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year")) %>% 
  step_holiday(date)
```

Add indicators for major holidays. Specific holidays, especially those ex-US, can also be generated.

At this point, we don't need `date` anymore. Instead of deleting it (there is a step for that) we will change its _role_ to be an identification variable.

---

# A first recipe - work with dates

```r
chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year")) %>% 
  step_holiday(date) %>% 
  update_role(date, new_role = "id")
```

`date` is still in the data set but tidymodels knows not to treat it as an analysis column.

---

# A first recipe -create indicator variables

For any factor or character predictors, make binary indicators.

There are _many_ recipe steps that can convert categorical predictors to numeric columns.

---

# A first recipe - filter out constant columns

```r
chi_rec <- 
  recipe(ridership ~ ., data = chi_train) %>% 
  step_date(date, features = c("dow", "month", "year")) %>% 
  step_holiday(date) %>% 
  update_role(date, new_role = "id") %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())
```

In case there is a holiday that never was observed, we can delete any _zero-variance_ predictors that have a single unique value.

Note that the selector chooses all columns with a role of "predictor"

---

# A first recipe - normalization

This centers and scales the numeric predictors.

Note that this will use the training set to estimate the means and standard deviations of the data.

All data put through the recipe will be normalized using those statistics (there is no re-estimation).

---

# A first recipe - reduce correlation

To deal with highly correlated predicors, find the minimum predictor set to remove to make the pairwise correlations are less than 0.9.

There are other filter steps too,

---

# Other possible steps

PCA feature extraction...

---

# Other possible steps

A fancy machine learning supervised dimension reduction technique

---

# Other possible steps

Nonlinear transforms like _natural splines_ and so on.

---

# Recipes are estimated

_Every_ preprocessing step in a recipe that involved calculations uses the _training set_. For example:

* Levels of a factor
 * Determination of zero-variance
 * Normalization
 * Feature extraction
 
and so on.

Once a a recipe is added to a workflow, this occurs when `fit()` is called.

---

# Recipes follow this strategy

---

# Adding recipes to workflows

Let's stick to a linear model for now and add a recipe (instead of a formula):

.pull-left[

```r
lm_spec <- linear_reg()

chi_wflow <- 
  workflow() %>% 
  add_model(lm_spec) %>% 
  add_recipe(chi_rec)

chi_wflow
```

]

.pull-right[

```
## ══ Workflow ══════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_date()
## • step_holiday()
## • step_dummy()
## • step_zv()
## • step_normalize()
## • step_corr()
## 
## ── Model ─────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

]

---

# Estimate via `fit()`

Let's stick to a linear model for now and add a recipe (instead of a formula):

.pull-left[

```r
chi_fit <- chi_wflow %>% fit(chi_train)
chi_fit
```

]

.pull-right[

```
## ══ Workflow [trained] ════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_date()
## • step_holiday()
## • step_dummy()
## • step_zv()
## • step_normalize()
## • step_corr()
## 
## ── Model ─────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##       (Intercept)   Washington_Wells        temp_change                dew  
##         13.611685          -0.102278          -0.026406           0.422014  
##          humidity           pressure    pressure_change               wind  
##         -0.070948           0.002097           0.032573          -0.120868  
##          wind_max               gust           gust_max             percip  
##          0.005030          -0.038944           0.097526          -0.034111  
##        percip_max       weather_rain       weather_snow      weather_cloud  
##         -0.030426          -0.127184          -0.127124          -0.087404  
##     weather_storm    Blackhawks_Away    Blackhawks_Home         Bulls_Away  
##          0.009117          -0.037365          -0.005315           0.015849  
##        Bulls_Home         Bears_Away         Bears_Home          Cubs_Home  
##          0.112938           0.061506           0.055691          -0.259078  
##         date_year      date_LaborDay   date_NewYearsDay  date_ChristmasDay  
##          1.743613           0.044648          -0.512929          -0.581508  
##      date_dow_Mon       date_dow_Tue       date_dow_Wed       date_dow_Thu  
##          4.475308           4.969280           4.974286           4.901382  
##      date_dow_Fri       date_dow_Sat     date_month_Feb     date_month_Mar  
##          4.695743           0.411799           0.116181           0.246738  
##    date_month_Apr     date_month_May     date_month_Jun     date_month_Jul  
##          0.346733           0.238892           0.498224           0.331972  
##    date_month_Aug     date_month_Sep     date_month_Oct     date_month_Nov  
##          0.401437           0.326200           0.481322           0.127999  
##    date_month_Dec  
##         -0.057195
```

]

---

# Prediction

When `predict()` is called, the fitted recipe is applied to the new data before it is predicted by the model:

```r
predict(chi_fit, chi_test)
```

```
## # A tibble: 14 × 1
##    .pred
##    <dbl>
##  1 20.6 
##  2 21.4 
##  3 21.7 
##  4 21.5 
##  5 20.8 
##  6  8.41
##  7  7.39
##  8 20.2 
##  9 21.6 
## 10 21.5 
## 11 21.2 
## 12 20.7 
## 13  8.86
## 14  7.60
```

You don't need to do anything else

---

# Tidying a recipe

.pull-left[
`tidy(recipe)` gives a summary of the steps:

```r
tidy(chi_rec)
```

```
## # A tibble: 6 × 6
##   number operation type      trained skip  id             
##    <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
## 1      1 step      date      FALSE   FALSE date_sJxC9     
## 2      2 step      holiday   FALSE   FALSE holiday_nMzT3  
## 3      3 step      dummy     FALSE   FALSE dummy_LUPxu    
## 4      4 step      zv        FALSE   FALSE zv_HsCdL       
## 5      5 step      normalize FALSE   FALSE normalize_KGE7Q
## 6      6 step      corr      FALSE   FALSE corr_64q8w
```

After fitting the recipe, you might want access to the statistics from each step. We can pull the fitted recipe from the workflow and choose which step to tidy by number or `id`
]
.pull-right[

```r
chi_fit %>% 
  extract_recipe() %>% 
  tidy(number = 5) # For step normalize
```

```
## # A tibble: 138 × 4
##    terms            statistic value id             
##    <chr>            <chr>     <dbl> <chr>          
##  1 Austin           mean       1.52 normalize_KGE7Q
##  2 Quincy_Wells     mean       5.58 normalize_KGE7Q
##  3 Belmont          mean       4.09 normalize_KGE7Q
##  4 Archer_35th      mean       2.21 normalize_KGE7Q
##  5 Oak_Park         mean       1.32 normalize_KGE7Q
##  6 Western          mean       2.87 normalize_KGE7Q
##  7 Clark_Lake       mean      13.6  normalize_KGE7Q
##  8 Clinton          mean       2.44 normalize_KGE7Q
##  9 Merchandise_Mart mean       4.67 normalize_KGE7Q
## 10 Irving_Park      mean       3.41 normalize_KGE7Q
## # … with 128 more rows
```
]

---

# Debugging a recipe

90% of the time, you will want to use a workflow to estimate and apply a recipe.

If you have an error, the original recipe object (e.g. `chi_rec`) can be estimated manually with a function called `bake()` (analogous to `fit()`).

This returns the fitted recipe. This can help debug any issues.

Another function (`bake()`) is analogous to `predict()` and gives you the processed data back.

---

# Fun facts about recipes

* Once `fit()` is called on a workflow, changing the model does not re-fit the recipe. 
* A list of all known steps is [here](https://www.tidymodels.org/find/recipes/). 
* Some steps can be [skipped](https://recipes.tidymodels.org/articles/Skipping.html) when using `predict()`. 
* The [order](https://recipes.tidymodels.org/articles/Ordering.html) of the steps matters. 
* There are .pkg[recipes]-adjacent packages with more steps: .pkg[embed], .pkg[timetk], .pkg[textrecipes], .pkg[themis], and others. 
    * Julia and I have written an amazing text processing book: [_Supervised Machine Learning for Text Analysis in R_](https://smltar.com/)
* There are a lot of ways to handle [categorical predictors](https://recipes.tidymodels.org/articles/Dummies.html) even those with novel levels. 
* Several .pkg[dplyr] steps exist, such as `step_mutate()`.

---

# Hands-On: Add a recipe to your model

Go to the lab and add a custom recipe to perform feature engineering.