Flexible feature engineering using {recipes}

Flexible
feature engineering
using {recipes}

What is
feature engineering?

What are features?

Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.

I tend to think of a feature as some representation of a predictor that will be used in a model.

Old-school features:

Interactions
Polynomial expansions/splines
PCA feature extraction

“Feature engineering” sounds pretty cool, but let’s take a minute to talk about preprocessing data.

Two types of preprocessing

General definitions

Data preprocessing are the steps that you take to make your model successful
Feature engineering are what you do to the original predictors to make the model do the least work to perform great

Some models can’t handle non-numeric data (missing data)

Linear Regression
K Nearest Neighbors

Some models are fine with categorical variables

most tree based models

Some models struggle if numeric variables aren’t scaled

K Nearest Neighbors
Anything using gradient descent

Working with dates

Consider a datetime field. If given as a raw predictor, it is converted to an integer.

It can be re-encoded as:

Days since a reference date
Day of the week
Month
Year
Indicators for holidays

static

sqrt, log, inverse
dummies with known levels
date time extractions

trained

centering & scaling
Imputation
PCA
anything for unknown levels

Trained methods needs to
calculate sufficient information
to be applied again

Considerations

if all methods are static, they can be done ahead of time

good for computational methods
- BERT
bad for fast and expanding methods
- feature hashing

anything after a trained transformation needs to be done within the modeling process

{recipes} package

extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data

Modular + extensible
pipeable
Deferred evaluation
Isolates test data from training data
Can do things formulas can’t

What is a recipe?

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

What is a recipe?

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Start by calling recipe() to denote the data source and variables used

What is a recipe?

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

specifying what actions to take by adding step_*()s

What is a recipe?

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

using {tidyselect} and recipes specific selectors to denote affected variables

What is a recipe?

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

many steps have options to modify behavior

Using a recipe

rec_spec <- recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

recipes are can be used with {workflows} to “combine” it with a model

wf_spec <- workflow() |>
add_recipe(rec_spec) |>
add_model(linear_reg())

recipes are estimated

Every preprocessing step in a recipe that involved
calculations uses the training set. For example:

Levels of a factor
Determination of zero-variance
Normalization
Feature extraction

Once a a recipe is added to a workflow,
this occurs when fit() is called.

types of steps - trained

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Levels not found in tranining data set are set to “unseen”

types of steps - trained

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records which levels are seen in training data set

types of steps - trained

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records which variables had zero variance

types of steps - trained

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records mean and sd of variables

types of steps - static

recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

these steps provide static transformations, and could thus be done outside before the recipe

A cartoon showing the progression of a fuzzy green round monster baking something, representing steps for data pre-processing available in the recipes package. From left to right: a pantry labeled “Variables pantry” where the monster is picking “response” and “predictor” variables, with text below reading “1. Specify variables”, then the monster writing pre-processing steps on a chalkboard (text reads “define pre-processing steps”), then the monster carrying boxes full of data (text reads “Provide datasets for recipe steps”), and finally the monster mixing things with a stand mixer, pouring contents into different tupperwares labeled “imputed”, “scaled”, “centered”, with text below reading “Apply pre-processing!”

Extensive role selection

All steps use {tidyselect} to select variables

… |>
step_pca(BsmtFin_SF_1, BsmtFin_SF_2, Bsmt_Unf_SF,
         Total_Bsmt_SF, First_Flr_SF, Second_Flr_SF,
         Wood_Deck_SF, Open_Porch_SF,
         threshold = 0.8) |>
…

variables can be written out one by one

Extensive role selection

All steps use {tidyselect} to select variables

… |>
step_pca(contains(“SF”), threshold = 0.8) |>
…

using tidyselect::contains()

Extensive role selection

All steps use {tidyselect} to select variables

… |>
step_pca(contains(“SF”), -starts_with(“Bsmt”),
threshold = 0.8) |>
…

or combine multiple {tidyselect} selectors

Useful selectors - tidyselect

starts_with()
ends_with()
contains()
matches()
num_range()
one_of()
any_of()

Useful selectors - recipes

In addition to all_predictors() and all_outcomes()

all_numeric()
- all_double()
- all_integer()
all_logical()
all_date()
all_datetime()

all_nominal()
- all_string()
- all_factor()
- all_unordered()
- all_ordered()

all have *_predictors() variants

Tidying a recipe

We can use prep() to “train” a recipe

rec <- recipe(sale_price ~ ., data = ames_training) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

rec_prepped <- prep(rec)

Tidying a recipe

running tidy() reveals the steps and basic information

rec_prepped |>
  tidy()

# A tibble: 5 × 6
  number operation type      trained skip  id             
   <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
1      1 step      unknown   TRUE    FALSE unknown_n939d  
2      2 step      other     TRUE    FALSE other_zfonZ    
3      3 step      dummy     TRUE    FALSE dummy_pdHMv    
4      4 step      nzv       TRUE    FALSE nzv_RUieL      
5      5 step      normalize TRUE    FALSE normalize_Bp5vK

Tidying a recipe

you can use number or id to select a step

rec_prepped |>
  tidy(number = 4) # id = "nzv_RUieL"

# A tibble: 32 × 2
   terms              id       
   <chr>              <chr>    
 1 BsmtFin_SF_2       nzv_RUieL
 2 Kitchen_AbvGr      nzv_RUieL
 3 Open_Porch_SF      nzv_RUieL
 4 Enclosed_Porch     nzv_RUieL
 5 Three_season_porch nzv_RUieL
 6 Screen_Porch       nzv_RUieL
 7 Pool_Area          nzv_RUieL
 8 Misc_Val           nzv_RUieL
 9 Street_other       nzv_RUieL
10 Lot_Shape_other    nzv_RUieL
# ℹ 22 more rows

Tidying a recipe

you can use number or id to select a step

rec_prepped |>
  tidy(number = 3) # id = "dummy_pdHMv"

# A tibble: 98 × 3
   terms       columns                              id         
   <chr>       <chr>                                <chr>      
 1 MS_SubClass One_and_Half_Story_Finished_All_Ages dummy_pdHMv
 2 MS_SubClass Two_Story_1946_and_Newer             dummy_pdHMv
 3 MS_SubClass One_Story_PUD_1946_and_Newer         dummy_pdHMv
 4 MS_SubClass other                                dummy_pdHMv
 5 MS_Zoning   Residential_Medium_Density           dummy_pdHMv
 6 MS_Zoning   other                                dummy_pdHMv
 7 Street      other                                dummy_pdHMv
 8 Alley       other                                dummy_pdHMv
 9 Lot_Shape   Slightly_Irregular                   dummy_pdHMv
10 Lot_Shape   other                                dummy_pdHMv
# ℹ 88 more rows

Extension packages

Provides steps to handle text variable

tokenization
filtering
counting

Extension packages

Advanced methods to embed categorical and numeric variables to smaller vector spaces

weight of evidence
string collapsing
PCA variants

Extension packages

Steps to deal with imbalanced data

up and down-sampling
SMOTE variants
ADASYN

Extension packages

Time based methods

time series signatures
lags & diffs
smoothing

Thank You!