</ Flexible
feature engineering
using {recipes} />
What is
feature engineering?
I tend to think of a feature as some representation of a predictor that will be used in a model.
Old-school features:
- Interactions
- Polynomial expansions/splines
- PCA feature extraction
“Feature engineering” sounds pretty cool, but let’s take a minute to talk about preprocessing data.
</ Two types of preprocessing
</ Two types of preprocessing
</ General definitions
- Data preprocessing are the steps that you take to make your model successful
- Feature engineering are what you do to the original predictors to make the model do the least work to perform great
Some models can’t handle non-numeric data (missing data)
- Linear Regression
- K Nearest Neighbors
Some models are fine with categorical variables
Some models struggle if numeric variables aren’t scaled
- K Nearest Neighbors
- Anything using gradient descent
</ Working with dates
Consider a datetime field. If given as a raw predictor, it is converted to an integer.
It can be re-encoded as:
- Days since a reference date
- Day of the week
- Month
- Year
- Indicators for holidays
static
- sqrt, log, inverse
- dummies with known levels
- date time extractions
trained
- centering & scaling
- Imputation
- PCA
- anything for unknown levels
Trained methods needs to calculate sufficient information to be applied again
</ Considerations
if all methods are static, they can be done ahead of time
- good for computational methods
- bad for fast and expanding methods
anything after a trained transformation needs to be done within the modeling process
%>% {recipes} package
extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data
- Modular + extensible
- pipeable
- Deferred evaluation
- Isolates test data from training data
- Can do things formulas can’t
</ What is a recipe?
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
</ What is a recipe?
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
Start by calling recipe() to denote the data source and variables used
</ What is a recipe?
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
specifying what actions to take by adding step_*()s
</ What is a recipe?
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
using {tidyselect} and recipes specific selectors to denote affected variables
</ What is a recipe?
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
many steps have options to modify behavior
</ Using a recipe
recipes are can be used with {workflows} to “combine” it with a model
wf_spec <- workflow() |>
add_recipe(rec_spec) |>
add_model(linear_reg())
</ recipes are estimated
Every preprocessing step in a recipe that involved
calculations uses the training set. For example:
- Levels of a factor
- Determination of zero-variance
- Normalization
- Feature extraction
Once a a recipe is added to a workflow,
this occurs when fit() is called.
</ types of steps - trained
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
Levels not found in tranining data set are set to “unseen”
</ types of steps - trained
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
records which levels are seen in training data set
</ types of steps - trained
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
records which variables had zero variance
</ types of steps - trained
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
records mean and sd of variables
</ types of steps - static
recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
these steps provide static transformations, and could thus be done outside before the recipe
</ Extensive role selection
All steps use {tidyselect} to select variables
… |>
step_pca(BsmtFin_SF_1, BsmtFin_SF_2, Bsmt_Unf_SF,
Total_Bsmt_SF, First_Flr_SF, Second_Flr_SF,
Wood_Deck_SF, Open_Porch_SF,
threshold = 0.8) |>
…
variables can be written out one by one
</ Extensive role selection
All steps use {tidyselect} to select variables
… |>
step_pca(contains(“SF”), threshold = 0.8) |>
…
using tidyselect::contains()
</ Extensive role selection
All steps use {tidyselect} to select variables
… |>
step_pca(contains(“SF”), -starts_with(“Bsmt”),
threshold = 0.8) |>
…
or combine multiple {tidyselect} selectors
</ Useful selectors - tidyselect
starts_with()
ends_with()
contains()
matches()
num_range()
one_of()
any_of()
</ Useful selectors - recipes
In addition to all_predictors() and all_outcomes()
- all_numeric()
- all_double()
- all_integer()
- all_logical()
- all_date()
- all_datetime()
- all_nominal()
- all_string()
- all_factor()
- all_unordered()
- all_ordered()
all have *_predictors() variants
</ Tidying a recipe
We can use prep() to “train” a recipe
rec <- recipe(sale_price ~ ., data = ames_training) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_normalize(all_numeric_predictors())
rec_prepped <- prep(rec)
</ Tidying a recipe
running tidy() reveals the steps and basic information
# A tibble: 5 × 6
number operation type trained skip id
<int> <chr> <chr> <lgl> <lgl> <chr>
1 1 step unknown TRUE FALSE unknown_n939d
2 2 step other TRUE FALSE other_zfonZ
3 3 step dummy TRUE FALSE dummy_pdHMv
4 4 step nzv TRUE FALSE nzv_RUieL
5 5 step normalize TRUE FALSE normalize_Bp5vK
</ Tidying a recipe
you can use number or id to select a step
rec_prepped |>
tidy(number = 4) # id = "nzv_RUieL"
# A tibble: 32 × 2
terms id
<chr> <chr>
1 BsmtFin_SF_2 nzv_RUieL
2 Kitchen_AbvGr nzv_RUieL
3 Open_Porch_SF nzv_RUieL
4 Enclosed_Porch nzv_RUieL
5 Three_season_porch nzv_RUieL
6 Screen_Porch nzv_RUieL
7 Pool_Area nzv_RUieL
8 Misc_Val nzv_RUieL
9 Street_other nzv_RUieL
10 Lot_Shape_other nzv_RUieL
# ℹ 22 more rows
</ Tidying a recipe
you can use number or id to select a step
rec_prepped |>
tidy(number = 3) # id = "dummy_pdHMv"
# A tibble: 98 × 3
terms columns id
<chr> <chr> <chr>
1 MS_SubClass One_and_Half_Story_Finished_All_Ages dummy_pdHMv
2 MS_SubClass Two_Story_1946_and_Newer dummy_pdHMv
3 MS_SubClass One_Story_PUD_1946_and_Newer dummy_pdHMv
4 MS_SubClass other dummy_pdHMv
5 MS_Zoning Residential_Medium_Density dummy_pdHMv
6 MS_Zoning other dummy_pdHMv
7 Street other dummy_pdHMv
8 Alley other dummy_pdHMv
9 Lot_Shape Slightly_Irregular dummy_pdHMv
10 Lot_Shape other dummy_pdHMv
# ℹ 88 more rows
</ Extension packages
Provides steps to handle text variable
- tokenization
- filtering
- counting
</ Extension packages
Advanced methods to embed categorical and numeric variables to smaller vector spaces
- weight of evidence
- string collapsing
- PCA variants
</ Extension packages
Steps to deal with imbalanced data
- up and down-sampling
- SMOTE variants
- ADASYN
</ Extension packages
Time based methods
- time series signatures
- lags & diffs
- smoothing