Feature Engineering


Emil Hvitfelt

What is feature engineering?

First thing’s first: what’s a feature?

I tend to think of a feature as some representation of a predictor that will be used in a model.

Old-school features:

  • Interactions
  • Polynomial expansions/splines
  • PCA feature extraction

“Feature engineering” sounds pretty cool, but let’s take a minute to talk about preprocessing data.

Two types of preprocessing

Two types of preprocessing

Easy examples

For example, centering and scaling are definitely not feature engineering.

Consider the lastper_insp_date field in the elevators data. If given as a raw predictor, it is converted to an integer.

It can be re-encoded as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Indicators for holidays

General definitions

  • Data preprocessing are the steps that you take to make your model successful.
  • Feature engineering are what you do to the original predictors to make the model do the least work to predict the outcome as well as possible.

We’ll demonstrate the recipes package for all of your data needs.

Recipes prepare your data for modeling

The package is extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data.

Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.

The resulting processed output can then be used as inputs for statistical or machine learning models.

A first recipe

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train)

# If ncol(data) is large, you can use
# recipe(data = elevators_train)

Based on the formula, the function assigns columns to roles of “outcome” or “predictor”

#> # A tibble: 18 × 4
#>    variable               type    role      source  
#>    <chr>                  <chr>   <chr>     <chr>   
#>  1 borough                nominal predictor original
#>  2 device_type            nominal predictor original
#>  3 lastper_insp_date      date    predictor original
#>  4 approval_date          date    predictor original
#>  5 manufacturer           nominal predictor original
#>  6 travel_distance        numeric predictor original
#>  7 capacity_lbs           numeric predictor original
#>  8 car_buffer_type        nominal predictor original
#>  9 governor_type          nominal predictor original
#> 10 machine_type           nominal predictor original
#> 11 safety_type            nominal predictor original
#> 12 mode_operation         nominal predictor original
#> 13 floor_from             numeric predictor original
#> 14 floor_to               numeric predictor original
#> 15 latitude               numeric predictor original
#> 16 longitude              numeric predictor original
#> 17 elevators_per_building numeric predictor original
#> 18 speed_fpm              numeric outcome   original

A first recipe - work with dates

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE)

This creates three new columns for each variable in the data based on the date. Now that the day-of-the-week column is a factor.

A first recipe - Dealing with missing data

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%

Many step_impute_*() functions are used for numeric predictors, step_unknown() is used for categorical predictors

step_novel() helps with new levels after training

Note that we can use fancy selector

A first recipe -create indicator variables

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, 
            features = c("dow", "month", "year"), 
            keep_original_cols = FALSE) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%

For any factor or character predictors, make binary indicators.

There are many recipe steps that can convert categorical predictors to numeric columns.

A first recipe - filter out constant columns

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>% 

In case there is a level that never was observed, we can delete any zero-variance predictors that have a single unique value.

Note that the selector chooses all columns with a role of “predictor”

A first recipe - normalization

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, 
            features = c("dow", "month", "year"), 
            keep_original_cols = FALSE) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>%

This centers and scales the numeric predictors.

Note that this will use the training set to estimate the means and standard deviations of the data. All data put through the recipe will be normalized using those statistics (there is no re-estimation).

A first recipe - reduce correlation

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predicors, find the minimum predictor set to remove to make the pairwise correlations are less than 0.9.

There are other filter steps too,

Other possible steps

elevators_rec <- 
  recipe(speed_fpm ~ ., data = elevators_train) %>% 
  step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>% 

PCA feature extraction…

Recipes are estimated

Every preprocessing step in a recipe that involved calculations uses the training set. For example:

  • Levels of a factor
  • Determination of zero-variance
  • Normalization
  • Feature extraction

and so on.

Once a a recipe is added to a workflow, this occurs when fit() is called.

Recipes follow this strategy

Adding recipes to workflows

Let’s stick to a linear model for now and add a recipe (instead of a formula):

lm_spec <- linear_reg() 

elevators_wflow <- 
  workflow() %>% 
  add_model(lm_spec) %>% 

#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 8 Recipe Steps
#> • step_date()
#> • step_impute_mean()
#> • step_novel()
#> • step_unknown()
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_corr()
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#> Computational engine: lm

Estimate via fit()

Let’s stick to a linear model for now and add a recipe (instead of a formula):

elevators_fit <- elevators_wflow %>% fit(elevators_train)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 8 Recipe Steps
#> • step_date()
#> • step_impute_mean()
#> • step_novel()
#> • step_unknown()
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_corr()
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> Coefficients:
#>                                (Intercept)  
#>                                  5.180e+00  
#>                            travel_distance  
#>                                  4.385e-02  
#>                               capacity_lbs  
#>                                  1.006e-02  
#>                                 floor_from  
#>                                 -2.140e-02  
#>                                   floor_to  
#>                                  1.752e-01  
#>                                   latitude  
#>                                 -1.654e-02  
#>                                  longitude  
#>                                  3.636e-03  
#>                     elevators_per_building  
#>                                  6.899e-02  
#>                         approval_date_year  
#>                                 -2.008e-02  
#>                     lastper_insp_date_year  
#>                                  1.477e-02  
#>                           borough_Brooklyn  
#>                                 -1.240e-03  
#>                          borough_Manhattan  
#>                                  4.881e-02  
#>                             borough_Queens  
#>                                 -1.052e-02  
#>                      borough_Staten.Island  
#>                                 -6.661e-03  
#>                      device_type_Escalator  
#>                                  6.073e-02  
#>                        device_type_Freight  
#>                                  8.614e-02  
#>                  device_type_Handicap.Lift  
#>                                 -3.758e-02  
#>                        device_type_Manlift  
#>                                 -2.247e-02  
#>             device_type_Passenger.Elevator  
#>                                  2.584e-01  
#>               device_type_Private.Elevator  
#>                                 -4.811e-03  
#>                device_type_Public.Elevator  
#>                                 -7.799e-03  
#>                       device_type_Sidewalk  
#>                                 -4.091e-02  
#>                         manufacturer_A..J.  
#> ...
#> and 306 more lines.


When predict() is called, the fitted recipe is applied to the new data before it is predicted by the model:

predict(elevators_fit, elevators_train)
#> # A tibble: 26,281 × 1
#>    .pred
#>    <dbl>
#>  1  6.12
#>  2  4.51
#>  3  5.17
#>  4  4.64
#>  5  5.31
#>  6  5.89
#>  7  5.57
#>  8  5.25
#>  9  5.90
#> 10  4.78
#> # … with 26,271 more rows

You don’t need to do anything else

Tidying a recipe

tidy(recipe) gives a summary of the steps:

#> # A tibble: 8 × 6
#>   number operation type        trained skip  id               
#>    <int> <chr>     <chr>       <lgl>   <lgl> <chr>            
#> 1      1 step      date        FALSE   FALSE date_0Y0le       
#> 2      2 step      impute_mean FALSE   FALSE impute_mean_NEmLh
#> 3      3 step      novel       FALSE   FALSE novel_WBlve      
#> 4      4 step      unknown     FALSE   FALSE unknown_YbzYd    
#> 5      5 step      dummy       FALSE   FALSE dummy_VKBbq      
#> 6      6 step      zv          FALSE   FALSE zv_HjsLv         
#> 7      7 step      normalize   FALSE   FALSE normalize_iuyff  
#> 8      8 step      corr        FALSE   FALSE corr_YZ2DO

After fitting the recipe, you might want access to the statistics from each step. We can pull the fitted recipe from the workflow and choose which step to tidy by number or id

elevators_fit %>% 
  extract_recipe() %>% 
  tidy(number = 7) # For step normalize
#> # A tibble: 356 × 4
#>    terms                  statistic    value id             
#>    <chr>                  <chr>        <dbl> <chr>          
#>  1 travel_distance        mean        52.2   normalize_iuyff
#>  2 capacity_lbs           mean      2795.    normalize_iuyff
#>  3 floor_from             mean         2.01  normalize_iuyff
#>  4 floor_to               mean         9.78  normalize_iuyff
#>  5 latitude               mean        40.7   normalize_iuyff
#>  6 longitude              mean       -73.9   normalize_iuyff
#>  7 elevators_per_building mean         5.08  normalize_iuyff
#>  8 approval_date_year     mean      2003.    normalize_iuyff
#>  9 lastper_insp_date_year mean      2015.    normalize_iuyff
#> 10 borough_Brooklyn       mean         0.235 normalize_iuyff
#> # … with 346 more rows

Debugging a recipe

90% of the time, you will want to use a workflow to estimate and apply a recipe.

If you have an error, the original recipe object (e.g. elevators_rec) can be estimated manually with a function called bake() (analogous to fit()).

This returns the fitted recipe. This can help debug any issues.

Another function (bake()) is analogous to predict() and gives you the processed data back.

Fun facts about recipes

  • Once fit() is called on a workflow, changing the model does not re-fit the recipe.
  • A list of all known steps is here.
  • Some steps can be skipped when using predict().
  • The order of the steps matters.
  • There are recipes-adjacent packages with more steps: embed, timetk, textrecipes, themis, and others.
  • There are a lot of ways to handle categorical predictors even those with novel levels.
  • Several dplyr steps exist, such as step_mutate().

Hands-On: Add a recipe to your model

Go to the lab and add a custom recipe to perform feature engineering.