useR2022
Emil Hvitfelt
First thing’s first: what’s a feature?
I tend to think of a feature as some representation of a predictor that will be used in a model.
Old-school features:
“Feature engineering” sounds pretty cool, but let’s take a minute to talk about preprocessing data.
For example, centering and scaling are definitely not feature engineering.
Consider the lastper_insp_date
field in the elevators data. If given as a raw predictor, it is converted to an integer.
It can be re-encoded as:
We’ll demonstrate the recipes package for all of your data needs.
The package is extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data.
Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
The resulting processed output can then be used as inputs for statistical or machine learning models.
Based on the formula, the function assigns columns to roles of “outcome” or “predictor”
summary(elevators_rec)
#> # A tibble: 18 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 borough nominal predictor original
#> 2 device_type nominal predictor original
#> 3 lastper_insp_date date predictor original
#> 4 approval_date date predictor original
#> 5 manufacturer nominal predictor original
#> 6 travel_distance numeric predictor original
#> 7 capacity_lbs numeric predictor original
#> 8 car_buffer_type nominal predictor original
#> 9 governor_type nominal predictor original
#> 10 machine_type nominal predictor original
#> 11 safety_type nominal predictor original
#> 12 mode_operation nominal predictor original
#> 13 floor_from numeric predictor original
#> 14 floor_to numeric predictor original
#> 15 latitude numeric predictor original
#> 16 longitude numeric predictor original
#> 17 elevators_per_building numeric predictor original
#> 18 speed_fpm numeric outcome original
This creates three new columns for each variable in the data based on the date. Now that the day-of-the-week column is a factor.
Many step_impute_*()
functions are used for numeric predictors, step_unknown()
is used for categorical predictors
step_novel()
helps with new levels after training
Note that we can use fancy selector
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date,
features = c("dow", "month", "year"),
keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
In case there is a level that never was observed, we can delete any zero-variance predictors that have a single unique value.
Note that the selector chooses all columns with a role of “predictor”
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date,
features = c("dow", "month", "year"),
keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
This centers and scales the numeric predictors.
Note that this will use the training set to estimate the means and standard deviations of the data. All data put through the recipe will be normalized using those statistics (there is no re-estimation).
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.9)
To deal with highly correlated predicors, find the minimum predictor set to remove to make the pairwise correlations are less than 0.9.
There are other filter steps too,
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors())
PCA feature extraction…
Every preprocessing step in a recipe that involved calculations uses the training set. For example:
and so on.
Once a a recipe is added to a workflow, this occurs when fit()
is called.
Let’s stick to a linear model for now and add a recipe (instead of a formula):
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 8 Recipe Steps
#>
#> • step_date()
#> • step_impute_mean()
#> • step_novel()
#> • step_unknown()
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_corr()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
fit()
Let’s stick to a linear model for now and add a recipe (instead of a formula):
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 8 Recipe Steps
#>
#> • step_date()
#> • step_impute_mean()
#> • step_novel()
#> • step_unknown()
#> • step_dummy()
#> • step_zv()
#> • step_normalize()
#> • step_corr()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept)
#> 5.180e+00
#> travel_distance
#> 4.385e-02
#> capacity_lbs
#> 1.006e-02
#> floor_from
#> -2.140e-02
#> floor_to
#> 1.752e-01
#> latitude
#> -1.654e-02
#> longitude
#> 3.636e-03
#> elevators_per_building
#> 6.899e-02
#> approval_date_year
#> -2.008e-02
#> lastper_insp_date_year
#> 1.477e-02
#> borough_Brooklyn
#> -1.240e-03
#> borough_Manhattan
#> 4.881e-02
#> borough_Queens
#> -1.052e-02
#> borough_Staten.Island
#> -6.661e-03
#> device_type_Escalator
#> 6.073e-02
#> device_type_Freight
#> 8.614e-02
#> device_type_Handicap.Lift
#> -3.758e-02
#> device_type_Manlift
#> -2.247e-02
#> device_type_Passenger.Elevator
#> 2.584e-01
#> device_type_Private.Elevator
#> -4.811e-03
#> device_type_Public.Elevator
#> -7.799e-03
#> device_type_Sidewalk
#> -4.091e-02
#> manufacturer_A..J.
#>
#> ...
#> and 306 more lines.
When predict()
is called, the fitted recipe is applied to the new data before it is predicted by the model:
You don’t need to do anything else
tidy(recipe)
gives a summary of the steps:
tidy(elevators_rec)
#> # A tibble: 8 × 6
#> number operation type trained skip id
#> <int> <chr> <chr> <lgl> <lgl> <chr>
#> 1 1 step date FALSE FALSE date_0Y0le
#> 2 2 step impute_mean FALSE FALSE impute_mean_NEmLh
#> 3 3 step novel FALSE FALSE novel_WBlve
#> 4 4 step unknown FALSE FALSE unknown_YbzYd
#> 5 5 step dummy FALSE FALSE dummy_VKBbq
#> 6 6 step zv FALSE FALSE zv_HjsLv
#> 7 7 step normalize FALSE FALSE normalize_iuyff
#> 8 8 step corr FALSE FALSE corr_YZ2DO
After fitting the recipe, you might want access to the statistics from each step. We can pull the fitted recipe from the workflow and choose which step to tidy by number or id
elevators_fit %>%
extract_recipe() %>%
tidy(number = 7) # For step normalize
#> # A tibble: 356 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 travel_distance mean 52.2 normalize_iuyff
#> 2 capacity_lbs mean 2795. normalize_iuyff
#> 3 floor_from mean 2.01 normalize_iuyff
#> 4 floor_to mean 9.78 normalize_iuyff
#> 5 latitude mean 40.7 normalize_iuyff
#> 6 longitude mean -73.9 normalize_iuyff
#> 7 elevators_per_building mean 5.08 normalize_iuyff
#> 8 approval_date_year mean 2003. normalize_iuyff
#> 9 lastper_insp_date_year mean 2015. normalize_iuyff
#> 10 borough_Brooklyn mean 0.235 normalize_iuyff
#> # … with 346 more rows
90% of the time, you will want to use a workflow to estimate and apply a recipe.
If you have an error, the original recipe object (e.g. elevators_rec
) can be estimated manually with a function called bake()
(analogous to fit()
).
This returns the fitted recipe. This can help debug any issues.
Another function (bake()
) is analogous to predict()
and gives you the processed data back.
fit()
is called on a workflow, changing the model does not re-fit the recipe.predict()
.step_mutate()
.Go to the lab and add a custom recipe to perform feature engineering.