recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
Start by calling recipe()
to denote the data source and variables used
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
specifying what actions to take by adding step_*()
s
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
using {tidyselect} and recipes specific selectors to denote affected variables
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
many steps have options to modify behavior
rec_spec <- recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
recipes are can be used with {workflows} to “combine” it with a model
wf_spec <- workflow() |>
add_recipe(rec_spec) |>
add_model(linear_reg())
Every preprocessing step in a recipe that involved
calculations uses the training set. For example:
Once a a recipe is added to a workflow,
this occurs when fit()
is called.
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
Levels not found in tranining data set are set to “unseen”
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
records which levels are seen in training data set
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
records which variables had zero variance
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
records mean and sd of variables
recipe(sale_price ~ ., data = ames_time) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
these steps provide static transformations, and could thus be done outside before the recipe
lot_area | lot_frontage | date_sold | lot_area_max | lot_area_mean | lot_frontage_max | lot_frontage_mean |
---|---|---|---|---|---|---|
10839 | 100 | 2008-07-11 | 159000 | 10295.181 | 200 | 58.06139 |
8944 | 65 | 2009-06-16 | 164660 | 10139.572 | 313 | 57.70433 |
9000 | 50 | 2006-02-17 | 22950 | 9442.355 | 153 | 64.39474 |
1488 | 24 | 2009-02-28 | 164660 | 10253.137 | 313 | 57.77512 |
8120 | 70 | 2009-02-27 | 164660 | 10256.759 | 313 | 57.79140 |
Using group
argument
lot_area | lot_frontage | date_sold | neighborhood | lot_area_max | lot_area_mean | lot_frontage_max | lot_frontage_mean |
---|---|---|---|---|---|---|---|
10839 | 100 | 2008-07-11 | Gilbert | 47280 | 12045.516 | 195 | 51.90323 |
8944 | 65 | 2009-06-16 | North_Ames | 39384 | 9934.830 | 313 | 62.84195 |
9000 | 50 | 2006-02-17 | Iowa_DOT_and_Rail_Road | 8600 | 6962.250 | 63 | 55.75000 |
1488 | 24 | 2009-02-28 | Blueste | 3907 | 2379.500 | 35 | 26.75000 |
8120 | 70 | 2009-02-27 | North_Ames | 39384 | 9986.279 | 313 | 62.98701 |
Given this infrastructure, creating the right features are just a function away