time aware recipes

May Open Source Demo

What is a recipe?

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())


Start by calling recipe() to denote the data source and variables used

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())


specifying what actions to take by adding step_*()s

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())


using {tidyselect} and recipes specific selectors to denote affected variables

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())


many steps have options to modify behavior

What is a recipe?

rec_spec <- recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

recipes are can be used with {workflows} to “combine” it with a model

wf_spec <- workflow() |>
  add_recipe(rec_spec) |>
  add_model(linear_reg())

recipes are estimated

Every preprocessing step in a recipe that involved
calculations uses the training set. For example:

  • Levels of a factor
  • Determination of zero-variance
  • Normalization
  • Feature extraction

Once a a recipe is added to a workflow,
this occurs when fit() is called.

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Levels not found in tranining data set are set to “unseen”

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records which levels are seen in training data set

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records which variables had zero variance

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records mean and sd of variables

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

these steps provide static transformations, and could thus be done outside before the recipe

time aware calculations

  • how much did the last house sold for?
    • ever
    • last month
    • last week
  • how many houses were sold in the last month?
    • in this neighborhood
  • how long since last house sale?
    • of this type

“step_time_features”

rec_spec <- recipe(~ lot_area + lot_frontage + date_sold,
                  data = ames_time) |>
  step_time_features(lot_area, lot_frontage, 
                     time = date_sold,
                     features = list(max = max, mean = mean))

lot_area lot_frontage date_sold lot_area_max lot_area_mean lot_frontage_max lot_frontage_mean
10839 100 2008-07-11 159000 10295.181 200 58.06139
8944 65 2009-06-16 164660 10139.572 313 57.70433
9000 50 2006-02-17 22950 9442.355 153 64.39474
1488 24 2009-02-28 164660 10253.137 313 57.77512
8120 70 2009-02-27 164660 10256.759 313 57.79140

“step_time_features”

rec_spec <- recipe(~ lot_area + lot_frontage + date_sold + neighborhood,
                  data = ames_time) |>
  step_time_features(lot_area, lot_frontage, 
                     time = date_sold, 
                     group = neighborhood,
                     features = list(max = max, mean = mean))

Using group argument

lot_area lot_frontage date_sold neighborhood lot_area_max lot_area_mean lot_frontage_max lot_frontage_mean
10839 100 2008-07-11 Gilbert 47280 12045.516 195 51.90323
8944 65 2009-06-16 North_Ames 39384 9934.830 313 62.84195
9000 50 2006-02-17 Iowa_DOT_and_Rail_Road 8600 6962.250 63 55.75000
1488 24 2009-02-28 Blueste 3907 2379.500 35 26.75000
8120 70 2009-02-27 North_Ames 39384 9986.279 313 62.98701

limitless skies

Given this infrastructure, creating the right features are just a function away

mean_last_10_values <- function(x, time, now) {
  mean(x[1:10], na.rm = TRUE)
}

mean_last_week <- function(x, time, now) {
  last_week <- difftime(now, time) <= 7
  mean(x[last_week], na.rm = TRUE)
}

  • will work with very short-term future data
    • not unlike other machine learning models
  • My future: have an idea for a {scales} inspired package to make aggregation functions easier