index

time aware recipes

May Open Source Demo

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Start by calling recipe() to denote the data source and variables used

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

specifying what actions to take by adding step_*()s

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

using {tidyselect} and recipes specific selectors to denote affected variables

What is a recipe?

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

many steps have options to modify behavior

What is a recipe?

rec_spec <- recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

recipes are can be used with {workflows} to “combine” it with a model

wf_spec <- workflow() |>
add_recipe(rec_spec) |>
add_model(linear_reg())

recipes are estimated

Every preprocessing step in a recipe that involved
calculations uses the training set. For example:

Levels of a factor
Determination of zero-variance
Normalization
Feature extraction

Once a a recipe is added to a workflow,
this occurs when fit() is called.

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

Levels not found in tranining data set are set to “unseen”

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records which levels are seen in training data set

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records which variables had zero variance

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

records mean and sd of variables

types of steps

recipe(sale_price ~ ., data = ames_time) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_date(date_sold, features = c(“month”, “dow”, “week”)) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

these steps provide static transformations, and could thus be done outside before the recipe

time aware calculations

how much did the last house sold for?
- ever
- last month
- last week
how many houses were sold in the last month?
- in this neighborhood
how long since last house sale?
- of this type

“step_time_features”

rec_spec <- recipe(~ lot_area + lot_frontage + date_sold,
                  data = ames_time) |>
  step_time_features(lot_area, lot_frontage, 
                     time = date_sold,
                     features = list(max = max, mean = mean))

lot_area	lot_frontage	date_sold	lot_area_max	lot_area_mean	lot_frontage_max	lot_frontage_mean
10839	100	2008-07-11	159000	10295.181	200	58.06139
8944	65	2009-06-16	164660	10139.572	313	57.70433
9000	50	2006-02-17	22950	9442.355	153	64.39474
1488	24	2009-02-28	164660	10253.137	313	57.77512
8120	70	2009-02-27	164660	10256.759	313	57.79140

“step_time_features”

rec_spec <- recipe(~ lot_area + lot_frontage + date_sold + neighborhood,
                  data = ames_time) |>
  step_time_features(lot_area, lot_frontage, 
                     time = date_sold, 
                     group = neighborhood,
                     features = list(max = max, mean = mean))

Using group argument

lot_area	lot_frontage	date_sold	neighborhood	lot_area_max	lot_area_mean	lot_frontage_max	lot_frontage_mean
10839	100	2008-07-11	Gilbert	47280	12045.516	195	51.90323
8944	65	2009-06-16	North_Ames	39384	9934.830	313	62.84195
9000	50	2006-02-17	Iowa_DOT_and_Rail_Road	8600	6962.250	63	55.75000
1488	24	2009-02-28	Blueste	3907	2379.500	35	26.75000
8120	70	2009-02-27	North_Ames	39384	9986.279	313	62.98701

limitless skies

Given this infrastructure, creating the right features are just a function away

mean_last_10_values <- function(x, time, now) {
  mean(x[1:10], na.rm = TRUE)
}

mean_last_week <- function(x, time, now) {
  last_week <- difftime(now, time) <= 7
  mean(x[last_week], na.rm = TRUE)
}

will work with very short-term future data
- not unlike other machine learning models
My future: have an idea for a {scales} inspired package to make aggregation functions easier