What’s New in Tidymodels

What is tidymodels?

Data

The Movies Dataset data set from Kaggle

Filtered to only include Horror movies

horror_movies |>
  relocate(title, id, runtime, genres) |>
  glimpse()
Rows: 3,759
Columns: 10
$ title                <chr> "Dracula: Dead and Loving It", "From Dusk Till Da…
$ id                   <dbl> 12110, 755, 9102, 9095, 12158, 34996, 8973, 34574…
$ runtime              <dbl> 88, 108, 108, 104, 100, 82, 119, 93, 98, 108, 95,…
$ genres               <chr> "Comedy", "Action, Thriller, Crime", "Science Fic…
$ budget               <dbl> NA, 1.9e+07, 2.0e+07, 4.7e+07, 1.4e+07, NA, NA, N…
$ production_countries <chr> "France, United States of America", "United State…
$ release_date         <date> 1995-12-22, 1996-01-19, 1995-09-08, 1996-02-23, …
$ spoken_languages     <chr> "English, Deutsch", "English, Español", "English"…
$ target               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ time                 <dbl> 112.804145, 205.761933, 114.304145, 111.904145, 1…

Heavily modified (turns out it is hard to force a themed example)

Validation splits

Validation splits

We can use initial_validation_split() to create a 3-way split of our data

set.seed(1234)
horror_split <- horror_movies |>
  mutate(surv = Surv(time, target)) |>
  select(-time, -target) |>
  initial_validation_split()

horror_split
<Training/Validation/Testing/Total>
<2255/752/752/3759>

this can then be turned into a rset object for tuning purposes

horror_set <- validation_set(horror_split)
horror_set
# A tibble: 1 × 2
  splits             id        
  <list>             <chr>     
1 <split [2255/752]> validation

Validation splits

We can also use training(), testing() and validation() functions as we know them

horror_train <- training(horror_split)
horror_train
# A tibble: 2,255 × 9
     budget genres                  id production_countries release_date runtime
      <dbl> <chr>                <dbl> <chr>                <date>         <dbl>
 1 19000000 Action, Thriller, C…   755 United States of Am… 1996-01-19       108
 2 47000000 Drama, Thriller, Ro…  9095 United States of Am… 1996-02-23       104
 3       NA Drama                34996 United States of Am… 1995-01-10        82
 4       NA Mystery, Thriller     8973 United States of Am… 1995-08-25       119
 5 35000000 Science Fiction, Ac…  9348 United States of Am… 1995-07-07       108
 6       NA Thriller, Drama, My… 18256 United States of Am… 1995-08-04        95
 7 45000000 Drama, Science Fict…  3036 United Kingdom, Jap… 1994-11-04       123
 8       NA Drama                92769 Canada               1995-10-03        90
 9       NA Comedy, Thriller      9059 United States of Am… 1995-01-13        92
10  6000000 Crime, Thriller      25066 United States of Am… 1995-05-24        98
# ℹ 2,245 more rows
# ℹ 3 more variables: spoken_languages <chr>, title <chr>, surv <Surv>

Multi Dummy Variables

Multi Dummy Variables

both the genres and production_countries variable is formatted as comma-separated lists:

horror_train$genres[1:10]
 [1] "Action, Thriller, Crime"         "Drama, Thriller, Romance"       
 [3] "Drama"                           "Mystery, Thriller"              
 [5] "Science Fiction, Action"         "Thriller, Drama, Mystery"       
 [7] "Drama, Science Fiction, Romance" "Drama"                          
 [9] "Comedy, Thriller"                "Crime, Thriller"                

This gives high cardinality

horror_train$genres |> unique() |> length()
[1] 321

despite low number of genres

horror_train$genres |> strsplit(", ") |> unlist() |> unique() |> length()
[1] 19

Creating dummies

With step_dummy(genres)

recipe(~ genres, data = horror_train) |>
  step_dummy(genres) |>
  prep() |>
  bake(new_data = NULL)
# A tibble: 2,255 × 320
   genres_Action genres_Action..Adventure genres_Action..Adventure..Animation.…¹
           <dbl>                    <dbl>                                  <dbl>
 1             0                        0                                      0
 2             0                        0                                      0
 3             0                        0                                      0
 4             0                        0                                      0
 5             0                        0                                      0
 6             0                        0                                      0
 7             0                        0                                      0
 8             0                        0                                      0
 9             0                        0                                      0
10             0                        0                                      0
# ℹ 2,245 more rows
# ℹ abbreviated name: ¹​genres_Action..Adventure..Animation..Comedy..Family
# ℹ 317 more variables:
#   genres_Action..Adventure..Drama..Foreign..Thriller <dbl>,
#   genres_Action..Adventure..Drama..History <dbl>,
#   genres_Action..Adventure..Drama..Thriller <dbl>,
#   genres_Action..Adventure..Fantasy <dbl>, …

Creating fancy dummies

With step_dummy_extract(genres, sep = ", ")

recipe(~ genres, data = horror_train) |>
  step_dummy_extract(genres, sep = ", ") |>
  prep() |>
  bake(new_data = NULL)
# A tibble: 2,255 × 20
   genres_Action genres_Adventure genres_Animation genres_Comedy genres_Crime
           <int>            <int>            <int>         <int>        <int>
 1             1                0                0             0            1
 2             0                0                0             0            0
 3             0                0                0             0            0
 4             0                0                0             0            0
 5             1                0                0             0            0
 6             0                0                0             0            0
 7             0                0                0             0            0
 8             0                0                0             0            0
 9             0                0                0             1            0
10             0                0                0             0            1
# ℹ 2,245 more rows
# ℹ 15 more variables: genres_Documentary <int>, genres_Drama <int>,
#   genres_Family <int>, genres_Fantasy <int>, genres_Foreign <int>,
#   genres_History <int>, genres_Music <int>, genres_Mystery <int>,
#   genres_Romance <int>, genres_Science.Fiction <int>, genres_Thriller <int>,
#   genres_TV.Movie <int>, genres_War <int>, genres_Western <int>,
#   genres_other <int>

Creating fancy dummies

With threshold = 0.1

recipe(~ genres, data = horror_train) |>
  step_dummy_extract(genres, sep = ", ", threshold = 0.1) |>
  prep() |>
  bake(new_data = NULL)
# A tibble: 2,255 × 7
   genres_Action genres_Comedy genres_Drama genres_Mystery
           <int>         <int>        <int>          <int>
 1             1             0            0              0
 2             0             0            1              0
 3             0             0            1              0
 4             0             0            0              1
 5             1             0            0              0
 6             0             0            1              1
 7             0             0            1              0
 8             0             0            1              0
 9             0             1            0              0
10             0             0            0              0
# ℹ 2,245 more rows
# ℹ 3 more variables: genres_Science.Fiction <int>, genres_Thriller <int>,
#   genres_other <int>

Dealing with Dates

We use step_date() to deal with release_date variable

with the arguments features = c("dow", "month"), label = FALSE, and keep_original_cols = FALSE

recipe(~ release_date, data = horror_train) |>
  step_date(all_date_predictors(), features = c("dow", "month"), label = FALSE, keep_original_cols = FALSE) |>
  prep() |>
  bake(new_data = NULL)
# A tibble: 2,255 × 2
   release_date_dow release_date_month
              <int>              <int>
 1                6                  1
 2                6                  2
 3                3                  1
 4                6                  8
 5                6                  7
 6                6                  8
 7                6                 11
 8                3                 10
 9                6                  1
10                4                  5
# ℹ 2,245 more rows

New recipes selectors

In addition to all_predictors() and all_outcomes()

  • all_numeric()
    • all_double()
    • all_integer()
  • all_logical()
  • all_date()
  • all_datetime()
  • all_nominal()
    • all_string()
    • all_factor()
    • all_unordered()
    • all_ordered()

all have *_predictors() variants

Final recipe

Taking everything together, we can create a recipe we will use now

surv_rec <- recipe(surv ~ ., data = horror_train) |>
  update_role(id, title, new_role = "ID") |>
  step_dummy_extract(genres, spoken_languages, sep = ", ", threshold = 0.1) |>
  step_dummy_extract(production_countries, sep = ", ", threshold = 0.05) |>
  step_impute_median(budget) |>
  step_date(all_date_predictors(), 
            features = c("dow", "month"), label = FALSE, 
            keep_original_cols = FALSE)

Sneak peak at data

surv_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 2,255
Columns: 20
$ budget                                        <dbl> 19000000, 47000000, 3000…
$ id                                            <dbl> 755, 9095, 34996, 8973, …
$ runtime                                       <dbl> 108, 104, 82, 119, 108, …
$ title                                         <fct> "From Dusk Till Dawn", "…
$ surv                                          <Surv> <Surv[26 x 2]>
$ genres_Action                                 <int> 1, 0, 0, 0, 1, 0, 0, 0,…
$ genres_Comedy                                 <int> 0, 0, 0, 0, 0, 0, 0, 0, …
$ genres_Drama                                  <int> 0, 1, 1, 0, 0, 1, 1, 1, …
$ genres_Mystery                                <int> 0, 0, 0, 1, 0, 1, 0, 0, …
$ genres_Science.Fiction                        <int> 0, 0, 0, 0, 1, 0, 1, 0, …
$ genres_Thriller                               <int> 1, 1, 0, 1, 0, 1, 0, 0, …
$ genres_other                                  <int> 1, 1, 0, 0, 0, 0, 1, 0, …
$ spoken_languages_English                      <int> 1, 1, 1, 1, 1, 1, 1, 1, …
$ spoken_languages_other                        <int> 1, 0, 0, 0, 0, 2, 0, 0, …
$ production_countries_Canada                   <int> 0, 0, 0, 0, 0, 0, 0, 1, …
$ production_countries_United.Kingdom           <int> 0, 0, 0, 0, 0, 0, 1, 0, …
$ production_countries_United.States.of.America <int> 1, 1, 1, 1, 1, 1, 1, 0, …
$ production_countries_other                    <int> 0, 0, 0, 0, 0, 0, 1, 0, …
$ release_date_dow                              <int> 6, 6, 3, 6, 6, 6, 6, 3, …
$ release_date_month                            <int> 1, 2, 1, 8, 7, 8, 11, 10…

Survival analysis

problem: you are working with time-to-event data and you wanna do it right!


For the horror movie data set, the event is “100 ratings”, and the time is calculated in weeks


Is this the best example? No, but it is 🎃seasonal🎃

What do we have

  • multiple different models in censored
  • a number of performance metrics in yardstick
  • support for tuning multiple models with tune

For the best experiance, install dev versions

# pak::pak(c("tidymodels/censored", "tidymodels/parsnip", "tidymodels/tune"))

We are very close! and would love any last minute feedback

specifying the models

if you know how to use parsnip, then you know how to use survival models

All the engines are provided in censored

Notice the new mode "censored regression"

sr_spec <- 
  survival_reg(dist = "weibull") %>%
  set_engine("survival") %>% 
  set_mode("censored regression") 

sr_spec
Parametric Survival Regression Model Specification (censored regression)

Main Arguments:
  dist = weibull

Computational engine: survival 

Types of models

We used a Parametric survival regression model, but there are also

  • Decision trees
  • Bagged trees
  • Boosted trees
  • Random forests
  • Proportional hazards regression

Works with workflows

These types of models behave basically the same as other models

the only difference is the outcome needs to be a Surv() object

surv_wflow <- workflow() |>
  add_recipe(surv_rec) |>
  add_model(sr_spec)

surv_wflow_fit <- fit(surv_wflow, horror_train)
surv_wflow_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: survival_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_dummy_extract()
• step_dummy_extract()
• step_impute_median()
• step_date()

── Model ───────────────────────────────────────────────────────────────────────
Call:
survival::survreg(formula = x ~ ., data = data, dist = ~"weibull", 
    model = TRUE)

Coefficients:
                                  (Intercept) 
                                 1.188669e+01 
                                       budget 
                                -3.697045e-08 
                                      runtime 
                                -9.604047e-03 
                                genres_Action 
                                 2.932672e-01 
                                genres_Comedy 
                                -3.241042e-01 
                                 genres_Drama 
                                 4.863489e-01 
                               genres_Mystery 
                                 1.948017e-02 
                       genres_Science.Fiction 
                                -9.928659e-01 
                              genres_Thriller 
                                -1.850366e-01 
                                 genres_other 
                                -3.287082e-01 
                     spoken_languages_English 
                                -1.217487e+00 
                       spoken_languages_other 
                                -3.205024e-02 
                  production_countries_Canada 
                                -3.507768e-02 
          production_countries_United.Kingdom 
                                -1.434201e-01 
production_countries_United.States.of.America 
                                 3.679797e-01 
                   production_countries_other 
                                -2.733567e-01 
                             release_date_dow 
                                -9.793042e-02 
                           release_date_month 
                                -5.917901e-02 

Scale= 1.509844 

Loglik(model)= -1188.6   Loglik(intercept only)= -1214.6
    Chisq= 52.06 on 17 degrees of freedom, p= 2.01e-05 
n= 2255 

Prediction

In the time-to-event setting, there are many things we could try to predict. In tidymodels we believe we got it covered

These are selected by setting type = and sometimes specifying eval_times

Prediction - time

for type = "time" we get time to event prediction (The default)

surv_wflow_fit |>
  predict(horror_train, type = "time") 
# A tibble: 2,255 × 1
   .pred_time
        <dbl>
 1      4439.
 2      1934.
 3     28915.
 4      5202.
 5      1102.
 6      9992.
 7       299.
 8     10505.
 9      7236.
10      5842.
# ℹ 2,245 more rows

Prediction - linear prediction

for type = "linear_pred" we get the linear prediction

surv_wflow_fit |>
  predict(horror_train, type = "linear_pred") 
# A tibble: 2,255 × 1
   .pred_linear_pred
               <dbl>
 1              8.40
 2              7.57
 3             10.3 
 4              8.56
 5              7.00
 6              9.21
 7              5.70
 8              9.26
 9              8.89
10              8.67
# ℹ 2,245 more rows

Prediction - quantile

for type = "quantile" we get the quantiles of the event time distribution

You can set quantile to something other than the default (1:9)/10

surv_wflow_fit |>
  predict(horror_train, type = "quantile") 
# A tibble: 2,255 × 1
   .pred           
   <list>          
 1 <tibble [9 × 2]>
 2 <tibble [9 × 2]>
 3 <tibble [9 × 2]>
 4 <tibble [9 × 2]>
 5 <tibble [9 × 2]>
 6 <tibble [9 × 2]>
 7 <tibble [9 × 2]>
 8 <tibble [9 × 2]>
 9 <tibble [9 × 2]>
10 <tibble [9 × 2]>
# ℹ 2,245 more rows

Prediction - quantile

for type = "quantile" we get the quantiles of the event time distribution

You can set quantile to something other than the default (1:9)/10

surv_wflow_fit |>
  predict(horror_train, type = "quantile") |>
  slice(1:2) |>
  pull(.pred)
[[1]]
# A tibble: 9 × 2
  .quantile .pred_quantile
      <dbl>          <dbl>
1       0.1           148.
2       0.2           461.
3       0.3           936.
4       0.4          1610.
5       0.5          2553.
6       0.6          3890.
7       0.7          5875.
8       0.8          9106.
9       0.9         15638.

[[2]]
# A tibble: 9 × 2
  .quantile .pred_quantile
      <dbl>          <dbl>
1       0.1           64.7
2       0.2          201. 
3       0.3          408. 
4       0.4          701. 
5       0.5         1112. 
6       0.6         1695. 
7       0.7         2560. 
8       0.8         3968. 
9       0.9         6814. 

Prediction - survival

for type = "survival" we get the survival probability

surv_wflow_fit |>
  predict(horror_train, 
          type = "survival", 
          eval_time = c(1, 10, 20, 30)) 
# A tibble: 2,255 × 1
   .pred           
   <list>          
 1 <tibble [4 × 2]>
 2 <tibble [4 × 2]>
 3 <tibble [4 × 2]>
 4 <tibble [4 × 2]>
 5 <tibble [4 × 2]>
 6 <tibble [4 × 2]>
 7 <tibble [4 × 2]>
 8 <tibble [4 × 2]>
 9 <tibble [4 × 2]>
10 <tibble [4 × 2]>
# ℹ 2,245 more rows

Prediction - survival

for type = "survival" we get the survival probability (The default)

surv_wflow_fit |>
  predict(horror_train, 
          type = "survival", 
          eval_time = c(1, 100, 200, 300)) |>
  slice(1:2) |>
  pull(.pred)
[[1]]
# A tibble: 4 × 2
  .eval_time .pred_survival
       <dbl>          <dbl>
1          1          0.996
2        100          0.922
3        200          0.880
4        300          0.845

[[2]]
# A tibble: 4 × 2
  .eval_time .pred_survival
       <dbl>          <dbl>
1          1          0.993
2        100          0.869
3        200          0.801
4        300          0.747

Prediction - hazard

for type = "hazard" we get the hazard estimate

surv_wflow_fit |>
  predict(horror_train, 
          type = "hazard", 
          eval_time = c(1, 10, 20, 30)) 
# A tibble: 2,255 × 1
   .pred           
   <list>          
 1 <tibble [4 × 2]>
 2 <tibble [4 × 2]>
 3 <tibble [4 × 2]>
 4 <tibble [4 × 2]>
 5 <tibble [4 × 2]>
 6 <tibble [4 × 2]>
 7 <tibble [4 × 2]>
 8 <tibble [4 × 2]>
 9 <tibble [4 × 2]>
10 <tibble [4 × 2]>
# ℹ 2,245 more rows

Prediction - hazard

for type = "hazard" we get the hazard estimate

surv_wflow_fit |>
  predict(horror_train, 
          type = "hazard", 
          eval_time = c(1, 100, 200, 300)) |>
  slice(1:2) |>
  pull(.pred)
[[1]]
# A tibble: 4 × 2
  .eval_time .pred_hazard
       <dbl>        <dbl>
1          1     0.00254 
2        100     0.000537
3        200     0.000425
4        300     0.000371

[[2]]
# A tibble: 4 × 2
  .eval_time .pred_hazard
       <dbl>        <dbl>
1          1     0.00441 
2        100     0.000931
3        200     0.000737
4        300     0.000643

Performance metrics

There are a couple of performance metrics that work specifically with survival data.

Performance metrics

Using augment() to predict() + bind_cols()

preds <- surv_wflow_fit |>
  augment(horror_train, eval_time = c(1, 100, 200, 300))

glimpse(preds)
Rows: 2,255
Columns: 11
$ budget               <dbl> 19000000, 47000000, NA, NA, 35000000, NA, 4500000…
$ genres               <chr> "Action, Thriller, Crime", "Drama, Thriller, Roma…
$ id                   <dbl> 755, 9095, 34996, 8973, 9348, 18256, 3036, 92769,…
$ production_countries <chr> "United States of America", "United States of Ame…
$ release_date         <date> 1996-01-19, 1996-02-23, 1995-01-10, 1995-08-25, …
$ runtime              <dbl> 108, 104, 82, 119, 108, 95, 123, 90, 92, 98, 112,…
$ spoken_languages     <chr> "English, Español", "English", "English", "Englis…
$ title                <chr> "From Dusk Till Dawn", "Mary Reilly", "The Addict…
$ surv                 <Surv> <Surv[26 x 2]>
$ .pred                <list> [<tbl_df[4 x 2]>], [<tbl_df[4 x 2]>], [<tbl_df[4…
$ .pred_time           <dbl> 4439.1661, 1934.1405, 28915.1872, 5202.1271, 110…

Performance metrics

You need a lot of information for some of these metrics, .censoring_weights_graf() can sometimes help

preds <- surv_wflow_fit |>
  augment(horror_train, eval_time = c(1, 100, 200, 300))

# hopefully better interface soon
.censoring_weights_graf(surv_wflow_fit, preds) |>
  brier_survival(truth = surv, .pred)
# A tibble: 4 × 4
  .metric        .estimator .eval_time .estimate
  <chr>          <chr>           <dbl>     <dbl>
1 brier_survival standard            1  0.000462
2 brier_survival standard          100  0.0567  
3 brier_survival standard          200  0.0749  
4 brier_survival standard          300  0.106   

Fitting many models

tune_res <- fit_resamples(surv_wflow, horror_set, eval_time = c(1, 10, 30, 50)) 
tune_res
# Resampling results
#  
# A tibble: 1 × 4
  splits             id         .metrics         .notes          
  <list>             <chr>      <list>           <list>          
1 <split [2255/752]> validation <tibble [4 × 5]> <tibble [0 × 3]>
tune_res |>
  collect_metrics()
# A tibble: 4 × 7
  .metric        .estimator .eval_time      mean     n std_err .config          
  <chr>          <chr>           <dbl>     <dbl> <int>   <dbl> <chr>            
1 brier_survival standard            1 0.0000233     1      NA Preprocessor1_Mo…
2 brier_survival standard           10 0.0149        1      NA Preprocessor1_Mo…
3 brier_survival standard           30 0.0287        1      NA Preprocessor1_Mo…
4 brier_survival standard           50 0.0436        1      NA Preprocessor1_Mo…

Clustering

What is clustering?

you have some data, with no clear outcome, and you want to see if you can partition the data

What we have

  • some models
  • some metrics
  • extraction
  • tuning

All in tidyclust

clustering recipe

Loan from the previous recipe, now with no outcome and normalization

clust_rec <- recipe(~ ., data = horror_train) |>
  update_role(id, title, surv, new_role = "ID") |>
  step_dummy_extract(genres, spoken_languages, sep = ", ", threshold = 0.1) |>
  step_dummy_extract(production_countries, sep = ", ", threshold = 0.05) |>
  step_impute_median(budget) |>
  step_date(all_date_predictors(), 
            features = c("dow", "month"), label = FALSE, 
            keep_original_cols = FALSE) |>
  step_normalize(all_numeric_predictors())

Peak the data

clust_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 2,255
Columns: 20
$ budget                                        <dbl> 1.601214800, 4.786132390…
$ id                                            <dbl> 755, 9095, 34996, 8973, …
$ runtime                                       <dbl> 0.89860756, 0.70468823, …
$ title                                         <fct> "From Dusk Till Dawn", "…
$ surv                                          <Surv> <Surv[26 x 2]>
$ genres_Action                                 <dbl> 2.9956460, -0.3336698, …
$ genres_Comedy                                 <dbl> -0.4080522, -0.4080522, …
$ genres_Drama                                  <dbl> -0.4249096, 2.3523981, 2…
$ genres_Mystery                                <dbl> -0.3916435, -0.3916435, …
$ genres_Science.Fiction                        <dbl> -0.5040422, -0.5040422, …
$ genres_Thriller                               <dbl> 1.1517608, 1.1517608, -0…
$ genres_other                                  <dbl> 1.4933815, 1.4933815, -0…
$ spoken_languages_English                      <dbl> 0.3076007, 0.3076007, 0.…
$ spoken_languages_other                        <dbl> 1.4857330, -0.3975317, -…
$ production_countries_Canada                   <dbl> -0.3015174, -0.3015174, …
$ production_countries_United.Kingdom           <dbl> -0.3893817, -0.3893817, …
$ production_countries_United.States.of.America <dbl> 0.654854, 0.654854, 0.65…
$ production_countries_other                    <dbl> -0.3861011, -0.3861011, …
$ release_date_dow                              <dbl> 0.7965929, 0.7965929, -0…
$ release_date_month                            <dbl> -1.5170577, -1.2337479, …

Specifying models and fitting

tidyclust is different from parsnip, but is used in almost the same way

kmeans_spec <- k_means(num_clusters = 4) |>
  set_engine("stats") |>
  set_mode("partition")

kmeans_wflow <- workflow(clust_rec, kmeans_spec)

kmeans_fit <- fit(kmeans_wflow, data = horror_train)

Available models

  • K-Means
    • K-Means
    • K-Modes
    • K-Prototypes
  • Hierarchical (Agglomerative) Clustering

More to come next release

cluster assignment + clusters + prediction

extract_cluster_assignment(kmeans_fit)
# A tibble: 2,255 × 1
   .cluster 
   <fct>    
 1 Cluster_1
 2 Cluster_1
 3 Cluster_2
 4 Cluster_3
 5 Cluster_1
 6 Cluster_3
 7 Cluster_1
 8 Cluster_4
 9 Cluster_2
10 Cluster_1
# ℹ 2,245 more rows

cluster assignment + clusters + prediction

extract_centroids(kmeans_fit)
# A tibble: 4 × 18
  .cluster  budget runtime genres_Action genres_Comedy genres_Drama
  <fct>      <dbl>   <dbl>         <dbl>         <dbl>        <dbl>
1 Cluster_1  0.251   0.216        0.211        -0.352        0.0751
2 Cluster_2 -0.164  -0.128       -0.0693        0.393       -0.170 
3 Cluster_3  0.189   0.246       -0.222        -0.265        0.281 
4 Cluster_4 -0.144  -0.190       -0.0125       -0.0422       0.0131
# ℹ 12 more variables: genres_Mystery <dbl>, genres_Science.Fiction <dbl>,
#   genres_Thriller <dbl>, genres_other <dbl>, spoken_languages_English <dbl>,
#   spoken_languages_other <dbl>, production_countries_Canada <dbl>,
#   production_countries_United.Kingdom <dbl>,
#   production_countries_United.States.of.America <dbl>,
#   production_countries_other <dbl>, release_date_dow <dbl>,
#   release_date_month <dbl>

cluster assignment + clusters + prediction

predict(kmeans_fit, new_data = horror_train)
# A tibble: 2,255 × 1
   .pred_cluster
   <fct>        
 1 Cluster_1    
 2 Cluster_1    
 3 Cluster_2    
 4 Cluster_3    
 5 Cluster_1    
 6 Cluster_3    
 7 Cluster_1    
 8 Cluster_4    
 9 Cluster_2    
10 Cluster_1    
# ℹ 2,245 more rows

Looking at the clusters

Cluster aware metrics

Metrics are available to, that uses information about centriods

my_metrics <- cluster_metric_set(sse_ratio, sse_within_total)

my_metrics(kmeans_fit)
# A tibble: 2 × 3
  .metric          .estimator .estimate
  <chr>            <chr>          <dbl>
1 sse_ratio        standard       0.825
2 sse_within_total standard   31602.   

You can do tuning as well!

By setting tune() in our cluster spec, we can find the “optimal” value, using tune_cluster() where we would use tune_grid()

Conformal inference

how to make prediction intervals with no parameteric asssumptions

Max Kuhn - Conformal Inference with Tidymodels

Talk material

Recording up “soon” on Youtube

using probably

tidymodels.org article

Calibration

Post-processing tool

There are essentially three different parts to a predictive model:

  • the pre-processing stage
  • model fitting
  • post-processing

Does your model always predict between 40% and 60%? then you might need calibration!

also using probably

tidymodels.org article

Causal inference

we want more eyes!

Link here if interested

Coming up soon

  • fairness metrics
  • cli errors (this year 🤞)

where to look