class: center, middle, title-slide # Feature engineering ## NHS-R Conference 2021 ### Emil Hvitfeldt ### 2021-11-02 ---
NHS tidymodels workshop
Home
Slides
▾
1: Introduction
2: Models
3: Features
4: Resampling
5: Tuning
☰
class: inverse, middle, center <!--- Packages ---------------------------------------------------------------> <!--- Chunk options ----------------------------------------------------------> <!--- pkg highlight ----------------------------------------------------------> <style> .pkg { font-weight: bold; letter-spacing: 0.5pt; color: #866BBF; } </style> <!--- Highlighing colors -----------------------------------------------------> <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$` `$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$` `$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$` `$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { purple: ["{\\color{purple}{#1}}", 1], green: ["{\\color{green}{#1}}", 1], orange: ["{\\color{orange}{#1}}", 1], white: ["{\\color{white}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .purple {color: #866BBF;} .green {color: #03A66A;} .orange {color: #F29441;} .white {color: #FFFFFF;} </style> <!--- knitr hooks ------------------------------------------------------------> # [`tidymodels.org`](https://www.tidymodels.org/) # _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/)) --- # What is feature engineering? First thing's first: what's a feature? I tend to think of a feature as some representation of a predictor that will be used in a model. Old-school features: * Interactions * Polynomial expansions/splines * PCA feature extraction "Feature engineering" sounds pretty cool, but let's take a minute to talk about _preprocessing_ data. --- # Two types of preprocessing <img src="images/fe_venn.svg" width="75%" style="display: block; margin: auto;" /> --- # Two types of preprocessing <img src="images/fe_venn_info.svg" width="75%" style="display: block; margin: auto;" /> --- # Easy examples For example, centering and scaling are definitely not feature engineering. Consider the `date` field in the Chicago data. If given as a raw predictor, it is converted to an integer. Spoiler alert: the date is the most important factor. It can be re-encoded as: * Days since a reference date 😪 * Day of the week ❤️❤️❤️❤️ * Month 😪 * Year ❤️❤️ * Indicators for holidays ❤️❤️❤️ * Indicators for home games for NFL, NBA, etc. 😪 --- # General definitions * _Data preprocessing_ are the steps that you take to make your model successful. * _Feature engineering_ are what you do to the original predictors to make the model do the least work to predict the outcome as well as possible. We'll demonstrate the .pkg[recipes] package for all of your data needs. --- # Recipes prepare your data for modeling The package is extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data. Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting processed output can then be used as inputs for statistical or machine learning models. --- # A first recipe ```r chi_rec <- recipe(ridership ~ ., data = chi_train) # If ncol(data) is large, you can use # recipe(data = chi_train) ``` Based on the formula, the function assigns columns to roles of "outcome" or "predictor" ```r summary(chi_rec) ``` ``` ## # A tibble: 50 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 Austin numeric predictor original ## 2 Quincy_Wells numeric predictor original ## 3 Belmont numeric predictor original ## 4 Archer_35th numeric predictor original ## 5 Oak_Park numeric predictor original ## 6 Western numeric predictor original ## 7 Clark_Lake numeric predictor original ## 8 Clinton numeric predictor original ## 9 Merchandise_Mart numeric predictor original ## 10 Irving_Park numeric predictor original ## # … with 40 more rows ``` --- # A first recipe - work with dates ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) ``` This creates three new columns in the data based on the date. Now that the day-of-the-week column is a factor. --- # A first recipe - work with dates ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) ``` Add indicators for major holidays. Specific holidays, especially those ex-US, can also be generated. At this point, we don't need `date` anymore. Instead of deleting it (there is a step for that) we will change its _role_ to be an identification variable. --- # A first recipe - work with dates ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") ``` `date` is still in the data set but tidymodels knows not to treat it as an analysis column. --- # A first recipe -create indicator variables ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) ``` For any factor or character predictors, make binary indicators. There are _many_ recipe steps that can convert categorical predictors to numeric columns. --- # A first recipe - filter out constant columns ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) ``` In case there is a holiday that never was observed, we can delete any _zero-variance_ predictors that have a single unique value. Note that the selector chooses all columns with a role of "predictor" --- # A first recipe - normalization ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) ``` This centers and scales the numeric predictors. Note that this will use the training set to estimate the means and standard deviations of the data. All data put through the recipe will be normalized using those statistics (there is no re-estimation). --- # A first recipe - reduce correlation ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_corr(all_numeric_predictors(), threshold = 0.9) ``` To deal with highly correlated predicors, find the minimum predictor set to remove to make the pairwise correlations are less than 0.9. There are other filter steps too, --- # Other possible steps ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_pca(all_numeric_predictors()) ``` PCA feature extraction... --- # Other possible steps ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_umap(all_numeric_predictors(), outcome = ridership) ``` A fancy machine learning supervised dimension reduction technique --- # Other possible steps ```r chi_rec <- recipe(ridership ~ ., data = chi_train) %>% step_date(date, features = c("dow", "month", "year")) %>% step_holiday(date) %>% update_role(date, new_role = "id") %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) %>% step_ns(Clark_Lake, deg_free = 10) ``` Nonlinear transforms like _natural splines_ and so on. --- # Recipes are estimated _Every_ preprocessing step in a recipe that involved calculations uses the _training set_. For example: * Levels of a factor * Determination of zero-variance * Normalization * Feature extraction and so on. Once a a recipe is added to a workflow, this occurs when `fit()` is called. --- # Recipes follow this strategy <img src="images/the-model.svg" width="70%" style="display: block; margin: auto;" /> --- # Adding recipes to workflows Let's stick to a linear model for now and add a recipe (instead of a formula): .pull-left[ ```r lm_spec <- linear_reg() chi_wflow <- workflow() %>% add_model(lm_spec) %>% add_recipe(chi_rec) chi_wflow ``` ] .pull-right[ ``` ## ══ Workflow ══════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────── ## 6 Recipe Steps ## ## • step_date() ## • step_holiday() ## • step_dummy() ## • step_zv() ## • step_normalize() ## • step_corr() ## ## ── Model ───────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] --- # Estimate via `fit()` Let's stick to a linear model for now and add a recipe (instead of a formula): .pull-left[ ```r chi_fit <- chi_wflow %>% fit(chi_train) chi_fit ``` ] .pull-right[ ``` ## ══ Workflow [trained] ════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────── ## 6 Recipe Steps ## ## • step_date() ## • step_holiday() ## • step_dummy() ## • step_zv() ## • step_normalize() ## • step_corr() ## ## ── Model ───────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) Washington_Wells temp_change dew ## 13.611685 -0.102278 -0.026406 0.422014 ## humidity pressure pressure_change wind ## -0.070948 0.002097 0.032573 -0.120868 ## wind_max gust gust_max percip ## 0.005030 -0.038944 0.097526 -0.034111 ## percip_max weather_rain weather_snow weather_cloud ## -0.030426 -0.127184 -0.127124 -0.087404 ## weather_storm Blackhawks_Away Blackhawks_Home Bulls_Away ## 0.009117 -0.037365 -0.005315 0.015849 ## Bulls_Home Bears_Away Bears_Home Cubs_Home ## 0.112938 0.061506 0.055691 -0.259078 ## date_year date_LaborDay date_NewYearsDay date_ChristmasDay ## 1.743613 0.044648 -0.512929 -0.581508 ## date_dow_Mon date_dow_Tue date_dow_Wed date_dow_Thu ## 4.475308 4.969280 4.974286 4.901382 ## date_dow_Fri date_dow_Sat date_month_Feb date_month_Mar ## 4.695743 0.411799 0.116181 0.246738 ## date_month_Apr date_month_May date_month_Jun date_month_Jul ## 0.346733 0.238892 0.498224 0.331972 ## date_month_Aug date_month_Sep date_month_Oct date_month_Nov ## 0.401437 0.326200 0.481322 0.127999 ## date_month_Dec ## -0.057195 ``` ] --- # Prediction When `predict()` is called, the fitted recipe is applied to the new data before it is predicted by the model: ```r predict(chi_fit, chi_test) ``` ``` ## # A tibble: 14 × 1 ## .pred ## <dbl> ## 1 20.6 ## 2 21.4 ## 3 21.7 ## 4 21.5 ## 5 20.8 ## 6 8.41 ## 7 7.39 ## 8 20.2 ## 9 21.6 ## 10 21.5 ## 11 21.2 ## 12 20.7 ## 13 8.86 ## 14 7.60 ``` You don't need to do anything else --- # Tidying a recipe .pull-left[ `tidy(recipe)` gives a summary of the steps: ```r tidy(chi_rec) ``` ``` ## # A tibble: 6 × 6 ## number operation type trained skip id ## <int> <chr> <chr> <lgl> <lgl> <chr> ## 1 1 step date FALSE FALSE date_sJxC9 ## 2 2 step holiday FALSE FALSE holiday_nMzT3 ## 3 3 step dummy FALSE FALSE dummy_LUPxu ## 4 4 step zv FALSE FALSE zv_HsCdL ## 5 5 step normalize FALSE FALSE normalize_KGE7Q ## 6 6 step corr FALSE FALSE corr_64q8w ``` After fitting the recipe, you might want access to the statistics from each step. We can pull the fitted recipe from the workflow and choose which step to tidy by number or `id` ] .pull-right[ ```r chi_fit %>% extract_recipe() %>% tidy(number = 5) # For step normalize ``` ``` ## # A tibble: 138 × 4 ## terms statistic value id ## <chr> <chr> <dbl> <chr> ## 1 Austin mean 1.52 normalize_KGE7Q ## 2 Quincy_Wells mean 5.58 normalize_KGE7Q ## 3 Belmont mean 4.09 normalize_KGE7Q ## 4 Archer_35th mean 2.21 normalize_KGE7Q ## 5 Oak_Park mean 1.32 normalize_KGE7Q ## 6 Western mean 2.87 normalize_KGE7Q ## 7 Clark_Lake mean 13.6 normalize_KGE7Q ## 8 Clinton mean 2.44 normalize_KGE7Q ## 9 Merchandise_Mart mean 4.67 normalize_KGE7Q ## 10 Irving_Park mean 3.41 normalize_KGE7Q ## # … with 128 more rows ``` ] --- # Debugging a recipe 90% of the time, you will want to use a workflow to estimate and apply a recipe. If you have an error, the original recipe object (e.g. `chi_rec`) can be estimated manually with a function called `bake()` (analogous to `fit()`). This returns the fitted recipe. This can help debug any issues. Another function (`bake()`) is analogous to `predict()` and gives you the processed data back. --- # Fun facts about recipes * Once `fit()` is called on a workflow, changing the model does not re-fit the recipe. * A list of all known steps is [here](https://www.tidymodels.org/find/recipes/). * Some steps can be [skipped](https://recipes.tidymodels.org/articles/Skipping.html) when using `predict()`. * The [order](https://recipes.tidymodels.org/articles/Ordering.html) of the steps matters. * There are .pkg[recipes]-adjacent packages with more steps: .pkg[embed], .pkg[timetk], .pkg[textrecipes], .pkg[themis], and others. * Julia and I have written an amazing text processing book: [_Supervised Machine Learning for Text Analysis in R_](https://smltar.com/) * There are a lot of ways to handle [categorical predictors](https://recipes.tidymodels.org/articles/Dummies.html) even those with novel levels. * Several .pkg[dplyr] steps exist, such as `step_mutate()`. --- # Hands-On: Add a recipe to your model Go to the lab and add a custom recipe to perform feature engineering.
10
:
00