class: center, middle, title-slide # Creating Models ## NHS-R Conference 2021 ### Emil Hvitfeldt ### 2021-11-02 ---
NHS tidymodels workshop
Home
Slides
▾
1: Introduction
2: Models
3: Features
4: Resampling
5: Tuning
☰
layout: false class: inverse, middle, center <!--- Packages ---------------------------------------------------------------> <!--- Chunk options ----------------------------------------------------------> <!--- pkg highlight ----------------------------------------------------------> <style> .pkg { font-weight: bold; letter-spacing: 0.5pt; color: #866BBF; } </style> <!--- Highlighing colors -----------------------------------------------------> <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{purple}{rgb}{0.525490196078431, 0.419607843137255, 0.749019607843137}$$` `$$\require{color}\definecolor{green}{rgb}{0.0117647058823529, 0.650980392156863, 0.415686274509804}$$` `$$\require{color}\definecolor{orange}{rgb}{0.949019607843137, 0.580392156862745, 0.254901960784314}$$` `$$\require{color}\definecolor{white}{rgb}{1, 1, 1}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { purple: ["{\\color{purple}{#1}}", 1], green: ["{\\color{green}{#1}}", 1], orange: ["{\\color{orange}{#1}}", 1], white: ["{\\color{white}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .purple {color: #866BBF;} .green {color: #03A66A;} .orange {color: #F29441;} .white {color: #FFFFFF;} </style> <!--- knitr hooks ------------------------------------------------------------> # [`tidymodels.org`](https://www.tidymodels.org/) # _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/)) --- # Define the score of "The Model" In tidymodels, there is the idea that a model-oriented data analysis consists of - a preprocessor, and - a model The preprocessor might be a simple formula or a sophisticated recipe. It's important to consider both of these activities as part of the data analysis process. - Post-model activities should also be included there (e.g. calibration, cut-off optimization, etc.) - We don't have those implemented yet. --- # Basic tidymodels components <img src="images/blocks.png" width="70%" style="display: block; margin: auto;" /> --- # A relevant example Let's say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data. - AKA principal component regression --- # A relevant example Let's say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data. - AKA ~~principal component regression~~ feature extraction --- # A relevant example Let's say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data. - AKA ~~principal component regression~~ feature extraction What do we consider the estimation part of this process? --- # Is it this? <img src="images/faux-model.svg" width="70%" style="display: block; margin: auto;" /> --- # Or is it this? <img src="images/the-model.svg" width="70%" style="display: block; margin: auto;" /> --- # What's the difference? It is easy to think that the model fit is the only estimation steps. There are cases where this could go really wrong: * Poor estimation of performance (buy treating the PCA parts as known) * Selection bias in feature selection * Information/data leakage These problems are exacerbated as the preprocessors increase in complexity and/or effectiveness. _We'll come back to this at the end of this section_ --- class: inverse, middle, center <div style="font-size: 110pt;"> .orange[Data splitting] </div> --- class: inverse, middle, center <div style="font-size: 50pt;"> .white[Always have a separate piece of data that can] .purple[contradict] .white[what you] .green[believe] </div> --- # Data splitting and spending How do we "spend" the data to find an optimal model? We _typically_ split data into training and test data sets: * ***Training Set***: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model. * ***Test Set***: these data can be used to get an independent assessment of model efficacy. **They should not be used during model training** (like, at all). --- # Data splitting and spending The more data we spend, the better estimates we'll get (provided the data is accurate). Given a fixed amount of data: * Too much spent in training won't allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting) * Too much spent in testing won't allow us to get a good assessment of model parameters Statistically, the best course of action would be to use all the data for model building and use statistical methods to get good estimates of error. From a non-statistical perspective, many consumers of complex models emphasize the need for an untouched set of samples to evaluate performance. --- # Mechanics of data splitting There are a few different ways to do the split: simple random sampling, _stratified sampling based on the outcome_, by date, or methods that focus on the distribution of the predictors. For stratification: * **classification**: this would mean sampling within the classes to preserve the distribution of the outcome in the training and test sets * **regression**: determine the quartiles of the data set and sample within those artificial groups For _time series_, we often use the most recent data as the test set. --- # Splitting with Chicago data `initial_split()` can be used when we use randomness to make the split. Let's put the last two weeks of data into the test set. `initial_time_split()` can be used for this purpose: ```r set.seed(1234) chi_split <- initial_time_split(Chicago, prop = 1 - (14/nrow(Chicago))) chi_split ``` ``` ## <Analysis/Assess/Total> ## <5684/14/5698> ``` ```r chi_train <- training(chi_split) chi_test <- testing(chi_split) c(training = nrow(chi_train), testing = nrow(chi_test)) ``` ``` ## training testing ## 5684 14 ``` --- class: inverse, middle, center <div style="font-size: 80pt;"> .orange[Creating models in R] </div> --- # Specifying models in R using formulas To fit a model to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this. The **formula** interface using R [formula rules](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Formulae-for-statistical-models) to specify a _symbolic_ representation of the terms: Variables + interactions ```r # day_of_week is not in the data set but day_of_week = lubridate::wday(date, label = TRUE) model_fn(ridership ~ day_of_week + Western + day_of_week:Western, data = chi_train) ``` Shorthand for all predictors ```r model_fn(ridership ~ ., data = chi_train) ``` Inline functions / transformations ```r model_fn(log10(ridership) ~ ns(Western, df = 3) + ., data = chi_train) ``` --- # Downsides to formulas * You can't nest in-line functions such as `model_fn(y ~ pca(scale(x1), scale(x2), scale(x3)), data = dat)`. * All the model matrix calculations happen at once and can't be recycled when used in a model function. * For very _wide_ data sets, the formula method can be [extremely inefficient](https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/). * There are limited _roles_ that variables can take which has led to several re-implementations of formulas. * Specifying multivariate outcomes is clunky and inelegant. * Not all modeling functions have a formula method (consistency!). --- # Specifying models without formulas Some modeling function have a non-formula (XY) interface. This usually has arguments for the predictors and the outcome(s): ```r # Usually, the variables must all be numeric pre_vars <- c("Austin", "Clark_Lake", "California") model_fn(x = chi_train[, pre_vars], y = chi_train$ridership) ``` This is inconvenient if you have transformations, factor variables, interactions, or any other operations to apply to the data prior to modeling. Overall, it is difficult to predict if a package has one or both of these interfaces. For example, `lm` only has formulas. There is a **third interface**, using _recipes_ that will be discussed later that solves some of these issues. --- # A linear regression model Let's start by fitting an ordinary linear regression model to the training set. You can choose the model terms for your model, but I will use a very simple model: ```r simple_lm <- lm(ridership ~ Clark_Lake + humidity, data = chi_train) ``` Before looking at coefficients, we should do some model checking to see if there is anything obviously wrong with the model. To get the statistics on the individual data points, we will use the awesome `broom` package: ```r simple_lm_values <- augment(simple_lm) names(simple_lm_values) ``` ``` ## [1] "ridership" "Clark_Lake" "humidity" ".fitted" ".resid" ## [6] ".hat" ".sigma" ".cooksd" ".std.resid" ``` --- class: inverse, middle, center <div style="font-size: 110pt;"> .orange[Fitting via tidymodels] </div> --- # The parsnip package .pull-left[ - A tidy unified _interface_ to models - `lm()` isn't the only way to perform linear regression - .pkg[glmnet] for regularized regression - .pkg[stan] for Bayesian regression - .pkg[keras] for regression using tensorflow - But...remember the consistency slide? - Each interface has its own minutiae to remember - .pkg[parsnip] standardizes all that! ] .pull-right[ <img src="images/all_the_models.jpeg" width="700px" style="display: block; margin: auto;" /> ] --- # parsnip in action .pull-left[ 1) Create specification 2) Set the engine 3) Fit the model ```r spec_lin_reg <- linear_reg() spec_lin_reg ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ```r spec_lm <- spec_lin_reg %>% set_engine("lm") spec_lm ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] .pull-right[ ```r fit_lm <- fit( spec_lm, ridership ~ Clark_Lake + humidity, data = chi_train ) fit_lm ``` ``` ## parsnip model object ## ## Fit time: 4ms ## ## Call: ## stats::lm(formula = ridership ~ Clark_Lake + humidity, data = data) ## ## Coefficients: ## (Intercept) Clark_Lake humidity ## 1.764496 0.883693 -0.002497 ``` Note: Models have default engines. We don't really need to use `set_engine("lm")` for this example. ] --- # Alternative engines With .pkg[parsnip], it is easy to switch to a different engine, like Stan, to run the same model with alternative backends. .pull-left[ ```r spec_stan <- spec_lin_reg %>% # Engine specific arguments are # passed through here set_engine("stan", chains = 4, iter = 1000) # Otherwise, looks exactly the same! fit_stan <- fit( spec_stan, ridership ~ Clark_Lake + humidity, data = chi_train ) ``` ] .pull-right[ ```r coef(fit_stan$fit) ``` ``` ## (Intercept) Clark_Lake humidity ## 1.76242135 0.88364918 -0.00241108 ``` ```r coef(fit_lm$fit) ``` ``` ## (Intercept) Clark_Lake humidity ## 1.764495522 0.883692888 -0.002497053 ``` ] --- # Duplicate computations Note that, for both of these fits, some of the computations are repeated. For example, the formula method does a fair amount of work to figure out how to turn the data frame into a matrix of predictors. When there are special effects (e.g. splines), dummy variables, interactions, or other components, the formula/terms objects have to keep track of everything. In cases where there are a lot of _predictors_, these computations can consume a lot of resources. If we can save them, that would be helpful. The answer is a _workflow_ object. These bundle together a preprocessor (such as a formula) along with a model. --- # A modeling _workflow_ We can _optionally_ bundle the recipe and model together into a <span style="color:LightGray;"><strike>pipeline</strike></span> _workflow_: ```r reg_wflow <- workflow() %>% # attached with the tidymodels package add_model(spec_lm) %>% add_formula(ridership ~ Clark_Lake + humidity) # or add_recipe() or add_variables() reg_fit <- fit(reg_wflow, data = chi_train) reg_fit ``` ``` ## ══ Workflow [trained] ════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: linear_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────── ## ridership ~ Clark_Lake + humidity ## ## ── Model ───────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) Clark_Lake humidity ## 1.764496 0.883693 -0.002497 ``` --- # Swapping models ```r stan_wflow <- reg_wflow %>% update_model(spec_stan) set.seed(21) stan_fit <- fit(stan_wflow, data = chi_train) stan_fit ``` ``` ## ══ Workflow [trained] ════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: linear_reg() ## ## ── Preprocessor ────────────────────────────────────────────────────────────── ## ridership ~ Clark_Lake + humidity ## ## ── Model ───────────────────────────────────────────────────────────────────── ## stan_glm ## family: gaussian [identity] ## formula: ..y ~ . ## observations: 5684 ## predictors: 3 ## ------ ## Median MAD_SD ## (Intercept) 1.8 0.2 ## Clark_Lake 0.9 0.0 ## humidity 0.0 0.0 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 3.1 0.0 ## ## ------ ## * For help interpreting the printed output see ?print.stanreg ## * For info on the priors used see ?prior_summary.stanreg ``` --- # Workflows Once the first model is fit, the preprocessor (i.e. the formula) is processed and the model matrix is formed. New models don't need to repeat those computations. Some other nice features: * Workflows are smarter with data than `model.matrix()` in terms of new factor levels. * Other preprocessors can be used: recipes and `dplyr::select()` statements (that do no data processing). * As will be seen later, they can help organize your work when a sequence of models are used. * A workflow captures the entire modeling process (mentioned earlier) and a simple `fit()` and `predict()` sequence are used for all of the estimation parts. --- # Using workflows to predict ```r # generate some bogus data (instead of using the training or test sets) set.seed(3) shuffled_data <- map_dfc(Chicago, ~ sample(.x, size = 10)) predict(stan_fit, shuffled_data) %>% slice(1:3) ``` ``` ## # A tibble: 3 × 1 ## .pred ## <dbl> ## 1 15.9 ## 2 18.5 ## 3 17.4 ``` ```r predict(stan_fit, shuffled_data, type = "pred_int") %>% slice(1:3) ``` ``` ## # A tibble: 3 × 2 ## .pred_lower .pred_upper ## <dbl> <dbl> ## 1 10.2 22.0 ## 2 12.6 24.6 ## 3 11.3 23.4 ``` --- # The tidymodels prediction guarantee! * The predictions will always be inside a **tibble**. * The column names and types are **unsurprising**. * The number of rows in `new_data` and the output **are the same**. This enables the use of `bind_cols()` to combine the original data and the predictions. --- # Evaluating models tidymodels has a [lot of performance metrics](https://yardstick.tidymodels.org/reference/index.html) for different types of models (e.g. binary classification, etc). Each takes a tibble as an input along with the observed and predicted column names: ```r pred_results <- predict(stan_fit, shuffled_data) %>% bind_cols(shuffled_data) # Data was randomized; these results should be bad pred_results %>% rmse(truth = ridership, estimate = .pred) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 6.17 ``` --- # Multiple metrics/KPIs A _metric set_ can bundle multiple statistics: ```r reg_metrics <- metric_set(rmse, rsq, mae, ccc) # A tidy format of the results pred_results %>% reg_metrics(truth = ridership, estimate = .pred) ``` ``` ## # A tibble: 4 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 6.17 ## 2 rsq standard 0.304 ## 3 mae standard 4.36 ## 4 ccc standard 0.434 ``` --- # broom methods .pkg[parsnip] and .pkg[workflow] fits have corresponding .pkg[broom] tidiers: ```r glance(reg_fit) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.780 0.780 3.07 10094. 0 2 -14448. 28905. 28931. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r tidy(reg_fit) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.76 0.222 7.94 2.46e-15 ## 2 Clark_Lake 0.884 0.00622 142. 0 ## 3 humidity -0.00250 0.00295 -0.847 3.97e- 1 ``` --- # broom methods For `augment()` we require the data to predict and attach ```r augment(reg_fit, shuffled_data %>% select(Clark_Lake, humidity, ridership)) ``` ``` ## # A tibble: 10 × 4 ## Clark_Lake humidity ridership .pred ## <dbl> <dbl> <dbl> <dbl> ## 1 16.3 78 6.36 15.9 ## 2 19.2 70 19.3 18.5 ## 3 17.8 45 15.7 17.4 ## 4 19.9 50 4.31 19.3 ## 5 18.7 61 18.7 18.1 ## 6 2.13 54.5 8.44 3.51 ## 7 4.56 43 1.74 5.69 ## 8 17.1 77.5 14.7 16.6 ## 9 19.4 59 14.6 18.7 ## 10 16.7 39 15.3 16.5 ``` --- # Hands-On: Fit a model Go to the lab and try to fit some models. The labs include the skeleton. Try to mix it up with different predictors or models.
10
:
00