In tidymodels, there is the idea that a model-oriented data analysis consists of
a preprocessor, and
a model
The preprocessor might be a simple formula or a sophisticated recipe.
It’s important to consider both of these activities as part of the data analysis process.
Post-model activities should also be included there (e.g. calibration, cut-off optimization, etc.)
(We don’t have those implemented yet)
Basic tidymodels components
A relevant example
Let’s say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data.
AKA principal component regression
A relevant example
Let’s say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data.
AKA principal component regression feature extraction
A relevant example
Let’s say that we have some highly correlated predictors and we want to reduce the correlation by first applying principal component analysis to the data.
AKA principal component regression feature extraction
What do we consider the estimation part of this process?
Is it this?
Or is it this?
What’s the difference?
It is easy to think that the model fit is the only estimation steps.
There are cases where this could go really wrong:
Poor estimation of performance (buy treating the PCA parts as known)
Selection bias in feature selection
Information/data leakage
These problems are exacerbated as the preprocessors increase in complexity and/or effectiveness.
We’ll come back to this at the end of this section
Data Splitting
Always have a seperate
piece of data that can
contradict
what you believe
Data splitting and spending
How do we “spend” the data to find an optimal model?
We typically split data into training and test data sets:
Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model.
Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model training (like, at all).
Data splitting and spending
The more data we spend, the better estimates we’ll get (provided the data is accurate).
Given a fixed amount of data:
Too much spent in training won’t allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting)
Too much spent in testing won’t allow us to get a good assessment of model parameters
Statistically, the best course of action would be to use all the data for model building and use statistical methods to get good estimates of error.
From a non-statistical perspective, many consumers of complex models emphasize the need for an untouched set of samples to evaluate performance.
Mechanics of data splitting
There are a few different ways to do the split: simple random sampling, stratified sampling based on the outcome, by date, or methods that focus on the distribution of the predictors.
For stratification:
classification: this would mean sampling within the classes to preserve the distribution of the outcome in the training and test sets
regression: determine the quartiles of the data set and sample within those artificial groups
For time series, we often use the most recent data as the test set.
Cleaning the data
We don’t need all the variables, and some are not encoded in a nice manner
To fit a model to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this.
The formula interface using R formula rules to specify a symbolic representation of the terms:
Variables + interactions
# day_of_week is not in the data set but day_of_week = lubridate::wday(lastper_insp_date, label = TRUE)model_fn(speed_fpm ~ day_of_week + car_buffer_type + day_of_week:car_buffer_type, data = elevators_train)
Shorthand for all predictors
model_fn(speed_fpm ~ ., data = elevators_train)
Inline functions / transformations
model_fn(log10(speed_fpm) ~ns(capacity_lbs, df =3) + ., data = elevators_train)
Downsides to formulas
You can’t nest in-line functions such as model_fn(y ~ pca(scale(x1), scale(x2), scale(x3)), data = dat).
All the model matrix calculations happen at once and can’t be recycled when used in a model function.
There are limited roles that variables can take which has led to several re-implementations of formulas.
Specifying multivariate outcomes is clunky and inelegant.
Not all modeling functions have a formula method (consistency!).
Specifying models without formulas
Some modeling function have a non-formula (XY) interface. This usually has arguments for the predictors and the outcome(s):
# Usually, the variables must all be numericpre_vars <-c("capacity_lbs", "elevators_per_building")model_fn(x = elevators_train[, pre_vars],y = elevators_train$speed_fpm)
This is inconvenient if you have transformations, factor variables, interactions, or any other operations to apply to the data prior to modeling.
Overall, it is difficult to predict if a package has one or both of these interfaces. For example, lm only has formulas.
There is a third interface, using recipes that will be discussed later that solves some of these issues.
A linear regression model
Let’s start by fitting an ordinary linear regression model to the training set. You can choose the model terms for your model, but I will use a very simple model:
simple_lm <-lm(speed_fpm ~ borough + capacity_lbs, data = elevators_train)
Before looking at coefficients, we should do some model checking to see if there is anything obviously wrong with the model.
To get the statistics on the individual data points, we will use the awesome broom package:
Note that, for both of these fits, some of the computations are repeated.
For example, the formula method does a fair amount of work to figure out how to turn the data frame into a matrix of predictors.
When there are special effects (e.g. splines), dummy variables, interactions, or other components, the formula/terms objects have to keep track of everything.
In cases where there are a lot of predictors, these computations can consume a lot of resources. If we can save them, that would be helpful.
The answer is a workflow object. These bundle together a preprocessor (such as a formula) along with a model.
A modeling workflow
We can optionally bundle the recipe and model together into a pipelineworkflow:
stan_wflow <- reg_wflow %>%update_model(spec_stan)set.seed(21)stan_fit <-fit(stan_wflow, data = elevators_train)stan_fit#> ══ Workflow [trained] ══════════════════════════════════════════════════════════#> Preprocessor: Formula#> Model: linear_reg()#> #> ── Preprocessor ────────────────────────────────────────────────────────────────#> speed_fpm ~ borough + capacity_lbs#> #> ── Model ───────────────────────────────────────────────────────────────────────#> stan_glm#> family: gaussian [identity]#> formula: ..y ~ .#> observations: 26112#> predictors: 6#> ------#> Median MAD_SD#> (Intercept) 4.9 0.0 #> boroughBrooklyn 0.0 0.0 #> boroughManhattan 0.6 0.0 #> boroughQueens 0.0 0.0 #> `boroughStaten Island` -0.2 0.0 #> capacity_lbs 0.0 0.0 #> #> Auxiliary parameter(s):#> Median MAD_SD#> sigma 0.7 0.0 #> #> ------#> * For help interpreting the printed output see ?print.stanreg#> * For info on the priors used see ?prior_summary.stanreg
Workflows
Once the first model is fit, the preprocessor (i.e. the formula) is processed and the model matrix is formed.
New models don’t need to repeat those computations.
Some other nice features:
Workflows are smarter with data than model.matrix() in terms of new factor levels.
Other preprocessors can be used: recipes and dplyr::select() statements (that do no data processing).
As will be seen later, they can help organize your work when a sequence of models are used.
A workflow captures the entire modeling process (mentioned earlier) and a simple fit() and predict() sequence are used for all of the estimation parts.
Using workflows to predict
# generate some bogus data (instead of using the training or test sets)set.seed(3)shuffled_data <-map_dfc(elevators, ~sample(.x, size =10))predict(stan_fit, shuffled_data) %>%slice(1:3)#> # A tibble: 3 × 1#> .pred#> <dbl>#> 1 4.93#> 2 4.93#> 3 4.88predict(stan_fit, shuffled_data, type ="pred_int") %>%slice(1:3)#> # A tibble: 3 × 2#> .pred_lower .pred_upper#> <dbl> <dbl>#> 1 3.48 6.45#> 2 3.52 6.31#> 3 3.40 6.30
The tidymodels prediction guarantee!
The predictions will always be inside a tibble.
The column names and types are unsurprising.
The number of rows in new_data and the output are the same.
This enables the use of bind_cols() to combine the original data and the predictions.
Evaluating models
tidymodels has a lot of performance metrics for different types of models (e.g. binary classification, etc).
Each takes a tibble as an input along with the observed and predicted column names:
pred_results <-augment(stan_fit, shuffled_data)# Data was randomized; these results should be badpred_results %>%rmse(truth = speed_fpm, estimate = .pred)#> # A tibble: 1 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 345.
Multiple metrics/KPIs
A metric set can bundle multiple statistics:
reg_metrics <-metric_set(rmse, rsq, mae, ccc)# A tidy format of the resultspred_results %>%reg_metrics(truth = speed_fpm, estimate = .pred)#> # A tibble: 4 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 345. #> 2 rsq standard 0.0844 #> 3 mae standard 290. #> 4 ccc standard -0.000280
broom methods
parsnip and workflow fits have corresponding broom tidiers: