```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
This document will serve as a place where you can run some of the code yourself to get a feel for the tidymodels code. I highly encourage you to write notes in this document and take it with you for later use.
```{r packages}
library(tidymodels)
library(elevators)
```
## Introduction
We will be working on the `elevators` data set today. I would like you to take a look at this data set. {dplyr} and {ggplot2} comes loaded with {tidymodels} for you to be able to explore data.
The goal of this section is for you to get familiar with the `elevators` data set.
```{r}
elevators
```
## Models
This section will let you get some training fitting models. Once you know the general structure in tidymodels, then using different models comes with much less friction.
Validation splitting:
```{r split}
number_extractor <- function(x) {
x <- stringr::str_extract(x, "[0-9]+")
x <- as.integer(x)
x[x > 100] <- NA
x
}
elevators_cleaned <- elevators %>%
mutate(speed_fpm = log(speed_fpm + 0.5),
floor_from = number_extractor(floor_from),
floor_to = number_extractor(floor_to),
travel_distance = number_extractor(travel_distance)) %>%
select(-device_number, -bin, -tax_block, -tax_lot, -house_number,
-street_name, -zip_code)
set.seed(1234)
elevators_split <- initial_split(elevators_cleaned)
elevators_train <- training(elevators_split)
elevators_test <- testing(elevators_split)
```
A linear regression specification using `linear_reg()`. Note that we are using the `"lm"` engine to specify we want the `lm()` function to do the calculations.
```{r}
spec_lm <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
```
then we fit the model like normal
```{r}
fit_lm <- fit(
spec_lm,
speed_fpm ~ capacity_lbs + floor_to,
data = elevators_train
)
fit_lm
```
Moving forward we will be using `workflow()` to construct the models. This will perform the
```{r}
reg_wflow <-
workflow() %>% # attached with the tidymodels package
add_model(spec_lm) %>%
add_formula(speed_fpm ~ capacity_lbs + floor_to) # or add_recipe() or add_variables()
reg_fit <- fit(reg_wflow, data = elevators_train)
reg_fit
```
Try swapping the model we use. You could also use a random forest with `rand_forest()`, or any of the other regression models available in [parsnip or addon packages](https://www.tidymodels.org/find/parsnip/).
```{r}
spec_tree <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
tree_wflow <-
reg_wflow %>%
update_model(spec_tree)
set.seed(21)
tree_fit <- fit(tree_wflow, data = elevators_train)
tree_fit
```
Once you have a model you can `predict()` with it. `augment()` is a personal favorite of mine since it attaches the predictions to the data set.
```{r}
predict(reg_fit, elevators_train)
augment(reg_fit, elevators_train)
```
## Features
This section gives you a basic recipe. You can run this, try some variations as shown in the slides or look for more steps in the [recipes documentation](https://recipes.tidymodels.org/reference/index.html).
```{r}
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.9)
```
Here we are using the previously specified `spec_lm`. You could swap in any other model specification.
```{r}
elevators_wflow <-
workflow() %>%
add_model(spec_lm) %>%
add_recipe(elevators_rec)
elevators_fit <- elevators_wflow %>% fit(elevators_train)
elevators_fit
```
## Resampling
We can now produce a resampled data set. Note how we are using sliding periods. If your data doesn't have a time compoment you can use `bootstraps()` or `vfold_cv()`. If you have time to spare, try to create resamples using `vfold_cv()` and look at how the estimated performance differs to when we use a sliding window.
```{r}
set.seed(2453)
elevators_folds <- vfold_cv(elevators_train)
```
I want you to see what happens when you set `verbose = TRUE`. It defaults to `FALSE` and you properly want it to be false for many cases.
```{r}
ctrl <- control_resamples(save_pred = TRUE, verbose = TRUE)
elevators_res <-
elevators_wflow %>%
fit_resamples(resamples = elevators_folds, control = ctrl)
elevators_res
```
To obtain the resampling estimates of performance:
```{r}
collect_metrics(elevators_res)
```
To get the holdout predictions:
```{r}
chi_pred <- collect_predictions(elevators_res)
chi_pred %>% slice(1:4)
```
## Tuning
We will now create a new workflow with tuneable parameters. The tunable parameters are specified using the `tune()` function.
```{r}
elevators_rec <-
recipe(speed_fpm ~ ., data = elevators_train) %>%
step_date(approval_date, lastper_insp_date, keep_original_cols = FALSE) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.9)
lm_spec <-
linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
elevators_wflow <-
workflow() %>%
add_model(lm_spec) %>%
add_recipe(elevators_rec)
```
You can see what tuning parameters are set for tuning using the `parameters()` function.
```{r}
elevators_wflow %>%
extract_parameter_set_dials()
```
We also set up a hypercube of parameters.
```{r}
set.seed(2)
grid <-
elevators_wflow %>%
extract_parameter_set_dials() %>%
grid_regular(levels = c(mixture = 10, penalty = 50))
grid
```
We can not fit this model over the different parameters.
```{r}
ctrl <- control_grid(save_pred = TRUE)
set.seed(9)
elevators_res <-
tune_grid(
elevators_wflow,
resamples = elevators_folds,
grid = grid
)
elevators_res
```
We can visualize the model
```{r}
autoplot(elevators_res, metric = "rmse")
```
and see how well each model is doing
```{r}
collect_metrics(elevators_res)
```
Look at the best performing sets of hyperparameters
```{r}
smallest_rmse <- select_by_pct_loss(
elevators_res,
metric = "rmse",
desc(penalty),
)
smallest_rmse
```
Updating the workflow and final fit
```{r}
elevators_wflow <-
elevators_wflow %>%
finalize_workflow(smallest_rmse)
test_res <-
elevators_wflow %>%
last_fit(split = elevators_split)
test_res
```