Download template here
We will be using the nycflights13
package again today. If the system prompts you to install a package or gives you a “package not found” error, simply run install.packages("packagename")
once to install it.
flights1 <- flights %>%
mutate(delay = factor(arr_delay > 0, c(TRUE, FALSE),
c("Delayed", "On time"))) %>%
filter(month == 1, !is.na(delay)) %>%
select(delay, hour, minute, dep_delay, carrier, distance, tailnum, origin, dest)
set.seed(1234)
flights_split <- initial_split(flights1)
flights_train <- training(flights_split)
flights_test <- testing(flights_split)
We have already explored this data set quite a bit so we can move on to some modeling. We will be using the logiistic_reg()
specification since we are doing binary classification.
lasso_spec <- logistic_reg(mixture = 1, penalty = 0) %>%
set_mode("classification") %>%
set_engine("glmnet")
Next, we need to create a recipe for preprocessing. Here we want to include many variables since we will be doing shrinkage to weed out unimportant predictors.
rec_spec <- recipe(delay ~ ., data = flights_train) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_other(all_nominal_predictors(), threshold = 0.001) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
We then can combine this into a workflow. Notice that this is another way we could create a workflow.
lasso_wf <- workflow(rec_spec, lasso_spec)
We can now fit the model
fitted_model <- fit(lasso_wf, data = flights_train)
and look at the performance of the model.
fitted_model %>%
augment(new_data = flights_train) %>%
conf_mat(truth = delay, estimate = .pred_class)
But above we set penaty
to a specific value, which isn’t ideal since we don’t know what the best value would be for our case. So this time when we specify the model, we set penalty = tune()
to tell the functions that we want to tune the penalty
parameter.
lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
and we create a new workflow.
lasso_wf <- workflow(rec_spec, lasso_spec)
To be able to tune hyperparameters with tune_grid()
we need 3 things. A model/workflow specification folds to fit our model across, and the parameter values we want to try. We already have the lasso_spec
. We create the cross-validation folds with vfold_cv()
.
set.seed(5678)
flights_folds <- vfold_cv(flights_train)
And the different hyperparameter values with grid_regular()
.
penalty_grid <- grid_regular(penalty(), levels = 50)
Now we can run the training loop.
lasso_results <- tune_grid(
lasso_wf,
resamples = flights_folds,
grid = penalty_grid
)
The results can then be summarized with either collect_metrics()
lasso_results %>%
collect_metrics()
or autoplot()
. The regularization in this model doesn’t affect the performance too much.
lasso_results %>%
autoplot()
The best performing hyperparameter pairs can be shown with show_best()
. Notice how they are all the same.
lasso_results %>%
show_best("roc_auc")
We can pluck out the best hyperparameters
select_best(lasso_results, "roc_auc")
and use them to update our model
lasso_wf_final <- finalize_workflow(
lasso_wf,
select_best(lasso_results, "roc_auc")
)
And now we can fit out final model on the training data set.
lasso_wf_final_fit <- fit(lasso_wf_final, data = flights_train)
lasso_wf_final_fit
Repeat the above procedure, but with a ridge model instead of lasso model. You get a ridge model by setting mixture = 0
. Compare the performance of both models on the training data set, pick a model and use that one to evaluate on the testing data set. Explain how you picked your model.