Standard Modeling Pipeline

csv
standard
This analysis is done using tidymodels R alone
Note

This page was last generated on 2024-03-06. If you find the code out of date please file an issue.

Loading packages

We are using the tidymodels package to do the modeling and embed for target encoding.

# install.packages("pak")
# pak::pak("tidymodels", "embed")
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5           ✔ recipes      1.0.10.9000
✔ dials        1.2.0           ✔ rsample      1.2.0      
✔ dplyr        1.1.4           ✔ tibble       3.2.1      
✔ ggplot2      3.4.4           ✔ tidyr        1.3.1      
✔ infer        1.0.6           ✔ tune         1.1.2.9022 
✔ modeldata    1.3.0           ✔ workflows    1.1.4      
✔ parsnip      1.2.0           ✔ workflowsets 1.0.1      
✔ purrr        1.0.2           ✔ yardstick    1.3.0      
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(embed)

Loading Data

We are using the standard laxflights2022_lite data set described on the data preparation page.

flights <- readr::read_csv(here::here("data/laxflights2022_lite.csv"))

glimpse(flights)
Rows: 3,757
Columns: 8
$ arr_delay <dbl> 4, -15, -12, 38, -9, -17, 5, 12, -40, 6, -7, 28, 25, -9, 180…
$ dep_delay <dbl> 9, -8, 0, -7, 3, 6, 29, -1, 2, 7, 6, 13, 34, -2, 191, 52, 9,…
$ carrier   <chr> "UA", "OO", "AA", "UA", "OO", "OO", "UA", "AA", "DL", "DL", …
$ tailnum   <chr> "N37502", "N198SY", "N410AN", "N77261", "N402SY", "N509SY", …
$ origin    <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest      <chr> "KOA", "EUG", "HNL", "DEN", "FAT", "SFO", "MCO", "MIA", "OGG…
$ distance  <dbl> 2504, 748, 2556, 862, 209, 337, 2218, 2342, 2486, 862, 156, …
$ time      <dttm> 2022-01-01 13:15:00, 2022-01-01 14:00:00, 2022-01-01 14:45:…

Modeling

As a reminder, the modeling task we are trying to accomplish is the following:

Given all the information we have, from the moment the plane leaves for departure. Can we predict the arrival delay arr_delay?

Our outcome is arr_delay and the remaining variables are predictors. We will be fitting an xgboost model as a regression model.

Splitting Data

Since the data set is already in chronological order, we can create a time split of the data using initial_time_split(), this will put the first 75% of the data into the training data set, and the remaining 25% into the testing data set.

set.seed(1234)

flights_split <- initial_time_split(flights, prop = 3/4)
flights_training <- training(flights_split)

Since we are doing hyperparameter tuning, we will also be creating a cross-validation split

flights_folds <- vfold_cv(flights_training)

Feature Engineering

We need to do a couple of things to make this data set work for our model. The datetime variable time needs to be transformed, as does the categorical variables carrier, tailnum, origin and dest.

From the time variable, the month and day of the week are extracted as categorical variables, then the day of year and time of day are extracted as numerics. The origin and dest variables will be turned into dummy variables, and carrier, tailnum, time_month, and time_dow will be converted to numerics with likelihood encoding.

flights_rec <- recipe(arr_delay ~ ., data = flights_training) %>%
  step_novel(all_nominal_predictors()) %>%
  step_other(origin, dest, threshold = 0.025) %>%
  step_dummy(origin, dest) %>%
  step_date(time, 
            features = c("month", "dow", "doy"), 
            label = TRUE, 
            keep_original_cols = TRUE) %>%
  step_time(time, features = "decimal_day", keep_original_cols = FALSE) %>%
  step_lencode_mixed(all_nominal_predictors(), outcome = vars(arr_delay)) %>%
  step_zv(all_predictors())

Specifying Models

We will be fitting a boosted tree model in the form of a xgboost model.

xgb_spec <-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    mtry = tune(),
    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("regression")
xgb_wf <- workflow(flights_rec, xgb_spec)

Hyperparameter Tuning

doParallel::registerDoParallel()

xgb_rs <- tune_grid(
  xgb_wf,
  resamples = flights_folds,
  grid = 10
)
i Creating pre-processing data to finalize unknown parameter: mtry

We can visualize the performance of the different hyperparameter selections

autoplot(xgb_rs)

and look at the top result

show_best(xgb_rs, metric = "rmse")
# A tibble: 5 × 9
   mtry trees min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     3  1988     4 rmse    standard    28.1    10    5.88 Preprocessor1_Model01
2     8   849    13 rmse    standard    29.6    10    6.44 Preprocessor1_Model04
3     3  1543     9 rmse    standard    29.6    10    6.11 Preprocessor1_Model02
4    10  1139    14 rmse    standard    30.0    10    6.44 Preprocessor1_Model05
5    12   554    18 rmse    standard    30.6    10    6.77 Preprocessor1_Model06

Fitting Final Model

Once we are satisfied with the modeling that has been done, we can fit our final model. We use finalize_workflow() to use the best hyperparameters, and last_fit() to fit the model to the training data set and evaluate it on the testing data set.

xgb_last <- xgb_wf %>%
  finalize_workflow(select_best(xgb_rs, metric = "rmse")) %>%
  last_fit(flights_split)

Results

Once we have the final model, we can do all kinds of analyses. Below we have a truth version prediction plot to showcase how well our model works

xgb_last %>%
  augment() %>%
  ggplot(aes(arr_delay, .pred)) +
  geom_abline(intercept = 0, slope = 1, color = "blue") +
  geom_point(alpha = 0.25) +
  theme_minimal()

Truth against prediction plot. The model has a hard time with overly long delays.