tidymodels pipelines

The goal of this project is to show how tidymodels can be used with different services, cloud providers and techniques.

Each pipeline will try to show how tidymodels can be used with another software solution. All the pipelines are expanded versions of the “standard pipeline” which just uses R and tidymodels:

All other pipelines can be found in the pipelines list.

The modeling problem we are creating solutions for is stated as the following:

Given all the information we have, from the moment the plane leaves for departure. Can we predict the arrival delay arr_delay.

Data source

The data sources used in the pipelines within this project will largely be the same data set. This is done to move focus away from the modeling problem and towards how tidymodels can be used with other software.

The data set that is used, is generated using the anyflights package. The standard data set includes all the flights departing from LAX in the year 2022.

library(readr)

laxflights2022 <- read_csv("data/laxflights2022.csv", show_col_types = FALSE)

dplyr::glimpse(laxflights2022)
Rows: 187,868
Columns: 8
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ carrier   <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum   <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin    <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest      <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance  <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time      <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…

Why this data set?

  • Proposes a realistic enough modeling problem
  • More data can be fetched to showcase larger data problems
  • Data from more airports can be used together to showcase a “many models” approach
  • USA Government data -> friendly data license