Data Preparation

This page goes through how we generate the data set laxflights2022 that is used throughout this project.

Loading packages

We load tidyverse for general data manipulation and anyflights to download the data.

library(tidyverse)
library(anyflights)

Downloading the data

The data set we are using can be downloaded with a single call to get_flights()

laxflights2022raw <- get_flights("LAX", year = 2022, months = 1:12)

To avoid repeated downloads, we save the data set right after we download it. Can be found at data-raw/laxflights2022raw.csv.

write_csv(laxflights2022raw, "data-raw/laxflights2022raw.csv")

Bad internet connection

If you are having issues with downloading this data all at once then you can split up the download into smaller chunks like so:

laxflights1 <- get_flights("LAX", 2022, 1:6)
laxflights2 <- get_flights("LAX", 2022, 7:12)

laxflights2022raw <- dplyr::bind_rows(
  laxflights1,
  laxflights2
)

Cleaning the data

We will do two things to this data set:

remove redundant variables
remove variables that won’t be available at the prediction time

When looking at the data

glimpse(laxflights2022raw)

Rows: 191,156
Columns: 19
$ year           <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
$ month          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <dbl> 7, 14, 15, 23, 25, 31, 34, 109, 119, 122, 135, 139, 224…
$ sched_dep_time <dbl> 2359, 2343, 2315, 30, 2259, 2312, 25, 59, 55, 50, 2340,…
$ dep_delay      <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, …
$ arr_time       <dbl> 459, 608, 733, 742, 745, 745, 611, 608, 643, 608, 1000,…
$ sched_arr_time <dbl> 511, 540, 647, 820, 631, 636, 631, 615, 633, 552, 811, …
$ arr_delay      <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 12…
$ carrier        <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "…
$ flight         <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 59…
$ tailnum        <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509…
$ origin         <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX",…
$ dest           <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH",…
$ air_time       <dbl> 153, 203, 234, 238, 240, 231, 187, 160, 177, 143, 298, …
$ distance       <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1…
$ hour           <dbl> 23, 23, 23, 0, 22, 23, 0, 0, 0, 0, 23, 1, 23, 5, 5, 5, …
$ minute         <dbl> 59, 43, 15, 30, 59, 12, 25, 59, 55, 50, 40, 30, 32, 10,…
$ time_hour      <dttm> 2022-01-01 23:00:00, 2022-01-01 23:00:00, 2022-01-01 2…

the first thing we notice is that time_hour encodes almost the same information as year, month, day, hour and minute. With minute being the only difference. Let us deal with this by adding the minutes to time_hour and only keeping that variable

laxflights2022 <- laxflights2022raw |>
  mutate(time = time_hour + minutes(minute)) |>
  select(-c(year, month, day, hour, minute, time_hour))

glimpse(laxflights2022)

Rows: 191,156
Columns: 14
$ dep_time       <dbl> 7, 14, 15, 23, 25, 31, 34, 109, 119, 122, 135, 139, 224…
$ sched_dep_time <dbl> 2359, 2343, 2315, 30, 2259, 2312, 25, 59, 55, 50, 2340,…
$ dep_delay      <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, …
$ arr_time       <dbl> 459, 608, 733, 742, 745, 745, 611, 608, 643, 608, 1000,…
$ sched_arr_time <dbl> 511, 540, 647, 820, 631, 636, 631, 615, 633, 552, 811, …
$ arr_delay      <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 12…
$ carrier        <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "…
$ flight         <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 59…
$ tailnum        <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509…
$ origin         <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX",…
$ dest           <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH",…
$ air_time       <dbl> 153, 203, 234, 238, 240, 231, 187, 160, 177, 143, 298, …
$ distance       <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1…
$ time           <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 2…

Next, we see that the combination of dep_time, sched_dep_time and dep_delay is linearly dependent as dep_time - sched_dep_time = dep_delay, so we can remove sched_dep_time and dep_time without losing any information since we also have sched_dep_time as a function of time.

laxflights2022 <- laxflights2022 |>
  select(-c(sched_dep_time, dep_time))

glimpse(laxflights2022)

Rows: 191,156
Columns: 12
$ dep_delay      <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, …
$ arr_time       <dbl> 459, 608, 733, 742, 745, 745, 611, 608, 643, 608, 1000,…
$ sched_arr_time <dbl> 511, 540, 647, 820, 631, 636, 631, 615, 633, 552, 811, …
$ arr_delay      <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 12…
$ carrier        <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "…
$ flight         <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 59…
$ tailnum        <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509…
$ origin         <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX",…
$ dest           <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH",…
$ air_time       <dbl> 153, 203, 234, 238, 240, 231, 187, 160, 177, 143, 298, …
$ distance       <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1…
$ time           <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 2…

On a similar note, since we are trying to predict arr_delay, we can’t have arr_time or air_time in the data set as they allow us to figure out arr_delay from dep_time. With some loss of information, we will also remove sched_arr_time to make the modeling a little easier.

laxflights2022 <- laxflights2022 |>
  select(-c(arr_time, air_time, sched_arr_time))

glimpse(laxflights2022)

Rows: 191,156
Columns: 9
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ carrier   <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ flight    <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 595, 51…
$ tailnum   <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin    <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest      <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance  <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time      <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…

We will also be removing the flight variable to have one less high cardinality variable to work with.

laxflights2022 <- laxflights2022 |>
  select(-flight)

glimpse(laxflights2022)

Rows: 191,156
Columns: 8
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ carrier   <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum   <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin    <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest      <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance  <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time      <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…

Since we are trying to model arr_delay we exclude all the flights where that information is missing.

laxflights2022 <- laxflights2022 |>
  drop_na(arr_delay)

glimpse(laxflights2022)

Rows: 187,868
Columns: 8
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ carrier   <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum   <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin    <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest      <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance  <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time      <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…

Lastly, the outcome arr_delay has been moved to the beginning of the data set as it aids in investigating the data.

laxflights2022 <- laxflights2022 |>
  relocate(arr_delay)

glimpse(laxflights2022)

Rows: 187,868
Columns: 8
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ carrier   <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum   <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin    <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest      <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance  <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time      <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…

Saving the data

The full data set is now ready and is saved as data/laxflights2022.csv.

write_csv(laxflights2022, "data/laxflights2022.csv")

A smaller version of the data set is also created and is saved as data/laxflights2022_lite.csv.

set.seed(1234)
laxflights2022_lite <- laxflights2022 |>
  slice_sample(prop = 0.02) %>%
  arrange(time)
write_csv(laxflights2022_lite, "data/laxflights2022_lite.csv")

Data dictionary

arr_delay: Arrival delays, in minutes. Negative times represent early arrivals.
dep_delay: Departure delays, in minutes. Negative times represent early departures.
carrier: Two letter carrier abbreviation.
tailnum: Plane tail number.
origin: FAA airport code for origin of flight.
dest: FAA airport code for destination of flight.
distance: Distance between airports, in miles.
time: Scheduled time of the flight as a POSIXct date, rounded to the nearest minute.