library(tidyverse)
library(anyflights)
Data Preparation
This page goes through how we generate the data set laxflights2022 that is used throughout this project.
Loading packages
We load tidyverse for general data manipulation and anyflights to download the data.
Downloading the data
The data set we are using can be downloaded with a single call to get_flights()
<- get_flights("LAX", year = 2022, months = 1:12) laxflights2022raw
To avoid repeated downloads, we save the data set right after we download it. Can be found at data-raw/laxflights2022raw.csv.
write_csv(laxflights2022raw, "data-raw/laxflights2022raw.csv")
If you are having issues with downloading this data all at once then you can split up the download into smaller chunks like so:
<- get_flights("LAX", 2022, 1:6)
laxflights1 <- get_flights("LAX", 2022, 7:12)
laxflights2
<- dplyr::bind_rows(
laxflights2022raw
laxflights1,
laxflights2 )
Cleaning the data
We will do two things to this data set:
- remove redundant variables
- remove variables that won’t be available at the prediction time
When looking at the data
glimpse(laxflights2022raw)
Rows: 191,156
Columns: 19
$ year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
$ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <dbl> 7, 14, 15, 23, 25, 31, 34, 109, 119, 122, 135, 139, 224…
$ sched_dep_time <dbl> 2359, 2343, 2315, 30, 2259, 2312, 25, 59, 55, 50, 2340,…
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, …
$ arr_time <dbl> 459, 608, 733, 742, 745, 745, 611, 608, 643, 608, 1000,…
$ sched_arr_time <dbl> 511, 540, 647, 820, 631, 636, 631, 615, 633, 552, 811, …
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 12…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "…
$ flight <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 59…
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509…
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX",…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH",…
$ air_time <dbl> 153, 203, 234, 238, 240, 231, 187, 160, 177, 143, 298, …
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1…
$ hour <dbl> 23, 23, 23, 0, 22, 23, 0, 0, 0, 0, 23, 1, 23, 5, 5, 5, …
$ minute <dbl> 59, 43, 15, 30, 59, 12, 25, 59, 55, 50, 40, 30, 32, 10,…
$ time_hour <dttm> 2022-01-01 23:00:00, 2022-01-01 23:00:00, 2022-01-01 2…
the first thing we notice is that time_hour
encodes almost the same information as year
, month
, day
, hour
and minute
. With minute
being the only difference. Let us deal with this by adding the minutes to time_hour
and only keeping that variable
<- laxflights2022raw |>
laxflights2022 mutate(time = time_hour + minutes(minute)) |>
select(-c(year, month, day, hour, minute, time_hour))
glimpse(laxflights2022)
Rows: 191,156
Columns: 14
$ dep_time <dbl> 7, 14, 15, 23, 25, 31, 34, 109, 119, 122, 135, 139, 224…
$ sched_dep_time <dbl> 2359, 2343, 2315, 30, 2259, 2312, 25, 59, 55, 50, 2340,…
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, …
$ arr_time <dbl> 459, 608, 733, 742, 745, 745, 611, 608, 643, 608, 1000,…
$ sched_arr_time <dbl> 511, 540, 647, 820, 631, 636, 631, 615, 633, 552, 811, …
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 12…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "…
$ flight <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 59…
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509…
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX",…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH",…
$ air_time <dbl> 153, 203, 234, 238, 240, 231, 187, 160, 177, 143, 298, …
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1…
$ time <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 2…
Next, we see that the combination of dep_time
, sched_dep_time
and dep_delay
is linearly dependent as dep_time - sched_dep_time = dep_delay
, so we can remove sched_dep_time
and dep_time
without losing any information since we also have sched_dep_time
as a function of time
.
<- laxflights2022 |>
laxflights2022 select(-c(sched_dep_time, dep_time))
glimpse(laxflights2022)
Rows: 191,156
Columns: 12
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, …
$ arr_time <dbl> 459, 608, 733, 742, 745, 745, 611, 608, 643, 608, 1000,…
$ sched_arr_time <dbl> 511, 540, 647, 820, 631, 636, 631, 615, 633, 552, 811, …
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 12…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "…
$ flight <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 59…
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509…
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX",…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH",…
$ air_time <dbl> 153, 203, 234, 238, 240, 231, 187, 160, 177, 143, 298, …
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1…
$ time <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 2…
On a similar note, since we are trying to predict arr_delay
, we can’t have arr_time
or air_time
in the data set as they allow us to figure out arr_delay
from dep_time
. With some loss of information, we will also remove sched_arr_time
to make the modeling a little easier.
<- laxflights2022 |>
laxflights2022 select(-c(arr_time, air_time, sched_arr_time))
glimpse(laxflights2022)
Rows: 191,156
Columns: 9
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ flight <dbl> 468, 359, 44, 177, 185, 205, 600, 122, 317, 33, 276, 595, 51…
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…
We will also be removing the flight
variable to have one less high cardinality variable to work with.
<- laxflights2022 |>
laxflights2022 select(-flight)
glimpse(laxflights2022)
Rows: 191,156
Columns: 8
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…
Since we are trying to model arr_delay
we exclude all the flights where that information is missing.
<- laxflights2022 |>
laxflights2022 drop_na(arr_delay)
glimpse(laxflights2022)
Rows: 187,868
Columns: 8
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…
Lastly, the outcome arr_delay
has been moved to the beginning of the data set as it aids in investigating the data.
<- laxflights2022 |>
laxflights2022 relocate(arr_delay)
glimpse(laxflights2022)
Rows: 187,868
Columns: 8
$ arr_delay <dbl> -12, 28, 46, -38, 74, 69, -20, -7, 10, 16, 109, -12, 122, -1…
$ dep_delay <dbl> 8, 31, 60, -7, 86, 79, 9, 10, 24, 32, 115, 9, 172, -2, 16, -…
$ carrier <chr> "UA", "AA", "NK", "AA", "NK", "NK", "UA", "NK", "DL", "NK", …
$ tailnum <chr> "N57864", "N919NN", "N949NK", "N812AA", "N903NK", "N509NK", …
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest <chr> "IAH", "BNA", "CLE", "PHL", "PIT", "DTW", "ORD", "IAH", "MSP…
$ distance <dbl> 1379, 1797, 2052, 2402, 2136, 1979, 1744, 1379, 1535, 1235, …
$ time <dttm> 2022-01-01 23:59:00, 2022-01-01 23:43:00, 2022-01-01 23:15:…
Saving the data
The full data set is now ready and is saved as data/laxflights2022.csv.
write_csv(laxflights2022, "data/laxflights2022.csv")
A smaller version of the data set is also created and is saved as data/laxflights2022_lite.csv.
set.seed(1234)
<- laxflights2022 |>
laxflights2022_lite slice_sample(prop = 0.02) %>%
arrange(time)
write_csv(laxflights2022_lite, "data/laxflights2022_lite.csv")
Data dictionary
arr_delay
: Arrival delays, in minutes. Negative times represent early arrivals.dep_delay
: Departure delays, in minutes. Negative times represent early departures.carrier
: Two letter carrier abbreviation.tailnum
: Plane tail number.origin
: FAA airport code for origin of flight.dest
: FAA airport code for destination of flight.distance
: Distance between airports, in miles.time
: Scheduled time of the flight as a POSIXct date, rounded to the nearest minute.