library(tidymodels)
library(embed)
library(vetiver)
library(pins)
library(AzureStor)
Azure, Pins Input, Docker Endpoint
This page was last generated on 2024-03-13. If you find the code out of date please file an issue.
All changes from the standard pipeline are highlighted with a cranberry line to the right.
Loading packages
We are using the tidymodels package to do the modeling, embed for target encoding, pins for versioning, vetiver for version and deployment, and AzureStor for connecting with Azure Storage.
Loading data from Posit Connect with pins
We will fetch data from and version the final model on Azure storage using the pins package.
For the smoothest experience, we recommend that you authenticate using environment variables. The two variables you will need are AZURE_CONTAINER_ENDPOINT
and AZURE_SAS_KEY
.
The function usethis::edit_r_environ() can be very handy to open .Renviron
file to specify your environment variables.
First we need to create a Azure container for us to point to. You can find out how to create a storage account and how to create a container from the official documentation. Furthermore, generating a SAS key should be done as well.
Once you have those two, you can add them to your .Renviron
file in the following format:
AZURE_CONTAINER_ENDPOINT=xxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_SAS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxx
The container endpoint will have the following format https://name-of-storage-account.blob.core.windows.net/name-of-container
.
Loading Data
We are using the smaller laxflights2022
data set described on the data preparation page. The data set has been uploaded to pins, as described on the data pins page.
library(pins)
library(AzureStor)
<-
container storage_container(
endpoint = Sys.getenv("AZURE_CONTAINER_ENDPOINT"),
sas = Sys.getenv("AZURE_SAS_KEY")
)
<- board_azure(container)
board
<- board |>
flights pin_read("laxflights2022_lite")
glimpse(flights)
Rows: 3,757
Columns: 8
$ arr_delay <dbl> 4, -15, -12, 38, -9, -17, 5, 12, -40, 6, -7, 28, 25, -9, 180…
$ dep_delay <dbl> 9, -8, 0, -7, 3, 6, 29, -1, 2, 7, 6, 13, 34, -2, 191, 52, 9,…
$ carrier <chr> "UA", "OO", "AA", "UA", "OO", "OO", "UA", "AA", "DL", "DL", …
$ tailnum <chr> "N37502", "N198SY", "N410AN", "N77261", "N402SY", "N509SY", …
$ origin <chr> "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX", "LAX…
$ dest <chr> "KOA", "EUG", "HNL", "DEN", "FAT", "SFO", "MCO", "MIA", "OGG…
$ distance <dbl> 2504, 748, 2556, 862, 209, 337, 2218, 2342, 2486, 862, 156, …
$ time <dttm> 2022-01-01 13:15:00, 2022-01-01 14:00:00, 2022-01-01 14:45:…
Modeling
As a reminder, the modeling task we are trying to accomplish is the following:
Given all the information we have, from the moment the plane leaves for departure. Can we predict the arrival delay
arr_delay
?
Our outcome is arr_delay
and the remaining variables are predictors. We will be fitting an xgboost model as a regression model.
Splitting Data
Since the data set is already in chronological order, we can create a time split of the data using initial_time_split()
, this will put the first 75% of the data into the training data set, and the remaining 25% into the testing data set.
set.seed(1234)
<- initial_time_split(flights, prop = 3/4)
flights_split <- training(flights_split) flights_training
Since we are doing hyperparameter tuning, we will also be creating a cross-validation split
<- vfold_cv(flights_training) flights_folds
Feature Engineering
We need to do a couple of things to make this data set work for our model. The datetime variable time
needs to be transformed, as does the categorical variables carrier
, tailnum
, origin
and dest
.
From the time
variable, the month and day of the week are extracted as categorical variables, then the day of year and time of day are extracted as numerics. The origin
and dest
variables will be turned into dummy variables, and carrier
, tailnum
, time_month
, and time_dow
will be converted to numerics with likelihood encoding.
<- recipe(arr_delay ~ ., data = flights_training) %>%
flights_rec step_novel(all_nominal_predictors()) %>%
step_other(origin, dest, threshold = 0.025) %>%
step_dummy(origin, dest) %>%
step_date(time,
features = c("month", "dow", "doy"),
label = TRUE,
keep_original_cols = TRUE) %>%
step_time(time, features = "decimal_day", keep_original_cols = FALSE) %>%
step_lencode_mixed(all_nominal_predictors(), outcome = vars(arr_delay)) %>%
step_zv(all_predictors())
Specifying Models
We will be fitting a boosted tree model in the form of a xgboost model.
<-
xgb_spec boost_tree(
trees = tune(),
min_n = tune(),
mtry = tune(),
learn_rate = 0.01
%>%
) set_engine("xgboost") %>%
set_mode("regression")
<- workflow(flights_rec, xgb_spec) xgb_wf
Hyperparameter Tuning
::registerDoParallel()
doParallel
<- tune_grid(
xgb_rs
xgb_wf,resamples = flights_folds,
grid = 10
)
i Creating pre-processing data to finalize unknown parameter: mtry
We can visualize the performance of the different hyperparameter selections
autoplot(xgb_rs)
and look at the top result
show_best(xgb_rs, metric = "rmse")
# A tibble: 5 × 9
mtry trees min_n .metric .estimator mean n std_err .config
<int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 3 1988 4 rmse standard 28.1 10 5.88 Preprocessor1_Model01
2 8 849 13 rmse standard 29.6 10 6.44 Preprocessor1_Model04
3 3 1543 9 rmse standard 29.6 10 6.11 Preprocessor1_Model02
4 10 1139 14 rmse standard 30.0 10 6.44 Preprocessor1_Model05
5 12 554 18 rmse standard 30.6 10 6.77 Preprocessor1_Model06
Fitting Final Model
Once we are satisfied with the modeling that has been done, we can fit our final model. We use finalize_workflow()
to use the best hyperparameters, and last_fit()
to fit the model to the training data set and evaluate it on the testing data set.
<- xgb_wf %>%
xgb_last finalize_workflow(select_best(xgb_rs, metric = "rmse")) %>%
last_fit(flights_split)
Creating vetiver model
<- xgb_last %>%
v extract_workflow() %>%
vetiver_model("flights_xgb")
v
── flights_xgb ─ <bundled_workflow> model for deployment
A xgboost regression modeling workflow using 7 features
Version model with pins on Azure
We will version this model on Azure using the pins package.
We will use the board
we created earlier to load the data.
vetiver_pin_write(board, v)
Create Docker artifacts
To build a Docker image that can serve your model, you need three artifacts:
- the Dockerfile itself,
- a
renv.lock
to capture your model dependencies, and - an
plumber.R
file containing the information to serve a vetiver REST API.
You can create all the needed files with one function.
vetiver_prepare_docker(
board, "flights_xgb",
docker_args = list(port = 8080)
)
The following package(s) were installed from an unknown source:
- recipes [1.0.10.9000]
renv may be unable to restore these packages in the future.
Consider reinstalling these packages from a known source (e.g. CRAN).
The following package(s) will be updated in the lockfile:
# CRAN -----------------------------------------------------------------------
- backports [1.4.1 -> *]
- base64enc [0.1-3 -> *]
- config [0.3.2 -> *]
- here [1.0.1 -> *]
- keras [2.13.0 -> *]
- png [0.1-8 -> *]
- RcppTOML [0.2.2 -> *]
- reticulate [1.35.0 -> *]
- rstudioapi [0.15.0 -> *]
- tensorflow [2.15.0 -> *]
- tfautograph [0.3.2 -> *]
- tfruns [1.5.2 -> *]
- zeallot [0.1.0 -> *]
# Local ----------------------------------------------------------------------
- embed [1.1.3 -> 1.1.3.9000]
# (Unknown Source) -----------------------------------------------------------
- recipes [1.0.10 -> 1.0.10.9000]
- Lockfile written to "vetiver_renv.lock".
Keep an eye on the value of port
, we want to make sure we use the same throughout the whole pipeline.
For ease of use, we make sure only to have CRAN versions of packages.
Build and run your Dockerfile
Now that we have everything we need to build a Docker image. We have one more thing to do. Install Docker if you haven’t already, then launch it so we can interact with it from the command line (not from R). Use the following docker build command. Notice that we can give it a “name” using the --tag
flag. The .
here denotes the path to the build context. Which in this example is the folder we are in.
docker build --tag flights .
If you are on an ARM architecture locally and deploying an R model, use --platform linux/amd64
for RSPM’s fast installation of R package binaries.
To run the docker container, we need to pass in the environment variables for the code to connect to the Connect server. We could pass in the system environment variables, but we will be safer if we just pass in what we need. We do this by creating a project-specific .Renviron
file. (fs::file_touch(".Renviron")
) and specifying AZURE_CONTAINER_ENDPOINT
and AZURE_SAS_KEY
in that file.
Then we run docker run command. We set 2 flags, --env-file
to pass in the environment variables we need, and --publish
to specify the port.
docker run --env-file .Renviron --publish 8080:8080 flights
Make predictions from Docker container
Now that the docker container is running we can create an endpoint with vetiver_endpoint()
, and that endpoint can be used as a way to make predictions.
<- vetiver_endpoint("http://0.0.0.0:8080/predict")
endpoint
predict(endpoint, flights_training)
# A tibble: 2,817 × 1
.pred
<dbl>
1 -1.79
2 -13.3
3 -17.3
4 -3.67
5 82.8
6 52.2
7 10.7
8 8.04
9 58.1
10 5.39
# ℹ 2,807 more rows
:::