Embracing sparsity in tidymodels
Emil Hvitfeldt
Salt Lake City R Users Group
![]()
Motivated example
library(tidymodels)
library(nycflights13)
library(sparsevctrs)
flights$arr_time <- NULL
flights$time_hour <- NULL
flights <- flights[!is.na(flights$arr_delay), ]
rec <- recipe(arr_delay ~ ., data = flights) |>
step_impute_mean(all_numeric_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors())
mod <- boost_tree() |>
set_engine("xgboost") |>
set_mode("regression")
wf_spec <- workflow(rec, mod)
Motivated example
- Using xgboost
- Contains many dummy variables (>4000)
- Running on a single core
Fitting the workflow
Last year
runtime: 480 seconds
Memory allocated: 82 GB
Today
runtime: 4 seconds
Memory allocated: 822 MB
What happened???
- ALTREP
- Sparsity
- Integration
- User Experiance
What is ALTREP?
Short for ALTerntive REPresentation
Some examples of intended uses would be to
- allow vector data to be in a memory-mapped file or distributed
- allow compact representation of arithmetic sequences
- support adding meta-data to objects
- support alternative representations of environments
ALTREP explained
When running f(x)
Normally
x is found in memory
- Passed to
f() for calculations
ALTREP
x is found in memory with metadata
- Passed to
f() for calculations
- meta data is used to help calculations
Integer sequences
created using n:m, seq(), seq_len(), and seq_along()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
library(lobstr)
obj_size(numeric(100))
obj_size(numeric(1000))
obj_size(numeric(10000))
library(lobstr)
obj_size(seq_len(100))
obj_size(seq_len(1000))
obj_size(seq_len(10000))
Integer sequences
# https://github.com/multimeric/altrepr
library(altrepr)
vec <- 7:100
alt_data1(vec)
vec <- 7:100000
alt_data1(vec)
Integer sequences
Altrep functions kick in for certain operations
pseudo implementation for length() and sum()
int_seq_length <- function(x) {
x[1]
}
int_seq_sum <- function(x) {
length <- x[1]
start <- x[2]
step <- x[3]
(length / 2.0) * (start * 2 + step * (length - 1))
}
Alrep methods
altrep
- UnserializeEX
- Unserialize
- Serialized_state
- DuplicateEX
- Duplicate
- Coerce
- Inspect
- Length
altvec
- Dataptr
- Dataptr_or_null
- Extract_subset
integer
- Elt
- Get_region
- Is_sorted
- No_NA
- Sum
- Min
- Max
Sparsity vocabulary
sparse: contains a lot of 0s
dense: doesn’t contain a lot of 0s
Sparse data is often found in
- Natural Language Processing
- Recommender Systems
- Interactions / Dummy variables
Sparse Vector
0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
This vector take 36 values to store, but most are 0s
How could we store this more efficiently?
- length: How long is the vector (1)
- positions: Locations of non-zero values (3)
- values: Values of non-zero values (3)
Sparse Vector
library(sparsevctrs)
obj_size(sparse_integer(
values = c(4, 6, 2),
positions = c(1, 10, 100),
length = 100
))
obj_size(sparse_integer(
values = c(4, 6, 2),
positions = c(1, 10, 100),
length = 10000
))
library(sparsevctrs)
obj_size(sparse_integer(
values = c(4),
positions = c(1),
length = 100
))
obj_size(sparse_integer(
values = c(4, 6, 2, 1, 1),
positions = c(1, 10, 20, 30, 56),
length = 100
))
Sparse operations
sparse_sum <- function(x) {
sum(x$values)
}
sparse_mean <- function(x) {
sum(x$values) / x$length
}
sparse_elt <- function(x, i) {
if (i %in% x$positions) {
index <- match(i, x$positions)
return(x$values[index])
} else {
return(0)
}
}
Sparse operations in C
static int altrep_sparse_integer_Elt(SEXP x, R_xlen_t i) {
SEXP val = extract_val(x);
SEXP pos = extract_pos(x);
const int* v_pos = INTEGER_RO(pos);
const R_xlen_t size = Rf_xlength(pos);
const R_xlen_t len = extract_len(x);
const int v_default_val = extract_default_integer(x);
if (i > len) {
// OOB of vector itself
return NA_INTEGER;
}
const int needle = (int) i + 1;
const R_xlen_t loc = binary_search(needle, v_pos, size);
if (loc == size) {
// Can't find it, must be the default value
return v_default_val;
} else {
// Look it up in `val`
return INTEGER_ELT(val, loc);
}
}
sparsevctrs
- All ALTREP happens in C
- Operations must match dense operations
- Very fragile, will materialize if not careful
- {sparsevctrs} contains more helpers
sparsevctrs
- Contains converters between sparse matrices from {Matrix} and tibbles with sparse vectors
- Is a developer focused package
Why are we doing this?
Because sparse matrices from {Matrix} doesn’t work with tibbles
tidymodels is built on tibbles
These sparse vectors fits inside tibbles
Use sparsevctrs functions to move between Sparse matrices and sparse tibbles
Integration in tidymodels
Integration in tidymodels
sparse tibbles in internals
Making it so sparse matrices are turned into sparse tibbles in all entry points
recipe(), prep(), bake(), fit(), predict(), and augment()
Adding a ton of testing with sparse matrices and sparse tibbles
Calling as_tibble(as.matrix(data)) would break sparsity
Integration in tidymodels
recipe steps
Recipes steps that could produce sparse data now have sparse argument
It takes 3 values:
"yes": Generate sparse vectors
"no": Don’t generate sparse vectors
"auto": Automatically decide (default)
Integration in tidymodels
Automatic toggle
For the smoothest user experiance we try to automatically toggle sparse computations
- Does model support sparse data?
- Is there high sparsity in the data?
If the answer to both questions are “yes” then tidymodels will do things sparsely otherwise not
Integration in tidymodels
detecting high sparsity
The sparsity matters for the parsnip model
We need to estimate the sparsity that comes out of recipes
Each step have a method that returns how it is expected to change sparsity
Integration in tidymodels
detecting high sparsity
Simulation study using different sizes of data, different amounts of sparsity, for all supported models
Fit MARS model on the log(dense time / sparse time)
used orbital to embed this fitted model in the toggler