Don’t be dense! Embracing sparsity in tidymodels

Don’t be dense!

Embracing sparsity in tidymodels

Emil Hvitfeldt

Salt Lake City R Users Group

Motivated example

library(tidymodels)
library(nycflights13)
library(sparsevctrs)

flights$arr_time <- NULL
flights$time_hour <- NULL
flights <- flights[!is.na(flights$arr_delay), ]

rec <- recipe(arr_delay ~ ., data = flights) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors())

mod <- boost_tree() |>
  set_engine("xgboost") |>
  set_mode("regression")

wf_spec <- workflow(rec, mod)

Motivated example

Using xgboost
Contains many dummy variables (>4000)
Running on a single core

Fitting the workflow

Last year

runtime: 480 seconds

Memory allocated: 82 GB

Today

runtime: 4 seconds

Memory allocated: 822 MB

What happened???

ALTREP
Sparsity
Integration
User Experiance

ALTREP

What is ALTREP?

Short for ALTerntive REPresentation

Some examples of intended uses would be to

allow vector data to be in a memory-mapped file or distributed
allow compact representation of arithmetic sequences
support adding meta-data to objects
support alternative representations of environments

ALTREP explained

When running f(x)

Normally

x is found in memory
Passed to f() for calculations

ALTREP

x is found in memory with metadata
Passed to f() for calculations
meta data is used to help calculations

metadata example

vec1 <- sample(10000000)

bench::mark(vec2 = {vec2 <- sort(vec1)})

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec2          184ms    188ms      5.35    76.3MB     5.35

bench::mark(vec3 = {vec3 <- sort(vec2)})

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec3         1.11µs   1.31µs   662570.     128MB        0

metadata example

vec1 <- sample(10000000)

bench::mark(vec2 = {vec2 <- sort(vec1)})

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec2          162ms    162ms      6.17    76.3MB     12.3

new_vec <- c(0, vec2)

bench::mark(vec3 = {vec3 <- sort(new_vec)})

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec3         26.4ms   26.4ms      37.8     114MB     114.

Integer sequences

created using n:m, seq(), seq_len(), and seq_along()

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

library(lobstr)

obj_size(numeric(100))
obj_size(numeric(1000))
obj_size(numeric(10000))

848 B
8.05 kB
80.05 kB

library(lobstr)

obj_size(seq_len(100))
obj_size(seq_len(1000))
obj_size(seq_len(10000))

680 B
680 B
680 B

Integer sequences

# https://github.com/multimeric/altrepr
library(altrepr)

vec <- 7:100
alt_data1(vec)

[1] 94  7  1

vec <- 7:100000
alt_data1(vec)

[1] 99994     7     1

length - start - step

Integer sequences

Altrep functions kick in for certain operations

pseudo implementation for length() and sum()

int_seq_length <- function(x) {
  x[1]
}

int_seq_sum <- function(x) {
  length <- x[1]
  start <- x[2]
  step <- x[3]

  (length / 2.0) * (start * 2 + step * (length - 1))
}

Alrep methods

altrep

UnserializeEX
Unserialize
Serialized_state
DuplicateEX
Duplicate
Coerce
Inspect
Length

altvec

Dataptr
Dataptr_or_null
Extract_subset

integer

Elt
Get_region
Is_sorted
No_NA
Sum
Min
Max

SPARSITY

Sparsity vocabulary

sparse: contains a lot of 0s

dense: doesn’t contain a lot of 0s

Sparse data is often found in

Natural Language Processing
Recommender Systems
Interactions / Dummy variables

Sparse Vector

0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

This vector take 36 values to store, but most are 0s

How could we store this more efficiently?

length: How long is the vector (1)
positions: Locations of non-zero values (3)
values: Values of non-zero values (3)

Sparse Vector

library(sparsevctrs)

obj_size(sparse_integer(
  values = c(4, 6, 2), 
  positions = c(1, 10, 100), 
  length = 100
))
obj_size(sparse_integer(
  values = c(4, 6, 2), 
  positions = c(1, 10, 100), 
  length = 10000
))

920 B
920 B

library(sparsevctrs)

obj_size(sparse_integer(
  values = c(4), 
  positions = c(1), 
  length = 100
))
obj_size(sparse_integer(
  values = c(4, 6, 2, 1, 1), 
  positions = c(1, 10, 20, 30, 56), 
  length = 100
))

904 B
952 B

Sparse operations

sparse_sum <- function(x) {
  sum(x$values)
}

sparse_mean <- function(x) {
  sum(x$values) / x$length
}

sparse_elt <- function(x, i) {
  if (i %in% x$positions) {
    index <- match(i, x$positions)

    return(x$values[index])
  } else {
    return(0)
  }
}

Sparse operations in C

static int altrep_sparse_integer_Elt(SEXP x, R_xlen_t i) {
  SEXP val = extract_val(x);

  SEXP pos = extract_pos(x);
  const int* v_pos = INTEGER_RO(pos);
  const R_xlen_t size = Rf_xlength(pos);

  const R_xlen_t len = extract_len(x);

  const int v_default_val = extract_default_integer(x);

  if (i > len) {
    // OOB of vector itself
    return NA_INTEGER;
  }

  const int needle = (int) i + 1;
  const R_xlen_t loc = binary_search(needle, v_pos, size);

  if (loc == size) {
    // Can't find it, must be the default value
    return v_default_val;
  } else {
    // Look it up in `val`
    return INTEGER_ELT(val, loc);
  }
}

sparsevctrs

All ALTREP happens in C
Operations must match dense operations
Very fragile, will materialize if not careful
{sparsevctrs} contains more helpers

sparsevctrs

Contains converters between sparse matrices from {Matrix} and tibbles with sparse vectors

Is a developer focused package

INTEGRATION

Why are we doing this?

Because sparse matrices from {Matrix} doesn’t work with tibbles
tidymodels is built on tibbles
These sparse vectors fits inside tibbles
Use sparsevctrs functions to move between Sparse matrices and sparse tibbles

Integration in tidymodels

Tibbles can contain sparse vectors inside all of tidymodels internals
recipe steps can produce sparsity
Automatic toggle

Integration in tidymodels

sparse tibbles in internals

Making it so sparse matrices are turned into sparse tibbles in all entry points

recipe(), prep(), bake(), fit(), predict(), and augment()

Adding a ton of testing with sparse matrices and sparse tibbles

Calling as_tibble(as.matrix(data)) would break sparsity

Integration in tidymodels

recipe steps

Recipes steps that could produce sparse data now have sparse argument

It takes 3 values:

"yes": Generate sparse vectors
"no": Don’t generate sparse vectors
"auto": Automatically decide (default)

Integration in tidymodels

Automatic toggle

For the smoothest user experiance we try to automatically toggle sparse computations

Does model support sparse data?
Is there high sparsity in the data?

If the answer to both questions are “yes” then tidymodels will do things sparsely otherwise not

Integration in tidymodels

detecting high sparsity

The sparsity matters for the parsnip model

We need to estimate the sparsity that comes out of recipes

Each step have a method that returns how it is expected to change sparsity

Integration in tidymodels

detecting high sparsity

Simulation study using different sizes of data, different amounts of sparsity, for all supported models

Fit MARS model on the log(dense time / sparse time)

used orbital to embed this fitted model in the toggler

USER EXPERIENCE

User Expeiance

sparse data input now have native support in tidymodels
automatic toggle can be turned off or on manually by setting sparse = "no" or "yes" in relevant steps

Thank you!