Don’t be dense!

Embracing sparsity in tidymodels


Emil Hvitfeldt

Salt Lake City R Users Group

Motivated example

library(tidymodels)
library(nycflights13)
library(sparsevctrs)

flights$arr_time <- NULL
flights$time_hour <- NULL
flights <- flights[!is.na(flights$arr_delay), ]

rec <- recipe(arr_delay ~ ., data = flights) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors())

mod <- boost_tree() |>
  set_engine("xgboost") |>
  set_mode("regression")

wf_spec <- workflow(rec, mod)

Motivated example

  • Using xgboost
  • Contains many dummy variables (>4000)
  • Running on a single core

Fitting the workflow

Last year


runtime: 480 seconds

Memory allocated: 82 GB

Today


runtime: 4 seconds

Memory allocated: 822 MB

What happened???

  • ALTREP
  • Sparsity
  • Integration
  • User Experiance

ALTREP

What is ALTREP?

Short for ALTerntive REPresentation

Some examples of intended uses would be to

  • allow vector data to be in a memory-mapped file or distributed
  • allow compact representation of arithmetic sequences
  • support adding meta-data to objects
  • support alternative representations of environments

ALTREP explained

When running f(x)

Normally

  • x is found in memory
  • Passed to f() for calculations

ALTREP

  • x is found in memory with metadata
  • Passed to f() for calculations
  • meta data is used to help calculations

metadata example

vec1 <- sample(10000000)

bench::mark(vec2 = {vec2 <- sort(vec1)})
# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec2          184ms    188ms      5.35    76.3MB     5.35
bench::mark(vec3 = {vec3 <- sort(vec2)})
# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec3         1.11µs   1.31µs   662570.     128MB        0

metadata example

vec1 <- sample(10000000)

bench::mark(vec2 = {vec2 <- sort(vec1)})
# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec2          162ms    162ms      6.17    76.3MB     12.3
new_vec <- c(0, vec2)
bench::mark(vec3 = {vec3 <- sort(new_vec)})
# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vec3         26.4ms   26.4ms      37.8     114MB     114.

Integer sequences

created using n:m, seq(), seq_len(), and seq_along()


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28


library(lobstr)

obj_size(numeric(100))
obj_size(numeric(1000))
obj_size(numeric(10000))
848 B
8.05 kB
80.05 kB
library(lobstr)

obj_size(seq_len(100))
obj_size(seq_len(1000))
obj_size(seq_len(10000))
680 B
680 B
680 B

Integer sequences

# https://github.com/multimeric/altrepr
library(altrepr)

vec <- 7:100
alt_data1(vec)
[1] 94  7  1


vec <- 7:100000
alt_data1(vec)
[1] 99994     7     1

length - start - step

Integer sequences

Altrep functions kick in for certain operations

pseudo implementation for length() and sum()

int_seq_length <- function(x) {
  x[1]
}

int_seq_sum <- function(x) {
  length <- x[1]
  start <- x[2]
  step <- x[3]

  (length / 2.0) * (start * 2 + step * (length - 1))
}

Alrep methods

altrep

  • UnserializeEX
  • Unserialize
  • Serialized_state
  • DuplicateEX
  • Duplicate
  • Coerce
  • Inspect
  • Length

altvec

  • Dataptr
  • Dataptr_or_null
  • Extract_subset

integer

  • Elt
  • Get_region
  • Is_sorted
  • No_NA
  • Sum
  • Min
  • Max

SPARSITY

Sparsity vocabulary


sparse: contains a lot of 0s

dense: doesn’t contain a lot of 0s


Sparse data is often found in

  • Natural Language Processing
  • Recommender Systems
  • Interactions / Dummy variables

Sparse Vector

0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

This vector take 36 values to store, but most are 0s

How could we store this more efficiently?


  • length: How long is the vector (1)
  • positions: Locations of non-zero values (3)
  • values: Values of non-zero values (3)

Sparse Vector

library(sparsevctrs)

obj_size(sparse_integer(
  values = c(4, 6, 2), 
  positions = c(1, 10, 100), 
  length = 100
))
obj_size(sparse_integer(
  values = c(4, 6, 2), 
  positions = c(1, 10, 100), 
  length = 10000
))
920 B
920 B
library(sparsevctrs)

obj_size(sparse_integer(
  values = c(4), 
  positions = c(1), 
  length = 100
))
obj_size(sparse_integer(
  values = c(4, 6, 2, 1, 1), 
  positions = c(1, 10, 20, 30, 56), 
  length = 100
))
904 B
952 B

Sparse operations

sparse_sum <- function(x) {
  sum(x$values)
}

sparse_mean <- function(x) {
  sum(x$values) / x$length
}
sparse_elt <- function(x, i) {
  if (i %in% x$positions) {
    index <- match(i, x$positions)

    return(x$values[index])
  } else {
    return(0)
  }
}

Sparse operations in C

static int altrep_sparse_integer_Elt(SEXP x, R_xlen_t i) {
  SEXP val = extract_val(x);

  SEXP pos = extract_pos(x);
  const int* v_pos = INTEGER_RO(pos);
  const R_xlen_t size = Rf_xlength(pos);

  const R_xlen_t len = extract_len(x);

  const int v_default_val = extract_default_integer(x);

  if (i > len) {
    // OOB of vector itself
    return NA_INTEGER;
  }

  const int needle = (int) i + 1;
  const R_xlen_t loc = binary_search(needle, v_pos, size);

  if (loc == size) {
    // Can't find it, must be the default value
    return v_default_val;
  } else {
    // Look it up in `val`
    return INTEGER_ELT(val, loc);
  }
}

sparsevctrs

  • All ALTREP happens in C
  • Operations must match dense operations
  • Very fragile, will materialize if not careful
  • {sparsevctrs} contains more helpers

sparsevctrs

  • Contains converters between sparse matrices from {Matrix} and tibbles with sparse vectors
  • Is a developer focused package

INTEGRATION

Why are we doing this?

  • Because sparse matrices from {Matrix} doesn’t work with tibbles

  • tidymodels is built on tibbles

  • These sparse vectors fits inside tibbles

  • Use sparsevctrs functions to move between Sparse matrices and sparse tibbles

Integration in tidymodels

  • Tibbles can contain sparse vectors inside all of tidymodels internals

  • recipe steps can produce sparsity

  • Automatic toggle

Integration in tidymodels

sparse tibbles in internals

Making it so sparse matrices are turned into sparse tibbles in all entry points

recipe(), prep(), bake(), fit(), predict(), and augment()

Adding a ton of testing with sparse matrices and sparse tibbles

Calling as_tibble(as.matrix(data)) would break sparsity

Integration in tidymodels

recipe steps

Recipes steps that could produce sparse data now have sparse argument

It takes 3 values:

  • "yes": Generate sparse vectors
  • "no": Don’t generate sparse vectors
  • "auto": Automatically decide (default)

Integration in tidymodels

Automatic toggle

For the smoothest user experiance we try to automatically toggle sparse computations

  1. Does model support sparse data?
  2. Is there high sparsity in the data?

If the answer to both questions are “yes” then tidymodels will do things sparsely otherwise not

Integration in tidymodels

detecting high sparsity

The sparsity matters for the parsnip model


We need to estimate the sparsity that comes out of recipes


Each step have a method that returns how it is expected to change sparsity

Integration in tidymodels

detecting high sparsity

Simulation study using different sizes of data, different amounts of sparsity, for all supported models

Fit MARS model on the log(dense time / sparse time)

used orbital to embed this fitted model in the toggler

USER EXPERIENCE

User Expeiance

  • sparse data input now have native support in tidymodels

  • automatic toggle can be turned off or on manually by setting sparse = "no" or "yes" in relevant steps

Thank you!