preprocessing with recipesAU STAT-427/627Emil Hvitfeldt2021-06-051 / 37

What happens to the data between `read_data()` and `fit_model()`?

2 / 37

Prices of 54,000 round-cut diamonds

library(ggplot2)
diamonds

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

3 / 37

Formula expression in modeling

model <- lm(price ~ cut:color + carat + log(depth), data = diamonds)

Select outcome & predictors

4 / 37

Formula expression in modeling

model <- lm(price ~ cut:color + carat + log(depth), data = diamonds)

Select outcome & predictors
Operators to matrix of predictors

5 / 37

Formula expression in modeling

model <- lm(price ~ cut:color + carat + log(depth), data = diamonds)

Select outcome & predictors
Operators to matrix of predictors
Inline functions

6 / 37

Work under the hood - model.matrix

model.matrix(price ~ cut:color + carat + log(depth) + table, 
             data = diamonds)

## Rows: 53,940
## Columns: 39
## $ `(Intercept)`         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ carat                 <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, …
## $ `log(depth)`          <dbl> 4.119037, 4.091006, 4.041295, 4.133565, 4.147885…
## $ table                 <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, …
## $ `cutFair:colorD`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `cutGood:colorD`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `cutVery Good:colorD` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `cutPremium:colorD`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `cutIdeal:colorD`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ `cutFair:colorE`      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ `cutGood:colorE`      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
...

7 / 37

DownsidesTedious typing with many variables
8 / 37

Downsides

Tedious typing with many variables
Functions have to manually be applied to each variable

lm(y ~ log(x01) + log(x02) + log(x03) + log(x04) + log(x05) + log(x06) + log(x07) +
       log(x08) + log(x09) + log(x10) + log(x11) + log(x12) + log(x13) + log(x14) + 
       log(x15) + log(x16) + log(x17) + log(x18) + log(x19) + log(x20) + log(x21) + 
       log(x22) + log(x23) + log(x24) + log(x25) + log(x26) + log(x27) + log(x28) + 
       log(x29) + log(x30) + log(x31) + log(x32) + log(x33) + log(x34) + log(x35),
   data = dat)

9 / 37

Downsides

Tedious typing with many variables
Functions have to manually be applied to each variable
Operations are constrained to single columns

# Not possible
lm(y ~ pca(x01, x02, x03, x04, x05), data = dat)

10 / 37

Downsides

Tedious typing with many variables
Functions have to manually be applied to each variable
Operations are constrained to single columns
Everything happens at once

You can't apply multiple transformations to the same variable.

11 / 37

Downsides

Tedious typing with many variables
Functions have to manually be applied to each variable
Operations are constrained to single columns
Everything happens at once
Connected to the model, calculations are not saved between models

One could manually use model.matrix and pass the result to the modeling function.

12 / 37