Back for second day of #adventofsteps where I show you a {recipes} step I hope you will find useful each and every days for 25 days!
This time we are looking at then extension package {embed} for the step step_collapse_stringdist(). This step will all the levels that have have a string distance less than specified.
Many different types of distances can been selected with method argument
# A tibble: 6 × 8
x1 x2 x3 x4 na_ind_x1 na_ind_x2 na_ind_x3 na_ind_x4
<dbl> <dbl> <lgl> <dbl> <int> <int> <int> <int>
1 1 1 NA 7 0 0 1 0
2 5 NA NA 8 0 1 1 0
3 8 3 NA 4 0 0 1 0
4 NA 6 NA 2 1 0 1 0
5 NA 2 NA 1 1 0 1 0
6 3 2 NA 1 0 0 1 0
03 text
For the third day of #adventofsteps we are back in {recipes}, and we looking at a different way to handle missing values.
Before you do any imputation on missing values, it might be beneficial to know which predictors had missing data and when. step_indicate_na() handles that with ease
For the Forth day of #adventofsteps we take a look at {textrecipes} for some non-text related steps.
Some functions are much more strict regarding the names of the columns that are accepted. Things like spaces and non-ascii characters will sometimes causes errors. step_clean_names() should always give you valid names
For the fifth day of #adventofsteps we look at ways to handle categorical variables with many levels.
Creating dummy variables, can be ineffective when dealing with many levels. Instead we can use target/likelihood/mean/impact encoding to capture the relationships between an variable (typically the outcome) and our predictors
For the seventh day of #adventofsteps we look at time with step_date() and step_time()
Each of these functions takes date and datetime variables, and returns a number of extractable components. With the former extracting larger than “day” elements
For the eighth day of #adventofsteps we go very controversial, with step_discretize()
There are a lot of talk, whether you should discretize numerical variables into categorical variables. Whether or not it is a good idea, there is a step for it so you can experiment for yourself
For the ninth day of #adventofsteps we look at one way to deal with cyclical predictors
step_harmonic() calculates sin() and cos() of the predictors passed to it. With the right frequency and cycle_size, you can extract good signal if it is there
For the Tenth day of #adventofsteps we look at another package {bestNormalize}
This community created package, implements the step step_best_normalize(), which gives us new ways of normalize numerical predicts. Please see the documentation of the package for more information
For day 11 of #adventofsteps we look at another community package, this time {timetk}
{timetk} is a very nice package for dealing with time series analysis. step_timeseries_signature() is similar to step_date() and step_time() we saw earlier, but this step gives us even more insight using timeseries specific values
For day 12 of #adventofsteps we look at a low-fi way of dealing with text predictors.
The idea is quite simple. Having a number of predictors that count the number of characters, words, periods, emojis and so on. This is what step_textfeature() is doing.
library(recipes)library(Matrix) # needs to be loaded for step to workdata(ames, package ="modeldata")recipe(Sale_Price ~ ., data = ames) |>step_dummy(all_nominal_predictors()) |>step_nzv(all_numeric_predictors()) |>step_normalize(all_numeric_predictors()) |>step_nnmf_sparse(all_numeric_predictors()) |>prep() |>bake(new_data =NULL)
# A tibble: 2 × 1
x
<fct>
1 bad_names
2 oeird_characters
15 text
For day 15 of #adventofsteps we look at how to deal with dirty categorical levels.
When we say dirty in this context, we mean that some levels will produce bad column names if used for other things such as dummy variables. step_clean_levels() will make it so all levels only consists of characters, numbers and underscores.
For day 16 of #adventofsteps, we will show a more sophisticated way to discretize your numeric predictors.
step_discretize_cart() from {embed} fits a decision tree using the numeric predictor against the outcome. Then replaces it with levels, according to the leafs of the tree.
For day 17 of #adventofsteps we look at another hidden gem with step_dummy_multi_choice()
This step shines in exactly one scenario. And that scenario happens when multiple columns in our data set are connected in the specific way seen in the example
For day 19 of #adventofsteps we look at a way to deal with weird distributions
step_percentile() will replace the value of each predictor with its percentile from the training set. This will effectively map any distribution into the range [0, 1].
For day 21 of #adventofsteps we are doing another multi step day! This time talking about splines
We recently added a new batch of spline steps, all with the function signatuve step_spline_*(). These additions greatly expand the types of splines you can create.
# A tibble: 4 × 2
a b
<dbl> <dbl>
1 1 6
2 2 5
3 3 4
4 4 3
22 text
For day 22 of #adventofsteps we look at a way to deal with the very specific problem of having linearly combined predictors
Some models and methods doesn’t like it if numeric predictors have exact linear combinations between them, as it can make it hard in invert the matrix, this is luckily easy to deal with using step_lincomb()
# A tibble: 4 × 2
a b
<dbl> <dbl>
1 1 1
2 2 2
3 3 2
4 4 1
23 text
For day 23 of #adventofsteps look at the issue with zero variance predictors
Some methods doesn’t like it when a predictor has zero variance. Zero variance is a fancy way of saying that it only takes one value. These variables can be removed with no hard as they don’t contain any information by detfinition.
For day 24 of #adventofsteps we look at a fun alternative to dummy variables
Feature hashing is an interesting technique where you create dummy variables, but instead of giving each level its own column, you run the level through a hashing function to determine the column. This means that any number of levels can be put into a fixed number of columns
For day 25 of #adventofsteps we have an all rounder!
If there are any simple calculations that isn’t implemented already? then you can do them directly with step_mutate() which works such like mutate() as you already knows