Week 13 - Trees

Download template here

We will be using tidymodels and the beans data set.

We will also use a couple of other packages such as rpart.plot, rpart, ranger, and vip. The {beans} need no further introduction. We will split it right away.

set.seed(1234)
beans_split <- initial_split(beans)

beans_train <- training(beans_split)
beans_test <- testing(beans_split)

The models

We will start using a decision_tree(). These can be used for both regression and classification so we need to specify it for this model. We will be using the rpart package as the engine.

tree_spec <- decision_tree() %>%
  set_mode("classification") %>%
  set_engine("rpart")

and then we will fit it right away. We could do some pre-processing, but in this case, it should be needed. Trees can handle the numeric data just fine, even without scaling.

bean_tree <- fit(tree_spec, class ~ ., data = beans_train)
bean_tree

The model has been written out in completion.

We can use the rpart.plot() function from the rpart.plot package to visualize the tree in its entirety

bean_tree %>%
  extract_fit_engine() %>%
  rpart.plot::rpart.plot(box.palette = as.list(rainbow(7)))

We can also get the confusion matrix

bean_tree %>%
  augment(new_data = beans_train) %>%
  conf_mat(class, .pred_class) %>%
  autoplot(type = "heatmap")

and calculate the accuracy.

bean_tree %>%
  augment(new_data = beans_train) %>%
  accuracy(class, .pred_class)

decision trees have a couple of parameters we could change. Below is what happens when we increase the cost complexity (it defaults to 0.05).

bean_tree01 <- tree_spec %>%
  set_args(cost_complexity = 0.1) %>%
  fit(class ~ ., data = beans_train)

This tree is a lot smaller, it even neglects a couple of the classes due to its size.

bean_tree01 %>%
  extract_fit_engine() %>%
  rpart.plot::rpart.plot(box.palette = as.list(rainbow(7)))

Which we can see represented in its confusion matrix. This model is simply unable to predict 2 of our classes.

bean_tree01 %>%
  augment(new_data = beans_train) %>%
  conf_mat(class, .pred_class) %>%
  autoplot(type = "heatmap")

If we wanted some improvements it might be worth it to take a look at a random forest model. We set importance = "impurity" so we can extract variable importance after.

rf_spec <- rand_forest(trees = 100) %>%
  set_mode("classification") %>%
  set_engine("ranger", importance = "impurity")

Now we fit like normal

rf_fit <- fit(rf_spec, class ~ ., data = beans_train)

And let us look at the performance. It is doing quite well

rf_fit %>%
  augment(new_data = beans_train) %>%
  accuracy(class, .pred_class)

And the confusion matrix looks good too

rf_fit %>%
  augment(new_data = beans_train) %>%
  conf_mat(class, .pred_class)

Lastly, we can use the vip() function from the vip package to extract and visualize the variable importance in this model.

vip::vip(rf_fit)

Your turn

You have 2 options. A, tune some parameters to see if you can improve the performance. B, give boosted trees a try. Boosted trees will require some hyperparameter tuning as well. Try using grid_latin_hypercube() instead of grid_regular(). The random forest we fit performed really well, so we should properly use some cross-validation to detect if we are overfitting.