Tree-Based MethodsAU STAT627Emil Hvitfeldt2021-06-161 / 35

$\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725}$

$\require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1}$

$\require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}$ `

Overview

We will cover 4 new methods today

Decision Trees
Bagging
Random Forrest
boosting

2 / 35

Overview

We will cover 4 new methods today

Decision Trees
Bagging
Random Forrest
boosting

Decision trees act as the building block for this chapter

3 / 35

Decision Trees

Given a problem, give me flow chart of if-else statements to find the answer

4 / 35

Penguins

5 / 35

Penguins

6 / 35

The flowchart

7 / 35

The rules

##        ..y  Ade Chi Gen                                                      
##     Adelie [.97 .03 .00] when flipper_length_mm <  207 & bill_length_mm <  43
##  Chinstrap [.06 .92 .02] when flipper_length_mm <  207 & bill_length_mm >= 43
##     Gentoo [.02 .04 .95] when flipper_length_mm >= 207

8 / 35

General setupWe divide the predictor space into multiple non-overlapping regions ( R_1, R_2, ..., R_J ).
Every observation that falls into a region will have the same prediction, and that prediction will be based on the observations in that regionRegression: mean value
Classification: Most common value

9 / 35

General setup

The shapes could in theory be any shape, but for simplicity we are using rectangles/boxes to partition the space

The main goal is to build a partition that minimizes some loss such as RSS

$\sum_{j=1}^J \sum_{i \in R_j} \left(y_i - \hat y_{R_j} \right)^2$

10 / 35

General setup

It is generally computational unfeasible to calculate all possible partitions

We use a recursive binary splitting procedure to find the trees

This approach is top-down approach since we start at the top and split our way down

It is greedy because we select the best possible split each time

11 / 35

Details

How many times should we split?

If we continue to split we end up with each observation belonging to their own region, giving us a wildly flexible model

We can control a number of different things, simple ones are

Tree depth, maximum depth of the tree
minimum number of data points in a node that are required for the node to be split further

12 / 35

Tree Pruning

Due to the way decision trees are grown, it can be beneficial to grow larger trees and then go back and reduce the complexity of the tree after

13 / 35

Regression "curves"

14 / 35

Regression "curves"

15 / 35

Regression "curves"

16 / 35

Regression "curves"

17 / 35

Regression "curves"

18 / 35

Regression "curves"

19 / 35

Decision boundary

20 / 35

Decision boundary

21 / 35

Decision boundary

22 / 35

Decision boundary

23 / 35

Pros and ConsProsVery easy to explain and reason about
Can Handle qualitative predictors without 
the need for dummy variables
ConsDon't have great predictive power
Non-robust, small changes in the data can give wildly different models
24 / 35

Next Steps

Individual decision trees don't offer great predictive performance due to their simple nature

Bagging, Random Forests and Boosting uses multiple decision trees together to get better performance with a trade-off of more complexity

25 / 35

Bagging

Decision trees suffer for high variance

We saw in week 3 how bootstrapping could be used to reduce the variance of a statistical learning method

We will use bootstrapping again with decision trees to reduce the variance. We can feasible do this since individual decision trees are fast to train

26 / 35

Bagging

"Algorithm"

Generate $B$ different bootstrapped training data sets
Fit a decision tree on on each of the bootstraps to get $\hat {f^{*b}}(x)$
Take the average of all the estimates t get your final estimate

$\hat{f}_{\text{bag}}(x) = \dfrac{1}{B} \sum^B_{b=1} \hat {f^{*b}}(x)$

27 / 35

Bagging

28 / 35

Bagging Notes

The number of bootstraps are not very important here, you just need to use a value of $B$ that is large enough to have the error settled down, ~100 seems to work well

You do not overfit by increasing $B$ , just increase the run-time

Bagged trees offer quite low interpretability since it is a mixture of multiple models

We can obtain a summary of the variable importance of our model by looking at the average amount of RSS a given predictor has decreased due to splits to a given variables

29 / 35

Random Forest

The Random Forest method offers an improvement over Bagged trees

One of the main downsides to Bagged Trees are that the trees become quiet correlated with each other

When fitting a Random forest, we start the same way as a Bagged tree with multiple bootstrapped data sets

but each time a split in a tree is considered, only a random sample of the predictors can be chosen

30 / 35

Random Forest

The sample is typically $m = \sqrt{p}$ with $p$ predictors

But this values is tuneable as well, along with everything tuneable from the decision tree

31 / 35

Random Forest

32 / 35

Boosting

Boosting is a general approach that can be used with many statistical machine learning methods

In bagging we fit multiple decision trees side by side

In Boosting we fit multiple decision trees back to back

33 / 35

Boosting

Algorithm

Fit a tree $\hat {f^b}$ to the model
Update the final fit using a shrunken version of the tree
Update the residuals using a shrunken version of the tree
repeat $B$ times

Final model

$\hat f(x)= \sum_{b=1}^B \lambda \hat {f^b}(x)$

34 / 35

Boosting

Large values of $B$ can result in overfitting

The shrinkage parameter $\lambda$ typically takes a small values but will need to be tuned

The number of splits $d$ will need to be tuned as well, typically very small trees are fit during boosting

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Tree-Based Methods

AU STAT627

Emil Hvitfeldt

2021-06-16

Overview

Overview

Decision Trees

Penguins

Penguins

The flowchart

The rules

General setup

General setup

General setup

Details

Tree Pruning

Regression "curves"

Regression "curves"

Regression "curves"

Regression "curves"

Regression "curves"

Regression "curves"

Decision boundary

Decision boundary

Decision boundary

Decision boundary

Pros and Cons

Pros

Cons

Next Steps

Bagging

Bagging

Bagging

Bagging Notes

Random Forest

Random Forest

Random Forest

Boosting

Boosting

Boosting

Overview

Help