Logistic RegressionAU STAT-427/627Emil Hvitfeldt2021-5-241 / 54

$\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725}$

$\require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1}$

$\require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}$ `

Classification

Last we looked at regression tasks. In regression the response variable $Y$ is quantitative

In classification tasks, the response variable $Y$ is qualitative

This Difference will present some challenges we will cover this week

2 / 54

3 / 54

Examples of classification tasksShould we sent an email ad to this person?
Are these symptoms indicative of cancer?
Given an image, which fruit is depicted?
4 / 54

Examples of classification tasks

Should we sent an email ad to this person?
Are these symptoms indicative of cancer?
Given an image, which fruit is depicted?

Two or more classes

4 / 54

Examples of classification tasks

Should we sent an email ad to this person?
Are these symptoms indicative of cancer?
Given an image, which fruit is depicted?

Two or more classes

There can be uncertainty

4 / 54

Examples of classification tasks

Should we sent an email ad to this person?
Are these symptoms indicative of cancer?
Given an image, which fruit is depicted?

Two or more classes

There can be uncertainty

Can be more than one class at the same time

4 / 54

Classification visual

5 / 54

Classification visual - decision boundary

6 / 54

Classification visual

7 / 54

Classification visual - no hope

8 / 54

Nonlinear decision boundary

9 / 54

Logistic regression

conceptually creates a linear line separating 2 classes

Low flexibility, explainable method

(we will talk about LDA, QLA, and K-nearest neighbors on Wednesday)

10 / 54

Logistic regression

You might ask

11 / 54

Logistic regression

You might ask

Why can't you use linear regression?

11 / 54

Logistic regression

You might ask

Why can't you use linear regression?

11 / 54

Response encoding

Propose we want to classify what kind of wine to market:

red
white

$Y$ has to be numeric for a linear model to work.

12 / 54

Response encoding

Propose we want to classify what kind of wine to market:

red
white

$Y$ has to be numeric for a linear model to work.

We could decode $red = 0, white = 1$ .

12 / 54

Response encoding

Propose we want to classify what kind of wine to market:

red
white

$Y$ has to be numeric for a linear model to work.

We could decode $red = 0, white = 1$ .

but what would happen if we let $\hat{Y} >1$

12 / 54

Response encoding

What if we have more than 2 classes?

red
white
rose
dessert
sparkling

We can't do $red = 1, white = 2, rose = 3, dessert = 4, sparking = 5$ because there isn't natural ordering and nothing to indicate that dessert wine is twice of white wine

13 / 54

Logistic regression

logistic (abstractly) models the probability that Y corresponds to a particular category

14 / 54

Logistic regression

logistic (abstractly) models the probability that Y corresponds to a particular category

Now some mathematics!

14 / 54

The Logistic Model

We want to model the relationship between $p(X) = Pr(Y = 1|X)$ and $X$ .

If we use a linear formulation

$p(X) = \beta_0 + \beta_1X$

then we will get negative probabilities which would be no good!

15 / 54

The Logistic Model

We need to restrict the values of $p(X)$ to be between 0 and 1

We can use the logistic function

$f(x) = \dfrac{e^x}{1-e^x}$

16 / 54

The Logistic Model

Using the logistic function gives us

$p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

Now no matter what the values of $X$ , $\beta_0$ or $\beta_1$ , $p(X)$ will always be contained between 0 and 1.

17 / 54

The Logistic Model

If we start with

$p(X) = \dfrac{e^{\blue{\beta_0 + \beta_1X}}}{1 + e^{\blue{\beta_0 + \beta_1X}}}$

and we see that this looks familiar, it is the linear combination we saw in linear regression we saw last week

Explain what the parameter estimates mean

18 / 54

odds

If we start with

$p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

after rearrangement gives

$\dfrac{p(X)}{1 + p(X)} = e^{\beta_0 + \beta_1X}$

19 / 54

odds

If we start with

$p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

after rearrangement gives

$\orange{\dfrac{p(X)}{1 + p(X)}} = e^{\beta_0 + \beta_1X}$

is called the odds and can take any value between 0 and $\infty$ .

20 / 54

log-odds

If we start with

$p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}$

after rearrangement gives

$\dfrac{p(X)}{1 + p(X)} = e^{\beta_0 + \beta_1X}$

taking the logarithm

$\log\left(\dfrac{p(X)}{1 + p(X)}\right) = \beta_0 + \beta_1X$

21 / 54

log-odds

$\blue{\log\left(\dfrac{p(X)}{1 + p(X)}\right)} = \beta_0 + \beta_1X$

The left-hand side is called the log-odds or logit.

22 / 54

How is this a classifier?

Logistic regression is not modeling classes

Logistic regression is modeling the probabilities that Y is equal on of the classes

Logistic regression turns into a classifier by picking a cutoff (usually 50%) and classifying according to this threshold.

23 / 54

Logistic regression decision boundary

24 / 54

Non-linear separator

25 / 54

Coefficients

Understanding:

Increasing $X$ by one unit changes the log odds by a factor of $e^{\beta_1}$

The amount of change in $p(X)$ depends on the current value of $X$

26 / 54

Making Predictions

Fitting the model gives us $\hat{\beta_0}$ and $\hat{\beta_1}$ which we can use to construct $\hat{p}(X)$

$\hat{p}(X) = \dfrac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}}$

Plugging in the values of $\hat{\beta_0}$ , $\hat{\beta_1}$ and $X$ gives us a prediction

27 / 54

Example with penguins

library(palmerpenguins)
penguins2 <- penguins %>%
  mutate(species = factor(species == "Adelie", 
                          labels = c("Adelie", "Not Adelie")))
library(parsnip)
lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
lr_fit <- lr_spec %>%
  fit(species ~ bill_length_mm + bill_depth_mm + body_mass_g,
      data = penguins2)

28 / 54

Example with penguins

lr_fit

## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:  stats::glm(formula = species ~ bill_length_mm + bill_depth_mm + 
##     body_mass_g, family = stats::binomial, data = data)
## 
## Coefficients:
##    (Intercept)  bill_length_mm   bill_depth_mm     body_mass_g  
##      32.965109       -4.903438        8.616116        0.006746  
## 
## Degrees of Freedom: 341 Total (i.e. Null);  338 Residual
##   (2 observations deleted due to missingness)
## Null Deviance:        469.4 
## Residual Deviance: 9.652     AIC: 17.65

29 / 54

Example with penguins

tidy(lr_fit)

## # A tibble: 4 x 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    33.0      25.6          1.29  0.199 
## 2 bill_length_mm -4.90      2.65        -1.85  0.0647
## 3 bill_depth_mm   8.62      4.81         1.79  0.0733
## 4 body_mass_g     0.00675   0.00385      1.75  0.0800

30 / 54

Multi class classification

We have so far only talked about what happens with 2 classes

Logistic regression isn't able to work with multiple classes since it finds 1 best line to separate 2 classes

31 / 54

Logistic regression multiclass struggles

32 / 54

Logistic regression multiclass struggles

33 / 54

Evaluation

To evaluate a classifier we need to quantify how good and bad it is performing

Different metrics will be different algebraic combinations of the above numbers

34 / 54

Evaluation metrics

Accuracy

$\dfrac{TN + TP}{TN + FN + FP + TP}$

Percentage of correct predictions

Drawback: If there are two classes A and B split 99% and 1%, you can get an accuracy of 99% by always predicting A

35 / 54

Evaluation metrics

Sensitivity

$\dfrac{TP}{FP + TP}$

Defined as the proportion of positive results out of the number of samples that were positive

36 / 54

Evaluation metrics

Specificity

$\dfrac{TP}{FP + TP}$

Measures the proportion of negatives that are correctly identified as negatives

37 / 54

ROC curve

38 / 54

Test-Train split

We have spent some time talking about fitting model and measuring performance

However, we need to be careful about how we go about that

performance metrics calculated on the data that was used to fit the data is likely to mislead

39 / 54

Test-Train split

In a prediction model, we are interested in the generalized performance. e.i. how well the model can perform on data it hasn't seen

40 / 54

Test-Train split

41 / 54

Test-Train split

We split the data into two groups (typically 75%/25%)

training data set
testing data set

We do the modeling on the training data set (it can be multiple models)

And then we use the testing data set ONCE to measure the performance

42 / 54

Why 75%/25%?

There are no real guidelines as to how you split the data

80/20 split is also used

It Will depend on data size

43 / 54

Why just once?

If you are working in a prediction setting, the testing data set represents fresh new data

If you modify your model you are essentially using information from the future to guide your modeling decisions

This is a kind of data-leakage and it will lead to overconfidence in the model and will come back to bite you once you start using the model

44 / 54

How will I be able to iterate?

We will talk more about how to efficiently use data in the next two weeks

45 / 54

How should we handle unbalanced classes?

46 / 54

How should we handle unbalanced classes?

47 / 54

How should we handle unbalanced classes?

48 / 54

stratified sampling

This stratification also works for regression tasks. The variable can be binned and samples to ensure equal distribution between training and testing data

There is very little downside to using stratified sampling.

49 / 54

More Data Leakage

Performing training-testing split in another place where data can leak

Any transformation done to the data should be done AFTER the split occurs as to not have had future information affect the modeling process

50 / 54

rsample

sample provides functionally to perform all different kinds of data splitting with a minimal footprint

51 / 54

rsample example

We bring back the penguins

penguins

## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

52 / 54

rsample example

Use initial_split() from rsample to generate a rsplit object

set.seed(1234) # remember the seed!
penguins_split <- initial_split(penguins)
penguins_split

## <Analysis/Assess/Total>
## <258/86/344>

This object store the information of what observations belong to each data set

53 / 54

rsample example

training() and testing() is used to extract the training data set and testing data set

set.seed(1234) # remember the seed!
penguins_split <- initial_split(penguins)
penquins_train <- training(penguins_split)
penquins_test <- testing(penguins_split)
dim(penquins_train)

## [1] 258   8

dim(penquins_test)

## [1] 86  8

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Logistic Regression

AU STAT-427/627

Emil Hvitfeldt

2021-5-24

Classification

Examples of classification tasks

Examples of classification tasks

Examples of classification tasks

Examples of classification tasks

Classification visual

Classification visual - decision boundary

Classification visual

Classification visual - no hope

Nonlinear decision boundary

Logistic regression

Logistic regression

Logistic regression

Logistic regression

Response encoding

Response encoding

Response encoding

Response encoding

Logistic regression

Logistic regression

The Logistic Model

The Logistic Model

The Logistic Model

The Logistic Model

odds

odds

log-odds

log-odds

How is this a classifier?

Logistic regression decision boundary

Non-linear separator

Coefficients

Making Predictions

Example with penguins

Example with penguins

Example with penguins

Multi class classification

Logistic regression multiclass struggles

Logistic regression multiclass struggles

Evaluation

Evaluation metrics

Accuracy

Evaluation metrics

Sensitivity

Evaluation metrics

Specificity

ROC curve

Test-Train split

Test-Train split

Test-Train split

Test-Train split

Why 75%/25%?

Why just once?

How will I be able to iterate?

How should we handle unbalanced classes?

How should we handle unbalanced classes?

How should we handle unbalanced classes?

stratified sampling

More Data Leakage

rsample

rsample example

rsample example

rsample example

Classification

Help