Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

Logistic Regression

AU STAT-427/627

Emil Hvitfeldt

2021-5-24

1 / 54
`\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725} \require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1} \require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}`

Classification

Last we looked at regression tasks. In regression the response variable Y is quantitative

In classification tasks, the response variable Y is qualitative

This Difference will present some challenges we will cover this week

2 / 54

3 / 54

Examples of classification tasks

  • Should we sent an email ad to this person?
  • Are these symptoms indicative of cancer?
  • Given an image, which fruit is depicted?
4 / 54

Examples of classification tasks

  • Should we sent an email ad to this person?
  • Are these symptoms indicative of cancer?
  • Given an image, which fruit is depicted?

Two or more classes

4 / 54

Examples of classification tasks

  • Should we sent an email ad to this person?
  • Are these symptoms indicative of cancer?
  • Given an image, which fruit is depicted?

Two or more classes

There can be uncertainty

4 / 54

Examples of classification tasks

  • Should we sent an email ad to this person?
  • Are these symptoms indicative of cancer?
  • Given an image, which fruit is depicted?

Two or more classes

There can be uncertainty

Can be more than one class at the same time

4 / 54

Classification visual

5 / 54

Classification visual - decision boundary

6 / 54

Classification visual

7 / 54

Classification visual - no hope

8 / 54

Nonlinear decision boundary

9 / 54

Logistic regression

conceptually creates a linear line separating 2 classes

Low flexibility, explainable method

(we will talk about LDA, QLA, and K-nearest neighbors on Wednesday)

10 / 54

Logistic regression

You might ask

11 / 54

Logistic regression

You might ask

  • Why can't you use linear regression?
11 / 54

Logistic regression

You might ask

  • Why can't you use linear regression?
11 / 54

Response encoding

Propose we want to classify what kind of wine to market:

  • red
  • white

Y has to be numeric for a linear model to work.

12 / 54

Response encoding

Propose we want to classify what kind of wine to market:

  • red
  • white

Y has to be numeric for a linear model to work.

We could decode red = 0, white = 1.

12 / 54

Response encoding

Propose we want to classify what kind of wine to market:

  • red
  • white

Y has to be numeric for a linear model to work.

We could decode red = 0, white = 1.

but what would happen if we let \hat{Y} >1

12 / 54

Response encoding

What if we have more than 2 classes?

  • red
  • white
  • rose
  • dessert
  • sparkling

We can't do red = 1, white = 2, rose = 3, dessert = 4, sparking = 5 because there isn't natural ordering and nothing to indicate that dessert wine is twice of white wine

13 / 54

Logistic regression

logistic (abstractly) models the probability that Y corresponds to a particular category

14 / 54

Logistic regression

logistic (abstractly) models the probability that Y corresponds to a particular category

Now some mathematics!

14 / 54

The Logistic Model

We want to model the relationship between p(X) = Pr(Y = 1|X) and X.

If we use a linear formulation

p(X) = \beta_0 + \beta_1X

then we will get negative probabilities which would be no good!

15 / 54

The Logistic Model

We need to restrict the values of p(X) to be between 0 and 1

We can use the logistic function

f(x) = \dfrac{e^x}{1-e^x}

16 / 54

The Logistic Model

Using the logistic function gives us

p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}

Now no matter what the values of X, \beta_0 or \beta_1, p(X) will always be contained between 0 and 1.

17 / 54

The Logistic Model

If we start with

p(X) = \dfrac{e^{\blue{\beta_0 + \beta_1X}}}{1 + e^{\blue{\beta_0 + \beta_1X}}}

and we see that this looks familiar, it is the linear combination we saw in linear regression we saw last week

Explain what the parameter estimates mean

18 / 54

odds

If we start with

p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}

after rearrangement gives

\dfrac{p(X)}{1 + p(X)} = e^{\beta_0 + \beta_1X}

19 / 54

odds

If we start with

p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}

after rearrangement gives

\orange{\dfrac{p(X)}{1 + p(X)}} = e^{\beta_0 + \beta_1X}

This is called the odds and can take any value between 0 and \infty.

20 / 54

log-odds

If we start with

p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}

after rearrangement gives

\dfrac{p(X)}{1 + p(X)} = e^{\beta_0 + \beta_1X}

taking the logarithm

\log\left(\dfrac{p(X)}{1 + p(X)}\right) = \beta_0 + \beta_1X

21 / 54

log-odds

\blue{\log\left(\dfrac{p(X)}{1 + p(X)}\right)} = \beta_0 + \beta_1X

The left-hand side is called the log-odds or logit.

22 / 54

How is this a classifier?

Logistic regression is not modeling classes

Logistic regression is modeling the probabilities that Y is equal on of the classes

Logistic regression turns into a classifier by picking a cutoff (usually 50%) and classifying according to this threshold.

23 / 54

Logistic regression decision boundary

24 / 54

Non-linear separator

25 / 54

Coefficients

Understanding:

Increasing X by one unit changes the log odds by a factor of e^{\beta_1}

The amount of change in p(X) depends on the current value of X

26 / 54

Making Predictions

Fitting the model gives us \hat{\beta_0} and \hat{\beta_1} which we can use to construct \hat{p}(X)

\hat{p}(X) = \dfrac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}}

Plugging in the values of \hat{\beta_0}, \hat{\beta_1} and X gives us a prediction

27 / 54

Example with penguins

library(palmerpenguins)
penguins2 <- penguins %>%
mutate(species = factor(species == "Adelie",
labels = c("Adelie", "Not Adelie")))
library(parsnip)
lr_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
lr_fit <- lr_spec %>%
fit(species ~ bill_length_mm + bill_depth_mm + body_mass_g,
data = penguins2)
28 / 54

Example with penguins

lr_fit
## parsnip model object
##
## Fit time: 4ms
##
## Call: stats::glm(formula = species ~ bill_length_mm + bill_depth_mm +
## body_mass_g, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm body_mass_g
## 32.965109 -4.903438 8.616116 0.006746
##
## Degrees of Freedom: 341 Total (i.e. Null); 338 Residual
## (2 observations deleted due to missingness)
## Null Deviance: 469.4
## Residual Deviance: 9.652 AIC: 17.65
29 / 54

Example with penguins

tidy(lr_fit)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 33.0 25.6 1.29 0.199
## 2 bill_length_mm -4.90 2.65 -1.85 0.0647
## 3 bill_depth_mm 8.62 4.81 1.79 0.0733
## 4 body_mass_g 0.00675 0.00385 1.75 0.0800
30 / 54

Multi class classification

We have so far only talked about what happens with 2 classes

Logistic regression isn't able to work with multiple classes since it finds 1 best line to separate 2 classes

31 / 54

Logistic regression multiclass struggles

32 / 54

Logistic regression multiclass struggles

33 / 54

Evaluation

To evaluate a classifier we need to quantify how good and bad it is performing

Different metrics will be different algebraic combinations of the above numbers

34 / 54

Evaluation metrics

Accuracy

\dfrac{TN + TP}{TN + FN + FP + TP}

Percentage of correct predictions

Drawback: If there are two classes A and B split 99% and 1%, you can get an accuracy of 99% by always predicting A

35 / 54

Evaluation metrics

Sensitivity

\dfrac{TP}{FP + TP}

Defined as the proportion of positive results out of the number of samples that were positive

36 / 54

Evaluation metrics

Specificity

\dfrac{TP}{FP + TP}

Measures the proportion of negatives that are correctly identified as negatives

37 / 54

ROC curve

38 / 54

Test-Train split

We have spent some time talking about fitting model and measuring performance

However, we need to be careful about how we go about that

performance metrics calculated on the data that was used to fit the data is likely to mislead

39 / 54

Test-Train split

In a prediction model, we are interested in the generalized performance. e.i. how well the model can perform on data it hasn't seen

40 / 54

Test-Train split

41 / 54

Test-Train split

We split the data into two groups (typically 75%/25%)

  • training data set
  • testing data set

We do the modeling on the training data set (it can be multiple models)

And then we use the testing data set ONCE to measure the performance

42 / 54

Why 75%/25%?

There are no real guidelines as to how you split the data

80/20 split is also used

It Will depend on data size

43 / 54

Why just once?

If you are working in a prediction setting, the testing data set represents fresh new data

If you modify your model you are essentially using information from the future to guide your modeling decisions

This is a kind of data-leakage and it will lead to overconfidence in the model and will come back to bite you once you start using the model

44 / 54

How will I be able to iterate?

We will talk more about how to efficiently use data in the next two weeks

45 / 54

How should we handle unbalanced classes?

46 / 54

How should we handle unbalanced classes?

47 / 54

How should we handle unbalanced classes?

48 / 54

stratified sampling

This stratification also works for regression tasks. The variable can be binned and samples to ensure equal distribution between training and testing data

There is very little downside to using stratified sampling.

49 / 54

More Data Leakage

Performing training-testing split in another place where data can leak

Any transformation done to the data should be done AFTER the split occurs as to not have had future information affect the modeling process

50 / 54

rsample

sample provides functionally to perform all different kinds of data splitting with a minimal footprint

51 / 54

rsample example

We bring back the penguins

penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
52 / 54

rsample example

Use initial_split() from rsample to generate a rsplit object

set.seed(1234) # remember the seed!
penguins_split <- initial_split(penguins)
penguins_split
## <Analysis/Assess/Total>
## <258/86/344>

This object store the information of what observations belong to each data set

53 / 54

rsample example

training() and testing() is used to extract the training data set and testing data set

set.seed(1234) # remember the seed!
penguins_split <- initial_split(penguins)
penquins_train <- training(penguins_split)
penquins_test <- testing(penguins_split)
dim(penquins_train)
## [1] 258 8
dim(penquins_test)
## [1] 86 8
54 / 54
`\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725} \require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1} \require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}`

Classification

Last we looked at regression tasks. In regression the response variable Y is quantitative

In classification tasks, the response variable Y is qualitative

This Difference will present some challenges we will cover this week

2 / 54
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow