Last we looked at regression tasks. In regression the response variable Y is quantitative
In classification tasks, the response variable Y is qualitative
This Difference will present some challenges we will cover this week
Two or more classes
Two or more classes
There can be uncertainty
Two or more classes
There can be uncertainty
Can be more than one class at the same time
conceptually creates a linear line separating 2 classes
Low flexibility, explainable method
(we will talk about LDA, QLA, and K-nearest neighbors on Wednesday)
You might ask
You might ask
You might ask
Propose we want to classify what kind of wine to market:
Y has to be numeric for a linear model to work.
Propose we want to classify what kind of wine to market:
Y has to be numeric for a linear model to work.
We could decode red = 0, white = 1.
Propose we want to classify what kind of wine to market:
Y has to be numeric for a linear model to work.
We could decode red = 0, white = 1.
but what would happen if we let \hat{Y} >1
What if we have more than 2 classes?
We can't do red = 1, white = 2, rose = 3, dessert = 4, sparking = 5 because there isn't natural ordering and nothing to indicate that dessert wine is twice of white wine
logistic (abstractly) models the probability that Y corresponds to a particular category
logistic (abstractly) models the probability that Y corresponds to a particular category
Now some mathematics!
We want to model the relationship between p(X) = Pr(Y = 1|X) and X.
If we use a linear formulation
p(X) = \beta_0 + \beta_1X
then we will get negative probabilities which would be no good!
We need to restrict the values of p(X) to be between 0 and 1
We can use the logistic function
f(x) = \dfrac{e^x}{1-e^x}
Using the logistic function gives us
p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}
Now no matter what the values of X, \beta_0 or \beta_1, p(X) will always be contained between 0 and 1.
If we start with
p(X) = \dfrac{e^{\blue{\beta_0 + \beta_1X}}}{1 + e^{\blue{\beta_0 + \beta_1X}}}
and we see that this looks familiar, it is the linear combination we saw in linear regression we saw last week
Explain what the parameter estimates mean
If we start with
p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}
after rearrangement gives
\dfrac{p(X)}{1 + p(X)} = e^{\beta_0 + \beta_1X}
If we start with
p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}
after rearrangement gives
\orange{\dfrac{p(X)}{1 + p(X)}} = e^{\beta_0 + \beta_1X}
This is called the odds and can take any value between 0 and \infty.
If we start with
p(X) = \dfrac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}
after rearrangement gives
\dfrac{p(X)}{1 + p(X)} = e^{\beta_0 + \beta_1X}
taking the logarithm
\log\left(\dfrac{p(X)}{1 + p(X)}\right) = \beta_0 + \beta_1X
\blue{\log\left(\dfrac{p(X)}{1 + p(X)}\right)} = \beta_0 + \beta_1X
The left-hand side is called the log-odds or logit.
Logistic regression is not modeling classes
Logistic regression is modeling the probabilities that Y is equal on of the classes
Logistic regression turns into a classifier by picking a cutoff (usually 50%) and classifying according to this threshold.
Understanding:
Increasing X by one unit changes the log odds by a factor of e^{\beta_1}
The amount of change in p(X) depends on the current value of X
Fitting the model gives us \hat{\beta_0} and \hat{\beta_1} which we can use to construct \hat{p}(X)
\hat{p}(X) = \dfrac{e^{\hat{\beta_0} + \hat{\beta_1}X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}}
Plugging in the values of \hat{\beta_0}, \hat{\beta_1} and X gives us a prediction
library(palmerpenguins)penguins2 <- penguins %>% mutate(species = factor(species == "Adelie", labels = c("Adelie", "Not Adelie")))library(parsnip)lr_spec <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification")lr_fit <- lr_spec %>% fit(species ~ bill_length_mm + bill_depth_mm + body_mass_g, data = penguins2)
lr_fit
## parsnip model object## ## Fit time: 4ms ## ## Call: stats::glm(formula = species ~ bill_length_mm + bill_depth_mm + ## body_mass_g, family = stats::binomial, data = data)## ## Coefficients:## (Intercept) bill_length_mm bill_depth_mm body_mass_g ## 32.965109 -4.903438 8.616116 0.006746 ## ## Degrees of Freedom: 341 Total (i.e. Null); 338 Residual## (2 observations deleted due to missingness)## Null Deviance: 469.4 ## Residual Deviance: 9.652 AIC: 17.65
tidy(lr_fit)
## # A tibble: 4 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 33.0 25.6 1.29 0.199 ## 2 bill_length_mm -4.90 2.65 -1.85 0.0647## 3 bill_depth_mm 8.62 4.81 1.79 0.0733## 4 body_mass_g 0.00675 0.00385 1.75 0.0800
We have so far only talked about what happens with 2 classes
Logistic regression isn't able to work with multiple classes since it finds 1 best line to separate 2 classes
To evaluate a classifier we need to quantify how good and bad it is performing
Different metrics will be different algebraic combinations of the above numbers
\dfrac{TN + TP}{TN + FN + FP + TP}
Percentage of correct predictions
Drawback: If there are two classes A and B split 99% and 1%, you can get an accuracy of 99% by always predicting A
\dfrac{TP}{FP + TP}
Defined as the proportion of positive results out of the number of samples that were positive
\dfrac{TP}{FP + TP}
Measures the proportion of negatives that are correctly identified as negatives
We have spent some time talking about fitting model and measuring performance
However, we need to be careful about how we go about that
performance metrics calculated on the data that was used to fit the data is likely to mislead
In a prediction model, we are interested in the generalized performance. e.i. how well the model can perform on data it hasn't seen
We split the data into two groups (typically 75%/25%)
We do the modeling on the training data set (it can be multiple models)
And then we use the testing data set ONCE to measure the performance
There are no real guidelines as to how you split the data
80/20 split is also used
It Will depend on data size
If you are working in a prediction setting, the testing data set represents fresh new data
If you modify your model you are essentially using information from the future to guide your modeling decisions
This is a kind of data-leakage and it will lead to overconfidence in the model and will come back to bite you once you start using the model
We will talk more about how to efficiently use data in the next two weeks
This stratification also works for regression tasks. The variable can be binned and samples to ensure equal distribution between training and testing data
There is very little downside to using stratified sampling.
Performing training-testing split in another place where data can leak
Any transformation done to the data should be done AFTER the split occurs as to not have had future information affect the modeling process
We bring back the penguins
penguins
## # A tibble: 344 x 8## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## <fct> <fct> <dbl> <dbl> <int> <int>## 1 Adelie Torgersen 39.1 18.7 181 3750## 2 Adelie Torgersen 39.5 17.4 186 3800## 3 Adelie Torgersen 40.3 18 195 3250## 4 Adelie Torgersen NA NA NA NA## 5 Adelie Torgersen 36.7 19.3 193 3450## 6 Adelie Torgersen 39.3 20.6 190 3650## 7 Adelie Torgersen 38.9 17.8 181 3625## 8 Adelie Torgersen 39.2 19.6 195 4675## 9 Adelie Torgersen 34.1 18.1 193 3475## 10 Adelie Torgersen 42 20.2 190 4250## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
Use initial_split()
from rsample to generate a rsplit
object
set.seed(1234) # remember the seed!penguins_split <- initial_split(penguins)penguins_split
## <Analysis/Assess/Total>## <258/86/344>
This object store the information of what observations belong to each data set
training()
and testing()
is used to extract the training data set and testing data set
set.seed(1234) # remember the seed!penguins_split <- initial_split(penguins)penquins_train <- training(penguins_split)penquins_test <- testing(penguins_split)dim(penquins_train)
## [1] 258 8
dim(penquins_test)
## [1] 86 8
Last we looked at regression tasks. In regression the response variable Y is quantitative
In classification tasks, the response variable Y is qualitative
This Difference will present some challenges we will cover this week
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |