One Metric to Fool Yourself

A Cautionary Tale in Machine Learning Evaluation


Emil Hvitfeldt, Posit PBC

Setting the stage

ames data set

library(tidymodels)
glimpse(ames)
Rows: 2,930
Columns: 74
$ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
$ MS_Zoning          <fct> Residential_Low_Density, Residential_High_Density, …
$ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
$ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
$ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
$ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
$ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
$ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
$ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
$ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
$ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
$ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
$ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
$ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
$ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
$ House_Style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
$ Overall_Cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
$ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
$ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
$ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
$ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
$ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
$ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
$ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
$ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
$ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
$ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
$ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
$ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
$ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
$ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
$ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
$ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
$ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
$ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
$ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
$ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
$ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
$ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
$ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
$ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
$ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
$ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
$ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
$ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
$ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
$ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
$ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
$ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
$ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
$ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
$ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
$ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
$ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
$ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
$ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
$ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
$ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
$ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
$ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
$ Misc_Feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
$ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
$ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
$ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
$ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
$ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
$ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
$ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…

Data split

set.seed(1234)
ames_split <- initial_split(ames)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

Our modeling task

Predicting the species of based on physical and geographical features

rec_spec <- recipe(Sale_Price ~ ., ames_train) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

mod_spec <- boost_tree("regression", "xgboost")

wf_spec <- workflow(rec_spec, mod_spec)

wf_fit <- fit(wf_spec, ames_train)

Not tidymodels specific



we are using tidymodels

but this talk isn’t about software

it is about decisions

The previous example is very simplistic, The whole modeling workflow include but is not limted to:


  • preprocessor
  • model
  • post processor


  • hyperparameter tuning
  • performance metric

What are performance metrics


A computation, typically resulting in a singular numeric value representing how well the model predictions match known truths.


The choice of which one you choose should matter a lot, it is a reflection of what you are prioritizing, and it will affect what model you end up with.

root mean squared error

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(truth_i - estimate_i)^2} \]


This is a classic regression metric. the square root doesn’t do anything here beyond bringing back the units to the original scale

mean absolute error

\[ \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|truth_i - estimate_i| \]


The mean in both makes it so our metric is invariant to the number of observations we are evaluating with.

R Squared

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(truth_i - estimate_i)^2}{\sum_{i=1}^{n}(truth_i - \bar{truth})^2} \]


More a measurement of consistency and correlation and not of accuracy.


Notice that this value is something we want to maximize instead of minimize.

Assymetic metrics


Is overpredicting the house price by $10,000 as bad as underpredicting by $10,000?



Is overpredicting the house price by 5% as bad as underpredicting by 5%?

Assymetic metrics




Same more commonly known in classification settings as type 1 and type 2 errors

Assymetic metrics - marketing

More than meets the eye

what do you think about a model that has 86% accuracy?

  • And it isn’t an imbalanced modeling
  • it depends


but that is a model that doesn’t work on Fridays.


something that works on 96% of observations could be a model that never works on americans or blondes

Stratified performance

We can calculate performance metrics on our data

augment(wf_fit, ames_train) |>
  mae(Sale_Price, .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard       8525.

Stratified performance

but what about if we call it stratisfied by some other variable

augment(wf_fit, ames_train) |>
  group_by(Neighborhood) |>
  mae(Sale_Price, .pred)
# A tibble: 28 × 4
   Neighborhood       .metric .estimator .estimate
   <fct>              <chr>   <chr>          <dbl>
 1 North_Ames         mae     standard       7084.
 2 College_Creek      mae     standard       7819.
 3 Old_Town           mae     standard       8257.
 4 Edwards            mae     standard       9027.
 5 Somerset           mae     standard       9320.
 6 Northridge_Heights mae     standard      10127.
 7 Gilbert            mae     standard       7012.
 8 Sawyer             mae     standard       7679.
 9 Northwest_Ames     mae     standard       7899.
10 Sawyer_West        mae     standard       7797.
# ℹ 18 more rows

Stratified performance

Looking at each end of the spectrum

augment(wf_fit, ames_train) |>
  group_by(Neighborhood) |>
  mae(Sale_Price, .pred) |>
  arrange(.estimate) |>
  slice(1:5, n() - (4:0))
# A tibble: 10 × 4
   Neighborhood    .metric .estimator .estimate
   <fct>           <chr>   <chr>          <dbl>
 1 Briardale       mae     standard       4834.
 2 Northpark_Villa mae     standard       6282.
 3 Gilbert         mae     standard       7012.
 4 North_Ames      mae     standard       7084.
 5 Meadow_Village  mae     standard       7445.
 6 Clear_Creek     mae     standard      11466.
 7 Blueste         mae     standard      12427.
 8 Northridge      mae     standard      12532.
 9 Landmark        mae     standard      13413.
10 Green_Hills     mae     standard      28467.

Look at your data and discover what is happening

Different metrics gives different orderings

Different metrics gives different orderings

performance metrics is as much part of the modeling pipeline as the model or preprocessing

We need to think carefully about what we are working on

Fairness metrics

How do we define fairness?

Definitions of fairness “are not mathematically or morally compatible in general.”

Algorithmic Fairness: Choices, Assumptions, and Definitions

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour and Kristian Lum

Modeling is hard, and there is no way around it

fairness metrics in yardstick

  • new_groupwise_metric()
    • Create groupwise metrics
  • demographic_parity()
    • Demographic parity
  • equalized_odds()
    • Equalized odds
  • equal_opportunity()
    • Equal opportunity

Using new_groupwise_metric()

max_rmse <- new_groupwise_metric(
  fn = rmse,
  name = "min_rmse",
  aggregate = function(x) {max(x$.estimate)},
  direction = "minimize"
)

max_rmse_Neighborhood <- max_rmse(Neighborhood)

augment(wf_fit, ames_train) |>
  max_rmse_Neighborhood(Sale_Price, .pred)
# A tibble: 1 × 4
  .metric  .by          .estimator .estimate
  <chr>    <chr>        <chr>          <dbl>
1 min_rmse Neighborhood standard      28682.

Using new_groupwise_metric()

high_mean_rmse <- new_groupwise_metric(
  fn = rmse,
  name = "high_mean_rmse",
  aggregate = function(x) {mean(sort(x$.estimate, decreasing = TRUE)[1:5])},
  direction = "minimize"
)

high_mean_rmse_Neighborhood <- high_mean_rmse(Neighborhood)

augment(wf_fit, ames_train) |>
  high_mean_rmse_Neighborhood(Sale_Price, .pred)
# A tibble: 1 × 4
  .metric        .by          .estimator .estimate
  <chr>          <chr>        <chr>          <dbl>
1 high_mean_rmse Neighborhood standard      18007.

Demographic parity

Demographic parity is satisfied when a model’s predictions have the same predicted positive rate across groups.


demographic_parity() is calculated as the difference between the largest and smallest value of detection_prevalence() across groups.


Demographic parity is sometimes referred to as group fairness, disparate impact, or statistical parity.

Equalized odds

Equalized odds is satisfied when a model’s predictions have the same false positive, true positive, false negative, and true negative rates across protected groups


equalized_odds() takes the maximum difference in range of sens() and spec() .estimates across groups.


Equalized odds is sometimes referred to as conditional procedure accuracy equality or disparate mistreatment.

Equal opportunity

Equal opportunity is satisfied when a model’s predictions have the same true positive and false negative rates across protected groups


equal_opportunity() is calculated as the difference between the largest and smallest value of sens() across groups.


Equal opportunity is sometimes referred to as equality of opportunity.

Goodhart’s law

“When a measure becomes a target, it ceases to be a good measure”

I would argue that modeling performance metrics are different enough to not be covered by this law, but we need to be careful

we have all seen badly written metrics that caused bad effects

but that doesn’t mean you have to pick the simplest one and call it a day

Thank you