Feature Engineering

SoCal RUG 2024 Hackathon

Emil Hvitfeldt

What is Feature Engineering?

The act of modifying data
to allow for easy
extraction of signal and
elimination of noise

why do we use feature engineering?

most models can’t handle non-numeric data

and we often have non-numeric data

Are there other reason?

to deal with missing values

and correlated features

or deal with scaling issues

preprocessing vs feature engineering

Preprocessing

what needs to happen


transformation

Normalization

Formatting

Imputing

Encoding

Feature engineering

what helps capture signal


transformation

Normalization

Formatting

Imputing

Encoding

Long Beach Animal Shelter Data

animal_id animal_name animal_type primary_color secondary_color sex dob intake_date intake_condition intake_type intake_subtype reason_for_intake outcome_date crossing jurisdiction outcome_type outcome_subtype latitude longitude intake_is_dead outcome_is_dead was_outcome_alive geopoint
A625007 NA CAT ORANGE WHITE Unknown 2019-04-06 2019-04-20 UNDER AGE/WEIGHT STRAY FIELD NA 2019-04-20 300 BLK E NEECE ST, LONG BEACH, CA 90805 LONG BEACH RESCUE OTHER RESC 33.87183 -118.1968 Alive on Intake FALSE 1 33.871828, -118.1967512
A602604 GANDALF CAT GRAY NA Male 2016-01-17 2018-01-17 NORMAL OWNER SURRENDER OTC OWNER DIED 2018-02-18 300 BLK E NORTON ST, LONG BEACH, CA 90805 LONG BEACH TRANSFER SPCALA 33.85827 -118.1898 Alive on Intake FALSE 1 33.8582742, -118.1897576
A651319 A651319 CAT SEAL PT NA Neutered 2015-11-11 2020-11-11 FRACTIOUS STRAY FIELD NA 2020-12-01 300 BLK E PEACE ST, LONG BEACH, CA 90805 LONG BEACH SHELTER, NEUTER, RETURN STRAYCATAL 33.84548 -118.1897 Alive on Intake FALSE 1 33.8454759, -118.1897493
A614730 *MATILDA CAT TORTIE NA Spayed 2016-03-15 2018-09-15 NORMAL STRAY OTC NA 2018-11-10 300 BLK E RHEA ST, LONG BEACH, CA 90806 LONG BEACH ADOPTION WALKIN 33.79247 -118.1892 Alive on Intake FALSE 1 33.7924717, -118.1892489
A716440 NA CAT GRAY NA Male 2016-03-06 2024-03-06 ILL MODERATETE STRAY FIELD NA 2024-03-06 300 BLK E SOUTH ST LONG BEACH DIED ENROUTE 33.86006 -118.1899 Alive on Intake TRUE 0 33.8600568, -118.1898684
A596684 XYLA CAT BLACK WHITE Spayed 2016-06-03 2017-09-07 NORMAL OWNER SURRENDER OTC LANDLORD 2017-09-17 300 BLK GLADYS AVE, LONG BEACH, CA 90814 LONG BEACH ADOPTION SPCALA 33.76833 -118.1577 Alive on Intake FALSE 1 33.7683335, -118.1576793
A709550 *PECORINO CAT BLACK NA Neutered 2023-09-06 2023-10-29 UNDER AGE/WEIGHT STRAY OTC NA 2023-12-27 300 BLK HERMOSA AVE LONG BEACH CA 90802 LONG BEACH ADOPTION WEB 33.76920 -118.1693 Alive on Intake FALSE 1 33.7691954, -118.1693268
A686577 *PACH JR CAT ORANGE NA Male 2022-08-12 2022-10-12 NORMAL STRAY OTC NA 2022-11-09 300 BLK HULA LN, LONG BEACH, CA 90805 LONG BEACH DIED IN CARE 33.87243 -118.1882 Alive on Intake TRUE 0 33.8724304, -118.1882392
A686576 *DOUG JR CAT ORG TABBY WHITE Neutered 2022-08-12 2022-10-12 NORMAL STRAY OTC NA 2023-01-28 300 BLK HULA LN, LONG BEACH, CA 90805 LONG BEACH HOMEFIRST NA 33.87243 -118.1882 Alive on Intake FALSE 1 33.8724304, -118.1882392
A689819 *RUPERT CAT BLACK NA Male 2022-10-02 2022-12-02 NORMAL STRAY OTC NA 2022-12-17 300 BLK JUNIPERO AVE, LONG BEACH, CA 90814 LONG BEACH RESCUE LITTLEPAWS 33.76860 -118.1649 Alive on Intake FALSE 1 33.7686022, -118.1649357

Date Time

How can we deal with this variable?

intake_date
2019-04-20
2018-01-17
2020-11-11
2018-09-15
2024-03-06
2017-09-07
2023-10-29
2022-10-12
2022-10-12
2022-12-02
intake_date_integer
18006
17548
18577
17789
19788
17416
19659
19277
19277
19328
intake_date_year intake_date_month intake_date_day
2019 4 20
2018 1 17
2020 11 11
2018 9 15
2024 3 6
2017 9 7
2023 10 29
2022 10 12
2022 10 12
2022 12 2
intake_date_before_christmas intake_date_before_halloween
249 194
342 287
44 354
101 46
294 239
109 54
57 2
74 19
74 19
23 333
intake_date_before_christmas intake_date_before_halloween
5.519459 5.2704322
5.836272 5.6612229
3.795489 5.8707083
4.620059 3.8394523
5.685279 5.4785534
4.695925 3.9982007
4.051785 0.9162907
4.310799 2.9704145
4.310799 2.9704145
3.157000 5.8096429

Multiple Categorical variables

How can we deal with these variables?

primary_color secondary_color
ORANGE WHITE
GRAY NA
SEAL PT NA
TORTIE NA
GRAY NA
BLACK WHITE
BLACK NA
ORANGE NA
ORG TABBY WHITE
BLACK NA

Multiple Categorical variables

How can we deal with these variables? Create dummies

primary_color_BC.LYNX.PT primary_color_BL.LYNX.PT primary_color_BLACK primary_color_BLK.SMOKE primary_color_BLK.TABBY primary_color_BLK.TIGER primary_color_BLONDE primary_color_BLUE primary_color_BLUE.CREAM primary_color_BLUE.PT primary_color_BLUE.TABBY primary_color_BR.BRINDLE primary_color_BRN.TABBY primary_color_BRN.TIGER primary_color_BROWN primary_color_BUFF primary_color_CALICO primary_color_CALICO.DIL primary_color_CALICO.PT primary_color_CALICO.TAB primary_color_CH.LYNX.PT primary_color_CHOC.PT primary_color_CHOCOLATE primary_color_CR.LYNX.PT primary_color_CREAM primary_color_CREAM.PT primary_color_CRM.TABBY primary_color_CRM.TIGER primary_color_FAWN primary_color_FLAME.PT primary_color_GOLD primary_color_GRAY primary_color_GRAY.TABBY primary_color_GRAY.TIGER primary_color_L.C.PT primary_color_LC.LYNX.PT primary_color_LI.LYNX.PT primary_color_LILAC.PT primary_color_LYNX.PT primary_color_ORANGE primary_color_ORG.TABBY primary_color_ORG.TIGER primary_color_PEACH primary_color_PINK primary_color_RD.LYNX.PT primary_color_RED primary_color_RED.PT primary_color_S.T.PT primary_color_SABLE primary_color_SEAL primary_color_SEAL.PT primary_color_SILVER primary_color_SL.LYNX.PT primary_color_SLVR.TABBY primary_color_SNOWSHOE primary_color_ST.LYNX.PT primary_color_TAN primary_color_TORBI primary_color_TORTIE primary_color_TORTIE.DIL primary_color_TORTIE.MUT primary_color_TORTIE.PT primary_color_TRICOLOR primary_color_UNKNOWN primary_color_WHEAT primary_color_WHITE primary_color_YELLOW secondary_color_BLACK secondary_color_BLK.SMOKE secondary_color_BLK.TABBY secondary_color_BLK.TIGER secondary_color_BLUE secondary_color_BRN.TABBY secondary_color_BRN.TIGER secondary_color_BROWN secondary_color_CALICO secondary_color_CALICO.TABBY secondary_color_CHOC.PT secondary_color_CREAM secondary_color_CRM.TABBY secondary_color_FAWN secondary_color_FLAME.PT secondary_color_GOLD secondary_color_GRAY secondary_color_GRAY.TABBY secondary_color_LYNX.PT secondary_color_MARBLED.TABBY secondary_color_ORANGE secondary_color_ORG.TABBY secondary_color_PEACH secondary_color_RED secondary_color_SEAL.PT secondary_color_SLVR.TABBY secondary_color_TAN secondary_color_TORBI secondary_color_TORTIE secondary_color_TORTIE.DIL secondary_color_TRICOLOR secondary_color_WHITE secondary_color_YELLOW secondary_color_unknown
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Multiple Categorical variables

How can we deal with these variables? othering of low counts

primary_color_BRN.TABBY primary_color_CALICO primary_color_GRAY primary_color_GRAY.TABBY primary_color_ORANGE primary_color_ORG.TABBY primary_color_SEAL.PT primary_color_TORTIE primary_color_WHITE primary_color_OTHER secondary_color_GRAY secondary_color_WHITE secondary_color_unknown secondary_color_OTHER
0 0 0 0 1 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 1 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 0 0 0 1 0
0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0

Multiple Categorical variables

How can we deal with these variables? combined dummies

color_ORANGE color_WHITE color_BLACK color_BLK.SMOKE color_BLK.TABBY color_BLK.TIGER color_BLUE color_BRN.TABBY color_BRN.TIGER color_BROWN color_CALICO color_CHOC.PT color_CREAM color_CRM.TABBY color_FAWN color_FLAME.PT color_GOLD color_GRAY color_GRAY.TABBY color_LYNX.PT color_ORG.TABBY color_PEACH color_RED color_SEAL.PT color_SLVR.TABBY color_TAN color_TORBI color_TORTIE color_TORTIE.DIL color_TRICOLOR color_YELLOW color_OTHER
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

What will we go over?

Types of data

  • numeric
  • character
    • factors
    • logical
  • Datetime
  • text
  • image
  • time series
  • … and more

Data problems

  • missing data
  • too many variables
  • correlated variables
  • outliers
  • imbalanced
  • … and more

numeric variables

Aren’t we done? we have numeric variables

They might still need improvements

Distributional problems, scaling, outliers, non-linear effects

Distributional problems

Think about how models are working

Very valid for count data

We rarely care whether a predictor is normal

A skewed distribution

A logged skewed distribution

A square rooted skewed distribution

Methods to alter distributions

  • Logarithms
  • Square Roots
  • Box-Cox
  • Yeo-Johnson

using MLE to estimate a transformation parameter \(\lambda\) in the following equation that would optimize the normality of \(x^*\)

\[ x^* = \left\{ \begin{array}{ll} \dfrac{x^\lambda - 1}{\lambda \tilde{x}^{\lambda - 1}}, & \lambda \neq 0 \\ \tilde{x} \log x & \lambda = 0 \end{array} \right. \]

Scaling issues

Think about how models are working

  • tree based models doesn’t care
  • distance based models do

has few downsides to doing

will make interpretations slightly harder

Motivated example

Motivated example

Motivated example

Motivated example

Scaling Methods

All scaling methods are trained on our data


All are a variation on

\[ X_{scaled} = \dfrac{X - a}{b} \]


either to change magnitude of data, or its range

Scaling Methods

Method Definition
Max-Abs \(X_{scaled} = \dfrac{X}{\text{max}(\text{abs}(X))}\)
Normalization \(X_{scaled} = \dfrac{X - \text{mean}(X)}{\text{sd}(X)}\)
Min-Max \(X_{scaled} = \dfrac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)}\)
Robust \(X_{scaled} = \dfrac{X - \text{median}(X)}{\text{Q3}(X) - \text{Q1}(X)}\)

Dealing with outliers


What are outliers?


values that are substantially different
from the rest of the values

Dealing with outliers


  1. Identify them
    1. expert knowledge
    2. (not recommended) thresholding
  2. Dealing with them
    1. removal
    2. imputation
    3. indication

Non-linear effects

Doing a linear effect, we have that when the predictor increases in value, the outcome increases in value

non-linear effects don’t have this property

depending on the type of model, having linear effects are nice

Non-linear example

Non-linear example - distance to maximum

Non-linear example - splines

Basis Spline Features

Monotone Spline Features

Categorical Variables

Dealing with categorical variables are easy right?

Yes and no, there are some unique strugglesto categoricals

and there are lots of methods to turn them into numeric

Character vs Factor variables


Character


Free form text


no restrictions

Factor


Know possible values


A logical variables is conceptually a factor

Factor Examples - sex

sex n
Female 3311
Male 3291
Neutered 2255
Spayed 2277
Unknown 1647

Lot of information cramped in here

We could have it split into sex and spayed/neutered

Factor Examples - intake_condition

intake_condition n
AGED 7
BEHAVIOR MILD 28
BEHAVIOR MODERATE 29
BEHAVIOR SEVERE 3
FERAL 313
FRACTIOUS 619
I/I REPORT 97
ILL MILD 663
intake_condition n
ILL MODERATETE 422
ILL SEVERE 535
INJURED MILD 190
INJURED MODERATE 240
INJURED SEVERE 507
NORMAL 4679
UNDER AGE/WEIGHT 4437
WELFARE SEIZURES 12

Character Examples - promary color

primary_color n
B-C PT 1
BC LYNX PT 1
BL LYNX PT 3
BLACK 3610
BLK SMOKE 20
BLK TABBY 60
BLK TIGER 4
BLONDE 3
BLUE 10
BLUE CREAM 1
BLUE PT 31
BLUE TABBY 4
BR BRINDLE 1
BRN TABBY 2056
BRN TIGER 8
BROWN 107
BUFF 2
CALICO 472
CALICO DIL 98
CALICO PT 3
primary_color n
CALICO TAB 77
CH LYNX PT 1
CHOC PT 30
CHOCOLATE 7
CR LYNX PT 2
CREAM 92
CREAM PT 6
CRM TABBY 66
CRM TIGER 1
FAWN 1
FLAME PT 54
GOLD 2
GRAY 1455
GRAY TABBY 1008
GRAY TIGER 17
L-C PT 1
LC LYNX PT 11
LI LYNX PT 14
LILAC PT 17
LYNX PT 87
primary_color n
ORANGE 455
ORG TABBY 813
ORG TIGER 3
PEACH 2
PINK 1
RD LYNX PT 1
RED 1
RED PT 1
S-T PT 13
SABLE 1
SEAL 13
SEAL PT 209
SILVER 3
SL LYNX PT 5
SLVR TABBY 18
SNOWSHOE 39
ST LYNX PT 1
TAN 55
TORBI 12
TORTIE 475

Character Examples - animal name

animal_name n
LUNA 31
MILO 19
LILY 18
DAISY 16
JACK 16
OLIVER 15
CHARLIE 14
TIGER 14
BELLA 12
COCO 12
LUCY 12
NALA 12
OREO 12
SHADOW 12
SIMBA 11
SMOKEY 11
TEDDY 11
TULIP 11
LARRY 10
LEO 10
animal_name n
LULU 10
OSCAR 10
PUMPKIN 10
SIMON 10
CLEO 9
FELIX 9
HENRY 9
MAX 9
PENELOPE 9
SAMMY 9
SASHA 9
SASSY 9
SMUDGE 9
UNKNOWN 9
APPLE 8
BABY 8
BETTY 8
BUDDY 8
BUTTERCUP 8
CALI 8
animal_name n
CHIP 8
CHLOE 8
FIONA 8
FRANKIE 8
GEORGE 8
IRIS 8
KIKI 8
LOLA 8
LUCA 8
MINNIE 8
OLIVE 8
ONYX 8
PENNY 8
PEPPER 8
ROMEO 8
TOBY 8
ASH 7
BEAU 7
BELLE 7
BENNY 7

Character Examples - animal name

animal_name n
BARTLEBY VON HAMMERSMARK 1
THUMBALINA [POLYDACTAL] 1
CHARLIE GEORGE SHIPPEE 1
KITTEN BLACK AND WHITE 1
KING SCHNOODLE-DOODLE 1
XENA WARRIOR PRINCESS 1
PRINCESS HELLO KITTY 1
SANTAS LITTLE HELPER 1
SHAKIRA (FKA JASPER) 1
TOE-BEE [POLYDACTAL] 1
animal_name n
INDEPENDENCE "INDY" 1
NEIL CATRICK HARRIS 1
SHOELACE AKA STALIN 1
CHEF GUARNASCHELLI 1
DJ KITTY BOOM BOOM 1
DR. FURASIER CRANE 1
EXOTIC MR. SMOOTHE 1
LITTLE JACK HORNER 1
MRS. J CASABLANCAS 1
PEPPERMINT STRIPES 1

Messy Characters

Missing values

Inconsistent encoding

Typos

Best dealt with manually

Unseen Levels


Only applicable for character variables

Will depend on method and models

How to deal with Categorical variables


Non-exhaustive list of methods


  • label / ordinal encoding
  • dummy encoding
  • frequency encoding
  • target encoding

Label / Ordinal Encoding

You take the categorical variable

Give each level a number

Replace level with said number

This is unlikely to work well unless done with care

Label / Ordinal Encoding - Example

sex_before intake_type_before sex_after intake_type_after
Unknown STRAY 5 8
Male OWNER SURRENDER 2 4
Neutered STRAY 3 8
Spayed STRAY 4 8
Male STRAY 2 8
Spayed OWNER SURRENDER 4 4
Neutered STRAY 3 8
Male STRAY 2 8
Neutered STRAY 3 8
Male STRAY 2 8
Male STRAY 2 8
Male STRAY 2 8

Dummy Encoding

A variable is created for each possible level in categorical variables

They take value 1 when they original variable takes the value, 0 otherwise

pros

  • easy to use and interpret
  • Versatile
  • base on many other methods

cons

  • Needs clean levels
  • Can create many columns
  • Not likely most efficient representation

Dummy vs One-Hot Encoding

The terms dummy encoding and one-hot encoding get thrown around interchangeably, but they do have different and distinct meanings.


One-hot encoding returns k variables


Dummy Encoding returns k-1 variables


Dummy encoding is preferred to avoid redundant data

Dummy Encoding - Example

sex_Male sex_Neutered sex_Spayed sex_Unknown intake_type_Euthenasia.Required intake_type_FOSTER intake_type_OWNER.SURRENDER intake_type_QUARANTINE intake_type_RETURN intake_type_SAFE.KEEP intake_type_STRAY intake_type_TRAP..NEUTER..RETURN intake_type_WELFARE.SEIZED
0 0 0 1 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0

Frequency Encoding

For each level calculate how often it appears


replace the level with that value


Can be raw count or percentage (doens’t matter much)

Frequency Encoding - Example

sex_before intake_type_before sex_after intake_type_after
Unknown STRAY 0.1288632 0.89820828
Male OWNER SURRENDER 0.2574916 0.07339019
Neutered STRAY 0.1764338 0.89820828
Spayed STRAY 0.1781551 0.89820828
Male STRAY 0.2574916 0.89820828
Spayed OWNER SURRENDER 0.1781551 0.07339019
Neutered STRAY 0.1764338 0.89820828
Male STRAY 0.2574916 0.89820828
Neutered STRAY 0.1764338 0.89820828
Male STRAY 0.2574916 0.89820828
Male STRAY 0.2574916 0.89820828
Male STRAY 0.2574916 0.89820828

Target Encoding

also called mean encoding, likelihood encoding, or impact encoding


done by replacing each level of a categorical variable with the mean of the target variable within said level


The target variable will typically be the outcome, but that is not necessarily a requirement.

Target Encoding - Example

sex_before intake_type_before outcome_type sex_after intake_type_after
Unknown STRAY RESCUE -5.78934836 -1.5615482
Male OWNER SURRENDER TRANSFER -3.07352887 -0.5419162
Neutered STRAY SHELTER, NEUTER, RETURN -0.04469019 -1.5615482
Spayed STRAY ADOPTION -0.17892102 -1.5615482
Male STRAY DIED -3.07352887 -1.5615482
Spayed OWNER SURRENDER ADOPTION -0.17892102 -0.5419162
Neutered STRAY ADOPTION -0.04469019 -1.5615482
Male STRAY DIED -3.07352887 -1.5615482
Neutered STRAY HOMEFIRST -0.04469019 -1.5615482
Male STRAY RESCUE -3.07352887 -1.5615482
Male STRAY RESCUE -3.07352887 -1.5615482
Male STRAY RESCUE -3.07352887 -1.5615482

Want to learn more?

My WIP Book: Feature Engineering A-Z

https://feaz-book.com/

Finished books by other people:

  • Python Feature Engineering Cookbook
  • Feature Engineering Bookcamp
  • Feature Engineering and Selection