+ - 0:00:00
Notes for current slide
Notes for next slide

Predictive modeling with text using tidy data principles

useR2020

Julia Silge & Emil Hvitfeldt

2019-7-24

1 / 102

Plan for today

3 / 102

Plan for today

  • We will walk through our case study using slides and live coding
3 / 102

Plan for today

  • We will walk through our case study using slides and live coding
  • After the tutorial, use the materials on GitHub and YouTube recording to run the code yourself 💪
3 / 102

Text as data

Let's look at complaints submitted to the United States Consumer Financial Protection Bureau (CFPB).

4 / 102

Text as data

Let's look at complaints submitted to the United States Consumer Financial Protection Bureau (CFPB).

  • An individual experiences a problem 😩 with a consumer financial product or service, like a bank, loan, or credit card 💰
4 / 102

Text as data

Let's look at complaints submitted to the United States Consumer Financial Protection Bureau (CFPB).

  • An individual experiences a problem 😩 with a consumer financial product or service, like a bank, loan, or credit card 💰
  • They submit a complaint to the CFPB explaining what happened 😡
4 / 102

Text as data

Let's look at complaints submitted to the United States Consumer Financial Protection Bureau (CFPB).

  • An individual experiences a problem 😩 with a consumer financial product or service, like a bank, loan, or credit card 💰
  • They submit a complaint to the CFPB explaining what happened 😡
  • This complaint is sent to the company, which can respond or dispute
4 / 102

Text as data

Let's look at complaints submitted to the United States Consumer Financial Protection Bureau (CFPB).

library(tidyverse)
complaints <- read_csv("data/complaints.csv.gz") %>% sample_frac(0.1)
names(complaints)
## [1] "date_received" "product"
## [3] "sub_product" "issue"
## [5] "sub_issue" "consumer_complaint_narrative"
## [7] "company_public_response" "company"
## [9] "state" "zip_code"
## [11] "tags" "consumer_consent_provided"
## [13] "submitted_via" "date_sent_to_company"
## [15] "company_response_to_consumer" "timely_response"
## [17] "consumer_disputed" "complaint_id"
5 / 102

Text as data

complaints %>%
sample_n(10) %>%
pull(consumer_complaint_narrative)
## [1] "I don't recognize this inquiries. I never applied for it. Except as otherwise provided in this section, a consumer reporting agency shall block the reporting of any information in the file of a consumer that the consumer identifies as information that resulted from an alleged identity theft, not later than 4 business days after the date of receipt by such agency of ( 1 ) appropriate proof of the identity of the consumer ; ( 2 ) a copy of an identity theft report ; ( 3 ) the identification of such information by the consumer ; and ( 4 ) a statement by the consumer that the information is not information relating to any transaction by the consumer. ( b ) Notification. A consumer reporting agency shall promptly notify the furnisher of information identified by the consumer under subsection ( a ) of this section ( 1 ) that the information may be a result of identity theft ; ( 2 ) that an identity theft report has been filed ; ( 3 ) that a block has been requested under this section ; and ( 4 ) of the effective dates of the block. ( c ) Authority to decline or rescind. ( 1 ) In general. A consumer reporting agency may decline to block, or may rescind any block, of information relating to a consumer under this section, if the consumer reporting agency reasonably determines that ( A ) the information was blocked in error or a block was requested by the consumer in error ; ( B ) the information was blocked, or a block was requested by the consumer, on the basis of a material misrepresentation of fact by the consumer relevant to the request to block ; or ( C ) the consumer obtained possession of goods, services, or money as a result of the blocked transaction or transactions. ( 2 ) Notification to consumer. If a block of information is declined or rescinded under this subsection, the affected consumer shall be notified promptly, in the same manner as consumers are notified of the reinsertion of information under section 1681i ( a ) ( 5 ) ( B ) of this title. ( 3 ) Significance of block. For purposes of this subsection, if a consumer reporting agency rescinds a block, the presence of information in the file of a consumer prior to the blocking of such information is not evidence of whether the consumer knew or should have known that the consumer obtained possession of any goods, services, or money as a result of the block. ( d ) Exception for resellers. ( 1 ) No reseller file. This section shall not apply to a consumer reporting agency, if the consumer reporting agency ( A ) is a reseller ; ( B ) is not, at the time of the request of the consumer under subsection ( a ) of this section, otherwise furnishing or reselling a consumer report concerning the information identified by the consumer ; and ( C ) informs the consumer, by any means, that the consumer may report the identity theft to the Bureau to obtain consumer information regarding identity theft. ( 2 ) Reseller with file. The sole obligation of the consumer reporting agency under this section, with regard to any request of a consumer under this section, shall be to block the consumer report maintained by the consumer reporting agency from any subsequent use, if ( A ) the consumer, in accordance with the provisions of subsection ( a ) of this section, identifies, to a consumer reporting agency, information in the file of the consumer that resulted from identity theft ; and ( B ) the consumer reporting agency is a reseller of the identified information. ( 3 ) Notice. In carrying out its obligation under paragraph ( 2 ), the reseller shall promptly provide a notice to the consumer of the decision to block the file. Such notice shall contain the name, address, and telephone number of each consumer reporting agency from which the consumer information was obtained for resale. ( e ) Exception for verification companies. The provisions of this section do not apply to a check services company, acting as such, which issues authorizations for the purpose of approving or processing negotiable instruments, electronic fund transfers, or similar methods of payments, except that, beginning 4 business days after receipt of information described in paragraphs ( 1 ) through ( 3 ) of subsection ( a ) of this section, a check services company shall not report to a national consumer reporting agency described in section 1681a ( p ) of this title, any information identified in the subject identity theft report as resulting from identity theft. ( f ) Access to blocked information by law enforcement agencies. No provision of this section shall be construed as requiring a consumer reporting agency to prevent a Federal, State, or local law enforcement agency from accessing blocked information in a consumer file to which the agency could otherwise obtain access under this subchapter."
## [2] "NOTE : I have selected available options regarding the complaint but it is regarding the purchases that I did not make as the card was never used, activated or opened from the origina wrapper until XXXX - details below. \n\nCOMPLAINT about Mastercard Gift Card but inside wrapper pamphlet says \" US Bank Gift Card : Ok, so one more thing to do before the year end.\n\nI purchased 2 Mastercard gift cards in XX/XX/XXXX from XXXX ( I still have the original receipt ) mainly to use these cards for phone recharging and some online purchases. Before this I had purchased a similar {$200.00} Mastercard for the first time and was using it without any issues and since do small amounts in recharges the first one was still good and I had put away these other 2 gift cards in my safe box ie. I never opened them or activated them or used them. One thing I would like to mention is that I did not know myraid of \" GOTCHAS '' these cards have and as the date of expiration on these is XXXX, I thought they were good as long as I did not activate and use them and never thought these are susceptible to fraud when they have not been activated or used. In fact when I search online ( and if you search online ) it seems that there have been thousands of users who have been defrauded. It is one thing when a card is active and used online and some thief or hacker gets a hold of the card number and uses it for fraudulent purchases - we all know this happens, but what if a card has never been activated or used for any purchases and put away in a safe with original wrapper unopened. How does a thief get a hold of the card information? It seems like this happens often ( specifically with Mastercard Gift Cards by US BANK ) and again if you go online you can see several forums where users discuss this. This is something I was not aware of. \n\nSo, in XXXX XXXX I opened one of the 2 cards that I had in my safe ( in original wrapper ) to start using it and upon calling the phone number to check my balance I found that the card had been charged a \" {$2.00} fee for each month '' for inactivity and almost half of the account had been drained by monthly inactivity fees. Again, I was not aware of this but I take responsibility for it although I would like to mention that it seems like a novel idea to charge someone for \" not using a product '' - we all know companies charge for something when a product is used but how about charging someone for not using it? Is this done by design to make more money or does it really cost the bank in fees when the consumer is not using the service? \n\nHowever, the real issue here is that with regards to my 2nd card and when I checked the balance on that it had a balance of {$11.00} ( something like that ) and I was shocked. So, I called the customer service. Without going into all the details of the conversation which will make this even longer, he first told me that he did not know as there is history for only 24 months in the system. Then after I pressed him and told him that I never used the card before then said that the card was first used in XXXX of XXXX there were two charges for {$43.00} or something. So, I was not able to understand how he was able to get that information beyond 24 months which he was not willing to share first. Again, I would like to reiterate that I had never opened the wrapper, activated or used this card and I told him that and at that he said that he supected fraud and that he was going to lock the card and if I wanted to use the card I have to fax my original receipt and my address etc., so they can mail me a new card and when I asked him if it was for the original purchase amount of {$200.00}, he said no it would be for {$11.00}. I asked to speak with a supervisor and he said he would have someone call me back. Anyway, after that I spoke with 2-3 agents and they said they would refund me back the fees. Later I understood that by that they meant that they were \" only refunding me the {$2.00} fee '' and not restore the original card. Different agents told me different versions. Anyway, I wanted to complain to CFPB about this since XXXX ( as you can see how long this is ) and just did not have the time to do so and wanted to this before the year end. \n\nThe issue is that as of today, they have locked that card, ( even with what was left {$11.00} ) and for that it seems I have to fax them and write them, etc., I am not willing to do that and as I told the agent in anger they can keep that, since they have taken the rest already. This is disgusting and if you search online, this seems to happen a lot where theives steal money from \" never used/inactivated '' cards and yet banks like US Bank and Mastercard continue to issue these cards every year but customers lose millions of dollars to fraud and fees. I hope they can resolve this and I hope that your agency will look into the whole issue of \" GIFT CARD '' business where banks are profiting at the expense of customers with inactivity fees for \" not using '' the service and of course the issue of fraud.\n\nBottomline is that if they are willing to resolve and restore my original amount to that card or mail me a new card with the original {$200.00} purchase ( minus any monthly fees for inactivity, as absurd as they are and fees for not using a product ) I will fax them my original receipt and the mailing address, etc., and please provide me the proper contact phone number or email address and not your general phone system where I will have to go through the whole history again with another helpless rep. \n\nThank You"
## [3] "XXXX XXXX credit card services reported that there was an incorrect 30 day late payment on my credit card. After an investigation, XXXX XXXX realized they made an error and reported it to XXXX, who updated my credit file. However, Equifax has not. Please have them fix this problem immediately."
## [4] "I opened a bank account with Chase Bank in XXXX XXXX Tx. A week after i deposited a check, my funds are still not available. I've called the bank and asked them to remove the hold. They claim the hold is due to my address not being verified. I did verify my address with the bank representative, using my Texas state identification card. I also received mail from Chase bank at this address. I'm not understanding why close to {$1000.00} of my money is being held from me because of a mistake made by the bank."
## [5] "I was a victim of fraud and for whatever reason Exeperian is not removing an account that even the company has indicated they were removing it because it was fraud. There is an account reporting under XXXX XXXX XXXX account number XXXX for which I have attached the correspondence with XXXX showing that Experian never even attempted to rectify this."
## [6] "I had a nordstrom visa creditcard (issued by tdbank ) was closed by the grantor in XX/XX/XXXX, and there were pending charges on nordstrom.com were canceled. As a result, now I have overpay the balance, since all orders were canceled. i want the bank refund the money how much i overpaid . I have called and talked to the customer service a few time and they said would call me back and never called."
## [7] "DEROGATORY ACCUSATIONS INCLUDING ALLEGED INQUIRIES MUST BE METRO 2 COMPLIANT TO RETAIN OR REPORT SO DELETE ANY NOT PHYSICALLY WITH PROOF OF PERMISSIBLE PURPOSE FROM THE EXACTLY AND LEGALLY IDENTIFIED ME.I DO NOT AUTHORIZE YOU TO REPORT NOT PROVEN COMPLIANT INFORMATION AND YOU MUST COMPLY.. DELETE XXXX XXXX, XXXX XXXX, XXXX XXXX"
## [8] "Made 2 payments to my account. Never credited but has cleared my checking account. Talked to several people on the phone still not resolved. My bank confirmed check was cashed by usaa"
## [9] "XXXX has reported my name in error conjunction with a collection account. This reporting has been on my credit file since 2014 and has cost me financially due to incorrect information being reported and maintained by XXXX."
## [10] "On XX/XX/XXXX and XXXX, 2019 a company by the name \" XXXX XXXXXXXX XXXX '' just popped up on my XXXX, XXXX, and Transunion credit report file. The original debtor is XXXX XXXX XXXX! This account does not belong to me, as all my bills have always been in my husband 's name. When I try to contact XXXX XXXX XXXX, my calls are not be accepted, their recording says their too busy to take my call. I have filed a dispute with XXXX and Transunion."
6 / 102

Text as data

7 / 102

Text as data

  • Text like this can be used for supervised or predictive modeling
7 / 102

Text as data

  • Text like this can be used for supervised or predictive modeling
  • We can build both regression and classification models with text data
7 / 102

Text as data

  • Text like this can be used for supervised or predictive modeling
  • We can build both regression and classification models with text data
  • We can use the ways language exhibits organization to create features for modeling
7 / 102

Modeling Packages

library(tidymodels)
library(textrecipes)
  • tidymodels is a collection of packages for modeling and machine learning using tidyverse principles
  • textrecipes extends the recipes package to handle text preprocessing
8 / 102

Modeling workflow

9 / 102

smltar.com

10 / 102

Class imbalance

11 / 102

Let's approach this as a binary classification task

12 / 102

Credit or not?

credit <- "Credit reporting, credit repair services, or other personal consumer reports"
complaints2class <- complaints %>%
mutate(product = factor(if_else(
condition = product == credit,
true = "Credit",
false = "Other"))) %>%
rename(text = consumer_complaint_narrative)
13 / 102

Data splitting

The testing set is a precious resource which can be used only once

14 / 102

Data splitting

set.seed(1234)
complaints_split <- initial_split(complaints2class, strata = product)
complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)
15 / 102

Which of these variables can we use?

names(complaints_train)
## [1] "date_received" "product"
## [3] "sub_product" "issue"
## [5] "sub_issue" "text"
## [7] "company_public_response" "company"
## [9] "state" "zip_code"
## [11] "tags" "consumer_consent_provided"
## [13] "submitted_via" "date_sent_to_company"
## [15] "company_response_to_consumer" "timely_response"
## [17] "consumer_disputed" "complaint_id"
16 / 102

Feature selection checklist

17 / 102

Feature selection checklist

  • Is it ethical to use this variable? (or even legal?)
17 / 102

Feature selection checklist

  • Is it ethical to use this variable? (or even legal?)
  • Will this variable be available at prediction time?
17 / 102

Feature selection checklist

  • Is it ethical to use this variable? (or even legal?)
  • Will this variable be available at prediction time?
  • Does this variable contribute to explainability?
17 / 102

Which of these variables can we use?

names(complaints_train)
## [1] "date_received" "product"
## [3] "sub_product" "issue"
## [5] "sub_issue" "text"
## [7] "company_public_response" "company"
## [9] "state" "zip_code"
## [11] "tags" "consumer_consent_provided"
## [13] "submitted_via" "date_sent_to_company"
## [15] "company_response_to_consumer" "timely_response"
## [17] "consumer_disputed" "complaint_id"
18 / 102

Which of these variables can we use?

  • date_received
  • tags
  • consumer_complaint_narrative == 📃
19 / 102

Preprocessing specification

complaints_rec <-
recipe(product ~ date_received + tags + text,
data = complaints_train
) %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_rm(date_received) %>%
step_dummy(has_role("dates")) %>%
step_unknown(tags) %>%
step_dummy(tags) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>%
step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>%
step_tfidf(text)
20 / 102

Feature engineering

complaints_rec <-
recipe(product ~ date_received + tags + text,
data = complaints_train
) %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_rm(date_received) %>%
step_dummy(has_role("dates")) %>%
step_unknown(tags) %>%
step_dummy(tags) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>%
step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>%
step_tfidf(text)
21 / 102

Feature engineering

complaints_rec <-
recipe(product ~ date_received + tags + text,
data = complaints_train
) %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_rm(date_received) %>%
step_dummy(has_role("dates")) %>%
step_unknown(tags) %>%
step_dummy(tags) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>%
step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>%
step_tfidf(text)

Also, what does tune() mean here? 🤔

21 / 102

You can combine text and non-text features in your model

22 / 102

Feature engineering: handling dates

recipe(product ~ date_received + tags + text,
data = complaints_train
) %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_rm(date_received)
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
##
## Operations:
##
## Date features from date_received
## Delete terms date_received
23 / 102

Feature engineering: categorical data

recipe(product ~ date_received + tags + text,
data = complaints_train
) %>%
step_unknown(tags) %>%
step_dummy(tags)
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
##
## Operations:
##
## Unknown factor level assignment for tags
## Dummy variables from tags
24 / 102

You can combine text and non-text features in your model

25 / 102

Feature engineering: text

recipe(product ~ date_received + tags + text,
data = complaints_train
) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>%
step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>%
step_tfidf(text)
26 / 102

Feature engineering: text

## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
##
## Operations:
##
## Tokenization for text
## Stop word removal for text
## ngramming for text
## Text filtering for text
## Term frequency-inverse document frequency with text
27 / 102

How do we create features from natural language?

28 / 102

From natural language to ML features

library(tidytext)
complaints_train %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(), by = "word") %>%
count(complaint_id, word) %>%
bind_tf_idf(word, complaint_id, n) %>%
cast_dfm(complaint_id, word, tf_idf)
## Document-feature matrix of: 8,791 documents, 18,098 features (99.7% sparse).
29 / 102

🛑 STOP WORDS 🛑

30 / 102

Stop words

library(stopwords)
stopwords()
## [1] "i" "me" "my" "myself" "we" "our"
## [7] "ours" "ourselves" "you" "your" "yours" "yourself"
## [13] "yourselves" "he" "him" "his" "himself" "she"
## [19] "her" "hers" "herself" "it" "its" "itself"
## [25] "they" "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that" "these"
## [37] "those" "am" "is" "are" "was" "were"
## [43] "be" "been" "being" "have" "has" "had"
## [49] "having" "do" "does" "did" "doing" "would"
## [55] "should" "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've" "you've"
## [67] "we've" "they've" "i'd" "you'd" "he'd" "she'd"
## [73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll"
## [79] "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot"
## [97] "couldn't" "mustn't" "let's" "that's" "who's" "what's"
## [103] "here's" "there's" "when's" "where's" "why's" "how's"
## [109] "a" "an" "the" "and" "but" "if"
## [115] "or" "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about" "against"
## [127] "between" "into" "through" "during" "before" "after"
## [133] "above" "below" "to" "from" "up" "down"
## [139] "in" "out" "on" "off" "over" "under"
## [145] "again" "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all" "any"
## [157] "both" "each" "few" "more" "most" "other"
## [163] "some" "such" "no" "nor" "not" "only"
## [169] "own" "same" "so" "than" "too" "very"
## [175] "will"
31 / 102

32 / 102

Stop words

33 / 102

Stop words

  • Stop words are context specific
33 / 102

Stop words

  • Stop words are context specific
  • Stop word lexicons can have bias
33 / 102

Stop words

  • Stop words are context specific
  • Stop word lexicons can have bias
  • You can create your own stop word list
33 / 102

Stop words

  • Stop words are context specific
  • Stop word lexicons can have bias
  • You can create your own stop word list
33 / 102

What kind of models work well for text?

34 / 102

Text models

Remember that text data is sparse! 😮

35 / 102

Text models

Remember that text data is sparse! 😮

  • Regularized linear models (glmnet)
  • Support vector machines
  • naive Bayes
  • Tree-based models like random forest?
35 / 102

Text models

Remember that text data is sparse! 😮

  • Regularized linear models (glmnet)
  • Support vector machines
  • naive Bayes
  • Tree-based models like random forest? 🙅
36 / 102

Does text data have to be sparse?

37 / 102

You shall know a word by the company it keeps.

💬 John Rupert Firth

38 / 102

You shall know a word by the company it keeps.

💬 John Rupert Firth

Learn more about word embeddings:

38 / 102

To specify a model in tidymodels

1. Pick a model

2. Set the mode (if needed)

3. Set the engine

39 / 102
40 / 102

To specify a model in tidymodels

All available models are listed at https://tidymodels.org/find/parsnip

41 / 102

set_mode()

Some models can solve multiple types of problems

svm_rbf() %>% set_mode(mode = "regression")
## Radial Basis Function Support Vector Machine Specification (regression)
42 / 102

set_mode()

Some models can solve multiple types of problems

svm_rbf() %>% set_mode(mode = "classification")
## Radial Basis Function Support Vector Machine Specification (classification)
43 / 102

set_engine()

The same model can be implemented by multiple computational engines

svm_rbf() %>% set_engine("kernlab")
## Radial Basis Function Support Vector Machine Specification (unknown)
##
## Computational engine: kernlab
44 / 102

set_engine()

The same model can be implemented by multiple computational engines

svm_rbf() %>% set_engine("liquidSVM")
## Radial Basis Function Support Vector Machine Specification (unknown)
##
## Computational engine: liquidSVM
45 / 102

What makes a model?

lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
lasso_spec
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
46 / 102

What makes a model?

lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
lasso_spec
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet

It's tune() again! 😟

46 / 102

Parameters and... hyperparameters?

  • Some model parameters can be learned from data during fitting/training
47 / 102

Parameters and... hyperparameters?

  • Some model parameters can be learned from data during fitting/training
  • Some CANNOT 😱
47 / 102

Parameters and... hyperparameters?

  • Some model parameters can be learned from data during fitting/training
  • Some CANNOT 😱
  • These are hyperparameters of a model, and we estimate them by training lots of models with different hyperparameters and comparing them
47 / 102

A grid of possible hyperparameters

param_grid <- grid_regular(
penalty(range = c(-4, 0)),
max_tokens(range = c(500, 2000)),
levels = 6
)
48 / 102

A grid of possible hyperparameters

## # A tibble: 36 x 2
## penalty max_tokens
## <dbl> <int>
## 1 0.0001 500
## 2 0.000631 500
## 3 0.00398 500
## 4 0.0251 500
## 5 0.158 500
## 6 1 500
## 7 0.0001 800
## 8 0.000631 800
## 9 0.00398 800
## 10 0.0251 800
## # … with 26 more rows
49 / 102

How can we compare and evaluate these different models?

50 / 102
51 / 102

Spend your data budget

set.seed(123)
complaints_folds <- vfold_cv(complaints_train, v = 10, strata = product)
complaints_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 2
## splits id
## <list> <chr>
## 1 <split [7.9K/880]> Fold01
## 2 <split [7.9K/880]> Fold02
## 3 <split [7.9K/880]> Fold03
## 4 <split [7.9K/880]> Fold04
## 5 <split [7.9K/879]> Fold05
## 6 <split [7.9K/879]> Fold06
## 7 <split [7.9K/879]> Fold07
## 8 <split [7.9K/878]> Fold08
## 9 <split [7.9K/878]> Fold09
## 10 <split [7.9K/878]> Fold10
52 / 102

✨ CROSS-VALIDATION ✨

53 / 102

Art by Alison Hill

54 / 102

Art by Alison Hill

55 / 102

Art by Alison Hill

56 / 102

Art by Alison Hill

57 / 102

Art by Alison Hill

58 / 102

Art by Alison Hill

59 / 102

Art by Alison Hill

60 / 102

Art by Alison Hill

61 / 102

Art by Alison Hill

62 / 102

Art by Alison Hill

63 / 102

Spend your data wisely to create simulated validation sets

64 / 102

Now we have resamples, features, plus a model

65 / 102

Create a workflow

complaints_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(lasso_spec)
66 / 102

What is a workflow()?

67 / 102

Time to tune! ⚡

set.seed(42)
lasso_rs <- tune_grid(
complaints_wf,
resamples = complaints_folds,
grid = param_grid,
control = control_grid(save_pred = TRUE)
)
68 / 102

Time to tune! ⚡

## # Tuning results
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [7.9K/880]> Fold01 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]>
## 2 <split [7.9K/880]> Fold02 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]>
## 3 <split [7.9K/880]> Fold03 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]>
## 4 <split [7.9K/880]> Fold04 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]>
## 5 <split [7.9K/879]> Fold05 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,644 × 8]>
## 6 <split [7.9K/879]> Fold06 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,644 × 8]>
## 7 <split [7.9K/879]> Fold07 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,644 × 8]>
## 8 <split [7.9K/878]> Fold08 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,608 × 8]>
## 9 <split [7.9K/878]> Fold09 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,608 × 8]>
## 10 <split [7.9K/878]> Fold10 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,608 × 8]>
69 / 102

💫 TOKENIZATION 💫

70 / 102

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
71 / 102

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
  • Most common token == word, but sometimes we tokenize in a different way
71 / 102

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
  • Most common token == word, but sometimes we tokenize in a different way
  • An essential part of most text analyses
71 / 102

Tokenization

  • The process of splitting text in smaller pieces of text (tokens)
  • Most common token == word, but sometimes we tokenize in a different way
  • An essential part of most text analyses
  • Many options to take into consideration
71 / 102

Tokenization: whitespace

token_example
## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me."
72 / 102

Tokenization: whitespace

token_example
## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me."
strsplit(token_example, "\\s")
## [[1]]
## [1] "I" "am" "a" "long-time" "victim" "of" "identity"
## [8] "theft." "This" "debt" "doesn't" "belong" "to" "me."
72 / 102

Tokenization: tokenizers package

token_example
## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me."
library(tokenizers)
tokenize_words(token_example)
## [[1]]
## [1] "i" "am" "a" "long" "time" "victim" "of" "identity"
## [9] "theft" "this" "debt" "doesn't" "belong" "to" "me"
73 / 102

Tokenization: spaCy library

token_example
## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me."
library(spacyr)
spacy_tokenize(token_example)
## [[1]]
## [1] "I" "am" "a" "long" "-" "time" "victim" "of"
## [9] "identity" "theft" "." "This" "debt" "does" "n't" "belong"
## [17] "to" "me" "."
74 / 102

whitespace

## [[1]]
## [1] "I" "am" "a" "long-time" "victim" "of" "identity"
## [8] "theft." "This" "debt" "doesn't" "belong" "to" "me."

tokenizers package

## [[1]]
## [1] "i" "am" "a" "long" "time" "victim" "of" "identity"
## [9] "theft" "this" "debt" "doesn't" "belong" "to" "me"

spaCy library

## [[1]]
## [1] "I" "am" "a" "long" "-" "time" "victim" "of"
## [9] "identity" "theft" "." "This" "debt" "does" "n't" "belong"
## [17] "to" "me" "."
75 / 102

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
76 / 102

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
  • How should we handle punctuation⁉️
76 / 102

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
  • How should we handle punctuation⁉️
  • What about non-word characters inside words?
76 / 102

Tokenization considerations

  • Should we turn UPPERCASE letters to lowercase?
  • How should we handle punctuation⁉️
  • What about non-word characters inside words?
  • Should compound words be split or multi-word ideas be kept together?
76 / 102

Tokenization for English text is typically much easier than other languages.

77 / 102

N-grams

A sequence of n sequential tokens

78 / 102

N-grams

A sequence of n sequential tokens

  • Captures words that appear together often
78 / 102

N-grams

A sequence of n sequential tokens

  • Captures words that appear together often
  • Can detect negations ("not happy")
78 / 102

N-grams

A sequence of n sequential tokens

  • Captures words that appear together often
  • Can detect negations ("not happy")
  • Larger cardinality
78 / 102
tokenize_ngrams(token_example, n = 1)
## [[1]]
## [1] "i" "am" "a" "long" "time" "victim" "of" "identity"
## [9] "theft" "this" "debt" "doesn't" "belong" "to" "me"
tokenize_ngrams(token_example, n = 2)
## [[1]]
## [1] "i am" "am a" "a long" "long time" "time victim"
## [6] "victim of" "of identity" "identity theft" "theft this" "this debt"
## [11] "debt doesn't" "doesn't belong" "belong to" "to me"
tokenize_ngrams(token_example, n = 3)
## [[1]]
## [1] "i am a" "am a long" "a long time" "long time victim"
## [5] "time victim of" "victim of identity" "of identity theft" "identity theft this"
## [9] "theft this debt" "this debt doesn't" "debt doesn't belong" "doesn't belong to"
## [13] "belong to me"
79 / 102

Tokenization

See Chapter 2 for more!

80 / 102

Look at the tuning results 👀

collect_metrics(lasso_rs)
## # A tibble: 72 x 8
## penalty max_tokens .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0001 500 accuracy binary 0.864 10 0.00490 Recipe1_Model1
## 2 0.0001 500 roc_auc binary 0.928 10 0.00268 Recipe1_Model1
## 3 0.000631 500 accuracy binary 0.867 10 0.00467 Recipe1_Model2
## 4 0.000631 500 roc_auc binary 0.931 10 0.00264 Recipe1_Model2
## 5 0.00398 500 accuracy binary 0.869 10 0.00473 Recipe1_Model3
## 6 0.00398 500 roc_auc binary 0.934 10 0.00282 Recipe1_Model3
## 7 0.0251 500 accuracy binary 0.840 10 0.00502 Recipe1_Model4
## 8 0.0251 500 roc_auc binary 0.911 10 0.00427 Recipe1_Model4
## 9 0.158 500 accuracy binary 0.539 10 0.00351 Recipe1_Model5
## 10 0.158 500 roc_auc binary 0.723 10 0.00575 Recipe1_Model5
## # … with 62 more rows
81 / 102
autoplot(lasso_rs)

82 / 102

Look at the tuning results 👀

lasso_rs %>%
show_best("roc_auc")
## # A tibble: 5 x 8
## penalty max_tokens .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00398 1400 roc_auc binary 0.942 10 0.00223 Recipe4_Model3
## 2 0.00398 2000 roc_auc binary 0.942 10 0.00241 Recipe6_Model3
## 3 0.00398 1700 roc_auc binary 0.942 10 0.00222 Recipe5_Model3
## 4 0.00398 1100 roc_auc binary 0.941 10 0.00243 Recipe3_Model3
## 5 0.00398 800 roc_auc binary 0.940 10 0.00293 Recipe2_Model3
83 / 102

The best 🥇 hyperparameters

best_roc_auc <- select_best(lasso_rs, "roc_auc")
best_roc_auc
## # A tibble: 1 x 3
## penalty max_tokens .config
## <dbl> <int> <chr>
## 1 0.00398 1400 Recipe4_Model3
84 / 102

Evaluate the best model 📏

collect_predictions(lasso_rs, parameters = best_roc_auc)
## # A tibble: 8,791 x 9
## id .pred_Credit .pred_Other .row max_tokens penalty .pred_class product .config
## <chr> <dbl> <dbl> <int> <int> <dbl> <fct> <fct> <chr>
## 1 Fold01 0.0143 0.986 1 1400 0.00398 Other Other Recipe4_Mod…
## 2 Fold01 0.992 0.00798 3 1400 0.00398 Credit Credit Recipe4_Mod…
## 3 Fold01 0.0335 0.966 8 1400 0.00398 Other Other Recipe4_Mod…
## 4 Fold01 0.644 0.356 12 1400 0.00398 Credit Credit Recipe4_Mod…
## 5 Fold01 0.133 0.867 25 1400 0.00398 Other Other Recipe4_Mod…
## 6 Fold01 0.158 0.842 41 1400 0.00398 Other Other Recipe4_Mod…
## 7 Fold01 0.0801 0.920 64 1400 0.00398 Other Other Recipe4_Mod…
## 8 Fold01 0.831 0.169 88 1400 0.00398 Credit Credit Recipe4_Mod…
## 9 Fold01 0.236 0.764 113 1400 0.00398 Other Other Recipe4_Mod…
## 10 Fold01 0.981 0.0194 115 1400 0.00398 Credit Credit Recipe4_Mod…
## # … with 8,781 more rows
85 / 102

Evaluate the best model 📏

collect_predictions(lasso_rs, parameters = best_roc_auc) %>%
group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot()
86 / 102

Evaluate the best model 📏

87 / 102

Update the workflow

We can update our workflow with the best performing hyperparameters.

wf_spec_final <- finalize_workflow(complaints_wf, best_roc_auc)

This workflow is ready to go! It can now be applied to new data.

88 / 102

How is our model thinking?

89 / 102

Variable importance

library(vip)
wf_spec_final %>%
fit(complaints_train) %>%
pull_workflow_fit() %>%
vi(lambda = best_roc_auc$penalty) %>%
filter(!str_detect(Variable, "tfidf")) %>%
filter(Importance != 0)
## # A tibble: 19 x 3
## Variable Importance Sign
## <chr> <dbl> <chr>
## 1 tags_Older.American..Servicemember 0.509 POS
## 2 date_received_dow_Mon 0.480 POS
## 3 date_received_dow_Fri 0.337 POS
## 4 date_received_dow_Thu 0.253 POS
## 5 date_received_dow_Wed 0.108 POS
## 6 date_received_dow_Sat 0.0106 POS
## 7 tags_unknown 0.00293 POS
## 8 date_received_month_Sep -0.0558 NEG
## 9 date_received_month_Jun -0.0615 NEG
## 10 date_received_month_Apr -0.100 NEG
## 11 tags_Servicemember -0.132 NEG
## 12 date_received_month_Aug -0.227 NEG
## 13 date_received_month_Mar -0.253 NEG
## 14 date_received_month_May -0.361 NEG
## 15 date_received_month_Jul -0.564 NEG
## 16 date_received_month_Oct -0.586 NEG
## 17 date_received_month_Nov -0.717 NEG
## 18 date_received_month_Feb -0.734 NEG
## 19 date_received_month_Dec -1.09 NEG
90 / 102

Variable importance

vi_data <- wf_spec_final %>%
fit(complaints_train) %>%
pull_workflow_fit() %>%
vi(lambda = best_roc_auc$penalty) %>%
mutate(Variable = str_remove_all(Variable, "tfidf_text_")) %>%
filter(Importance != 0)
91 / 102

Variable importance

vi_data
## # A tibble: 1,377 x 3
## Variable Importance Sign
## <chr> <dbl> <chr>
## 1 identity_theft_2 296. POS
## 2 section_consumer 262. POS
## 3 appraisal 153. POS
## 4 agency_shall 130. POS
## 5 credit_reporting_act 123. POS
## 6 xxxx_oh 99.5 POS
## 7 blocked_information 92.0 POS
## 8 pnc 88.8 POS
## 9 requested_consumer 83.9 POS
## 10 american_express 83.0 POS
## # … with 1,367 more rows
92 / 102

93 / 102

Credit Complaint #1

Credit And Other

i have contacted equifax on at least 5 occasions to include a xxxx xxxx credit card on my profile i have provided photo identification my ss and a copy of the card because they have refused to update my profile my credit score remains below 600 and i have been denied credit on several occasions i have had the card for at least 4 years yet my credit profile with equifax claims i have no credit history i have contacted them via letters online disputes and telephone to no avail please advise as to other remedies available to me
94 / 102

Credit Complaint #2

Credit And Other

i have disputed items with equifax back on xx xx 2019 and they have failed to respond i have also sent them a notice that they failed to respond and they have ignore my request to dispute items that are reporting on my credit reports
95 / 102

Other Complaint #1

Credit And Other

i was contacted on xxxx xx xx 2019 by portfolio recovery associates about a past debt that i have no knowledge of having the representative mentioned that there was an attempt to deliver paperwork but paperwork delivery was at a different location that my physical home address upon requesting further clarification the representative continued to say how i was refusing to pay the debt i never refused to pay the debt i would just like clarification on the debt i will be delivering a cease and desist order to the aforementioned party
96 / 102

Other Complaint #2

Credit And Other

xxxx xxxx is alleging that i owe them 690.00 i did not breach my rental contract and do not understand the charges against me i have attempted to resolve the matter but every time i try the claim that i owe them additional monies i believe that i am a victim of fraud and unfair business tactics management is run very poorly and one person says one thing another says another thing i do have documentation to prove my innocence
97 / 102

Final fit

We will now use last_fit() to fit our model one last time on our training data and evaluate it on our testing data.

final_fit <- last_fit(
wf_spec_final,
complaints_split
)
98 / 102

Notice that this is the first and only time we have used our testing data

99 / 102

Evaluate on the test data 📏

final_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.880
## 2 roc_auc binary 0.943
100 / 102
final_fit %>%
collect_predictions() %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot()

101 / 102
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow