class: center, middle, title-slide # Predictive modeling with text using tidy data principles ## useR2020 ### Julia Silge & Emil Hvitfeldt ### 2019-7-24 --- class: center, middle # Hello! .pull-left[ <img style="border-radius: 50%;" src="https://github.com/EmilHvitfeldt.png" width="150px"/> [
@EmilHvitfeldt](https://github.com/EmilHvitfeldt) [
@Emil_Hvitfeldt](https://twitter.com/Emil_Hvitfeldt) [
hvitfeldt.me](https://www.hvitfeldt.me/) ] .pull-right[ <img style="border-radius: 50%;" src="https://github.com/juliasilge.png" width="150px"/> [
@juliasilge](https://github.com/juliasilge) [
@juliasilge](https://twitter.com/juliasilge) [
juliasilge.com](https://juliasilge.com) ] --- # Plan for today -- - We will walk through our case study using slides and live coding -- - After the tutorial, use the [materials on GitHub](https://github.com/EmilHvitfeldt/useR2020-text-modeling-tutorial) and YouTube recording to run the code yourself 💪 --- # Text as data Let's look at complaints submitted to the [United States Consumer Financial Protection Bureau (CFPB)](https://www.consumerfinance.gov/data-research/consumer-complaints/). -- - An individual experiences a problem 😩 with a consumer financial product or service, like a bank, loan, or credit card 💰 -- - They submit a **complaint** to the CFPB explaining what happened 😡 -- - This complaint is sent to the company, which can respond or dispute --- # Text as data Let's look at complaints submitted to the [United States Consumer Financial Protection Bureau (CFPB)](https://www.consumerfinance.gov/data-research/consumer-complaints/). ```r library(tidyverse) complaints <- read_csv("data/complaints.csv.gz") %>% sample_frac(0.1) names(complaints) ``` ``` ## [1] "date_received" "product" ## [3] "sub_product" "issue" ## [5] "sub_issue" "consumer_complaint_narrative" ## [7] "company_public_response" "company" ## [9] "state" "zip_code" ## [11] "tags" "consumer_consent_provided" ## [13] "submitted_via" "date_sent_to_company" ## [15] "company_response_to_consumer" "timely_response" ## [17] "consumer_disputed" "complaint_id" ``` --- # Text as data ```r complaints %>% sample_n(10) %>% pull(consumer_complaint_narrative) ``` ``` ## [1] "I don't recognize this inquiries. I never applied for it. Except as otherwise provided in this section, a consumer reporting agency shall block the reporting of any information in the file of a consumer that the consumer identifies as information that resulted from an alleged identity theft, not later than 4 business days after the date of receipt by such agency of ( 1 ) appropriate proof of the identity of the consumer ; ( 2 ) a copy of an identity theft report ; ( 3 ) the identification of such information by the consumer ; and ( 4 ) a statement by the consumer that the information is not information relating to any transaction by the consumer. ( b ) Notification. A consumer reporting agency shall promptly notify the furnisher of information identified by the consumer under subsection ( a ) of this section ( 1 ) that the information may be a result of identity theft ; ( 2 ) that an identity theft report has been filed ; ( 3 ) that a block has been requested under this section ; and ( 4 ) of the effective dates of the block. ( c ) Authority to decline or rescind. ( 1 ) In general. A consumer reporting agency may decline to block, or may rescind any block, of information relating to a consumer under this section, if the consumer reporting agency reasonably determines that ( A ) the information was blocked in error or a block was requested by the consumer in error ; ( B ) the information was blocked, or a block was requested by the consumer, on the basis of a material misrepresentation of fact by the consumer relevant to the request to block ; or ( C ) the consumer obtained possession of goods, services, or money as a result of the blocked transaction or transactions. ( 2 ) Notification to consumer. If a block of information is declined or rescinded under this subsection, the affected consumer shall be notified promptly, in the same manner as consumers are notified of the reinsertion of information under section 1681i ( a ) ( 5 ) ( B ) of this title. ( 3 ) Significance of block. For purposes of this subsection, if a consumer reporting agency rescinds a block, the presence of information in the file of a consumer prior to the blocking of such information is not evidence of whether the consumer knew or should have known that the consumer obtained possession of any goods, services, or money as a result of the block. ( d ) Exception for resellers. ( 1 ) No reseller file. This section shall not apply to a consumer reporting agency, if the consumer reporting agency ( A ) is a reseller ; ( B ) is not, at the time of the request of the consumer under subsection ( a ) of this section, otherwise furnishing or reselling a consumer report concerning the information identified by the consumer ; and ( C ) informs the consumer, by any means, that the consumer may report the identity theft to the Bureau to obtain consumer information regarding identity theft. ( 2 ) Reseller with file. The sole obligation of the consumer reporting agency under this section, with regard to any request of a consumer under this section, shall be to block the consumer report maintained by the consumer reporting agency from any subsequent use, if ( A ) the consumer, in accordance with the provisions of subsection ( a ) of this section, identifies, to a consumer reporting agency, information in the file of the consumer that resulted from identity theft ; and ( B ) the consumer reporting agency is a reseller of the identified information. ( 3 ) Notice. In carrying out its obligation under paragraph ( 2 ), the reseller shall promptly provide a notice to the consumer of the decision to block the file. Such notice shall contain the name, address, and telephone number of each consumer reporting agency from which the consumer information was obtained for resale. ( e ) Exception for verification companies. The provisions of this section do not apply to a check services company, acting as such, which issues authorizations for the purpose of approving or processing negotiable instruments, electronic fund transfers, or similar methods of payments, except that, beginning 4 business days after receipt of information described in paragraphs ( 1 ) through ( 3 ) of subsection ( a ) of this section, a check services company shall not report to a national consumer reporting agency described in section 1681a ( p ) of this title, any information identified in the subject identity theft report as resulting from identity theft. ( f ) Access to blocked information by law enforcement agencies. No provision of this section shall be construed as requiring a consumer reporting agency to prevent a Federal, State, or local law enforcement agency from accessing blocked information in a consumer file to which the agency could otherwise obtain access under this subchapter." ## [2] "NOTE : I have selected available options regarding the complaint but it is regarding the purchases that I did not make as the card was never used, activated or opened from the origina wrapper until XXXX - details below. \n\nCOMPLAINT about Mastercard Gift Card but inside wrapper pamphlet says \" US Bank Gift Card : Ok, so one more thing to do before the year end.\n\nI purchased 2 Mastercard gift cards in XX/XX/XXXX from XXXX ( I still have the original receipt ) mainly to use these cards for phone recharging and some online purchases. Before this I had purchased a similar {$200.00} Mastercard for the first time and was using it without any issues and since do small amounts in recharges the first one was still good and I had put away these other 2 gift cards in my safe box ie. I never opened them or activated them or used them. One thing I would like to mention is that I did not know myraid of \" GOTCHAS '' these cards have and as the date of expiration on these is XXXX, I thought they were good as long as I did not activate and use them and never thought these are susceptible to fraud when they have not been activated or used. In fact when I search online ( and if you search online ) it seems that there have been thousands of users who have been defrauded. It is one thing when a card is active and used online and some thief or hacker gets a hold of the card number and uses it for fraudulent purchases - we all know this happens, but what if a card has never been activated or used for any purchases and put away in a safe with original wrapper unopened. How does a thief get a hold of the card information? It seems like this happens often ( specifically with Mastercard Gift Cards by US BANK ) and again if you go online you can see several forums where users discuss this. This is something I was not aware of. \n\nSo, in XXXX XXXX I opened one of the 2 cards that I had in my safe ( in original wrapper ) to start using it and upon calling the phone number to check my balance I found that the card had been charged a \" {$2.00} fee for each month '' for inactivity and almost half of the account had been drained by monthly inactivity fees. Again, I was not aware of this but I take responsibility for it although I would like to mention that it seems like a novel idea to charge someone for \" not using a product '' - we all know companies charge for something when a product is used but how about charging someone for not using it? Is this done by design to make more money or does it really cost the bank in fees when the consumer is not using the service? \n\nHowever, the real issue here is that with regards to my 2nd card and when I checked the balance on that it had a balance of {$11.00} ( something like that ) and I was shocked. So, I called the customer service. Without going into all the details of the conversation which will make this even longer, he first told me that he did not know as there is history for only 24 months in the system. Then after I pressed him and told him that I never used the card before then said that the card was first used in XXXX of XXXX there were two charges for {$43.00} or something. So, I was not able to understand how he was able to get that information beyond 24 months which he was not willing to share first. Again, I would like to reiterate that I had never opened the wrapper, activated or used this card and I told him that and at that he said that he supected fraud and that he was going to lock the card and if I wanted to use the card I have to fax my original receipt and my address etc., so they can mail me a new card and when I asked him if it was for the original purchase amount of {$200.00}, he said no it would be for {$11.00}. I asked to speak with a supervisor and he said he would have someone call me back. Anyway, after that I spoke with 2-3 agents and they said they would refund me back the fees. Later I understood that by that they meant that they were \" only refunding me the {$2.00} fee '' and not restore the original card. Different agents told me different versions. Anyway, I wanted to complain to CFPB about this since XXXX ( as you can see how long this is ) and just did not have the time to do so and wanted to this before the year end. \n\nThe issue is that as of today, they have locked that card, ( even with what was left {$11.00} ) and for that it seems I have to fax them and write them, etc., I am not willing to do that and as I told the agent in anger they can keep that, since they have taken the rest already. This is disgusting and if you search online, this seems to happen a lot where theives steal money from \" never used/inactivated '' cards and yet banks like US Bank and Mastercard continue to issue these cards every year but customers lose millions of dollars to fraud and fees. I hope they can resolve this and I hope that your agency will look into the whole issue of \" GIFT CARD '' business where banks are profiting at the expense of customers with inactivity fees for \" not using '' the service and of course the issue of fraud.\n\nBottomline is that if they are willing to resolve and restore my original amount to that card or mail me a new card with the original {$200.00} purchase ( minus any monthly fees for inactivity, as absurd as they are and fees for not using a product ) I will fax them my original receipt and the mailing address, etc., and please provide me the proper contact phone number or email address and not your general phone system where I will have to go through the whole history again with another helpless rep. \n\nThank You" ## [3] "XXXX XXXX credit card services reported that there was an incorrect 30 day late payment on my credit card. After an investigation, XXXX XXXX realized they made an error and reported it to XXXX, who updated my credit file. However, Equifax has not. Please have them fix this problem immediately." ## [4] "I opened a bank account with Chase Bank in XXXX XXXX Tx. A week after i deposited a check, my funds are still not available. I've called the bank and asked them to remove the hold. They claim the hold is due to my address not being verified. I did verify my address with the bank representative, using my Texas state identification card. I also received mail from Chase bank at this address. I'm not understanding why close to {$1000.00} of my money is being held from me because of a mistake made by the bank." ## [5] "I was a victim of fraud and for whatever reason Exeperian is not removing an account that even the company has indicated they were removing it because it was fraud. There is an account reporting under XXXX XXXX XXXX account number XXXX for which I have attached the correspondence with XXXX showing that Experian never even attempted to rectify this." ## [6] "I had a nordstrom visa creditcard (issued by tdbank ) was closed by the grantor in XX/XX/XXXX, and there were pending charges on nordstrom.com were canceled. As a result, now I have overpay the balance, since all orders were canceled. i want the bank refund the money how much i overpaid . I have called and talked to the customer service a few time and they said would call me back and never called." ## [7] "DEROGATORY ACCUSATIONS INCLUDING ALLEGED INQUIRIES MUST BE METRO 2 COMPLIANT TO RETAIN OR REPORT SO DELETE ANY NOT PHYSICALLY WITH PROOF OF PERMISSIBLE PURPOSE FROM THE EXACTLY AND LEGALLY IDENTIFIED ME.I DO NOT AUTHORIZE YOU TO REPORT NOT PROVEN COMPLIANT INFORMATION AND YOU MUST COMPLY.. DELETE XXXX XXXX, XXXX XXXX, XXXX XXXX" ## [8] "Made 2 payments to my account. Never credited but has cleared my checking account. Talked to several people on the phone still not resolved. My bank confirmed check was cashed by usaa" ## [9] "XXXX has reported my name in error conjunction with a collection account. This reporting has been on my credit file since 2014 and has cost me financially due to incorrect information being reported and maintained by XXXX." ## [10] "On XX/XX/XXXX and XXXX, 2019 a company by the name \" XXXX XXXXXXXX XXXX '' just popped up on my XXXX, XXXX, and Transunion credit report file. The original debtor is XXXX XXXX XXXX! This account does not belong to me, as all my bills have always been in my husband 's name. When I try to contact XXXX XXXX XXXX, my calls are not be accepted, their recording says their too busy to take my call. I have filed a dispute with XXXX and Transunion." ``` --- # Text as data -- - Text like this can be used for **supervised** or **predictive** modeling -- - We can build both regression and classification models with text data -- - We can use the ways language exhibits organization to create features for modeling --- # Modeling Packages ```r library(tidymodels) library(textrecipes) ``` - [tidymodels](https://www.tidymodels.org/) is a collection of packages for modeling and machine learning using tidyverse principles - [textrecipes](https://textrecipes.tidymodels.org/) extends the recipes package to handle text preprocessing --- # Modeling workflow ![](https://rviews.rstudio.com/post/2019-06-14-a-gentle-intro-to-tidymodels_files/figure-html/tidymodels.png) --- # smltar.com <iframe src="https://smltar.com/" width="100%" height="400px"></iframe> --- # Class imbalance <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="700px" style="display: block; margin: auto;" /> --- class: inverse, right, middle # Let's approach this as a **binary classification task** --- # Credit or not? ```r credit <- "Credit reporting, credit repair services, or other personal consumer reports" complaints2class <- complaints %>% mutate(product = factor(if_else( condition = product == credit, true = "Credit", false = "Other"))) %>% rename(text = consumer_complaint_narrative) ``` --- # Data splitting The testing set is a precious resource which can be used only once <img src="index_files/figure-html/all-split-1.png" width="700px" style="display: block; margin: auto;" /> --- # Data splitting ```r set.seed(1234) complaints_split <- initial_split(complaints2class, strata = product) complaints_train <- training(complaints_split) complaints_test <- testing(complaints_split) ``` --- # Which of these variables can we use? ```r names(complaints_train) ``` ``` ## [1] "date_received" "product" ## [3] "sub_product" "issue" ## [5] "sub_issue" "text" ## [7] "company_public_response" "company" ## [9] "state" "zip_code" ## [11] "tags" "consumer_consent_provided" ## [13] "submitted_via" "date_sent_to_company" ## [15] "company_response_to_consumer" "timely_response" ## [17] "consumer_disputed" "complaint_id" ``` --- # Feature selection checklist -- - Is it ethical to use this variable? (or even legal?) -- - Will this variable be available at prediction time? -- - Does this variable contribute to explainability? --- # Which of these variables can we use? ```r names(complaints_train) ``` ``` ## [1] "date_received" "product" ## [3] "sub_product" "issue" ## [5] "sub_issue" "text" ## [7] "company_public_response" "company" ## [9] "state" "zip_code" ## [11] "tags" "consumer_consent_provided" ## [13] "submitted_via" "date_sent_to_company" ## [15] "company_response_to_consumer" "timely_response" ## [17] "consumer_disputed" "complaint_id" ``` --- # Which of these variables can we use? - `date_received` - `tags` - `consumer_complaint_narrative` == 📃 --- # Preprocessing specification ```r complaints_rec <- recipe(product ~ date_received + tags + text, data = complaints_train ) %>% step_date(date_received, features = c("month", "dow"), role = "dates") %>% step_rm(date_received) %>% step_dummy(has_role("dates")) %>% step_unknown(tags) %>% step_dummy(tags) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>% step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>% step_tfidf(text) ``` --- # Feature engineering ```r complaints_rec <- recipe(product ~ date_received + tags + text, data = complaints_train ) %>% step_date(date_received, features = c("month", "dow"), role = "dates") %>% step_rm(date_received) %>% step_dummy(has_role("dates")) %>% step_unknown(tags) %>% step_dummy(tags) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>% step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>% step_tfidf(text) ``` -- Also, what does `tune()` mean here? 🤔 --- class: inverse, right, middle # You can **combine** text and non-text features in your model --- # Feature engineering: handling dates ```r recipe(product ~ date_received + tags + text, data = complaints_train ) %>% step_date(date_received, features = c("month", "dow"), role = "dates") %>% step_rm(date_received) ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Date features from date_received ## Delete terms date_received ``` --- # Feature engineering: categorical data ```r recipe(product ~ date_received + tags + text, data = complaints_train ) %>% step_unknown(tags) %>% step_dummy(tags) ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Unknown factor level assignment for tags ## Dummy variables from tags ``` --- class: inverse, right, middle # You can **combine** text and non-text features in your model --- # Feature engineering: text ```r recipe(product ~ date_received + tags + text, data = complaints_train ) %>% step_tokenize(text) %>% step_stopwords(text) %>% step_ngram(text, num_tokens = 3, min_num_tokens = 1) %>% step_tokenfilter(text, max_tokens = tune(), min_times = 5) %>% step_tfidf(text) ``` --- # Feature engineering: text ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 3 ## ## Operations: ## ## Tokenization for text ## Stop word removal for text ## ngramming for text ## Text filtering for text ## Term frequency-inverse document frequency with text ``` --- class: inverse, right, middle # How do we create **features** from **natural language**? --- # From natural language to ML features ```r library(tidytext) complaints_train %>% unnest_tokens(word, text) %>% anti_join(get_stopwords(), by = "word") %>% count(complaint_id, word) %>% bind_tf_idf(word, complaint_id, n) %>% cast_dfm(complaint_id, word, tf_idf) ``` ``` ## Document-feature matrix of: 8,791 documents, 18,098 features (99.7% sparse). ``` --- class: inverse, center, middle # 🛑 STOP WORDS 🛑 --- # Stop words ```r library(stopwords) stopwords() ``` ``` ## [1] "i" "me" "my" "myself" "we" "our" ## [7] "ours" "ourselves" "you" "your" "yours" "yourself" ## [13] "yourselves" "he" "him" "his" "himself" "she" ## [19] "her" "hers" "herself" "it" "its" "itself" ## [25] "they" "them" "their" "theirs" "themselves" "what" ## [31] "which" "who" "whom" "this" "that" "these" ## [37] "those" "am" "is" "are" "was" "were" ## [43] "be" "been" "being" "have" "has" "had" ## [49] "having" "do" "does" "did" "doing" "would" ## [55] "should" "could" "ought" "i'm" "you're" "he's" ## [61] "she's" "it's" "we're" "they're" "i've" "you've" ## [67] "we've" "they've" "i'd" "you'd" "he'd" "she'd" ## [73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" ## [79] "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't" ## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" ## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" ## [97] "couldn't" "mustn't" "let's" "that's" "who's" "what's" ## [103] "here's" "there's" "when's" "where's" "why's" "how's" ## [109] "a" "an" "the" "and" "but" "if" ## [115] "or" "because" "as" "until" "while" "of" ## [121] "at" "by" "for" "with" "about" "against" ## [127] "between" "into" "through" "during" "before" "after" ## [133] "above" "below" "to" "from" "up" "down" ## [139] "in" "out" "on" "off" "over" "under" ## [145] "again" "further" "then" "once" "here" "there" ## [151] "when" "where" "why" "how" "all" "any" ## [157] "both" "each" "few" "more" "most" "other" ## [163] "some" "such" "no" "nor" "not" "only" ## [169] "own" "same" "so" "than" "too" "very" ## [175] "will" ``` --- class: center, middle <img src="index_files/figure-html/unnamed-chunk-21-1.png" width="700px" style="display: block; margin: auto;" /> --- # Stop words -- - Stop words are context specific -- - Stop word lexicons can have bias -- - You can create your own stop word list -- - See [Chapter 3](https://smltar.com/stopwords.html) for more! 🛑 --- class: inverse, right, middle ## What kind of **models** work well for text? --- # Text models Remember that text data is sparse! 😮 -- - Regularized linear models (glmnet) - Support vector machines - naive Bayes - Tree-based models like random forest? --- # Text models Remember that text data is sparse! 😮 - Regularized linear models (glmnet) - Support vector machines - naive Bayes - Tree-based models like random forest? 🙅 --- class: inverse, right, middle # Does text data have to be **sparse**? --- >### You shall know a word by the company it keeps. #### [💬 John Rupert Firth](https://en.wikiquote.org/wiki/John_Rupert_Firth) -- Learn more about word embeddings: - in [Chapter 5](https://smltar.com/embeddings.html) - at [juliasilge.github.io/why-r-webinar/](https://juliasilge.github.io/why-r-webinar/) --- # To specify a model in tidymodels 1\. Pick a **model** 2\. Set the **mode** (if needed) 3\. Set the **engine** --- background-image: url(https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/parsnip.png) background-size: cover .footnote[ Art by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) ] --- # To specify a model in tidymodels All available models are listed at <https://tidymodels.org/find/parsnip> <iframe src="https://tidymodels.org/find/parsnip" width="100%" height="400px"></iframe> --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r svm_rbf() %>% set_mode(mode = "regression") ``` ``` ## Radial Basis Function Support Vector Machine Specification (regression) ``` --- class: middle # `set_mode()` Some models can solve multiple types of problems ```r svm_rbf() %>% set_mode(mode = "classification") ``` ``` ## Radial Basis Function Support Vector Machine Specification (classification) ``` --- class: middle # `set_engine()` The same model can be implemented by multiple computational engines ```r svm_rbf() %>% set_engine("kernlab") ``` ``` ## Radial Basis Function Support Vector Machine Specification (unknown) ## ## Computational engine: kernlab ``` --- class: middle # `set_engine()` The same model can be implemented by multiple computational engines ```r svm_rbf() %>% set_engine("liquidSVM") ``` ``` ## Radial Basis Function Support Vector Machine Specification (unknown) ## ## Computational engine: liquidSVM ``` --- # What makes a model? ```r lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% set_mode("classification") %>% set_engine("glmnet") lasso_spec ``` ``` ## Logistic Regression Model Specification (classification) ## ## Main Arguments: ## penalty = tune() ## mixture = 1 ## ## Computational engine: glmnet ``` -- It's `tune()` again! 😟 --- ## Parameters and... hyperparameters? - Some model parameters can be learned from data during fitting/training -- - Some CANNOT 😱 -- - These are **hyperparameters** of a model, and we estimate them by training lots of models with different hyperparameters and comparing them --- # A grid of possible hyperparameters ```r param_grid <- grid_regular( penalty(range = c(-4, 0)), max_tokens(range = c(500, 2000)), levels = 6 ) ``` --- # A grid of possible hyperparameters ``` ## # A tibble: 36 x 2 ## penalty max_tokens ## <dbl> <int> ## 1 0.0001 500 ## 2 0.000631 500 ## 3 0.00398 500 ## 4 0.0251 500 ## 5 0.158 500 ## 6 1 500 ## 7 0.0001 800 ## 8 0.000631 800 ## 9 0.00398 800 ## 10 0.0251 800 ## # … with 26 more rows ``` --- class: inverse, right, middle # How can we **compare** and **evaluate** these different models? --- background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg) background-size: 60% --- # Spend your data budget ```r set.seed(123) complaints_folds <- vfold_cv(complaints_train, v = 10, strata = product) complaints_folds ``` ``` ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 2 ## splits id ## <list> <chr> ## 1 <split [7.9K/880]> Fold01 ## 2 <split [7.9K/880]> Fold02 ## 3 <split [7.9K/880]> Fold03 ## 4 <split [7.9K/880]> Fold04 ## 5 <split [7.9K/879]> Fold05 ## 6 <split [7.9K/879]> Fold06 ## 7 <split [7.9K/879]> Fold07 ## 8 <split [7.9K/878]> Fold08 ## 9 <split [7.9K/878]> Fold09 ## 10 <split [7.9K/878]> Fold10 ``` --- class: middle, center, inverse # ✨ CROSS-VALIDATION ✨ --- background-image: url(images/cross-validation/Slide2.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide3.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide4.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide5.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide6.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide7.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide8.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide9.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide10.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- background-image: url(images/cross-validation/Slide11.png) background-size: contain .footnote[ Art by [Alison Hill](https://alison.rbind.io/) ] --- class: inverse, right, middle # Spend your data wisely to create **simulated** validation sets --- class: inverse, right, middle # Now we have **resamples**, **features**, plus a **model** --- # Create a workflow ```r complaints_wf <- workflow() %>% add_recipe(complaints_rec) %>% add_model(lasso_spec) ``` --- class: inverse, right, middle # What is a `workflow()`? --- ## Time to tune! ⚡ ```r set.seed(42) lasso_rs <- tune_grid( complaints_wf, resamples = complaints_folds, grid = param_grid, control = control_grid(save_pred = TRUE) ) ``` --- ## Time to tune! ⚡ ``` ## # Tuning results ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 5 ## splits id .metrics .notes .predictions ## <list> <chr> <list> <list> <list> ## 1 <split [7.9K/880]> Fold01 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]> ## 2 <split [7.9K/880]> Fold02 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]> ## 3 <split [7.9K/880]> Fold03 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]> ## 4 <split [7.9K/880]> Fold04 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,680 × 8]> ## 5 <split [7.9K/879]> Fold05 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,644 × 8]> ## 6 <split [7.9K/879]> Fold06 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,644 × 8]> ## 7 <split [7.9K/879]> Fold07 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,644 × 8]> ## 8 <split [7.9K/878]> Fold08 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,608 × 8]> ## 9 <split [7.9K/878]> Fold09 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,608 × 8]> ## 10 <split [7.9K/878]> Fold10 <tibble [72 × 6]> <tibble [0 × 1]> <tibble [31,608 × 8]> ``` --- class: middle, center, inverse # 💫 TOKENIZATION 💫 --- # Tokenization - The process of splitting text in smaller pieces of text (_tokens_) -- - Most common token == word, but sometimes we tokenize in a different way -- - An essential part of most text analyses -- - Many options to take into consideration --- # Tokenization: whitespace ```r token_example ``` ``` ## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me." ``` -- ```r strsplit(token_example, "\\s") ``` ``` ## [[1]] ## [1] "I" "am" "a" "long-time" "victim" "of" "identity" ## [8] "theft." "This" "debt" "doesn't" "belong" "to" "me." ``` --- # Tokenization: [tokenizers](https://docs.ropensci.org/tokenizers/) package ```r token_example ``` ``` ## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me." ``` ```r library(tokenizers) tokenize_words(token_example) ``` ``` ## [[1]] ## [1] "i" "am" "a" "long" "time" "victim" "of" "identity" ## [9] "theft" "this" "debt" "doesn't" "belong" "to" "me" ``` --- # Tokenization: [spaCy](https://spacy.io/) library ```r token_example ``` ``` ## [1] "I am a long-time victim of identity theft. This debt doesn't belong to me." ``` ```r library(spacyr) spacy_tokenize(token_example) ``` ``` ## [[1]] ## [1] "I" "am" "a" "long" "-" "time" "victim" "of" ## [9] "identity" "theft" "." "This" "debt" "does" "n't" "belong" ## [17] "to" "me" "." ``` --- whitespace ``` ## [[1]] ## [1] "I" "am" "a" "long-time" "victim" "of" "identity" ## [8] "theft." "This" "debt" "doesn't" "belong" "to" "me." ``` tokenizers package ``` ## [[1]] ## [1] "i" "am" "a" "long" "time" "victim" "of" "identity" ## [9] "theft" "this" "debt" "doesn't" "belong" "to" "me" ``` spaCy library ``` ## [[1]] ## [1] "I" "am" "a" "long" "-" "time" "victim" "of" ## [9] "identity" "theft" "." "This" "debt" "does" "n't" "belong" ## [17] "to" "me" "." ``` --- # Tokenization considerations - Should we turn UPPERCASE letters to lowercase? -- - How should we handle punctuation⁉️ -- - What about non-word characters _inside_ words? -- - Should compound words be split or multi-word ideas be kept together? --- class: inverse, right, middle ## Tokenization for English text is typically **much easier** than other languages. --- # N-grams ## A sequence of `n` sequential tokens -- - Captures words that appear together often -- - Can detect negations ("not happy") -- - Larger cardinality --- ```r tokenize_ngrams(token_example, n = 1) ``` ``` ## [[1]] ## [1] "i" "am" "a" "long" "time" "victim" "of" "identity" ## [9] "theft" "this" "debt" "doesn't" "belong" "to" "me" ``` ```r tokenize_ngrams(token_example, n = 2) ``` ``` ## [[1]] ## [1] "i am" "am a" "a long" "long time" "time victim" ## [6] "victim of" "of identity" "identity theft" "theft this" "this debt" ## [11] "debt doesn't" "doesn't belong" "belong to" "to me" ``` ```r tokenize_ngrams(token_example, n = 3) ``` ``` ## [[1]] ## [1] "i am a" "am a long" "a long time" "long time victim" ## [5] "time victim of" "victim of identity" "of identity theft" "identity theft this" ## [9] "theft this debt" "this debt doesn't" "debt doesn't belong" "doesn't belong to" ## [13] "belong to me" ``` --- # Tokenization See [Chapter 2](https://smltar.com/tokenization.html) for more! --- # Look at the tuning results 👀 ```r collect_metrics(lasso_rs) ``` ``` ## # A tibble: 72 x 8 ## penalty max_tokens .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.0001 500 accuracy binary 0.864 10 0.00490 Recipe1_Model1 ## 2 0.0001 500 roc_auc binary 0.928 10 0.00268 Recipe1_Model1 ## 3 0.000631 500 accuracy binary 0.867 10 0.00467 Recipe1_Model2 ## 4 0.000631 500 roc_auc binary 0.931 10 0.00264 Recipe1_Model2 ## 5 0.00398 500 accuracy binary 0.869 10 0.00473 Recipe1_Model3 ## 6 0.00398 500 roc_auc binary 0.934 10 0.00282 Recipe1_Model3 ## 7 0.0251 500 accuracy binary 0.840 10 0.00502 Recipe1_Model4 ## 8 0.0251 500 roc_auc binary 0.911 10 0.00427 Recipe1_Model4 ## 9 0.158 500 accuracy binary 0.539 10 0.00351 Recipe1_Model5 ## 10 0.158 500 roc_auc binary 0.723 10 0.00575 Recipe1_Model5 ## # … with 62 more rows ``` --- ```r autoplot(lasso_rs) ``` <img src="index_files/figure-html/unnamed-chunk-48-1.png" width="700px" style="display: block; margin: auto;" /> --- # Look at the tuning results 👀 ```r lasso_rs %>% show_best("roc_auc") ``` ``` ## # A tibble: 5 x 8 ## penalty max_tokens .metric .estimator mean n std_err .config ## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.00398 1400 roc_auc binary 0.942 10 0.00223 Recipe4_Model3 ## 2 0.00398 2000 roc_auc binary 0.942 10 0.00241 Recipe6_Model3 ## 3 0.00398 1700 roc_auc binary 0.942 10 0.00222 Recipe5_Model3 ## 4 0.00398 1100 roc_auc binary 0.941 10 0.00243 Recipe3_Model3 ## 5 0.00398 800 roc_auc binary 0.940 10 0.00293 Recipe2_Model3 ``` --- # The **best** 🥇 hyperparameters ```r best_roc_auc <- select_best(lasso_rs, "roc_auc") best_roc_auc ``` ``` ## # A tibble: 1 x 3 ## penalty max_tokens .config ## <dbl> <int> <chr> ## 1 0.00398 1400 Recipe4_Model3 ``` --- # Evaluate the best model 📏 ```r collect_predictions(lasso_rs, parameters = best_roc_auc) ``` ``` ## # A tibble: 8,791 x 9 ## id .pred_Credit .pred_Other .row max_tokens penalty .pred_class product .config ## <chr> <dbl> <dbl> <int> <int> <dbl> <fct> <fct> <chr> ## 1 Fold01 0.0143 0.986 1 1400 0.00398 Other Other Recipe4_Mod… ## 2 Fold01 0.992 0.00798 3 1400 0.00398 Credit Credit Recipe4_Mod… ## 3 Fold01 0.0335 0.966 8 1400 0.00398 Other Other Recipe4_Mod… ## 4 Fold01 0.644 0.356 12 1400 0.00398 Credit Credit Recipe4_Mod… ## 5 Fold01 0.133 0.867 25 1400 0.00398 Other Other Recipe4_Mod… ## 6 Fold01 0.158 0.842 41 1400 0.00398 Other Other Recipe4_Mod… ## 7 Fold01 0.0801 0.920 64 1400 0.00398 Other Other Recipe4_Mod… ## 8 Fold01 0.831 0.169 88 1400 0.00398 Credit Credit Recipe4_Mod… ## 9 Fold01 0.236 0.764 113 1400 0.00398 Other Other Recipe4_Mod… ## 10 Fold01 0.981 0.0194 115 1400 0.00398 Credit Credit Recipe4_Mod… ## # … with 8,781 more rows ``` --- # Evaluate the best model 📏 ```r collect_predictions(lasso_rs, parameters = best_roc_auc) %>% group_by(id) %>% roc_curve(truth = product, .pred_Credit) %>% autoplot() ``` --- # Evaluate the best model 📏 <img src="index_files/figure-html/unnamed-chunk-53-1.png" width="700px" style="display: block; margin: auto;" /> --- # Update the workflow We can update our workflow with the best performing hyperparameters. ```r wf_spec_final <- finalize_workflow(complaints_wf, best_roc_auc) ``` This workflow is ready to go! It can now be applied to new data. --- class: inverse, right, middle # How is our model **thinking**? --- ## Variable importance ```r library(vip) wf_spec_final %>% fit(complaints_train) %>% pull_workflow_fit() %>% vi(lambda = best_roc_auc$penalty) %>% filter(!str_detect(Variable, "tfidf")) %>% filter(Importance != 0) ``` ``` ## # A tibble: 19 x 3 ## Variable Importance Sign ## <chr> <dbl> <chr> ## 1 tags_Older.American..Servicemember 0.509 POS ## 2 date_received_dow_Mon 0.480 POS ## 3 date_received_dow_Fri 0.337 POS ## 4 date_received_dow_Thu 0.253 POS ## 5 date_received_dow_Wed 0.108 POS ## 6 date_received_dow_Sat 0.0106 POS ## 7 tags_unknown 0.00293 POS ## 8 date_received_month_Sep -0.0558 NEG ## 9 date_received_month_Jun -0.0615 NEG ## 10 date_received_month_Apr -0.100 NEG ## 11 tags_Servicemember -0.132 NEG ## 12 date_received_month_Aug -0.227 NEG ## 13 date_received_month_Mar -0.253 NEG ## 14 date_received_month_May -0.361 NEG ## 15 date_received_month_Jul -0.564 NEG ## 16 date_received_month_Oct -0.586 NEG ## 17 date_received_month_Nov -0.717 NEG ## 18 date_received_month_Feb -0.734 NEG ## 19 date_received_month_Dec -1.09 NEG ``` --- ## Variable importance ```r vi_data <- wf_spec_final %>% fit(complaints_train) %>% pull_workflow_fit() %>% vi(lambda = best_roc_auc$penalty) %>% mutate(Variable = str_remove_all(Variable, "tfidf_text_")) %>% filter(Importance != 0) ``` --- ## Variable importance ```r vi_data ``` ``` ## # A tibble: 1,377 x 3 ## Variable Importance Sign ## <chr> <dbl> <chr> ## 1 identity_theft_2 296. POS ## 2 section_consumer 262. POS ## 3 appraisal 153. POS ## 4 agency_shall 130. POS ## 5 credit_reporting_act 123. POS ## 6 xxxx_oh 99.5 POS ## 7 blocked_information 92.0 POS ## 8 pnc 88.8 POS ## 9 requested_consumer 83.9 POS ## 10 american_express 83.0 POS ## # … with 1,367 more rows ``` --- class: center, middle <img src="index_files/figure-html/unnamed-chunk-59-1.png" width="700px" style="display: block; margin: auto;" /> --- ## Credit Complaint #1 <span, style = 'color:green;'>Credit</span> <span, style = 'color:black;'>And</span> <span, style = 'color:blue;'>Other</span>
i
have
contacted
equifax
on
at
least
5
occasions
to
include
a
xxxx
xxxx
credit
card
on
my
profile
i
have
provided
photo
identification
my
ss
and
a
copy
of
the
card
because
they
have
refused
to
update
my
profile
my
credit
score
remains
below
600
and
i
have
been
denied
credit
on
several
occasions
i
have
had
the
card
for
at
least
4
years
yet
my
credit
profile
with
equifax
claims
i
have
no
credit
history
i
have
contacted
them
via
letters
online
disputes
and
telephone
to
no
avail
please
advise
as
to
other
remedies
available
to
me
--- ## Credit Complaint #2 <span, style = 'color:green;'>Credit</span> <span, style = 'color:black;'>And</span> <span, style = 'color:blue;'>Other</span>
i
have
disputed
items
with
equifax
back
on
xx
xx
2019
and
they
have
failed
to
respond
i
have
also
sent
them
a
notice
that
they
failed
to
respond
and
they
have
ignore
my
request
to
dispute
items
that
are
reporting
on
my
credit
reports
--- ## Other Complaint #1 <span, style = 'color:green;'>Credit</span> <span, style = 'color:black;'>And</span> <span, style = 'color:blue;'>Other</span>
i
was
contacted
on
xxxx
xx
xx
2019
by
portfolio
recovery
associates
about
a
past
debt
that
i
have
no
knowledge
of
having
the
representative
mentioned
that
there
was
an
attempt
to
deliver
paperwork
but
paperwork
delivery
was
at
a
different
location
that
my
physical
home
address
upon
requesting
further
clarification
the
representative
continued
to
say
how
i
was
refusing
to
pay
the
debt
i
never
refused
to
pay
the
debt
i
would
just
like
clarification
on
the
debt
i
will
be
delivering
a
cease
and
desist
order
to
the
aforementioned
party
--- ## Other Complaint #2 <span, style = 'color:green;'>Credit</span> <span, style = 'color:black;'>And</span> <span, style = 'color:blue;'>Other</span>
xxxx
xxxx
is
alleging
that
i
owe
them
690.00
i
did
not
breach
my
rental
contract
and
do
not
understand
the
charges
against
me
i
have
attempted
to
resolve
the
matter
but
every
time
i
try
the
claim
that
i
owe
them
additional
monies
i
believe
that
i
am
a
victim
of
fraud
and
unfair
business
tactics
management
is
run
very
poorly
and
one
person
says
one
thing
another
says
another
thing
i
do
have
documentation
to
prove
my
innocence
--- # Final fit We will now use `last_fit()` to **fit** our model one last time on our training data and **evaluate** it on our testing data. ```r final_fit <- last_fit( wf_spec_final, complaints_split ) ``` --- class: inverse, right, middle # Notice that this is the **first** and **only** time we have used our **testing data** --- # Evaluate on the **test** data 📏 ```r final_fit %>% collect_metrics() ``` ``` ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.880 ## 2 roc_auc binary 0.943 ``` --- ```r final_fit %>% collect_predictions() %>% roc_curve(truth = product, .pred_Credit) %>% autoplot() ``` <img src="index_files/figure-html/unnamed-chunk-66-1.png" width="700px" style="display: block; margin: auto;" /> --- class: center, middle # Thanks! ##[smltar.com](https://smltar.com/) .pull-left[ <img style="border-radius: 50%;" src="https://github.com/EmilHvitfeldt.png" width="150px"/> [
@EmilHvitfeldt](https://github.com/EmilHvitfeldt) [
@Emil_Hvitfeldt](https://twitter.com/Emil_Hvitfeldt) [
hvitfeldt.me](https://www.hvitfeldt.me/) ] .pull-right[ <img style="border-radius: 50%;" src="https://github.com/juliasilge.png" width="150px"/> [
@juliasilge](https://github.com/juliasilge) [
@juliasilge](https://twitter.com/juliasilge) [
juliasilge.com](https://juliasilge.com) ]