The calculations are done with the word2vec package.
word2vec( text, tokenizer = text2vec::space_tokenizer, dim = 50, type = c("cbow", "skip-gram"), window = 5L, min_count = 5L, loss = c("ns", "hs"), negative = 5L, n_iter = 5L, lr = 0.05, sample = 0.001, stopwords = character(), threads = 1L, collapse_character = "\t", composition = c("tibble", "data.frame", "matrix") )
text | Character string. |
---|---|
tokenizer | Function, function to perform tokenization. Defaults to text2vec::space_tokenizer. |
dim | dimension of the word vectors. Defaults to 50. |
type | the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow' |
window | skip length between words. Defaults to 5. |
min_count | integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
loss | Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns". |
negative | integer with the number of negative samples. Only used in case hs is set to FALSE |
n_iter | Integer, number of training iterations. Defaults to 5. |
lr | initial learning rate also known as alpha. Defaults to 0.05 |
sample | threshold for occurrence of words. Defaults to 0.001 |
stopwords | a character vector of stopwords to exclude from training |
threads | number of CPU threads to use. Defaults to 1. |
collapse_character | Character vector with length 1. Character used to
glue together tokens after tokenizing. See details for more information.
Defaults to |
composition | Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors. |
A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.
A trade-off have been made to allow for an arbitrary tokenizing function. The
text is first passed through the tokenizer. Then it is being collapsed back
together into strings using collapse_character
as the separator. You
need to pick collapse_character
to be a character that will not appear
in any of the tokens after tokenizing is done. The default value is a "tab"
character. If you pick a character that is present in the tokens then those
words will be split.
The choice of loss functions are one of:
"ns" negative sampling
"hs" hierarchical softmax
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality
word2vec(fairy_tales)#> # A tibble: 452 x 51 #> tokens V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 into 0.640 0.461 -0.778 -0.486 -0.829 1.15 -1.18 0.0822 -1.26 -0.526 #> 2 once 0.646 0.455 -0.805 -0.478 -0.831 1.12 -1.21 0.0821 -1.25 -0.523 #> 3 witch 0.606 0.474 -0.786 -0.444 -0.811 1.08 -1.17 0.137 -1.29 -0.487 #> 4 green 0.642 0.430 -0.761 -0.504 -0.809 1.15 -1.18 0.108 -1.28 -0.480 #> 5 princ… 0.684 0.474 -0.837 -0.490 -0.861 1.12 -1.24 0.106 -1.28 -0.475 #> 6 a 0.631 0.387 -0.756 -0.493 -0.824 1.14 -1.18 0.151 -1.26 -0.489 #> 7 moment 0.625 0.459 -0.801 -0.466 -0.808 1.11 -1.19 0.119 -1.25 -0.518 #> 8 drove 0.627 0.445 -0.785 -0.465 -0.827 1.18 -1.20 0.0511 -1.24 -0.504 #> 9 door 0.635 0.448 -0.796 -0.487 -0.815 1.12 -1.18 0.109 -1.28 -0.522 #> 10 </s> 0.808 -0.170 -1.23 0.353 0.432 -1.04 -1.62 -1.30 -1.50 -0.324 #> # … with 442 more rows, and 40 more variables: V11 <dbl>, V12 <dbl>, V13 <dbl>, #> # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, #> # V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, #> # V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, #> # V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, #> # V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, #> # V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl># Custom tokenizer that splits on non-alphanumeric characters word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))#> # A tibble: 489 x 51 #> tokens V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 field 0.457 1.17 -0.717 1.05 -0.441 0.198 0.528 -0.787 0.00984 -0.751 #> 2 four 0.492 1.15 -0.775 1.00 -0.498 0.238 0.567 -0.720 -0.0716 -0.671 #> 3 black 0.520 1.16 -0.771 1.02 -0.495 0.253 0.568 -0.713 -0.0803 -0.662 #> 4 woman 0.451 1.16 -0.792 1.02 -0.423 0.271 0.495 -0.719 -0.0415 -0.720 #> 5 a 0.449 1.16 -0.752 0.995 -0.383 0.261 0.474 -0.751 0.00606 -0.695 #> 6 body 0.534 1.18 -0.840 1.01 -0.462 0.227 0.507 -0.741 -0.0582 -0.639 #> 7 door 0.419 1.16 -0.756 1.01 -0.439 0.283 0.497 -0.731 0.00333 -0.695 #> 8 let 0.524 1.17 -0.777 1.02 -0.503 0.205 0.536 -0.735 -0.0570 -0.703 #> 9 himse… 0.481 1.15 -0.795 1.02 -0.440 0.256 0.527 -0.748 -0.0367 -0.720 #> 10 There 0.493 1.15 -0.798 1.01 -0.455 0.255 0.539 -0.712 -0.0449 -0.693 #> # … with 479 more rows, and 40 more variables: V11 <dbl>, V12 <dbl>, V13 <dbl>, #> # V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, #> # V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, #> # V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, #> # V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, #> # V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, #> # V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>