Extract word vectors from word2vec word embedding

The calculations are done with the word2vec package.

word2vec(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 50,
  type = c("cbow", "skip-gram"),
  window = 5L,
  min_count = 5L,
  loss = c("ns", "hs"),
  negative = 5L,
  n_iter = 5L,
  lr = 0.05,
  sample = 0.001,
  stopwords = character(),
  threads = 1L,
  collapse_character = "\t",
  composition = c("tibble", "data.frame", "matrix")
)

Arguments

text	Character string.
tokenizer	Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
dim	dimension of the word vectors. Defaults to 50.
type	the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'
window	skip length between words. Defaults to 5.
min_count	integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
loss	Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns".
negative	integer with the number of negative samples. Only used in case hs is set to FALSE
n_iter	Integer, number of training iterations. Defaults to 5.
lr	initial learning rate also known as alpha. Defaults to 0.05
sample	threshold for occurrence of words. Defaults to 0.001
stopwords	a character vector of stopwords to exclude from training
threads	number of CPU threads to use. Defaults to 1.
collapse_character	Character vector with length 1. Character used to glue together tokens after tokenizing. See details for more information. Defaults to `"\t"`.
composition	Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.

Source

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

Details

A trade-off have been made to allow for an arbitrary tokenizing function. The text is first passed through the tokenizer. Then it is being collapsed back together into strings using collapse_character as the separator. You need to pick collapse_character to be a character that will not appear in any of the tokens after tokenizing is done. The default value is a "tab" character. If you pick a character that is present in the tokens then those words will be split.

The choice of loss functions are one of:

"ns" negative sampling
"hs" hierarchical softmax

References

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality

Examples

word2vec(fairy_tales)
#> # A tibble: 452 x 51
#>    tokens    V1     V2     V3     V4     V5    V6    V7      V8    V9    V10
#>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>
#>  1 into   0.640  0.461 -0.778 -0.486 -0.829  1.15 -1.18  0.0822 -1.26 -0.526
#>  2 once   0.646  0.455 -0.805 -0.478 -0.831  1.12 -1.21  0.0821 -1.25 -0.523
#>  3 witch  0.606  0.474 -0.786 -0.444 -0.811  1.08 -1.17  0.137  -1.29 -0.487
#>  4 green  0.642  0.430 -0.761 -0.504 -0.809  1.15 -1.18  0.108  -1.28 -0.480
#>  5 princ… 0.684  0.474 -0.837 -0.490 -0.861  1.12 -1.24  0.106  -1.28 -0.475
#>  6 a      0.631  0.387 -0.756 -0.493 -0.824  1.14 -1.18  0.151  -1.26 -0.489
#>  7 moment 0.625  0.459 -0.801 -0.466 -0.808  1.11 -1.19  0.119  -1.25 -0.518
#>  8 drove  0.627  0.445 -0.785 -0.465 -0.827  1.18 -1.20  0.0511 -1.24 -0.504
#>  9 door   0.635  0.448 -0.796 -0.487 -0.815  1.12 -1.18  0.109  -1.28 -0.522
#> 10 </s>   0.808 -0.170 -1.23   0.353  0.432 -1.04 -1.62 -1.30   -1.50 -0.324
#> # … with 442 more rows, and 40 more variables: V11 <dbl>, V12 <dbl>, V13 <dbl>,
#> #   V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
#> #   V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
#> #   V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
#> #   V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>,
#> #   V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>,
#> #   V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>

# Custom tokenizer that splits on non-alphanumeric characters
word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))
#> # A tibble: 489 x 51
#>    tokens    V1    V2     V3    V4     V5    V6    V7     V8       V9    V10
#>    <chr>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>    <dbl>  <dbl>
#>  1 field  0.457  1.17 -0.717 1.05  -0.441 0.198 0.528 -0.787  0.00984 -0.751
#>  2 four   0.492  1.15 -0.775 1.00  -0.498 0.238 0.567 -0.720 -0.0716  -0.671
#>  3 black  0.520  1.16 -0.771 1.02  -0.495 0.253 0.568 -0.713 -0.0803  -0.662
#>  4 woman  0.451  1.16 -0.792 1.02  -0.423 0.271 0.495 -0.719 -0.0415  -0.720
#>  5 a      0.449  1.16 -0.752 0.995 -0.383 0.261 0.474 -0.751  0.00606 -0.695
#>  6 body   0.534  1.18 -0.840 1.01  -0.462 0.227 0.507 -0.741 -0.0582  -0.639
#>  7 door   0.419  1.16 -0.756 1.01  -0.439 0.283 0.497 -0.731  0.00333 -0.695
#>  8 let    0.524  1.17 -0.777 1.02  -0.503 0.205 0.536 -0.735 -0.0570  -0.703
#>  9 himse… 0.481  1.15 -0.795 1.02  -0.440 0.256 0.527 -0.748 -0.0367  -0.720
#> 10 There  0.493  1.15 -0.798 1.01  -0.455 0.255 0.539 -0.712 -0.0449  -0.693
#> # … with 479 more rows, and 40 more variables: V11 <dbl>, V12 <dbl>, V13 <dbl>,
#> #   V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
#> #   V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
#> #   V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
#> #   V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>,
#> #   V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>,
#> #   V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>