Extract word vectors from fasttext word embedding

The calculations are done with the fastTextR package.

fasttext(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 10L,
  type = c("skip-gram", "cbow"),
  window = 5L,
  loss = "hs",
  negative = 5L,
  n_iter = 5L,
  min_count = 5L,
  threads = 1L,
  composition = c("tibble", "data.frame", "matrix"),
  verbose = FALSE
)

Arguments

text	Character string.
tokenizer	Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
dim	Integer, number of dimension of the resulting word vectors.
type	Character, the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'skip-gram'.
window	Integer, skip length between words. Defaults to 5.
loss	Charcter, choice of loss function must be one of "ns", "hs", or "softmax". See details for more Defaults to "hs".
negative	integer with the number of negative samples. Only used when loss = "ns".
n_iter	Integer, number of training iterations. Defaults to 5. `numeric = -1` defines early stopping strategy. Stop fitting when one of two following conditions will be satisfied: (a) passed all iterations (b) `cost_previous_iter / cost_current_iter - 1 < convergence_tol`. Defaults to -1.
min_count	Integer, number of times a token should appear to be considered in the model. Defaults to 5.
threads	number of CPU threads to use. Defaults to 1.
composition	Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.
verbose	Logical, controls whether progress is reported as operations are executed.

Source

https://fasttext.cc/

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

Details

The choice of loss functions are one of:

"ns" negative sampling
"hs" hierarchical softmax
"softmax" full softmax

References

Enriching Word Vectors with Subword Information, 2016, P. Bojanowski, E. Grave, A. Joulin, T. Mikolov.

Examples

fasttext(fairy_tales, n_iter = 2)
#> # A tibble: 452 x 11
#>    tokens     V1     V2    V3     V4     V5      V6    V7     V8      V9   V10
#>    <chr>   <dbl>  <dbl> <dbl>  <dbl>  <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl>
#>  1 the    -0.282 0.0299 0.443 -0.414 0.131  -0.0250 0.231 -0.198 -0.0193 0.223
#>  2 and    -0.217 0.0142 0.483 -0.456 0.131  -0.0469 0.231 -0.208 -0.0527 0.192
#>  3 a      -0.199 0.0468 0.421 -0.429 0.133  -0.0780 0.228 -0.207 -0.0396 0.214
#>  4 to     -0.256 0.0122 0.417 -0.448 0.0965 -0.0589 0.241 -0.227 -0.0462 0.183
#>  5 he     -0.215 0.0563 0.380 -0.361 0.0772 -0.0212 0.173 -0.186 -0.0733 0.184
#>  6 of     -0.180 0.0346 0.417 -0.438 0.127  -0.0677 0.253 -0.213 -0.0217 0.178
#>  7 in     -0.307 0.0502 0.533 -0.514 0.111  -0.0652 0.258 -0.228 -0.0335 0.229
#>  8 she    -0.187 0.0389 0.363 -0.376 0.0901 -0.0588 0.210 -0.165 -0.0210 0.157
#>  9 was    -0.215 0.0439 0.371 -0.362 0.0886 -0.0382 0.188 -0.170 -0.0402 0.159
#> 10 as     -0.231 0.0385 0.560 -0.552 0.157  -0.0821 0.308 -0.228 -0.0645 0.209
#> # … with 442 more rows

# Custom tokenizer that splits on non-alphanumeric characters
fasttext(fairy_tales,
         n_iter = 2,
         tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))
#> # A tibble: 490 x 11
#>    tokens       V1    V2      V3     V4     V5       V6     V7     V8       V9
#>    <chr>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>    <dbl>
#>  1 the     0.0375  0.516 -0.0992 0.0700 -0.276 -0.0166  -0.265 -0.209 -0.00763
#>  2 and     0.0275  0.522 -0.0804 0.0831 -0.264 -0.00803 -0.277 -0.223 -0.00484
#>  3 a       0.0264  0.507 -0.0448 0.0188 -0.239 -0.0572  -0.247 -0.207 -0.00223
#>  4 to      0.0256  0.520 -0.136  0.0479 -0.331 -0.00933 -0.274 -0.246 -0.00939
#>  5 he      0.0656  0.461 -0.0624 0.0535 -0.256 -0.0286  -0.252 -0.170 -0.0213 
#>  6 of      0.00377 0.531 -0.137  0.0733 -0.279 -0.0114  -0.280 -0.208  0.00984
#>  7 in      0.0215  0.480 -0.0681 0.0660 -0.270 -0.0350  -0.263 -0.173  0.0342 
#>  8 she     0.0272  0.482 -0.115  0.0502 -0.255  0.00664 -0.265 -0.202 -0.0184 
#>  9 you    -0.00101 0.617 -0.0984 0.0716 -0.329 -0.0198  -0.322 -0.240 -0.0310 
#> 10 was     0.0360  0.482 -0.0888 0.0455 -0.259 -0.0131  -0.242 -0.185 -0.0135 
#> # … with 480 more rows, and 1 more variable: V10 <dbl>