The calculations are done with the fastTextR package.
fasttext( text, tokenizer = text2vec::space_tokenizer, dim = 10L, type = c("skip-gram", "cbow"), window = 5L, loss = "hs", negative = 5L, n_iter = 5L, min_count = 5L, threads = 1L, composition = c("tibble", "data.frame", "matrix"), verbose = FALSE )
text | Character string. |
---|---|
tokenizer | Function, function to perform tokenization. Defaults to text2vec::space_tokenizer. |
dim | Integer, number of dimension of the resulting word vectors. |
type | Character, the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'skip-gram'. |
window | Integer, skip length between words. Defaults to 5. |
loss | Charcter, choice of loss function must be one of "ns", "hs", or "softmax". See details for more Defaults to "hs". |
negative | integer with the number of negative samples. Only used when loss = "ns". |
n_iter | Integer, number of training iterations. Defaults to 5.
|
min_count | Integer, number of times a token should appear to be considered in the model. Defaults to 5. |
threads | number of CPU threads to use. Defaults to 1. |
composition | Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors. |
verbose | Logical, controls whether progress is reported as operations are executed. |
A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.
The choice of loss functions are one of:
"ns" negative sampling
"hs" hierarchical softmax
"softmax" full softmax
Enriching Word Vectors with Subword Information, 2016, P. Bojanowski, E. Grave, A. Joulin, T. Mikolov.
fasttext(fairy_tales, n_iter = 2)#> # A tibble: 452 x 11 #> tokens V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 the -0.282 0.0299 0.443 -0.414 0.131 -0.0250 0.231 -0.198 -0.0193 0.223 #> 2 and -0.217 0.0142 0.483 -0.456 0.131 -0.0469 0.231 -0.208 -0.0527 0.192 #> 3 a -0.199 0.0468 0.421 -0.429 0.133 -0.0780 0.228 -0.207 -0.0396 0.214 #> 4 to -0.256 0.0122 0.417 -0.448 0.0965 -0.0589 0.241 -0.227 -0.0462 0.183 #> 5 he -0.215 0.0563 0.380 -0.361 0.0772 -0.0212 0.173 -0.186 -0.0733 0.184 #> 6 of -0.180 0.0346 0.417 -0.438 0.127 -0.0677 0.253 -0.213 -0.0217 0.178 #> 7 in -0.307 0.0502 0.533 -0.514 0.111 -0.0652 0.258 -0.228 -0.0335 0.229 #> 8 she -0.187 0.0389 0.363 -0.376 0.0901 -0.0588 0.210 -0.165 -0.0210 0.157 #> 9 was -0.215 0.0439 0.371 -0.362 0.0886 -0.0382 0.188 -0.170 -0.0402 0.159 #> 10 as -0.231 0.0385 0.560 -0.552 0.157 -0.0821 0.308 -0.228 -0.0645 0.209 #> # … with 442 more rows# Custom tokenizer that splits on non-alphanumeric characters fasttext(fairy_tales, n_iter = 2, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))#> # A tibble: 490 x 11 #> tokens V1 V2 V3 V4 V5 V6 V7 V8 V9 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 the 0.0375 0.516 -0.0992 0.0700 -0.276 -0.0166 -0.265 -0.209 -0.00763 #> 2 and 0.0275 0.522 -0.0804 0.0831 -0.264 -0.00803 -0.277 -0.223 -0.00484 #> 3 a 0.0264 0.507 -0.0448 0.0188 -0.239 -0.0572 -0.247 -0.207 -0.00223 #> 4 to 0.0256 0.520 -0.136 0.0479 -0.331 -0.00933 -0.274 -0.246 -0.00939 #> 5 he 0.0656 0.461 -0.0624 0.0535 -0.256 -0.0286 -0.252 -0.170 -0.0213 #> 6 of 0.00377 0.531 -0.137 0.0733 -0.279 -0.0114 -0.280 -0.208 0.00984 #> 7 in 0.0215 0.480 -0.0681 0.0660 -0.270 -0.0350 -0.263 -0.173 0.0342 #> 8 she 0.0272 0.482 -0.115 0.0502 -0.255 0.00664 -0.265 -0.202 -0.0184 #> 9 you -0.00101 0.617 -0.0984 0.0716 -0.329 -0.0198 -0.322 -0.240 -0.0310 #> 10 was 0.0360 0.482 -0.0888 0.0455 -0.259 -0.0131 -0.242 -0.185 -0.0135 #> # … with 480 more rows, and 1 more variable: V10 <dbl>