Extract word vectors from GloVe word embedding

The calculations are done with the text2vec package.

glove(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 10L,
  window = 5L,
  min_count = 5L,
  n_iter = 10L,
  x_max = 10L,
  stopwords = character(),
  convergence_tol = -1,
  threads = 1,
  composition = c("tibble", "data.frame", "matrix"),
  verbose = FALSE
)

Arguments

text	Character string.
tokenizer	Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
dim	Integer, number of dimension of the resulting word vectors.
window	Integer, skip length between words. Defaults to 5.
min_count	Integer, number of times a token should appear to be considered in the model. Defaults to 5.
n_iter	Integer, number of training iterations. Defaults to 10.
x_max	Integer, maximum number of co-occurrences to use in the weighting function. Defaults to 10.
stopwords	Character, a vector of stop words to exclude from training.
convergence_tol	Numeric, value determining the convergence criteria. `numeric = -1` defines early stopping strategy. Stop fitting when one of two following conditions will be satisfied: (a) passed all iterations (b) `cost_previous_iter / cost_current_iter - 1 < convergence_tol`. Defaults to -1.
threads	number of CPU threads to use. Defaults to 1.
composition	Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.
verbose	Logical, controls whether progress is reported as operations are executed.

Source

https://nlp.stanford.edu/projects/glove/

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

References

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Examples

glove(fairy_tales, x_max = 5)
#> # A tibble: 451 x 11
#>    tokens     V1      V2      V3      V4    V5       V6      V7      V8      V9
#>    <chr>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
#>  1 "\"Do" -0.136 -0.319  -0.364  -1.14   0.567  0.376   -0.229  -0.840  -0.0496
#>  2 "\"Go…  0.605 -0.764  -0.887   0.0500 1.34   0.185    0.207  -0.543  -0.453 
#>  3 "\"He" -0.483  0.0754 -0.194  -0.838  0.689  0.229   -0.621   0.328  -0.679 
#>  4 "\"He…  0.263 -0.265  -0.499  -0.827  0.149 -0.00565 -0.170   1.05   -1.11  
#>  5 "\"Oh"  0.927 -0.599  -0.0827 -0.762  0.967 -0.158   -0.319   0.452   0.207 
#>  6 "\"Th…  0.433 -0.0803  0.510  -1.33   0.557 -0.0535  -0.127   1.07   -1.56  
#>  7 "\"Ye… -0.162 -0.421   0.149  -1.20   0.833  0.132    0.325  -0.457  -0.516 
#>  8 "-"     0.217 -0.0786 -0.472  -1.29   0.935 -0.212   -0.0861  0.175  -0.425 
#>  9 "All"   0.495 -0.602  -0.0697 -0.301  0.437  0.256    0.571  -0.0976 -0.691 
#> 10 "You"   0.435 -0.783  -0.264  -1.29   0.646 -0.987   -0.535  -0.0246 -1.20  
#> # … with 441 more rows, and 1 more variable: V10 <dbl>