A word sequence (sentence), encoded as a list
of word indices (integers). If using a sampling_table,
word indices are expected to match the rank
of the words in a reference dataset (e.g. 10 would encode
the 10-th most frequently occurring token).
Note that index 0 is expected to be a non-word and will be skipped.
vocabulary_size
Int, maximum possible word index + 1
window_size
Int, size of sampling windows (technically half-window).
The window of a word w_i will be
[i - window_size, i + window_size+1].
negative_samples
Float >= 0. 0 for no negative (i.e. random) samples.
1 for same number as positive samples.
shuffle
Whether to shuffle the word couples before returning them.
categorical
bool. if False, labels will be
integers (eg. [0, 1, 1 .. ]),
if True, labels will be categorical, e.g.
[[1,0],[0,1],[0,1] .. ].
sampling_table
1D array of size vocabulary_size where the entry i
encodes the probability to sample a word of rank i.
seed
Random seed.
Returns
couples, labels: where couples are int pairs and
labels are either 0 or 1.
Note
By convention, index 0 in the vocabulary is
a non-word and will be skipped.