TensorFlow 1 version | View source on GitHub |
Generates skipgram word pairs.
tf.keras.preprocessing.sequence.skipgrams(
sequence, vocabulary_size, window_size=4, negative_samples=1.0, shuffle=True,
categorical=False, sampling_table=None, seed=None
)
This function transforms a sequence of word indexes (list of integers) into tuples of words of the form:
- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).
Read more about Skipgram in this gnomic paper by Mikolov et al.: Efficient Estimation of Word Representations in Vector Space
Arguments | |
---|---|
sequence
|
A word sequence (sentence), encoded as a list
of word indices (integers). If using a sampling_table ,
word indices are expected to match the rank
of the words in a reference dataset (e.g. 10 would encode
the 10-th most frequently occurring token).
Note that index 0 is expected to be a non-word and will be skipped.
|
vocabulary_size
|
Int, maximum possible word index + 1 |
window_size
|
Int, size of sampling windows (technically half-window).
The window of a word w_i will be
[i - window_size, i + window_size+1] .
|
negative_samples
|
Float >= 0. 0 for no negative (i.e. random) samples. 1 for same number as positive samples. |
shuffle
|
Whether to shuffle the word couples before returning them. |
categorical
|
bool. if False, labels will be
integers (eg. [0, 1, 1 .. ] ),
if True , labels will be categorical, e.g.
[[1,0],[0,1],[0,1] .. ] .
|
sampling_table
|
1D array of size vocabulary_size where the entry i
encodes the probability to sample a word of rank i.
|
seed
|
Random seed. |
Returns | |
---|---|
couples, labels: where couples are int pairs and
labels are either 0 or 1.
|
Note:
By convention, index 0 in the vocabulary is a non-word and will be skipped.