tf.keras.preprocessing.sequence.skipgrams
Generates skipgram word pairs.
tf.keras.preprocessing.sequence.skipgrams(
sequence,
vocabulary_size,
window_size=4,
negative_samples=1.0,
shuffle=True,
categorical=False,
sampling_table=None,
seed=None
)
This function transforms a sequence of word indexes (list of integers)
into tuples of words of the form:
- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).
Read more about Skipgram in this gnomic paper by Mikolov et al.:
Efficient Estimation of Word Representations in
Vector Space
Arguments |
sequence
|
A word sequence (sentence), encoded as a list
of word indices (integers). If using a sampling_table ,
word indices are expected to match the rank
of the words in a reference dataset (e.g. 10 would encode
the 10-th most frequently occurring token).
Note that index 0 is expected to be a non-word and will be skipped.
|
vocabulary_size
|
Int, maximum possible word index + 1
|
window_size
|
Int, size of sampling windows (technically half-window).
The window of a word w_i will be
[i - window_size, i + window_size+1] .
|
negative_samples
|
Float >= 0. 0 for no negative (i.e. random) samples.
1 for same number as positive samples.
|
shuffle
|
Whether to shuffle the word couples before returning them.
|
categorical
|
bool. if False, labels will be
integers (eg. [0, 1, 1 .. ] ),
if True , labels will be categorical, e.g.
[[1,0],[0,1],[0,1] .. ] .
|
sampling_table
|
1D array of size vocabulary_size where the entry i
encodes the probability to sample a word of rank i.
|
seed
|
Random seed.
|
Returns |
couples, labels: where couples are int pairs and
labels are either 0 or 1.
|
Note |
By convention, index 0 in the vocabulary is
a non-word and will be skipped.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2022-10-27 UTC.
[null,null,["Last updated 2022-10-27 UTC."],[],[]]