View source on GitHub |
Generates skip-gram token and label paired Tensors from the input tensor.
tfa.text.skip_gram_sample(
input_tensor: tfa.types.TensorLike
,
min_skips: tfa.types.FloatTensorLike
= 1,
max_skips: tfa.types.FloatTensorLike
= 5,
start: tfa.types.FloatTensorLike
= 0,
limit: tfa.types.FloatTensorLike
= -1,
emit_self_as_target: bool = False,
vocab_freq_table: tf.lookup.KeyValueTensorInitializer = None,
vocab_min_count: Optional[FloatTensorLike] = None,
vocab_subsampling: Optional[FloatTensorLike] = None,
corpus_size: Optional[FloatTensorLike] = None,
seed: Optional[FloatTensorLike] = None,
name: Optional[str] = None
) -> tf.Tensor
Generates skip-gram ("token", "label")
pairs using each element in the
rank-1 input_tensor
as a token. The window size used for each token will
be randomly selected from the range specified by [min_skips, max_skips]
,
inclusive. See https://arxiv.org/abs/1301.3781 for more details about
skip-gram.
For example, given input_tensor = ["the", "quick", "brown", "fox",
"jumps"]
, min_skips = 1
, max_skips = 2
, emit_self_as_target = False
,
the output (tokens, labels)
pairs for the token "quick" will be randomly
selected from either (tokens=["quick", "quick"], labels=["the", "brown"])
for 1 skip, or (tokens=["quick", "quick", "quick"],
labels=["the", "brown", "fox"])
for 2 skips.
If emit_self_as_target = True
, each token will also be emitted as a label
for itself. From the previous example, the output will be either
(tokens=["quick", "quick", "quick"], labels=["the", "quick", "brown"])
for 1 skip, or (tokens=["quick", "quick", "quick", "quick"],
labels=["the", "quick", "brown", "fox"])
for 2 skips.
The same process is repeated for each element of input_tensor
and
concatenated together into the two output rank-1 Tensors
(one for all the
tokens, another for all the labels).
If vocab_freq_table
is specified, tokens in input_tensor
that are not
present in the vocabulary are discarded. Tokens whose frequency counts are
below vocab_min_count
are also discarded. Tokens whose frequency
proportions in the corpus exceed vocab_subsampling
may be randomly
down-sampled. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details
about subsampling.
Returns | |
---|---|
A tuple containing (token, label) Tensors . Each output Tensor is of
rank-1 and has the same type as input_tensor .
|
Raises | |
---|---|
ValueError
|
If vocab_freq_table is not provided, but vocab_min_count ,
vocab_subsampling , or corpus_size is specified.
If vocab_subsampling and corpus_size are not both present or
both absent.
|