View source on GitHub
|
Generates skip-gram token and label paired Tensors from the input tensor.
tfa.text.skip_gram_sample(
input_tensor: tfa.types.TensorLike,
min_skips: tfa.types.FloatTensorLike = 1,
max_skips: tfa.types.FloatTensorLike = 5,
start: tfa.types.FloatTensorLike = 0,
limit: tfa.types.FloatTensorLike = -1,
emit_self_as_target: bool = False,
vocab_freq_table: tf.lookup.KeyValueTensorInitializer = None,
vocab_min_count: Optional[FloatTensorLike] = None,
vocab_subsampling: Optional[FloatTensorLike] = None,
corpus_size: Optional[FloatTensorLike] = None,
seed: Optional[FloatTensorLike] = None,
name: Optional[str] = None
) -> tf.Tensor
Generates skip-gram ("token", "label") pairs using each element in the
rank-1 input_tensor as a token. The window size used for each token will
be randomly selected from the range specified by [min_skips, max_skips],
inclusive. See https://arxiv.org/abs/1301.3781 for more details about
skip-gram.
For example, given input_tensor = ["the", "quick", "brown", "fox",
"jumps"], min_skips = 1, max_skips = 2, emit_self_as_target = False,
the output (tokens, labels) pairs for the token "quick" will be randomly
selected from either (tokens=["quick", "quick"], labels=["the", "brown"])
for 1 skip, or (tokens=["quick", "quick", "quick"],
labels=["the", "brown", "fox"]) for 2 skips.
If emit_self_as_target = True, each token will also be emitted as a label
for itself. From the previous example, the output will be either
(tokens=["quick", "quick", "quick"], labels=["the", "quick", "brown"])
for 1 skip, or (tokens=["quick", "quick", "quick", "quick"],
labels=["the", "quick", "brown", "fox"]) for 2 skips.
The same process is repeated for each element of input_tensor and
concatenated together into the two output rank-1 Tensors (one for all the
tokens, another for all the labels).
If vocab_freq_table is specified, tokens in input_tensor that are not
present in the vocabulary are discarded. Tokens whose frequency counts are
below vocab_min_count are also discarded. Tokens whose frequency
proportions in the corpus exceed vocab_subsampling may be randomly
down-sampled. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details
about subsampling.
Returns | |
|---|---|
A tuple containing (token, label) Tensors. Each output Tensor is of
rank-1 and has the same type as input_tensor.
|
Raises | |
|---|---|
ValueError
|
If vocab_freq_table is not provided, but vocab_min_count,
vocab_subsampling, or corpus_size is specified.
If vocab_subsampling and corpus_size are not both present or
both absent.
|
View source on GitHub