View source on GitHub |
Skip-gram sampling with a text vocabulary file.
tfa.text.skip_gram_sample_with_text_vocab(
input_tensor: tfa.types.TensorLike
,
vocab_freq_file: str,
vocab_token_index: tfa.types.FloatTensorLike
= 0,
vocab_token_dtype: Optional[AcceptableDTypes] = tf.dtypes.string,
vocab_freq_index: tfa.types.FloatTensorLike
= 1,
vocab_freq_dtype: Optional[AcceptableDTypes] = tf.dtypes.float64,
vocab_delimiter: str = ',',
vocab_min_count: Optional[FloatTensorLike] = None,
vocab_subsampling: Optional[FloatTensorLike] = None,
corpus_size: Optional[FloatTensorLike] = None,
min_skips: tfa.types.FloatTensorLike
= 1,
max_skips: tfa.types.FloatTensorLike
= 5,
start: tfa.types.FloatTensorLike
= 0,
limit: tfa.types.FloatTensorLike
= -1,
emit_self_as_target: bool = False,
seed: Optional[FloatTensorLike] = None,
name: Optional[str] = None
) -> tf.Tensor
Wrapper around skip_gram_sample()
for use with a text vocabulary file.
The vocabulary file is expected to be a plain-text file, with lines of
vocab_delimiter
-separated columns. The vocab_token_index
column should
contain the vocabulary term, while the vocab_freq_index
column should
contain the number of times that term occurs in the corpus. For example,
with a text vocabulary file of:
bonjour,fr,42
hello,en,777
hola,es,99
You should set vocab_delimiter=","
, vocab_token_index=0
, and
vocab_freq_index=2
.
See skip_gram_sample()
documentation for more details about the skip-gram
sampling process.
Args | |
---|---|
input_tensor
|
A rank-1 Tensor from which to generate skip-gram candidates.
|
vocab_freq_file
|
string specifying full file path to the text vocab file.
|
vocab_token_index
|
int specifying which column in the text vocab file
contains the tokens.
|
vocab_token_dtype
|
DType specifying the format of the tokens in the text vocab file.
|
vocab_freq_index
|
int specifying which column in the text vocab file
contains the frequency counts of the tokens.
|
vocab_freq_dtype
|
DType specifying the format of the frequency counts
in the text vocab file.
|
vocab_delimiter
|
string specifying the delimiter used in the text vocab
file.
|
vocab_min_count
|
int , float , or scalar Tensor specifying
minimum frequency threshold (from vocab_freq_file ) for a token to be
kept in input_tensor . This should correspond with vocab_freq_dtype .
|
vocab_subsampling
|
(Optional) float specifying frequency proportion
threshold for tokens from input_tensor . Tokens that occur more
frequently will be randomly down-sampled. Reasonable starting values
may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546
for more details.
|
corpus_size
|
(Optional) int , float , or scalar Tensor specifying the
total number of tokens in the corpus (e.g., sum of all the frequency
counts of vocab_freq_file ). Used with vocab_subsampling for
down-sampling frequently occurring tokens. If this is specified,
vocab_freq_file and vocab_subsampling must also be specified.
If corpus_size is needed but not supplied, then it will be calculated
from vocab_freq_file . You might want to supply your own value if you
have already eliminated infrequent tokens from your vocabulary files
(where frequency < vocab_min_count) to save memory in the internal
token lookup table. Otherwise, the unused tokens' variables will waste
memory. The user-supplied corpus_size value must be greater than or
equal to the sum of all the frequency counts of vocab_freq_file .
|
min_skips
|
int or scalar Tensor specifying the minimum window size to
randomly use for each token. Must be >= 0 and <= max_skips . If
min_skips and max_skips are both 0, the only label outputted will
be the token itself.
|
max_skips
|
int or scalar Tensor specifying the maximum window size to
randomly use for each token. Must be >= 0.
|
start
|
int or scalar Tensor specifying the position in input_tensor
from which to start generating skip-gram candidates.
|
limit
|
int or scalar Tensor specifying the maximum number of elements
in input_tensor to use in generating skip-gram candidates. -1 means
to use the rest of the Tensor after start .
|
emit_self_as_target
|
bool or scalar Tensor specifying whether to emit
each token as a label for itself.
|
seed
|
(Optional) int used to create a random seed for window size and
subsampling. See
set_random_seed
for behavior.
|
name
|
(Optional) A string name or a name scope for the operations.
|
Returns | |
---|---|
A tuple containing (token, label) Tensors . Each output Tensor is of
rank-1 and has the same type as input_tensor .
|