View source on GitHub |
Tokenizes a tensor of UTF-8 string tokens into phrases.
Inherits From: Tokenizer
, Splitter
, Detokenizer
text.PhraseTokenizer(
vocab=None,
token_out_type=dtypes.int32,
unknown_token='<UNK>',
support_detokenization=True,
prob=0,
split_end_punctuation=False,
model_buffer=None
)
Args | |
---|---|
vocab
|
(optional) The list of tokens in the vocabulary. |
token_out_type
|
(optional) The type of the token to return. This can be
tf.int64 or tf.int32 IDs, or tf.string subwords.
|
unknown_token
|
(optional) The string value to substitute for an unknown
token. It must be included in vocab .
|
support_detokenization
|
(optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. |
prob
|
Probability of emitting a phrase when there is a match. |
split_end_punctuation
|
Split the end punctuation. |
model_buffer
|
(optional) Bytes object (or a uint8 tf.Tenosr) that contains
the phrase model in flatbuffer format (see phrase_tokenizer_model.fbs).
If not None , all other arguments (except token_output_type ) are
ignored.
|
Methods
detokenize
detokenize(
input_t
)
Detokenizes a tensor of int64 or int32 phrase ids into sentences.
Detokenize and tokenize an input string returns itself when the input string
is normalized and the tokenized phrases don't contain <unk>
.
Example:
>>> vocab = ["I", "have", "a", "dream", "a dream", "I have a", "<UNK>"]
>>> tokenizer = PhraseTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2], [5, 3]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
... numpy=array([b'I have a', b'I have a dream'], dtype=object)>
Args | |
---|---|
input_t
|
An N-dimensional Tensor or RaggedTensor of int64 or int32.
|
Returns | |
---|---|
A RaggedTensor of sentences that has N - 1 dimension when N > 1.
Otherwise, a string tensor.
|
split
split(
input
)
Alias for Tokenizer.tokenize
.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 string tokens further into phrase tokens.
Example, single string tokenization:
>>> vocab = ["I", "have", "a", "dream", "a dream", "I have a", "<UNK>"]
>>> tokenizer = PhraseTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["I have a dream"]]
>>> phrases = tokenizer.tokenize(tokens)
>>> phrases
<tf.RaggedTensor [[[b'I have a', b'dream']]]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
tokens
|
is a RaggedTensor , where tokens[i, j] is the j-th token
(i.e., phrase) for input[i] (i.e., the i-th input word). This
token is either the actual token string content, or the corresponding
integer id, i.e., the index of that token string in the vocabulary.
This choice is controlled by the token_out_type parameter passed to
the initializer method.
|