View source on GitHub |
Base class for tokenizer implementations.
Inherits From: Splitter
text.Tokenizer(
name=None
)
A Tokenizer is a text.Splitter
that splits strings into tokens. Tokens
generally correspond to short substrings of the source string. Tokens can be
encoded using either strings or integer ids (where integer ids could be
created by hashing strings or by looking them up in a fixed vocabulary table
that maps strings to ids).
Each Tokenizer subclass must implement a tokenize
method, which splits each
string in a Tensor into tokens. E.g.:
class SimpleTokenizer(tf_text.Tokenizer):
def tokenize(self, input):
return tf.strings.split(input)
print(SimpleTokenizer().tokenize(["hello world", "this is a test"]))
<tf.RaggedTensor [[b'hello', b'world'], [b'this', b'is', b'a', b'test']]>
By default, the split
method simply delegates to tokenize
.
Methods
split
split(
input
)
Alias for Tokenizer.tokenize
.
tokenize
@abc.abstractmethod
tokenize( input )
Tokenizes the input tensor.
Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.
Example:
print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)
Args | |
---|---|
input
|
An N-dimensional UTF-8 string (or optionally integer) Tensor or
RaggedTensor .
|
Returns | |
---|---|
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor .
For each string from the input tensor, the final, extra dimension contains
the tokens that string was split into.
|