text.Tokenizer

Base class for tokenizer implementations.

Inherits From: Splitter

A Tokenizer is a text.Splitter that splits strings into tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids (where integer ids could be created by hashing strings or by looking them up in a fixed vocabulary table that maps strings to ids).

Each Tokenizer subclass must implement a tokenize method, which splits each string in a Tensor into tokens. E.g.:

class SimpleTokenizer(tf_text.Tokenizer):
  def tokenize(self, input):
    return tf.strings.split(input)
print(SimpleTokenizer().tokenize(["hello world", "this is a test"]))
<tf.RaggedTensor [[b'hello', b'world'], [b'this', b'is', b'a', b'test']]>

By default, the split method simply delegates to tokenize.

Methods

split

View source

Alias for Tokenizer.tokenize.

tokenize

View source

Tokenizes the input tensor.

Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
input An N-dimensional UTF-8 string (or optionally integer) Tensor or RaggedTensor.

Returns
An N+1-dimensional UTF-8 string or integer Tensor or RaggedTensor. For each string from the input tensor, the final, extra dimension contains the tokens that string was split into.