text.Tokenizer

Base class for tokenizer implementations.

Inherits From: Splitter

text.Tokenizer(
    name=None
)

A Tokenizer is a text.Splitter that splits strings into tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids (where integer ids could be created by hashing strings or by looking them up in a fixed vocabulary table that maps strings to ids).

Each Tokenizer subclass must implement a tokenize method, which splits each string in a Tensor into tokens. E.g.:

class SimpleTokenizer(tf_text.Tokenizer):
  def tokenize(self, input):
    return tf.strings.split(input)
print(SimpleTokenizer().tokenize(["hello world", "this is a test"]))
<tf.RaggedTensor [[b'hello', b'world'], [b'this', b'is', b'a', b'test']]>

By default, the split method simply delegates to tokenize.

Methods

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`tokenize`

View source

@abc.abstractmethod
tokenize(
    input
)

Tokenizes the input tensor.

Splits each string in the input tensor into a sequence of tokens. Tokens generally correspond to short substrings of the source string. Tokens can be encoded using either strings or integer ids.

Example:

print(tf_text.WhitespaceTokenizer().tokenize("small medium large"))
tf.Tensor([b'small' b'medium' b'large'], shape=(3,), dtype=string)

Args
`input`	An N-dimensional UTF-8 string (or optionally integer) `Tensor` or `RaggedTensor`.

Returns
An N+1-dimensional UTF-8 string or integer `Tensor` or `RaggedTensor`. For each string from the input tensor, the final, extra dimension contains the tokens that string was split into.