TensorFlow is back at Google I/O on May 14! Register now

text.FastBertTokenizer

Tokenizer used for BERT, a faster version with TFLite support.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.FastBertTokenizer(
    vocab=None,
    suffix_indicator='##',
    max_bytes_per_word=100,
    token_out_type=dtypes.int64,
    unknown_token='[UNK]',
    no_pretokenization=False,
    support_detokenization=False,
    fast_wordpiece_model_buffer=None,
    lower_case_nfd_strip_accents=False,
    fast_bert_normalizer_model_buffer=None
)

This tokenizer applies an end-to-end, text string to wordpiece tokenization. It is equivalent to BertTokenizer for most common scenarios while running faster and supporting TFLite. It does not support certain special settings (see the docs below).

See WordpieceTokenizer for details on the subword tokenization.

For an example of use, see https://www.tensorflow.org/text/guide/bert_preprocessing_guide

Attributes
`vocab`	(optional) The list of tokens in the vocabulary.
`suffix_indicator`	(optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword.
`max_bytes_per_word`	(optional) Max size of input token.
`token_out_type`	(optional) The type of the token to return. This can be `tf.int64` or `tf.int32` IDs, or `tf.string` subwords.
`unknown_token`	(optional) The string value to substitute for an unknown token. It must be included in `vocab`.
`no_pretokenization`	(optional) By default, the input is split on whitespaces and punctuations before applying the Wordpiece tokenization. When true, the input is assumed to be pretokenized already.
`support_detokenization`	(optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. As a reference, when using 120k multilingual BERT WordPiece vocab, the flatbuffer's size increases from ~5MB to ~6MB.
`fast_wordpiece_model_buffer`	(optional) Bytes object (or a uint8 tf.Tenosr) that contains the wordpiece model in flatbuffer format (see fast_wordpiece_tokenizer_model.fbs). If not `None`, all other arguments related to FastWordPieceTokenizer (except `token_output_type`) are ignored.
`lower_case_nfd_strip_accents`	(optional) . If true, it first lowercases the text, applies NFD normalization, strips accents characters, and then replaces control characters with whitespaces. If false, it only replaces control characters with whitespaces.
`fast_bert_normalizer_model_buffer`	(optional) bytes object (or a uint8 tf.Tenosr) that contains the fast bert normalizer model in flatbuffer format (see fast_bert_normalizer_model.fbs). If not `None`, `lower_case_nfd_strip_accents` is ignored.

Methods

`detokenize`

View source

detokenize(
    token_ids
)

Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.

See WordpieceTokenizer.detokenize for details.

Example:

vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab, support_detokenization=True)
tokenizer.detokenize([[4, 5]])
<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'greatest'],
dtype=object)>

Args
`token_ids`	A `RaggedTensor` or `Tensor` with an int dtype.

Returns
A `RaggedTensor` with dtype `string` and the same rank as the input `token_ids`.

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    text_input
)

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab)
text_inputs = tf.constant(['greatest'.encode('utf-8') ])
tokenizer.tokenize(text_inputs)
<tf.RaggedTensor [[4, 5]]>

Args
`text_input`	input: A `Tensor` or `RaggedTensor` of untokenized UTF-8 strings.

Returns
A `RaggedTensor` of tokens where `tokens[i1...iN, j]` is the string contents (or ID in the vocab_lookup_table representing that string) of the `jth` token in `input[i1...iN]`

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    text_input
)

Tokenizes a tensor of string tokens into subword tokens for BERT.

Example:

vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab)
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.tokenize_with_offsets(text_inputs)
(<tf.RaggedTensor [[4, 5]]>,
 <tf.RaggedTensor [[0, 5]]>,
 <tf.RaggedTensor [[5, 8]]>)

Args
`text_input`	input: A `Tensor` or `RaggedTensor` of untokenized UTF-8 strings.

Returns
A tuple of `RaggedTensor`s where the first element is the tokens where `tokens[i1...iN, j]`, the second element is the starting offsets, the third element is the end offset. (Please look at `tokenize` for details on tokens.)

text.FastBertTokenizer

Attributes

Methods

detokenize

Example:

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

`detokenize`

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`