View source on GitHub |
Tokenizer used for BERT, a faster version with TFLite support.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
, Detokenizer
text.FastBertTokenizer(
vocab=None,
suffix_indicator='##',
max_bytes_per_word=100,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
no_pretokenization=False,
support_detokenization=False,
fast_wordpiece_model_buffer=None,
lower_case_nfd_strip_accents=False,
fast_bert_normalizer_model_buffer=None
)
This tokenizer applies an end-to-end, text string to wordpiece tokenization.
It is equivalent to BertTokenizer
for most common scenarios while running
faster and supporting TFLite. It does not support certain special settings
(see the docs below).
See WordpieceTokenizer
for details on the subword tokenization.
For an example of use, see https://www.tensorflow.org/text/guide/bert_preprocessing_guide
Attributes | |
---|---|
vocab
|
(optional) The list of tokens in the vocabulary. |
suffix_indicator
|
(optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. |
max_bytes_per_word
|
(optional) Max size of input token. |
token_out_type
|
(optional) The type of the token to return. This can be
tf.int64 or tf.int32 IDs, or tf.string subwords.
|
unknown_token
|
(optional) The string value to substitute for an unknown
token. It must be included in vocab .
|
no_pretokenization
|
(optional) By default, the input is split on whitespaces and punctuations before applying the Wordpiece tokenization. When true, the input is assumed to be pretokenized already. |
support_detokenization
|
(optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. As a reference, when using 120k multilingual BERT WordPiece vocab, the flatbuffer's size increases from ~5MB to ~6MB. |
fast_wordpiece_model_buffer
|
(optional) Bytes object (or a uint8 tf.Tenosr)
that contains the wordpiece model in flatbuffer format (see
fast_wordpiece_tokenizer_model.fbs). If not None , all other arguments
related to FastWordPieceTokenizer (except token_output_type ) are
ignored.
|
lower_case_nfd_strip_accents
|
(optional) .
|
fast_bert_normalizer_model_buffer
|
(optional) bytes object (or a uint8
tf.Tenosr) that contains the fast bert normalizer model in flatbuffer
format (see fast_bert_normalizer_model.fbs). If not None ,
lower_case_nfd_strip_accents is ignored.
|
Methods
detokenize
detokenize(
token_ids
)
Convert a Tensor
or RaggedTensor
of wordpiece IDs to string-words.
See WordpieceTokenizer.detokenize
for details.
Example:
vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab, support_detokenization=True)
tokenizer.detokenize([[4, 5]])
<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'greatest'],
dtype=object)>
Args | |
---|---|
token_ids
|
A RaggedTensor or Tensor with an int dtype.
|
Returns | |
---|---|
A RaggedTensor with dtype string and the same rank as the input
token_ids .
|
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
text_input
)
Tokenizes a tensor of string tokens into subword tokens for BERT.
Example:
vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab)
text_inputs = tf.constant(['greatest'.encode('utf-8') ])
tokenizer.tokenize(text_inputs)
<tf.RaggedTensor [[4, 5]]>
Args | |
---|---|
text_input
|
input: A Tensor or RaggedTensor of untokenized UTF-8
strings.
|
Returns | |
---|---|
A RaggedTensor of tokens where tokens[i1...iN, j] is the string
contents (or ID in the vocab_lookup_table representing that string)
of the jth token in input[i1...iN]
|
tokenize_with_offsets
tokenize_with_offsets(
text_input
)
Tokenizes a tensor of string tokens into subword tokens for BERT.
Example:
vocab = ['they', "##'", '##re', 'the', 'great', '##est', '[UNK]']
tokenizer = FastBertTokenizer(vocab=vocab)
text_inputs = tf.constant(['greatest'.encode('utf-8')])
tokenizer.tokenize_with_offsets(text_inputs)
(<tf.RaggedTensor [[4, 5]]>,
<tf.RaggedTensor [[0, 5]]>,
<tf.RaggedTensor [[5, 8]]>)
Args | |
---|---|
text_input
|
input: A Tensor or RaggedTensor of untokenized UTF-8
strings.
|
Returns | |
---|---|
A tuple of RaggedTensor s where the first element is the tokens where
tokens[i1...iN, j] , the second element is the starting offsets, the
third element is the end offset. (Please look at tokenize for details
on tokens.)
|