View source on GitHub |
Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Inherits From: TokenizerWithOffsets
, Tokenizer
, SplitterWithOffsets
, Splitter
, Detokenizer
text.FastWordpieceTokenizer(
vocab=None,
suffix_indicator='##',
max_bytes_per_word=100,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
no_pretokenization=False,
support_detokenization=False,
model_buffer=None
)
It employs the linear (as opposed to quadratic) WordPiece algorithm (see the paper).
Differences compared to the classic WordpieceTokenizer are as follows (as of 11/2021):
unknown_token
cannot be None or empty. That means if a word is too long or cannot be tokenized, FastWordpieceTokenizer always returnsunknown_token
. In constrast, the original WordpieceTokenizer would return the original word ifunknown_token
is empty or None.unknown_token
must be included in the vocabulary.When
unknown_token
is returned, in tokenize_with_offsets(), the result end_offset is set to be the length of the original input word. In contrast, whenunknown_token
is returned by the original WordpieceTokenizer, the end_offset is set to be the length of theunknown_token
string.split_unknown_characters
is not supported.max_chars_per_token
is not used or needed.By default the input is assumed to be general text (i.e., sentences), and FastWordpieceTokenizer first splits it on whitespaces and punctuations and then applies the Wordpiece tokenization (see the parameter
no_pretokenization
). If the input already contains single words only, please setno_pretokenization=True
to be consistent with the classic WordpieceTokenizer.
Args | |
---|---|
vocab
|
(optional) The list of tokens in the vocabulary. |
suffix_indicator
|
(optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. |
max_bytes_per_word
|
(optional) Max size of input token. |
token_out_type
|
(optional) The type of the token to return. This can be
tf.int64 or tf.int32 IDs, or tf.string subwords.
|
unknown_token
|
(optional) The string value to substitute for an unknown
token. It must be included in vocab .
|
no_pretokenization
|
(optional) By default, the input is split on whitespaces and punctuations before applying the Wordpiece tokenization. When true, the input is assumed to be pretokenized already. |
support_detokenization
|
(optional) Whether to make the tokenizer support doing detokenization. Setting it to true expands the size of the model flatbuffer. As a reference, when using 120k multilingual BERT WordPiece vocab, the flatbuffer's size increases from ~5MB to ~6MB. |
model_buffer
|
(optional) Bytes object (or a uint8 tf.Tenosr) that contains
the wordpiece model in flatbuffer format (see
fast_wordpiece_tokenizer_model.fbs). If not None , all other arguments
(except token_output_type ) are ignored.
|
Methods
detokenize
detokenize(
input
)
Detokenizes a tensor of int64 or int32 subword ids into sentences.
Detokenize and tokenize an input string returns itself when the input string
is normalized and the tokenized wordpieces don't contain <unk>
.
Example:
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re", "ok"]
>>> tokenizer = FastWordpieceTokenizer(vocab, support_detokenization=True)
>>> ids = tf.ragged.constant([[0, 1, 2, 3, 4, 5], [9]])
>>> tokenizer.detokenize(ids)
<tf.Tensor: shape=(2,), dtype=string,
... numpy=array([b"they're the greatest", b'ok'], dtype=object)>
>>> ragged_ids = tf.ragged.constant([[[0, 1, 2, 3, 4, 5], [9]], [[4, 5]]])
>>> tokenizer.detokenize(ragged_ids)
<tf.RaggedTensor [[b"they're the greatest", b'ok'], [b'greatest']]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of int64 or int32.
|
Returns | |
---|---|
A RaggedTensor of sentences that has N - 1 dimension when N > 1.
Otherwise, a string tensor.
|
split
split(
input
)
Alias for Tokenizer.tokenize
.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets
.
tokenize
tokenize(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
Example 1, single word tokenization:
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
... no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
Example 2, general text tokenization (pre-tokenization on
punctuation and whitespace followed by WordPiece tokenization):
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
[b'the', b'great', b'##est']]]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A RaggedTensor of tokens where tokens[i, j] is the j-th token
(i.e., wordpiece) for input[i] (i.e., the i-th input word). This token
is either the actual token string content, or the corresponding integer
id, i.e., the index of that token string in the vocabulary. This choice
is controlled by the token_out_type parameter passed to the initializer
method.
|
tokenize_with_offsets
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
Example 1, single word tokenization:
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string,
... no_pretokenization=True)
>>> tokens = [["they're", "the", "greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>
Example 2, general text tokenization (pre-tokenization on
punctuation and whitespace followed by WordPiece tokenization):
>>> vocab = ["they", "##'", "##re", "the", "great", "##est", "[UNK]",
... "'", "re"]
>>> tokenizer = FastWordpieceTokenizer(vocab, token_out_type=tf.string)
>>> tokens = [["they're the greatest", "the greatest"]]
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"'", b're', b'the', b'great', b'##est'],
[b'the', b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5, 8, 12, 17], [0, 4, 9]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7, 11, 17, 20], [3, 9, 12]]]>
Args | |
---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
|
tokens
|
is a RaggedTensor , where tokens[i, j] is the j-th token
(i.e., wordpiece) for input[i] (i.e., the i-th input word). This
token is either the actual token string content, or the corresponding
integer id, i.e., the index of that token string in the vocabulary.
This choice is controlled by the token_out_type parameter passed to
the initializer method.
start_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets
for the inclusive start of the jth token in input[i1...iN] .
end_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for
the exclusive end of the jth token in input[i ...iN]` (exclusive,
i.e., first byte after the end of the token).
|