text.UnicodeScriptTokenizer

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter

Used in the notebooks

Used in the guide Used in the tutorials

By default, this tokenizer leaves out scripts matching the whitespace unicode property (use the keep_whitespace argument to keep it), so in this case the results are similar to the WhitespaceTokenizer. Any punctuation will get its own token (since it is in a different script), and any script change in the input string will be the location of a split.

Example:

tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(["xy.,z de", "fg?h", "abαβ"])
print(tokens.to_list())
[[b'xy', b'.,', b'z', b'de'], [b'fg', b'?', b'h'],
 [b'ab', b'\xce\xb1\xce\xb2']]
tokens = tokenizer.tokenize(u"累計7239人")
print(tokens)
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'], shape=(3,),
          dtype=string)

Both the punctuation and the whitespace in the first string have been split, but the punctuation run is present as a token while the whitespace isn't emitted (by default). The third example shows the case of a script change without any whitespace. This results in a split at that boundary point.

keep_whitespace A boolean that specifices whether to emit whitespace tokens (default False).

Methods

split

View source

Alias for Tokenizer.tokenize.

split_with_offsets

View source

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when successive tokens change their Unicode script or change being whitespace or not. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

ICU-defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.

Args
input A RaggedTensoror Tensor of UTF-8 strings with any shape.

Returns
A RaggedTensor of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

tokenize_with_offsets

View source

Tokenizes UTF-8 by splitting when there is a change in Unicode script.

The strings are split when a change in the Unicode script is detected between sequential tokens. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. See: http://icu-project.org/apiref/icu4c/uscript_8h.html

ICU defined whitespace characters are dropped, unless the keep_whitespace option was specified at construction time.

Example:

tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize_with_offsets(["xy.,z de", "abαβ"])
print(tokens[0].to_list())
[[b'xy', b'.,', b'z', b'de'], [b'ab', b'\xce\xb1\xce\xb2']]
print(tokens[1].to_list())
[[0, 2, 4, 6], [0, 2]]
print(tokens[2].to_list())
[[2, 4, 5, 8], [2, 6]]
tokens = tokenizer.tokenize_with_offsets(u"累計7239人")
print(tokens[0])
tf.Tensor([b'\xe7\xb4\xaf\xe8\xa8\x88' b'7239' b'\xe4\xba\xba'],
    shape=(3,), dtype=string)
print(tokens[1])
tf.Tensor([ 0  6 10], shape=(3,), dtype=int64)
print(tokens[2])
tf.Tensor([ 6 10 13], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings.

Args
input A RaggedTensoror Tensor of UTF-8 strings with any shape.

Returns
A tuple (tokens, start_offsets, end_offsets) where:

  • tokens: A RaggedTensor of tokenized text.
  • start_offsets: A RaggedTensor of the tokens' starting byte offset.
  • end_offsets: A RaggedTensor of the tokens' ending byte offset.