View source on GitHub
|
Tokenizes a tensor of UTF-8 string into words according to labels.
Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter
text.SplitMergeTokenizer()
Used in the notebooks
| Used in the guide |
|---|
Methods
split
split(
input
)
Alias for Tokenizer.tokenize.
split_with_offsets
split_with_offsets(
input
)
Alias for TokenizerWithOffsets.tokenize_with_offsets.
tokenize
tokenize(
input, labels, force_split_at_break_character=True
)
Tokenizes a tensor of UTF-8 strings according to labels.
Example:
strings = ["HelloMonday", "DearFriday"]labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],[0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]tokenizer = SplitMergeTokenizer()tokenizer.tokenize(strings, labels)<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>
| Args | |
|---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
labels
|
An (N+1)-dimensional Tensor or RaggedTensor of int32, with
labels[i1...iN, j] being the split(0)/merge(1) label of the j-th
character for input[i1...iN]. Here split means create a new word with
this character and merge means adding this character to the previous
word.
|
force_split_at_break_character
|
bool indicates whether to force start a
new word after seeing a ICU defined whitespace character. When seeing
one or more ICU defined whitespace character:
|
| Returns | |
|---|---|
A RaggedTensor of strings where tokens[i1...iN, j] is the string
content of the j-th token in input[i1...iN]
|
tokenize_with_offsets
tokenize_with_offsets(
input, labels, force_split_at_break_character=True
)
Tokenizes a tensor of UTF-8 strings into tokens with [start,end) offsets.
Example:
strings = ["HelloMonday", "DearFriday"]labels = [[0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],[0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0]]tokenizer = SplitMergeTokenizer()tokens, starts, ends = tokenizer.tokenize_with_offsets(strings, labels)tokens<tf.RaggedTensor [[b'Hello', b'Monday'], [b'Dear', b'Friday']]>starts<tf.RaggedTensor [[0, 5], [0, 4]]>ends<tf.RaggedTensor [[5, 11], [4, 10]]>
| Args | |
|---|---|
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
labels
|
An (N+1)-dimensional Tensor or RaggedTensor of int32, with
labels[i1...iN, j] being the split(0)/merge(1) label of the j-th
character for input[i1...iN]. Here split means create a new word with
this character and merge means adding this character to the previous
word.
|
force_split_at_break_character
|
bool indicates whether to force start a
new word after seeing a ICU defined whitespace character. When seeing
one or more ICU defined whitespace character:
|
| Returns | |
|---|---|
A tuple (tokens, start_offsets, end_offsets) where:
|
|
tokens
|
is a RaggedTensor of strings where tokens[i1...iN, j] is
the string content of the j-th token in input[i1...iN]
|
start_offsets
|
is a RaggedTensor of int64s where
start_offsets[i1...iN, j] is the byte offset for the start of the
j-th token in input[i1...iN].
|
end_offsets
|
is a RaggedTensor of int64s where
end_offsets[i1...iN, j] is the byte offset immediately after the
end of the j-th token in input[i...iN].
|
View source on GitHub