Normalizes a tensor of UTF-8 strings.
text.FastBertNormalizer(
lower_case_nfd_strip_accents=False, model_buffer=None
)
Args |
lower_case_nfd_strip_accents
|
(optional). - If true, it first lowercases
the text, applies NFD normalization, strips accents characters, and then
replaces control characters with whitespaces. - If false, it only
replaces control characters with whitespaces.
|
model_buffer
|
(optional) bytes object (or a uint8 tf.Tenosr) that contains
the fast bert normalizer model in flatbuffer format (see
fast_bert_normalizer_model.fbs). If not None , all other arguments are
ignored.
|
Methods
normalize
View source
normalize(
input
)
Tokenizes a tensor of UTF-8 strings.
Example:
texts = [["They're", "the", "Greatest", "\xC0bc"]]
normalizer = FastBertNormalizer(lower_case_nfd_strip_accents=True)
normalizer.normalize(texts)
<tf.RaggedTensor [[b"they're", b'the', b'greatest', b'abc']]>
Args |
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns |
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
normalize_with_offsets
View source
normalize_with_offsets(
input
)
Normalizes a tensor of UTF-8 strings and returns offsets map.
Example:
texts = ["They're", "the", "Greatest", "\xC0bc"]
normalizer = FastBertNormalizer(lower_case_nfd_strip_accents=True)
normalized_text, offsets = (
normalizer.normalize_with_offsets(texts))
normalized_text
<tf.Tensor: shape=(4,), dtype=string, numpy=array([b"they're", b'the',
b'greatest', b'abc'], dtype=object)>
offsets
<tf.RaggedTensor [[0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3], [0, 1, 2, 3, 4, 5,
6, 7, 8], [0, 2, 3, 4]]>
Args |
input
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns |
A tuple (normalized_texts, offsets) where:
|
normalized_texts
|
is a Tensor or RaggedTensor .
|
offsets
|
is a RaggedTensor of the byte offsets from the output
to the input. For example, if the input is input[i1...iN] with N
strings, offsets[i1...iN, k] is the byte offset in inputs[i1...iN]
for the kth byte in normalized_texts[i1...iN] . Note that
offsets[i1...iN, ...] also covers the position following the last byte
in normalized_texts[i1...iN] , so that we know the byte offset position
in input[i1...iN] that corresponds to the end of
normalized_texts[i1...iN] .
|