text.FastBertNormalizer

Normalizes a tensor of UTF-8 strings.

lower_case_nfd_strip_accents (optional). - If true, it first lowercases the text, applies NFD normalization, strips accents characters, and then replaces control characters with whitespaces. - If false, it only replaces control characters with whitespaces.
model_buffer (optional) bytes object (or a uint8 tf.Tenosr) that contains the fast bert normalizer model in flatbuffer format (see fast_bert_normalizer_model.fbs). If not None, all other arguments are ignored.

Methods

normalize

View source

Tokenizes a tensor of UTF-8 strings.

Example:

texts = [["They're", "the", "Greatest", "\xC0bc"]]
normalizer = FastBertNormalizer(lower_case_nfd_strip_accents=True)
normalizer.normalize(texts)
<tf.RaggedTensor [[b"they&#x27;re", b'the', b'greatest', b'abc']]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

normalize_with_offsets

View source

Normalizes a tensor of UTF-8 strings and returns offsets map.

Example:

texts = ["They&#x27;re", "the", "Greatest", "\xC0bc"]
normalizer = FastBertNormalizer(lower_case_nfd_strip_accents=True)
normalized_text, offsets = (
  normalizer.normalize_with_offsets(texts))
normalized_text
<tf.Tensor: shape=(4,), dtype=string, numpy=array([b"they&#x27;re", b'the',
b&#x27;greatest', b'abc'], dtype=object)>
offsets
<tf.RaggedTensor [[0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3], [0, 1, 2, 3, 4, 5,
6, 7, 8], [0, 2, 3, 4]]>

Args
input An N-dimensional Tensor or RaggedTensor of UTF-8 strings.

Returns
A tuple (normalized_texts, offsets) where:
normalized_texts is a Tensor or RaggedTensor.
offsets is a RaggedTensor of the byte offsets from the output to the input. For example, if the input is input[i1...iN] with N strings, offsets[i1...iN, k] is the byte offset in inputs[i1...iN] for the kth byte in normalized_texts[i1...iN]. Note that offsets[i1...iN, ...] also covers the position following the last byte in normalized_texts[i1...iN], so that we know the byte offset position in input[i1...iN] that corresponds to the end of normalized_texts[i1...iN].