|View source on GitHub|
Decode UTF8 tokens into code points and return their bits.
text.utf8_binarize( tokens, word_length=16, bits_per_char=24, replacement_char=65533, name=None )
See the RetVec paper for details.
code_points = utf8_binarize("hello", word_length=3, bits_per_char=4)
[0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1.]
The codepoints are encoded bitwise in the little-endian order.
The inner dimension of the output is always
word_length * bits_per_char,
because extra characters are truncated / missing characters are padded,
bits_per_char lowest bits of each codepoint is stored.
Decoding errors (which in applications are often replaced with the character
U+65533 "REPLACEMENT CHARACTER") are represented with
bits_per_char lowest bits.
|A tensor of floating-point zero and one values corresponding to the bits of the token characters' Unicode code points.|