View source on GitHub |
Decode UTF8 tokens into code points and return their bits.
text.utf8_binarize(
tokens, word_length=16, bits_per_char=24, replacement_char=65533, name=None
)
See the RetVec paper for details.
Example:
code_points = utf8_binarize("hello", word_length=3, bits_per_char=4)
print(code_points.numpy())
[0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1.]
The codepoints are encoded bitwise in the little-endian order.
The inner dimension of the output is always word_length * bits_per_char
,
because extra characters are truncated / missing characters are padded,
and bits_per_char
lowest bits of each codepoint is stored.
Decoding errors (which in applications are often replaced with the character
U+65533 "REPLACEMENT CHARACTER") are represented with replacement_char
's
bits_per_char
lowest bits.
Returns | |
---|---|
A tensor of floating-point zero and one values corresponding to the bits of the token characters' Unicode code points. | |
Shape
|
[<shape of tokens>, word_length * bits_per_char] .
|