View source on GitHub
|
Splits each string into a sequence of code points with start offsets.
tf.strings.unicode_split_with_offsets(
input,
input_encoding,
errors='replace',
replacement_char=65533,
name=None
)
This op is similar to tf.strings.decode(...), but it also returns the
start offset for each character in its respective string. This information
can be used to align the characters with the original byte sequence.
Returns a tuple (chars, start_offsets) where:
chars[i1...iN, j]is the substring ofinput[i1...iN]that encodes itsjth character, when decoded usinginput_encoding.start_offsets[i1...iN, j]is the start byte offset for thejth character ininput[i1...iN], when decoded usinginput_encoding.
Returns | |
|---|---|
A tuple of N+1 dimensional tensors (codepoints, start_offsets).
The returned tensors are |
Example:
input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]result = tf.strings.unicode_split_with_offsets(input, 'UTF-8')result[0].to_list() # character substrings[[b'G', b'\xc3\xb6', b'\xc3\xb6', b'd', b'n', b'i', b'g', b'h', b't'],[b'\xf0\x9f\x98\x8a']]result[1].to_list() # offsets[[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]
View source on GitHub