Transcode the input text from a source encoding to a destination encoding.
tf.strings.unicode_transcode(
input: Annotated[Any, _atypes.String],
input_encoding: str,
output_encoding: str,
errors: str = 'replace',
replacement_char: int = 65533,
replace_control_characters: bool = False,
name=None
) -> Annotated[Any, _atypes.String]
Used in the notebooks
Used in the guide |
---|
The input is a string tensor of any shape. The output is a string tensor of
the same shape containing the transcoded strings. Output strings are always
valid unicode. If the input contains invalid encoding positions, the
errors
attribute sets the policy for how to deal with them. If the default
error-handling policy is used, invalid formatting will be substituted in the
output by the replacement_char
. If the errors policy is to ignore
, any
invalid encoding positions in the input are skipped and not included in the
output. If it set to strict
then any invalid formatting will result in an
InvalidArgument error.
This operation can be used with output_encoding = input_encoding
to enforce
correct formatting for inputs even if they are already in the desired encoding.
If the input is prefixed by a Byte Order Mark needed to determine encoding (e.g. if the encoding is UTF-16 and the BOM indicates big-endian), then that BOM will be consumed and not emitted into the output. If the input encoding is marked with an explicit endianness (e.g. UTF-16-BE), then the BOM is interpreted as a non-breaking-space and is preserved in the output (including always for UTF-8).
The end result is that if the input is marked as an explicit endianness the transcoding is faithful to all codepoints in the source. If it is not marked with an explicit endianness, the BOM is not considered part of the string itself but as metadata, and so is not preserved in the output.
Examples:
tf.strings.unicode_transcode(["Hello", "TensorFlow", "2.x"], "UTF-8", "UTF-16-BE")
<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'\x00H\x00e\x00l\x00l\x00o',
b'\x00T\x00e\x00n\x00s\x00o\x00r\x00F\x00l\x00o\x00w',
b'\x002\x00.\x00x'], dtype=object)>
tf.strings.unicode_transcode(["A", "B", "C"], "US ASCII", "UTF-8").numpy()
array([b'A', b'B', b'C'], dtype=object)
Returns | |
---|---|
A Tensor of type string .
|