UnicodeTranscode

public final class UnicodeTranscode

Transcode the input text from a source encoding to a destination encoding.

The input is a string tensor of any shape. The output is a string tensor of the same shape containing the transcoded strings. Output strings are always valid unicode. If the input contains invalid encoding positions, the `errors` attribute sets the policy for how to deal with them. If the default error-handling policy is used, invalid formatting will be substituted in the output by the `replacement_char`. If the errors policy is to `ignore`, any invalid encoding positions in the input are skipped and not included in the output. If it set to `strict` then any invalid formatting will result in an InvalidArgument error.

This operation can be used with `output_encoding = input_encoding` to enforce correct formatting for inputs even if they are already in the desired encoding.

If the input is prefixed by a Byte Order Mark needed to determine encoding (e.g. if the encoding is UTF-16 and the BOM indicates big-endian), then that BOM will be consumed and not emitted into the output. If the input encoding is marked with an explicit endianness (e.g. UTF-16-BE), then the BOM is interpreted as a non-breaking-space and is preserved in the output (including always for UTF-8).

The end result is that if the input is marked as an explicit endianness the transcoding is faithful to all codepoints in the source. If it is not marked with an explicit endianness, the BOM is not considered part of the string itself but as metadata, and so is not preserved in the output.

Examples:

>>> tf.strings.unicode_transcode(["Hello", "TensorFlow", "2.x"], "UTF-8", "UTF-16-BE") >>> tf.strings.unicode_transcode(["A", "B", "C"], "US ASCII", "UTF-8").numpy() array([b'A', b'B', b'C'], dtype=object)

Nested Classes

class UnicodeTranscode.Options Optional attributes for UnicodeTranscode

Constants

String OP_NAME The name of this op, as known by TensorFlow core engine

Public Methods

Output<TString>	asOutput() Returns the symbolic handle of the tensor.
static UnicodeTranscode	create(Scope scope, Operand<TString> input, String inputEncoding, String outputEncoding, Options... options) Factory method to create a class wrapping a new UnicodeTranscode operation.
static UnicodeTranscode.Options	errors(String errors)
Output<TString>	output() A string tensor containing unicode text encoded using `output_encoding`.
static UnicodeTranscode.Options	replaceControlCharacters(Boolean replaceControlCharacters)
static UnicodeTranscode.Options	replacementChar(Long replacementChar)

Inherited Methods

From class org.tensorflow.op.RawOp

final boolean	equals(Object obj)
final int	hashCode()
Operation	op() Return this unit of computation as a single `Operation`.
final String	toString()

From class java.lang.Object

boolean	equals(Object arg0)
final Class<?>	getClass()
int	hashCode()
final void	notify()
final void	notifyAll()
String	toString()
final void	wait(long arg0, int arg1)
final void	wait(long arg0)
final void	wait()

From interface org.tensorflow.op.Op

abstract ExecutionEnvironment	env() Return the execution environment this op was created in.
abstract Operation	op() Return this unit of computation as a single `Operation`.

From interface org.tensorflow.Operand

abstract Output<TString>	asOutput() Returns the symbolic handle of the tensor.
abstract TString	asTensor() Returns the tensor at this operand.
abstract Shape	shape() Returns the (possibly partially known) shape of the tensor referred to by the `Output` of this operand.
abstract Class<TString>	type() Returns the tensor type of this operand

From interface org.tensorflow.ndarray.Shaped

abstract int	rank()
abstract Shape	shape()
abstract long	size() Computes and returns the total size of this container, in number of values.

Constants

public static final String OP_NAME

The name of this op, as known by TensorFlow core engine

Constant Value: "UnicodeTranscode"

Public Methods

public Output<TString> asOutput ()

Returns the symbolic handle of the tensor.

Inputs to TensorFlow operations are outputs of another TensorFlow operation. This method is used to obtain a symbolic handle that represents the computation of the input.

public static UnicodeTranscode create (Scope scope, Operand<TString> input, String inputEncoding, String outputEncoding, Options... options)

Factory method to create a class wrapping a new UnicodeTranscode operation.

Parameters

scope	current scope
input	The text to be processed. Can have any shape.
inputEncoding	Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `"UTF-16", "US ASCII", "UTF-8"`.
outputEncoding	The unicode encoding to use in the output. Must be one of `"UTF-8", "UTF-16-BE", "UTF-32-BE"`. Multi-byte encodings will be big-endian.
options	carries optional attributes values

Returns

a new instance of UnicodeTranscode

public static UnicodeTranscode.Options errors (String errors)

Parameters

errors	Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.

errors

Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character.

public Output<TString> output ()

A string tensor containing unicode text encoded using `output_encoding`.

public static UnicodeTranscode.Options replaceControlCharacters (Boolean replaceControlCharacters)

Parameters

replaceControlCharacters	Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false.

public static UnicodeTranscode.Options replacementChar (Long replacementChar)

Parameters

replacementChar	The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.) Note that for UTF-8, passing a replacement character expressible in 1 byte, such as ' ', will preserve string alignment to the source since invalid bytes will be replaced with a 1-byte replacement. For UTF-16-BE and UTF-16-LE, any 1 or 2 byte replacement character will preserve byte alignment to the source.

replacementChar

The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.)

Note that for UTF-8, passing a replacement character expressible in 1 byte, such as ' ', will preserve string alignment to the source since invalid bytes will be replaced with a 1-byte replacement. For UTF-16-BE and UTF-16-LE, any 1 or 2 byte replacement character will preserve byte alignment to the source.