ML Community Day is November 9! Join us for updates from TensorFlow, JAX, and more Learn more

tff.analytics.heavy_hitters.iblt.UTF8Chunker

Encodes and decodes strings into integer tensors using UTF-8 encoding.

string_max_length Maximum length of the string to encode. Note that this is measured in bytes and some unicode characters may take more than 1 byte. In the case that string_max_length does not divide self._dtype_size_bytes (calculated below), it is rounded up to the smallest integer that divides it.
max_chunk_value Maximum value in each chunk. Defaults to the maximum possible value in dtype.
dtype tf.dtypes.DType indicating the data type of the output. Must be either tf.int32 or tf.int64. Defaults to tf.int64.

ValueError If arguments do not meet expectations.

Methods

decode_python

View source

Decodes encoded_chunks of shape (n, self._num_chunks) to n strings.

Args
encoded_chunks A np.ndarray of shape (num_strings, self._num_chunks) and self._dtype.

Returns
A np.ndarray of shape (num_strings,) and type np.string.

decode_tensorflow

View source

Decodes encoded_chunks of shape (n, self._num_chunks) to n strings.

Args
encoded_chunks A tf.Tensor of shape (num_strings, self._num_chunks) and self._dtype.

Returns
A tf.Tensor of shape (num_strings,) and type tf.string.

encode_tensorflow

View source

Encodes input_strings to tensors.

Args
input_strings A 1-D tf.Tensor of type tf.string. Denote the shape of input_strings as (num_strings,).

Returns
A Tuple (encoded_strings, trimmed_input_strings)

  • encoded_strings: A tf.Tensor of shape (num_strings, self._num_chunks) containing encoded input_strings
  • trimmed_input_strings: A tf.Tensor of shape (num_strings,) containing trimmed input_strings that the length of each string in it is no more than self._max_length bytes. Note that a utf-8 character might take morethan one byte, so both the encoded and trimmed strings could contain characters that are cut in the middle. The caller needs to be aware of this when decoding these strings, i.g. decode a byte string s by s.decode('utf-8', 'ignore') to avoid decoding errors.

get_num_chunks

View source