Encodes and decodes strings into integer tensors using UTF-8 encoding.
tff.analytics.heavy_hitters.iblt.UTF8Chunker(
string_max_length: int,
*,
max_chunk_value: Optional[int] = None,
dtype: tf.dtypes.DType = tf.int64
)
Args |
string_max_length
|
Maximum length of the string to encode. Note that this
is measured in bytes and some unicode characters may take more than 1
byte. In the case that string_max_length does not divide
self._dtype_size_bytes (calculated below), it is rounded up to the
smallest integer that divides it.
|
max_chunk_value
|
Maximum value in each chunk. Defaults to the maximum
possible value in dtype.
|
dtype
|
tf.dtypes.DType indicating the data type of the output. Must be
either tf.int32 or tf.int64 . Defaults to tf.int64 .
|
Raises |
ValueError
|
If arguments do not meet expectations.
|
Methods
decode_python
View source
decode_python(
encoded_chunks: np.ndarray
) -> np.ndarray
Decodes encoded_chunks
of shape (n, self._num_chunks)
to n
strings.
Args |
encoded_chunks
|
A np.ndarray of shape (num_strings, self._num_chunks)
and self._dtype .
|
Returns |
A np.ndarray of shape (num_strings,) and type np.string .
|
decode_tensorflow
View source
@tf.function
decode_tensorflow(
encoded_chunks: tf.Tensor
) -> tf.Tensor
Decodes encoded_chunks
of shape (n, self._num_chunks)
to n
strings.
Args |
encoded_chunks
|
A tf.Tensor of shape (num_strings, self._num_chunks)
and self._dtype .
|
encode_tensorflow
View source
encode_tensorflow(
input_strings: tf.Tensor
) -> Tuple[tf.Tensor, tf.Tensor]
Encodes input_strings
to tensors.
Args |
input_strings
|
A 1-D tf.Tensor of type tf.string . Denote the shape of
input_strings as (num_strings,) .
|
Returns |
A Tuple (encoded_strings, trimmed_input_strings)
- encoded_strings: A
tf.Tensor of shape
(num_strings, self._num_chunks) containing encoded input_strings
- trimmed_input_strings: A
tf.Tensor of shape (num_strings,)
containing trimmed input_strings that the length of each string in it
is no more than self._max_length bytes.
Note that a utf-8 character might take morethan one byte, so both the
encoded and trimmed strings could contain characters that are cut in the
middle. The caller needs to be aware of this when decoding these strings,
i.g. decode a byte string s by s.decode('utf-8', 'ignore') to avoid
decoding errors.
|
get_num_chunks
View source
get_num_chunks() -> int