tf.data.experimental.bucket_by_sequence_length
Stay organized with collections
Save and categorize content based on your preferences.
A transformation that buckets elements in a Dataset
by length. (deprecated)
tf.data.experimental.bucket_by_sequence_length(
element_length_func,
bucket_boundaries,
bucket_batch_sizes,
padded_shapes=None,
padding_values=None,
pad_to_bucket_boundary=False,
no_padding=False,
drop_remainder=False
)
Elements of the Dataset
are grouped together by length and then are padded
and batched.
This is useful for sequence tasks in which the elements have variable length.
Grouping together elements that have similar lengths reduces the total
fraction of padding in a batch which increases training step efficiency.
Below is an example to bucketize the input data to the 3 buckets
"[0, 3), [3, 5), [5, inf)" based on sequence length, with batch size 2.
elements = [
[0], [1, 2, 3, 4], [5, 6, 7],
[7, 8, 9, 10, 11], [13, 14, 15, 16, 19, 20], [21, 22]]
dataset = tf.data.Dataset.from_generator(
lambda: elements, tf.int64, output_shapes=[None])
dataset = dataset.apply(
tf.data.experimental.bucket_by_sequence_length(
element_length_func=lambda elem: tf.shape(elem)[0],
bucket_boundaries=[3, 5],
bucket_batch_sizes=[2, 2, 2]))
for elem in dataset.as_numpy_iterator():
print(elem)
[[1 2 3 4]
[5 6 7 0]]
[[ 7 8 9 10 11 0]
[13 14 15 16 19 20]]
[[ 0 0]
[21 22]]
There is also a possibility to pad the dataset till the bucket boundary.
You can also provide which value to be used while padding the data.
Below example uses -1
as padding and it also shows the input data
being bucketizied to two buckets "[0,3], [4,6]".
elements = [
[0], [1, 2, 3, 4], [5, 6, 7],
[7, 8, 9, 10, 11], [13, 14, 15, 16, 19, 20], [21, 22]]
dataset = tf.data.Dataset.from_generator(
lambda: elements, tf.int32, output_shapes=[None])
dataset = dataset.apply(
tf.data.experimental.bucket_by_sequence_length(
element_length_func=lambda elem: tf.shape(elem)[0],
bucket_boundaries=[4, 7],
bucket_batch_sizes=[2, 2, 2],
pad_to_bucket_boundary=True,
padding_values=-1))
for elem in dataset.as_numpy_iterator():
print(elem)
[[ 0 -1 -1]
[ 5 6 7]]
[[ 1 2 3 4 -1 -1]
[ 7 8 9 10 11 -1]]
[[21 22 -1]]
[[13 14 15 16 19 20]]
When using pad_to_bucket_boundary
option, it can be seen that it is
not always possible to maintain the bucket batch size.
You can drop the batches that do not maintain the bucket batch size by
using the option drop_remainder
. Using the same input data as in the
above example you get the following result.
elements = [
[0], [1, 2, 3, 4], [5, 6, 7],
[7, 8, 9, 10, 11], [13, 14, 15, 16, 19, 20], [21, 22]]
dataset = tf.data.Dataset.from_generator(
lambda: elements, tf.int32, output_shapes=[None])
dataset = dataset.apply(
tf.data.experimental.bucket_by_sequence_length(
element_length_func=lambda elem: tf.shape(elem)[0],
bucket_boundaries=[4, 7],
bucket_batch_sizes=[2, 2, 2],
pad_to_bucket_boundary=True,
padding_values=-1,
drop_remainder=True))
for elem in dataset.as_numpy_iterator():
print(elem)
[[ 0 -1 -1]
[ 5 6 7]]
[[ 1 2 3 4 -1 -1]
[ 7 8 9 10 11 -1]]
Args |
element_length_func
|
function from element in Dataset to tf.int32 ,
determines the length of the element, which will determine the bucket it
goes into.
|
bucket_boundaries
|
list<int> , upper length boundaries of the buckets.
|
bucket_batch_sizes
|
list<int> , batch size per bucket. Length should be
len(bucket_boundaries) + 1 .
|
padded_shapes
|
Nested structure of tf.TensorShape to pass to
tf.data.Dataset.padded_batch . If not provided, will use
dataset.output_shapes , which will result in variable length dimensions
being padded out to the maximum length in each batch.
|
padding_values
|
Values to pad with, passed to
tf.data.Dataset.padded_batch . Defaults to padding with 0.
|
pad_to_bucket_boundary
|
bool, if False , will pad dimensions with unknown
size to maximum length in batch. If True , will pad dimensions with
unknown size to bucket boundary minus 1 (i.e., the maximum length in each
bucket), and caller must ensure that the source Dataset does not contain
any elements with length longer than max(bucket_boundaries) .
|
no_padding
|
bool , indicates whether to pad the batch features (features
need to be either of type tf.sparse.SparseTensor or of same shape).
|
drop_remainder
|
(Optional.) A tf.bool scalar tf.Tensor , representing
whether the last batch should be dropped in the case it has fewer than
batch_size elements; the default behavior is not to drop the smaller
batch.
|
Raises |
ValueError
|
if len(bucket_batch_sizes) != len(bucket_boundaries) + 1 .
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[null,null,["Last updated 2024-04-26 UTC."],[],[],null,["# tf.data.experimental.bucket_by_sequence_length\n\n\u003cbr /\u003e\n\n|----------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.16.1/tensorflow/python/data/experimental/ops/grouping.py#L110-L257) |\n\nA transformation that buckets elements in a `Dataset` by length. (deprecated)\n\n#### View aliases\n\n\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.data.experimental.bucket_by_sequence_length`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length)\n\n\u003cbr /\u003e\n\n tf.data.experimental.bucket_by_sequence_length(\n element_length_func,\n bucket_boundaries,\n bucket_batch_sizes,\n padded_shapes=None,\n padding_values=None,\n pad_to_bucket_boundary=False,\n no_padding=False,\n drop_remainder=False\n )\n\n| **Deprecated:** THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use [`tf.data.Dataset.bucket_by_sequence_length(...)`](../../../tf/data/Dataset#bucket_by_sequence_length).\n\nElements of the `Dataset` are grouped together by length and then are padded\nand batched.\n\nThis is useful for sequence tasks in which the elements have variable length.\nGrouping together elements that have similar lengths reduces the total\nfraction of padding in a batch which increases training step efficiency.\n\nBelow is an example to bucketize the input data to the 3 buckets\n\"\\[0, 3), \\[3, 5), \\[5, inf)\" based on sequence length, with batch size 2. \n\n elements = [\n [0], [1, 2, 3, 4], [5, 6, 7],\n [7, 8, 9, 10, 11], [13, 14, 15, 16, 19, 20], [21, 22]]\n\n dataset = tf.data.Dataset.from_generator(\n lambda: elements, tf.int64, output_shapes=[None])\n\n dataset = dataset.apply(\n tf.data.experimental.bucket_by_sequence_length(\n element_length_func=lambda elem: tf.shape(elem)[0],\n bucket_boundaries=[3, 5],\n bucket_batch_sizes=[2, 2, 2]))\n\n for elem in dataset.as_numpy_iterator():\n print(elem)\n [[1 2 3 4]\n [5 6 7 0]]\n [[ 7 8 9 10 11 0]\n [13 14 15 16 19 20]]\n [[ 0 0]\n [21 22]]\n\nThere is also a possibility to pad the dataset till the bucket boundary.\nYou can also provide which value to be used while padding the data.\nBelow example uses `-1` as padding and it also shows the input data\nbeing bucketizied to two buckets \"\\[0,3\\], \\[4,6\\]\". \n\n elements = [\n [0], [1, 2, 3, 4], [5, 6, 7],\n [7, 8, 9, 10, 11], [13, 14, 15, 16, 19, 20], [21, 22]]\n\n dataset = tf.data.Dataset.from_generator(\n lambda: elements, tf.int32, output_shapes=[None])\n\n dataset = dataset.apply(\n tf.data.experimental.bucket_by_sequence_length(\n element_length_func=lambda elem: tf.shape(elem)[0],\n bucket_boundaries=[4, 7],\n bucket_batch_sizes=[2, 2, 2],\n pad_to_bucket_boundary=True,\n padding_values=-1))\n\n for elem in dataset.as_numpy_iterator():\n print(elem)\n [[ 0 -1 -1]\n [ 5 6 7]]\n [[ 1 2 3 4 -1 -1]\n [ 7 8 9 10 11 -1]]\n [[21 22 -1]]\n [[13 14 15 16 19 20]]\n\nWhen using `pad_to_bucket_boundary` option, it can be seen that it is\nnot always possible to maintain the bucket batch size.\nYou can drop the batches that do not maintain the bucket batch size by\nusing the option `drop_remainder`. Using the same input data as in the\nabove example you get the following result. \n\n elements = [\n [0], [1, 2, 3, 4], [5, 6, 7],\n [7, 8, 9, 10, 11], [13, 14, 15, 16, 19, 20], [21, 22]]\n\n dataset = tf.data.Dataset.from_generator(\n lambda: elements, tf.int32, output_shapes=[None])\n\n dataset = dataset.apply(\n tf.data.experimental.bucket_by_sequence_length(\n element_length_func=lambda elem: tf.shape(elem)[0],\n bucket_boundaries=[4, 7],\n bucket_batch_sizes=[2, 2, 2],\n pad_to_bucket_boundary=True,\n padding_values=-1,\n drop_remainder=True))\n\n for elem in dataset.as_numpy_iterator():\n print(elem)\n [[ 0 -1 -1]\n [ 5 6 7]]\n [[ 1 2 3 4 -1 -1]\n [ 7 8 9 10 11 -1]]\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `element_length_func` | function from element in `Dataset` to [`tf.int32`](../../../tf#int32), determines the length of the element, which will determine the bucket it goes into. |\n| `bucket_boundaries` | `list\u003cint\u003e`, upper length boundaries of the buckets. |\n| `bucket_batch_sizes` | `list\u003cint\u003e`, batch size per bucket. Length should be `len(bucket_boundaries) + 1`. |\n| `padded_shapes` | Nested structure of [`tf.TensorShape`](../../../tf/TensorShape) to pass to [`tf.data.Dataset.padded_batch`](../../../tf/data/Dataset#padded_batch). If not provided, will use `dataset.output_shapes`, which will result in variable length dimensions being padded out to the maximum length in each batch. |\n| `padding_values` | Values to pad with, passed to [`tf.data.Dataset.padded_batch`](../../../tf/data/Dataset#padded_batch). Defaults to padding with 0. |\n| `pad_to_bucket_boundary` | bool, if `False`, will pad dimensions with unknown size to maximum length in batch. If `True`, will pad dimensions with unknown size to bucket boundary minus 1 (i.e., the maximum length in each bucket), and caller must ensure that the source `Dataset` does not contain any elements with length longer than `max(bucket_boundaries)`. |\n| `no_padding` | `bool`, indicates whether to pad the batch features (features need to be either of type [`tf.sparse.SparseTensor`](../../../tf/sparse/SparseTensor) or of same shape). |\n| `drop_remainder` | (Optional.) A [`tf.bool`](../../../tf#bool) scalar [`tf.Tensor`](../../../tf/Tensor), representing whether the last batch should be dropped in the case it has fewer than `batch_size` elements; the default behavior is not to drop the smaller batch. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| A `Dataset` transformation function, which can be passed to [`tf.data.Dataset.apply`](../../../tf/data/Dataset#apply). ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|-------------------------------------------------------------|\n| `ValueError` | if `len(bucket_batch_sizes) != len(bucket_boundaries) + 1`. |\n\n\u003cbr /\u003e"]]