tf.distribute.experimental.CollectiveHints
Stay organized with collections
Save and categorize content based on your preferences.
Hints for collective operations like AllReduce.
tf.distribute.experimental.CollectiveHints(
bytes_per_pack=0, timeout_seconds=None
)
This can be passed to methods like
tf.distribute.get_replica_context().all_reduce()
to optimize collective
operation performance. Note that these are only hints, which may or may not
change the actual behavior. Some options only apply to certain strategy and
are ignored by others.
One common optimization is to break gradients all-reduce into multiple packs
so that weight updates can overlap with gradient all-reduce.
Examples:
hints = tf.distribute.experimental.CollectiveHints(
bytes_per_pack=50 * 1024 * 1024)
grads = tf.distribute.get_replica_context().all_reduce(
'sum', grads, experimental_hints=hints)
optimizer.apply_gradients(zip(grads, vars),
experimental_aggregate_gradients=False)
strategy = tf.distribute.MirroredStrategy()
hints = tf.distribute.experimental.CollectiveHints(
timeout_seconds=120.0)
try:
strategy.reduce("sum", v, axis=None, experimental_hints=hints)
except tf.errors.DeadlineExceededError:
do_something()
Args |
bytes_per_pack
|
a non-negative integer. Breaks collective operations into
packs of certain size. If it's zero, the value is determined
automatically. This only applies to all-reduce with
MultiWorkerMirroredStrategy currently.
|
timeout_seconds
|
a float or None, timeout in seconds. If not None, the
collective raises tf.errors.DeadlineExceededError if it takes longer
than this timeout. This can be useful when debugging hanging issues.
This should only be used for debugging since it creates a new thread for
each collective, i.e. an overhead of timeout_seconds *
num_collectives_per_second more threads. This only works for
tf.distribute.experimental.MultiWorkerMirroredStrategy .
|
Raises |
ValueError
|
When arguments have invalid value.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[null,null,["Last updated 2024-04-26 UTC."],[],[],null,["# tf.distribute.experimental.CollectiveHints\n\n\u003cbr /\u003e\n\n|------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.16.1/tensorflow/python/distribute/collective_util.py#L168-L233) |\n\nHints for collective operations like AllReduce.\n\n#### View aliases\n\n\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.distribute.experimental.CollectiveHints`](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/CollectiveHints)\n\n\u003cbr /\u003e\n\n tf.distribute.experimental.CollectiveHints(\n bytes_per_pack=0, timeout_seconds=None\n )\n\nThis can be passed to methods like\n`tf.distribute.get_replica_context().all_reduce()` to optimize collective\noperation performance. Note that these are only hints, which may or may not\nchange the actual behavior. Some options only apply to certain strategy and\nare ignored by others.\n\nOne common optimization is to break gradients all-reduce into multiple packs\nso that weight updates can overlap with gradient all-reduce.\n\n#### Examples:\n\n- bytes_per_pack\n\n hints = tf.distribute.experimental.CollectiveHints(\n bytes_per_pack=50 * 1024 * 1024)\n grads = tf.distribute.get_replica_context().all_reduce(\n 'sum', grads, experimental_hints=hints)\n optimizer.apply_gradients(zip(grads, vars),\n experimental_aggregate_gradients=False)\n\n- timeout_seconds\n\n strategy = tf.distribute.MirroredStrategy()\n hints = tf.distribute.experimental.CollectiveHints(\n timeout_seconds=120.0)\n try:\n strategy.reduce(\"sum\", v, axis=None, experimental_hints=hints)\n except tf.errors.DeadlineExceededError:\n do_something()\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `bytes_per_pack` | a non-negative integer. Breaks collective operations into packs of certain size. If it's zero, the value is determined automatically. This only applies to all-reduce with `MultiWorkerMirroredStrategy` currently. |\n| `timeout_seconds` | a float or None, timeout in seconds. If not None, the collective raises [`tf.errors.DeadlineExceededError`](../../../tf/errors/DeadlineExceededError) if it takes longer than this timeout. This can be useful when debugging hanging issues. This should only be used for debugging since it creates a new thread for each collective, i.e. an overhead of `timeout_seconds * num_collectives_per_second` more threads. This only works for [`tf.distribute.experimental.MultiWorkerMirroredStrategy`](../../../tf/distribute/experimental/MultiWorkerMirroredStrategy). |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|------------------------------------|\n| `ValueError` | When arguments have invalid value. |\n\n\u003cbr /\u003e"]]