|TensorFlow 1 version||View source on GitHub|
An asynchronous multi-worker parameter server tf.distribute strategy.
tf.distribute.experimental.ParameterServerStrategy( cluster_resolver=None )
This strategy requires two roles: workers and parameter servers. Variables and updates to those variables will be assigned to parameter servers and other operations are assigned to workers.
When each worker has more than one GPU, operations will be replicated on all GPUs. Even though operations may be replicated, variables are not and each worker shares a common view for which parameter server a variable is assigned to.
By default it uses
TFConfigClusterResolver to detect configurations for
multi-worker training. This requires a 'TF_CONFIG' environment variable and
the 'TF_CONFIG' must have a cluster spec.
This class assumes each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker.
It is expected to call
call_for_each_replica(fn, ...) for any
operations which potentially can be replicated across replicas (i.e. multiple
GPUs) even if there is only CPU or one GPU. When defining the
caution needs to be taken:
1) It is generally not recommended to open a device scope under the strategy's
scope. A device scope (i.e. calling
tf.device) will be merged with or
override the device for operations but will not change the device for
2) It is also not recommended to open a colocation scope (i.e. calling
tf.compat.v1.colocate_with) under the strategy's scope. For colocating
strategy.extended.colocate_vars_with instead. Colocation of
ops will possibly create device assignment conflicts.
strategy = tf.distribute.experimental.ParameterServerStrategy() run_config = tf.estimator.RunConfig( experimental_distribute.train_distribute=strategy) estimator = tf.estimator.Estimator(config=run_config) tf.estimator.train_and_evaluate(estimator,...)
Returns the cluster resolver associated with this strategy.
In general, when using a multi-worker
Strategies that intend to have an associated
Single-worker strategies usually do not have a
For more information, please see
||Returns number of replicas over which gradients are aggregated.|
experimental_assign_to_logical_device( tensor, logical_device_id )
Adds annotation that
tensor will be assigned to a logical device.
# Initializing TPU system with 2 logical devices and 4 replicas. resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='') tf.config.experimental_connect_to_cluster(resolver) topology = tf.tpu.experimental.initialize_tpu_system(resolver) device_assignment = tf.tpu.experimental.DeviceAssignment.build( topology, computation_shape=[1, 1, 1, 2], num_replicas=4) strategy = tf.distribute.TPUStrategy( resolver, experimental_device_assignment=device_assignment) iterator = iter(inputs) @tf.function() def step_fn(inputs): output = tf.add(inputs, inputs) # Add operation will be executed on logical device 0. output = strategy.experimental_assign_to_logical_device(output, 0) return output strategy.run(step_fn, args=(next(iterator),))
||Input tensor to annotate.|
||Id of the logical core to which the tensor will be assigned.|
||The logical device id presented is not consistent with total number of partitions specified by the device assignment.|
Annotated tensor with idential value as
experimental_distribute_dataset( dataset )
The following is an example:
strategy = tf.distribute.MirroredStrategy() # Create a dataset dataset = dataset_ops.Dataset.TFRecordDataset([ "/a/1.tfr", "/a/2.tfr", "/a/3.tfr", "/a/4.tfr"]) # Distribute that dataset dist_dataset = strategy.experimental_distribute_dataset(dataset) # Iterate over the `tf.distribute.DistributedDataset` for x in dist_dataset: # process dataset elements strategy.run(replica_fn, args=(x,))
In the code snippet above, the
dist_dataset is batched by
GLOBAL_BATCH_SIZE, and we iterate through it
for x in dist_dataset.
containing data for all replicas, which aggregates to a batch of
tf.distribute.Strategy.run will take care of feeding
the right per-replica data in
x to the right
replica_fn executed on each