Help protect the Great Barrier Reef with TensorFlow on Kaggle Join Challenge

使用 tf.distribute.Strategy 进行自定义训练

在 TensorFlow.org 上查看 在 Google Colab 上运行 在 GitHub 上查看源代码 下载该 notebook

本教程演示了如何使用 tf.distribute.Strategy 进行自定义训练循环。我们将在 Fashion-MNIST 数据集上训练一个简单的 CNN 模型。Fashion-MNIST 数据集包含了 60000 个大小为 28 x 28 的训练图像和 10000 个大小为 28 x 28 的测试图像。

我们用自定义训练循环来训练我们的模型是因为它们在训练的过程中为我们提供了灵活性和在训练过程中更好的控制。而且,使它们调试模型和训练循环的时候更容易。

# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

print(tf.__version__)
2.6.0

下载流行的 MNIST 数据集

fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Adding a dimension to the array -> new shape == (28, 28, 1)
# We are doing this because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Getting the images in [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step
4431872/4422102 [==============================] - 0s 0us/step

创建一个分发变量和图形的策略

tf.distribute.MirroredStrategy 策略是如何运作的?

  • 所有变量和模型图都复制在副本上。
  • 输入都均匀分布在副本中。
  • 每个副本在收到输入后计算输入的损失和梯度。
  • 通过求和,每一个副本上的梯度都能同步。
  • 同步后,每个副本上的复制的变量都可以同样更新。

注意:您可以将下面的所有代码放在一个单独单元内。 我们将它分成几个代码单元用于说明目的。

# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()
2021-08-13 21:23:56.336601: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.343258: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.344132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.346010: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-13 21:23:56.346532: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.347425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.348255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.927153: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.928079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.928915: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-13 21:23:56.929762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))
Number of devices: 1

设置输入流水线

将图形和变量导出成平台不可识别的 SavedModel 格式。在你的模型保存后,你可以在有或没有范围的情况下载入它。

BUFFER_SIZE = len(train_images)

BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

EPOCHS = 10

创建数据集并分发它们:

train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE) 
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE) 

train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)
2021-08-13 21:23:57.794731: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_FLOAT
      type: DT_UINT8
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 28
        }
        dim {
          size: 28
        }
        dim {
          size: 1
        }
      }
      shape {
      }
    }
  }
}

2021-08-13 21:23:57.835235: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_FLOAT
      type: DT_UINT8
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 28
        }
        dim {
          size: 28
        }
        dim {
          size: 1
        }
      }
      shape {
      }
    }
  }
}

创建模型

使用 tf.keras.Sequential 创建一个模型。你也可以使用模型子类化 API 来完成这个。

def create_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
    ])

  return model
# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

定义损失函数

通常,在具有 1 个 GPU/CPU 的单台机器上,损失会除以输入批次中的样本数量。

因此,使用 tf.distribute.Strategy 时应如何计算损失?

  • 例如,假设有 4 个 GPU,批次大小为 64。一个批次的输入会分布在各个副本(4 个 GPU)上,每个副本获得一个大小为 16 的输入。

  • 每个副本上的模型都会使用其各自的输入进行前向传递,并计算损失。现在,不将损失除以其相应输入中的样本数 (BATCH_SIZE_PER_REPLICA = 16),而应将损失除以 GLOBAL_BATCH_SIZE (64)。

为什么这样做?

  • 之所以需要这样做,是因为在每个副本上计算完梯度后,会通过对梯度求和在副本之间同步梯度。

如何在 TensorFlow 中执行此操作?

  • 如果您正在编写自定义训练循环(如本教程中所述),则应将每个样本的损失相加,然后将总和除以 GLOBAL_BATCH_SIZE: scale_loss = tf.reduce_sum(loss) * (1. / GLOBAL_BATCH_SIZE),或者您可以使用 tf.nn.compute_average_loss,它会将每个样本的损失、可选样本权重和 GLOBAL_BATCH_SIZE 作为参数,并返回经过缩放的损失。

  • 如果在模型中使用正则化损失,则需要按副本数缩放损失值。您可以使用 tf.nn.scale_regularization_loss 函数进行此操作。

  • 不建议使用 tf.reduce_mean。这样做会将损失除以实际的每个副本批次大小,该大小可能会随着步骤的不同而发生变化。

  • 这种缩减和缩放会在 Keras model.compile
    model.fit 中自动完成。

  • 如果使用 tf.keras.losses 类(如下面的示例所示),则需要将损失缩减显式地指定为 NONESUM。与 tf.distribute.Strategy 一起使用时,不允许使用 AUTOSUM_OVER_BATCH_SIZE。不允许使用 AUTO,因为用户应明确考虑他们想要的缩减量,以确保在分布式情况下缩减量正确。不允许使用 SUM_OVER_BATCH_SIZE,因为当前它只能按副本批次大小进行划分,而将按副本数量划分划留给用户,这可能很容易遗漏。因此,我们转而要求用户自己显式地执行缩减操作。

  • 如果 labels 为多维,则对每个样本中的元素数量的 per_example_loss 求平均值。例如,如果 predictions 的形状为 (batch_size, H, W, n_classes),而 labels(batch_size, H, W),则需要更新 per_example_loss,例如:per_example_loss /= tf.cast(tf.reduce_prod(tf.shape(labels)[1:]), tf.float32)

    小心:验证损失的形状tf.losses/tf.keras.losses 中的损失函数通常会返回输入最后一个维度的平均值。损失类封装这些函数。在创建损失类的实例时传递 reduction=Reduction.NONE,表示“无额外缩减”。对于样本输入形状为 [batch, W, H, n_classes] 的类别损失,会缩减 n_classes 维度。对于类似 losses.mean_squared_errorlosses.binary_crossentropy 的逐点损失,应包含一个虚拟轴,使 [batch, W, H, 1] 缩减为 [batch, W, H]。如果没有虚拟轴,则 [batch, W, H] 将被错误地缩减为 [batch, W]

with strategy.scope():
  # Set reduction to `none` so we can do the reduction afterwards and divide by
  # global batch size.
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True,
      reduction=tf.keras.losses.Reduction.NONE)
  def compute_loss(labels, predictions):
    per_example_loss = loss_object(labels, predictions)
    return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

定义衡量指标以跟踪损失和准确性

这些指标可以跟踪测试的损失,训练和测试的准确性。 您可以使用.result()随时获取累积的统计信息。

with strategy.scope():
  test_loss = tf.keras.metrics.Mean(name='test_loss')

  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='train_accuracy')
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy')
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

训练循环

# model, optimizer, and checkpoint must be created under `strategy.scope`.
with strategy.scope():
  model = create_model()

  optimizer = tf.keras.optimizers.Adam()

  checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
def train_step(inputs):
  images, labels = inputs

  with tf.GradientTape() as tape:
    predictions = model(images, training=True)
    loss = compute_loss(labels, predictions)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_accuracy.update_state(labels, predictions)
  return loss 

def test_step(inputs):
  images, labels = inputs

  predictions = model(images, training=False)
  t_loss = loss_object(labels, predictions)

  test_loss.update_state(t_loss)
  test_accuracy.update_state(labels, predictions)
# `run` replicates the provided computation and runs it
# with the distributed input.
@tf.function
def distributed_train_step(dataset_inputs):
  per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
  return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                         axis=None)

@tf.function
def distributed_test_step(dataset_inputs):
  return strategy.run(test_step, args=(dataset_inputs,))

for epoch in range(EPOCHS):
  # TRAIN LOOP
  total_loss = 0.0
  num_batches = 0
  for x in train_dist_dataset:
    total_loss += distributed_train_step(x)
    num_batches += 1
  train_loss = total_loss / num_batches

  # TEST LOOP
  for x in test_dist_dataset:
    distributed_test_step(x)

  if epoch % 2 == 0:
    checkpoint.save(checkpoint_prefix)

  template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
              "Test Accuracy: {}")
  print (template.format(epoch+1, train_loss,
                         train_accuracy.result()*100, test_loss.result(),
                         test_accuracy.result()*100))

  test_loss.reset_states()
  train_accuracy.reset_states()
  test_accuracy.reset_states()
2021-08-13 21:23:58.131484: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-13 21:23:59.022869: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8100
2021-08-13 21:23:59.578566: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
Epoch 1, Loss: 0.5184440612792969, Accuracy: 81.22833251953125, Test Loss: 0.40535494685173035, Test Accuracy: 85.40999603271484
Epoch 2, Loss: 0.33861199021339417, Accuracy: 87.77999877929688, Test Loss: 0.3343721926212311, Test Accuracy: 87.76000213623047
Epoch 3, Loss: 0.2895026206970215, Accuracy: 89.47833251953125, Test Loss: 0.3115186095237732, Test Accuracy: 88.43000030517578
Epoch 4, Loss: 0.25864723324775696, Accuracy: 90.5, Test Loss: 0.3231189548969269, Test Accuracy: 88.3800048828125
Epoch 5, Loss: 0.23562075197696686, Accuracy: 91.30833435058594, Test Loss: 0.27625685930252075, Test Accuracy: 89.84000396728516
Epoch 6, Loss: 0.21540267765522003, Accuracy: 92.0816650390625, Test Loss: 0.25776195526123047, Test Accuracy: 90.58999633789062
Epoch 7, Loss: 0.19832941889762878, Accuracy: 92.73832702636719, Test Loss: 0.2531856298446655, Test Accuracy: 90.63999938964844
Epoch 8, Loss: 0.18321861326694489, Accuracy: 93.27999877929688, Test Loss: 0.24788013100624084, Test Accuracy: 91.1199951171875
Epoch 9, Loss: 0.1684563010931015, Accuracy: 93.77999877929688, Test Loss: 0.2517089247703552, Test Accuracy: 91.25999450683594
Epoch 10, Loss: 0.15191349387168884, Accuracy: 94.38333129882812, Test Loss: 0.2564716041088104, Test Accuracy: 90.86000061035156

以上示例中需要注意的事项:

  • 我们使用for x in ...迭代构造train_dist_datasettest_dist_dataset
  • 缩放损失是distributed_train_step的返回值。 这个值会在各个副本使用tf.distribute.Strategy.reduce的时候合并,然后通过tf.distribute.Strategy.reduce叠加各个返回值来跨批次。
  • 在执行tf.distribute.Strategy.experimental_run_v2时,tf.keras.Metrics应在train_steptest_step中更新。

恢复最新的检查点并进行测试

使用 tf.distribute.Strategy 设置了检查点的模型可以使用或不使用策略进行恢复。

eval_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='eval_accuracy')

new_model = create_model()
new_optimizer = tf.keras.optimizers.Adam()

test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)
@tf.function
def eval_step(images, labels):
  predictions = new_model(images, training=False)
  eval_accuracy(labels, predictions)
checkpoint = tf.train.Checkpoint(optimizer=new_optimizer, model=new_model)
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

for images, labels in test_dataset:
  eval_step(images, labels)

print ('Accuracy after restoring the saved model without strategy: {}'.format(
    eval_accuracy.result()*100))
Accuracy after restoring the saved model without strategy: 91.25999450683594

迭代一个数据集的替代方法

使用迭代器

如果你想要迭代一个已经给定步骤数量而不需要整个遍历的数据集,你可以创建一个迭代器并在迭代器上调用iter和显式调用next。 您可以选择在 tf.function 内部和外部迭代数据集。 这是一个小片段,演示了使用迭代器在 tf.function 外部迭代数据集。

for _ in range(EPOCHS):
  total_loss = 0.0
  num_batches = 0
  train_iter = iter(train_dist_dataset)

  for _ in range(10):
    total_loss += distributed_train_step(next(train_iter))
    num_batches += 1
  average_train_loss = total_loss / num_batches

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print (template.format(epoch+1, average_train_loss, train_accuracy.result()*100))
  train_accuracy.reset_states()
Epoch 10, Loss: 0.12727877497673035, Accuracy: 95.46875
Epoch 10, Loss: 0.12111912667751312, Accuracy: 95.625
Epoch 10, Loss: 0.11665823310613632, Accuracy: 94.53125
Epoch 10, Loss: 0.12236034870147705, Accuracy: 95.46875
Epoch 10, Loss: 0.12217365205287933, Accuracy: 96.40625
Epoch 10, Loss: 0.13115283846855164, Accuracy: 95.625
Epoch 10, Loss: 0.12177123874425888, Accuracy: 95.625
Epoch 10, Loss: 0.11623428016901016, Accuracy: 95.0
Epoch 10, Loss: 0.14430288970470428, Accuracy: 94.6875
Epoch 10, Loss: 0.13273152709007263, Accuracy: 95.3125

在 tf.function 中迭代

您还可以使用for x in ...构造在 tf.function 内部迭代整个输入train_dist_dataset,或者像上面那样创建迭代器。下面的例子演示了在 tf.function 中包装一个 epoch 并在功能内迭代train_dist_dataset

@tf.function
def distributed_train_epoch(dataset):
  total_loss = 0.0
  num_batches = 0
  for x in dataset:
    per_replica_losses = strategy.run(train_step, args=(x,))
    total_loss += strategy.reduce(
      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)
    num_batches += 1
  return total_loss / tf.cast(num_batches, dtype=tf.float32)

for epoch in range(EPOCHS):
  train_loss = distributed_train_epoch(train_dist_dataset)

  template = ("Epoch {}, Loss: {}, Accuracy: {}")
  print (template.format(epoch+1, train_loss, train_accuracy.result()*100))

  train_accuracy.reset_states()
/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:374: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.
  warnings.warn("To make it possible to preserve tf.data options across "
Epoch 1, Loss: 0.14146514236927032, Accuracy: 94.69000244140625
Epoch 2, Loss: 0.12875722348690033, Accuracy: 95.08332824707031
Epoch 3, Loss: 0.11861380189657211, Accuracy: 95.68167114257812
Epoch 4, Loss: 0.10926252603530884, Accuracy: 95.82499694824219
Epoch 5, Loss: 0.10033459216356277, Accuracy: 96.25166320800781
Epoch 6, Loss: 0.09170950204133987, Accuracy: 96.57333374023438
Epoch 7, Loss: 0.08375364542007446, Accuracy: 96.9000015258789
Epoch 8, Loss: 0.07536998391151428, Accuracy: 97.24333190917969
Epoch 9, Loss: 0.07213420420885086, Accuracy: 97.3116683959961
Epoch 10, Loss: 0.06588523089885712, Accuracy: 97.54166412353516

跟踪副本中的训练的损失

注意:作为通用的规则,您应该使用tf.keras.Metrics来跟踪每个样本的值以避免它们在副本中合并。

我们 建议使用tf.metrics.Mean 来跟踪不同副本的训练损失,因为在执行过程中会进行损失缩放计算。

例如,如果您运行具有以下特点的训练作业:

  • 两个副本
  • 在每个副本上处理两个例子
  • 产生的损失值:每个副本为[2,3]和[4,5]
  • 全局批次大小 = 4

通过损失缩放,您可以通过添加损失值来计算每个副本上的每个样本的损失值,然后除以全局批量大小。 在这种情况下:(2 + 3)/ 4 = 1.25(4 + 5)/ 4 = 2.25

如果您使用 tf.metrics.Mean 来跟踪两个副本的损失,结果会有所不同。 在这个例子中,你最终得到一个total为 3.50 和count为 2 的结果,当调用result()时,你将得到total /count = 1.75。 使用tf.keras.Metrics计算损失时会通过一个等于同步副本数量的额外因子来缩放。

例子和教程

以下是一些使用自定义训练循环来分发策略的示例:

  1. 分布式训练指南
  2. DenseNet 使用 MirroredStrategy的例子。
  3. BERT 使用 MirroredStrategyTPUStrategy来训练的例子。 此示例对于了解如何在分发训练过程中如何载入一个检测点和定期生成检查点特别有帮助。
  4. NCF 使用 MirroredStrategy 来启用 keras_use_ctl 标记。
  5. NMT 使用 MirroredStrategy来训练的例子。

更多的例子列在 分发策略指南

下一步

  • 在您的模型上尝试新的 tf.distribute.Strategy API。
  • 访问指南中的性能部分,了解有关其他策略和工具的更多信息,您可以使用它们来优化 TensorFlow 模型的性能。