Help protect the Great Barrier Reef with TensorFlow on Kaggle

Custom training loop with Keras and MultiWorkerMirroredStrategy

Overview

This tutorial demonstrates multi-worker training with custom training loop API, distributed via MultiWorkerMirroredStrategy, so a Keras model designed to run on single-worker can seamlessly work on multiple workers with minimal code change.

We are using custom training loops to train our model because they give us flexibility and a greater control on training. Moreover, it is easier to debug the model and the training loop. More detailed information is available in Writing a training loop from scratch.

If you are looking for how to use MultiWorkerMirroredStrategy with keras model.fit, refer to this tutorial instead.

Distributed Training in TensorFlow guide is available for an overview of the distribution strategies TensorFlow supports for those interested in a deeper understanding of tf.distribute.Strategy APIs.

Setup

First, some necessary imports.

import json
import os
import sys


Before importing TensorFlow, make a few changes to the environment.

Disable all GPUs. This prevents errors caused by the workers all trying to use the same GPU. For a real application each worker would be on a different machine.

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"


Reset the TF_CONFIG environment variable, you'll see more about this later.

os.environ.pop('TF_CONFIG', None)


Be sure that the current directory is on python's path. This allows the notebook to import the files written by %%writefile later.

if '.' not in sys.path:
sys.path.insert(0, '.')


Now import TensorFlow.

import tensorflow as tf


Dataset and model definition

Next create an mnist.py file with a simple model and dataset setup. This python file will be used by the worker-processes in this tutorial:

%%writefile mnist.py

import os
import tensorflow as tf
import numpy as np

def mnist_dataset(batch_size):
# The x arrays are in uint8 and have values in the range [0, 255].
# You need to convert them to float32 with values in the range [0, 1]
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(60000)
return train_dataset

def dataset_fn(global_batch_size, input_context):
batch_size = input_context.get_per_replica_batch_size(global_batch_size)
dataset = mnist_dataset(batch_size)
dataset = dataset.shard(input_context.num_input_pipelines,
input_context.input_pipeline_id)
dataset = dataset.batch(batch_size)
return dataset

def build_cnn_model():
return tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])

Writing mnist.py


Multi-worker Configuration

Now let's enter the world of multi-worker training. In TensorFlow, the TF_CONFIG environment variable is required for training on multiple machines, each of which possibly has a different role. TF_CONFIG used below, is a JSON string used to specify the cluster configuration on each worker that is part of the cluster. This is the default method for specifying a cluster, using cluster_resolver.TFConfigClusterResolver, but there are other options available in the distribute.cluster_resolver module.

Here is an example configuration:

tf_config = {
'cluster': {
'worker': ['localhost:12345', 'localhost:23456']
},
}


Here is the same TF_CONFIG serialized as a JSON string:

json.dumps(tf_config)

'{"cluster": {"worker": ["localhost:12345", "localhost:23456"]}, "task": {"type": "worker", "index": 0} }'


There are two components of TF_CONFIG: cluster and task.

• cluster is the same for all workers and provides information about the training cluster, which is a dict consisting of different types of jobs such as worker. In multi-worker training with MultiWorkerMirroredStrategy, there is usually one worker that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular worker does. Such a worker is referred to as the chief worker, and it is customary that the worker with index 0 is appointed as the chief worker (in fact this is how tf.distribute.Strategy is implemented).

• task provides information of the current task and is different on each worker. It specifies the type and index of that worker.

In this example, you set the task type to "worker" and the task index to 0. This machine is the first worker and will be appointed as the chief worker and do more work than the others. Note that other machines will need to have the TF_CONFIG environment variable set as well, and it should have the same cluster dict, but different task type or task index depending on what the roles of those machines are.

For illustration purposes, this tutorial shows how one may set a TF_CONFIG with 2 workers on localhost. In practice, users would create multiple workers on external IP addresses/ports, and set TF_CONFIG on each worker appropriately.

In this example you will use 2 workers, the first worker's TF_CONFIG is shown above. For the second worker you would set tf_config['task']['index']=1

Above, tf_config is just a local variable in python. To actually use it to configure training, this dictionary needs to be serialized as JSON, and placed in the TF_CONFIG environment variable.

Environment variables and subprocesses in notebooks

Subprocesses inherit environment variables from their parent. So if you set an environment variable in this jupyter notebook process:

os.environ['GREETINGS'] = 'Hello TensorFlow!'


You can access the environment variable from a subprocesses:

echo \${GREETINGS}

Hello TensorFlow!


In the next section, you'll use this to pass the TF_CONFIG to the worker subprocesses. You would never really launch your jobs this way, but it's sufficient for the purposes of this tutorial: To demonstrate a minimal multi-worker example.

MultiWorkerMirroredStrategy

To train the model, use an instance of tf.distribute.MultiWorkerMirroredStrategy, which creates copies of all variables in the model's layers on each device across all workers. The tf.distribute.Strategy guide has more details about this strategy.

strategy = tf.distribute.MultiWorkerMirroredStrategy()

INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO
2021-11-23 02:29:16.957442: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-11-23 02:29:16.957748: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.63.1 does not match DSO version 470.82.0 -- cannot find working devices in this configuration


Use tf.distribute.Strategy.scope to specify that a strategy should be used when building your model. This puts you in the "cross-replica context" for this strategy, which means the strategy is put in control of things like variable placement.

import mnist
with strategy.scope():
# Model building needs to be within strategy.scope().
multi_worker_model = mnist.build_cnn_model()


In multi-worker training, dataset sharding is not necessarily needed, however it gives you exactly-once semantics which makes more training more reproducible, i.e. training on multiple workers should be the same as training on one worker. Note: performance can be affected in some cases.

per_worker_batch_size = 64
num_workers = len(tf_config['cluster']['worker'])
global_batch_size = per_worker_batch_size * num_workers

with strategy.scope():
multi_worker_dataset = strategy.distribute_datasets_from_function(
lambda input_context: mnist.dataset_fn(global_batch_size, input_context))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
11501568/11490434 [==============================] - 0s 0us/step


Define Custom Training Loop and Train the model

Specify an optimizer

with strategy.scope():
# The creation of optimizer and train_accuracy will need to be in
# strategy.scope() as well, since they create variables.
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='train_accuracy')


Define a training step with tf.function

@tf.function
def train_step(iterator):
"""Training step function."""

def step_fn(inputs):
"""Per-Replica step function."""
x, y = inputs
predictions = multi_worker_model(x, training=True)
per_batch_loss = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True,
reduction=tf.keras.losses.Reduction.NONE)(y, predictions)
loss = tf.nn.compute_average_loss(
per_batch_loss, global_batch_size=global_batch_size)

train_accuracy.update_state(y, predictions)
return loss

per_replica_losses = strategy.run(step_fn, args=(next(iterator),))
return strategy.reduce(
tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)


Checkpoint saving and restoring

Checkpointing implementation in a Custom Training Loop requires the user to handle it instead of using a keras callback. It allows you to save model's weights and restore them without having to save the whole model.

from multiprocessing import util
checkpoint_dir = os.path.join(util.get_temp_dir(), 'ckpt')

and "chief" not in cluster_spec.as_dict()))

temp_dir = os.path.join(dirpath, base_dirpath)
tf.io.gfile.makedirs(temp_dir)
return temp_dir

dirpath = os.path.dirname(filepath)
base = os.path.basename(filepath)
return os.path.join(dirpath, base)


Here, you'll create one tf.train.Checkpoint that tracks the model, which is managed by a tf.train.CheckpointManager so that only the latest checkpoint is preserved.

epoch = tf.Variable(
initial_value=tf.constant(0, dtype=tf.dtypes.int64), name='epoch')
step_in_epoch = tf.Variable(
initial_value=tf.constant(0, dtype=tf.dtypes.int64),
name='step_in_epoch')
# We normally don't need to manually instantiate a ClusterSpec, but in this
# illustrative example we did not set TF_CONFIG before initializing the
# strategy. See the next section for "real-world" usage.
cluster_spec = tf.train.ClusterSpec(tf_config['cluster'])

checkpoint = tf.train.Checkpoint(
model=multi_worker_model, epoch=epoch, step_in_epoch=step_in_epoch)

cluster_spec)
checkpoint_manager = tf.train.CheckpointManager(
checkpoint, directory=write_checkpoint_dir, max_to_keep=1)


Now, when you need to restore, you can find the latest checkpoint saved using the convenient tf.train.latest_checkpoint function.

latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
if latest_checkpoint:
checkpoint.restore(latest_checkpoint)


After restoring the checkpoint, you can continue with training your custom training loop.

num_epochs = 3
num_steps_per_epoch = 70

while epoch.numpy() < num_epochs:
iterator = iter(multi_worker_dataset)
total_loss = 0.0
num_batches = 0

while step_in_epoch.numpy() < num_steps_per_epoch:
total_loss += train_step(iterator)
num_batches += 1

train_loss = total_loss / num_batches
print('Epoch: %d, accuracy: %f, train_loss: %f.'
%(epoch.numpy(), train_accuracy.result(), train_loss))

train_accuracy.reset_states()

# Once the CheckpointManager is set up, you're now ready to save, and remove
# the checkpoints non-chief workers saved.
checkpoint_manager.save()
tf.io.gfile.rmtree(write_checkpoint_dir)

step_in_epoch.assign(0)

2021-11-23 02:29:18.214294: W tensorflow/core/framework/dataset.cc:744] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch: 0, accuracy: 0.826228, train_loss: 0.540131.
Epoch: 1, accuracy: 0.937946, train_loss: 0.207413.
Epoch: 2, accuracy: 0.960603, train_loss: 0.137420.


Full code setup on workers

To actually run with MultiWorkerMirroredStrategy you'll need to run worker processes and pass a TF_CONFIG to them.

Like the mnist.py file written earlier, here is the main.py that contain the same code we walked through step by step previously in this colab, we're just writing it to a file so each of the workers will run it:

File: main.py

Writing main.py


Train and Evaluate

The current directory now contains both Python files:

ls *.py

main.py
mnist.py


So json-serialize the TF_CONFIG and add it to the environment variables:

os.environ['TF_CONFIG'] = json.dumps(tf_config)


Now, you can launch a worker process that will run the main.py and use the TF_CONFIG:

# first kill any previous runs
%killbgscripts

All background processes were killed.

python main.py &> job_0.log


There are a few things to note about the above command:

1. It uses the %%bash which is a notebook "magic" to run some bash commands.
2. It uses the --bg flag to run the bash process in the background, because this worker will not terminate. It waits for all the workers before it starts.

The backgrounded worker process won't print output to this notebook, so the &> redirects its output to a file, so you can see what happened.

So, wait a few seconds for the process to start up:

import time
time.sleep(20)


Now look what's been output to the worker's logfile so far:

cat job_0.log

2021-11-23 02:29:29.831748: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-11-23 02:29:29.832003: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.63.1 does not match DSO version 470.82.0 -- cannot find working devices in this configuration


The last line of the log file should say: Started server with target: grpc://localhost:12345. The first worker is now ready, and is waiting for all the other worker(s) to be ready to proceed.

So update the tf_config for the second worker's process to pick up:

tf_config['task']['index'] = 1
os.environ['TF_CONFIG'] = json.dumps(tf_config)


Now launch the second worker. This will start the training since all the workers are active (so there's no need to background this process):

python main.py > /dev/null 2>&1


Now if you recheck the logs written by the first worker you'll see that it participated in training that model:

cat job_0.log

2021-11-23 02:29:29.831748: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-11-23 02:29:29.832003: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 470.63.1 does not match DSO version 470.82.0 -- cannot find working devices in this configuration
2021-11-23 02:29:50.709898: W tensorflow/core/framework/dataset.cc:744] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Epoch: 0, accuracy: 0.820424, train_loss: 0.575663.
Epoch: 1, accuracy: 0.927344, train_loss: 0.241324.
Epoch: 2, accuracy: 0.953237, train_loss: 0.154762.

# Delete the TF_CONFIG, and kill any background tasks so they don't affect the next section.
os.environ.pop('TF_CONFIG', None)
%killbgscripts

All background processes were killed.


Multi worker training in depth

This tutorial has demonstrated a Custom Training Loop workflow of the multi-worker setup. A detailed description of other topics is available in the model.fit's guide of the multi-worker setup and applicable to CTLs.