TF 2.0 is out! Get hands-on practice at TF World, Oct 28-31. Use code TF20 for 20% off select passes. Register now

Keras custom callbacks

View on TensorFlow.org View source on GitHub Download notebook

A custom callback is a powerful tool to customize the behavior of a Keras model during training, evaluation, or inference, including reading/changing the Keras model. Examples include tf.keras.callbacks.TensorBoard where the training progress and results can be exported and visualized with TensorBoard, or tf.keras.callbacks.ModelCheckpoint where the model is automatically saved during training, and more. In this guide, you will learn what Keras callback is, when it will be called, what it can do, and how you can build your own. Towards the end of this guide, there will be demos of creating a couple of simple callback applications to get you started on your custom callback.

Setup

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

Introduction to Keras callbacks

In Keras, Callback is a python class meant to be subclassed to provide specific functionality, with a set of methods called at various stages of training (including batch/epoch start and ends), testing, and predicting. Callbacks are useful to get a view on internal states and statistics of the model during training. You can pass a list of callbacks (as the keyword argument callbacks) to any of tf.keras.Model.fit(), tf.keras.Model.evaluate(), and tf.keras.Model.predict() methods. The methods of the callbacks will then be called at different stages of training/evaluating/inference.

To get started, let's import tensorflow and define a simple Sequential Keras model:

# Define the Keras model to add callbacks to
def get_model():
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(1, activation = 'linear', input_dim = 784))
  model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.1), loss='mean_squared_error', metrics=['mae'])
  return model

Then, load the MNIST data for training and testing from Keras datasets API:

# Load example MNIST data and pre-process it
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

Now, define a simple custom callback to track the start and end of every batch of data. During those calls, it prints the index of the current batch.

import datetime

class MyCustomCallback(tf.keras.callbacks.Callback):

  def on_train_batch_begin(self, batch, logs=None):
    print('Training: batch {} begins at {}'.format(batch, datetime.datetime.now().time()))

  def on_train_batch_end(self, batch, logs=None):
    print('Training: batch {} ends at {}'.format(batch, datetime.datetime.now().time()))

  def on_test_batch_begin(self, batch, logs=None):
    print('Evaluating: batch {} begins at {}'.format(batch, datetime.datetime.now().time()))

  def on_test_batch_end(self, batch, logs=None):
    print('Evaluating: batch {} ends at {}'.format(batch, datetime.datetime.now().time()))

Providing a callback to model methods such as tf.keras.Model.fit() ensures the methods are called at those stages:

model = get_model()
_ = model.fit(x_train, y_train,
          batch_size=64,
          epochs=1,
          steps_per_epoch=5,
          verbose=0,
          callbacks=[MyCustomCallback()])
Training: batch 0 begins at 22:53:23.497069
Training: batch 0 ends at 22:53:24.406412
Training: batch 1 begins at 22:53:24.406879
Training: batch 1 ends at 22:53:24.410124
Training: batch 2 begins at 22:53:24.410395
Training: batch 2 ends at 22:53:24.412897
Training: batch 3 begins at 22:53:24.413114
Training: batch 3 ends at 22:53:24.415623
Training: batch 4 begins at 22:53:24.415865
Training: batch 4 ends at 22:53:24.418233

Model methods that take callbacks

Users can supply a list of callbacks to the following tf.keras.Model methods:

fit(), fit_generator()

Trains the model for a fixed number of epochs (iterations over a dataset, or data yielded batch-by-batch by a Python generator).

evaluate(), evaluate_generator()

Evaluates the model for given data or data generator. Outputs the loss and metric values from the evaluation.

predict(), predict_generator()

Generates output predictions for the input data or data generator.

_ = model.evaluate(x_test, y_test, batch_size=128, verbose=0, steps=5,
          callbacks=[MyCustomCallback()])
Evaluating: batch 0 begins at 22:53:24.503264
Evaluating: batch 0 ends at 22:53:24.565644
Evaluating: batch 1 begins at 22:53:24.566251
Evaluating: batch 1 ends at 22:53:24.568336
Evaluating: batch 2 begins at 22:53:24.568842
Evaluating: batch 2 ends at 22:53:24.570645
Evaluating: batch 3 begins at 22:53:24.571082
Evaluating: batch 3 ends at 22:53:24.572955
Evaluating: batch 4 begins at 22:53:24.573187
Evaluating: batch 4 ends at 22:53:24.575074

An overview of callback methods

Common methods for training/testing/predicting

For training, testing, and predicting, following methods are provided to be overridden.

on_(train|test|predict)_begin(self, logs=None)

Called at the beginning of fit/evaluate/predict.

on_(train|test|predict)_end(self, logs=None)

Called at the end of fit/evaluate/predict.

on_(train|test|predict)_batch_begin(self, batch, logs=None)

Called right before processing a batch during training/testing/predicting. Within this method, logs is a dict with batch and size available keys, representing the current batch number and the size of the batch.

on_(train|test|predict)_batch_end(self, batch, logs=None)

Called at the end of training/testing/predicting a batch. Within this method, logs is a dict containing the stateful metrics result.

Training specific methods

In addition, for training, following are provided.

on_epoch_begin(self, epoch, logs=None)

Called at the beginning of an epoch during training.

on_epoch_end(self, epoch, logs=None)

Called at the end of an epoch during training.

Usage of logs dict

The logs dict contains the loss value, and all the metrics at the end of a batch or epoch. Example includes the loss and mean absolute error.

class LossAndErrorPrintingCallback(tf.keras.callbacks.Callback):

  def on_train_batch_end(self, batch, logs=None):
    print('For batch {}, loss is {:7.2f}.'.format(batch, logs['loss']))

  def on_test_batch_end(self, batch, logs=None):
    print('For batch {}, loss is {:7.2f}.'.format(batch, logs['loss']))

  def on_epoch_end(self, epoch, logs=None):
    print('The average loss for epoch {} is {:7.2f} and mean absolute error is {:7.2f}.'.format(epoch, logs['loss'], logs['mae']))

model = get_model()
_ = model.fit(x_train, y_train,
          batch_size=64,
          steps_per_epoch=5,
          epochs=3,
          verbose=0,
          callbacks=[LossAndErrorPrintingCallback()])
For batch 0, loss is   32.60.
For batch 1, loss is  977.06.
For batch 2, loss is   26.34.
For batch 3, loss is    9.67.
For batch 4, loss is    8.72.
The average loss for epoch 0 is  210.88 and mean absolute error is    8.68.
For batch 0, loss is    7.02.
For batch 1, loss is    6.24.
For batch 2, loss is    7.08.
For batch 3, loss is    7.00.
For batch 4, loss is    8.16.
The average loss for epoch 1 is    7.10 and mean absolute error is    2.19.
For batch 0, loss is    6.65.
For batch 1, loss is    4.38.
For batch 2, loss is    5.21.
For batch 3, loss is    6.02.
For batch 4, loss is    5.20.
The average loss for epoch 2 is    5.49 and mean absolute error is    1.94.

Similarly, one can provide callbacks in evaluate() calls.

_ = model.evaluate(x_test, y_test, batch_size=128, verbose=0, steps=20,
          callbacks=[LossAndErrorPrintingCallback()])
For batch 0, loss is    5.60.
For batch 1, loss is    3.91.
For batch 2, loss is    4.94.
For batch 3, loss is    5.03.
For batch 4, loss is    5.88.
For batch 5, loss is    4.70.
For batch 6, loss is    4.75.
For batch 7, loss is    4.44.
For batch 8, loss is    4.73.
For batch 9, loss is    5.69.
For batch 10, loss is    5.49.
For batch 11, loss is    5.58.
For batch 12, loss is    5.79.
For batch 13, loss is    7.22.
For batch 14, loss is    4.87.
For batch 15, loss is    5.16.
For batch 16, loss is    6.29.
For batch 17, loss is    6.54.
For batch 18, loss is    6.31.
For batch 19, loss is    4.51.

Examples of Keras callback applications

The following section will guide you through creating simple Callback applications.

Early stopping at minimum loss

First example showcases the creation of a Callback that stops the Keras training when the minimum of loss has been reached by mutating the attribute model.stop_training (boolean). Optionally, the user can provide an argument patience to specfify how many epochs the training should wait before it eventually stops.

tf.keras.callbacks.EarlyStopping provides a more complete and general implementation.

import numpy as np

class EarlyStoppingAtMinLoss(tf.keras.callbacks.Callback):
  """Stop training when the loss is at its min, i.e. the loss stops decreasing.

  Arguments:
      patience: Number of epochs to wait after min has been hit. After this
      number of no improvement, training stops.
  """

  def __init__(self, patience=0):
    super(EarlyStoppingAtMinLoss, self).__init__()

    self.patience = patience

    # best_weights to store the weights at which the minimum loss occurs.
    self.best_weights = None

  def on_train_begin(self, logs=None):
    # The number of epoch it has waited when loss is no longer minimum.
    self.wait = 0
    # The epoch the training stops at.
    self.stopped_epoch = 0
    # Initialize the best as infinity.
    self.best = np.Inf

  def on_epoch_end(self, epoch, logs=None):
    current = logs.get('loss')
    if np.less(current, self.best):
      self.best = current
      self.wait = 0
      # Record the best weights if current results is better (less).
      self.best_weights = self.model.get_weights()
    else:
      self.wait += 1
      if self.wait >= self.patience:
        self.stopped_epoch = epoch
        self.model.stop_training = True
        print('Restoring model weights from the end of the best epoch.')
        self.model.set_weights(self.best_weights)

  def on_train_end(self, logs=None):
    if self.stopped_epoch > 0:
      print('Epoch %05d: early stopping' % (self.stopped_epoch + 1))
model = get_model()
_ = model.fit(x_train, y_train,
          batch_size=64,
          steps_per_epoch=5,
          epochs=30,
          verbose=0,
          callbacks=[LossAndErrorPrintingCallback(), EarlyStoppingAtMinLoss()])
For batch 0, loss is   22.00.
For batch 1, loss is  824.54.
For batch 2, loss is   20.08.
For batch 3, loss is   12.31.
For batch 4, loss is    8.96.
The average loss for epoch 0 is  177.58 and mean absolute error is    8.02.
For batch 0, loss is    6.24.
For batch 1, loss is    5.54.
For batch 2, loss is    4.51.
For batch 3, loss is    5.26.
For batch 4, loss is    8.51.
The average loss for epoch 1 is    6.01 and mean absolute error is    2.00.
For batch 0, loss is    6.76.
For batch 1, loss is    5.01.
For batch 2, loss is    5.96.
For batch 3, loss is    4.84.
For batch 4, loss is    5.34.
The average loss for epoch 2 is    5.58 and mean absolute error is    1.92.
For batch 0, loss is    5.15.
For batch 1, loss is    4.82.
For batch 2, loss is    9.00.
For batch 3, loss is   28.72.
For batch 4, loss is  114.01.
The average loss for epoch 3 is   32.34 and mean absolute error is    4.14.
Restoring model weights from the end of the best epoch.
Epoch 00004: early stopping

Learning rate scheduling

One thing that is commonly done in model training is changing the learning rate as more epochs have passed. Keras backend exposes get_value api which can be used to set the variables. In this example, we're showing how a custom Callback can be used to dymanically change the learning rate.

class LearningRateScheduler(tf.keras.callbacks.Callback):
  """Learning rate scheduler which sets the learning rate according to schedule.

  Arguments:
      schedule: a function that takes an epoch index
          (integer, indexed from 0) and current learning rate
          as inputs and returns a new learning rate as output (float).
  """

  def __init__(self, schedule):
    super(LearningRateScheduler, self).__init__()
    self.schedule = schedule

  def on_epoch_begin(self, epoch, logs=None):
    if not hasattr(self.model.optimizer, 'lr'):
      raise ValueError('Optimizer must have a "lr" attribute.')
    # Get the current learning rate from model's optimizer.
    lr = float(tf.keras.backend.get_value(self.model.optimizer.lr))
    # Call schedule function to get the scheduled learning rate.
    scheduled_lr = self.schedule(epoch, lr)
    # Set the value back to the optimizer before this epoch starts
    tf.keras.backend.set_value(self.model.optimizer.lr, scheduled_lr)
    print('\nEpoch %05d: Learning rate is %6.4f.' % (epoch, scheduled_lr))
LR_SCHEDULE = [
    # (epoch to start, learning rate) tuples
    (3, 0.05), (6, 0.01), (9, 0.005), (12, 0.001)
]

def lr_schedule(epoch, lr):
  """Helper function to retrieve the scheduled learning rate based on epoch."""
  if epoch < LR_SCHEDULE[0][0] or epoch > LR_SCHEDULE[-1][0]:
    return lr
  for i in range(len(LR_SCHEDULE)):
    if epoch == LR_SCHEDULE[i][0]:
      return LR_SCHEDULE[i][1]
  return lr

model = get_model()
_ = model.fit(x_train, y_train,
          batch_size=64,
          steps_per_epoch=5,
          epochs=15,
          verbose=0,
          callbacks=[LossAndErrorPrintingCallback(), LearningRateScheduler(lr_schedule)])

Epoch 00000: Learning rate is 0.1000.
For batch 0, loss is   17.97.
For batch 1, loss is 1061.05.
For batch 2, loss is   20.94.
For batch 3, loss is    9.75.
For batch 4, loss is    6.84.
The average loss for epoch 0 is  223.31 and mean absolute error is    8.41.

Epoch 00001: Learning rate is 0.1000.
For batch 0, loss is    5.55.
For batch 1, loss is    7.26.
For batch 2, loss is    5.09.
For batch 3, loss is    6.07.
For batch 4, loss is    5.28.
The average loss for epoch 1 is    5.85 and mean absolute error is    1.96.

Epoch 00002: Learning rate is 0.1000.
For batch 0, loss is    5.50.
For batch 1, loss is    6.10.
For batch 2, loss is    5.71.
For batch 3, loss is    8.08.
For batch 4, loss is    5.85.
The average loss for epoch 2 is    6.25 and mean absolute error is    2.01.

Epoch 00003: Learning rate is 0.0500.
For batch 0, loss is    5.73.
For batch 1, loss is    4.33.
For batch 2, loss is    3.53.
For batch 3, loss is    5.04.
For batch 4, loss is    5.11.
The average loss for epoch 3 is    4.75 and mean absolute error is    1.77.

Epoch 00004: Learning rate is 0.0500.
For batch 0, loss is    3.79.
For batch 1, loss is    5.21.
For batch 2, loss is    4.28.
For batch 3, loss is    5.18.
For batch 4, loss is    5.39.
The average loss for epoch 4 is    4.77 and mean absolute error is    1.75.

Epoch 00005: Learning rate is 0.0500.
For batch 0, loss is    4.60.
For batch 1, loss is    4.82.
For batch 2, loss is    3.07.
For batch 3, loss is    3.61.
For batch 4, loss is    6.60.
The average loss for epoch 5 is    4.54 and mean absolute error is    1.68.

Epoch 00006: Learning rate is 0.0100.
For batch 0, loss is    4.70.
For batch 1, loss is    6.51.
For batch 2, loss is    3.84.
For batch 3, loss is    4.23.
For batch 4, loss is    4.24.
The average loss for epoch 6 is    4.70 and mean absolute error is    1.74.

Epoch 00007: Learning rate is 0.0100.
For batch 0, loss is    5.25.
For batch 1, loss is    4.71.
For batch 2, loss is    4.92.
For batch 3, loss is    5.17.
For batch 4, loss is    2.96.
The average loss for epoch 7 is    4.60 and mean absolute error is    1.70.

Epoch 00008: Learning rate is 0.0100.
For batch 0, loss is    3.22.
For batch 1, loss is    4.52.
For batch 2, loss is    3.15.
For batch 3, loss is    5.05.
For batch 4, loss is    4.31.
The average loss for epoch 8 is    4.05 and mean absolute error is    1.61.

Epoch 00009: Learning rate is 0.0050.
For batch 0, loss is    4.31.
For batch 1, loss is    3.53.
For batch 2, loss is    3.64.
For batch 3, loss is    4.84.
For batch 4, loss is    3.52.
The average loss for epoch 9 is    3.97 and mean absolute error is    1.54.

Epoch 00010: Learning rate is 0.0050.
For batch 0, loss is    4.84.
For batch 1, loss is    4.54.
For batch 2, loss is    3.06.
For batch 3, loss is    4.19.
For batch 4, loss is    5.61.
The average loss for epoch 10 is    4.45 and mean absolute error is    1.66.

Epoch 00011: Learning rate is 0.0050.
For batch 0, loss is    4.27.
For batch 1, loss is    4.28.
For batch 2, loss is    3.13.
For batch 3, loss is    2.63.
For batch 4, loss is    3.06.
The average loss for epoch 11 is    3.47 and mean absolute error is    1.46.

Epoch 00012: Learning rate is 0.0010.
For batch 0, loss is    3.26.
For batch 1, loss is    4.26.
For batch 2, loss is    2.71.
For batch 3, loss is    3.88.
For batch 4, loss is    3.36.
The average loss for epoch 12 is    3.49 and mean absolute error is    1.49.

Epoch 00013: Learning rate is 0.0010.
For batch 0, loss is    4.96.
For batch 1, loss is    4.26.
For batch 2, loss is    3.49.
For batch 3, loss is    3.48.
For batch 4, loss is    3.30.
The average loss for epoch 13 is    3.90 and mean absolute error is    1.54.

Epoch 00014: Learning rate is 0.0010.
For batch 0, loss is    3.28.
For batch 1, loss is    4.09.
For batch 2, loss is    5.22.
For batch 3, loss is    2.88.
For batch 4, loss is    3.55.
The average loss for epoch 14 is    3.80 and mean absolute error is    1.55.

Standard Keras callbacks

Be sure to check out the existing Keras callbacks by visiting the api doc. Applications include logging to CSV, saving the model, visualizing on TensorBoard and a lot more.