
Callback to back up and restore the training state.

Inherits From: Callback

Used in the notebooks

BackupAndRestore callback is intended to recover training from an interruption that has happened in the middle of a execution, by backing up the training states in a temporary checkpoint file, at the end of each epoch. Each backup overwrites the previously written checkpoint file, so at any given time there is at most one such checkpoint file for backup/restoring purpose.

If training restarts before completion, the training state (which includes the Model weights and epoch number) is restored to the most recently saved state at the beginning of a new run. At the completion of a run, the temporary checkpoint file is deleted.

Note that the user is responsible to bring jobs back after the interruption. This callback is important for the backup and restore mechanism for fault tolerance purpose, and the model to be restored from a previous checkpoint is expected to be the same as the one used to back up. If user changes arguments passed to compile or fit, the checkpoint saved for fault tolerance can become invalid.


class InterruptingCallback(keras.callbacks.Callback):
  def on_epoch_begin(self, epoch, logs=None):
    if epoch == 4:
      raise RuntimeError('Interrupting!')
callback = keras.callbacks.BackupAndRestore(backup_dir="/tmp/backup")
model = keras.models.Sequential([keras.layers.Dense(10)])
model.compile(keras.optimizers.SGD(), loss='mse')
try:, 20), np.zeros(5), epochs=10,
            batch_size=1, callbacks=[callback, InterruptingCallback()],
history =, 20), np.zeros(5),
                    epochs=10, batch_size=1, callbacks=[callback],
# Only 6 more epochs are run, since first training got interrupted at
# zero-indexed epoch 4, second training will continue from 4 to 9.

backup_dir String, path of directory where to store the data needed to restore the model. The directory cannot be reused elsewhere to store other files, e.g. by the BackupAndRestore callback of another training run, or by another callback (e.g. ModelCheckpoint) of the same training run.
save_freq "epoch", integer, or False. When set to "epoch" the callback saves the checkpoint at the end of each epoch. When set to an integer, the callback saves the checkpoint every save_freq batches. Set save_freq=False only if using preemption checkpointing (i.e. with save_before_preemption=True).
delete_checkpoint Boolean, defaults to True. This BackupAndRestore callback works by saving a checkpoint to back up the training state. If delete_checkpoint=True, the checkpoint will be deleted after training is finished. Use False if you'd like to keep the checkpoint for future usage.




View source

A backwards compatibility alias for on_train_batch_begin.


View source

A backwards compatibility alias for on_train_batch_end.


View source

Called at the start of an epoch.

Subclasses should override for any actions to run. This function should only be called during TRAIN mode.

epoch Integer, index of epoch.
logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the end of an epoch.

Subclasses should override for any actions to run. This function should only be called during TRAIN mode.

epoch Integer, index of epoch.
logs Dict, metric results for this training epoch, and for the validation epoch if validation is performed. Validation result keys are prefixed with val_. For training epoch, the values of the Model's metrics are returned. Example: {'loss': 0.2, 'accuracy': 0.7}.


View source

Called at the beginning of a batch in predict methods.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in Model is set to N, this method will only be called every N batches.

batch Integer, index of batch within the current epoch.
logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the end of a batch in predict methods.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in Model is set to N, this method will only be called every N batches.

batch Integer, index of batch within the current epoch.
logs Dict. Aggregated metric results up until this batch.


View source

Called at the beginning of prediction.

Subclasses should override for any actions to run.

logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the end of prediction.

Subclasses should override for any actions to run.

logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the beginning of a batch in evaluate methods.

Also called at the beginning of a validation batch in the fit methods, if validation data is provided.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in Model is set to N, this method will only be called every N batches.

batch Integer, index of batch within the current epoch.
logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the end of a batch in evaluate methods.

Also called at the end of a validation batch in the fit methods, if validation data is provided.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in Model is set to N, this method will only be called every N batches.

batch Integer, index of batch within the current epoch.
logs Dict. Aggregated metric results up until this batch.


View source

Called at the beginning of evaluation or validation.

Subclasses should override for any actions to run.

logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the end of evaluation or validation.

Subclasses should override for any actions to run.

logs Dict. Currently the output of the last call to on_test_batch_end() is passed to this argument for this method but that may change in the future.


View source

Called at the beginning of a training batch in fit methods.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in Model is set to N, this method will only be called every N batches.

batch Integer, index of batch within the current epoch.
logs Dict. Currently no data is passed to this argument for this method but that may change in the future.


View source

Called at the end of a training batch in fit methods.

Subclasses should override for any actions to run.

Note that if the steps_per_execution argument to compile in Model is set to N, this method will only be called every N batches.

batch Integer, index of batch within the current epoch.
logs Dict. Aggregated metric results up until this batch.


View source

Get training state from temporary file and restore it.


View source

Called at the end of training.

Subclasses should override for any actions to run.

logs Dict. Currently the output of the last call to on_epoch_end() is passed to this argument for this method but that may change in the future.


View source


View source