tf.keras.callbacks.BackupAndRestore
Stay organized with collections
Save and categorize content based on your preferences.
Callback to back up and restore the training state.
Inherits From: Callback
tf.keras.callbacks.BackupAndRestore(
backup_dir,
save_freq='epoch',
delete_checkpoint=True,
save_before_preemption=False
)
BackupAndRestore
callback is intended to recover training from an
interruption that has happened in the middle of a Model.fit
execution, by
backing up the training states in a temporary checkpoint file (with the help
of a tf.train.CheckpointManager
), at the end of each epoch. Each backup
overwrites the previously written checkpoint file, so at any given time
there is at most one such checkpoint file for backup/restoring purpose.
If training restarts before completion, the training state (which includes
the Model
weights and epoch number) is restored to the most recently saved
state at the beginning of a new Model.fit
run. At the completion of a
Model.fit
run, the temporary checkpoint file is deleted.
Note that the user is responsible to bring jobs back after the interruption.
This callback is important for the backup and restore mechanism for fault
tolerance purpose, and the model to be restored from a previous checkpoint
is expected to be the same as the one used to back up. If user changes
arguments passed to compile or fit, the checkpoint saved for fault tolerance
can become invalid.
Note:
- This callback is not compatible with eager execution disabled.
- A checkpoint is saved at the end of each epoch. After restoring,
Model.fit
redoes any partial work during the unfinished epoch in which the
training got restarted (so the work done before the interruption doesn't
affect the final model state).
- This works for both single worker and multi-worker modes. When
Model.fit
is used with tf.distribute
, it supports
tf.distribute.MirroredStrategy
,
tf.distribute.MultiWorkerMirroredStrategy
, tf.distribute.TPUStrategy
,
and tf.distribute.experimental.ParameterServerStrategy
.
Example:
class InterruptingCallback(tf.keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
if epoch == 4:
raise RuntimeError('Interrupting!')
callback = tf.keras.callbacks.BackupAndRestore(backup_dir="/tmp/backup")
model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
model.compile(tf.keras.optimizers.SGD(), loss='mse')
try:
model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
batch_size=1, callbacks=[callback, InterruptingCallback()],
verbose=0)
except:
pass
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
epochs=10, batch_size=1, callbacks=[callback],
verbose=0)
# Only 6 more epochs are run, since first training got interrupted at
# zero-indexed epoch 4, second training will continue from 4 to 9.
len(history.history['loss'])
6
Besides the option to save at the end of every epoch or every N steps, if
you are doing distributed training with
tf.distribute.MultiWorkerMirroredStrategy
on Google Cloud Platform or
Google Borg, you can also use the save_before_preemption
argument
to enable saving a checkpoint right before a worker gets preempted
by other jobs and training gets interrupted. See
tf.distribute.experimental.PreemptionCheckpointHandler
for more details.
Args |
backup_dir
|
String, path to store the checkpoint.
e.g. backup_dir = os.path.join(working_dir, 'backup') .
This is the directory in which the system stores temporary files to
recover the model from jobs terminated unexpectedly. The directory
cannot be reused elsewhere to store other files, e.g. by the
BackupAndRestore callback of another training run,
or by another callback
(e.g. ModelCheckpoint ) of the same training.
|
save_freq
|
'epoch' , integer, or False . When set to 'epoch'
the callback saves the checkpoint at the end of each epoch.
When set to an integer, the callback saves the checkpoint every
save_freq batches. Set save_freq to False if only using
preemption checkpointing (with save_before_preemption=True ).
|
delete_checkpoint
|
Boolean, default to True. This BackupAndRestore
callback works by saving a checkpoint to back up the training state.
If delete_checkpoint=True , the checkpoint will be deleted after
training is finished. Use False if you'd like to keep the checkpoint
for future usage.
|
save_before_preemption
|
A boolean value instructing whether to turn on
the automatic checkpoint saving for preemption/maintenance events.
This only supports
tf.distribute.MultiWorkerMirroredStrategy on Google Cloud Platform
or Google Borg for now.
|
Methods
set_model
View source
set_model(
model
)
set_params
View source
set_params(
params
)
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2023-10-06 UTC.
[null,null,["Last updated 2023-10-06 UTC."],[],[],null,["# tf.keras.callbacks.BackupAndRestore\n\n\u003cbr /\u003e\n\n|----------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/keras-team/keras/tree/v2.14.0/keras/callbacks.py#L1743-L1949) |\n\nCallback to back up and restore the training state.\n\nInherits From: [`Callback`](../../../tf/keras/callbacks/Callback) \n\n tf.keras.callbacks.BackupAndRestore(\n backup_dir,\n save_freq='epoch',\n delete_checkpoint=True,\n save_before_preemption=False\n )\n\n`BackupAndRestore` callback is intended to recover training from an\ninterruption that has happened in the middle of a [`Model.fit`](../../../tf/keras/Model#fit) execution, by\nbacking up the training states in a temporary checkpoint file (with the help\nof a [`tf.train.CheckpointManager`](../../../tf/train/CheckpointManager)), at the end of each epoch. Each backup\noverwrites the previously written checkpoint file, so at any given time\nthere is at most one such checkpoint file for backup/restoring purpose.\n\nIf training restarts before completion, the training state (which includes\nthe `Model` weights and epoch number) is restored to the most recently saved\nstate at the beginning of a new [`Model.fit`](../../../tf/keras/Model#fit) run. At the completion of a\n[`Model.fit`](../../../tf/keras/Model#fit) run, the temporary checkpoint file is deleted.\n\nNote that the user is responsible to bring jobs back after the interruption.\nThis callback is important for the backup and restore mechanism for fault\ntolerance purpose, and the model to be restored from a previous checkpoint\nis expected to be the same as the one used to back up. If user changes\narguments passed to compile or fit, the checkpoint saved for fault tolerance\ncan become invalid.\n\n#### Note:\n\n1. This callback is not compatible with eager execution disabled.\n2. A checkpoint is saved at the end of each epoch. After restoring, [`Model.fit`](../../../tf/keras/Model#fit) redoes any partial work during the unfinished epoch in which the training got restarted (so the work done before the interruption doesn't affect the final model state).\n3. This works for both single worker and multi-worker modes. When [`Model.fit`](../../../tf/keras/Model#fit) is used with [`tf.distribute`](../../../tf/distribute), it supports [`tf.distribute.MirroredStrategy`](../../../tf/distribute/MirroredStrategy), [`tf.distribute.MultiWorkerMirroredStrategy`](../../../tf/distribute/MultiWorkerMirroredStrategy), [`tf.distribute.TPUStrategy`](../../../tf/distribute/TPUStrategy), and [`tf.distribute.experimental.ParameterServerStrategy`](../../../tf/distribute/experimental/ParameterServerStrategy).\n\n#### Example:\n\n class InterruptingCallback(tf.keras.callbacks.Callback):\n def on_epoch_begin(self, epoch, logs=None):\n if epoch == 4:\n raise RuntimeError('Interrupting!')\n callback = tf.keras.callbacks.BackupAndRestore(backup_dir=\"/tmp/backup\")\n model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])\n model.compile(tf.keras.optimizers.SGD(), loss='mse')\n try:\n model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,\n batch_size=1, callbacks=[callback, InterruptingCallback()],\n verbose=0)\n except:\n pass\n history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),\n epochs=10, batch_size=1, callbacks=[callback],\n verbose=0)\n # Only 6 more epochs are run, since first training got interrupted at\n # zero-indexed epoch 4, second training will continue from 4 to 9.\n len(history.history['loss'])\n 6\n\nBesides the option to save at the end of every epoch or every N steps, if\nyou are doing distributed training with\n[`tf.distribute.MultiWorkerMirroredStrategy`](../../../tf/distribute/MultiWorkerMirroredStrategy) on Google Cloud Platform or\nGoogle Borg, you can also use the `save_before_preemption` argument\nto enable saving a checkpoint right before a worker gets preempted\nby other jobs and training gets interrupted. See\n[`tf.distribute.experimental.PreemptionCheckpointHandler`](../../../tf/distribute/experimental/PreemptionCheckpointHandler) for more details.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `backup_dir` | String, path to store the checkpoint. e.g. `backup_dir = os.path.join(working_dir, 'backup')`. This is the directory in which the system stores temporary files to recover the model from jobs terminated unexpectedly. The directory cannot be reused elsewhere to store other files, e.g. by the `BackupAndRestore` callback of another training run, or by another callback (e.g. `ModelCheckpoint`) of the same training. |\n| `save_freq` | `'epoch'`, integer, or `False`. When set to `'epoch'` the callback saves the checkpoint at the end of each epoch. When set to an integer, the callback saves the checkpoint every `save_freq` batches. Set `save_freq` to `False` if only using preemption checkpointing (with `save_before_preemption=True`). |\n| `delete_checkpoint` | Boolean, default to True. This `BackupAndRestore` callback works by saving a checkpoint to back up the training state. If `delete_checkpoint=True`, the checkpoint will be deleted after training is finished. Use `False` if you'd like to keep the checkpoint for future usage. |\n| `save_before_preemption` | A boolean value instructing whether to turn on the automatic checkpoint saving for preemption/maintenance events. This only supports [`tf.distribute.MultiWorkerMirroredStrategy`](../../../tf/distribute/MultiWorkerMirroredStrategy) on Google Cloud Platform or Google Borg for now. |\n\n\u003cbr /\u003e\n\nMethods\n-------\n\n### `set_model`\n\n[View source](https://github.com/keras-team/keras/tree/v2.14.0/keras/callbacks.py#L694-L695) \n\n set_model(\n model\n )\n\n### `set_params`\n\n[View source](https://github.com/keras-team/keras/tree/v2.14.0/keras/callbacks.py#L691-L692) \n\n set_params(\n params\n )"]]