tf.keras.callbacks.experimental.BackupAndRestore
Stay organized with collections
Save and categorize content based on your preferences.
Callback to back up and restore the training state.
Inherits From: Callback
tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir
)
BackupAndRestore
callback is intended to recover from interruptions that
happened in the middle of a model.fit execution by backing up the
training states in a temporary checkpoint file (based on TF CheckpointManager)
at the end of each epoch. If training restarted before completion, the
training state and model are restored to the most recently saved state at the
beginning of a new model.fit() run.
Note that user is responsible to bring jobs back up.
This callback is important for the backup and restore mechanism for fault
tolerance purpose. And the model to be restored from an previous checkpoint is
expected to be the same as the one used to back up. If user changes arguments
passed to compile or fit, the checkpoint saved for fault tolerance can become
invalid.
Note:
- This callback is not compatible with disabling eager execution.
- A checkpoint is saved at the end of each epoch, when restoring we'll redo
any partial work from an unfinished epoch in which the training got restarted
(so the work done before a interruption doesn't affect the final model state).
- This works for both single worker and multi-worker mode, only
MirroredStrategy and MultiWorkerMirroredStrategy are supported for now.
Example:
class InterruptingCallback(tf.keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
if epoch == 4:
raise RuntimeError('Interrupting!')
callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="/tmp")
model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
model.compile(tf.keras.optimizers.SGD(), loss='mse')
try:
model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
batch_size=1, callbacks=[callback, InterruptingCallback()],
verbose=0)
except:
pass
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
batch_size=1, callbacks=[callback], verbose=0)
# Only 6 more epochs are run, since first trainning got interrupted at
# zero-indexed epoch 4, second training will continue from 4 to 9.
len(history.history['loss'])
6
Arguments |
backup_dir
|
String, path to save the model file. This is the directory in
which the system stores temporary files to recover the model from jobs
terminated unexpectedly. The directory cannot be reused elsewhere to
store other checkpoints, e.g. by BackupAndRestore callback of another
training, or by another callback (ModelCheckpoint) of the same training.
|
Methods
set_model
View source
set_model(
model
)
set_params
View source
set_params(
params
)
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2020-10-01 UTC.
[null,null,["Last updated 2020-10-01 UTC."],[],[],null,["# tf.keras.callbacks.experimental.BackupAndRestore\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/callbacks.py#L1446-L1555) |\n\nCallback to back up and restore the training state.\n\nInherits From: [`Callback`](../../../../tf/keras/callbacks/Callback) \n\n tf.keras.callbacks.experimental.BackupAndRestore(\n backup_dir\n )\n\n`BackupAndRestore` callback is intended to recover from interruptions that\nhappened in the middle of a model.fit execution by backing up the\ntraining states in a temporary checkpoint file (based on TF CheckpointManager)\nat the end of each epoch. If training restarted before completion, the\ntraining state and model are restored to the most recently saved state at the\nbeginning of a new model.fit() run.\nNote that user is responsible to bring jobs back up.\nThis callback is important for the backup and restore mechanism for fault\ntolerance purpose. And the model to be restored from an previous checkpoint is\nexpected to be the same as the one used to back up. If user changes arguments\npassed to compile or fit, the checkpoint saved for fault tolerance can become\ninvalid.\n\n#### Note:\n\n1. This callback is not compatible with disabling eager execution.\n2. A checkpoint is saved at the end of each epoch, when restoring we'll redo any partial work from an unfinished epoch in which the training got restarted (so the work done before a interruption doesn't affect the final model state).\n3. This works for both single worker and multi-worker mode, only MirroredStrategy and MultiWorkerMirroredStrategy are supported for now.\n\n#### Example:\n\n class InterruptingCallback(tf.keras.callbacks.Callback):\n def on_epoch_begin(self, epoch, logs=None):\n if epoch == 4:\n raise RuntimeError('Interrupting!')\n callback = tf.keras.callbacks.experimental.BackupAndRestore(\n backup_dir=\"/tmp\")\n model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])\n model.compile(tf.keras.optimizers.SGD(), loss='mse')\n try:\n model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,\n batch_size=1, callbacks=[callback, InterruptingCallback()],\n verbose=0)\n except:\n pass\n history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,\n batch_size=1, callbacks=[callback], verbose=0)\n # Only 6 more epochs are run, since first trainning got interrupted at\n # zero-indexed epoch 4, second training will continue from 4 to 9.\n len(history.history['loss'])\n 6\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Arguments --------- ||\n|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `backup_dir` | String, path to save the model file. This is the directory in which the system stores temporary files to recover the model from jobs terminated unexpectedly. The directory cannot be reused elsewhere to store other checkpoints, e.g. by BackupAndRestore callback of another training, or by another callback (ModelCheckpoint) of the same training. |\n\n\u003cbr /\u003e\n\nMethods\n-------\n\n### `set_model`\n\n[View source](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/callbacks.py#L1525-L1526) \n\n set_model(\n model\n )\n\n### `set_params`\n\n[View source](https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/keras/callbacks.py#L616-L617) \n\n set_params(\n params\n )"]]