Callback to back up and restore the training state.
Inherits From: Callback
tf.keras.callbacks.BackupAndRestore(
backup_dir, save_freq='epoch', delete_checkpoint=True
)
Used in the notebooks
Used in the guide | Used in the tutorials |
---|---|
BackupAndRestore
callback is intended to recover training from an
interruption that has happened in the middle of a Model.fit
execution, by
backing up the training states in a temporary checkpoint file, at the end of
each epoch. Each backup overwrites the previously written checkpoint file,
so at any given time there is at most one such checkpoint file for
backup/restoring purpose.
If training restarts before completion, the training state (which includes
the Model
weights and epoch number) is restored to the most recently saved
state at the beginning of a new Model.fit
run. At the completion of a
Model.fit
run, the temporary checkpoint file is deleted.
Note that the user is responsible to bring jobs back after the interruption. This callback is important for the backup and restore mechanism for fault tolerance purpose, and the model to be restored from a previous checkpoint is expected to be the same as the one used to back up. If user changes arguments passed to compile or fit, the checkpoint saved for fault tolerance can become invalid.
Example:
class InterruptingCallback(keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
if epoch == 4:
raise RuntimeError('Interrupting!')
callback = keras.callbacks.BackupAndRestore(backup_dir="/tmp/backup")
model = keras.models.Sequential([keras.layers.Dense(10)])
model.compile(keras.optimizers.SGD(), loss='mse')
try:
model.fit(np.arange(100).reshape(5, 20), np.zeros(5), epochs=10,
batch_size=1, callbacks=[callback, InterruptingCallback()],
verbose=0)
except:
pass
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
epochs=10, batch_size=1, callbacks=[callback],
verbose=0)
# Only 6 more epochs are run, since first training got interrupted at
# zero-indexed epoch 4, second training will continue from 4 to 9.
len(history.history['loss'])
6
Methods
on_batch_begin
on_batch_begin(
batch, logs=None
)
A backwards compatibility alias for on_train_batch_begin
.
on_batch_end
on_batch_end(
batch, logs=None
)
A backwards compatibility alias for on_train_batch_end
.
on_epoch_begin
on_epoch_begin(
epoch, logs=None
)
Called at the start of an epoch.
Subclasses should override for any actions to run. This function should only be called during TRAIN mode.
Args | |
---|---|
epoch
|
Integer, index of epoch. |
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_epoch_end
on_epoch_end(
epoch, logs=None
)
Called at the end of an epoch.
Subclasses should override for any actions to run. This function should only be called during TRAIN mode.
Args | |
---|---|
epoch
|
Integer, index of epoch. |
logs
|
Dict, metric results for this training epoch, and for the
validation epoch if validation is performed. Validation result
keys are prefixed with val_ . For training epoch, the values of
the Model 's metrics are returned. Example:
{'loss': 0.2, 'accuracy': 0.7} .
|
on_predict_batch_begin
on_predict_batch_begin(
batch, logs=None
)
Called at the beginning of a batch in predict
methods.
Subclasses should override for any actions to run.
Note that if the steps_per_execution
argument to compile
in
Model
is set to N
, this method will only be called every
N
batches.
Args | |
---|---|
batch
|
Integer, index of batch within the current epoch. |
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_predict_batch_end
on_predict_batch_end(
batch, logs=None
)
Called at the end of a batch in predict
methods.
Subclasses should override for any actions to run.
Note that if the steps_per_execution
argument to compile
in
Model
is set to N
, this method will only be called every
N
batches.
Args | |
---|---|
batch
|
Integer, index of batch within the current epoch. |
logs
|
Dict. Aggregated metric results up until this batch. |
on_predict_begin
on_predict_begin(
logs=None
)
Called at the beginning of prediction.
Subclasses should override for any actions to run.
Args | |
---|---|
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_predict_end
on_predict_end(
logs=None
)
Called at the end of prediction.
Subclasses should override for any actions to run.
Args | |
---|---|
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_test_batch_begin
on_test_batch_begin(
batch, logs=None
)
Called at the beginning of a batch in evaluate
methods.
Also called at the beginning of a validation batch in the fit
methods, if validation data is provided.
Subclasses should override for any actions to run.
Note that if the steps_per_execution
argument to compile
in
Model
is set to N
, this method will only be called every
N
batches.
Args | |
---|---|
batch
|
Integer, index of batch within the current epoch. |
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_test_batch_end
on_test_batch_end(
batch, logs=None
)
Called at the end of a batch in evaluate
methods.
Also called at the end of a validation batch in the fit
methods, if validation data is provided.
Subclasses should override for any actions to run.
Note that if the steps_per_execution
argument to compile
in
Model
is set to N
, this method will only be called every
N
batches.
Args | |
---|---|
batch
|
Integer, index of batch within the current epoch. |
logs
|
Dict. Aggregated metric results up until this batch. |
on_test_begin
on_test_begin(
logs=None
)
Called at the beginning of evaluation or validation.
Subclasses should override for any actions to run.
Args | |
---|---|
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_test_end
on_test_end(
logs=None
)
Called at the end of evaluation or validation.
Subclasses should override for any actions to run.
Args | |
---|---|
logs
|
Dict. Currently the output of the last call to
on_test_batch_end() is passed to this argument for this method
but that may change in the future.
|
on_train_batch_begin
on_train_batch_begin(
batch, logs=None
)
Called at the beginning of a training batch in fit
methods.
Subclasses should override for any actions to run.
Note that if the steps_per_execution
argument to compile
in
Model
is set to N
, this method will only be called every
N
batches.
Args | |
---|---|
batch
|
Integer, index of batch within the current epoch. |
logs
|
Dict. Currently no data is passed to this argument for this method but that may change in the future. |
on_train_batch_end
on_train_batch_end(
batch, logs=None
)
Called at the end of a training batch in fit
methods.
Subclasses should override for any actions to run.
Note that if the steps_per_execution
argument to compile
in
Model
is set to N
, this method will only be called every
N
batches.
Args | |
---|---|
batch
|
Integer, index of batch within the current epoch. |
logs
|
Dict. Aggregated metric results up until this batch. |
on_train_begin
on_train_begin(
logs=None
)
Get training state from temporary file and restore it.
on_train_end
on_train_end(
logs=None
)
Called at the end of training.
Subclasses should override for any actions to run.
Args | |
---|---|
logs
|
Dict. Currently the output of the last call to
on_epoch_end() is passed to this argument for this method but
that may change in the future.
|
set_model
set_model(
model
)
set_params
set_params(
params
)