tfm.core.base_trainer.Recovery

Built-in model blowup recovery module.

Checks the loss value by the given threshold. If applicable, recover the model by reading the checkpoint on disk.

Methods

maybe_recover

View source

Conditionally recovers the training by triggering checkpoint restoration.

Args
loss_value the loss value as a float.
global_step the number of global training steps.

Raises
RuntimeError when recovery happens more than the max number of trials, the job should crash.

should_recover

View source