View source on GitHub |
Action that saves on-demand checkpoints after a preemption.
orbit.actions.SaveCheckpointIfPreempted(
cluster_resolver: tf.distribute.cluster_resolver.ClusterResolver,
checkpoint_manager: tf.train.CheckpointManager,
checkpoint_number: Optional[tf.Variable] = None,
keep_running_after_save: Optional[bool] = False
)
Args | |
---|---|
cluster_resolver
|
A tf.distribute.cluster_resolver.ClusterResolver
object.
|
checkpoint_manager
|
A tf.train.CheckpointManager object.
|
checkpoint_number
|
A tf.Variable to indicate the checkpoint_number for
checkpoint manager, usually it will be the global step.
|
keep_running_after_save
|
Whether to keep the job running after the preemption on-demand checkpoint. Only set to True when in-process preemption recovery with tf.distribute.experimental.PreemptionWatcher is enabled. |
Methods
__call__
__call__(
_
) -> None
Call self as a function.