RSVP for your your local TensorFlow Everywhere event today!


Interface for service job manager.



View source

Ensures necessary service jobs are started and healthy for the pipeline.

Service jobs are long-running jobs associated with a node or the pipeline that persist across executions (eg: worker pools, Tensorboard, etc). Service jobs are started before the nodes that depend on them are started.

ensure_services will be called in the orchestration loop periodically and is expected to:

  1. Start any service jobs required by the pipeline nodes.
  2. Probe job health and handle failures. If a service job fails, the corresponding node uids should be returned.
  3. Optionally stop service jobs that are no longer needed. Whether or not a service job is needed is context dependent, for eg: in a typical sync pipeline, one may want Tensorboard job to continue running even after the corresponding trainer has completed but others like worker pool services may be shutdown.

pipeline_state A PipelineState object for an active pipeline.

List of NodeUids of nodes whose service jobs are in a state of permanent failure.


View source

Stops all service jobs associated with the pipeline.

pipeline_state A PipelineState object for an active pipeline.