Interface for service job manager.



Ensures necessary service jobs are started and healthy for the pipeline.

Service jobs are long-running jobs associated with a node or the pipeline that persist across executions (eg: worker pools, Tensorboard, etc). Service jobs are started before the nodes that depend on them are started.

ensure_services will be called in the orchestration loop periodically and is expected to:

  1. Start any service jobs required by the pipeline nodes.
  2. Probe job health and handle failures. If a service job fails, the corresponding node uids should be returned.
  3. Optionally stop service jobs that are no longer needed. Whether or not a service job is needed is context dependent, for eg: in a typical sync pipeline, one may want Tensorboard job to continue running even after the corresponding trainer has completed but others like worker pool services may be shutdown.

pipeline_state A PipelineState object for an active pipeline.

List of NodeUids of nodes whose service jobs are in a state of permanent failure.


Stops all service jobs associated with the pipeline.

pipeline_state A PipelineState object for an active pipeline.