View source on GitHub |
A manager for facilitating multiple in-progress evaluations.
tff.learning.programs.EvaluationManager(
data_source: tff.program.FederatedDataSource
,
aggregated_metrics_manager: Optional[release_manager.ReleaseManager[release_manager.ReleasableStructure,
int]],
create_state_manager_fn: Callable[[str], tff.program.FileProgramStateManager
],
create_process_fn: Callable[[str], tuple[learning_process.LearningProcess, Optional[
release_manager.ReleaseManager[release_manager.ReleasableStructure, int]]]],
cohort_size: int,
duration: datetime.timedelta = datetime.timedelta(hours=24)
)
This manager performs three responsbilities:
- Prepares, starts and tracks new evaluation loops. This involves creating
a new evaluation process and state manager for that process, adding
the new process to the list of tracked inprocess evaluations, and
creating a new
asyncio.Task
to run the evaluation loop. - Record evaluations that have finished. This removes the evaluation from the list of in-progresss evaluations.
- If the program has restarted, load the most recent state of in-progress evaluations and restart each of the evaluations.
This class uses N + 1 tff.program.ProgramStateManagers
to enable resumable
evaluations.
- The first state managers is for this class itself, and manages the list of
in-progress evaluations via two tensor objects. Tensor objects must be
used (rather than Python lists) because
tff.program.FileProgramStateManager
does not support state objects that change Python structure across versions (e.g. to load the next version, we must known its shape, but after a restart we don't know). Alternatively, we can use tensor or ndarray objects with shape[None]
to support changing shapes of structure's leaf elements. - The next N state managers manage the cross-round metric aggregation for each evaluation process started. One for each evaluation process.
Args | |
---|---|
data_source
|
A tff.program.FederatedDataSource that the manager will use
to create iterators for evaluation loops.
|
aggregated_metrics_manager
|
A tff.program.ReleaseManager for releasing
the total aggregated metrics at the end of the evaluation loop.
|
create_state_manager_fn
|
A callable that returns a
tff.program.FileProgramStateManager that will be used to create the
overall evaluation manager's state manager, and each per evaluation loop
state manager that will enable resuming and checkpointing.
|
create_process_fn
|
A callable that returns a 2-tuple of
tff.learning.templates.LearningProcess and
tff.program.ReleaseManager for the per-evaluation round metrics
releasing that will used be to start each evaluation loop.
|
cohort_size
|
An integer denoting the size of each evaluation round to
select from the iterator created from data_source .
|
duration
|
The datetime.timedelta duration to run each evaluation loop.
|
Methods
record_evaluations_finished
record_evaluations_finished(
train_round
)
Removes evaluation for train_round
from the internal state manager.
Args | |
---|---|
train_round
|
The integer round number of the training round that has finished evaluation. |
Raises | |
---|---|
RuntimeError
|
If train_round was not currently being tracked as an
in-progress evaluation.
|
resume_from_previous_state
resume_from_previous_state()
Load the most recent state and restart in-progress evaluations.
start_evaluation
start_evaluation(
train_round, start_timestamp_seconds, model_weights
)
Starts a new evaluation loop for the incoming model_weights.
wait_for_evaluations_to_finish
wait_for_evaluations_to_finish()
Creates an awaitable that blocks until all evaluations are finished.