View source on GitHub
|
Convergence criterion based on inner products between successive gradients.
Inherits From: ConvergenceCriterion
tfp.substrates.numpy.optimizer.convergence_criteria.SuccessiveGradientsAreUncorrelated(
window_size=10, min_num_steps=20, name=None
)
Let g[t] be the gradient vector at step t, and g[t-1] the previous
gradient. Their inner product:
grad_inner_product[t] = sum_i(g[t, i] * g[t - 1, i])
measures correlation between successive optimization steps. We expect this to be positive if the optimization is making progress; conversely, it can be shown to be negative in expectation at the stationary distribution of constant-step-size SGD [(Pflug, 1990)][2].
This criterion detects convergence when an exponentially-weighted moving
average of grad_inner_product becomes negative; intuitively, when there has
been no consistent direction to the most recent window_size steps.
Theoretical analysis shows that with no decay (window_size=np.inf), this
rule stops in finite time almost surely, for constant-step-size SGD under
standard assumptions ([Pflug, 1990][2]; [Chee and Toulis, 2017][1]). In
practice, it is often more efficient to use a decaying moving average.
Batch semantics: because this criterion does not depend on the loss,
vector-valued losses will not produce vector-valued convergence indicators.
Instead, the returned has_converged is always scalar, and is computed from
the inner product summed across gradients from all variables being
optimized.
please contact tfprobability@tensorflow.org.
References
[1] Jerry Chee and Panos Toulis. Convergence diagnostics for stochastic gradient descent with constant step size. _arXiv preprint arXiv:1710.06382, 2017. https://arxiv.org/abs/1710.06382
[2] Georg Ch. Pflug. Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte fur Mathematik, 110(3-4), pp.297-314, 1990.
Attributes | |
|---|---|
min_num_steps
|
|
name
|
|
window_size
|
|
Methods
bootstrap
bootstrap(
loss, grads, parameters
)
Returns a structure of Tensors for the rule's state at step 0.
The shape of the Tensors specifying loss, grads, and parameters may
optionally be prefixed by one or more batch dimension(s).
| Args | |
|---|---|
loss
|
float Tensor initial value of loss being optimized.
|
grads
|
list of float Tensor gradients of loss wrt parameters.
|
parameters
|
list of float Tensor initial values of parameters
being optimized.
|
| Returns | |
|---|---|
initial_auxiliary_state
|
(Structure of) Tensor(s) representing the
initial auxiliary state carried forward by this criterion.
|
one_step
one_step(
step, loss, grads, parameters, auxiliary_state
)
Updates tracked quantities for a new step, and determines if converged.
The shape of the Tensors specifying loss, grads, and parameters may
optionally be prefixed by one or more batch dimension(s). In this case,
the returned value has_converged will have shape equal to the broadcast
batch shape of whichever of those quantities is used by this convergence
criterion, and the quantities defining the convergence criterion (
min_num_steps, etc.).
| Args | |
|---|---|
step
|
integer Tensor index of the current step, where step >= 1 (on
step 0, initial_state should be called instead).
|
loss
|
float Tensor value of loss at the current step.
|
grads
|
list of float Tensor gradients of loss wrt parameters.
|
parameters
|
list of float Tensor current values of parameters
being optimized.
|
auxiliary_state
|
the (structure of) Tensor(s) containing state carried
forward from the previous step.
|
| Returns | |
|---|---|
has_converged
|
boolean Tensor indicating whether the optimization has
converged.
|
updated_auxiliary_state
|
(Structure of) Tensor(s) representing
updated quantities tracked by the convergence criterion. This should
match the structure of the value returned by bootstrap.
|
View source on GitHub