Optimization parameters for Adam with TPU embeddings.
tf . compat . v1 . tpu . experimental . AdamParameters (
learning_rate : float ,
beta1 : float = 0.9 ,
beta2 : float = 0.999 ,
epsilon : float = 1e-08 ,
lazy_adam : bool = True ,
sum_inside_sqrt : bool = True ,
use_gradient_accumulation : bool = True ,
clip_weight_min : Optional [ float ] = None ,
clip_weight_max : Optional [ float ] = None ,
weight_decay_factor : Optional [ float ] = None ,
multiply_weight_decay_factor_by_learning_rate : Optional [ bool ] = None ,
clip_gradient_min : Optional [ float ] = None ,
clip_gradient_max : Optional [ float ] = None
)
Pass this to tf.estimator.tpu.experimental.EmbeddingConfigSpec
via the
optimization_parameters
argument to set the optimizer and its parameters.
See the documentation for tf.estimator.tpu.experimental.EmbeddingConfigSpec
for more details.
estimator = tf . estimator . tpu . TPUEstimator (
...
embedding_config_spec = tf . estimator . tpu . experimental . EmbeddingConfigSpec (
...
optimization_parameters = tf . tpu . experimental . AdamParameters ( 0.1 ),
... ))
Args
learning_rate
a floating point value. The learning rate.
beta1
A float value. The exponential decay rate for the 1st moment
estimates.
beta2
A float value. The exponential decay rate for the 2nd moment
estimates.
epsilon
A small constant for numerical stability.
lazy_adam
Use lazy Adam instead of Adam. Lazy Adam trains faster. See
optimization_parameters.proto
for details.
sum_inside_sqrt
This improves training speed. Please see
optimization_parameters.proto
for details.
use_gradient_accumulation
setting this to False
makes embedding
gradients calculation less accurate but faster. Please see
optimization_parameters.proto
for details.
clip_weight_min
the minimum value to clip by; None means -infinity.
clip_weight_max
the maximum value to clip by; None means +infinity.
weight_decay_factor
amount of weight decay to apply; None means that the
weights are not decayed.
multiply_weight_decay_factor_by_learning_rate
if true,
weight_decay_factor
is multiplied by the current learning rate.
clip_gradient_min
the minimum value to clip by; None means -infinity.
Gradient accumulation must be set to true if this is set.
clip_gradient_max
the maximum value to clip by; None means +infinity.
Gradient accumulation must be set to true if this is set.