Warning: This project is deprecated. TensorFlow Addons has stopped development,
The project will only be providing minimal maintenance releases until May 2024. See the full
announcement here or on
github.
It implements the decoupled weight decay described by Loshchilov & Hutter, in which the weight decay is
decoupled from the optimization steps w.r.t. to the loss function.
For SGD variants, this simplifies hyperparameter search since it decouples
the settings of weight decay and learning rate.
For adaptive gradient algorithms, it regularizes variables with large
gradients more than L2 regularization would, which was shown to yield
better training loss and generalization error in the paper above.
This class alone is not an optimizer but rather extends existing
optimizers with decoupled weight decay. We explicitly define the two
examples used in the above paper (SGDW and AdamW), but in general this can
extend any OptimizerX class by using
ExtendedCls = extend_with_decoupled_weight_decay(OptimizerX).
Weight decay can then be set when instantiating the optimizer:
optimizerX = ExtendedCls(weight_decay=0.001, learning_rate=0.001).
In order for it to work, it must be the first class the Optimizer with
weight decay inherits from, e.g.
step=tf.Variable(0,trainable=False)schedule=tf.optimizers.schedules.PiecewiseConstantDecay([10000,15000],[1e-0,1e-1,1e-2])# lr and wd can be a function or a tensorlr=1e-1*schedule(step)wd=lambda:1e-4*schedule(step)# ...optimizer=tfa.optimizers.AdamW(learning_rate=lr,weight_decay=wd)
List of regex patterns of
variables excluded from weight decay. Variables whose name
contain a substring matching the pattern will be excluded.
Note decay_var_list in minimize or apply_gradients takes
priority over exclude_from_weight_decay if specified.
**kwargs
Optional list or tuple or set of Variable objects to
decay.
This is the second part of minimize(). It returns an Operation that
applies gradients.
Args
grads_and_vars
List of (gradient, variable) pairs.
name
Optional name for the returned operation. Default to the
name passed to the Optimizer constructor.
decay_var_list
Optional list of variables to be decayed. Defaults
to all variables in var_list. Note decay_var_list takes
priority over exclude_from_weight_decay if specified.
**kwargs
Additional arguments to pass to the base optimizer's
apply_gradient method, e.g., TF2.2 added an argument
experimental_aggregate_gradients.
Returns
An Operation that applies the specified gradients.
This method simply computes gradient using tf.GradientTape and calls
apply_gradients(). If you want to process the gradient before
applying then call tf.GradientTape and apply_gradients() explicitly
instead of using this function.
Args
loss
Tensor or callable. If a callable, loss should take no
arguments and return the value to minimize. If a Tensor, the
tape argument must be passed.
var_list
list or tuple of Variable objects to update to
minimize loss, or a callable returning the list or tuple of
Variable objects. Use callable when the variable list would
otherwise be incomplete before minimize since the variables
are created at the first time loss is called.
grad_loss
Optional. A Tensor holding the gradient computed for
loss.
decay_var_list
Optional list of variables to be decayed. Defaults
to all variables in var_list. Note decay_var_list takes
priority over exclude_from_weight_decay if specified.
name
Optional name for the returned operation.
tape
(Optional) tf.GradientTape. If loss is provided as a
Tensor, the tape that computed the loss must be provided.
Returns
An Operation that updates the variables in var_list.
Raises
ValueError
If some of the variables are not Variable objects.
[null,null,["Last updated 2023-05-25 UTC."],[],[],null,["# tfa.optimizers.DecoupledWeightDecayExtension\n\n\u003cbr /\u003e\n\n|---------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/addons/blob/v0.20.0/tensorflow_addons/optimizers/weight_decay_optimizers.py#L26-L268) |\n\nThis class allows to extend optimizers with decoupled weight decay. \n\n tfa.optimizers.DecoupledWeightDecayExtension(\n weight_decay: Union[FloatTensorLike, Callable],\n exclude_from_weight_decay: Optional[List[str]] = None,\n **kwargs\n )\n\nIt implements the decoupled weight decay described by [Loshchilov \\& Hutter](https://arxiv.org/pdf/1711.05101.pdf), in which the weight decay is\ndecoupled from the optimization steps w.r.t. to the loss function.\nFor SGD variants, this simplifies hyperparameter search since it decouples\nthe settings of weight decay and learning rate.\nFor adaptive gradient algorithms, it regularizes variables with large\ngradients more than L2 regularization would, which was shown to yield\nbetter training loss and generalization error in the paper above.\n\nThis class alone is not an optimizer but rather extends existing\noptimizers with decoupled weight decay. We explicitly define the two\nexamples used in the above paper (SGDW and AdamW), but in general this can\nextend any OptimizerX class by using\n`ExtendedCls = extend_with_decoupled_weight_decay(OptimizerX)`.\nWeight decay can then be set when instantiating the optimizer:\n`optimizerX = ExtendedCls(weight_decay=0.001, learning_rate=0.001)`.\nIn order for it to work, it must be the first class the Optimizer with\nweight decay inherits from, e.g. \n\n class AdamW(DecoupledWeightDecayExtension, tf.keras.optimizers.Adam):\n def __init__(self, weight_decay, *args, **kwargs):\n super(AdamW, self).__init__(weight_decay, *args, **kwargs).\n\n| **Note:** this extension decays weights BEFORE applying the update based on the gradient, i.e. this extension only has the desired behaviour for optimizers which do not depend on the value of'var' in the update step!\n**Note:** when applying a decay to the learning rate, be sure to manually apply the decay to the `weight_decay` as well. For example: \n\n step = tf.Variable(0, trainable=False)\n schedule = tf.optimizers.schedules.PiecewiseConstantDecay(\n [10000, 15000], [1e-0, 1e-1, 1e-2])\n # lr and wd can be a function or a tensor\n lr = 1e-1 * schedule(step)\n wd = lambda: 1e-4 * schedule(step)\n\n # ...\n\n optimizer = tfa.optimizers.AdamW(learning_rate=lr, weight_decay=wd)\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `weight_decay` | A `Tensor`, a floating point value, or a schedule that is a [`tf.keras.optimizers.schedules.LearningRateSchedule`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/LearningRateSchedule) to decay the variable by, in the update step. |\n| `exclude_from_weight_decay` | List of regex patterns of variables excluded from weight decay. Variables whose name contain a substring matching the pattern will be excluded. Note `decay_var_list` in `minimize` or `apply_gradients` takes priority over `exclude_from_weight_decay` if specified. |\n| `**kwargs` | Optional list or tuple or set of `Variable` objects to decay. |\n\n\u003cbr /\u003e\n\nMethods\n-------\n\n### `apply_gradients`\n\n[View source](https://github.com/tensorflow/addons/blob/v0.20.0/tensorflow_addons/optimizers/weight_decay_optimizers.py#L172-L196) \n\n apply_gradients(\n grads_and_vars, name=None, decay_var_list=None, **kwargs\n )\n\nApply gradients to variables.\n\nThis is the second part of `minimize()`. It returns an `Operation` that\napplies gradients.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ||\n|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `grads_and_vars` | List of (gradient, variable) pairs. |\n| `name` | Optional name for the returned operation. Default to the name passed to the `Optimizer` constructor. |\n| `decay_var_list` | Optional list of variables to be decayed. Defaults to all variables in var_list. Note `decay_var_list` takes priority over `exclude_from_weight_decay` if specified. |\n| `**kwargs` | Additional arguments to pass to the base optimizer's apply_gradient method, e.g., TF2.2 added an argument `experimental_aggregate_gradients`. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ||\n|---|---|\n| An `Operation` that applies the specified gradients. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ||\n|--------------|------------------------------------------|\n| `TypeError` | If `grads_and_vars` is malformed. |\n| `ValueError` | If none of the variables have gradients. |\n\n\u003cbr /\u003e\n\n### `from_config`\n\n[View source](https://github.com/tensorflow/addons/blob/v0.20.0/tensorflow_addons/optimizers/weight_decay_optimizers.py#L112-L127) \n\n @classmethod\n from_config(\n config, custom_objects=None\n )\n\n### `get_config`\n\n[View source](https://github.com/tensorflow/addons/blob/v0.20.0/tensorflow_addons/optimizers/weight_decay_optimizers.py#L102-L110) \n\n get_config()\n\n### `minimize`\n\n[View source](https://github.com/tensorflow/addons/blob/v0.20.0/tensorflow_addons/optimizers/weight_decay_optimizers.py#L129-L170) \n\n minimize(\n loss, var_list, grad_loss=None, name=None, decay_var_list=None, tape=None\n )\n\nMinimize `loss` by updating `var_list`.\n\nThis method simply computes gradient using [`tf.GradientTape`](https://www.tensorflow.org/api_docs/python/tf/GradientTape) and calls\n`apply_gradients()`. If you want to process the gradient before\napplying then call [`tf.GradientTape`](https://www.tensorflow.org/api_docs/python/tf/GradientTape) and `apply_gradients()` explicitly\ninstead of using this function.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ||\n|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `loss` | `Tensor` or callable. If a callable, `loss` should take no arguments and return the value to minimize. If a `Tensor`, the `tape` argument must be passed. |\n| `var_list` | list or tuple of `Variable` objects to update to minimize `loss`, or a callable returning the list or tuple of `Variable` objects. Use callable when the variable list would otherwise be incomplete before `minimize` since the variables are created at the first time `loss` is called. |\n| `grad_loss` | Optional. A `Tensor` holding the gradient computed for `loss`. |\n| `decay_var_list` | Optional list of variables to be decayed. Defaults to all variables in var_list. Note `decay_var_list` takes priority over `exclude_from_weight_decay` if specified. |\n| `name` | Optional name for the returned operation. |\n| `tape` | (Optional) [`tf.GradientTape`](https://www.tensorflow.org/api_docs/python/tf/GradientTape). If `loss` is provided as a `Tensor`, the tape that computed the `loss` must be provided. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ||\n|---|---|\n| An Operation that updates the variables in `var_list`. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ||\n|--------------|------------------------------------------------------|\n| `ValueError` | If some of the variables are not `Variable` objects. |\n\n\u003cbr /\u003e"]]