Save the date! Google I/O returns May 18-20

# tf.keras.layers.LayerNormalization

Layer normalization layer (Ba et al., 2016).

Inherits From: `Layer`, `Module`

### Used in the notebooks

Used in the tutorials

Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.

Given a tensor `inputs`, moments are calculated and normalization is performed across the axes specified in `axis`.

#### Example:

````data = tf.constant(np.arange(10).reshape(5, 2) * 10, dtype=tf.float32)`
`print(data)`
`tf.Tensor(`
`[[ 0. 10.]`
` [20. 30.]`
` [40. 50.]`
` [60. 70.]`
` [80. 90.]], shape=(5, 2), dtype=float32)`
```
````layer = tf.keras.layers.LayerNormalization(axis=1)`
`output = layer(data)`
`print(output)`
`tf.Tensor(`
`[[-1. 1.]`
` [-1. 1.]`
` [-1. 1.]`
` [-1. 1.]`
` [-1. 1.]], shape=(5, 2), dtype=float32)`
```

Notice that with Layer Normalization the normalization happens across the axes within each example, rather than across different examples in the batch.

If `scale` or `center` are enabled, the layer will scale the normalized outputs by broadcasting them with a trainable variable `gamma`, and center the outputs by broadcasting with a trainable variable `beta`. `gamma` will default to a ones tensor and `beta` will default to a zeros tensor, so that centering and scaling are no-ops before training has begun.

So, with scaling and centering enabled the normalization equations are as follows: Let the intermediate activations for a mini-batch to be the `inputs`.

For each sample `x_i` in `inputs` with `k` features, we compute the mean and variance of the sample:

``````  mean_i = sum(x_i[j] for j in range(k)) / k
var_i = sum((x_i[j] - mean_i) ** 2 for j in range(k)) / k
``````

and then compute a normalized `x_i_normalized`, including a small factor `epsilon` for numerical stability.

``````  x_i_normalized = (x_i - mean_i) / sqrt(var_i + epsilon)
``````

And finally `x_i_normalized` is linearly transformed by `gamma` and `beta`, which are learned parameters:

``````  output_i = x_i_normalized * gamma + beta
``````

`gamma` and `beta` will span the axes of `inputs` specified in `axis`, and this part of the inputs' shape must be fully defined.

#### For example:

````layer = tf.keras.layers.LayerNormalization(axis=[1, 2, 3])`
`layer.build([5, 20, 30, 40])`
`print(layer.beta.shape)`
`(20, 30, 40)`
`print(layer.gamma.shape)`
`(20, 30, 40)`
```

Note that other implementations of layer normalization may choose to define `gamma` and `beta` over a separate set of axes from the axes being normalized across. For example, Group Normalization (Wu et al. 2018) with group size of 1 corresponds to a Layer Normalization that normalizes across height, width, and channel and has `gamma` and `beta` span only the channel dimension. So, this Layer Normalization implementation will not match a Group Normalization layer with group size set to 1.

`axis` Integer or List/Tuple. The axis or axes to normalize across. Typically this is the features axis/axes. The left-out axes are typically the batch axis/axes. This argument defaults to `-1`, the last dimension in the input.
`epsilon` Small float added to variance to avoid dividing by zero. Defaults to 1e-3
`center` If True, add offset of `beta` to normalized tensor. If False, `beta` is ignored. Defaults to True.
`scale` If True, multiply by `gamma`. If False, `gamma` is not used. Defaults to True. When the next layer is linear (also e.g. `nn.relu`), this can be disabled since the scaling will be done by the next layer.
`beta_initializer` Initializer for the beta weight. Defaults to zeros.
`gamma_initializer` Initializer for the gamma weight. Defaults to ones.
`beta_regularizer` Optional regularizer for the beta weight. None by default.
`gamma_regularizer` Optional regularizer for the gamma weight. None by default.
`beta_constraint` Optional constraint for the beta weight. None by default.
`gamma_constraint` Optional constraint for the gamma weight. None by default.
`trainable` Boolean, if `True` the variables will be marked as trainable. Defaults to True.

Input shape: Arbitrary. Use the keyword argument `input_shape` (tuple of integers, does not include the samples axis) when using this layer as the first layer in a model. Output shape: Same shape as input. Reference: