Thanks for tuning in to Google I/O. View all sessions on demandWatch on demand

tfm.nlp.layers.ReuseTransformer

Transformer layer.

This layer implements the ReuseTransformer Encoder from "Leveraging redundancy in attention with Reuse Transformers". (https://arxiv.org/abs/2110.06821)

num_attention_heads Number of attention heads.
inner_dim The output dimension of the first Dense layer in a two-layer feedforward network.
inner_activation The activation for the first Dense layer in a two-layer feedforward network.
head_size Projection size of heads.
output_range the sequence output range, [0, output_range) for slicing the target sequence. None means the target sequence is not sliced.
kernel_initializer Initializer for dense layer kernels.
bias_initializer Initializer for dense layer biases.
kernel_regularizer Regularizer for dense layer kernels.
bias_regularizer Regularizer for dense layer biases.
activity_regularizer Regularizer for dense layer activity.
kernel_constraint Constraint for dense layer kernels.
bias_constraint Constraint for dense layer kernels.
use_bias Whether to enable use_bias in attention layer. If set False, use_bias in attention layer is disabled.
norm_first Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized.
norm_epsilon Epsilon value to initialize normalization layers.
output_dropout Dropout probability for the post-attention and output dropout.
attention_dropout Dropout probability for within the attention layer.
inner_dropout Dropout probability for the first Dense layer in a two-layer feedforward network.
attention_initializer Initializer for kernels of attention layers. If set None, attention layers use kernel_initializer as initializer for kernel.
attention_axes axes over which the attention is applied. None means attention over all axes, but batch, heads, and features.
reuse_attention reuse_attention: An integer specifying number of heads to reuse. -1 for all heads.
use_relative_pe whether to use relative position bias.
pe_max_seq_length used to set the size of the relative positin encodings.
layer_idx the idx of this layer.
max_reuse_layer_idx layer idx (if passed) greater than this value will not reuse attention scores from previous layers.
**kwargs keyword arguments.

Methods

call

View source

Transformer self-attention encoder block call.

Args
inputs a single tensor or a list of tensors. input tensor as the single sequence of embeddings. [input tensor, attention mask] to have the additional attention mask. [query tensor, attention mask, attention scores] to have additional attention scores for reuse computation. If attention scores is None, the reuse_attention flag will be ignored.

Returns
An output tensor with the same dimensions as input/query tensor. Attention scores if return_attention_scores is true.