TransformerEncoderBlock layer.
tfm.nlp.layers.TransformerEncoderBlock(
num_attention_heads,
inner_dim,
inner_activation,
output_range=None,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
use_bias=True,
norm_first=False,
norm_epsilon=1e-12,
output_dropout=0.0,
attention_dropout=0.0,
inner_dropout=0.0,
attention_initializer=None,
attention_axes=None,
use_query_residual=True,
key_dim=None,
value_dim=None,
output_last_dim=None,
diff_q_kv_att_layer_norm=False,
**kwargs
)
This layer implements the Transformer Encoder from
"Attention Is All You Need". (https://arxiv.org/abs/1706.03762),
which combines a tf.keras.layers.MultiHeadAttention
layer with a
two-layer feedforward network.
Args |
num_attention_heads
|
Number of attention heads.
|
inner_dim
|
The output dimension of the first Dense layer in a two-layer
feedforward network.
|
inner_activation
|
The activation for the first Dense layer in a two-layer
feedforward network.
|
output_range
|
the sequence output range, [0, output_range) for slicing the
target sequence. None means the target sequence is not sliced.
|
kernel_initializer
|
Initializer for dense layer kernels.
|
bias_initializer
|
Initializer for dense layer biases.
|
kernel_regularizer
|
Regularizer for dense layer kernels.
|
bias_regularizer
|
Regularizer for dense layer biases.
|
activity_regularizer
|
Regularizer for dense layer activity.
|
kernel_constraint
|
Constraint for dense layer kernels.
|
bias_constraint
|
Constraint for dense layer kernels.
|
use_bias
|
Whether to enable use_bias in attention layer. If set False,
use_bias in attention layer is disabled.
|
norm_first
|
Whether to normalize inputs to attention and intermediate
dense layers. If set False, output of attention and intermediate dense
layers is normalized.
|
norm_epsilon
|
Epsilon value to initialize normalization layers.
|
output_dropout
|
Dropout probability for the post-attention and output
dropout.
|
attention_dropout
|
Dropout probability for within the attention layer.
|
inner_dropout
|
Dropout probability for the first Dense layer in a
two-layer feedforward network.
|
attention_initializer
|
Initializer for kernels of attention layers. If set
None , attention layers use kernel_initializer as initializer for
kernel.
|
attention_axes
|
axes over which the attention is applied. None means
attention over all axes, but batch, heads, and features.
|
use_query_residual
|
Toggle to execute residual connection after attention.
|
key_dim
|
key_dim for the tf.keras.layers.MultiHeadAttention . If
None , we use the first input_shape 's last dim.
|
value_dim
|
value_dim for the tf.keras.layers.MultiHeadAttention .
|
output_last_dim
|
Final dimension of the output of this module. This also
dictates the value for the final dimension of the
multi-head-attention. When it's None , we use, in order of decreasing
precedence, key_dim * num_heads or the first input_shape 's last
dim as the output's last dim.
|
diff_q_kv_att_layer_norm
|
If True , create a separate attention layer
norm layer for query and key-value if norm_first is True . Invalid
to set to True if norm_first is False .
|
**kwargs
|
keyword arguments.
|
Methods
call
View source
call(
inputs
)
Transformer self-attention encoder block call.
Args |
inputs
|
a single tensor or a list of tensors.
input tensor as the single sequence of embeddings.
[input tensor , attention mask ] to have the additional attention
mask.
[query tensor , key value tensor , attention mask ] to have separate
input streams for the query, and key/value to the multi-head
attention.
|
Returns |
An output tensor with the same dimensions as input/query tensor.
|