Number of attention heads in the transformer block.
intermediate_size
The size of the "intermediate" (a.k.a., feed
forward) layer.
intermediate_act_fn
The non-linear activation function to apply
to the output of the intermediate/feed-forward layer.
hidden_dropout_prob
Dropout probability for the hidden layers.
attention_probs_dropout_prob
Dropout probability of the attention
probabilities.
intra_bottleneck_size
Size of bottleneck.
initializer_range
The stddev of the truncated_normal_initializer for
initializing all weight matrices.
use_bottleneck_attention
Use attention inputs from the bottleneck
transformation. If true, the following key_query_shared_bottleneck
will be ignored.
key_query_shared_bottleneck
Whether to share linear transformation for
keys and queries.
num_feedforward_networks
Number of stacked feed-forward networks.
normalization_type
The type of normalization_type, only no_norm and
layer_norm are supported. no_norm represents the element-wise linear
transformation for the student model, as suggested by the original
MobileBERT paper. layer_norm is used for the teacher model.
classifier_activation
If using the tanh activation for the final
representation of the [CLS] token in fine-tuning.
input_mask_dtype
The dtype of input_mask tensor, which is one of the
input tensors of this encoder. Defaults to int32. If you want
to use tf.lite quantization, which does not support Cast op,
please set this argument to tf.float32 and feed input_mask
tensor with values in float32 to avoid tf.cast in the computation.
**kwargs
Other keyworded and arguments.
Attributes
pooler_layer
The pooler dense layer after the transformer layers.
transformer_layers
List of Transformer layers in the encoder.
Methods
call
call(
inputs, training=None, mask=None
)
Calls the model on new inputs and returns the outputs as tensors.
In this case call() just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Args
inputs
Input tensor, or dict/list/tuple of input tensors.
training
Boolean or boolean scalar tensor, indicating whether to
run the Network in training mode or inference mode.
mask
A mask or list of masks. A mask can be either a boolean tensor
or None (no mask). For more details, check the guide
here.
Returns
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.