Thanks for tuning in to Google I/O. View all sessions on demandWatch on demand


Sparse Mixer encoder network.

Based on "Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT". Sparse Mixer is an efficient encoder network that replaces typical Transformer encoder blocks with a combination of linear mixing and sparsely activated Mixture-of-Experts (MoE) sublayers.

This implementation defaults to the canonical Sparse Mixer Base model. To use the "Fast Sparse Mixer" configuration, set *_capacity_factor=0.5. This yields a sparser and faster variant of the canonical Sparse Mixer model, in which each expert processes roughly 50% less tokens.


  • The underlying MoeLayer uses the Keras add_loss() and add_metric() APIs to propagate auxiliary MoE losses and metrics. Any model using this network, should collect these losses and, if desired, metrics.
  • The input length is fixed to 'max_sequence_length' to accomodate the mixing mechanisms.

vocab_size The size of the token vocabulary.
hidden_size The size of the transformer hidden layers.
num_layers The number of transformer layers.
moe_layers Specifies which layers, if any, should be sparsely activated Mixture-of-Experts (MoE) layers. The remaining [0, num_layers) setminus moe_layers will use the vanilla MLP sublayers. Defaults to placing MoE layers in the middle of the model.
attention_layers Specifies which layers, if any, should be attention layers in the encoder. The remaining [0, num_layers) setminus attention_layers will use the specified mixing_mechanism. If using attention layers, a good rule of thumb is to place them in the final few layers.
num_experts Number of experts. Experts are themselves MLP modules, with the same inner_dim and inner_activation as the vanilla MLP sublayers.
train_capacity_factor Scaling factor to increase the expert token capacity during training. See layers.MoeLayer for further details. The "Fast Sparse Mixer" increases model sparsity (and speed) by using a capacity factor of 0.5.
eval_capacity_factor As above, but used during evaluation.
max_group_size The total number of tokens on each device is subdivided into groups of this size. Router computations are then performed on a per-group basis. See layers.MoeLayer for further details.
mixing_mechanism Type of mixing mechanism used in place of self-attention layers. Defaults to 'Linear' mixing.
use_fft Only used for spectral mixing mechanisms. Determines whether to use Fast Fourier Transform (True) or the Discrete Fourier Transform (DFT) matrix (False; default) to compute the Fourier Transform. See layers.FourierTransformLayer or layers.HartleyTransformLayer for advice.
num_attention_heads The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads.
max_sequence_length The only sequence length that this encoder can consume. This determines the variable shape for positional embeddings and the size of the mixing matrices.
type_vocab_size The number of types that the 'type_ids' input can take.
inner_dim The output dimension of the first Dense layer in a two-layer feedforward network for each transformer.
inner_activation The activation for the first Dense layer in a two-layer feedforward network for each transformer.
output_dropout Dropout probability for the post-attention and output dropout.
attention_dropout The dropout rate to use for the attention layers within the transformer layers.
initializer The initializer to use for all weights in this encoder.
output_range The sequence output range, [0, output_range), by slicing the target sequence of the last transformer layer. None means the entire target sequence will attend to the source sequence, which yields the full output.
embedding_width The width of the word embeddings. If the embedding width is not equal to hidden size, embedding parameters will be factorized into two matrices in the shape of ['vocab_size', 'embedding_width'] and 'embedding_width', 'hidden_size'.
embedding_layer An optional Layer instance which will be called to generate embeddings for the input word IDs.
norm_first Whether to normalize inputs to attention and intermediate dense layers. If set False, output of attention and intermediate dense layers is normalized.
with_dense_inputs Whether to accept dense embeddings as the input.
export_metrics Whether to export metrics using Keras add_metric API.

pooler_layer The pooler dense layer after the transformer layers.
transformer_layers List of Transformer layers in the encoder.



View source

This is where the layer's logic lives.

The call() method may not create state (except in its first invocation, wrapping the creation of variables or other resources in tf.init_scope()). It is recommended to create state, including tf.Variable instances and nested Layer instances, in __init__(), or in the build() method that is called automatically before call() executes for the first time.

inputs Input tensor, or dict/list/tuple of input tensors. The first positional inputs argument is subject to special rules:

  • inputs must be explicitly passed. A layer cannot have zero arguments, and inputs cannot be provided via the default value of a keyword argument.
  • NumPy array or Python scalar values in inputs get cast as tensors.
  • Keras mask metadata is only collected from inputs.
  • Layers are built (build(input_shape) method) using shape info from inputs only.
  • input_spec compatibility is only checked against inputs.
  • Mixed precision input casting is only applied to inputs. If a layer has tensor arguments in *args or **kwargs, their casting behavior in mixed precision should be handled manually.
  • The SavedModel input specification is generated using inputs only.
  • Integration with various ecosystem packages like TFMOT, TFLite, TF.js, etc is only supported for inputs and not for tensors in positional and keyword arguments.
*args Additional positional arguments. May contain tensors, although this is not recommended, for the reasons above.
**kwargs Additional keyword arguments. May contain tensors, although this is not recommended, for the reasons above. The following optional keyword arguments are reserved:
  • training: Boolean scalar tensor of Python boolean indicating whether the call is meant for training or inference.
  • mask: Boolean input mask. If the layer's call() method takes a mask argument, its default value will be set to the mask generated for inputs by the previous layer (if input did come from a layer that generated a corresponding mask, i.e. if it came from a Keras layer with masking support).
  • Returns
    A tensor or list/tuple of tensors.


    View source


    View source