Thanks for tuning in to Google I/O. View all sessions on demandWatch on demand


ALBERT ( text encoder network.

This network implements the encoder described in the paper "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations" (

Compared with BERT (, ALBERT refactorizes embedding parameters into two smaller matrices and shares parameters across layers.

The default values for this object are taken from the ALBERT-Base implementation described in the paper.

vocab_size The size of the token vocabulary.
embedding_width The width of the word embeddings. If the embedding width is not equal to hidden size, embedding parameters will be factorized into two matrices in the shape of (vocab_size, embedding_width) and (embedding_width, hidden_size), where embedding_width is usually much smaller than hidden_size.
hidden_size The size of the transformer hidden layers.
num_layers The number of transformer layers.
num_attention_heads The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads.
max_sequence_length The maximum sequence length that this encoder can consume. If None, max_sequence_length uses the value from sequence length. This determines the variable shape for positional embeddings.
type_vocab_size The number of types that the 'type_ids' input can take.
intermediate_size The intermediate size for the transformer layers.
activation The activation to use for the transformer layers.
dropout_rate The dropout rate to use for the transformer layers.
attention_dropout_rate The dropout rate to use for the attention layers within the transformer layers.
initializer The initialzer to use for all weights in this encoder.
dict_outputs Whether to use a dictionary as the model outputs.



Calls the model on new inputs and returns the outputs as tensors.

In this case call() just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

inputs Input tensor, or dict/list/tuple of input tensors.
training Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask A mask or list of masks. A mask can be either a boolean tensor or None (no mask). For more details, check the guide here.

A tensor if there is a single output, or a list of tensors if there are more than one outputs.


View source