Invertible TextEncoder using word pieces with a byte-level fallback.

Inherits From: TextEncoder

Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded.

The vocabulary is "trained" on a corpus and all wordpieces are stored in a vocabulary file. To generate a vocabulary from a corpus, use tfds.deprecated.text.SubwordTextEncoder.build_from_corpus.

Typical usage:

# Build
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    corpus_generator, target_vocab_size=2**15)

# Load
encoder = tfds.deprecated.text.SubwordTextEncoder.load_from_file(vocab_fname)
ids = encoder.encode("hello world")
text = encoder.decode([1, 2, 3, 4])

vocab_list list<str>, list of subwords for the vocabulary. Note that an underscore at the end of a subword indicates the end of the word (i.e. a space will be inserted afterwards when decoding). Underscores in the interior of subwords are disallowed and should use the underscore escape sequence.


vocab_size Size of the vocabulary. Decode produces ints [1, vocab_size).



View source

Builds a SubwordTextEncoder based on the corpus_generator.

corpus_generator generator yielding str, from which subwords will be constructed.
target_vocab_size int, approximate size of the vocabulary to create.
max_subword_length int, maximum length of a subword. Note that memory and compute scale quadratically in the length of the longest token.
max_corpus_chars int, the maximum number of characters to consume from corpus_generator for the purposes of building the subword vocabulary.
reserved_tokens list<str>, list of tokens that will always be treated as whole tokens and not split up. Note that these must contain a mix of alphanumeric and non-alphanumeric characters (e.g. "") and not end in an underscore.



View source

Decodes a list of integers into text.


View source

Encodes text into a list of integers.


View source

Extracts list of subwords from file.


View source

Save the vocabulary to a file.