|  View source on GitHub | 
Invertible TextEncoder using word pieces with a byte-level fallback.
Inherits From: TextEncoder
tfds.deprecated.text.SubwordTextEncoder(
    vocab_list=None
)
Encoding is fully invertible because all out-of-vocab wordpieces are byte-encoded.
The vocabulary is "trained" on a corpus and all wordpieces are stored in a
vocabulary file. To generate a vocabulary from a corpus, use
tfds.deprecated.text.SubwordTextEncoder.build_from_corpus.
Typical usage:
# Build
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    corpus_generator, target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
# Load
encoder = tfds.deprecated.text.SubwordTextEncoder.load_from_file(vocab_fname)
ids = encoder.encode("hello world")
text = encoder.decode([1, 2, 3, 4])
| Attributes | |
|---|---|
| subwords | |
| vocab_size | Size of the vocabulary. Decode produces ints [1, vocab_size). | 
Methods
build_from_corpus
@classmethodbuild_from_corpus( corpus_generator, target_vocab_size, max_subword_length=20, max_corpus_chars=None, reserved_tokens=None )
Builds a SubwordTextEncoder based on the corpus_generator.
| Args | |
|---|---|
| corpus_generator | generator yielding str, from which subwords will be
constructed. | 
| target_vocab_size | int, approximate size of the vocabulary to create. | 
| max_subword_length | int, maximum length of a subword. Note that memory
and compute scale quadratically in the length of the longest token. | 
| max_corpus_chars | int, the maximum number of characters to consume fromcorpus_generatorfor the purposes of building the subword vocabulary. | 
| reserved_tokens | list<str>, list of tokens that will always be treated
as whole tokens and not split up. Note that these must contain a mix of
alphanumeric and non-alphanumeric characters (e.g. " | 
| Returns | |
|---|---|
| SubwordTextEncoder. | 
decode
decode(
    ids
)
Decodes a list of integers into text.
encode
encode(
    s
)
Encodes text into a list of integers.
load_from_file
@classmethodload_from_file( filename_prefix )
Extracts list of subwords from file.
save_to_file
save_to_file(
    filename_prefix
)
Save the vocabulary to a file.