|  TensorFlow 1 version |  View source on GitHub | 
Text tokenization utility class.
tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True, split=' ', char_level=False, oov_token=None,
    document_count=0, **kwargs
)
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
| Arguments | |
|---|---|
| num_words | the maximum number of words to keep, based
on word frequency. Only the most common num_words-1words will
be kept. | 
| filters | a string where each element is a character that will be
filtered from the texts. The default is all punctuation, plus
tabs and line breaks, minus the 'character. | 
| lower | boolean. Whether to convert the texts to lowercase. | 
| split | str. Separator for word splitting. | 
| char_level | if True, every character will be treated as a token. | 
| oov_token | if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls | 
By default, all punctuation is removed, turning the texts into
space-separated sequences of words
(words maybe include the ' character). These sequences are then
split into lists of tokens. They will then be indexed or vectorized.
0 is a reserved index that won't be assigned to any word.
Methods
fit_on_sequences
fit_on_sequences(
    sequences
)
Updates internal vocabulary based on a list of sequences.
Required before using sequences_to_matrix
(if fit_on_texts was never called).
| Arguments | |
|---|---|
| sequences | A list of sequence. A "sequence" is a list of integer word indices. | 
fit_on_texts
fit_on_texts(
    texts
)
Updates internal vocabulary based on a list of texts.
In the case where texts contains lists, we assume each entry of the lists to be a token.
Required before using texts_to_sequences or texts_to_matrix.
| Arguments | |
|---|---|
| texts | can be a list of strings, a generator of strings (for memory-efficiency), or a list of list of strings. | 
get_config
get_config()
Returns the tokenizer configuration as Python dictionary. The word count dictionaries used by the tokenizer get serialized into plain JSON, so that the configuration can be read by other projects.
| Returns | |
|---|---|
| A Python dictionary with the tokenizer configuration. | 
sequences_to_matrix
sequences_to_matrix(
    sequences, mode='binary'
)
Converts a list of sequences into a Numpy matrix.
| Arguments | |
|---|---|
| sequences | list of sequences (a sequence is a list of integer word indices). | 
| mode | one of "binary", "count", "tfidf", "freq" | 
| Returns | |
|---|---|
| A Numpy matrix. | 
| Raises | |
|---|---|
| ValueError | In case of invalid modeargument,
or if the Tokenizer requires to be fit to sample data. | 
sequences_to_texts
sequences_to_texts(
    sequences
)
Transforms each sequence into a list of text.
Only top num_words-1 most frequent words will be taken into account.
Only words known by the tokenizer will be taken into account.
| Arguments | |
|---|---|
| sequences | A list of sequences (list of integers). | 
| Returns | |
|---|---|
| A list of texts (strings) | 
sequences_to_texts_generator
sequences_to_texts_generator(
    sequences
)
Transforms each sequence in sequences to a list of texts(strings).
Each sequence has to a list of integers. In other words, sequences should be a list of sequences
Only top num_words-1 most frequent words will be taken into account.
Only words known by the tokenizer will be taken into account.
| Arguments | |
|---|---|
| sequences | A list of sequences. | 
| Yields | |
|---|---|
| Yields individual texts. | 
texts_to_matrix
texts_to_matrix(
    texts, mode='binary'
)
Convert a list of texts to a Numpy matrix.
| Arguments | |
|---|---|
| texts | list of strings. | 
| mode | one of "binary", "count", "tfidf", "freq". | 
| Returns | |
|---|---|
| A Numpy matrix. | 
texts_to_sequences
texts_to_sequences(
    texts
)
Transforms each text in texts to a sequence of integers.
Only top num_words-1 most frequent words will be taken into account.
Only words known by the tokenizer will be taken into account.
| Arguments | |
|---|---|
| texts | A list of texts (strings). | 
| Returns | |
|---|---|
| A list of sequences. | 
texts_to_sequences_generator
texts_to_sequences_generator(
    texts
)
Transforms each text in texts to a sequence of integers.
Each item in texts can also be a list, in which case we assume each item of that list to be a token.
Only top num_words-1 most frequent words will be taken into account.
Only words known by the tokenizer will be taken into account.
| Arguments | |
|---|---|
| texts | A list of texts (strings). | 
| Yields | |
|---|---|
| Yields individual sequences. | 
to_json
to_json(
    **kwargs
)
Returns a JSON string containing the tokenizer configuration.
To load a tokenizer from a JSON string, use
keras.preprocessing.text.tokenizer_from_json(json_string).
| Arguments | |
|---|---|
| **kwargs | Additional keyword arguments
to be passed to json.dumps(). | 
| Returns | |
|---|---|
| A JSON string containing the tokenizer configuration. |