Text tokenization utility class.

Used in the notebooks

Used in the tutorials

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

num_words the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
filters a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.
lower boolean. Whether to convert the texts to lowercase.
split str. Separator for word splitting.
char_level if True, every character will be treated as a token.
oov_token if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.

0 is a reserved index that won't be assigned to any word.



View source

Updates internal vocabulary based on a list of sequences.

Required before using sequences_to_matrix (if fit_on_texts was never called).

sequences A list of sequence. A "sequence" is a list of integer word indices.


View source

Updates internal vocabulary based on a list of texts.

In the case where texts contains lists, we assume each entry of the lists to be a token.

Required before using texts_to_sequences or texts_to_matrix.

texts can be a list of strings, a generator of strings (for memory-efficiency), or a list of list of strings.


View source

Returns the tokenizer configuration as Python dictionary. The word count dictionaries used by the tokenizer get serialized into plain JSON, so that the configuration can be read by other projects.

A Python dictionary with the tokenizer configuration.


View source

Converts a list of sequences into a Numpy matrix.

sequences list of sequences (a sequence is a list of integer word indices).
mode one of "binary", "count", "tfidf", "freq"

A Numpy matrix.

ValueError In case of invalid mode argument, or if the Tokenizer requires to be fit to sample data.


View source

Transforms each sequence into a list of text.