Text processing tools for TensorFlow
import tensorflow as tf import tensorflow_text as tf_text def preprocess(vocab_lookup_table, example_text): # Normalize text tf_text.normalize_utf8(example_text) # Tokenize into words word_tokenizer = tf_text.WhitespaceTokenizer() tokens = word_tokenizer.tokenize(example_text) # Tokenize into subwords subword_tokenizer = tf_text.WordpieceTokenizer( vocab_lookup_table, token_out_type=tf.int64) subtokens = subword_tokenizer.tokenize(tokens).merge_dims(1, -1) # Apply padding padded_inputs = tf_text.pad_model_inputs(subtokens, max_seq_length=16) return padded_inputs
TensorFlow provides you with a rich collection of ops and libraries to help you work with input in text form such as raw text strings or documents. These libraries can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling.
You can extract powerful syntactic and semantic text features from inside the TensorFlow graph as input to your neural net.
Integrating preprocessing with the TensorFlow graph provides the following benefits:
- Facilitates a large toolkit for working with text
- Allows integration with a large suite of Tensorflow tools to support projects from problem definition through training, evaluation, and launch
- Reduces complexity at serving time and prevents training-serving skew
In addition to the above, you do not need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.
