Common Transformations

In this document we describe how to do common transformations with tf.transform.

We assume you have already constructed the beam pipeline along the lines of the examples, and only describe what needs to be added to preprocessing_fn and possibly model.

Using String/Categorical data

The following preprocessing_fn will compute a vocabulary over the values of feature x with tokens in descending frequency order, convert feature x values to their index in the vocabulary, and finally perform a one-hot encoding for the output.

This is common for example in use cases where the label feature is a categorical string. The resulting one-hot encoding is ready for training.

def preprocessing_fn(inputs):
  integerized = tft.compute_and_apply_vocabulary(
      inputs['x'],
      num_oov_buckets=1,
      vocab_filename='x_vocab')
  one_hot_encoded = tf.one_hot(
      integerized,
      depth=tf.cast(tft.experimental.get_vocabulary_size_by_name('x_vocab') + 1,
                    tf.int32),
      on_value=1.0,
      off_value=0.0)
  return {
    'x_out': one_hot_encoded,
  }

Mean imputation for missing data

In this example, feature x is an optional feature, represented as a tf.SparseTensor in the preprocessing_fn. In order to convert it to a dense tensor, we compute its mean, and set the mean to be the default value when it is missing from an instance.

The resulting dense tensor will have the shape [None, 1], None represents the batch dimension, and for the second dimension it will be the number of values that x can have per instance. In this case it's 1.

def preprocessing_fn(inputs):
  return {
      'x_out': tft.sparse_tensor_to_dense_with_shape(
          inputs['x'], default_value=tft.mean(x), shape=[None, 1])
  }