tfx.v1.components.Transform

A TFX component to transform the input examples.

Inherits From: BaseComponent, BaseNode

Used in the notebooks

Used in the tutorials

The Transform component wraps TensorFlow Transform (tf.Transform) to preprocess data in a TFX pipeline. This component will load the preprocessing_fn from input module file, preprocess both 'train' and 'eval' splits of input examples, generate the tf.Transform output, and save both transform function and transformed examples to orchestrator desired locations.

The Transform component can also invoke TFDV to compute statistics on the pre-transform and post-transform data. Invocations of TFDV take an optional StatsOptions object. To configure the StatsOptions object that is passed to TFDV for both pre-transform and post-transform statistics, users can define the optional stats_options_updater_fn within the module file.

Providing a preprocessing function

The TFX executor will use the estimator provided in the module_file file to train the model. The Transform executor will look specifically for the preprocessing_fn() function within that file.

An example of preprocessing_fn() can be found in the user-supplied code of the TFX Chicago Taxi pipeline example.

Updating StatsOptions

The Transform executor will look specifically for the stats_options_updater_fn() within the module file specified above.

An example of stats_options_updater_fn() can be found in the user-supplied code of the TFX BERT MRPC pipeline example.

Example

# Performs transformations and feature engineering in training and serving.
transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=infer_schema.outputs['schema'],
    module_file=module_file)

Component outputs contains:

  • transform_graph: Channel of type standard_artifacts.TransformGraph, which includes an exported Tensorflow graph suitable for both training and serving.
  • transformed_examples: Channel of type standard_artifacts.Examples for materialized transformed examples, which includes transform splits as specified in splits_config. This is optional controlled by materialize.

Please see the Transform guide for more details.

examples A BaseChannel of type standard_artifacts.Examples (required). This should contain custom splits specified in splits_config. If custom split is not provided, this should contain two splits 'train' and 'eval'.
schema A BaseChannel of type standard_artifacts.Schema. This should contain a single schema artifact.
module_file The file path to a python module file, from which the 'preprocessing_fn' function will be loaded. Exactly one of 'module_file' or 'preprocessing_fn' must be supplied.

The function needs to have the following signature:

def preprocessing_fn(inputs: Dict[Text, Any]) -> Dict[Text, Any]:
  ...

where the values of input and returned Dict are either tf.Tensor or tf.SparseTensor.

If additional inputs are needed for preprocessing_fn, they can be passed in custom_config:

def preprocessing_fn(inputs: Dict[Text, Any], custom_config:
                     Dict[Text, Any]) -> Dict[Text, Any]:
  ...

To update the stats options used to compute the pre-transform or post-transform statistics, optionally define the 'stats-options_updater_fn' within the same module. If implemented, this function needs to have the following signature:

def stats_options_updater_fn(stats_type: tfx.components.transform
  .stats_options_util.StatsType, stats_options: tfdv.StatsOptions)
  -> tfdv.StatsOptions:
  ...

Use of a RuntimeParameter for this argument is experimental.

preprocessing_fn The path to python function that implements a 'preprocessing_fn'. See 'module_file' for expected signature of the function. Exactly one of 'module_file' or 'preprocessing_fn' must be supplied. Use of a RuntimeParameter for this argument is experimental.
splits_config A transform_pb2.SplitsConfig instance, providing splits that should be analyzed and splits that should be transformed. Note analyze and transform splits can have overlap. Default behavior (when splits_config is not set) is analyze the 'train' split and transform all splits. If splits_config is set, analyze cannot be empty.
analyzer_cache Optional input 'TransformCache' channel containing cached information from previous Transform runs. When provided, Transform will try use the cached calculation if possible.
materialize If True, write transformed examples as an output.
disable_analyzer_cache If False, Transform will use input cache if provided and write cache output. If True, analyzer_cache must not be provided.
force_tf_compat_v1 (Optional) If True and/or TF2 behaviors are disabled Transform will use Tensorflow in compat.v1 mode irrespective of installed version of Tensorflow. Defaults to False.
custom_config A dict which contains additional parameters that will be passed to preprocessing_fn.
disable_statistics If True, do not invoke TFDV to compute pre-transform and post-transform statistics. When statistics are computed, they will will be stored in the pre_transform_feature_stats/ and post_transform_feature_stats/ subfolders of the transform_graph export.
stats_options_updater_fn The path to a python function that implements a 'stats_options_updater_fn'. See 'module_file' for expected signature of the function. 'stats_options_updater_fn' cannot be defined if 'module_file' is specified.

ValueError When both or neither of 'module_file' and 'preprocessing_fn' is supplied.

outputs Component's output channel dict.

Methods

with_beam_pipeline_args

Add per component Beam pipeline args.

Args
beam_pipeline_args List of Beam pipeline args to be added to the Beam executor spec.

Returns
the same component itself.