tfx.components.example_gen.base_example_gen_executor.BaseExampleGenExecutor

Generic TFX example gen base executor.

Inherits From: BaseExecutor

The base ExampleGen executor takes a configuration and converts external data sources to TensorFlow Examples (tf.train.Example, tf.train.SequenceExample), or any other protocol buffer as subclass defines.

The common configuration (defined in https://github.com/tensorflow/tfx/blob/master/tfx/proto/example_gen.proto#L44) describes the general properties of input data and shared instructions when producing output data.

The conversion is done in GenerateExamplesByBeam as a Beam pipeline, which validates the configuration, reads the external data sources, converts the record in the input source to any supported output payload formats (e.g., tf.Example or tf.SequenceExample) if needed, and splits the examples if the output split config is given. Then the executor's Do writes the results in splits to the output path.

For simple custom ExampleGens, the details of transforming input data record(s) to a specific output payload format (e.g., tf.Example or tf.SequenceExample) is expected to be given in GetInputSourceToExamplePTransform, which returns a Beam PTransform with the actual implementation. For complex use cases, such as joining multiple data sources and different interpretations of the configurations, the custom ExampleGen can override GenerateExamplesByBeam.

Child Classes

class Context

Methods

Do

View source

Take input data source and generates serialized data splits.

The output is intended to be serialized tf.train.Examples or tf.train.SequenceExamples protocol buffer in gzipped TFRecord format, but subclasses can choose to override to write to any serialized records payload into gzipped TFRecord as specified, so long as downstream component can consume it. The format of payload is added to payload_format custom property of the output Example artifact.

Args
input_dict Input dict from input key to a list of Artifacts. Depends on detailed example gen implementation.
output_dict Output dict from output key to a list of Artifacts.

  • examples: splits of serialized records.
exec_properties A dict of execution properties. Depends on detailed example gen implementation.
  • input_base: an external directory containing the data files.
  • input_config: JSON string of example_gen_pb2.Input instance, providing input configuration.
  • output_config: JSON string of example_gen_pb2.Output instance, providing output configuration.
  • output_data_format: Payload format of generated data in output artifact, one of example_gen_pb2.PayloadFormat enum.
  • Returns
    None

    GenerateExamplesByBeam

    View source

    Converts input source to serialized record splits based on configs.

    Custom ExampleGen executor should provide GetInputSourceToExamplePTransform for converting input split to serialized records. Overriding this 'GenerateExamplesByBeam' method instead if complex logic is need, e.g., custom spliting logic.

    Args
    pipeline Beam pipeline.
    exec_properties A dict of execution properties. Depends on detailed example gen implementation.

    • input_base: an external directory containing the data files.
    • input_config: JSON string of example_gen_pb2.Input instance, providing input configuration.
    • output_config: JSON string of example_gen_pb2.Output instance, providing output configuration.
    • output_data_format: Payload format of generated data in output artifact, one of example_gen_pb2.PayloadFormat enum.

    Returns
    Dict of beam PCollection with split name as key, each PCollection is a single output split that contains serialized records.

    GetInputSourceToExamplePTransform

    View source

    Returns PTransform for converting input source to records.

    The record is by default assumed to be tf.train.Example protos, subclassses can serialize any protocol buffer into bytes as output PCollection, so long as the downstream component can consume it.

    Note that each input split will be transformed by this function separately. For complex use case, consider override 'GenerateExamplesByBeam' instead.

    Here is an example PTransform: @beam.ptransform_fn @beam.typehints.with_input_types(beam.Pipeline) @beam.typehints.with_output_types(Union[tf.train.Example, tf.train.SequenceExample, bytes]) def ExamplePTransform( pipeline: beam.Pipeline, exec_properties: Dict[Text, Any], split_pattern: Text) -> beam.pvalue.PCollection