![]() |
Generic TFX example gen base executor.
Inherits From: BaseExecutor
tfx.components.example_gen.base_example_gen_executor.BaseExampleGenExecutor(
context: Optional[tfx.dsl.components.base.base_executor.BaseExecutor.Context
] = None
)
The base ExampleGen executor takes a configuration and converts external data sources to TensorFlow Examples (tf.train.Example, tf.train.SequenceExample), or any other protocol buffer as subclass defines.
The common configuration (defined in https://github.com/tensorflow/tfx/blob/master/tfx/proto/example_gen.proto#L44) describes the general properties of input data and shared instructions when producing output data.
The conversion is done in GenerateExamplesByBeam
as a Beam pipeline, which
validates the configuration, reads the external data sources, converts the
record in the input source to any supported output payload formats
(e.g., tf.Example or tf.SequenceExample) if needed, and splits the examples
if the output split config is given. Then the executor's Do
writes the
results in splits to the output path.
For simple custom ExampleGens, the details of transforming input data
record(s) to a specific output payload format (e.g., tf.Example or
tf.SequenceExample) is expected to be given in
GetInputSourceToExamplePTransform
, which returns a Beam PTransform with the
actual implementation. For complex use cases, such as joining multiple data
sources and different interpretations of the configurations, the custom
ExampleGen can override GenerateExamplesByBeam
.
Child Classes
Methods
Do
Do(
input_dict: Dict[Text, List[types.Artifact]],
output_dict: Dict[Text, List[types.Artifact]],
exec_properties: Dict[Text, Any]
) -> None
Take input data source and generates serialized data splits.
The output is intended to be serialized tf.train.Examples or
tf.train.SequenceExamples protocol buffer in gzipped TFRecord format,
but subclasses can choose to override to write to any serialized records
payload into gzipped TFRecord as specified, so long as downstream
component can consume it. The format of payload is added to
payload_format
custom property of the output Example artifact.
Args | |
---|---|
input_dict
|
Input dict from input key to a list of Artifacts. Depends on detailed example gen implementation. |
output_dict
|
Output dict from output key to a list of Artifacts.
|
exec_properties
|
A dict of execution properties. Depends on detailed
example gen implementation.
|
Returns | |
---|---|
None |
GenerateExamplesByBeam
GenerateExamplesByBeam(
pipeline: beam.Pipeline,
exec_properties: Dict[Text, Any]
) -> Dict[Text, beam.pvalue.PCollection]
Converts input source to serialized record splits based on configs.
Custom ExampleGen executor should provide GetInputSourceToExamplePTransform for converting input split to serialized records. Overriding this 'GenerateExamplesByBeam' method instead if complex logic is need, e.g., custom spliting logic.
Args | |
---|---|
pipeline
|
Beam pipeline. |
exec_properties
|
A dict of execution properties. Depends on detailed
example gen implementation.
|
Returns | |
---|---|
Dict of beam PCollection with split name as key, each PCollection is a single output split that contains serialized records. |
GetInputSourceToExamplePTransform
@abc.abstractmethod
GetInputSourceToExamplePTransform() -> beam.PTransform
Returns PTransform for converting input source to records.
The record is by default assumed to be tf.train.Example protos, subclassses can serialize any protocol buffer into bytes as output PCollection, so long as the downstream component can consume it.
Note that each input split will be transformed by this function separately. For complex use case, consider override 'GenerateExamplesByBeam' instead.
Here is an example PTransform: @beam.ptransform_fn @beam.typehints.with_input_types(beam.Pipeline) @beam.typehints.with_output_types(Union[tf.train.Example, tf.train.SequenceExample, bytes]) def ExamplePTransform( pipeline: beam.Pipeline, exec_properties: Dict[Text, Any], split_pattern: Text) -> beam.pvalue.PCollection