![]() |
TFX example gen executor for processing parquet format.
Inherits From: BaseExampleGenExecutor
, BaseExecutor
tfx.components.example_gen.custom_executors.parquet_executor.Executor(
context: Optional[tfx.dsl.components.base.base_executor.BaseExecutor.Context
] = None
)
Data type conversion:
integer types will be converted to tf.train.Feature with tf.train.Int64List. float types will be converted to tf.train.Feature with tf.train.FloatList. string types will be converted to tf.train.Feature with tf.train.BytesList and utf-8 encoding.
Note that, Single value will be converted to a list of that single value. Missing value will be converted to empty tf.train.Feature(). Parquet data might lose precision, e.g., int96.
For details, check the dict_to_example function in example_gen.utils.
Example usage:
from tfx.components.base import executor_spec from tfx.components.example_gen.component import FileBasedExampleGen from tfx.components.example_gen.custom_executors import parquet_executor from tfx.utils.dsl_utils import external_input
example_gen = FileBasedExampleGen( input=external_input(parquet_dir_path), custom_executor_spec=executor_spec.ExecutorClassSpec( parquet_executor.Executor))
Child Classes
Methods
Do
Do(
input_dict: Dict[Text, List[types.Artifact]],
output_dict: Dict[Text, List[types.Artifact]],
exec_properties: Dict[Text, Any]
) -> None
Take input data source and generates serialized data splits.
The output is intended to be serialized tf.train.Examples or
tf.train.SequenceExamples protocol buffer in gzipped TFRecord format,
but subclasses can choose to override to write to any serialized records
payload into gzipped TFRecord as specified, so long as downstream
component can consume it. The format of payload is added to
payload_format
custom property of the output Example artifact.
Args | |
---|---|
input_dict
|
Input dict from input key to a list of Artifacts. Depends on detailed example gen implementation. |
output_dict
|
Output dict from output key to a list of Artifacts.
|
exec_properties
|
A dict of execution properties. Depends on detailed
example gen implementation.
|
Returns | |
---|---|
None |
GenerateExamplesByBeam
GenerateExamplesByBeam(
pipeline: beam.Pipeline,
exec_properties: Dict[Text, Any]
) -> Dict[Text, beam.pvalue.PCollection]
Converts input source to serialized record splits based on configs.
Custom ExampleGen executor should provide GetInputSourceToExamplePTransform for converting input split to serialized records. Overriding this 'GenerateExamplesByBeam' method instead if complex logic is need, e.g., custom spliting logic.
Args | |
---|---|
pipeline
|
Beam pipeline. |
exec_properties
|
A dict of execution properties. Depends on detailed
example gen implementation.
|
Returns | |
---|---|
Dict of beam PCollection with split name as key, each PCollection is a single output split that contains serialized records. |
GetInputSourceToExamplePTransform
GetInputSourceToExamplePTransform() -> beam.PTransform
Returns PTransform for parquet to TF examples.