tfx.v1.components.CsvExampleGen

Official TFX CsvExampleGen component.

Inherits From: BaseComponent, BaseNode

Used in the notebooks

Used in the tutorials

The csv examplegen component takes csv data, and generates train and eval examples for downstream components.

The csv examplegen encodes column values to tf.Example int/float/byte feature. For the case when there's missing cells, the csv examplegen uses: -- tf.train.Feature(type_list=tf.train.typeList(value=[])), when the type can be inferred. -- tf.train.Feature() when it cannot infer the type from the column.

Note that the type inferring will be per input split. If input isn't a single split, users need to ensure the column types align in each pre-splits.

For example, given the following csv rows of a split:

header:A,B,C,D row1: 1,,x,0.1 row2: 2,,y,0.2 row3: 3,,,0.3 row4:

The output example will be example1: 1(int), empty feature(no type), x(string), 0.1(float) example2: 2(int), empty feature(no type), x(string), 0.2(float) example3: 3(int), empty feature(no type), empty list(string), 0.3(float)

Note that the empty feature is tf.train.Feature() while empty list string feature is tf.train.Feature(bytes_list=tf.train.BytesList(value=[])).

Component outputs contains:

input_base an external directory containing the CSV files.
input_config An example_gen_pb2.Input instance, providing input configuration. If unset, the files under input_base will be treated as a single split.
output_config An example_gen_pb2.Output instance, providing output configuration. If unset, default splits will be 'train' and 'eval' with size 2:1.
range_config An optional range_config_pb2.RangeConfig instance, specifying the range of span values to consider. If unset, driver will default to searching for latest span with no restrictions.

outputs Component's output channel dict.

Methods

with_beam_pipeline_args

Add per component Beam pipeline args.

Args
beam_pipeline_args List of Beam pipeline args to be added to the Beam executor spec.

Returns
the same component itself.