tfx_bsl.public.tfxio.CsvTFXIO

TFXIO implementation for CSV.

Inherits From: TFXIO

tfx_bsl.public.tfxio.CsvTFXIO(
    file_pattern: Text,
    column_names: List[Text],
    telemetry_descriptors: Optional[List[Text]] = None,
    validate: bool = True,
    delimiter: Optional[Text] = ',',
    skip_blank_lines: Optional[bool] = True,
    multivalent_columns: Optional[Text] = None,
    secondary_delimiter: Optional[Text] = None,
    schema: Optional[schema_pb2.Schema] = None,
    raw_record_column_name: Optional[Text] = None,
    skip_header_lines: int = 0
)

Used in the notebooks

Used in the tutorials
Preprocessing data with TensorFlow Transform

Args
`file_pattern`	A file glob pattern to read csv files from.
`column_names`	List of csv column names. Order must match the order in the CSV file.
`telemetry_descriptors`	A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use.
`validate`	Boolean flag to verify that the files exist during the pipeline creation time.
`delimiter`	A one-character string used to separate fields.
`skip_blank_lines`	A boolean to indicate whether to skip over blank lines rather than interpreting them as missing values.
`multivalent_columns`	Name of column that can contain multiple values. If secondary_delimiter is provided, this must also be provided.
`secondary_delimiter`	Delimiter used for parsing multivalent columns. If multivalent_columns is provided, this must also be provided.
`schema`	An optional TFMD Schema describing the dataset. If schema is provided, it will determine the data type of the csv columns. Otherwise, the each column's data type will be inferred by the csv decoder. The schema should contain exactly the same features as column_names.
`raw_record_column_name`	If not None, the generated Arrow RecordBatches will contain a column of the given name that contains raw csv rows.
`skip_header_lines`	Number of header lines to skip. Same number is skipped from each file. Must be 0 or higher. Large number of skipped lines might impact performance.

Attributes
`raw_record_column_name`
`telemetry_descriptors`

Attributes

raw_record_column_name

telemetry_descriptors

Methods

`ArrowSchema`

ArrowSchema() -> pa.Schema

Returns the schema of the RecordBatch produced by self.BeamSource().

May raise an error if the TFMD schema was not provided at construction time.

`BeamSource`

BeamSource(
    batch_size: Optional[int] = None
) -> beam.PTransform

Returns a beam PTransform that produces PCollection[pa.RecordBatch].

May NOT raise an error if the TFMD schema was not provided at construction time.

If a TFMD schema was provided at construction time, all the pa.RecordBatches in the result PCollection must be of the same schema returned by self.ArrowSchema. If a TFMD schema was not provided, the pa.RecordBatches might not be of the same schema (they may contain different numbers of columns).

Args
`batch_size`	if not None, the `pa.RecordBatch` produced will be of the specified size. Otherwise it's automatically tuned by Beam.

`Project`

Project(
    tensor_names: List[Text]
) -> 'TFXIO'

Projects the dataset represented by this TFXIO.

A Projected TFXIO:

Only columns needed for given tensor_names are guaranteed to be produced by self.BeamSource()
self.TensorAdapterConfig() and self.TensorFlowDataset() are trimmed to contain only those tensors.
It retains a reference to the very original TFXIO, so its TensorAdapter knows about the specs of the tensors that would be produced by the original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec().

May raise an error if the TFMD schema was not provided at construction time.

Args
`tensor_names`	a set of tensor names.

Returns

Returns
A `TFXIO` instance that is the same as `self` except that: Only columns needed for given tensor_names are guaranteed to be produced by `self.BeamSource()` `self.TensorAdapterConfig()` and `self.TensorFlowDataset()` are trimmed to contain only those tensors.

A TFXIO instance that is the same as self except that:

Only columns needed for given tensor_names are guaranteed to be produced by self.BeamSource()
self.TensorAdapterConfig() and self.TensorFlowDataset() are trimmed to contain only those tensors.

`RawRecordBeamSource`

RawRecordBeamSource() -> beam.PTransform

Returns a PTransform that produces a PCollection[bytes].

Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:

record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()

would result in the files being read twice, while the following would only read once:

raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()

`RawRecordTensorFlowDataset`

RawRecordTensorFlowDataset(
    options: tfx_bsl.public.tfxio.TensorFlowDatasetOptions
) -> tf.data.Dataset

Returns a Dataset that contains nested Datasets of raw records.

May not be implemented for some TFXIOs.

This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).

The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:

file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)

datasets would result in the following dataset: [ds1, ds2]. Where ds1 iterates over records from 'file_1' and 'file_2', and ds2 iterates over records from files matched by 'dir_1/*'.

Example usage:

tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.

Note that we need a flat_map, because RawRecordTensorFlowDataset returns a dataset of datasets.

When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.

Args
`options`	A TensorFlowDatasetOptions object. Not all options will apply.

`RawRecordToRecordBatch`

RawRecordToRecordBatch(
    batch_size: Optional[int] = None
) -> beam.PTransform

Returns a PTransform that converts raw records to Arrow RecordBatches.

The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).

Args
`batch_size`	if not None, the `pa.RecordBatch` produced will be of the specified size. Otherwise it's automatically tuned by Beam.

`RecordBatches`

RecordBatches(
    options: tfx_bsl.public.tfxio.RecordBatchesOptions
)

Returns an iterable of record batches.

This can be used outside of Apache Beam or TensorFlow to access data.

Args
`options`	An options object for iterating over record batches. Look at `dataset_options.RecordBatchesOptions` for more details.

`SupportAttachingRawRecords`

SupportAttachingRawRecords() -> bool

`TensorAdapter`

TensorAdapter() -> tfx_bsl.public.tfxio.TensorAdapter

Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.

May raise an error if the TFMD schema was not provided at construction time.

`TensorAdapterConfig`

TensorAdapterConfig() -> tfx_bsl.public.tfxio.TensorAdapterConfig

Returns the config to initialize a TensorAdapter.

Returns
a `TensorAdapterConfig` that is the same as what is used to initialize the `TensorAdapter` returned by `self.TensorAdapter()`.

`TensorFlowDataset`

TensorFlowDataset(
    options: tfx_bsl.public.tfxio.TensorFlowDatasetOptions
)

Returns a tf.data.Dataset of TF inputs.

May raise an error if the TFMD schema was not provided at construction time.

Args
`options`	an options object for the tf.data.Dataset. Look at `dataset_options.TensorFlowDatasetOptions` for more details.

`TensorRepresentations`

TensorRepresentations() -> tfx_bsl.public.tfxio.TensorRepresentations

Returns the TensorRepresentations.

These TensorRepresentations describe the tensors or composite tensors produced by the TensorAdapter created from self.TensorAdapter() or the tf.data.Dataset created from self.TensorFlowDataset().

May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.

tfx_bsl.public.tfxio.CsvTFXIO

Used in the notebooks

Args

Attributes

Methods

ArrowSchema

BeamSource

Project

A Projected TFXIO:

RawRecordBeamSource

RawRecordTensorFlowDataset

Example usage:

RawRecordToRecordBatch

RecordBatches

SupportAttachingRawRecords

TensorAdapter

TensorAdapterConfig

TensorFlowDataset

TensorRepresentations

`ArrowSchema`

`BeamSource`

`Project`

`RawRecordBeamSource`

`RawRecordTensorFlowDataset`

`RawRecordToRecordBatch`

`RecordBatches`

`SupportAttachingRawRecords`

`TensorAdapter`

`TensorAdapterConfig`

`TensorFlowDataset`

`TensorRepresentations`