TFXIO implementation for CSV records in pcoll[bytes].
Inherits From: TFXIO
tfx_bsl.public.tfxio.BeamRecordCsvTFXIO(
physical_format: Text,
column_names: List[Text],
delimiter: Optional[Text] = ',',
skip_blank_lines: Optional[bool] = True,
multivalent_columns: Optional[Text] = None,
secondary_delimiter: Optional[Text] = None,
schema: Optional[schema_pb2.Schema] = None,
raw_record_column_name: Optional[Text] = None,
telemetry_descriptors: Optional[List[Text]] = None
)
This is a special TFXIO that does not actually do I/O -- it relies on the caller to prepare a PCollection of bytes.
Attributes | |
---|---|
raw_record_column_name
|
|
telemetry_descriptors
|
Methods
ArrowSchema
ArrowSchema() -> pa.Schema
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
BeamSource
BeamSource(
batch_size: Optional[int] = None
) -> beam.PTransform
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
Args | |
---|---|
batch_size
|
if not None, the pa.RecordBatch produced will be of the
specified size. Otherwise it's automatically tuned by Beam.
|
Project
Project(
tensor_names: List[Text]
) -> "TFXIO"
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by
self.BeamSource()
self.TensorAdapterConfig()
andself.TensorFlowDataset()
are trimmed to contain only those tensors.- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see
TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
Args | |
---|---|
tensor_names
|
a set of tensor names. |
Returns | |
---|---|
A TFXIO instance that is the same as self except that:
|
RawRecordBeamSource
RawRecordBeamSource() -> beam.PTransform
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
RawRecordToRecordBatch
RawRecordToRecordBatch(
batch_size: Optional[int] = None
) -> beam.PTransform
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
Args | |
---|---|
batch_size
|
if not None, the pa.RecordBatch produced will be of the
specified size. Otherwise it's automatically tuned by Beam.
|
RecordBatches
RecordBatches(
options: tfx_bsl.public.tfxio.RecordBatchesOptions
)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
Args | |
---|---|
options
|
An options object for iterating over record batches. Look at
dataset_options.RecordBatchesOptions for more details.
|
SupportAttachingRawRecords
SupportAttachingRawRecords() -> bool
TensorAdapter
TensorAdapter() -> tfx_bsl.public.tfxio.TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
TensorAdapterConfig
TensorAdapterConfig() -> tfx_bsl.public.tfxio.TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
Returns | |
---|---|
a TensorAdapterConfig that is the same as what is used to initialize the
TensorAdapter returned by self.TensorAdapter() .
|
TensorFlowDataset
TensorFlowDataset(
options: tfx_bsl.public.tfxio.TensorFlowDatasetOptions
)
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Args | |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
dataset_options.TensorFlowDatasetOptions for more details.
|
TensorRepresentations
TensorRepresentations() -> tfx_bsl.public.tfxio.TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.