tf.data.experimental.CsvDataset

A Dataset comprising lines from one or more CSV files.

Used in the notebooks

Used in the guide Used in the tutorials

filenames A tf.string tensor containing one or more filenames.
record_defaults A list of default values for the CSV fields. Each item in the list is either a valid CSV DType (float32, float64, int32, int64, string), or a Tensor object with one of the above types. One per column of CSV data, with either a scalar Tensor default value for the column if it is optional, or DType or empty Tensor if required. If both this and select_columns are specified, these must have the same lengths, and column_defaults is assumed to be sorted in order of increasing column index.
compression_type (Optional.) A tf.string scalar evaluating to one of "" (no compression), "ZLIB", or "GZIP". Defaults to no compression.
buffer_size (Optional.) A tf.int64 scalar denoting the number of bytes to buffer while reading files. Defaults to 4MB.
header (Optional.) A tf.bool scalar indicating whether the CSV file(s) have header line(s) that should be skipped when parsing. Defaults to False.
field_delim (Optional.) A tf.string scalar containing the delimiter character that separates fields in a record. Defaults to ",".
use_quote_delim (Optional.) A tf.bool scalar. If False, treats double quotation marks as regular characters inside of string fields (ignoring RFC 4180, Section 2, Bullet 5). Defaults to True.
na_value (Optional.) A tf.string scalar indicating a value that will be treated as NA/NaN.
select_cols (Optional.) A sorted list of column indices to select from the input data. If specified, only this subset of columns will be parsed. Defaults to parsing all columns.

element_spec The type specification of an element of this dataset.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
dataset.element_spec
TensorSpec(shape=(), dtype=tf.int32, name=None)

Methods

apply

View source

Applies a transformation function to this dataset.

apply enables chaining of custom Dataset transformations, which are represented as functions that take one Dataset argument and return a transformed Dataset.

dataset = tf.data.Dataset.range(100)
def dataset_fn(ds):
  return ds.filter(lambda x: x < 5)
dataset = dataset.apply(dataset_fn)
list(dataset.as_numpy_iterator())
[0, 1, 2, 3, 4]

Args
transformation_func A function that takes one Dataset argument and returns a Dataset.

Returns
Dataset The Dataset returned by applying transformation_func to this dataset.

as_numpy_iterator

View source

Returns an iterator which converts all elements of the dataset to numpy.

Use as_numpy_iterator to inspect the content of your dataset. To see element shapes and types, print dataset elements directly instead of using as_numpy_iterator.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
  print(element)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)

This method requires that you are running in eager mode and the dataset's element_spec contains only TensorSpec components.

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
  print(element)
1
2
3
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
print(list(dataset.as_numpy_iterator()))
[1, 2, 3]

as_numpy_iterator() will preserve the nested structure of dataset elements.

dataset = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]),
                                              'b': [5, 6]})
list(dataset.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5},
                                      {'a': (2, 4), 'b': 6}]
True

Returns
An iterable over the elements of the dataset, with their tensors converted to numpy arrays.

Raises
TypeError if an element contains a non-Tensor value.
RuntimeError if eager execution is not enabled.

batch

View source

Combines consecutive elements of this dataset into batches.

dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
list(dataset.as_numpy_iterator())
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7])]
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)
list(dataset.as_numpy_iterator())
[array([0, 1, 2]), array([3, 4, 5])]

The components of the resulting element will have an additional outer dimension, which will be batch_size (or N % batch_size for the last element if batch_size does not divide the number of input elements N evenly and drop_remainder is False). If your program depends on the batches having the same outer dimension, you should set the drop_remainder argument to True to prevent the smaller batch from being produced.

Args
batch_size A tf.int64 scalar tf.Tensor, representing the number of consecutive elements of this dataset to combine in a single batch.
drop_remainder