tfds.core.DatasetBuilder

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

Configuration: Some DatasetBuilders expose multiple variants of the dataset by defining a tfds.core.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in DatasetBuilder.builder_configs.

Typical DatasetBuilder usage:

mnist_builder = tfds.builder("mnist")
mnist_info = mnist_builder.info
mnist_builder.download_and_prepare()
datasets = mnist_builder.as_dataset()

train_dataset, test_dataset = datasets["train"], datasets["test"]
assert isinstance(train_dataset, tf.data.Dataset)

# And then the rest of your input pipeline
train_dataset = train_dataset.repeat().shuffle(1024).batch(128)
train_dataset = train_dataset.prefetch(2)
features = tf.compat.v1.data.make_one_shot_iterator(train_dataset).get_next()
image, label = features['image'], features['label']

data_dir directory to read/write data. Defaults to the value of the environment variable TFDS_DATA_DIR, if set, otherwise falls back to datasets are stored.
config tfds.core.BuilderConfig or str name, optional configuration for the dataset that affects the data generated on disk. Different builder_configs will have their own subdirectories and versions.
version Optional version at which to load the dataset. An error is raised if specified version cannot be satisfied. Eg: '1.2.3', '1.2.*'. The special value "experimental_latest" will use the highest version, even if not default. This is not recommended unless you know what you are doing, as the version could be broken.

builder_config tfds.core.BuilderConfig for this builder.
canonical_version

data_dir Returns the directory where this version + config is stored.

Note that this is different from data_dir_root. For example, if data_dir_root is /data/tfds, then data_dir would be /data/tfds/my_dataset/my_config/1.2.3.

data_dir_root Returns the root directory where all TFDS datasets are stored.

Note that this is different from data_dir, which includes the dataset name, config, and version. For example, if data_dir is /data/tfds/my_dataset/my_config/1.2.3, then data_dir_root is /data/tfds.

data_path Returns the path where this version + config is stored.
info tfds.core.DatasetInfo for this builder.
release_notes

supported_versions

version

versions Versions (canonical + availables), in preference order.

Methods

as_data_source

View source

Constructs an ArrayRecordDataSource.

Args
split Which split of the data to load (e.g. 'train', 'test', ['train', 'test'], 'train[80%:]',...). See our split API guide. If None, will return all splits in a Dict[Split, Sequence].
decoders Nested dict of Decoder objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.

Returns
Sequence if split, dict<key: tfds.Split, value: Sequence> otherwise.

Raises
NotImplementedError if the data was not generated using ArrayRecords.

as_dataset

View source

Constructs a tf.data.Dataset.

Callers must pass arguments as keyword arguments.

The output types vary depending on the parameters. Examples:

builder = tfds.builder('imdb_reviews')
builder.download_and_prepare()

# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
#  'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}

# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys())  # ==> ['test', 'train', 'unsupervised']

assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
#  <tf.Tensor: ... dtype=int64, numpy=1>)

Args
split Which split of the data to load (e.g. 'train', 'test', ['train', 'test'], 'train[80%:]',...). See our split API guide. If None, will return all splits in a Dict[Split, tf.data.Dataset].
batch_size int, batch size. Note that variable-length features will be 0-padded if batch_size is set. Users that want more custom behavior should use batch_size=None and use the tf.data API to construct a custom pipeline. If batch_size == -1, will return feature dictionaries of the whole dataset with tf.Tensors instead of a tf.data.Dataset.
shuffle_files bool, whether to shuffle the input files. Defaults to False.
decoders Nested dict of Decoder objects which allow to customize the decoding. The structure should match the feature structure, but only customized feature keys need to be present. See the guide for more info.
read_config tfds.ReadConfig, Additional options to configure the input pipeline (e.g. seed, num parallel reads,...).
as_supervised bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

Returns
tf.data.Dataset, or if split=None, dict<key: tfds.Split, value: tf.data.Dataset>.

If batch_size is -1, will return feature dictionaries containing the entire dataset in tf.Tensors instead of a tf.data.Dataset.

dataset_info_from_configs

View source

Returns the DatasetInfo using given kwargs and config files.

Sub-class should call this and add information not present in config files using kwargs directly passed to tfds.core.DatasetInfo object.

If information is present both in passed arguments and config files, config files will prevail.

Args
**kwargs kw args to pass to DatasetInfo directly.

download_and_prepare

View source

Downloads and prepares dataset for reading.

Args
download_dir directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads".
download_config tfds.download.DownloadConfig, further configuration for downloading and preparing dataset.
file_format optional str or file_adapters.FileFormat, format of the record files in which the dataset will be written.

Raises
IOError if there is not enough disk space available.
RuntimeError when the config cannot be found.

get_default_builder_config

View source

Returns the default builder config if there is one.

Note that for dataset builders that cannot use the cls.BUILDER_CONFIGS, we need a method that uses the instance to get BUILDER_CONFIGS and DEFAULT_BUILDER_CONFIG_NAME.

Returns
the default builder config if there is one

get_metadata

View source

Returns metadata (README, CITATIONS, ...) specified in config files.

The config files are read from the same package where the DatasetBuilder has been defined, so those metadata might be wrong for legacy builders.

get_reference

View source

Returns a reference to the dataset produced by this dataset builder.

Includes the config if specified, the version, and the data_dir that should contain this dataset.

Arguments
namespace if this dataset is a community dataset, and therefore has a namespace, then the namespace must be provided such that it can be set in the reference. Note that a dataset builder is not aware that it is part of a namespace.

Returns
a reference to this instantiated builder.

is_prepared

View source

Returns whether this dataset is already downloaded and prepared.

BUILDER_CONFIGS []
DEFAULT_BUILDER_CONFIG_NAME None
MANUAL_DOWNLOAD_INSTRUCTIONS None
MAX_SIMULTANEOUS_DOWNLOADS None
RELEASE_NOTES



}

SUPPORTED_VERSIONS []
VERSION None
builder_config_cls None
builder_configs



}

code_path Instance of etils.epath.gpath.PosixGPath
default_builder_config None
name 'dataset_builder'
pkg_dir_path None
url_infos None