View source on GitHub |
Base class for datasets with data generation based on file adapter.
Inherits From: DatasetBuilder
tfds.core.GeneratorBasedBuilder(
*, file_format: Union[None, str, file_adapters.FileFormat] = None, **kwargs
)
GeneratorBasedBuilder
is a convenience class that abstracts away much
of the data writing and reading of DatasetBuilder
.
It expects subclasses to overwrite _split_generators
to return a dict of
splits, generators. See the method docstrings for details.
Args | |
---|---|
file_format
|
EXPERIMENTAL, may change at any time; Format of the record
files in which dataset will be read/written to. If None , defaults to
tfrecord .
|
**kwargs
|
Arguments passed to DatasetBuilder .
|
Attributes | |
---|---|
builder_config
|
tfds.core.BuilderConfig for this builder.
|
canonical_version
|
|
data_dir
|
Returns the directory where this version + config is stored.
Note that this is different from |
data_dir_root
|
Returns the root directory where all TFDS datasets are stored.
Note that this is different from |
data_path
|
Returns the path where this version + config is stored. |
info
|
tfds.core.DatasetInfo for this builder.
|
release_notes
|
|
supported_versions
|
|
version
|
|
versions
|
Versions (canonical + availables), in preference order. |
Methods
as_data_source
as_data_source(
split: Optional[Tree[splits_lib.SplitArg]] = None,
*,
decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None
) -> ListOrTreeOrElem[Sequence[Any]]
Constructs an ArrayRecordDataSource
.
Args | |
---|---|
split
|
Which split of the data to load (e.g. 'train' , 'test' ,
['train', 'test'] , 'train[80%:]' ,...). See our split API
guide. If None , will
return all splits in a Dict[Split, Sequence] .
|
decoders
|
Nested dict of Decoder objects which allow to customize the
decoding. The structure should match the feature structure, but only
customized feature keys need to be present. See the
guide
for more info.
|
Returns | |
---|---|
Sequence if split ,
dict<key: tfds.Split, value: Sequence> otherwise.
|
Raises | |
---|---|
NotImplementedError if the data was not generated using ArrayRecords. |
as_dataset
as_dataset(
split: Optional[Tree[splits_lib.SplitArg]] = None,
*,
batch_size: Optional[int] = None,
shuffle_files: bool = False,
decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None,
read_config: Optional[read_config_lib.ReadConfig] = None,
as_supervised: bool = False
)
Constructs a tf.data.Dataset
.
Callers must pass arguments as keyword arguments.
The output types vary depending on the parameters. Examples:
builder = tfds.builder('imdb_reviews')
builder.download_and_prepare()
# Default parameters: Returns the dict of tf.data.Dataset
ds_all_dict = builder.as_dataset()
assert isinstance(ds_all_dict, dict)
print(ds_all_dict.keys()) # ==> ['test', 'train', 'unsupervised']
assert isinstance(ds_all_dict['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of dictionaries
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
# 'text': <tf.Tensor: .. dtype=string, numpy=b"I've watched the movie ..">}
# {'label': <tf.Tensor: .. dtype=int64, numpy=1>,
# 'text': <tf.Tensor: .. dtype=string, numpy=b'If you love Japanese ..'>}
# With as_supervised: tf.data.Dataset only contains (feature, label) tuples
ds_all_supervised = builder.as_dataset(as_supervised=True)
assert isinstance(ds_all_supervised, dict)
print(ds_all_supervised.keys()) # ==> ['test', 'train', 'unsupervised']
assert isinstance(ds_all_supervised['test'], tf.data.Dataset)
# Each dataset (test, train, unsup.) consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
# Same as above plus requesting a particular split
ds_test_supervised = builder.as_dataset(as_supervised=True, split='test')
assert isinstance(ds_test_supervised, tf.data.Dataset)
# The dataset consists of tuples (text, label)
# (<tf.Tensor: ... dtype=string, numpy=b"I've watched the movie ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
# (<tf.Tensor: ... dtype=string, numpy=b"If you love Japanese ..">,
# <tf.Tensor: ... dtype=int64, numpy=1>)
Args | |
---|---|
split
|
Which split of the data to load (e.g. 'train' , 'test' ,
['train', 'test'] , 'train[80%:]' ,...). See our split API
guide. If None , will
return all splits in a Dict[Split, tf.data.Dataset] .
|
batch_size
|
int , batch size. Note that variable-length features will be
0-padded if batch_size is set. Users that want more custom behavior
should use batch_size=None and use the tf.data API to construct a
custom pipeline. If batch_size == -1 , will return feature dictionaries
of the whole dataset with tf.Tensor s instead of a tf.data.Dataset .
|
shuffle_files
|
bool , whether to shuffle the input files. Defaults to
False .
|
decoders
|
Nested dict of Decoder objects which allow to customize the
decoding. The structure should match the feature structure, but only
customized feature keys need to be present. See the
guide
for more info.
|
read_config
|
tfds.ReadConfig , Additional options to configure the input
pipeline (e.g. seed, num parallel reads,...).
|
as_supervised
|
bool , if True , the returned tf.data.Dataset will have
a 2-tuple structure (input, label) according to
builder.info.supervised_keys . If False , the default, the returned
tf.data.Dataset will have a dictionary with all the features.
|
Returns | |
---|---|
tf.data.Dataset , or if split=None , dict<key: tfds.Split, value:
tf.data.Dataset> .
If |
dataset_info_from_configs
dataset_info_from_configs(
**kwargs
)
Returns the DatasetInfo using given kwargs and config files.
Sub-class should call this and add information not present in config files using kwargs directly passed to tfds.core.DatasetInfo object.
If information is present both in passed arguments and config files, config files will prevail.
Args | |
---|---|
**kwargs
|
kw args to pass to DatasetInfo directly. |
download_and_prepare
download_and_prepare(
*,
download_dir: Optional[epath.PathLike] = None,
download_config: Optional[download.DownloadConfig] = None,
file_format: Optional[Union[str, file_adapters.FileFormat]] = None
) -> None
Downloads and prepares dataset for reading.
Args | |
---|---|
download_dir
|
directory where downloaded files are stored. Defaults to "~/tensorflow-datasets/downloads". |
download_config
|
tfds.download.DownloadConfig , further configuration for
downloading and preparing dataset.
|
file_format
|
optional str or file_adapters.FileFormat , format of the
record files in which the dataset will be written.
|
Raises | |
---|---|
IOError
|
if there is not enough disk space available. |
RuntimeError
|
when the config cannot be found. |
get_default_builder_config
get_default_builder_config() -> Optional[BuilderConfig]
Returns the default builder config if there is one.
Note that for dataset builders that cannot use the cls.BUILDER_CONFIGS
, we
need a method that uses the instance to get BUILDER_CONFIGS
and
DEFAULT_BUILDER_CONFIG_NAME
.
Returns | |
---|---|
the default builder config if there is one |
get_metadata
@classmethod
get_metadata() -> dataset_metadata.DatasetMetadata
Returns metadata (README, CITATIONS, ...) specified in config files.
The config files are read from the same package where the DatasetBuilder has been defined, so those metadata might be wrong for legacy builders.
get_reference
get_reference(
namespace: Optional[str] = None
) -> naming.DatasetReference
Returns a reference to the dataset produced by this dataset builder.
Includes the config if specified, the version, and the data_dir that should contain this dataset.
Arguments | |
---|---|
namespace
|
if this dataset is a community dataset, and therefore has a namespace, then the namespace must be provided such that it can be set in the reference. Note that a dataset builder is not aware that it is part of a namespace. |
Returns | |
---|---|
a reference to this instantiated builder. |
is_prepared
is_prepared() -> bool
Returns whether this dataset is already downloaded and prepared.
read_text_file
read_text_file(
filename: epath.PathLike, encoding: Optional[str] = None
) -> str
Returns the text in the given file and records the lineage.
read_tfrecord_as_dataset
read_tfrecord_as_dataset(
filenames: (str | Sequence[str]),
compression_type: (str | None) = None,
num_parallel_reads: (int | None) = None
) -> tf.data.Dataset
Returns the dataset for the given tfrecord files and records the lineage.
read_tfrecord_as_examples
read_tfrecord_as_examples(
filenames: Union[str, Sequence[str]],
compression_type: (str | None) = None,
num_parallel_reads: (int | None) = None
) -> Iterator[tf.train.Example]
Returns tf.Examples from the given tfrecord files and records the lineage.
read_tfrecord_beam
read_tfrecord_beam(
file_pattern, /, **kwargs
) -> 'beam.PTransform'
Returns a PTransform reading the TFRecords and records it in the dataset lineage.
This function records the lineage in the DatasetInfo and then invokes
beam.io.ReadFromTFRecord
. The kwargs should contain any other parameters
for beam.io.ReadFromTFRecord
. See
https://beam.apache.org/releases/pydoc/2.6.0/apache_beam.io.tfrecordio.html#apache_beam.io.tfrecordio.ReadFromTFRecord
Arguments | |
---|---|
file_pattern
|
A file glob pattern to read TFRecords from. |
**kwargs
|
the other parameters for beam.io.ReadFromTFRecord .
|
Returns | |
---|---|
a Beam PTransform that reads the given TFRecord files. |