TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

tfds.core.DatasetInfo

Information about a dataset.

tfds.core.DatasetInfo(
    *,
    builder: Union[DatasetIdentity, Any],
    description: Optional[str] = None,
    features: Optional[feature_lib.FeatureConnector] = None,
    supervised_keys: Optional[SupervisedKeysType] = None,
    disable_shuffling: bool = False,
    homepage: Optional[str] = None,
    citation: Optional[str] = None,
    metadata: Optional[Metadata] = None,
    license: Optional[str] = None,
    redistribution_info: Optional[Dict[str, str]] = None,
    split_dict: Optional[splits_lib.SplitDict] = None
)

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Args
`builder`	`DatasetBuilder` or `DatasetIdentity`. The dataset builder or identity will be used to populate this info.
`description`	`str`, description of this dataset.
`features`	`tfds.features.FeaturesDict`, Information on the feature dict of the `tf.data.Dataset()` object from the `builder.as_dataset()` method.
`supervised_keys`	Specifies the input structure for supervised learning, if applicable for the dataset, used with "as_supervised". The keys correspond to the feature names to select in `info.features`. When calling `tfds.core.DatasetBuilder.as_dataset()` with `as_supervised=True`, the `tf.data.Dataset` object will yield the structure defined by the keys passed here, instead of that defined by the `features` argument. Typically this is a `(input_key, target_key)` tuple, and the dataset yields a tuple of tensors `(input, target)` tensors. To yield a more complex structure, pass a tuple of `tf.nest` compatible structures of feature keys. The resulting `Dataset` will yield structures with each key replaced by the coresponding tensor. For example, passing a triple of keys would return a dataset that yields `(feature, target, sample_weights)` triples for keras. Using `supervised_keys=({'a':'a','b':'b'}, 'c')` would create a dataset yielding a tuple with a dictionary of features in the `features` position. Note that selecting features in nested `tfds.features.FeaturesDict` objects is not supported.
`disable_shuffling`	`bool`, specify whether to shuffle the examples.
`homepage`	`str`, optional, the homepage for this dataset.
`citation`	`str`, optional, the citation to use for this dataset.
`metadata`	`tfds.core.Metadata`, additonal object which will be stored/restored with the dataset. This allows for storing additional information with the dataset.
`license`	license of the dataset.
`redistribution_info`	information needed for redistribution, as specified in `dataset_info_pb2.RedistributionInfo`. The content of the `license` subfield will automatically be written to a LICENSE file stored with the dataset.
`split_dict`	information about the splits in this dataset.

Attributes
`as_json`
`as_proto`
`as_proto_with_features`
`citation`
`config_description`
`config_name`
`config_tags`
`data_dir`
`dataset_size`	Generated dataset files size, in bytes.
`description`
`disable_shuffling`
`download_size`	Downloaded files size, in bytes.
`features`
`file_format`
`full_name`	Full canonical name: (//).
`homepage`
`initialized`	Whether DatasetInfo has been fully initialized.
`metadata`
`module_name`
`name`
`redistribution_info`
`release_notes`
`splits`
`supervised_keys`
`version`

Methods

`add_file_data_source_access`

View source

add_file_data_source_access(
    path: Union[epath.PathLike, Iterable[epath.PathLike]],
    url: Optional[str] = None
) -> None

Records that the given query was used to generate this dataset.

Arguments
`path`	path or paths of files that were read. Can be a file pattern. Multiple paths or patterns can be specified as a comma-separated string or a list.
`url`	URL referring to the data being used.

`add_sql_data_source_access`

View source

add_sql_data_source_access(
    sql_query: str
) -> None

Records that the given query was used to generate this dataset.

`add_tfds_data_source_access`

View source

add_tfds_data_source_access(
    dataset_reference: naming.DatasetReference, url: Optional[str] = None
) -> None

Records that the given query was used to generate this dataset.

Args

Args
`dataset_reference`
`url`	a URL referring to the TFDS dataset.

dataset_reference

url a URL referring to the TFDS dataset.

`add_url_access`

View source

add_url_access(
    url: str, checksum: Optional[str] = None
) -> None

Records the URL used to generate this dataset.

`from_proto`

View source

@classmethod
from_proto(
    builder, proto: dataset_info_pb2.DatasetInfo
) -> 'DatasetInfo'

Instantiates DatasetInfo from the given builder and proto.

`initialize_from_bucket`

View source

initialize_from_bucket() -> None

Initialize DatasetInfo from GCS bucket info files.

`read_from_directory`

View source

read_from_directory(
    dataset_info_dir: epath.PathLike
) -> None

Update DatasetInfo from the metadata files in dataset_info_dir.

This function updates all the dynamically generated fields (num_examples, hash, time of creation,...) of the DatasetInfo.

This will overwrite all previous metadata.

Args
`dataset_info_dir`	The directory containing the metadata file. This should be the root directory of a specific dataset version.

Raises
`FileNotFoundError`	If the dataset_info.json can't be found.

`set_file_format`

View source

set_file_format(
    file_format: Union[None, str, file_adapters.FileFormat],
    override: bool = False
) -> None

Internal function to define the file format.

The file format is set during FileReaderBuilder.__init__, not DatasetInfo.init.

Args
`file_format`	The file format.
`override`	Whether the file format should be overridden if it is already set.

Raises
`ValueError`	if the file format was already set and the `override` parameter was False.
`RuntimeError`	if an incorrect combination of options is given, e.g. `override=True` when the DatasetInfo is already fully initialized.

`set_splits`

View source

set_splits(
    split_dict: splits_lib.SplitDict
) -> None

Split setter (private method).

`update_data_dir`

View source

update_data_dir(
    data_dir: str
) -> None

Updates the data dir for each split.

`write_to_directory`

View source

write_to_directory(
    dataset_info_dir: epath.PathLike, all_metadata=True
) -> None

Write DatasetInfo as JSON to dataset_info_dir + labels & features.

Args
`dataset_info_dir`	path to directory in which to save the `dataset_info.json` file, as well as `features.json` and `*.labels.txt` if applicable.
`all_metadata`	defaults to True. If False, will not write metadata which may have an impact on how the data is read (features.json). Should be set to True whenever `write_to_directory` is called for the first time for a new dataset.

tfds.core.DatasetInfo

Args

Attributes

Methods

add_file_data_source_access

add_sql_data_source_access

add_tfds_data_source_access

add_url_access

from_proto

initialize_from_bucket

read_from_directory

set_file_format

set_splits

update_data_dir

write_to_directory

`add_file_data_source_access`

`add_sql_data_source_access`

`add_tfds_data_source_access`

`add_url_access`

`from_proto`

`initialize_from_bucket`

`read_from_directory`

`set_file_format`

`set_splits`

`update_data_dir`

`write_to_directory`