This guide documents all format-specific dataset builders currently available in TFDS.
Format-specific dataset builders are subclasses of
tfds.core.GeneratorBasedBuilder
which take care of most data processing for a specific data format.
Datasets based on tf.data.Dataset
If you want to create a TFDS dataset from a dataset that's in tf.data.Dataset
(reference)
format, then you can use tfds.dataset_builders.TfDataBuilder
(see
API docs).
We envision two typical uses of this class:
- Creating experimental datasets in a notebook-like environment
- Defining a dataset builder in code
Creating a new dataset from a notebook
Suppose you are working in a notebook, loaded some data as a tf.data.Dataset
,
applied various transformations (map, filter, etc) and now you want to store
this data and easily share it with teammates or load it in other notebooks.
Instead of having to define a new dataset builder class, you can also
instantiate a tfds.dataset_builders.TfDataBuilder
and call
download_and_prepare
to store your dataset as a TFDS dataset.
Because it's a TFDS dataset, you can version it, use configs, have different splits, and document it for easier use later. This means that you also have to tell TFDS what the features are in your dataset.
Here's a dummy example of how you can use it.
import tensorflow as tf
import tensorflow_datasets as tfds
my_ds_train = tf.data.Dataset.from_tensor_slices({"number": [1, 2, 3]})
my_ds_test = tf.data.Dataset.from_tensor_slices({"number": [4, 5]})
# Optionally define a custom `data_dir`.
# If None, then the default data dir is used.
custom_data_dir = "/my/folder"
# Define the builder.
single_number_builder = tfds.dataset_builders.TfDataBuilder(
name="my_dataset",
config="single_number",
version="1.0.0",
data_dir=custom_data_dir,
split_datasets={
"train": my_ds_train,
"test": my_ds_test,
},
features=tfds.features.FeaturesDict({
"number": tfds.features.Scalar(dtype=tf.int64),
}),
description="My dataset with a single number.",
release_notes={
"1.0.0": "Initial release with numbers up to 5!",
}
)
# Make the builder store the data as a TFDS dataset.
single_number_builder.download_and_prepare()
The download_and_prepare
method will iterate over the input tf.data.Dataset
s
and store the corresponding TFDS dataset in
/my/folder/my_dataset/single_number/1.0.0
, which will contain both the train
and test splits.
The config
argument is optional and can be useful if you want to store
different configs under the same dataset.
The data_dir
argument can be used to store the generated TFDS dataset in a
different folder, for example in your own sandbox if you don't want to share
this with others (yet). Note that when doing this, you also need to pass the
data_dir
to tfds.load
. If the data_dir
argument is not specified, then the
default TFDS data dir will be used.
Loading your dataset
After the TFDS dataset has been stored, it can be loaded from other scripts or by teammates if they have access to the data:
# If no custom data dir was specified:
ds_test = tfds.load("my_dataset/single_number", split="test")
# When there are multiple versions, you can also specify the version.
ds_test = tfds.load("my_dataset/single_number:1.0.0", split="test")
# If the TFDS was stored in a custom folder, then it can be loaded as follows:
custom_data_dir = "/my/folder"
ds_test = tfds.load("my_dataset/single_number:1.0.0", split="test", data_dir=custom_data_dir)
Adding a new version or config
After iterating further on your dataset, you may have added or changed some of the transformations of the source data. To store and share this dataset, you can easily store this as a new version.
def add_one(example):
example["number"] = example["number"] + 1
return example
my_ds_train_v2 = my_ds_train.map(add_one)
my_ds_test_v2 = my_ds_test.map(add_one)
single_number_builder_v2 = tfds.dataset_builders.TfDataBuilder(
name="my_dataset",
config="single_number",
version="1.1.0",
data_dir=custom_data_dir,
split_datasets={
"train": my_ds_train_v2,
"test": my_ds_test_v2,
},
features=tfds.features.FeaturesDict({
"number": tfds.features.Scalar(dtype=tf.int64, doc="Some number"),
}),
description="My dataset with a single number.",
release_notes={
"1.1.0": "Initial release with numbers up to 6!",
"1.0.0": "Initial release with numbers up to 5!",
}
)
# Make the builder store the data as a TFDS dataset.
single_number_builder_v2.download_and_prepare()
Defining a new dataset builder class
You can also define a new DatasetBuilder
based on this class.
import tensorflow as tf
import tensorflow_datasets as tfds
class MyDatasetBuilder(tfds.dataset_builders.TfDataBuilder):
def __init__(self):
ds_train = tf.data.Dataset.from_tensor_slices([1, 2, 3])
ds_test = tf.data.Dataset.from_tensor_slices([4, 5])
super().__init__(
name="my_dataset",
version="1.0.0",
split_datasets={
"train": ds_train,
"test": ds_test,
},
features=tfds.features.FeaturesDict({
"number": tfds.features.Scalar(dtype=tf.int64),
}),
config="single_number",
description="My dataset with a single number.",
release_notes={
"1.0.0": "Initial release with numbers up to 5!",
})
CroissantBuilder
The format
Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools.
Croissant builds on schema.org and its sc:Dataset
vocabulary, a widely used
format to represent datasets on the Web, and make them searchable.
CroissantBuilder
A CroissantBuilder
defines a TFDS dataset based on a Croissant 🥐 metadata
file; each of the record_set_ids
specified will result in a separate
ConfigBuilder
.
For example, to initialize a CroissantBuilder
for the MNIST dataset using its
Croissant 🥐 definition:
import tensorflow_datasets as tfds
builder = tfds.dataset_builders.CroissantBuilder(
jsonld="https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/0.8/huggingface-mnist/metadata.json",
file_format='array_record',
)
builder.download_and_prepare()
ds = builder.as_data_source()
print(ds['default'][0])
CoNLL
The format
CoNLL is a popular format used to represent annotated text data.
CoNLL-formatted data usually contain one token with its linguistic annotations per line; within the same line, annotations are usually separated by spaces or tabs. Empty lines represent sentence boundaries.
Consider as an example the following sentence from the conll2003 dataset, which follows the CoNLL annotation format:
U.N. NNP I-NP I-ORG official
NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP
I-LOC . . O O
ConllDatasetBuilder
To add a new CoNLL-based dataset to TFDS, you can base your dataset builder
class on tfds.dataset_builders.ConllDatasetBuilder
. This base class contains
common code to deal with the specificities of CoNLL datasets (iterating over the
column-based format, precompiled lists of features and tags, ...).
tfds.dataset_builders.ConllDatasetBuilder
implements a CoNLL-specific
GeneratorBasedBuilder
. Refer to the following class as a minimal example of a
CoNLL dataset builder:
from tensorflow_datasets.core.dataset_builders.conll import conll_dataset_builder_utils as conll_lib
import tensorflow_datasets.public_api as tfds
class MyCoNNLDataset(tfds.dataset_builders.ConllDatasetBuilder):
VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {'1.0.0': 'Initial release.'}
# conllu_lib contains a set of ready-to-use CONLL-specific configs.
BUILDER_CONFIGS = [conll_lib.CONLL_2003_CONFIG]
def _info(self) -> tfds.core.DatasetInfo:
return self.create_dataset_info(
# ...
)
def _split_generators(self, dl_manager):
path = dl_manager.download_and_extract('https://data-url')
return {'train': self._generate_examples(path=path / 'train.txt'),
'test': self._generate_examples(path=path / 'train.txt'),
}
As for standard dataset builders, it requires to overwrite the class methods
_info
and _split_generators
. Depending on the dataset, you might need to
update also
conll_dataset_builder_utils.py
to include the features and the list of tags specific to your dataset.
The _generate_examples
method should not require further overwriting, unless
your dataset needs specific implementation.
Examples
Consider conll2003 as an example of a dataset implemented using the CoNLL-specific dataset builder.
CLI
The easiest way to write a new CoNLL-based dataset is to use the TFDS CLI:
cd path/to/my/project/datasets/
tfds new my_dataset --format=conll # Create `my_dataset/my_dataset.py` CoNLL-specific template files
CoNLL-U
The format
CoNLL-U is a popular format used to represent annotated text data.
CoNLL-U enhances the CoNLL format by adding a number of features, such as support for multi-token words. CoNLL-U formatted data usually contain one token with its linguistic annotations per line; within the same line, annotations are usually separated by single tab characters. Empty lines represent sentence boundaries.
Typically, each CoNLL-U annotated word line contains the following fields, as reported in the official documentation:
- ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
- FORM: Word form or punctuation symbol.
- LEMMA: Lemma or stem of word form.
- UPOS: Universal part-of-speech tag.
- XPOS: Language-specific part-of-speech tag; underscore if not available.
- FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
- HEAD: Head of the current word, which is either a value of ID or zero (0).
- DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
- DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
- MISC: Any other annotation.
Consider as an example the following CoNLL-U annotated sentence from the official documentation:
1-2 vámonos _
1 vamos ir
2 nos nosotros
3-4 al _
3 a a
4 el el
5 mar mar
ConllUDatasetBuilder
To add a new CoNLL-U based dataset to TFDS, you can base your dataset builder
class on tfds.dataset_builders.ConllUDatasetBuilder
. This base class contains
common code to deal with the specificities of CoNLL-U datasets (iterating over
the column-based format, precompiled lists of features and tags, ...).
tfds.dataset_builders.ConllUDatasetBuilder
implements a CoNLL-U specific
GeneratorBasedBuilder
. Refer to the following class as a minimal example of a
CoNLL-U dataset builder:
from tensorflow_datasets.core.dataset_builders.conll import conllu_dataset_builder_utils as conllu_lib
import tensorflow_datasets.public_api as tfds
class MyCoNNLUDataset(tfds.dataset_builders.ConllUDatasetBuilder):
VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {'1.0.0': 'Initial release.'}
# conllu_lib contains a set of ready-to-use features.
BUILDER_CONFIGS = [
conllu_lib.get_universal_morphology_config(
language='en',
features=conllu_lib.UNIVERSAL_DEPENDENCIES_FEATURES,
)
]
def _info(self) -> tfds.core.DatasetInfo:
return self.create_dataset_info(
# ...
)
def _split_generators(self, dl_manager):
path = dl_manager.download_and_extract('https://data-url')
return {
'train':
self._generate_examples(
path=path / 'train.txt',
# If necessary, add optional custom processing (see conllu_lib
# for examples).
# process_example_fn=...,
)
}
As for standard dataset builders, it requires to overwrite the class methods
_info
and _split_generators
. Depending on the dataset, you might need to
update also
conllu_dataset_builder_utils.py
to include the features and the list of tags specific to your dataset.
The _generate_examples
method should not require further overwriting, unless
your dataset needs specific implementation. Note that, if your dataset requires
specific preprocessing - for example if it considers non-classic
universal dependency features -
you might need to update the process_example_fn
attribute of your
generate_examples
function (see the
xtreme_pos
daset as an example).
Examples
Consider the following datasets, which use the CoNNL-U specific dataset builder, as examples:
CLI
The easiest way to write a new CoNLL-U based dataset is to use the TFDS CLI:
cd path/to/my/project/datasets/
tfds new my_dataset --format=conllu # Create `my_dataset/my_dataset.py` CoNLL-U specific template files