TensorFlow Datasets

TensorFlow Datasets provides a collection of ready-to-use datasets. It handles downloading and preparing the data and constructing a tf.data.Dataset.

Copyright 2018 The TensorFlow Datasets Authors, Licensed under the Apache License, Version 2.0

View on TensorFlow.org View source on GitHub

Installation

pip install tensorflow-datasets

Note that tensorflow-datasets expects you to have TensorFlow already installed, and currently depends on tensorflow (or tensorflow-gpu) >= 1.15.0.

pip install -q tensorflow tensorflow-datasets matplotlib
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import matplotlib.pyplot as plt
import numpy as np
import tensorflow.compat.v2 as tf

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

Citation

Please include the following citation when using tensorflow-datasets for a paper, in addition to any citation specific to the used datasets.

@misc{TFDS,
  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets}},
}

Eager execution

TensorFlow Datasets is compatible with both TensorFlow Eager mode and Graph mode. For this colab, we'll run in Eager mode.

tf.enable_v2_behavior()

List the available datasets

Each dataset is implemented as a tfds.core.DatasetBuilder and you can list all available builders with tfds.list_builders().

You can see all the datasets with additional documentation on the datasets documentation page.

tfds.list_builders()
['abstract_reasoning',
 'aeslc',
 'aflw2k3d',
 'amazon_us_reviews',
 'arc',
 'bair_robot_pushing_small',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'cos_e',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'deep_weeds',
 'definite_pronoun_resolution',
 'diabetic_retinopathy_detection',
 'dmlab',
 'downsampled_imagenet',
 'dsprites',
 'dtd',
 'duke_ultrasound',
 'dummy_dataset_shared_generator',
 'dummy_mnist',
 'emnist',
 'esnli',
 'eurosat',
 'fashion_mnist',
 'flic',
 'flores',
 'food101',
 'gap',
 'gigaword',
 'glue',
 'groove',
 'higgs',
 'horses_or_humans',
 'i_naturalist2017',
 'image_label_folder',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imagenet_resized',
 'imagenette',
 'imdb_reviews',
 'iris',
 'kitti',
 'kmnist',
 'lfw',
 'lm1b',
 'lost_and_found',
 'lsun',
 'malaria',
 'math_dataset',
 'mnist',
 'mnist_corrupted',
 'movie_rationales',
 'moving_mnist',
 'multi_news',
 'multi_nli',
 'multi_nli_mismatch',
 'newsroom',
 'nsynth',
 'omniglot',
 'open_images_v4',
 'oxford_flowers102',
 'oxford_iiit_pet',
 'para_crawl',
 'patch_camelyon',
 'pet_finder',
 'places365_small',
 'plant_leaves',
 'plant_village',
 'plantae_k',
 'quickdraw_bitmap',
 'reddit_tifu',
 'resisc45',
 'rock_paper_scissors',
 'rock_you',
 'scan',
 'scene_parse150',
 'scicite',
 'scientific_papers',
 'shapes3d',
 'smallnorb',
 'snli',
 'so2sat',
 'squad',
 'stanford_dogs',
 'stanford_online_products',
 'starcraft_video',
 'sun397',
 'super_glue',
 'svhn_cropped',
 'ted_hrlr_translate',
 'ted_multi_translate',
 'tf_flowers',
 'the300w_lp',
 'titanic',
 'trivia_qa',
 'uc_merced',
 'ucf101',
 'vgg_face2',
 'visual_domain_decathlon',
 'voc',
 'wider_face',
 'wikihow',
 'wikipedia',
 'wmt14_translate',
 'wmt15_translate',
 'wmt16_translate',
 'wmt17_translate',
 'wmt18_translate',
 'wmt19_translate',
 'wmt_t2t_translate',
 'wmt_translate',
 'xnli',
 'xsum']

tfds.load: A dataset in one line

tfds.load is a convenience method that's the simplest way to build and load a tf.data.Dataset.

tf.data.Dataset is the standard TensorFlow API to build input pipelines. If you're not familiar with this API, we strongly encourage you to read the official TensorFlow guide.

Below, we load the MNIST training data. It downloads and prepares the data, unless you specify download=False. Note that once data has been prepared, subsequent calls of load will reuse the prepared data. You can customize where the data is saved/loaded by specifying data_dir= ( defaults to ~/tensorflow_datasets/).

mnist_train = tfds.load(name="mnist", split="train")
assert isinstance(mnist_train, tf.data.Dataset)
print(mnist_train)
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead set
data_dir=gs://tfds-data/datasets.


Downloading and preparing dataset mnist (11.06 MiB) to /home/kbuilder/tensorflow_datasets/mnist/3.0.0...
Dataset mnist downloaded and prepared to /home/kbuilder/tensorflow_datasets/mnist/3.0.0. Subsequent calls will reuse this data.
<DatasetV1Adapter shapes: {label: (), image: (28, 28, 1)}, types: {label: tf.int64, image: tf.uint8}>

When loading a dataset, the canonical default version is used. It is however recommended to specify the major version of the dataset to use, and to advertise which version of the dataset was used in your results. See the documentation on datasets versioning for more details.

mnist = tfds.load("mnist:1.*.*")
WARNING:absl:Found a different version 3.0.0 of dataset mnist in data_dir /home/kbuilder/tensorflow_datasets. Using currently defined version 1.0.0.
WARNING:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead set
data_dir=gs://tfds-data/datasets.


Downloading and preparing dataset mnist (11.06 MiB) to /home/kbuilder/tensorflow_datasets/mnist/1.0.0...
Dataset mnist downloaded and prepared to /home/kbuilder/tensorflow_datasets/mnist/1.0.0. Subsequent calls will reuse this data.

Feature dictionaries

All tfds datasets contain feature dictionaries mapping feature names to Tensor values. A typical dataset, like MNIST, will have 2 keys: "image" and "label". Below we inspect a single example.

for mnist_example in mnist_train.take(1):  # Only take a single example
  image, label = mnist_example["image"], mnist_example["label"]

  plt.imshow(image.numpy()[:, :, 0].astype(np.float32), cmap=plt.get_cmap("gray"))
  print("Label: %d" % label.numpy())
Label: 4

png

DatasetBuilder

tfds.load is really a thin conveninence wrapper around DatasetBuilder. We can accomplish the same as above directly with the MNIST DatasetBuilder.

mnist_builder = tfds.builder("mnist")
mnist_builder.download_and_prepare()
mnist_train = mnist_builder.as_dataset(split="train")
mnist_train
<DatasetV1Adapter shapes: {label: (), image: (28, 28, 1)}, types: {label: tf.int64, image: tf.uint8}>

Input pipelines

Once you have a tf.data.Dataset object, it's simple to define the rest of an input pipeline suitable for model training by using the tf.data API.

Here we'll repeat the dataset so that we have an infinite stream of examples, shuffle, and create batches of 32.

mnist_train = mnist_train.repeat().shuffle(1024).batch(32)

# prefetch will enable the input pipeline to asynchronously fetch batches while
# your model is training.
mnist_train = mnist_train.prefetch(tf.data.experimental.AUTOTUNE)

# Now you could loop over batches of the dataset and train
# for batch in mnist_train:
#   ...

DatasetInfo

After generation, the builder contains useful information on the dataset:

info = mnist_builder.info
print(info)
tfds.core.DatasetInfo(
    name='mnist',
    version=3.0.0,
    description='The MNIST database of handwritten digits.',
    homepage='http://yann.lecun.com/exdb/mnist/',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
    redistribution_info=,
)

DatasetInfo also contains useful information about the features:

print(info.features)
print(info.features["label"].num_classes)
print(info.features["label"].names)
FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})
10
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

You can also load the DatasetInfo directly with tfds.load using with_info=True.

mnist_test, info = tfds.load("mnist", split="test", with_info=True)
print(info)
tfds.core.DatasetInfo(
    name='mnist',
    version=3.0.0,
    description='The MNIST database of handwritten digits.',
    homepage='http://yann.lecun.com/exdb/mnist/',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
    redistribution_info=,
)

Visualization

For image classification datasets, you can use tfds.show_examples to display some examples.

fig = tfds.show_examples(info, mnist_test)

png