tff.simulation.datasets.inaturalist.load_data

Loads a federated version of the iNaturalist 2017 dataset.

If the dataset is loaded for the first time, the images for the entire iNaturalist 2017 dataset will be downloaded from AWS Open Data Program.

The dataset is created from the images stored inside the image_dir. Once the dataset is created, it will be cached inside the cache directory.

The tf.data.Datasets returned by tff.simulation.datasets.ClientData.create_tf_dataset_for_client will yield collections.OrderedDict objects at each iteration, with the following keys and values:

  • 'image/decoded': A tf.Tensor with dtype=tf.uint8 that corresponds to the pixels of the images.
  • 'class': A tf.Tensor with dtype=tf.int64 and shape [1], corresponding to the class label.

Seven splits of iNaturalist datasets are available. The details of each different dataset split can be found in https://arxiv.org/abs/2003.08082 For the USER_120K dataset, the images are split by the user id. The number of clients for USER120K is 9,275. The training set contains 120,300 images of 1,203 species, and test set contains 35,641 images. For the GEO* datasets, the images are splitted by the geo location. The number of clients for the GEO_* datasets:

  1. GEO_100: 3607.
  2. GEO_300: 1209.
  3. GEO_1K: 369.
  4. GEO_3K: 136.
  5. GEO_10K: 39.
  6. GEO_30K: 12.

image_dir (Optional) The directory containing the images downloaded from https://github.com/visipedia/inat_comp/tree/master/2017
cache_dir (Optional) The directory to cache the created datasets.
split (Optional) The split of the dataset, default to be split by users.

Tuple of (train, test) where the tuple elements are a tff.simulation.datasets.ClientData and a tf.data.Dataset.