tff.simulation.datasets.inaturalist.load_data
Stay organized with collections
Save and categorize content based on your preferences.
Loads a federated version of the iNaturalist 2017 dataset.
tff.simulation.datasets.inaturalist.load_data(
image_dir: str = 'images',
cache_dir: str = 'cache',
split: tff.simulation.datasets.inaturalist.INaturalistSplit
= tff.simulation.datasets.inaturalist.INaturalistSplit.USER_120K
) -> tuple[ClientData, tf.data.Dataset]
If the dataset is loaded for the first time, the images for the entire
iNaturalist 2017 dataset will be downloaded from AWS Open Data Program.
The dataset is created from the images stored inside the image_dir. Once the
dataset is created, it will be cached inside the cache directory.
The tf.data.Datasets
returned by
tff.simulation.datasets.ClientData.create_tf_dataset_for_client
will yield
collections.OrderedDict
objects at each iteration, with the following keys
and values:
'image/decoded'
: A tf.Tensor
with dtype=tf.uint8
that
corresponds to the pixels of the images.
'class'
: A tf.Tensor
with dtype=tf.int64
and shape [1],
corresponding to the class label.
Seven splits of iNaturalist datasets are available. The details of each
different dataset split can be found in https://arxiv.org/abs/2003.08082
For the USER_120K dataset, the images are split by the user id.
The number of clients for USER120K is 9,275. The training set contains
120,300 images of 1,203 species, and test set contains 35,641 images.
For the GEO* datasets, the images are splitted by the geo location.
The number of clients for the GEO_* datasets:
- GEO_100: 3607.
- GEO_300: 1209.
- GEO_1K: 369.
- GEO_3K: 136.
- GEO_10K: 39.
- GEO_30K: 12.
Args |
image_dir
|
(Optional) The directory containing the images downloaded from
https://github.com/visipedia/inat_comp/tree/master/2017
|
cache_dir
|
(Optional) The directory to cache the created datasets.
|
split
|
(Optional) The split of the dataset, default to be split by users.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-09-20 UTC.
[null,null,["Last updated 2024-09-20 UTC."],[],[],null,["# tff.simulation.datasets.inaturalist.load_data\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/federated/blob/v0.87.0 Version 2.0, January 2004 Licensed under the Apache License, Version 2.0 (the) |\n\nLoads a federated version of the iNaturalist 2017 dataset. \n\n tff.simulation.datasets.inaturalist.load_data(\n image_dir: str = 'images',\n cache_dir: str = 'cache',\n split: ../../../../tff/simulation/datasets/inaturalist/INaturalistSplit = ../../../../tff/simulation/datasets/inaturalist/INaturalistSplit#USER_120K\n ) -\u003e tuple[ClientData, tf.data.Dataset]\n\nIf the dataset is loaded for the first time, the images for the entire\niNaturalist 2017 dataset will be downloaded from AWS Open Data Program.\n\nThe dataset is created from the images stored inside the image_dir. Once the\ndataset is created, it will be cached inside the cache directory.\n\nThe `tf.data.Datasets` returned by\n[`tff.simulation.datasets.ClientData.create_tf_dataset_for_client`](../../../../tff/simulation/datasets/ClientData#create_tf_dataset_for_client) will yield\n`collections.OrderedDict` objects at each iteration, with the following keys\nand values:\n\n- `'image/decoded'`: A [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.uint8` that corresponds to the pixels of the images.\n- `'class'`: A [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor) with `dtype=tf.int64` and shape \\[1\\], corresponding to the class label.\n\nSeven splits of iNaturalist datasets are available. The details of each\ndifferent dataset split can be found in \u003chttps://arxiv.org/abs/2003.08082\u003e\nFor the USER_120K dataset, the images are split by the user id.\nThe number of clients for USER*120K is 9,275. The training set contains\n120,300 images of 1,203 species, and test set contains 35,641 images.\nFor the GEO*\\* datasets, the images are splitted by the geo location.\nThe number of clients for the GEO_\\* datasets:\n\n1. GEO_100: 3607.\n2. GEO_300: 1209.\n3. GEO_1K: 369.\n4. GEO_3K: 136.\n5. GEO_10K: 39.\n6. GEO_30K: 12.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|-------------|------------------------------------------------------------------------------------------------------------------------|\n| `image_dir` | (Optional) The directory containing the images downloaded from https://github.com/visipedia/inat_comp/tree/master/2017 |\n| `cache_dir` | (Optional) The directory to cache the created datasets. |\n| `split` | (Optional) The split of the dataset, default to be split by users. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| Tuple of (train, test) where the tuple elements are a [`tff.simulation.datasets.ClientData`](../../../../tff/simulation/datasets/ClientData) and a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). ||\n\n\u003cbr /\u003e"]]