• Description:

This dataset contains ILSVRC-2012 (ImageNet) validation images annotated with multi-class labels from "Evaluating Machine Accuracy on ImageNet", ICML, 2020. The multi-class labels were reviewed by a panel of experts extensively trained in the intricacies of fine-grained class distinctions in the ImageNet class hierarchy (see paper for more details). Compared to the original labels, these expert-reviewed multi-class labels enable a more semantically coherent evaluation of accuracy.

Version 3.0.0 of this dataset contains more corrected labels from "When does dough become a bagel? Analyzing the remaining mistakes on ImageNet as well as the ImageNet-Major (ImageNet-M) 68-example split under 'imagenet-m'.

Only 20,000 of the 50,000 ImageNet validation images have multi-label annotations. The set of multi-labels was first generated by a testbed of 67 trained ImageNet models, and then each individual model prediction was manually annotated by the experts as either correct (the label is correct for the image),wrong (the label is incorrect for the image), or unclear (no consensus was reached among the experts).

Additionally, during annotation, the expert panel identified a set of problematic images. An image was problematic if it met any of the below criteria:

  • The original ImageNet label (top-1 label) was incorrect or unclear
  • Image was a drawing, painting, sketch, cartoon, or computer-rendered
  • Image was excessively edited
  • Image had inappropriate content

The problematic images are included in this dataset but should be ignored when computing multi-label accuracy. Additionally, since the initial set of 20,000 annotations is class-balanced, but the set of problematic images is not, we recommend computing the per-class accuracies and then averaging them. We also recommend counting a prediction as correct if it is marked as correct or unclear (i.e., being lenient with the unclear labels).

One possible way of doing this is with the following NumPy code:

import tensorflow_datasets as tfds

ds = tfds.load('imagenet2012_multilabel', split='validation')

# We assume that predictions is a dictionary from file_name to a class index between 0 and 999

num_correct_per_class = {}
num_images_per_class = {}

for example in ds:
    # We ignore all problematic images
    if example[‘is_problematic’].numpy():

    # The label of the image in ImageNet
    cur_class = example['original_label'].numpy()

    # If we haven't processed this class yet, set the counters to 0
    if cur_class not in num_correct_per_class:
        num_correct_per_class[cur_class] = 0
        assert cur_class not in num_images_per_class
        num_images_per_class[cur_class] = 0

    num_images_per_class[cur_class] += 1

    # Get the predictions for this image
    cur_pred = predictions[example['file_name'].numpy()]

    # We count a prediction as correct if it is marked as correct or unclear
    # (i.e., we are lenient with the unclear labels)
    if cur_pred is in example['correct_multi_labels'].numpy() or cur_pred is in example['unclear_multi_labels'].numpy():
        num_correct_per_class[cur_class] += 1

# Check that we have collected accuracy data for each of the 1,000 classes
num_classes = 1000
assert len(num_correct_per_class) == num_classes
assert len(num_images_per_class) == num_classes

# Compute the per-class accuracies and then average them
final_avg = 0
for cid in range(num_classes):
  assert cid in num_correct_per_class
  assert cid in num_images_per_class
  final_avg += num_correct_per_class[cid] / num_images_per_class[cid]
final_avg /= num_classes

Split Examples
'imagenet_m' 68
'validation' 20,000
  • Feature structure:
    'correct_multi_labels': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=1000)),
    'file_name': Text(shape=(), dtype=string),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'is_problematic': bool,
    'original_label': ClassLabel(shape=(), dtype=int64, num_classes=1000),
    'unclear_multi_labels': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=1000)),
    'wrong_multi_labels': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=1000)),
  • Feature documentation:
Feature Class Shape Dtype Description
correct_multi_labels Sequence(ClassLabel) (None,) int64
file_name Text string
image Image (None, None, 3) uint8
is_problematic Tensor bool
original_label ClassLabel int64
unclear_multi_labels Sequence(ClassLabel) (None,) int64
wrong_multi_labels Sequence(ClassLabel) (None,) int64


  • Citation:
  title={Evaluating Machine Accuracy on ImageNet},
  author={Vaishaal Shankar* and Rebecca Roelofs* and Horia Mania and Alex Fang and Benjamin Recht and Ludwig Schmidt},
  note={\url{} }
  title={ {ImageNet} large scale visual recognition challenge},
  author={Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause
   and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and
   Alexander C. Berg and Fei-Fei Li},
  journal={International Journal of Computer Vision},
  note={\url{} }
   author={Jia Deng and Wei Dong and Richard Socher and Li-Jia Li and Kai Li and Li Fei-Fei},
   booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
   title={ {ImageNet}: A large-scale hierarchical image database},
   note={\url{} }
  title={When does dough become a bagel? Analyzing the remaining mistakes on ImageNet},
  author={Vasudevan, Vijay and Caine, Benjamin and Gontijo-Lopes, Raphael and Fridovich-Keil, Sara and Roelofs, Rebecca},
  journal={arXiv preprint arXiv:2205.04596},