TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

ref_coco

Description:

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/lichengunc/refer
Source code: tfds.datasets.ref_coco.Builder
Versions:
- 1.0.0: Initial release.
- 1.1.0 (default): Added masks.
Download size: Unknown size
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
Follow the instructions in https://github.com/lichengunc/refer and download the annotations and the images, matching the data/ directory specified in the repo.

Follow the instructions of PythonAPI in https://github.com/cocodataset/cocoapi to get pycocotools and the instances_train2014 annotations file from https://cocodataset.org/#download
Add both refer.py from (1) and pycocotools from (2) to your PYTHONPATH.
Run manual_download_process.py to generate refcoco.json, replacing ref_data_root, coco_annotations_file, and out_file with the values corresponding to where you have downloaded / want to save these files. Note that manual_download_process.py can be found in the TFDS repository.
Download the COCO training set from https://cocodataset.org/#download and stick it into a folder called coco_train2014/. Move refcoco.json to the same level as coco_train2014.
Follow the standard manual download instructions.

Auto-cached (documentation): No
Feature structure:

FeaturesDict({
    'coco_annotations': Sequence({
        'area': int64,
        'bbox': BBoxFeature(shape=(4,), dtype=float32),
        'id': int64,
        'label': int64,
    }),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image/id': int64,
    'objects': Sequence({
        'area': int64,
        'bbox': BBoxFeature(shape=(4,), dtype=float32),
        'gt_box_index': int64,
        'id': int64,
        'label': int64,
        'mask': Image(shape=(None, None, 3), dtype=uint8),
        'refexp': Sequence({
            'raw': Text(shape=(), dtype=string),
            'refexp_id': int64,
        }),
    }),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
coco_annotations	Sequence
coco_annotations/area	Tensor		int64
coco_annotations/bbox	BBoxFeature	(4,)	float32
coco_annotations/id	Tensor		int64
coco_annotations/label	Tensor		int64
image	Image	(None, None, 3)	uint8
image/id	Tensor		int64
objects	Sequence
objects/area	Tensor		int64
objects/bbox	BBoxFeature	(4,)	float32
objects/gt_box_index	Tensor		int64
objects/id	Tensor		int64
objects/label	Tensor		int64
objects/mask	Image	(None, None, 3)	uint8
objects/refexp	Sequence
objects/refexp/raw	Text		string
objects/refexp/refexp_id	Tensor		int64

Supervised keys (See as_supervised doc): None
Citation:

@inproceedings{kazemzadeh2014referitgame,
  title={Referitgame: Referring to objects in photographs of natural scenes},
  author={Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara},
  booktitle={Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)},
  pages={787--798},
  year={2014}
}
@inproceedings{yu2016modeling,
  title={Modeling context in referring expressions},
  author={Yu, Licheng and Poirson, Patrick and Yang, Shan and Berg, Alexander C and Berg, Tamara L},
  booktitle={European Conference on Computer Vision},
  pages={69--85},
  year={2016},
  organization={Springer}
}
@inproceedings{mao2016generation,
  title={Generation and Comprehension of Unambiguous Object Descriptions},
  author={Mao, Junhua and Huang, Jonathan and Toshev, Alexander and Camburu, Oana and Yuille, Alan and Murphy, Kevin},
  booktitle={CVPR},
  year={2016}
}
@inproceedings{nagaraja2016modeling,
  title={Modeling context between objects for referring expression understanding},
  author={Nagaraja, Varun K and Morariu, Vlad I and Davis, Larry S},
  booktitle={European Conference on Computer Vision},
  pages={792--807},
  year={2016},
  organization={Springer}
}