TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

wit

Description:

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research-datasets/wit/
Source code: tfds.vision_language.wit.Wit
Versions:
- 1.0.0: Initial release. It loads the WIT dataset from https://storage.googleapis.com/gresearch/wit/
- 1.1.0 (default): Added val and test splits.
Download size: 25.20 GiB
Dataset size: 81.17 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'test'`	210,166
`'train'`	37,046,386
`'val'`	261,024

Feature structure:

FeaturesDict({
    'attribution_passes_lang_id': bool,
    'caption_alt_text_description': Text(shape=(), dtype=string),
    'caption_attribution_description': Text(shape=(), dtype=string),
    'caption_reference_description': Text(shape=(), dtype=string),
    'context_page_description': Text(shape=(), dtype=string),
    'context_section_description': Text(shape=(), dtype=string),
    'hierarchical_section_title': Text(shape=(), dtype=string),
    'image_url': Text(shape=(), dtype=string),
    'is_main_image': bool,
    'language': Text(shape=(), dtype=string),
    'mime_type': Text(shape=(), dtype=string),
    'original_height': int32,
    'original_width': int32,
    'page_changed_recently': bool,
    'page_title': Text(shape=(), dtype=string),
    'page_url': Text(shape=(), dtype=string),
    'section_title': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
attribution_passes_lang_id	Tensor	bool
caption_alt_text_description	Text	string
caption_attribution_description	Text	string
caption_reference_description	Text	string
context_page_description	Text	string
context_section_description	Text	string
hierarchical_section_title	Text	string
image_url	Text	string
is_main_image	Tensor	bool
language	Text	string
mime_type	Text	string
original_height	Tensor	int32
original_width	Tensor	int32
page_changed_recently	Tensor	bool
page_title	Text	string
page_url	Text	string
section_title	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}