- Description:
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.vision_language.wit.WitVersions:
1.0.0: Initial release. It loads the WIT dataset from https://storage.googleapis.com/gresearch/wit/1.1.0(default): Addedvalandtestsplits.
Download size:
25.20 GiBDataset size:
81.17 GiBAuto-cached (documentation): No
Splits:
| Split | Examples |
|---|---|
'test' |
210,166 |
'train' |
37,046,386 |
'val' |
261,024 |
- Feature structure:
FeaturesDict({
'attribution_passes_lang_id': bool,
'caption_alt_text_description': Text(shape=(), dtype=string),
'caption_attribution_description': Text(shape=(), dtype=string),
'caption_reference_description': Text(shape=(), dtype=string),
'context_page_description': Text(shape=(), dtype=string),
'context_section_description': Text(shape=(), dtype=string),
'hierarchical_section_title': Text(shape=(), dtype=string),
'image_url': Text(shape=(), dtype=string),
'is_main_image': bool,
'language': Text(shape=(), dtype=string),
'mime_type': Text(shape=(), dtype=string),
'original_height': int32,
'original_width': int32,
'page_changed_recently': bool,
'page_title': Text(shape=(), dtype=string),
'page_url': Text(shape=(), dtype=string),
'section_title': Text(shape=(), dtype=string),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description |
|---|---|---|---|---|
| FeaturesDict | ||||
| attribution_passes_lang_id | Tensor | bool | ||
| caption_alt_text_description | Text | string | ||
| caption_attribution_description | Text | string | ||
| caption_reference_description | Text | string | ||
| context_page_description | Text | string | ||
| context_section_description | Text | string | ||
| hierarchical_section_title | Text | string | ||
| image_url | Text | string | ||
| is_main_image | Tensor | bool | ||
| language | Text | string | ||
| mime_type | Text | string | ||
| original_height | Tensor | int32 | ||
| original_width | Tensor | int32 | ||
| page_changed_recently | Tensor | bool | ||
| page_title | Text | string | ||
| page_url | Text | string | ||
| section_title | Text | string |
Supervised keys (See
as_superviseddoc):NoneFigure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@article{srinivasan2021wit,
title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
journal={arXiv preprint arXiv:2103.01913},
year={2021}
}