- Description:
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.vision_language.wit.Wit
Versions:
1.0.0
: Initial release. It loads the WIT dataset from https://storage.googleapis.com/gresearch/wit/1.1.0
(default): Addedval
andtest
splits.
Download size:
25.20 GiB
Dataset size:
81.17 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'test' |
210,166 |
'train' |
37,046,386 |
'val' |
261,024 |
- Feature structure:
FeaturesDict({
'attribution_passes_lang_id': bool,
'caption_alt_text_description': Text(shape=(), dtype=string),
'caption_attribution_description': Text(shape=(), dtype=string),
'caption_reference_description': Text(shape=(), dtype=string),
'context_page_description': Text(shape=(), dtype=string),
'context_section_description': Text(shape=(), dtype=string),
'hierarchical_section_title': Text(shape=(), dtype=string),
'image_url': Text(shape=(), dtype=string),
'is_main_image': bool,
'language': Text(shape=(), dtype=string),
'mime_type': Text(shape=(), dtype=string),
'original_height': int32,
'original_width': int32,
'page_changed_recently': bool,
'page_title': Text(shape=(), dtype=string),
'page_url': Text(shape=(), dtype=string),
'section_title': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
attribution_passes_lang_id | Tensor | bool | ||
caption_alt_text_description | Text | string | ||
caption_attribution_description | Text | string | ||
caption_reference_description | Text | string | ||
context_page_description | Text | string | ||
context_section_description | Text | string | ||
hierarchical_section_title | Text | string | ||
image_url | Text | string | ||
is_main_image | Tensor | bool | ||
language | Text | string | ||
mime_type | Text | string | ||
original_height | Tensor | int32 | ||
original_width | Tensor | int32 | ||
page_changed_recently | Tensor | bool | ||
page_title | Text | string | ||
page_url | Text | string | ||
section_title | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):
- Citation:
@article{srinivasan2021wit,
title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
journal={arXiv preprint arXiv:2103.01913},
year={2021}
}