Attend the Women in ML Symposium on December 7 Register now

wit

  • Description:

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Split Examples
'test' 210,166
'train' 37,046,386
'val' 261,024
  • Feature structure:
FeaturesDict({
    'attribution_passes_lang_id': bool,
    'caption_alt_text_description': Text(shape=(), dtype=object),
    'caption_attribution_description': Text(shape=(), dtype=object),
    'caption_reference_description': Text(shape=(), dtype=object),
    'context_page_description': Text(shape=(), dtype=object),
    'context_section_description': Text(shape=(), dtype=object),
    'hierarchical_section_title': Text(shape=(), dtype=object),
    'image_url': Text(shape=(), dtype=object),
    'is_main_image': bool,
    'language': Text(shape=(), dtype=object),
    'mime_type': Text(shape=(), dtype=object),
    'original_height': int32,
    'original_width': int32,
    'page_changed_recently': bool,
    'page_title': Text(shape=(), dtype=object),
    'page_url': Text(shape=(), dtype=object),
    'section_title': Text(shape=(), dtype=object),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
attribution_passes_lang_id Tensor bool
caption_alt_text_description Text object
caption_attribution_description Text object
caption_reference_description Text object
context_page_description Text object
context_section_description Text object
hierarchical_section_title Text object
image_url Text object
is_main_image Tensor bool
language Text object
mime_type Text object
original_height Tensor int32
original_width Tensor int32
page_changed_recently Tensor bool
page_title Text object
page_url Text object
section_title Text object
  • Citation:
@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}