- Description:
Wikipedia - Image/Caption Matching Kaggle Competition.
This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.
In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.
- Homepage: https://www.kaggle.com/c/wikipedia-image-caption/code 
- Source code: - tfds.vision_language.wit_kaggle.WitKaggle
- Versions: - 1.0.0: Initial release. It provides the train and test datasets from the Wikipedia - Image/Caption Matching Kaggle competition (https://www.kaggle.com/c/wikipedia-image-caption/data).- The goal of the competition is to build a model that automatically retrieves the text closest to an image. Specifically, the model shuld be trained to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. - Note that this release doesn't provide the ground truth for the test set, as it hasn't been provided by the Kaggle competition yet. - Note that not all of the training observations have corresponding image data. The released images exclude all images containing humans. For samples which are not associated with image data, the following image features are used: - imageis a byte-64 encoded blank image,- embeddingis a vector of 2048 zeros.- The samples released for the competition can be loaded as: - tfds.load("wit_kaggle/train_with_extended_features") tfds.load("wit_kaggle/test_without_gold")
- 1.0.1: Optimize Beam pipeline to avoid strugglers, ignoring rows without an image URL. Also added more Beam counters.
- 1.0.2(default): Fixes parsing of boolean fields.
 
- Download size: - Unknown size
- Manual download instructions: This dataset requires you to download the source data manually into - download_config.manual_dir(defaults to- ~/tensorflow_datasets/downloads/manual/):
 Depending on the config called, manual_dir should contain some of the following subdirectories:- train
- train-{0000x}-of-00005.tsv.zip
- image_data_train/
- image_pixels/
- train_image_pixels_part-00{000-199}.csv.gz
- resnet_embeddings/
- train_resnet_embeddings_part-00{000-214}.csv.gz
 
- test
- test.tsv.zip
- image_data_test/
- image_pixels/
- test_image_pixels_part-0000{0-4}.csv
- resnet_embeddings/
- test_resnet_embeddings_part-0000{0-9}.csv
 
 
Registration at https://www.kaggle.com/c/wikipedia-image-caption/data is needed to get the links to download the dataset.
- Auto-cached (documentation): No 
- Supervised keys (See - as_superviseddoc):- ('image_url', 'caption_title_and_reference_description')
- Citation: 
@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}
wit_kaggle/train_with_extended_features (default config)
- Config description: Training samples for the Wikipedia-Image/Caption Matching competition. 
- Dataset size: - 1.16 TiB
- Splits: 
| Split | Examples | 
|---|---|
| 'train_with_extended_features' | 37,046,386 | 
- Feature structure:
FeaturesDict({
    'attribution_passes_lang_id': bool,
    'caption_alt_text_description': Text(shape=(), dtype=string),
    'caption_attribution_description': Text(shape=(), dtype=string),
    'caption_reference_description': Text(shape=(), dtype=string),
    'caption_title_and_reference_description': Text(shape=(), dtype=string),
    'context_page_description': Text(shape=(), dtype=string),
    'context_section_description': Text(shape=(), dtype=string),
    'embedding': Tensor(shape=(2048,), dtype=float32),
    'hierarchical_section_title': Text(shape=(), dtype=string),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image_url': Text(shape=(), dtype=string),
    'is_main_image': bool,
    'language': Text(shape=(), dtype=string),
    'metadata_url': Text(shape=(), dtype=string),
    'mime_type': Text(shape=(), dtype=string),
    'original_height': int32,
    'original_width': int32,
    'page_changed_recently': bool,
    'page_title': Text(shape=(), dtype=string),
    'page_url': Text(shape=(), dtype=string),
    'section_title': Text(shape=(), dtype=string),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| attribution_passes_lang_id | Tensor | bool | ||
| caption_alt_text_description | Text | string | ||
| caption_attribution_description | Text | string | ||
| caption_reference_description | Text | string | ||
| caption_title_and_reference_description | Text | string | ||
| context_page_description | Text | string | ||
| context_section_description | Text | string | ||
| embedding | Tensor | (2048,) | float32 | |
| hierarchical_section_title | Text | string | ||
| image | Image | (None, None, 3) | uint8 | |
| image_url | Text | string | ||
| is_main_image | Tensor | bool | ||
| language | Text | string | ||
| metadata_url | Text | string | ||
| mime_type | Text | string | ||
| original_height | Tensor | int32 | ||
| original_width | Tensor | int32 | ||
| page_changed_recently | Tensor | bool | ||
| page_title | Text | string | ||
| page_url | Text | string | ||
| section_title | Text | string | 
- Figure (tfds.show_examples):

- Examples (tfds.as_dataframe):
wit_kaggle/test_without_gold
- Config description: Test samples (without gold answers) for the Wikipedia-Image/Caption Matching competition. 
- Dataset size: - 3.37 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test_without_gold' | 92,366 | 
- Feature structure:
FeaturesDict({
    'caption_title_and_reference_description': Text(shape=(), dtype=string),
    'embedding': Tensor(shape=(2048,), dtype=float32),
    'id': Text(shape=(), dtype=string),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image_url': Text(shape=(), dtype=string),
    'metadata_url': Text(shape=(), dtype=string),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| caption_title_and_reference_description | Text | string | ||
| embedding | Tensor | (2048,) | float32 | |
| id | Text | string | ||
| image | Image | (None, None, 3) | uint8 | |
| image_url | Text | string | ||
| metadata_url | Text | string | 
- Figure (tfds.show_examples):

- Examples (tfds.as_dataframe):