Attend the Women in ML Symposium on December 7 Register now

wit_kaggle

  • Description:

Wikipedia - Image/Caption Matching Kaggle Competition.

This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.

In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.

  • Homepage: https://www.kaggle.com/c/wikipedia-image-caption/code

  • Source code: tfds.vision_language.wit_kaggle.WitKaggle

  • Versions:

    • 1.0.0: Initial release. It provides the train and test datasets from the Wikipedia - Image/Caption Matching Kaggle competition (https://www.kaggle.com/c/wikipedia-image-caption/data).

      The goal of the competition is to build a model that automatically retrieves the text closest to an image. Specifically, the model shuld be trained to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images.

      Note that this release doesn't provide the ground truth for the test set, as it hasn't been provided by the Kaggle competition yet.

      Note that not all of the training observations have corresponding image data. The released images exclude all images containing humans. For samples which are not associated with image data, the following image features are used: image is a byte-64 encoded blank image, embedding is a vector of 2048 zeros.

      The samples released for the competition can be loaded as: tfds.load("wit_kaggle/train_with_extended_features") tfds.load("wit_kaggle/test_without_gold")

    • 1.0.1 (default): Optimize Beam pipeline to avoid strugglers, ignoring rows without an image URL. Also added more Beam counters.

  • Download size: Unknown size

  • Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
    Depending on the config called, manual_dir should contain some of the following subdirectories:

    • train
    • train-{0000x}-of-00005.tsv.zip
    • image_data_train/
      • image_pixels/
      • train_image_pixels_part-00{000-199}.csv.gz
      • resnet_embeddings/
      • train_resnet_embeddings_part-00{000-214}.csv.gz
    • test
    • test.tsv.zip
    • image_data_test/
      • image_pixels/
      • test_image_pixels_part-0000{0-4}.csv
      • resnet_embeddings/
      • test_resnet_embeddings_part-0000{0-9}.csv

Registration at https://www.kaggle.com/c/wikipedia-image-caption/data is needed to get the links to download the dataset.

@article{srinivasan2021wit,
  title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
  author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
  journal={arXiv preprint arXiv:2103.01913},
  year={2021}
}

wit_kaggle/train_with_extended_features (default config)

  • Config description: Training samples for the Wikipedia-Image/Caption Matching competition.

  • Dataset size: 1.16 TiB

  • Splits:

Split Examples
'train_with_extended_features' 37,046,386
  • Feature structure:
FeaturesDict({
    'attribution_passes_lang_id': bool,
    'caption_alt_text_description': Text(shape=(), dtype=object),
    'caption_attribution_description': Text(shape=(), dtype=object),
    'caption_reference_description': Text(shape=(), dtype=object),
    'caption_title_and_reference_description': Text(shape=(), dtype=object),
    'context_page_description': Text(shape=(), dtype=object),
    'context_section_description': Text(shape=(), dtype=object),
    'embedding': Tensor(shape=(2048,), dtype=float32),
    'hierarchical_section_title': Text(shape=(), dtype=object),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image_url': Text(shape=(), dtype=object),
    'is_main_image': bool,
    'language': Text(shape=(), dtype=object),
    'metadata_url': Text(shape=(), dtype=object),
    'mime_type': Text(shape=(), dtype=object),
    'original_height': int32,
    'original_width': int32,
    'page_changed_recently': bool,
    'page_title': Text(shape=(), dtype=object),
    'page_url': Text(shape=(), dtype=object),
    'section_title': Text(shape=(), dtype=object),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
attribution_passes_lang_id Tensor bool
caption_alt_text_description Text object
caption_attribution_description Text object
caption_reference_description Text object
caption_title_and_reference_description Text object
context_page_description Text object
context_section_description Text object
embedding Tensor (2048,) float32
hierarchical_section_title Text object
image Image (None, None, 3) uint8
image_url Text object
is_main_image Tensor bool
language Text object
metadata_url Text object
mime_type Text object
original_height Tensor int32
original_width Tensor int32
page_changed_recently Tensor bool
page_title Text object
page_url Text object
section_title Text object

Visualization

wit_kaggle/test_without_gold

  • Config description: Test samples (without gold answers) for the Wikipedia-Image/Caption Matching competition.

  • Dataset size: 3.37 GiB

  • Splits:

Split Examples
'test_without_gold' 92,366
  • Feature structure:
FeaturesDict({
    'caption_title_and_reference_description': Text(shape=(), dtype=object),
    'embedding': Tensor(shape=(2048,), dtype=float32),
    'id': Text(shape=(), dtype=object),
    'image': Image(shape=(None, None, 3), dtype=uint8),
    'image_url': Text(shape=(), dtype=object),
    'metadata_url': Text(shape=(), dtype=object),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
caption_title_and_reference_description Text object
embedding Tensor (2048,) float32
id Text object
image Image (None, None, 3) uint8
image_url Text object
metadata_url Text object

Visualization