- Description:
Youtube-vis is a video instance segmentation dataset. It contains 2,883 high-resolution YouTube videos, a per-pixel category label set including 40 common objects such as person, animals and vehicles, 4,883 unique video instances, and 131k high-quality manual annotations.
The YouTube-VIS dataset is split into 2,238 training videos, 302 validation videos and 343 test videos.
No files were removed or altered during preprocessing.
- Additional Documentation: Explore on Papers With Code 
- Homepage: https://youtube-vos.org/dataset/vis/ 
- Source code: - tfds.video.youtube_vis.YoutubeVis
- Versions: - 1.0.0(default): Initial release.
 
- Download size: - Unknown size
- Manual download instructions: This dataset requires you to download the source data manually into - download_config.manual_dir(defaults to- ~/tensorflow_datasets/downloads/manual/):
 Please download all files for the 2019 version of the dataset (test_all_frames.zip, test.json, train_all_frames.zip, train.json, valid_all_frames.zip, valid.json) from the youtube-vis website and move them to ~/tensorflow_datasets/downloads/manual/.
Note that the dataset landing page is located at https://youtube-vos.org/dataset/vis/, and it will then redirect you to a page on https://competitions.codalab.org where you can download the 2019 version of the dataset. You will need to make an account on codalab to download the data. Note that at the time of writing this, you will need to bypass a "Connection not secure" warning when accessing codalab.
- Auto-cached (documentation): No 
- Supervised keys (See - as_superviseddoc):- None
- Figure (tfds.show_examples): Not supported. 
- Citation: 
@article{DBLP:journals/corr/abs-1905-04804,
  author    = {Linjie Yang and
               Yuchen Fan and
               Ning Xu},
  title     = {Video Instance Segmentation},
  journal   = {CoRR},
  volume    = {abs/1905.04804},
  year      = {2019},
  url       = {http://arxiv.org/abs/1905.04804},
  archivePrefix = {arXiv},
  eprint    = {1905.04804},
  timestamp = {Tue, 28 May 2019 12:48:08 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1905-04804.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
youtube_vis/full (default config)
- Config description: The full resolution version of the dataset, with all frames, including those without labels, included. 
- Dataset size: - 33.31 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 343 | 
| 'train' | 2,238 | 
| 'validation' | 302 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
| video | Video(Image) | (None, None, None, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/480_640_full
- Config description: All images are bilinearly resized to 480 X 640 with all frames included. 
- Dataset size: - 130.02 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 343 | 
| 'train' | 2,238 | 
| 'validation' | 302 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
| video | Video(Image) | (None, 480, 640, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/480_640_only_frames_with_labels
- Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included. 
- Dataset size: - 26.27 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 343 | 
| 'train' | 2,238 | 
| 'validation' | 302 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
| video | Video(Image) | (None, 480, 640, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/only_frames_with_labels
- Config description: Only images with labels included at their native resolution. 
- Dataset size: - 6.91 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 343 | 
| 'train' | 2,238 | 
| 'validation' | 302 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
| video | Video(Image) | (None, None, None, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/full_train_split
- Config description: The full resolution version of the dataset, with all frames, including those without labels, included. The val and test splits are manufactured from the training data. 
- Dataset size: - 26.09 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 200 | 
| 'train' | 1,838 | 
| 'validation' | 200 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
| video | Video(Image) | (None, None, None, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/480_640_full_train_split
- Config description: All images are bilinearly resized to 480 X 640 with all frames included. The val and test splits are manufactured from the training data. 
- Dataset size: - 101.57 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 200 | 
| 'train' | 1,838 | 
| 'validation' | 200 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
| video | Video(Image) | (None, 480, 640, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/480_640_only_frames_with_labels_train_split
- Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included. The val and test splits are manufactured from the training data. 
- Dataset size: - 20.55 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 200 | 
| 'train' | 1,838 | 
| 'validation' | 200 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
| video | Video(Image) | (None, 480, 640, 3) | uint8 | 
- Examples (tfds.as_dataframe):
youtube_vis/only_frames_with_labels_train_split
- Config description: Only images with labels included at their native resolution. The val and test splits are manufactured from the training data. 
- Dataset size: - 5.46 GiB
- Splits: 
| Split | Examples | 
|---|---|
| 'test' | 200 | 
| 'train' | 1,838 | 
| 'validation' | 200 | 
- Feature structure:
FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description | 
|---|---|---|---|---|
| FeaturesDict | ||||
| metadata | FeaturesDict | |||
| metadata/height | Tensor | int32 | ||
| metadata/num_frames | Tensor | int32 | ||
| metadata/video_name | Tensor | string | ||
| metadata/width | Tensor | int32 | ||
| tracks | Sequence | |||
| tracks/areas | Sequence(Tensor) | (None,) | float32 | |
| tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
| tracks/category | ClassLabel | int64 | ||
| tracks/frames | Sequence(Tensor) | (None,) | int32 | |
| tracks/is_crowd | Tensor | bool | ||
| tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
| video | Video(Image) | (None, None, None, 3) | uint8 | 
- Examples (tfds.as_dataframe):