- Description:
Youtube-vis is a video instance segmentation dataset. It contains 2,883 high-resolution YouTube videos, a per-pixel category label set including 40 common objects such as person, animals and vehicles, 4,883 unique video instances, and 131k high-quality manual annotations.
The YouTube-VIS dataset is split into 2,238 training videos, 302 validation videos and 343 test videos.
No files were removed or altered during preprocessing.
Additional Documentation: Explore on Papers With Code
Homepage: https://youtube-vos.org/dataset/vis/
Source code:
tfds.video.youtube_vis.YoutubeVis
Versions:
1.0.0
(default): Initial release.
Download size:
Unknown size
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
Please download all files for the 2019 version of the dataset (test_all_frames.zip, test.json, train_all_frames.zip, train.json, valid_all_frames.zip, valid.json) from the youtube-vis website and move them to ~/tensorflow_datasets/downloads/manual/.
Note that the dataset landing page is located at https://youtube-vos.org/dataset/vis/, and it will then redirect you to a page on https://competitions.codalab.org where you can download the 2019 version of the dataset. You will need to make an account on codalab to download the data. Note that at the time of writing this, you will need to bypass a "Connection not secure" warning when accessing codalab.
Auto-cached (documentation): No
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@article{DBLP:journals/corr/abs-1905-04804,
author = {Linjie Yang and
Yuchen Fan and
Ning Xu},
title = {Video Instance Segmentation},
journal = {CoRR},
volume = {abs/1905.04804},
year = {2019},
url = {http://arxiv.org/abs/1905.04804},
archivePrefix = {arXiv},
eprint = {1905.04804},
timestamp = {Tue, 28 May 2019 12:48:08 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1905-04804.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
youtube_vis/full (default config)
Config description: The full resolution version of the dataset, with all frames, including those without labels, included.
Dataset size:
33.31 GiB
Splits:
Split | Examples |
---|---|
'test' |
343 |
'train' |
2,238 |
'validation' |
302 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
video | Video(Image) | (None, None, None, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/480_640_full
Config description: All images are bilinearly resized to 480 X 640 with all frames included.
Dataset size:
130.02 GiB
Splits:
Split | Examples |
---|---|
'test' |
343 |
'train' |
2,238 |
'validation' |
302 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
video | Video(Image) | (None, 480, 640, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/480_640_only_frames_with_labels
Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included.
Dataset size:
26.27 GiB
Splits:
Split | Examples |
---|---|
'test' |
343 |
'train' |
2,238 |
'validation' |
302 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
video | Video(Image) | (None, 480, 640, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/only_frames_with_labels
Config description: Only images with labels included at their native resolution.
Dataset size:
6.91 GiB
Splits:
Split | Examples |
---|---|
'test' |
343 |
'train' |
2,238 |
'validation' |
302 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
video | Video(Image) | (None, None, None, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/full_train_split
Config description: The full resolution version of the dataset, with all frames, including those without labels, included. The val and test splits are manufactured from the training data.
Dataset size:
26.09 GiB
Splits:
Split | Examples |
---|---|
'test' |
200 |
'train' |
1,838 |
'validation' |
200 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
video | Video(Image) | (None, None, None, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/480_640_full_train_split
Config description: All images are bilinearly resized to 480 X 640 with all frames included. The val and test splits are manufactured from the training data.
Dataset size:
101.57 GiB
Splits:
Split | Examples |
---|---|
'test' |
200 |
'train' |
1,838 |
'validation' |
200 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
video | Video(Image) | (None, 480, 640, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/480_640_only_frames_with_labels_train_split
Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included. The val and test splits are manufactured from the training data.
Dataset size:
20.55 GiB
Splits:
Split | Examples |
---|---|
'test' |
200 |
'train' |
1,838 |
'validation' |
200 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, 480, 640, 1) | uint8 | |
video | Video(Image) | (None, 480, 640, 3) | uint8 |
- Examples (tfds.as_dataframe):
youtube_vis/only_frames_with_labels_train_split
Config description: Only images with labels included at their native resolution. The val and test splits are manufactured from the training data.
Dataset size:
5.46 GiB
Splits:
Split | Examples |
---|---|
'test' |
200 |
'train' |
1,838 |
'validation' |
200 |
- Feature structure:
FeaturesDict({
'metadata': FeaturesDict({
'height': int32,
'num_frames': int32,
'video_name': string,
'width': int32,
}),
'tracks': Sequence({
'areas': Sequence(float32),
'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
'frames': Sequence(int32),
'is_crowd': bool,
'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
}),
'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
metadata | FeaturesDict | |||
metadata/height | Tensor | int32 | ||
metadata/num_frames | Tensor | int32 | ||
metadata/video_name | Tensor | string | ||
metadata/width | Tensor | int32 | ||
tracks | Sequence | |||
tracks/areas | Sequence(Tensor) | (None,) | float32 | |
tracks/bboxes | Sequence(BBoxFeature) | (None, 4) | float32 | |
tracks/category | ClassLabel | int64 | ||
tracks/frames | Sequence(Tensor) | (None,) | int32 | |
tracks/is_crowd | Tensor | bool | ||
tracks/segmentations | Video(Image) | (None, None, None, 1) | uint8 | |
video | Video(Image) | (None, None, None, 3) | uint8 |
- Examples (tfds.as_dataframe):