TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

youtube_vis

Description:

Youtube-vis is a video instance segmentation dataset. It contains 2,883 high-resolution YouTube videos, a per-pixel category label set including 40 common objects such as person, animals and vehicles, 4,883 unique video instances, and 131k high-quality manual annotations.

The YouTube-VIS dataset is split into 2,238 training videos, 302 validation videos and 343 test videos.

No files were removed or altered during preprocessing.

Additional Documentation: Explore on Papers With Code
Homepage: https://youtube-vos.org/dataset/vis/
Source code: tfds.video.youtube_vis.YoutubeVis
Versions:
- 1.0.0 (default): Initial release.
Download size: Unknown size
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
Please download all files for the 2019 version of the dataset (test_all_frames.zip, test.json, train_all_frames.zip, train.json, valid_all_frames.zip, valid.json) from the youtube-vis website and move them to ~/tensorflow_datasets/downloads/manual/.

Note that the dataset landing page is located at https://youtube-vos.org/dataset/vis/, and it will then redirect you to a page on https://competitions.codalab.org where you can download the 2019 version of the dataset. You will need to make an account on codalab to download the data. Note that at the time of writing this, you will need to bypass a "Connection not secure" warning when accessing codalab.

Auto-cached (documentation): No
Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@article{DBLP:journals/corr/abs-1905-04804,
  author    = {Linjie Yang and
               Yuchen Fan and
               Ning Xu},
  title     = {Video Instance Segmentation},
  journal   = {CoRR},
  volume    = {abs/1905.04804},
  year      = {2019},
  url       = {http://arxiv.org/abs/1905.04804},
  archivePrefix = {arXiv},
  eprint    = {1905.04804},
  timestamp = {Tue, 28 May 2019 12:48:08 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1905-04804.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

youtube_vis/full (default config)

Config description: The full resolution version of the dataset, with all frames, including those without labels, included.
Dataset size: 33.31 GiB
Splits:

Split	Examples
`'test'`	343
`'train'`	2,238
`'validation'`	302

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, None, None, 1)	uint8
video	Video(Image)	(None, None, None, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/480_640_full

Config description: All images are bilinearly resized to 480 X 640 with all frames included.
Dataset size: 130.02 GiB
Splits:

Split	Examples
`'test'`	343
`'train'`	2,238
`'validation'`	302

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, 480, 640, 1)	uint8
video	Video(Image)	(None, 480, 640, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/480_640_only_frames_with_labels

Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included.
Dataset size: 26.27 GiB
Splits:

Split	Examples
`'test'`	343
`'train'`	2,238
`'validation'`	302

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, 480, 640, 1)	uint8
video	Video(Image)	(None, 480, 640, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/only_frames_with_labels

Config description: Only images with labels included at their native resolution.
Dataset size: 6.91 GiB
Splits:

Split	Examples
`'test'`	343
`'train'`	2,238
`'validation'`	302

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, None, None, 1)	uint8
video	Video(Image)	(None, None, None, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/full_train_split

Config description: The full resolution version of the dataset, with all frames, including those without labels, included. The val and test splits are manufactured from the training data.
Dataset size: 26.09 GiB
Splits:

Split	Examples
`'test'`	200
`'train'`	1,838
`'validation'`	200

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, None, None, 1)	uint8
video	Video(Image)	(None, None, None, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/480_640_full_train_split

Config description: All images are bilinearly resized to 480 X 640 with all frames included. The val and test splits are manufactured from the training data.
Dataset size: 101.57 GiB
Splits:

Split	Examples
`'test'`	200
`'train'`	1,838
`'validation'`	200

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, 480, 640, 1)	uint8
video	Video(Image)	(None, 480, 640, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/480_640_only_frames_with_labels_train_split

Config description: All images are bilinearly resized to 480 X 640 with only frames with labels included. The val and test splits are manufactured from the training data.
Dataset size: 20.55 GiB
Splits:

Split	Examples
`'test'`	200
`'train'`	1,838
`'validation'`	200

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(480, 640, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(480, 640, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, 480, 640, 1)	uint8
video	Video(Image)	(None, 480, 640, 3)	uint8

Examples (tfds.as_dataframe):

youtube_vis/only_frames_with_labels_train_split

Config description: Only images with labels included at their native resolution. The val and test splits are manufactured from the training data.
Dataset size: 5.46 GiB
Splits:

Split	Examples
`'test'`	200
`'train'`	1,838
`'validation'`	200

Feature structure:

FeaturesDict({
    'metadata': FeaturesDict({
        'height': int32,
        'num_frames': int32,
        'video_name': string,
        'width': int32,
    }),
    'tracks': Sequence({
        'areas': Sequence(float32),
        'bboxes': Sequence(BBoxFeature(shape=(4,), dtype=float32)),
        'category': ClassLabel(shape=(), dtype=int64, num_classes=40),
        'frames': Sequence(int32),
        'is_crowd': bool,
        'segmentations': Video(Image(shape=(None, None, 1), dtype=uint8)),
    }),
    'video': Video(Image(shape=(None, None, 3), dtype=uint8)),
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
metadata	FeaturesDict
metadata/height	Tensor		int32
metadata/num_frames	Tensor		int32
metadata/video_name	Tensor		string
metadata/width	Tensor		int32
tracks	Sequence
tracks/areas	Sequence(Tensor)	(None,)	float32
tracks/bboxes	Sequence(BBoxFeature)	(None, 4)	float32
tracks/category	ClassLabel		int64
tracks/frames	Sequence(Tensor)	(None,)	int32
tracks/is_crowd	Tensor		bool
tracks/segmentations	Video(Image)	(None, None, None, 1)	uint8
video	Video(Image)	(None, None, None, 3)	uint8

Examples (tfds.as_dataframe):