stanford_mask_vit_converted_externally_to_rlds
Stay organized with collections
Save and categorize content based on your preferences.
Sawyer pushing and picking objects in a bin
Split |
Examples |
'train' |
9,109 |
'val' |
91 |
FeaturesDict({
'episode_metadata': FeaturesDict({
'file_path': Text(shape=(), dtype=string),
}),
'steps': Dataset({
'action': Tensor(shape=(5,), dtype=float32, description=Robot action, consists of [3x change in end effector position, 1x gripper yaw, 1x open/close gripper (-1 means to open the gripper, 1 means close)].),
'discount': Scalar(shape=(), dtype=float32, description=Discount if provided, default to 1.),
'is_first': bool,
'is_last': bool,
'is_terminal': bool,
'language_embedding': Tensor(shape=(512,), dtype=float32, description=Kona language embedding. See https://tfhub.dev/google/universal-sentence-encoder-large/5),
'language_instruction': Text(shape=(), dtype=string),
'observation': FeaturesDict({
'end_effector_pose': Tensor(shape=(5,), dtype=float32, description=Robot end effector pose, consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position]. This is the state used in the MaskViT paper.),
'finger_sensors': Tensor(shape=(1,), dtype=float32, description=1x Sawyer gripper finger sensors.),
'high_bound': Tensor(shape=(5,), dtype=float32, description=High bound for end effector pose normalization. Consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position].),
'image': Image(shape=(480, 480, 3), dtype=uint8, description=Main camera RGB observation.),
'low_bound': Tensor(shape=(5,), dtype=float32, description=Low bound for end effector pose normalization. Consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position].),
'state': Tensor(shape=(15,), dtype=float32, description=Robot state, consists of [7x robot joint angles, 7x robot joint velocities,1x gripper position].),
}),
'reward': Scalar(shape=(), dtype=float32, description=Reward if provided, 1 on final step for demos.),
}),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
episode_metadata |
FeaturesDict |
|
|
|
episode_metadata/file_path |
Text |
|
string |
Path to the original data file. |
steps |
Dataset |
|
|
|
steps/action |
Tensor |
(5,) |
float32 |
Robot action, consists of [3x change in end effector position, 1x gripper yaw, 1x open/close gripper (-1 means to open the gripper, 1 means close)]. |
steps/discount |
Scalar |
|
float32 |
Discount if provided, default to 1. |
steps/is_first |
Tensor |
|
bool |
|
steps/is_last |
Tensor |
|
bool |
|
steps/is_terminal |
Tensor |
|
bool |
|
steps/language_embedding |
Tensor |
(512,) |
float32 |
Kona language embedding. See https://tfhub.dev/google/universal-sentence-encoder-large/5 |
steps/language_instruction |
Text |
|
string |
Language Instruction. |
steps/observation |
FeaturesDict |
|
|
|
steps/observation/end_effector_pose |
Tensor |
(5,) |
float32 |
Robot end effector pose, consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position]. This is the state used in the MaskViT paper. |
steps/observation/finger_sensors |
Tensor |
(1,) |
float32 |
1x Sawyer gripper finger sensors. |
steps/observation/high_bound |
Tensor |
(5,) |
float32 |
High bound for end effector pose normalization. Consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position]. |
steps/observation/image |
Image |
(480, 480, 3) |
uint8 |
Main camera RGB observation. |
steps/observation/low_bound |
Tensor |
(5,) |
float32 |
Low bound for end effector pose normalization. Consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position]. |
steps/observation/state |
Tensor |
(15,) |
float32 |
Robot state, consists of [7x robot joint angles, 7x robot joint velocities,1x gripper position]. |
steps/reward |
Scalar |
|
float32 |
Reward if provided, 1 on final step for demos. |
@inproceedings{gupta2022maskvit,
title={MaskViT: Masked Visual Pre-Training for Video Prediction},
author={Agrim Gupta and Stephen Tian and Yunzhi Zhang and Jiajun Wu and Roberto Martín-Martín and Li Fei-Fei},
booktitle={International Conference on Learning Representations},
year={2022}
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-12-11 UTC.
[null,null,["Last updated 2024-12-11 UTC."],[],[],null,["# stanford_mask_vit_converted_externally_to_rlds\n\n\u003cbr /\u003e\n\n- **Description**:\n\nSawyer pushing and picking objects in a bin\n\n- **Homepage** :\n \u003chttps://arxiv.org/abs/2206.11894\u003e\n\n- **Source code** :\n [`tfds.robotics.rtx.StanfordMaskVitConvertedExternallyToRlds`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/robotics/rtx/rtx.py)\n\n- **Versions**:\n\n - **`0.1.0`** (default): Initial release.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `76.17 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'train'` | 9,109 |\n| `'val'` | 91 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'episode_metadata': FeaturesDict({\n 'file_path': Text(shape=(), dtype=string),\n }),\n 'steps': Dataset({\n 'action': Tensor(shape=(5,), dtype=float32, description=Robot action, consists of [3x change in end effector position, 1x gripper yaw, 1x open/close gripper (-1 means to open the gripper, 1 means close)].),\n 'discount': Scalar(shape=(), dtype=float32, description=Discount if provided, default to 1.),\n 'is_first': bool,\n 'is_last': bool,\n 'is_terminal': bool,\n 'language_embedding': Tensor(shape=(512,), dtype=float32, description=Kona language embedding. See https://tfhub.dev/google/universal-sentence-encoder-large/5),\n 'language_instruction': Text(shape=(), dtype=string),\n 'observation': FeaturesDict({\n 'end_effector_pose': Tensor(shape=(5,), dtype=float32, description=Robot end effector pose, consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position]. This is the state used in the MaskViT paper.),\n 'finger_sensors': Tensor(shape=(1,), dtype=float32, description=1x Sawyer gripper finger sensors.),\n 'high_bound': Tensor(shape=(5,), dtype=float32, description=High bound for end effector pose normalization. Consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position].),\n 'image': Image(shape=(480, 480, 3), dtype=uint8, description=Main camera RGB observation.),\n 'low_bound': Tensor(shape=(5,), dtype=float32, description=Low bound for end effector pose normalization. Consists of [3x Cartesian position, 1x gripper yaw, 1x gripper position].),\n 'state': Tensor(shape=(15,), dtype=float32, description=Robot state, consists of [7x robot joint angles, 7x robot joint velocities,1x gripper position].),\n }),\n 'reward': Scalar(shape=(), dtype=float32, description=Reward if provided, 1 on final step for demos.),\n }),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-------------------------------------|--------------|---------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------|\n| | FeaturesDict | | | |\n| episode_metadata | FeaturesDict | | | |\n| episode_metadata/file_path | Text | | string | Path to the original data file. |\n| steps | Dataset | | | |\n| steps/action | Tensor | (5,) | float32 | Robot action, consists of \\[3x change in end effector position, 1x gripper yaw, 1x open/close gripper (-1 means to open the gripper, 1 means close)\\]. |\n| steps/discount | Scalar | | float32 | Discount if provided, default to 1. |\n| steps/is_first | Tensor | | bool | |\n| steps/is_last | Tensor | | bool | |\n| steps/is_terminal | Tensor | | bool | |\n| steps/language_embedding | Tensor | (512,) | float32 | Kona language embedding. See \u003chttps://tfhub.dev/google/universal-sentence-encoder-large/5\u003e |\n| steps/language_instruction | Text | | string | Language Instruction. |\n| steps/observation | FeaturesDict | | | |\n| steps/observation/end_effector_pose | Tensor | (5,) | float32 | Robot end effector pose, consists of \\[3x Cartesian position, 1x gripper yaw, 1x gripper position\\]. This is the state used in the MaskViT paper. |\n| steps/observation/finger_sensors | Tensor | (1,) | float32 | 1x Sawyer gripper finger sensors. |\n| steps/observation/high_bound | Tensor | (5,) | float32 | High bound for end effector pose normalization. Consists of \\[3x Cartesian position, 1x gripper yaw, 1x gripper position\\]. |\n| steps/observation/image | Image | (480, 480, 3) | uint8 | Main camera RGB observation. |\n| steps/observation/low_bound | Tensor | (5,) | float32 | Low bound for end effector pose normalization. Consists of \\[3x Cartesian position, 1x gripper yaw, 1x gripper position\\]. |\n| steps/observation/state | Tensor | (15,) | float32 | Robot state, consists of \\[7x robot joint angles, 7x robot joint velocities,1x gripper position\\]. |\n| steps/reward | Scalar | | float32 | Reward if provided, 1 on final step for demos. |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @inproceedings{gupta2022maskvit,\n title={MaskViT: Masked Visual Pre-Training for Video Prediction},\n author={Agrim Gupta and Stephen Tian and Yunzhi Zhang and Jiajun Wu and Roberto Martín-Martín and Li Fei-Fei},\n booktitle={International Conference on Learning Representations},\n year={2022}\n }"]]