FeatureConnector
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
tfds.features.FeatureConnector
API:
概述
tfds.features.FeatureConnector
定义数据集特征结构(在 tfds.core.DatasetInfo
中):
tfds.core.DatasetInfo(
features=tfds.features.FeaturesDict({
'image': tfds.features.Image(shape=(28, 28, 1), doc='Grayscale image'),
'label': tfds.features.ClassLabel(
names=['no', 'yes'],
doc=tfds.features.Documentation(
desc='Whether this is a picture of a cat',
value_range='yes or no'
),
),
'metadata': {
'id': tf.int64,
'timestamp': tfds.features.Scalar(
tf.int64,
doc='Timestamp when this picture was taken as seconds since epoch'),
'language': tf.string,
},
}),
)
可以通过仅使用文本描述 (doc='description'
) 或直接使用 tfds.features.Documentation
来提供更详细的特征描述来记录特征。
特征可以是:
在生成过程中,样本将由 FeatureConnector.encode_example
自动序列化为适合磁盘的格式(当前为 tf.train.Example
协议缓冲区):
yield {
'image': '/path/to/img0.png', # `np.array`, file bytes,... also accepted
'label': 'yes', # int (0-num_classes) also accepted
'metadata': {
'id': 43,
'language': 'en',
},
}
读取数据集时(例如使用 tfds.load
),数据会使用 FeatureConnector.decode_example
自动解码。返回的 tf.data.Dataset
将匹配 tfds.core.DatasetInfo
中定义的 dict
结构:
ds = tfds.load(...)
ds.element_spec == {
'image': tf.TensorSpec(shape=(28, 28, 1), tf.uint8),
'label': tf.TensorSpec(shape=(), tf.int64),
'metadata': {
'id': tf.TensorSpec(shape=(), tf.int64),
'language': tf.TensorSpec(shape=(), tf.string),
},
}
序列化/反序列化为 proto
TFDS 公开了一个低级 API 以将样本序列化/反序列化为 tf.train.Example
proto。
要将 dict[np.ndarray | Path | str | ...]
序列化为 proto bytes
,请使用 features.serialize_example
:
with tf.io.TFRecordWriter('path/to/file.tfrecord') as writer:
for ex in all_exs:
ex_bytes = features.serialize_example(data)
f.write(ex_bytes)
要将 proto bytes
反序列化为 tf.Tensor
,请使用 features.deserialize_example
:
ds = tf.data.TFRecordDataset('path/to/file.tfrecord')
ds = ds.map(features.deserialize_example)
访问元数据
要访问特征元数据(标签名称、形状、数据类型…),请参阅简介文档。示例:
ds, info = tfds.load(..., with_info=True)
info.features['label'].names # ['cat', 'dog', ...]
info.features['label'].str2int('cat') # 0
如果您认为可用特征中缺少某个特征,请打开一个新议题。
要创建您自己的特征连接器,需要从 tfds.features.FeatureConnector
继承并实现抽象方法。
tfds.features.FeatureConnector
对象可将特征在磁盘中的编码方式从特征如何呈现给用户中抽象出来。下图显示了数据集的抽象层,以及从原始数据集文件到 tf.data.Dataset
对象的转换。

要创建您自己的特征连接器,请将 tfds.features.FeatureConnector
子类化并实现抽象方法:
注:确保使用 self.assertFeature
和 tfds.testing.FeatureExpectationItem
测试您的特征连接器。请查看测试示例:
如需了解详情,请查看 tfds.features.FeatureConnector
文档。最好还是查看一下真实示例。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2024-01-11。
[null,null,["最后更新时间 (UTC):2024-01-11。"],[],[],null,["# FeatureConnector\n\n\u003cbr /\u003e\n\nThe [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) API:\n\n- Defines the structure, shapes, dtypes of the final [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)\n- Abstract away serialization to/from disk.\n- Expose additional metadata (e.g. label names, audio sample rate,...)\n\nOverview\n--------\n\n[`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) defines the dataset features structure (in\n[`tfds.core.DatasetInfo`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo)): \n\n tfds.core.DatasetInfo(\n features=tfds.features.FeaturesDict({\n 'image': tfds.features.Image(shape=(28, 28, 1), doc='Grayscale image'),\n 'label': tfds.features.ClassLabel(\n names=['no', 'yes'],\n doc=tfds.features.Documentation(\n desc='Whether this is a picture of a cat',\n value_range='yes or no'\n ),\n ),\n 'metadata': {\n 'id': tf.int64,\n 'timestamp': tfds.features.Scalar(\n tf.int64,\n doc='Timestamp when this picture was taken as seconds since epoch'),\n 'language': tf.string,\n },\n }),\n )\n\nFeatures can be documented by either using just a textual description\n(`doc='description'`) or by using `tfds.features.Documentation` directly to\nprovide a more detailed feature description.\n\nFeatures can be:\n\n- Scalar values: [`tf.bool`](https://www.tensorflow.org/api_docs/python/tf#bool), [`tf.string`](https://www.tensorflow.org/api_docs/python/tf#string), [`tf.float32`](https://www.tensorflow.org/api_docs/python/tf#float32),... When you want to document the feature, you can also use `tfds.features.Scalar(tf.int64,\n doc='description')`.\n- [`tfds.features.Audio`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Audio), [`tfds.features.Video`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Video),... (see [the list](https://www.tensorflow.org/datasets/api_docs/python/tfds/features?version=nightly) of available features)\n- Nested `dict` of features: `{'metadata': {'image': Image(), 'description':\n tf.string} }`,...\n- Nested [`tfds.features.Sequence`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Sequence): `Sequence({'image': ..., 'id': ...})`, `Sequence(Sequence(tf.int64))`,...\n\nDuring generation, the examples will be automatically serialized by\n[`FeatureConnector.encode_example`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector#encode_example) into a format suitable to disk (currently\n[`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) protocol buffers): \n\n yield {\n 'image': '/path/to/img0.png', # `np.array`, file bytes,... also accepted\n 'label': 'yes', # int (0-num_classes) also accepted\n 'metadata': {\n 'id': 43,\n 'language': 'en',\n },\n }\n\nWhen reading the dataset (e.g. with [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)), the data is automtically\ndecoded with [`FeatureConnector.decode_example`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector#decode_example). The returned [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)\nwill match the `dict` structure defined in [`tfds.core.DatasetInfo`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo): \n\n ds = tfds.load(...)\n ds.element_spec == {\n 'image': tf.TensorSpec(shape=(28, 28, 1), tf.uint8),\n 'label': tf.TensorSpec(shape=(), tf.int64),\n 'metadata': {\n 'id': tf.TensorSpec(shape=(), tf.int64),\n 'language': tf.TensorSpec(shape=(), tf.string),\n },\n }\n\nSerialize/deserialize to proto\n------------------------------\n\nTFDS expose a low-level API to serialize/deserialize examples to\n[`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) proto.\n\nTo serialize `dict[np.ndarray | Path | str | ...]` to proto `bytes`, use\n`features.serialize_example`: \n\n with tf.io.TFRecordWriter('path/to/file.tfrecord') as writer:\n for ex in all_exs:\n ex_bytes = features.serialize_example(data)\n f.write(ex_bytes)\n\nTo deserialize to proto `bytes` to [`tf.Tensor`](https://www.tensorflow.org/api_docs/python/tf/Tensor), use\n`features.deserialize_example`: \n\n ds = tf.data.TFRecordDataset('path/to/file.tfrecord')\n ds = ds.map(features.deserialize_example)\n\nAccess metadata\n---------------\n\nSee the\n[introduction doc](https://www.tensorflow.org/datasets/overview#access_the_dataset_metadata)\nto access features metadata (label names, shape, dtype,...). Example: \n\n ds, info = tfds.load(..., with_info=True)\n\n info.features['label'].names # ['cat', 'dog', ...]\n info.features['label'].str2int('cat') # 0\n\nCreate your own [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector)\n--------------------------------------------------------------------------------------------------------------------------------------\n\nIf you believe a feature is missing from the\n[available features](https://www.tensorflow.org/datasets/api_docs/python/tfds/features#classes),\nplease open a [new issue](https://github.com/tensorflow/datasets/issues).\n\nTo create your own feature connector, you need to inherit from\n[`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) and implement the abstract methods.\n\n- If your feature is a single tensor value, it's best to inherit from [`tfds.features.Tensor`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Tensor) and use `super()` when needed. See [`tfds.features.BBoxFeature`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/BBoxFeature) source code for an example.\n- If your feature is a container of multiple tensors, it's best to inherit from [`tfds.features.FeaturesDict`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeaturesDict) and use the `super()` to automatically encode sub-connectors.\n\nThe [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) object abstracts away how the feature is\nencoded on disk from how it is presented to the user. Below is a diagram showing\nthe abstraction layers of the dataset and the transformation from the raw\ndataset files to the [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) object.\n\n\nTo create your own feature connector, subclass [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector)\nand implement the abstract methods:\n\n- `encode_example(data)`: Defines how to encode the data given in the generator `_generate_examples()` into a [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) compatible data. Can return a single value, or a `dict` of values.\n- `decode_example(data)`: Defines how to decode the data from the tensor read from [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) into user tensor returned by [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).\n- `get_tensor_info()`: Indicates the shape/dtype of the tensor(s) returned by [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). May be optional if inheriting from another [`tfds.features`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features).\n- (optionally) `get_serialized_info()`: If the info returned by `get_tensor_info()` is different from how the data are actually written on disk, then you need to overwrite `get_serialized_info()` to match the specs of the [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example)\n- `to_json_content`/`from_json_content`: This is required to allow your dataset to be loaded without the original source code. See [Audio feature](https://github.com/tensorflow/datasets/blob/65a76cb53c8ff7f327a3749175bc4f8c12ff465e/tensorflow_datasets/core/features/audio_feature.py#L121) for an example.\n\n| **Note:** Make sure to test your Feature connectors with `self.assertFeature` and [`tfds.testing.FeatureExpectationItem`](https://www.tensorflow.org/datasets/api_docs/python/tfds/testing/FeatureExpectationItem). Have a look at [test examples](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/core/features/image_feature_test.py):\n\nFor more info, have a look at [`tfds.features.FeatureConnector`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector) documentation.\nIt's also best to look at\n[real examples](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/core/features)."]]