数据集版本控制
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
定义
版本有两种不同含义:
- TFDS API 版本(pip 版本):
tfds.version
- 公共数据集版本,独立于 TFDS(例如 Voc2007、Voc2012)。在 TFDS 中,每个公共数据集版本都应作为一个独立的数据集实现:
- 通过构建器配置:例如
voc/2007
、voc/2012
- 作为 2 个独立的数据集:例如
wmt13_translate
、wmt14_translate
- TFDS 中的数据集生成代码版本 (
my_dataset:1.0.0
):例如,如果在 voc/2007
的 TFDS 实现中发现错误,将更新 voc.py
生成代码 (voc/2007:1.0.0
-> voc/2007:2.0.0
)。
本指南的其余部分仅关注最后一个定义(TFDS 仓库中的数据集代码版本)。
支持的版本
作为一般规则:
- 只能生成上一个最新版本。
- 可以读取之前生成的所有数据集(注:这需要使用 TFDS 4+ 生成的数据集)。
builder = tfds.builder('my_dataset')
builder.info.version # Current version is: '2.0.0'
# download and load the last available version (2.0.0)
ds = tfds.load('my_dataset')
# Explicitly load a previous version (only works if
# `~/tensorflow_datasets/my_dataset/1.0.0/` already exists)
ds = tfds.load('my_dataset:1.0.0')
语义
TFDS 中定义的每个 DatasetBuilder
都有一个版本,例如:
class MNIST(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version('2.0.0')
RELEASE_NOTES = {
'1.0.0': 'Initial release',
'2.0.0': 'Update dead download url',
}
该版本遵循语义化版本控制规范 2.0.0:MAJOR.MINOR.PATCH
。版本的目的是为了保证重现性:加载固定版本的指定数据集会产生相同的数据。进一步来说:
- 如果增大
PATCH
版本,则客户端读取的数据是相同的,尽管数据可能会在磁盘上以不同的方式序列化,或者元数据可能已发生变化。对于任何给定的切片,slicing API 都会返回相同的记录集。
- 如果增大
MINOR
版本,则客户端读取的现有数据是相同的,但是还包含其他数据(每条记录中的特征)。对于任何给定的切片,slicing API 都会返回相同的记录集。
- 如果增大
MAJOR
版本,则现有数据已更改,并且/或者 slicing API 不一定会为给定切片返回相同的记录集。
对 TFDS 库进行代码更改并且该代码更改影响客户端对数据集进行序列化和/或读取的方式时,则根据上述准则增大相应的构建器版本。
请注意,上述语义化方案并非完美,当版本未递增时,可能会出现一些未被注意的错误对数据集产生影响。此类错误最终会得到修复,但是如果您严重依赖版本控制,我们建议您使用已发布版本(而非 HEAD
)中的 TFDS。
还要注意,某些数据集具有独立于 TFDS 版本的另一种版本控制方案。例如,Open Images 数据集具有多个版本,在 TFDS 中,相应的构建器是 open_images_v4
、open_images_v5
...
加载特定版本
加载数据集或 DatasetBuilder
时,您可以指定要使用的版本。例如:
tfds.load('imagenet2012:2.0.1')
tfds.builder('imagenet2012:2.0.1')
tfds.load('imagenet2012:2.0.0') # Error: unsupported version.
# Resolves to 3.0.0 for now, but would resolve to 3.1.1 if when added.
tfds.load('imagenet2012:3.*.*')
如果使用 TFDS 发布,我们建议您:
- 仅修复版本的
MAJOR
部分;
- 公布结果中使用了哪个版本的数据集。
这样做可便于您在未来,或便于读者和审阅者重现您的结果。
BUILDER_CONFIGS 和版本
有些数据集定义了多项 BUILDER_CONFIGS
。此时,version
和 supported_versions
是在配置对象自身上定义的。除此之外,语义和用法相同。例如:
class OpenImagesV4(tfds.core.GeneratorBasedBuilder):
BUILDER_CONFIGS = [
OpenImagesV4Config(
name='original',
version=tfds.core.Version('0.2.0'),
supported_versions=[
tfds.core.Version('1.0.0', "Major change in data"),
],
description='Images at their original resolution and quality.'),
...
]
tfds.load('open_images_v4/original:1.*.*')
实验版本
注:下面是不佳的做法,容易出错,应当阻止。
可以允许同时生成 2 个版本。一个默认版本和一个实验版本。例如:
class MNIST(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version("1.0.0") # Default version
SUPPORTED_VERSIONS = [
tfds.core.Version("2.0.0"), # Experimental version
]
# Download and load default version 1.0.0
builder = tfds.builder('mnist')
# Download and load experimental version 2.0.0
builder = tfds.builder('mnist', version='experimental_latest')
在代码中,您需要确保支持 2 个版本:
class MNIST(tfds.core.GeneratorBasedBuilder):
...
def _generate_examples(self, path):
if self.info.version >= '2.0.0':
...
else:
...
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2022-06-07。
[null,null,["最后更新时间 (UTC):2022-06-07。"],[],[],null,["# Datasets versioning\n\n\u003cbr /\u003e\n\nDefinition\n----------\n\nVersioning can refer to different meaning:\n\n- The TFDS API version (pip version): [`tfds.`**version**](https://www.tensorflow.org/datasets/api_docs/python/tfds#__version__)\n- The public dataset version, independent from TFDS (e.g. [Voc2007](https://pjreddie.com/projects/pascal-voc-dataset-mirror/), Voc2012). In TFDS each public dataset version should be implemented as an independent dataset:\n - Either through [builder configs](https://www.tensorflow.org/datasets/add_dataset#dataset_configurationvariants_tfdscorebuilderconfig): E.g. `voc/2007`, `voc/2012`\n - Either as 2 independent datasets: E.g. `wmt13_translate`, `wmt14_translate`\n- The dataset generation code version in TFDS (`my_dataset:1.0.0`): For example, if a bug is found in the TFDS implementation of `voc/2007`, the `voc.py` generation code will be updated (`voc/2007:1.0.0` -\\\u003e `voc/2007:2.0.0`).\n\nThe rest of this guide only focus on the last definition (dataset code version\nin the TFDS repository).\n\nSupported versions\n------------------\n\nAs a general rule:\n\n- Only the last current version can be generated.\n- All previously generated dataset can be read (note: This require datasets generated with TFDS 4+).\n\n builder = tfds.builder('my_dataset')\n builder.info.version # Current version is: '2.0.0'\n\n # download and load the last available version (2.0.0)\n ds = tfds.load('my_dataset')\n\n # Explicitly load a previous version (only works if\n # `~/tensorflow_datasets/my_dataset/1.0.0/` already exists)\n ds = tfds.load('my_dataset:1.0.0')\n\nSemantic\n--------\n\nEvery `DatasetBuilder` defined in TFDS comes with a version, for example: \n\n class MNIST(tfds.core.GeneratorBasedBuilder):\n VERSION = tfds.core.Version('2.0.0')\n RELEASE_NOTES = {\n '1.0.0': 'Initial release',\n '2.0.0': 'Update dead download url',\n }\n\nThe version follows\n[Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html):\n`MAJOR.MINOR.PATCH`. The purpose of the version is to be able to guarantee\nreproducibility: loading a given dataset at a fixed version yields the same\ndata. More specifically:\n\n- If `PATCH` version is incremented, data as read by the client is the same, although data might be serialized differently on disk, or the metadata might have changed. For any given slice, the slicing API returns the same set of records.\n- If `MINOR` version is incremented, existing data as read by the client is the same, but there is additional data (features in each record). For any given slice, the slicing API returns the same set of records.\n- If `MAJOR` version is incremented, the existing data has been changed and/or the slicing API doesn't necessarily return the same set of records for a given slice.\n\nWhen a code change is made to the TFDS library and that code change impacts the\nway a dataset is being serialized and/or read by the client, then the\ncorresponding builder version is incremented according to the above guidelines.\n\nNote that the above semantic is best effort, and there might be un-noticed bugs\nimpacting a dataset while the version was not incremented. Such bugs are\neventually fixed, but if you heavily rely on the versioning, we advise you to\nuse TFDS from a released version (as opposed to `HEAD`).\n\nAlso note that some datasets have another versioning scheme independent from\nthe TFDS version. For example, the Open Images dataset has several versions,\nand in TFDS, the corresponding builders are `open_images_v4`, `open_images_v5`,\n...\n\nLoading a specific version\n--------------------------\n\nWhen loading a dataset or a `DatasetBuilder`, you can specify the version to\nuse. For example: \n\n tfds.load('imagenet2012:2.0.1')\n tfds.builder('imagenet2012:2.0.1')\n\n tfds.load('imagenet2012:2.0.0') # Error: unsupported version.\n\n # Resolves to 3.0.0 for now, but would resolve to 3.1.1 if when added.\n tfds.load('imagenet2012:3.*.*')\n\nIf using TFDS for a publication, we advise you to:\n\n- **fix the `MAJOR` component of the version only**;\n- **advertise which version of the dataset was used in your results.**\n\nDoing so should make it easier for your future self, your readers and\nreviewers to reproduce your results.\n\nBUILDER_CONFIGS and versions\n----------------------------\n\nSome datasets define several `BUILDER_CONFIGS`. When that is the case, `version`\nand `supported_versions` are defined on the config objects themselves. Other\nthan that, semantics and usage are identical. For example: \n\n class OpenImagesV4(tfds.core.GeneratorBasedBuilder):\n\n BUILDER_CONFIGS = [\n OpenImagesV4Config(\n name='original',\n version=tfds.core.Version('0.2.0'),\n supported_versions=[\n tfds.core.Version('1.0.0', \"Major change in data\"),\n ],\n description='Images at their original resolution and quality.'),\n ...\n ]\n\n tfds.load('open_images_v4/original:1.*.*')\n\nExperimental version\n--------------------\n\n| **Note:** The following is bad practice, error prone and should be discouraged.\n\nIt is possible to allow 2 versions to be generated at the same time. One default\nand one experimental version. For example: \n\n class MNIST(tfds.core.GeneratorBasedBuilder):\n VERSION = tfds.core.Version(\"1.0.0\") # Default version\n SUPPORTED_VERSIONS = [\n tfds.core.Version(\"2.0.0\"), # Experimental version\n ]\n\n\n # Download and load default version 1.0.0\n builder = tfds.builder('mnist')\n\n # Download and load experimental version 2.0.0\n builder = tfds.builder('mnist', version='experimental_latest')\n\nIn the code, you need to make sure to support the 2 versions: \n\n class MNIST(tfds.core.GeneratorBasedBuilder):\n\n ...\n\n def _generate_examples(self, path):\n if self.info.version \u003e= '2.0.0':\n ...\n else:\n ..."]]