StatisticsGen TFX 流水线组件
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
StatisticsGen TFX 流水线组件根据训练数据和应用数据来生成特征统计信息,以供其他流水线组件使用。StatisticsGen 使用 Beam 来扩展为大型数据集。
- 使用:由 ExampleGen 流水线组件创建的数据集。
- 发出:数据集统计信息。
StatisticsGen 和 TensorFlow Data Validation
StatisticsGen 广泛使用 TensorFlow Data Validation 来根据您的数据集生成统计信息。
使用 StatsGen 组件
StatisticsGen 流水线组件通常非常易于部署,而且几乎不需要自定义。典型代码如下所示:
from tfx import components
...
compute_eval_stats = components.StatisticsGen(
examples=example_gen.outputs['examples'],
name='compute-eval-stats'
)
将 StatsGen 组件与架构一起使用
当流水线第一次运行时,StatisticsGen 的输出将用于推断架构。不过,在随后的运行中,您可能具有手动选择的架构,其中包含有关数据集的附加信息。通过将此架构提供给 StatisticsGen,TFDV 可以根据数据集的已声明属性提供更多有用的统计信息。
在此设置中,您将使用由 ImporterNode 导入的精选架构调用 StatisticsGen,代码如下所示:
from tfx import components
from tfx.types import standard_artifacts
...
user_schema_importer = components.ImporterNode(
instance_name='import_user_schema',
source_uri=user_schema_dir, # directory containing only schema text proto
artifact_type=standard_artifacts.Schema)
compute_eval_stats = components.StatisticsGen(
examples=example_gen.outputs['examples'],
schema=user_schema_importer.outputs['result'],
name='compute-eval-stats'
)
创建精选架构
TFX 中的 Schema
是 TensorFlow Metadata Schema
proto 的一个实例。这可以从头开始以文本格式创作。但是,将 SchemaGen
生成的推断架构用作起点要容易得多。执行 SchemaGen
组件后,架构将位于以下路径的流水线根目录下:
<pipeline_root>/SchemaGen/schema/<artifact_id>/schema.pbtxt
其中,<artifact_id>
表示 MLMD 中此版本架构的唯一 ID。随后,可以修改此架构 proto 以传达有关无法可靠推断的数据集的信息,这样,StatisticsGen
的输出便会更加有用,而且 ExampleValidator
组件中执行的验证也会更加严格。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2020-11-13。
[null,null,["最后更新时间 (UTC):2020-11-13。"],[],[],null,["# The StatisticsGen TFX Pipeline Component\n\n\u003cbr /\u003e\n\nThe StatisticsGen TFX pipeline component generates features statistics\nover both training and serving data, which can be used by other pipeline\ncomponents.\nStatisticsGen uses Beam to scale to large datasets.\n\n- Consumes: datasets created by an ExampleGen pipeline component.\n- Emits: Dataset statistics.\n\nStatisticsGen and TensorFlow Data Validation\n--------------------------------------------\n\nStatisticsGen makes extensive use of [TensorFlow Data Validation](/tfx/guide/tfdv) for\ngenerating statistics from your dataset.\n\nUsing the StatsGen Component\n----------------------------\n\nA StatisticsGen pipeline component is typically very easy to deploy and requires\nlittle customization. Typical code looks like this: \n\n compute_eval_stats = StatisticsGen(\n examples=example_gen.outputs['examples'],\n name='compute-eval-stats'\n )\n\nUsing the StatsGen Component With a Schema\n------------------------------------------\n\nFor the first run of a pipeline, the output of StatisticsGen will be used to\ninfer a schema. However, on subsequent runs you may have a manually curated\nschema that contains additional information about your data set. By providing\nthis schema to StatisticsGen, TFDV can provide more useful statistics based on\ndeclared properties of your data set.\n\nIn this setting, you will invoke StatisticsGen with a curated schema that has\nbeen imported by an ImporterNode like this: \n\n user_schema_importer = Importer(\n source_uri=user_schema_dir, # directory containing only schema text proto\n artifact_type=standard_artifacts.Schema).with_id('schema_importer')\n\n compute_eval_stats = StatisticsGen(\n examples=example_gen.outputs['examples'],\n schema=user_schema_importer.outputs['result'],\n name='compute-eval-stats'\n )\n\n### Creating a Curated Schema\n\n`Schema` in TFX is an instance of the TensorFlow Metadata\n[`Schema` proto](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto).\nThis can be composed in\n[text format](https://googleapis.dev/python/protobuf/latest/google/protobuf/text_format.html)\nfrom scratch. However, it is easier to use the inferred schema produced by\n`SchemaGen` as a starting point. Once the `SchemaGen` component has executed,\nthe schema will be located under the pipeline root in the following path: \n\n \u003cpipeline_root\u003e/SchemaGen/schema/\u003cartifact_id\u003e/schema.pbtxt\n\nWhere `\u003cartifact_id\u003e` represents a unique ID for this version of the schema in\nMLMD. This schema proto can then be modified to communicate information about\nthe dataset which cannot be reliably inferred, which will make the output of\n`StatisticsGen` more useful and the validation performed in the\n[`ExampleValidator`](https://www.tensorflow.org/tfx/guide/exampleval) component\nmore stringent.\n\nMore details are available in the\n[StatisticsGen API reference](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/StatisticsGen)."]]