SchemaGen TFX 流水线组件
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
一些 TFX 组件使用架构来描述输入数据。架构是 schema.proto 的一个实例。它可以指定特征值的数据类型、是否在所有示样本中都必须存在特征、允许的值范围以及其他属性。SchemaGen 流水线组件将通过从训练数据中推断类型、类别和范围来自动生成架构。
- 使用:来自 StatisticsGen 组件的统计信息
- 发出:数据架构 proto
下面的代码摘自一个架构 proto:
...
feature {
name: "age"
value_count {
min: 1
max: 1
}
type: FLOAT
presence {
min_fraction: 1
min_count: 1
}
}
feature {
name: "capital-gain"
value_count {
min: 1
max: 1
}
type: FLOAT
presence {
min_fraction: 1
min_count: 1
}
}
...
以下 TFX 库使用架构:
- TensorFlow Data Validation
- TensorFlow Transform
- TensorFlow Model Analysis
在典型的 TFX 流水线中,SchemaGen 会生成一个将由其他流水线组件使用的架构。
注:自动生成的架构是一种尽力而为的架构,仅会尝试推断数据的基本属性。开发者应根据需要对其进行检查和修改。
SchemaGen 和 TensorFlow Data Validation
SchemaGen 广泛使用 TensorFlow Data Validation 来推断架构。
使用 SchemaGen 组件
SchemaGen 流水线组件通常非常易于部署,而且几乎不需要自定义。典型代码如下所示:
from tfx import components
...
infer_schema = components.SchemaGen(
statistics=compute_training_stats.outputs['statistics'])
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2020-11-13。
[null,null,["最后更新时间 (UTC):2020-11-13。"],[],[],null,["# The SchemaGen TFX Pipeline Component\n\n\u003cbr /\u003e\n\nSome TFX components use a description of your input data called a *schema* . The\nschema is an instance of\n[schema.proto](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto).\nIt can specify data types for feature values,\nwhether a feature has to be present in all examples, allowed value ranges, and\nother properties. A SchemaGen pipeline component will automatically generate a\nschema by inferring types, categories, and ranges from the training data.\n\n- Consumes: statistics from a StatisticsGen component\n- Emits: Data schema proto\n\nHere's an excerpt from a schema proto: \n\n ...\n feature {\n name: \"age\"\n value_count {\n min: 1\n max: 1\n }\n type: FLOAT\n presence {\n min_fraction: 1\n min_count: 1\n }\n }\n feature {\n name: \"capital-gain\"\n value_count {\n min: 1\n max: 1\n }\n type: FLOAT\n presence {\n min_fraction: 1\n min_count: 1\n }\n }\n ...\n\nThe following TFX libraries use the schema:\n\n- TensorFlow Data Validation\n- TensorFlow Transform\n- TensorFlow Model Analysis\n\nIn a typical TFX pipeline SchemaGen generates a schema, which is consumed by the\nother pipeline components. However, the auto-generated schema is best-effort and\nonly tries to infer basic properties of the data. It is expected that developers\nreview and modify it as needed.\n\nThe modified schema can be brought back into the pipeline using ImportSchemaGen\ncomponent. The SchemaGen component for the initial schema generation can be\nremoved and all downstream components can use the output of ImportSchemaGen. It\nis also recommended to add\n[ExampleValidator](https://www.tensorflow.org/tfx/guide/exampleval) using the\nimported schema to examine the training data continuously.\n\nSchemaGen and TensorFlow Data Validation\n----------------------------------------\n\nSchemaGen makes extensive use of [TensorFlow Data Validation](/tfx/guide/tfdv) for inferring a schema.\n\nUsing the SchemaGen Component\n-----------------------------\n\n### For the initial schema generation\n\nA SchemaGen pipeline component is typically very easy to deploy and requires little\ncustomization. Typical code looks like this: \n\n schema_gen = tfx.components.SchemaGen(\n statistics=stats_gen.outputs['statistics'])\n\nMore details are available in the\n[SchemaGen API reference](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/SchemaGen).\n\n### For the reviewed schema import\n\nAdd ImportSchemaGen component to the pipeline to bring the reviewed schema\ndefinition into the pipeline. \n\n schema_gen = tfx.components.ImportSchemaGen(\n schema_file='/some/path/schema.pbtxt')\n\nThe `schema_file` should be a full path to the text protobuf file.\n\nMore details are available in the\n[ImportSchemaGen API reference](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/ImportSchemaGen)."]]