构建 TFX 流水线
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
注:有关 TFX 流水线的概念视图,请参阅理解 TFX 流水线。
注:想在深入了解详细信息之前构建您的第一个流水线?请从使用模板构建流水线开始。
使用 Pipeline
类
TFX 流水线使用 Pipeline
类 {: .external } 进行定义。以下示例将演示如何使用 Pipeline
类。
pipeline.Pipeline(
pipeline_name=<var>pipeline-name</var>,
pipeline_root=<var>pipeline-root</var>,
components=<var>components</var>,
enable_cache=<var>enable-cache</var>,
metadata_connection_config=<var>metadata-connection-config</var>,
)
替换以下内容:
pipeline-name:此流水线的名称。流水线名称必须唯一。
TFX 会使用流水线名称在 ML Metadata 中查询组件输入工件。重用流水线名称可能会导致意外行为。
pipeline-root:此流水线输出的根路径。根路径必须是编排器具有读取和写入访问权限的目录的完整路径。在运行时,TFX 使用流水线根目录生成组件工件的输出路径。此目录可以是本地目录,也可以位于受支持的分布式文件系统(如 Google Cloud Storage 或 HDFS)。
components:组成此流水线工作流的组件实例的列表。
enable-cache:(可选)布尔值,指示此流水线是否使用缓存来加速流水线执行。
metadata-connection-config:(可选)ML Metadata 的连接配置。
定义组件执行计算图
组件实例会生成工件作为输出,并且通常依赖于上游组件实例生成的工件作为输入。通过创建工件依赖项的有向无环图 (DAG) 来确定组件实例的执行顺序。
例如,ExampleGen
标准组件可以从 CSV 文件提取数据并输出序列化的样本记录。StatisticsGen
标准组件接受这些样本记录作为输入并生成数据集统计信息。在此示例中,StatisticsGen
的实例必须遵循 ExampleGen
,因为 SchemaGen
取决于 ExampleGen
的输出。
基于任务的依赖关系
注:通常不建议使用基于任务的依赖关系。通过使用工件依赖项定义执行计算图,您可以利用 TFX 的自动工件沿袭跟踪和缓存功能。
您还可以使用组件的 add_upstream_node
和 add_downstream_node
方法定义基于任务的依赖关系。您可以通过 add_upstream_node
指定当前组件必须在指定组件之后执行。或者通过 add_downstream_node
指定当前组件必须在指定组件之前执行。
流水线模板
要快速设置流水线并查看所有部件是如何装配在一起的,最简单的方式是使用模板。模板的使用方法在本地构建 TFX 流水线中进行了介绍。
缓存
TFX 流水线缓存使您的流水线可以跳过在先前的流水线运行中使用相同输入集执行过的组件。如果启用了缓存,流水线会尝试将每个组件的签名、组件和输入集与此流水线先前的组件执行进行匹配。如果存在匹配项,流水线将使用先前运行中的组件输出。如果无匹配,则执行组件。
如果流水线使用非确定性组件,请勿使用缓存。例如,如果为流水线创建一个组件来创建随机数,启用缓存会使此组件执行一次。在此示例中,后续运行会使用首次运行的随机数,而不是生成随机数。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2021-08-16。
[null,null,["最后更新时间 (UTC):2021-08-16。"],[],[],null,["# Building TFX pipelines\n\n\u003cbr /\u003e\n\n| **Note:** For a conceptual view of TFX Pipelines, see [Understanding TFX Pipelines](/tfx/guide/understanding_tfx_pipelines).\n| **Note:** Want to build your first pipeline before you dive into the details? Get started [building a pipeline using a template](https://www.tensorflow.org/tfx/guide/build_local_pipeline#build_a_pipeline_using_a_template).\n\nUsing the `Pipeline` class\n--------------------------\n\nTFX pipelines are defined using the\n[`Pipeline` class](https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/pipeline.py).\nThe following example demonstrates how to use the `Pipeline` class. \n\n```scdoc\npipeline.Pipeline(\n pipeline_name=pipeline-name,\n pipeline_root=pipeline-root,\n components=components,\n enable_cache=enable-cache,\n metadata_connection_config=metadata-connection-config,\n)\n```\n\nReplace the following:\n\n- \u003cvar translate=\"no\"\u003epipeline-name\u003c/var\u003e: The name of this pipeline. The pipeline name must\n be unique.\n\n TFX uses the pipeline name when querying ML Metadata for component input\n artifacts. Reusing a pipeline name may result in unexpected behaviors.\n- \u003cvar translate=\"no\"\u003epipeline-root\u003c/var\u003e: The root path of this pipeline's outputs. The root\n path must be the full path to a directory that your orchestrator has read\n and write access to. At runtime, TFX uses the pipeline root to generate\n output paths for component artifacts. This directory can be local, or on a\n supported distributed file system, such as Google Cloud Storage or HDFS.\n\n- \u003cvar translate=\"no\"\u003ecomponents\u003c/var\u003e: A list of component instances that make up this\n pipeline's workflow.\n\n- \u003cvar translate=\"no\"\u003eenable-cache\u003c/var\u003e: (Optional.) A boolean value that indicates if this\n pipeline uses caching to speed up pipeline execution.\n\n- \u003cvar translate=\"no\"\u003emetadata-connection-config\u003c/var\u003e: (Optional.) A connection\n configuration for ML Metadata.\n\nDefining the component execution graph\n--------------------------------------\n\nComponent instances produce artifacts as outputs and typically depend on\nartifacts produced by upstream component instances as inputs. The execution\nsequence for component instances is determined by creating a directed acyclic\ngraph (DAG) of the artifact dependencies.\n\nFor instance, the `ExampleGen` standard component can ingest data from a CSV\nfile and output serialized example records. The `StatisticsGen` standard\ncomponent accepts these example records as input and produces dataset\nstatistics. In this example, the instance of `StatisticsGen` must follow\n`ExampleGen` because `SchemaGen` depends on the output of `ExampleGen`.\n\n### Task-based dependencies\n\n| **Note:** Using task-based dependencies is typically not recommended. Defining the execution graph with artifact dependencies lets you take advantage of the automatic artifact lineage tracking and caching features of TFX.\n\nYou can also define task-based dependencies using your component's\n[`add_upstream_node` and `add_downstream_node`](https://github.com/tensorflow/tfx/blob/master/tfx/components/base/base_node.py)\nmethods. `add_upstream_node` lets you specify that the current component must be\nexecuted after the specified component. `add_downstream_node` lets you specify\nthat the current component must be executed before the specified component.\n\nPipeline templates\n------------------\n\nThe easiest way to get a pipeline set up quickly, and to see how all the pieces\nfit together, is to use a template. Using templates is covered in [Building a\nTFX Pipeline Locally](/tfx/guide/build_local_pipeline).\n\nCaching\n-------\n\nTFX pipeline caching lets your pipeline skip over components that have been\nexecuted with the same set of inputs in a previous pipeline run. If caching is\nenabled, the pipeline attempts to match the signature of each component, the\ncomponent and set of inputs, to one of this pipeline's previous component\nexecutions. If there is a match, the pipeline uses the component outputs from\nthe previous run. If there is not a match, the component is executed.\n\nDo not use caching if your pipeline uses non-deterministic components. For\nexample, if you create a component to create a random number for your pipeline,\nenabling the cache causes this component to execute once. In this example,\nsubsequent runs use the first run's random number instead of generating a random\nnumber."]]