了解 TFX 流水线
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
MLOps 是一种应用 DevOps 做法来帮助自动化、管理和审核机器学习 (ML) 工作流的做法。ML 工作流包括以下步骤:
- 准备、分析和转换数据。
- 训练和评估模型。
- 将训练的模型部署到生产中。
- 跟踪 ML 工件并了解其依赖项。
以特殊方式管理这些步骤可能既困难又耗时。
TFX 提供了一个工具包,可帮助您在各种编排器上编排 ML 流程,从而简化 MLOps 的实现过程,这些编排器包括:Apache Airflow、Apache Beam 和 Kubeflow Pipelines。通过将工作流实现为 TFX 流水线,您可以:
- 使 ML 流程自动化,从而允许您定期重新训练、评估和部署模型。
- 利用分布式计算资源来处理大型数据集和工作负载。
- 通过运行具有不同超参数集的流水线来提高实验速度。
本指南介绍了解 TFX 流水线所需的核心概念。
工件
TFX 流水线中步骤的输出称为工件。工作流中的后续步骤可能会使用这些工件作为输入。这样,TFX 便允许您在工作流步骤之间传输数据。
例如,ExampleGen
标准组件发出序列化样本,另一些组件(例如 StatisticsGen
标准组件)则将这些样本用作输入。
必须使用在 ML Metadata 存储中注册的工件类型对工件进行强类型化。详细了解 ML Metadata 中使用的概念。
工件类型具有名称并定义其属性的架构。工件类型名称在 ML Metadata 存储中必须唯一。TFX 提供了几种标准工件类型,这些类型描述了复杂的数据类型和值类型,例如:字符串、整数和浮点数。您可以重用这些工件类型,或者定义派生自 Artifact
的自定义工件类型。
参数
参数是在执行流水线之前已知的流水线输入。借助参数,您可以通过配置而不是代码来更改流水线或部分流水线的行为。
例如,您可以使用参数来运行具有不同超参数集的流水线,而无需更改流水线的代码。
利用参数,您可以更轻松地使用不同的参数集来运行流水线,从而提高实验速度。
详细了解 RuntimeParameter 类。
组件
组件是 ML 任务的实现,您可以将其用作 TFX 流水线中的步骤。组件包括:
- 组件规范,用于定义组件的输入和输出工件以及组件的必需参数。
- 执行器,用于实现代码以执行 ML 工作流中的步骤,例如提取和转换数据或训练和评估模型。
- 组件接口,用于包装组件规范和执行器以在流水线中使用。
TFX 提供了可在流水线中使用的几个标准组件。如果这些组件不能满足您的需求,您可以构建自定义组件。详细了解自定义组件。
流水线
TFX 流水线是 ML 工作流的可移植实现,可以在各种编排器上运行,例如:Apache Airflow、Apache Beam 和 Kubeflow Pipelines。流水线由组件实例和输入参数组成。
组件实例生成工件作为输出,并且通常依赖于上游组件实例生成的工件作为输入。组件实例的执行顺序通过创建工件依赖项的有向无环图 (DAG) 来确定。
例如,考虑一个执行以下操作的流水线:
- 使用自定义组件直接从专有系统中提取数据。
- 使用 StatisticsGen 标准组件为训练数据计算统计信息。
- 使用 SchemaGen 标准组件创建数据架构。
- 使用 ExampleValidator 标准组件检查训练数据是否存在异常。
- 使用 Transform 标准组件对数据集执行特征工程。
- 使用 Trainer 标准组件训练模型。
- 使用 Evaluator 组件评估训练的模型。
- 如果模型通过评估,则流水线会使用自定义组件将训练的模型排入专有部署系统队列中。

为了确定组件实例的执行顺序,TFX 会分析工件依赖项。
- 数据提取组件没有任何工件依赖项,因此它可以是计算图中的第一个节点。
- StatisticsGen 依赖于数据提取生成的样本,因此必须在数据提取后执行。
- SchemaGen 依赖于 StatisticsGen 创建的统计信息,因此必须在 StatisticsGen 后执行。
- ExampleValidator 依赖于 StatisticsGen 创建的统计信息和 SchemaGen 创建的架构,因此必须在 StatisticsGen 和 SchemaGen 后执行。
- Transform 依赖于数据提取生成的样本和 SchemaGen 创建的架构,因此必须在数据提取和 SchemaGen 后执行。
- Trainer 依赖于数据提取生成的样本、SchemaGen 创建的架构以及 Transform 生成的已保存模型,因此只能在数据提取、SchemaGen 和 Transform 后执行。
- Evaluator 依赖于数据提取生成的样本和 Trainer 生成的已保存模型,因此必须在数据提取和 Trainer 后执行。
- 自定义部署器依赖于 Trainer 生成的已保存模型 和 Evaluator 创建的分析结果,因此部署器必须在 Trainer 和 Evaluator 后执行。
根据此分析,编排器的工作方式如下:
- 依次运行数据提取、StatisticsGen、SchemaGen 组件实例。
- ExampleValidator 和 Transform 组件可以并行运行,因为它们共享输入工件依赖项,并且不依赖于彼此的输出。
- 在 Transform 组件完成后,Trainer、Evaluator 和自定义部署器组件实例依次运行。
详细了解如何构建 TFX 流水线。
TFX 流水线模板
TFX 流水线模板通过提供可针对用例自定义的预构建流水线,使流水线开发变得更加容易。
详细了解如何自定义 TFX 流水线模板。
流水线运行
运行是流水线的单次执行。
编排器
编排器是一个您可以在其中执行流水线运行的系统。TFX 支持众多编排器,例如:Apache Airflow、Apache Beam 和 Kubeflow Pipelines。TFX 还使用术语 DagRunner 来指代支持编排器的实现。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2020-11-13。
[null,null,["最后更新时间 (UTC):2020-11-13。"],[],[],null,["# Understanding TFX Pipelines\n\n\u003cbr /\u003e\n\nMLOps is the practice of applying DevOps practices to help automate, manage, and\naudit machine learning (ML) workflows. ML workflows include steps to:\n\n- Prepare, analyze, and transform data.\n- Train and evaluate a model.\n- Deploy trained models to production.\n- Track ML artifacts and understand their dependencies.\n\nManaging these steps in an ad-hoc manner can be difficult and time-consuming.\n\nTFX makes it easier to implement MLOps by providing a toolkit that helps you\norchestrate your ML process on various orchestrators, such as: Apache Airflow,\nApache Beam, and Kubeflow Pipelines. By implementing your workflow as a TFX\npipeline, you can:\n\n- Automate your ML process, which lets you regularly retrain, evaluate, and deploy your model.\n- Utilize distributed compute resources for processing large datasets and workloads.\n- Increase the velocity of experimentation by running a pipeline with different sets of hyperparameters.\n\nThis guide describes the core concepts required to understand TFX pipelines.\n\nArtifact\n--------\n\nThe outputs of steps in a TFX pipeline are called **artifacts**. Subsequent\nsteps in your workflow may use these artifacts as inputs. In this way, TFX lets\nyou transfer data between workflow steps.\n\nFor instance, the `ExampleGen` standard component emits serialized examples,\nwhich components such as the `StatisticsGen` standard component use as inputs.\n\nArtifacts must be strongly typed with an **artifact type** registered in the\n[ML Metadata](/tfx/guide/mlmd) store. Learn more about the\n[concepts used in ML Metadata](/tfx/guide/mlmd#concepts).\n\nArtifact types have a name and define a schema of its properties. Artifact type\nnames must be unique in your ML Metadata store. TFX provides several\n[standard artifact types](https://github.com/tensorflow/tfx/blob/master/tfx/types/standard_artifacts.py)\nthat describe complex data types and value types, such as: string, integer, and\nfloat. You can\n[reuse these artifact types](https://github.com/tensorflow/tfx/blob/master/tfx/types/standard_artifacts.py)\nor define custom artifact types that derive from\n[`Artifact`](https://github.com/tensorflow/tfx/blob/master/tfx/types/artifact.py).\n\nParameter\n---------\n\nParameters are inputs to pipelines that are known before your pipeline is\nexecuted. Parameters let you change the behavior of a pipeline, or a part of a\npipeline, through configuration instead of code.\n\nFor example, you can use parameters to run a pipeline with different sets of\nhyperparameters without changing the pipeline's code.\n\nUsing parameters lets you increase the velocity of experimentation by making it\neasier to run your pipeline with different sets of parameters.\n\nLearn more about the\n[RuntimeParameter class](https://github.com/tensorflow/tfx/blob/master/tfx/orchestration/data_types.py).\n\nComponent\n---------\n\nA **component** is an implementation of an ML task that you can use as a step in\nyour TFX pipeline. Components are composed of:\n\n- A component specification, which defines the component's input and output artifacts, and the component's required parameters.\n- An executor, which implements the code to perform a step in your ML workflow, such as ingesting and transforming data or training and evaluating a model.\n- A component interface, which packages the component specification and executor for use in a pipeline.\n\nTFX provides several [standard components](/tfx/guide#tfx_standard_components)\nthat you can use in your pipelines. If these components do not meet your needs,\nyou can build custom components.\n[Learn more about custom components](/tfx/guide/understanding_custom_components).\n\nPipeline\n--------\n\nA TFX pipeline is a portable implementation of an ML workflow that can be run on\nvarious orchestrators, such as: Apache Airflow, Apache Beam, and Kubeflow\nPipelines. A pipeline is composed of component instances and input parameters.\n\nComponent instances produce artifacts as outputs and typically depend on\nartifacts produced by upstream component instances as inputs. The execution\nsequence for component instances is determined by creating a directed acyclic\ngraph of the artifact dependencies.\n\nFor example, consider a pipeline that does the following:\n\n- Ingests data directly from a proprietary system using a custom component.\n- Calculates statistics for the training data using the StatisticsGen standard component.\n- Creates a data schema using the SchemaGen standard component.\n- Checks the training data for anomalies using the ExampleValidator standard component.\n- Performs feature engineering on the dataset using the Transform standard component.\n- Trains a model using the Trainer standard component.\n- Evaluates the trained model using the Evaluator component.\n- If the model passes its evaluation, the pipeline enqueues the trained model to a proprietary deployment system using a custom component.\n\nTo determine the execution sequence for the component instances, TFX analyzes\nthe artifact dependencies.\n\n- The data ingestion component does not have any artifact dependencies, so it can be the first node in the graph.\n- StatisticsGen depends on the *examples* produced by data ingestion, so it must be executed after data ingestion.\n- SchemaGen depends on the *statistics* created by StatisticsGen, so it must be executed after StatisticsGen.\n- ExampleValidator depends on the *statistics* created by StatisticsGen and the *schema* created by SchemaGen, so it must be executed after StatisticsGen and SchemaGen.\n- Transform depends on the *examples* produced by data ingestion and the *schema* created by SchemaGen, so it must be executed after data ingestion and SchemaGen.\n- Trainer depends on the *examples* produced by data ingestion, the *schema* created by SchemaGen, and the *saved model* produced by Transform. The Trainer can be executed only after data ingestion, SchemaGen, and Transform.\n- Evaluator depends on the *examples* produced by data ingestion and the *saved model* produced by the Trainer, so it must be executed after data ingestion and the Trainer.\n- The custom deployer depends on the *saved model* produced by the Trainer and the *analysis results* created by the Evaluator, so the deployer must be executed after the Trainer and the Evaluator.\n\nBased on this analysis, an orchestrator runs:\n\n- The data ingestion, StatisticsGen, SchemaGen component instances sequentially.\n- The ExampleValidator and Transform components can run in parallel since they share input artifact dependencies and do not depend on each other's output.\n- After the Transform component is complete, the Trainer, Evaluator, and custom deployer component instances run sequentially.\n\nLearn more about [building a TFX pipeline](/tfx/guide/build_tfx_pipeline).\n\nTFX Pipeline Template\n---------------------\n\nTFX Pipeline Templates make it easier to get started with pipeline development\nby providing a prebuilt pipeline that you can customize for your use case.\n\nLearn more about\n[customizing a TFX pipeline template](/tfx/guide/build_tfx_pipeline#pipeline_templates).\n\nPipeline Run\n------------\n\nA run is a single execution of a pipeline.\n\nOrchestrator\n------------\n\nAn Orchestrator is a system where you can execute pipeline runs. TFX supports\norchestrators such as: [Apache Airflow](/tfx/guide/airflow), [Apache Beam](/tfx/guide/beam), and\n[Kubeflow Pipelines](/tfx/guide/kubeflow). TFX also uses the term *DagRunner* to refer\nto an implementation that supports an orchestrator."]]