Join the SIG TFX-Addons community and help make TFX even better!

Data validation using TFX Pipeline and TensorFlow Data Validation

In this notebook-based tutorial, we will create and run TFX pipelines to validate input data and create an ML model. This notebook is based on the TFX pipeline we built in Simple TFX Pipeline Tutorial. If you have not read that tutorial yet, you should read it before proceeding with this notebook.

The first task in any data science or ML project is to understand and clean the data, which includes:

  • Understanding the data types, distributions, and other information (e.g., mean value, or number of uniques) about each feature
  • Generating a preliminary schema that describes the data
  • Identifying anomalies and missing values in the data with respect to given schema

In this tutorial, we will create two TFX pipelines.

First, we will create a pipeline to analyze the dataset and generate a preliminary schema of the given dataset. This pipeline will include two new components, StatisticsGen and SchemaGen.

Once we have a proper schema of the data, we will create a pipeline to train an ML classification model based on the pipeline from the previous tutorial. In this pipeline, we will use the schema from the first pipeline and a new component, ExampleValidator, to validate the input data.

The three new components, StatisticsGen, SchemaGen and ExampleValidator, are TFX components for data analysis and validation, and they are implemented using the TensorFlow Data Validation library.

Please see Understanding TFX Pipelines to learn more about various concepts in TFX.

Set Up

We first need to install the TFX Python package and download the dataset which we will use for our model.

Upgrade Pip

To avoid upgrading Pip in a system when running locally, check to make sure that we are running in Colab. Local systems can of course be upgraded separately.

try:
  import colab
  !pip install -q --upgrade pip
except:
  pass

Install TFX

# TODO(b/178712706): Stop using legacy resolver after PIP issue is resolved.
pip install -q -U --use-deprecated=legacy-resolver tfx==0.27.0

Did you restart the runtime?

If you are using Google Colab, the first time that you run the cell above, you must restart the runtime by clicking above "RESTART RUNTIME" button or using "Runtime > Restart runtime ..." menu. This is because of the way that Colab loads packages.

Check the TensorFlow and TFX versions.

import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
import tfx
print('TFX version: {}'.format(tfx.__version__))
TensorFlow version: 2.4.1
TFX version: 0.27.0

Set up variables

There are some variables used to define a pipeline. You can customize these variables as you want. By default all output from the pipeline will be generated under the current directory.

import os

# We will create two pipelines. One for schema generation and one for training.
SCHEMA_PIPELINE_NAME = "penguin-tfdv-schema"
PIPELINE_NAME = "penguin-tfdv"

# Output directory to store artifacts generated from the pipeline.
SCHEMA_PIPELINE_ROOT = os.path.join('pipelines', SCHEMA_PIPELINE_NAME)
PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)
# Path to a SQLite DB file to use as an MLMD storage.
SCHEMA_METADATA_PATH = os.path.join('metadata', SCHEMA_PIPELINE_NAME,
                                    'metadata.db')
METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')

# Output directory where created models from the pipeline will be exported.
SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)

from absl import logging
logging.set_verbosity(logging.INFO)  # Set default logging level.

Prepare example data

We will download the example dataset for use in our TFX pipeline. The dataset we are using is Palmer Penguins dataset which is also used in other TFX examples.

There are four numeric features in this dataset:

  • culmen_length_mm
  • culmen_depth_mm
  • flipper_length_mm
  • body_mass_g

All features were already normalized to have range [0,1]. We will build a classification model which predicts the species of penguins.

Because the TFX ExampleGen component reads inputs from a directory, we need to create a directory and copy the dataset to it.

import urllib.request
import tempfile

DATA_ROOT = tempfile.mkdtemp(prefix='tfx-data')  # Create a temporary directory.
_data_url = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/penguins_processed.csv'
_data_filepath = os.path.join(DATA_ROOT, "data.csv")
urllib.request.urlretrieve(_data_url, _data_filepath)
('/tmp/tfx-datacjdu22yh/data.csv', <http.client.HTTPMessage at 0x7f93259774a8>)

Take a quick look at the CSV file.

head {_data_filepath}
species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667
0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556
0,0.29818181818181805,0.5833333333333334,0.3898305084745763,0.1527777777777778
0,0.16727272727272732,0.7380952380952381,0.3559322033898305,0.20833333333333334
0,0.26181818181818167,0.892857142857143,0.3050847457627119,0.2638888888888889
0,0.24727272727272717,0.5595238095238096,0.15254237288135594,0.2569444444444444
0,0.25818181818181823,0.773809523809524,0.3898305084745763,0.5486111111111112
0,0.32727272727272727,0.5357142857142859,0.1694915254237288,0.1388888888888889
0,0.23636363636363636,0.9642857142857142,0.3220338983050847,0.3055555555555556

You should be able to see five feature columns. species is one of 0, 1 or 2, and all other features should have values between 0 and 1. We will create a TFX pipeline to analyze this dataset.

Generate a preliminary schema

TFX pipelines are defined using Python APIs. We will create a pipeline to generate a schema from the input examples automatically. This schema can be reviewed by a human and adjusted as needed. Once the schema is finalized it can be used for training and example validation in later tasks.

In addition to CsvExampleGen which is used in Simple TFX Pipeline Tutorial, we will use StatisticsGen and SchemaGen:

  • StatisticsGen calculates statistics for the dataset.
  • SchemaGen examines the statistics and creates an initial data schema.

See the guides for each component or TFX components tutorial to learn more on these components.

Write a pipeline definition

We define a function to create a TFX pipeline. A Pipeline object represents a TFX pipeline which can be run using one of pipeline orchestration systems that TFX supports.

from tfx.components import CsvExampleGen
from tfx.components import SchemaGen
from tfx.components import StatisticsGen
from tfx.orchestration import metadata
from tfx.orchestration import pipeline

def _create_schema_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
                     metadata_path: str) -> pipeline.Pipeline:
  """Creates a pipeline for schema generation."""
  # Brings data into the pipeline.
  example_gen = CsvExampleGen(input_base=data_root)

  # NEW: Computes statistics over data for visualization and schema generation.
  statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])

  # NEW: Generates schema based on the generated statistics.
  schema_gen = SchemaGen(
      statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)

  components = [
      example_gen,
      statistics_gen,
      schema_gen,
  ]

  return pipeline.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      metadata_connection_config=metadata.sqlite_metadata_connection_config(
          metadata_path),
      components=components)
INFO:absl:tensorflow_ranking is not available: No module named 'tensorflow_ranking'
INFO:absl:tensorflow_text is not available: No module named 'tensorflow_text'
WARNING:absl:RuntimeParameter is only supported on Cloud-based DAG runner currently.

Run the pipeline

We will use LocalDagRunner as in the previous tutorial.

import os
from tfx.orchestration.local import local_dag_runner

local_dag_runner.LocalDagRunner().run(
  _create_schema_pipeline(
      pipeline_name=SCHEMA_PIPELINE_NAME,
      pipeline_root=SCHEMA_PIPELINE_ROOT,
      data_root=DATA_ROOT,
      metadata_path=SCHEMA_METADATA_PATH))
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running executor for CsvExampleGen
INFO:absl:Generating examples.
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.
INFO:absl:Processing input csv data /tmp/tfx-datacjdu22yh/* to TFExample.
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
INFO:absl:Examples generated.
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component StatisticsGen is running.
INFO:absl:Running driver for StatisticsGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for StatisticsGen
INFO:absl:Generating statistics for split train.
INFO:absl:Statistics for split train written to pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2/train.
INFO:absl:Generating statistics for split eval.
INFO:absl:Statistics for split eval written to pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2/eval.
INFO:absl:Running publisher for StatisticsGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component StatisticsGen is finished.
INFO:absl:Component SchemaGen is running.
INFO:absl:Running driver for SchemaGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for SchemaGen
INFO:absl:Processing schema from statistics for split train.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_data_validation/utils/stats_util.py:247: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_data_validation/utils/stats_util.py:247: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
INFO:absl:Processing schema from statistics for split eval.
INFO:absl:Schema written to pipelines/penguin-tfdv-schema/SchemaGen/schema/3/schema.pbtxt.
INFO:absl:Running publisher for SchemaGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component SchemaGen is finished.

You should see "INFO:absl:Component SchemaGen is finished." if the pipeline finished successfully.

We will examine the output of the pipeline to understand our dataset.

Review outputs of the pipeline

As explained in the previous tutorial, a TFX pipeline produces two kinds of outputs, artifacts and a metadata DB(MLMD) which contains metadata of artifacts and pipeline executions. We defined the location of these outputs in the above cells. By default, artifacts are stored under the pipelines directory and metadata is stored as a sqlite database under the metadata directory.

You can use MLMD APIs to locate these outputs programatically. First, we will define some utility functions to search for the output artifacts that were just produced.

from ml_metadata.proto import metadata_store_pb2
from tfx.orchestration.portable.mlmd import execution_lib
from tfx.types import standard_component_specs

# TODO(b/171447278): Move these functions into the TFX library.

def get_latest_executions(metadata, pipeline_name):
  """Gets the execution objects for the latest run of the pipeline."""
  run_contexts = [
        c for c in metadata.store.get_contexts_by_type('run')
        if c.properties['pipeline_name'].string_value == pipeline_name
    ]
  if not run_contexts:
    return []
  # Pick the latest run context.
  latest_context = max(run_contexts,
                       key=lambda c: c.last_update_time_since_epoch)
  return metadata.store.get_executions_by_context(latest_context.id)

def get_artifacts_for_component_id(metadata, executions, component_id):
  for execution in executions:
    if execution.properties['component_id'].string_value == component_id:
      return execution_lib.get_artifacts_dict(metadata, execution.id,
                                              metadata_store_pb2.Event.OUTPUT)
  raise RuntimeError(f'Execution not found for {component_id}')

from tfx.orchestration.experimental.interactive import visualizations

def visualize_artifacts(artifacts):
  """Visualizes artifacts using standard visualization modules."""
  for artifact in artifacts:
    visualization = visualizations.get_registry().get_visualization(
        artifact.type_name)
    if visualization:
      visualization.display(artifact)

from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()

from tfx.types import standard_artifacts

Now we can examine the outputs from the pipeline execution.

metadata_connection_config = metadata.sqlite_metadata_connection_config(
              SCHEMA_METADATA_PATH)

with metadata.Metadata(metadata_connection_config) as metadata_handler:
  # Search component executions from the previous pipeline run.
  executions = get_latest_executions(metadata_handler, SCHEMA_PIPELINE_NAME)

  # Find output artifacts from MLMD.
  stat_gen_output = get_artifacts_for_component_id(metadata_handler, executions,
                                                   'StatisticsGen')
  stats_artifacts = stat_gen_output[standard_component_specs.STATISTICS_KEY]

  schema_gen_output = get_artifacts_for_component_id(metadata_handler,
                                                     executions, 'SchemaGen')
  schema_artifacts = schema_gen_output[standard_component_specs.SCHEMA_KEY]
INFO:absl:MetadataStore with DB connection initialized

It is time to examine the outputs from each component. As described above, Tensorflow Data Validation(TFDV) is used in StatisticsGen and SchemaGen, and TFDV also provides visualization of the outputs from these components.

In this tutorial, we will use the visualization helper methods in TFX which use TFDV internally to show the visualization.

Examine the output from StatisticsGen

visualize_artifacts(stats_artifacts)

You can see various stats for the input data. These statistics are supplied to SchemaGen to construct an initial schema of data automatically.

Examine the output from SchemaGen

visualize_artifacts(schema_artifacts)

This schema is automatically inferred from the output of StatisticsGen. You should be able to see 4 FLOAT features and 1 INT feature.

Export the schema for future use

We need to review and refine the generated schema. The reviewed schema needs to be persisted to be used in subsequent pipelines for ML model training. In other words, you might want to add the schema file to your version control system for actual use cases. In this tutorial, we will just copy the schema to a predefined filesystem path for simplicity.

import shutil

_schema_filename = 'schema.pbtxt'
SCHEMA_PATH = 'schema'

os.makedirs(SCHEMA_PATH, exist_ok=True)
_generated_path = os.path.join(schema_artifacts[0].uri, _schema_filename)

# Copy the 'schema.pbtxt' file from the artifact uri to a predefined path.
shutil.copy(_generated_path, SCHEMA_PATH)
'schema/schema.pbtxt'

The schema file uses Protocol Buffer text format and an instance of TensorFlow Metadata Schema proto.

print(f'Schema at {SCHEMA_PATH}-----')
!cat {SCHEMA_PATH}/*
Schema at schema-----
feature {
  name: "body_mass_g"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "culmen_depth_mm"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "culmen_length_mm"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "flipper_length_mm"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "species"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}

You should be sure to review and possibly edit the schema definition as needed. In this tutorial, we will just use the generated schema unchanged.

Validate input examples and train an ML model

We will go back to the pipeline that we created in Simple TFX Pipeline Tutorial, to train an ML model and use the generated schema for writing the model training code.

We will also add an ExampleValidator component which will look for anomalies and missing values in the incoming dataset with respect to the schema.

Write model training code

We need to write the model code as we did in Simple TFX Pipeline Tutorial.

The model itself is the same as in the previous tutorial, but this time we will use the schema generated from the previous pipeline instead of specifying features manually. Most of the code was not changed. The only difference is that we do not need to specify the names and types of features in this file. Instead, we read them from the schema file.

_trainer_module_file = 'penguin_trainer.py'
%%writefile {_trainer_module_file}

from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils

from tfx.components.trainer.executor import TrainerFnArgs
from tfx.components.trainer.fn_args_utils import DataAccessor
from tfx.utils import io_utils
from tfx_bsl.tfxio import dataset_options
from tensorflow_metadata.proto.v0 import schema_pb2

# We don't need to specify _FEATURE_KEYS and _FEATURE_SPEC any more.
# Those information can be read from the given schema file.

_LABEL_KEY = 'species'

_TRAIN_BATCH_SIZE = 20
_EVAL_BATCH_SIZE = 10

def _input_fn(file_pattern: List[str],
              data_accessor: DataAccessor,
              schema: schema_pb2.Schema,
              batch_size: int = 200) -> tf.data.Dataset:
  """Generates features and label for training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    schema: schema of the input data.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
  """
  return data_accessor.tf_dataset_factory(
      file_pattern,
      dataset_options.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()


def _build_keras_model(schema: schema_pb2.Schema) -> tf.keras.Model:
  """Creates a DNN Keras model for classifying penguin data.

  Returns:
    A Keras Model.
  """
  # The model below is built with Functional API, please refer to
  # https://www.tensorflow.org/guide/keras/overview for all API options.

  # ++ Changed code: Uses all features in the schema except the label.
  feature_keys = [f.name for f in schema.feature if f.name != _LABEL_KEY]
  inputs = [keras.layers.Input(shape=(1,), name=f) for f in feature_keys]
  # ++ End of the changed code.

  d = keras.layers.concatenate(inputs)
  for _ in range(2):
    d = keras.layers.Dense(8, activation='relu')(d)
  outputs = keras.layers.Dense(3)(d)

  model = keras.Model(inputs=inputs, outputs=outputs)
  model.compile(
      optimizer=keras.optimizers.Adam(1e-2),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=[keras.metrics.SparseCategoricalAccuracy()])

  model.summary(print_fn=logging.info)
  return model


# TFX Trainer will call this function.
def run_fn(fn_args: TrainerFnArgs):
  """Train the model based on given args.

  Args:
    fn_args: Holds args used to train the model as name/value pairs.
  """

  # ++ Changed code: Reads in schema file passed to the Trainer component.
  schema = io_utils.parse_pbtxt_file(fn_args.schema_path, schema_pb2.Schema())
  # ++ End of the changed code.

  train_dataset = _input_fn(
      fn_args.train_files,
      fn_args.data_accessor,
      schema,
      batch_size=_TRAIN_BATCH_SIZE)
  eval_dataset = _input_fn(
      fn_args.eval_files,
      fn_args.data_accessor,
      schema,
      batch_size=_EVAL_BATCH_SIZE)

  model = _build_keras_model(schema)
  model.fit(
      train_dataset,
      steps_per_epoch=fn_args.train_steps,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps)

  # The result of the training should be saved in `fn_args.serving_model_dir`
  # directory.
  model.save(fn_args.serving_model_dir, save_format='tf')
Writing penguin_trainer.py

Now you have completed all preparation steps to build a TFX pipeline for model training.

Write a pipeline definition

We will add two new components, Importer and ExampleValidator. Importer brings an external file into the TFX pipeline. In this case, it is a file containing schema definition. ExampleValidator will examine the input data and validate whether all input data conforms the data schema we provided.

from tfx.components import CsvExampleGen
from tfx.components import ExampleValidator
from tfx.components import Pusher
from tfx.components import StatisticsGen
from tfx.components import Trainer
from tfx.components.trainer.executor import GenericExecutor
from tfx.dsl.components.base import executor_spec
from tfx.dsl.components.common.importer import Importer
from tfx.orchestration import metadata
from tfx.orchestration import pipeline
from tfx.proto import pusher_pb2
from tfx.proto import trainer_pb2
from tfx.types import standard_artifacts


def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
                     schema_path: str, module_file: str, serving_model_dir: str,
                     metadata_path: str) -> pipeline.Pipeline:
  """Creates a pipeline using predefined schema with TFX."""
  # Brings data into the pipeline.
  example_gen = CsvExampleGen(input_base=data_root)

  # Computes statistics over data for visualization and example validation.
  statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])

  # NEW: Import the schema.
  schema_importer = Importer(
      instance_name='import_schema',
      source_uri=schema_path,
      artifact_type=standard_artifacts.Schema)

  # NEW: Performs anomaly detection based on statistics and data schema.
  example_validator = ExampleValidator(
      statistics=statistics_gen.outputs['statistics'],
      schema=schema_importer.outputs['result'])

  # Uses user-provided Python function that trains a model.
  trainer = Trainer(
      module_file=module_file,
      custom_executor_spec=executor_spec.ExecutorClassSpec(GenericExecutor),
      examples=example_gen.outputs['examples'],
      schema=schema_importer.outputs['result'],  # Pass the imported schema.
      train_args=trainer_pb2.TrainArgs(num_steps=100),
      eval_args=trainer_pb2.EvalArgs(num_steps=5))

  # Pushes the model to a filesystem destination.
  pusher = Pusher(
      model=trainer.outputs['model'],
      push_destination=pusher_pb2.PushDestination(
          filesystem=pusher_pb2.PushDestination.Filesystem(
              base_directory=serving_model_dir)))

  components = [
      example_gen,

      # NEW: Following three components were added to the pipeline.
      statistics_gen,
      schema_importer,
      example_validator,

      trainer,
      pusher,
  ]

  return pipeline.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      metadata_connection_config=metadata.sqlite_metadata_connection_config(
          metadata_path),
      components=components)

Run the pipeline

import os
from tfx.orchestration.local import local_dag_runner

local_dag_runner.LocalDagRunner().run(
  _create_pipeline(
      pipeline_name=PIPELINE_NAME,
      pipeline_root=PIPELINE_ROOT,
      data_root=DATA_ROOT,
      schema_path=SCHEMA_PATH,
      module_file=_trainer_module_file,
      serving_model_dir=SERVING_MODEL_DIR,
      metadata_path=METADATA_PATH))
INFO:absl:Excluding no splits because exclude_splits is not set.
WARNING:absl:`instance_name` is deprecated, please set the node id directly using `with_id()` or the `.id` setter.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running executor for CsvExampleGen
INFO:absl:Generating examples.
INFO:absl:Processing input csv data /tmp/tfx-datacjdu22yh/* to TFExample.
INFO:absl:Examples generated.
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component Importer.import_schema is running.
INFO:absl:Running driver for Importer.import_schema
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Processing source uri: schema, properties: {}, custom_properties: {}
INFO:absl:Running executor for Importer.import_schema
INFO:absl:Running publisher for Importer.import_schema
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component Importer.import_schema is finished.
INFO:absl:Component StatisticsGen is running.
INFO:absl:Running driver for StatisticsGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for StatisticsGen
INFO:absl:Generating statistics for split train.
INFO:absl:Statistics for split train written to pipelines/penguin-tfdv/StatisticsGen/statistics/3/train.
INFO:absl:Generating statistics for split eval.
INFO:absl:Statistics for split eval written to pipelines/penguin-tfdv/StatisticsGen/statistics/3/eval.
INFO:absl:Running publisher for StatisticsGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component StatisticsGen is finished.
INFO:absl:Component Trainer is running.
INFO:absl:Running driver for Trainer
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for Trainer
INFO:absl:Train on the 'train' split when train_args.splits is not set.
INFO:absl:Evaluate on the 'eval' split when eval_args.splits is not set.
INFO:absl:Loading penguin_trainer.py because it has not been loaded before.
INFO:absl:Training model.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature body_mass_g has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_depth_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature culmen_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature flipper_length_mm has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Feature species has a shape dim {
  size: 1
}
. Setting to DenseTensor.
INFO:absl:Model: "model"
INFO:absl:__________________________________________________________________________________________________
INFO:absl:Layer (type)                    Output Shape         Param #     Connected to                     
INFO:absl:==================================================================================================
INFO:absl:body_mass_g (InputLayer)        [(None, 1)]          0                                            
INFO:absl:__________________________________________________________________________________________________
INFO:absl:culmen_depth_mm (InputLayer)    [(None, 1)]          0                                            
INFO:absl:__________________________________________________________________________________________________
INFO:absl:culmen_length_mm (InputLayer)   [(None, 1)]          0                                            
INFO:absl:__________________________________________________________________________________________________
INFO:absl:flipper_length_mm (InputLayer)  [(None, 1)]          0                                            
INFO:absl:__________________________________________________________________________________________________
INFO:absl:concatenate (Concatenate)       (None, 4)            0           body_mass_g[0][0]                
INFO:absl:                                                                 culmen_depth_mm[0][0]            
INFO:absl:                                                                 culmen_length_mm[0][0]           
INFO:absl:                                                                 flipper_length_mm[0][0]          
INFO:absl:__________________________________________________________________________________________________
INFO:absl:dense (Dense)                   (None, 8)            40          concatenate[0][0]                
INFO:absl:__________________________________________________________________________________________________
INFO:absl:dense_1 (Dense)                 (None, 8)            72          dense[0][0]                      
INFO:absl:__________________________________________________________________________________________________
INFO:absl:dense_2 (Dense)                 (None, 3)            27          dense_1[0][0]                    
INFO:absl:==================================================================================================
INFO:absl:Total params: 139
INFO:absl:Trainable params: 139
INFO:absl:Non-trainable params: 0
INFO:absl:__________________________________________________________________________________________________
100/100 [==============================] - 2s 9ms/step - loss: 0.7374 - sparse_categorical_accuracy: 0.7421 - val_loss: 0.2698 - val_sparse_categorical_accuracy: 0.9600
INFO:tensorflow:Assets written to: pipelines/penguin-tfdv/Trainer/model/4/serving_model_dir/assets
INFO:tensorflow:Assets written to: pipelines/penguin-tfdv/Trainer/model/4/serving_model_dir/assets
INFO:absl:Training complete. Model written to pipelines/penguin-tfdv/Trainer/model/4/serving_model_dir. ModelRun written to pipelines/penguin-tfdv/Trainer/model_run/4
INFO:absl:Running publisher for Trainer
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component Trainer is finished.
INFO:absl:Component ExampleValidator is running.
INFO:absl:Running driver for ExampleValidator
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for ExampleValidator
INFO:absl:Validating schema against the computed statistics for split train.
INFO:absl:Validation complete for split train. Anomalies written to pipelines/penguin-tfdv/ExampleValidator/anomalies/5/train.
INFO:absl:Validating schema against the computed statistics for split eval.
INFO:absl:Validation complete for split eval. Anomalies written to pipelines/penguin-tfdv/ExampleValidator/anomalies/5/eval.
INFO:absl:Running publisher for ExampleValidator
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component ExampleValidator is finished.
INFO:absl:Component Pusher is running.
INFO:absl:Running driver for Pusher
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Running executor for Pusher
WARNING:absl:Pusher is going to push the model without validation. Consider using Evaluator or InfraValidator in your pipeline.
INFO:absl:Model version: 1619774279
INFO:absl:Model written to serving path serving_model/penguin-tfdv/1619774279.
INFO:absl:Model pushed to pipelines/penguin-tfdv/Pusher/pushed_model/6.
INFO:absl:Running publisher for Pusher
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component Pusher is finished.

You should see "INFO:absl:Component Pusher is finished." if the pipeline finished successfully.

Examine outputs of the pipeline

We have trained the classification model for penguins, and we also have validated the input examples in the ExampleValidator component. We can analyze the output from ExampleValidator as we did with the previous pipeline.

metadata_connection_config = metadata.sqlite_metadata_connection_config(
              METADATA_PATH)

with metadata.Metadata(metadata_connection_config) as metadata_handler:
  # Search component executions from the previous pipeline run.
  executions = get_latest_executions(metadata_handler, PIPELINE_NAME)

  # Find output artifacts from MLMD.
  stat_gen_output = get_artifacts_for_component_id(metadata_handler, executions,
                                                   'ExampleValidator')
  anomalies_artifacts = stat_gen_output[standard_component_specs.ANOMALIES_KEY]
INFO:absl:MetadataStore with DB connection initialized

ExampleAnomalies from the ExampleValidator can be visualized as well.

visualize_artifacts(anomalies_artifacts)

You should see "No anomalies found" for each split of examples. Because we used the same data which was used for the schema generation in this pipeline, no anomaly is expected here. If you run this pipeline repeatedly with new incoming data, ExampleValidator should be able to find any discrepancies between the new data and the existing schema.

If any anomalies were found, you may review your data to check to see if any examples do not follow your assumptions. Outputs from other components like StatisticsGen might be useful. However, any anomalies which are found will NOT block further pipeline executions.

Next steps

You can find more resources on https://www.tensorflow.org/tfx/tutorials

Please see Understanding TFX Pipelines to learn more about various concepts in TFX.