TensorFlow Model Analysis

An Example of a Key Component of TensorFlow Extended (TFX)

TensorFlow Model Analysis (TFMA) is a library for performing model evaluation across different slices of data. TFMA performs its computations in a distributed manner over large amounts of data using Apache Beam.

This example colab notebook illustrates how TFMA can be used to investigate and visualize the performance of a model with respect to characteristics of the dataset. We'll use a model that we trained previously, and now you get to play with the results! The model we trained was for the Chicago Taxi Example, which uses the Taxi Trips dataset released by the City of Chicago. Explore the full dataset in the BigQuery UI.

As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about ML fairness.

The columns in the dataset are:

pickup_community_areafaretrip_start_month
trip_start_hourtrip_start_daytrip_start_timestamp
pickup_latitudepickup_longitudedropoff_latitude
dropoff_longitudetrip_milespickup_census_tract
dropoff_census_tractpayment_typecompany
trip_secondsdropoff_community_areatips

Install Jupyter Extensions

jupyter nbextension enable --py widgetsnbextension --sys-prefix 
jupyter nbextension install --py --symlink tensorflow_model_analysis --sys-prefix 
jupyter nbextension enable --py tensorflow_model_analysis --sys-prefix 

Install TensorFlow Model Analysis (TFMA)

This will pull in all the dependencies, and will take a minute.

Note to ensure all the dependencies are installed properly, you may need to re-run this install step multiple times before there are no errors.

# This setup was tested with TF 2.3 and TFMA 0.24 (using colab), but it should
# also work with the latest release.
import sys

# Confirm that we're using Python 3
assert sys.version_info.major==3, 'This notebook must be run using Python 3.'

print('Installing TensorFlow')
import tensorflow as tf
print('TF version: {}'.format(tf.__version__))

print('Installing Tensorflow Model Analysis and Dependencies')
!pip install -q tensorflow_model_analysis
import apache_beam as beam
print('Beam version: {}'.format(beam.__version__))
import tensorflow_model_analysis as tfma
print('TFMA version: {}'.format(tfma.__version__))
Installing TensorFlow
TF version: 2.3.1
Installing Tensorflow Model Analysis and Dependencies
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

google-api-python-client 1.12.3 requires httplib2<1dev,>=0.15.0, but you'll have httplib2 0.9.2 which is incompatible.
Beam version: 2.24.0
TFMA version: 0.24.3

Load The Files

We'll download a tar file that has everything we need. That includes:

  • Training and evaluation datasets
  • Data schema
  • Training and serving saved models (keras and estimator) and eval saved models (estimator).
# Download the tar file from GCP and extract it
import io, os, tempfile
TAR_NAME = 'saved_models-2.2'
BASE_DIR = tempfile.mkdtemp()
DATA_DIR = os.path.join(BASE_DIR, TAR_NAME, 'data')
MODELS_DIR = os.path.join(BASE_DIR, TAR_NAME, 'models')
SCHEMA = os.path.join(BASE_DIR, TAR_NAME, 'schema.pbtxt')
OUTPUT_DIR = os.path.join(BASE_DIR, 'output')

!curl -O https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/{TAR_NAME}.tar
!tar xf {TAR_NAME}.tar
!mv {TAR_NAME} {BASE_DIR}
!rm {TAR_NAME}.tar

print("Here's what we downloaded:")
!ls -R {BASE_DIR}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6800k  100 6800k    0     0  25.4M      0 --:--:-- --:--:-- --:--:-- 25.3M
Here's what we downloaded:
/tmp/tmpj6t03cp6:
saved_models-2.2

/tmp/tmpj6t03cp6/saved_models-2.2:
data  models  schema.pbtxt

/tmp/tmpj6t03cp6/saved_models-2.2/data:
eval  train

/tmp/tmpj6t03cp6/saved_models-2.2/data/eval:
data.csv

/tmp/tmpj6t03cp6/saved_models-2.2/data/train:
data.csv

/tmp/tmpj6t03cp6/saved_models-2.2/models:
estimator  keras

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator:
eval_model_dir  serving_model_dir

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/eval_model_dir:
1591221811

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/eval_model_dir/1591221811:
saved_model.pb  tmp.pbtxt  variables

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/eval_model_dir/1591221811/variables:
variables.data-00000-of-00001  variables.index

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/serving_model_dir:
checkpoint
eval_chicago-taxi-eval
events.out.tfevents.1591221780.my-pipeline-b57vp-237544850
export
graph.pbtxt
model.ckpt-100.data-00000-of-00001
model.ckpt-100.index
model.ckpt-100.meta

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/serving_model_dir/eval_chicago-taxi-eval:
events.out.tfevents.1591221799.my-pipeline-b57vp-237544850

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/serving_model_dir/export:
chicago-taxi

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/serving_model_dir/export/chicago-taxi:
1591221801

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/serving_model_dir/export/chicago-taxi/1591221801:
saved_model.pb  variables

/tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/serving_model_dir/export/chicago-taxi/1591221801/variables:
variables.data-00000-of-00001  variables.index

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras:
0  1  2

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras/0:
saved_model.pb  variables

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras/0/variables:
variables.data-00000-of-00001  variables.index

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras/1:
saved_model.pb  variables

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras/1/variables:
variables.data-00000-of-00001  variables.index

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras/2:
saved_model.pb  variables

/tmp/tmpj6t03cp6/saved_models-2.2/models/keras/2/variables:
variables.data-00000-of-00001  variables.index

Parse the Schema

Among the things we downloaded was a schema for our data that was created by TensorFlow Data Validation. Let's parse that now so that we can use it with TFMA.

import tensorflow as tf
from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
from tensorflow.core.example import example_pb2

schema = schema_pb2.Schema()
contents = file_io.read_file_to_string(SCHEMA)
schema = text_format.Parse(contents, schema)

Use the Schema to Create TFRecords

We need to give TFMA access to our dataset, so let's create a TFRecords file. We can use our schema to create it, since it gives us the correct type for each feature.

import csv

datafile = os.path.join(DATA_DIR, 'eval', 'data.csv')
reader = csv.DictReader(open(datafile, 'r'))
examples = []
for line in reader:
  example = example_pb2.Example()
  for feature in schema.feature:
    key = feature.name
    if feature.type == schema_pb2.FLOAT:
      example.features.feature[key].float_list.value[:] = (
          [float(line[key])] if len(line[key]) > 0 else [])
    elif feature.type == schema_pb2.INT:
      example.features.feature[key].int64_list.value[:] = (
          [int(line[key])] if len(line[key]) > 0 else [])
    elif feature.type == schema_pb2.BYTES:
      example.features.feature[key].bytes_list.value[:] = (
          [line[key].encode('utf8')] if len(line[key]) > 0 else [])
  # Add a new column 'big_tipper' that indicates if tips was > 20% of the fare. 
  # TODO(b/157064428): Remove after label transformation is supported for Keras.
  big_tipper = float(line['tips']) > float(line['fare']) * 0.2
  example.features.feature['big_tipper'].float_list.value[:] = [big_tipper]
  examples.append(example)

tfrecord_file = os.path.join(BASE_DIR, 'train_data.rio')
with tf.io.TFRecordWriter(tfrecord_file) as writer:
  for example in examples:
    writer.write(example.SerializeToString())

!ls {tfrecord_file}
/tmp/tmpj6t03cp6/train_data.rio

Setup and Run TFMA

TFMA supports a number of different model types including TF keras models, models based on generic TF2 signature APIs, as well TF estimator based models. The get_started guide has the full list of model types supported and any restrictions. For this example we are going to show how to configure a keras based model as well as an estimator based model that was saved as an EvalSavedModel. See the FAQ for examples of other configurations.

TFMA provides support for calculating metrics that were used at training time (i.e. built-in metrics) as well metrics defined after the model was saved as part of the TFMA configuration settings. For our keras setup we will demonstrate adding our metrics and plots manually as part of our configuration (see the metrics guide for information on the metrics and plots that are supported). For the estimator setup we will use the built-in metrics that were saved with the model. Our setups also include a number of slicing specs which are discussed in more detail in the following sections.

After creating a tfma.EvalConfig and tfma.EvalSharedModel we can then run TFMA using tfma.run_model_analysis. This will create a tfma.EvalResult which we can use later for rendering our metrics and plots.

Keras

import tensorflow_model_analysis as tfma

# Setup tfma.EvalConfig settings
keras_eval_config = text_format.Parse("""
  ## Model information
  model_specs {
    # For keras (and serving models) we need to add a `label_key`.
    label_key: "big_tipper"
  }

  ## Post training metric information. These will be merged with any built-in
  ## metrics from training.
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "BinaryAccuracy" }
    metrics { class_name: "BinaryCrossentropy" }
    metrics { class_name: "AUC" }
    metrics { class_name: "AUCPrecisionRecall" }
    metrics { class_name: "Precision" }
    metrics { class_name: "Recall" }
    metrics { class_name: "MeanLabel" }
    metrics { class_name: "MeanPrediction" }
    metrics { class_name: "Calibration" }
    metrics { class_name: "CalibrationPlot" }
    metrics { class_name: "ConfusionMatrixPlot" }
    # ... add additional metrics and plots ...
  }

  ## Slicing information
  slicing_specs {}  # overall slice
  slicing_specs {
    feature_keys: ["trip_start_hour"]
  }
  slicing_specs {
    feature_keys: ["trip_start_day"]
  }
  slicing_specs {
    feature_values: {
      key: "trip_start_month"
      value: "1"
    }
  }
  slicing_specs {
    feature_keys: ["trip_start_hour", "trip_start_day"]
  }
""", tfma.EvalConfig())

# Create a tfma.EvalSharedModel that points at our keras model.
keras_model_path = os.path.join(MODELS_DIR, 'keras', '2')
keras_eval_shared_model = tfma.default_eval_shared_model(
    eval_saved_model_path=keras_model_path,
    eval_config=keras_eval_config)

keras_output_path = os.path.join(OUTPUT_DIR, 'keras')

# Run TFMA
keras_eval_result = tfma.run_model_analysis(
    eval_shared_model=keras_eval_shared_model,
    eval_config=keras_eval_config,
    data_location=tfrecord_file,
    output_path=keras_output_path)
WARNING:absl:Tensorflow version (2.3.1) found. Note that TFMA support for TF 2.0 is currently in beta
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.

Warning:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/writers/metrics_plots_and_validations_writer.py:70: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/writers/metrics_plots_and_validations_writer.py:70: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Estimator

import tensorflow_model_analysis as tfma

# Setup tfma.EvalConfig settings
estimator_eval_config = text_format.Parse("""
  ## Model information
  model_specs {
    # To use EvalSavedModel set `signature_name` to "eval".
    signature_name: "eval"
  }

  ## Post training metric information. These will be merged with any built-in
  ## metrics from training.
  metrics_specs {
    metrics { class_name: "ConfusionMatrixPlot" }
    # ... add additional metrics and plots ...
  }

  ## Slicing information
  slicing_specs {}  # overall slice
  slicing_specs {
    feature_keys: ["trip_start_hour"]
  }
  slicing_specs {
    feature_keys: ["trip_start_day"]
  }
  slicing_specs {
    feature_values: {
      key: "trip_start_month"
      value: "1"
    }
  }
  slicing_specs {
    feature_keys: ["trip_start_hour", "trip_start_day"]
  }
""", tfma.EvalConfig())

# Create a tfma.EvalSharedModel that points at our eval saved model.
estimator_base_model_path = os.path.join(
    MODELS_DIR, 'estimator', 'eval_model_dir')
estimator_model_path = os.path.join(
    estimator_base_model_path, os.listdir(estimator_base_model_path)[0])
estimator_eval_shared_model = tfma.default_eval_shared_model(
    eval_saved_model_path=estimator_model_path,
    eval_config=estimator_eval_config)

estimator_output_path = os.path.join(OUTPUT_DIR, 'estimator')

# Run TFMA
estimator_eval_result = tfma.run_model_analysis(
    eval_shared_model=estimator_eval_shared_model,
    eval_config=estimator_eval_config,
    data_location=tfrecord_file,
    output_path=estimator_output_path)
WARNING:absl:Tensorflow version (2.3.1) found. Note that TFMA support for TF 2.0 is currently in beta

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/eval_saved_model/load.py:169: load (from tensorflow.python.saved_model.loader_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/eval_saved_model/load.py:169: load (from tensorflow.python.saved_model.loader_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.

INFO:tensorflow:Restoring parameters from /tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/eval_model_dir/1591221811/variables/variables

INFO:tensorflow:Restoring parameters from /tmp/tmpj6t03cp6/saved_models-2.2/models/estimator/eval_model_dir/1591221811/variables/variables

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/eval_saved_model/graph_ref.py:189: get_tensor_from_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.get_tensor_from_tensor_info or tf.compat.v1.saved_model.get_tensor_from_tensor_info.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/eval_saved_model/graph_ref.py:189: get_tensor_from_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.get_tensor_from_tensor_info or tf.compat.v1.saved_model.get_tensor_from_tensor_info.

Visualizing Metrics and Plots

Now that we've run the evaluation, let's take a look at our visualizations using TFMA. For the following examples, we will visualize the results from running the evaluation on the keras model. To view the estimator based model update the eval_result to point at our estimator_eval_result variable.

eval_result = keras_eval_result
# eval_result = estimator_eval_result

Rendering Metrics

To view metrics you use tfma.view.render_slicing_metrics

By default the views will display the Overall slice. To view a particular slice you can either use the name of the column (by setting slicing_column) or provide a tfma.SlicingSpec.

The metrics visualization supports the following interactions:

  • Click and drag to pan
  • Scroll to zoom
  • Right click to reset the view
  • Hover over the desired data point to see more details.
  • Select from four different types of views using the selections at the bottom.

For example, we'll be setting slicing_column to look at the trip_start_hour feature from our previous slicing_specs.

tfma.view.render_slicing_metrics(eval_result, slicing_column='trip_start_hour')
SlicingMetricsViewer(config={'weightedExamplesColumn': 'example_count'}, data=[{'slice': 'trip_start_hour:2', …

Slices Overview

The default visualization is the Slices Overview when the number of slices is small. It shows the values of metrics for each slice. Since we've selected trip_start_hour above, it's showing us metrics like accuracy and AUC for each hour, which allows us to look for issues that are specific to some hours and not others.

In the visualization above:

  • Try sorting the feature column, which is our trip_start_hours feature, by clicking on the column header
  • Try sorting by precision, and notice that the precision for some of the hours with examples is 0, which may indicate a problem

The chart also allows us to select and display different metrics in our slices.

  • Try selecting different metrics from the "Show" menu
  • Try selecting recall in the "Show" menu, and notice that the recall for some of the hours with examples is 0, which may indicate a problem

It is also possible to set a threshold to filter out slices with smaller numbers of examples, or "weights". You can type a minimum number of examples, or use the slider.

Metrics Histogram

This view also supports a Metrics Histogram as an alternative visualization, which is also the default view when the number of slices is large. The results will be divided into buckets and the number of slices / total weights / both can be visualized. Columns can be sorted by clicking on the column header. Slices with small weights can be filtered out by setting the threshold. Further filtering can be applied by dragging the grey band. To reset the range, double click the band. Filtering can also be used to remove outliers in the visualization and the metrics tables. Click the gear icon to switch to a logarithmic scale instead of a linear scale.

  • Try selecting "Metrics Histogram" in the Visualization menu

More Slices

Our initial tfma.EvalConfig created a whole list of slicing_specs, which we can visualize by updating slice information passed to tfma.view.render_slicing_metrics. Here we'll select the trip_start_day slice (days of the week). Try changing the trip_start_day to trip_start_month and rendering again to examine different slices.

tfma.view.render_slicing_metrics(eval_result, slicing_column='trip_start_day')
SlicingMetricsViewer(config={'weightedExamplesColumn': 'example_count'}, data=[{'slice': 'trip_start_day:3', '…

TFMA also supports creating feature crosses to analyze combinations of features. Our original settings created a cross trip_start_hour and trip_start_day:

tfma.view.render_slicing_metrics(
    eval_result,
    slicing_spec=tfma.SlicingSpec(
        feature_keys=['trip_start_hour', 'trip_start_day']))
SlicingMetricsViewer(config={'weightedExamplesColumn': 'example_count'}, data=[{'slice': 'trip_start_day_X_tri…

Crossing the two columns creates a lot of combinations! Let's narrow down our cross to only look at trips that start at noon. Then let's select binary_accuracy from the visualization:

tfma.view.render_slicing_metrics(
    eval_result,
    slicing_spec=tfma.SlicingSpec(
        feature_keys=['trip_start_day'], feature_values={'trip_start_hour': '12'}))
SlicingMetricsViewer(config={'weightedExamplesColumn': 'example_count'}, data=[{'slice': 'trip_start_day_X_tri…

Rendering Plots

Any plots that were added to the tfma.EvalConfig as post training metric_specs can be displayed using tfma.view.render_plot.

As with metrics, plots can be viewed by slice. Unlike metrics, only plots for a particular slice value can be displayed so the tfma.SlicingSpec must be used and it must specify both a slice feature name and value. If no slice is provided then the plots for the Overall slice is used.

In the example below we are displaying the CalibrationPlot and ConfusionMatrixPlot plots that were computed for the trip_start_hour:1 slice.

tfma.view.render_plot(
    eval_result,
    tfma.SlicingSpec(feature_values={'trip_start_hour': '1'}))
PlotViewer(config={'sliceName': 'trip_start_hour:1', 'metricKeys': {'calibrationPlot': {'metricName': 'calibra…

Tracking Model Performance Over Time

Your training dataset will be used for training your model, and will hopefully be representative of your test dataset and the data that will be sent to your model in production. However, while the data in inference requests may remain the same as your training data, in many cases it will start to change enough so that the performance of your model will change.

That means that you need to monitor and measure your model's performance on an ongoing basis, so that you can be aware of and react to changes. Let's take a look at how TFMA can help.

Let's load 3 different model runs and use TFMA to see how they compare using render_time_series.

# Note this re-uses the EvalConfig from the keras setup.

# Run eval on each saved model
output_paths = []
for i in range(3):
  # Create a tfma.EvalSharedModel that points at our saved model.
  eval_shared_model = tfma.default_eval_shared_model(
      eval_saved_model_path=os.path.join(MODELS_DIR, 'keras', str(i)),
      eval_config=keras_eval_config)

  output_path = os.path.join(OUTPUT_DIR, 'time_series', str(i))
  output_paths.append(output_path)

  # Run TFMA
  tfma.run_model_analysis(eval_shared_model=eval_shared_model,
                          eval_config=keras_eval_config,
                          data_location=tfrecord_file,
                          output_path=output_path)
WARNING:absl:Tensorflow version (2.3.1) found. Note that TFMA support for TF 2.0 is currently in beta
WARNING:absl:Tensorflow version (2.3.1) found. Note that TFMA support for TF 2.0 is currently in beta
WARNING:absl:Tensorflow version (2.3.1) found. Note that TFMA support for TF 2.0 is currently in beta

First, we'll imagine that we've trained and deployed our model yesterday, and now we want to see how it's doing on the new data coming in today. The visualization will start by displaying AUC. From the UI you can:

  • Add other metrics using the "Add metric series" menu.
  • Close unwanted graphs by clicking on x
  • Hover over data points (the ends of line segments in the graph) to get more details
eval_results_from_disk = tfma.load_eval_results(output_paths[:2])

tfma.view.render_time_series(eval_results_from_disk)
TimeSeriesViewer(config={'isModelCentric': True}, data=[{'metrics': {'': {'': {'calibration': {'doubleValue': …

Now we'll imagine that another day has passed and we want to see how it's doing on the new data coming in today, compared to the previous two days:

eval_results_from_disk = tfma.load_eval_results(output_paths)

tfma.view.render_time_series(eval_results_from_disk)
TimeSeriesViewer(config={'isModelCentric': True}, data=[{'metrics': {'': {'': {'calibration': {'doubleValue': …

Model Validation

TFMA can be configured to evaluate multiple models at the same time. Typically this is done to compare a new model against a baseline (such as the currently serving model) to determine what the performance differences in metrics (e.g. AUC, etc) are relative to the baseline. When thresholds are configured, TFMA will produce a tfma.ValidationResult record indicating whether the performance matches expecations.

Let's re-configure our keras evaluation to compare two models: a candidate and a baseline. We will also validate the candidate's performance against the baseline by setting a tmfa.MetricThreshold on the AUC metric.

# Setup tfma.EvalConfig setting
eval_config_with_thresholds = text_format.Parse("""
  ## Model information
  model_specs {
    name: "candidate"
    # For keras we need to add a `label_key`.
    label_key: "big_tipper"
  }
  model_specs {
    name: "baseline"
    # For keras we need to add a `label_key`.
    label_key: "big_tipper"
    is_baseline: true
  }

  ## Post training metric information
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "BinaryAccuracy" }
    metrics { class_name: "BinaryCrossentropy" }
    metrics {
      class_name: "AUC"
      threshold {
        # Ensure that AUC is always > 0.9
        value_threshold {
          lower_bound { value: 0.9 }
        }
        # Ensure that AUC does not drop by more than a small epsilon
        # e.g. (candidate - baseline) > -1e-10 or candidate > baseline - 1e-10
        change_threshold {
          direction: HIGHER_IS_BETTER
          absolute { value: -1e-10 }
        }
      }
    }
    metrics { class_name: "AUCPrecisionRecall" }
    metrics { class_name: "Precision" }
    metrics { class_name: "Recall" }
    metrics { class_name: "MeanLabel" }
    metrics { class_name: "MeanPrediction" }
    metrics { class_name: "Calibration" }
    metrics { class_name: "CalibrationPlot" }
    metrics { class_name: "ConfusionMatrixPlot" }
    # ... add additional metrics and plots ...
  }

  ## Slicing information
  slicing_specs {}  # overall slice
  slicing_specs {
    feature_keys: ["trip_start_hour"]
  }
  slicing_specs {
    feature_keys: ["trip_start_day"]
  }
  slicing_specs {
    feature_keys: ["trip_start_month"]
  }
  slicing_specs {
    feature_keys: ["trip_start_hour", "trip_start_day"]
  }
""", tfma.EvalConfig())

# Create tfma.EvalSharedModels that point at our keras models.
candidate_model_path = os.path.join(MODELS_DIR, 'keras', '2')
baseline_model_path = os.path.join(MODELS_DIR, 'keras', '1')
eval_shared_models = [
  tfma.default_eval_shared_model(
      model_name=tfma.CANDIDATE_KEY,
      eval_saved_model_path=candidate_model_path,
      eval_config=eval_config_with_thresholds),
  tfma.default_eval_shared_model(
      model_name=tfma.BASELINE_KEY,
      eval_saved_model_path=baseline_model_path,
      eval_config=eval_config_with_thresholds),
]

validation_output_path = os.path.join(OUTPUT_DIR, 'validation')

# Run TFMA
eval_result_with_validation = tfma.run_model_analysis(
    eval_shared_models,
    eval_config=eval_config_with_thresholds,
    data_location=tfrecord_file,
    output_path=validation_output_path)
WARNING:absl:Tensorflow version (2.3.1) found. Note that TFMA support for TF 2.0 is currently in beta
/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_model_analysis/evaluators/metrics_validator.py:66: RuntimeWarning: invalid value encountered in true_divide
  ratio = diff / metrics[key.make_baseline_key(baseline_model_name)]

When running evaluations with one or more models against a baseline, TFMA automatically adds diff metrics for all the metrics computed during the evaluation. These metrics are named after the corresponding metric but with _diff appended to the metric name.

Let's take a look at the metrics produced by our run:

tfma.view.render_time_series(eval_result_with_validation)
TimeSeriesViewer(config={'isModelCentric': True}, data=[{'metrics': {'': {'': {'calibration_diff': {'doubleVal…

Now let's look at the output from our validation checks. To view the validation results we use tfma.load_validator_result. For our example, the validation fails because AUC is below the threshold.

validation_result = tfma.load_validation_result(validation_output_path)
print(validation_result.validation_ok)
False