Pruning for on-device inference w/ XNNPACK

View on TensorFlow.org View source on GitHub Download notebook

Welcome to the guide on Keras weights pruning for improving latency of on-device inference via XNNPACK.

This guide presents the usage of the newly introduced tfmot.sparsity.keras.PruningPolicy API and demonstrates how it could be used for accelerating mostly convolutional models on modern CPUs using XNNPACK Sparse inference.

The guide covers the following steps of the model creation process:

  • Build and train the dense baseline
  • Fine-tune model with pruning
  • Convert to TFLite
  • On-device benchmark

The guide doesn't cover the best practices for the fine-tuning with pruning. For more detailed information on this topic, please check out our comprehensive guide.

Setup

 pip install -q tensorflow
 pip install -q tensorflow-model-optimization
import tempfile

import tensorflow as tf
import numpy as np

from tensorflow import keras
import tensorflow_datasets as tfds
import tensorflow_model_optimization as tfmot

%load_ext tensorboard

Build and train the dense model

We build and train a simple baseline CNN for classification task on CIFAR10 dataset.

# Load CIFAR10 dataset.
(ds_train, ds_val, ds_test), ds_info = tfds.load(
    'cifar10',
    split=['train[:90%]', 'train[90%:]', 'test'],
    as_supervised=True,
    with_info=True,
)

# Normalize the input image so that each pixel value is between 0 and 1.
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.image.convert_image_dtype(image, tf.float32), label

# Load the data in batches of 128 images.
batch_size = 128
def prepare_dataset(ds, buffer_size=None):
  ds = ds.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  ds = ds.cache()
  if buffer_size:
    ds = ds.shuffle(buffer_size)
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
  return ds

ds_train = prepare_dataset(ds_train,
                           buffer_size=ds_info.splits['train'].num_examples)
ds_val = prepare_dataset(ds_val)
ds_test = prepare_dataset(ds_test)

# Build the dense baseline model.
dense_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.Conv2D(
        filters=8,
        kernel_size=(3, 3),
        strides=(2, 2),
        padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.DepthwiseConv2D(kernel_size=(3, 3), padding='same'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=16, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.DepthwiseConv2D(
        kernel_size=(3, 3), strides=(2, 2), padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=32, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(10)
])

# Compile and train the dense model for 10 epochs.
dense_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

dense_model.fit(
  ds_train,
  epochs=10,
  validation_data=ds_val)

# Evaluate the dense model.
_, dense_model_accuracy = dense_model.evaluate(ds_test, verbose=0)
2021-08-13 11:13:35.517009: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-08-13 11:13:35.517068: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kokoro-gcp-ubuntu-prod-1682665100): /proc/driver/nvidia/version does not exist
2021-08-13 11:13:35.517823: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/10
2021-08-13 11:13:36.392179: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
352/352 [==============================] - 12s 21ms/step - loss: 1.9929 - accuracy: 0.2651 - val_loss: 2.5594 - val_accuracy: 0.1466
Epoch 2/10
352/352 [==============================] - 7s 19ms/step - loss: 1.7293 - accuracy: 0.3582 - val_loss: 1.7533 - val_accuracy: 0.3414
Epoch 3/10
352/352 [==============================] - 7s 19ms/step - loss: 1.6531 - accuracy: 0.3849 - val_loss: 1.6463 - val_accuracy: 0.3886
Epoch 4/10
352/352 [==============================] - 7s 19ms/step - loss: 1.6073 - accuracy: 0.4024 - val_loss: 1.6127 - val_accuracy: 0.3980
Epoch 5/10
352/352 [==============================] - 7s 19ms/step - loss: 1.5692 - accuracy: 0.4200 - val_loss: 1.5552 - val_accuracy: 0.4228
Epoch 6/10
352/352 [==============================] - 7s 19ms/step - loss: 1.5358 - accuracy: 0.4344 - val_loss: 1.6375 - val_accuracy: 0.4030
Epoch 7/10
352/352 [==============================] - 7s 19ms/step - loss: 1.5074 - accuracy: 0.4475 - val_loss: 1.5514 - val_accuracy: 0.4258
Epoch 8/10
352/352 [==============================] - 7s 19ms/step - loss: 1.4810 - accuracy: 0.4598 - val_loss: 1.7087 - val_accuracy: 0.3866
Epoch 9/10
352/352 [==============================] - 7s 19ms/step - loss: 1.4610 - accuracy: 0.4669 - val_loss: 1.5219 - val_accuracy: 0.4492
Epoch 10/10
352/352 [==============================] - 7s 19ms/step - loss: 1.4445 - accuracy: 0.4748 - val_loss: 1.5329 - val_accuracy: 0.4302

Build the sparse model

Using the instructions from the comprehensive guide, we apply tfmot.sparsity.keras.prune_low_magnitude function with parameters that target on-device acceleration via pruning i.e. tfmot.sparsity.keras.PruneForLatencyOnXNNPack policy.

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# Compute end step to finish pruning after after 5 epochs.
end_epoch = 5

num_iterations_per_epoch = len(ds_train)
end_step =  num_iterations_per_epoch * end_epoch

# Define parameters for pruning.
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.25,
                                                               final_sparsity=0.75,
                                                               begin_step=0,
                                                               end_step=end_step),
      'pruning_policy': tfmot.sparsity.keras.PruneForLatencyOnXNNPack()
}

# Try to apply pruning wrapper with pruning policy parameter.
try:
  model_for_pruning = prune_low_magnitude(dense_model, **pruning_params)
except ValueError as e:
  print(e)
Could not find a `GlobalAveragePooling2D` layer with `keepdims = True` in all output branches

The call prune_low_magnitude results in ValueError with the message Could not find a GlobalAveragePooling2D layer with keepdims = True in all output branches. The message indicates that the model isn't supported for pruning with policy tfmot.sparsity.keras.PruneForLatencyOnXNNPack and specifically the layer GlobalAveragePooling2D requires the parameter keepdims = True. Let's fix that and reapply prune_low_magnitude function.

fixed_dense_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.Conv2D(
        filters=8,
        kernel_size=(3, 3),
        strides=(2, 2),
        padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.DepthwiseConv2D(kernel_size=(3, 3), padding='same'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=16, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.DepthwiseConv2D(
        kernel_size=(3, 3), strides=(2, 2), padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=32, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.GlobalAveragePooling2D(keepdims=True),
    keras.layers.Flatten(),
    keras.layers.Dense(10)
])

# Use the pretrained model for pruning instead of training from scratch.
fixed_dense_model.set_weights(dense_model.get_weights())

# Try to reapply pruning wrapper.
model_for_pruning = prune_low_magnitude(fixed_dense_model, **pruning_params)
/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/keras/engine/base_layer.py:2223: UserWarning: `layer.add_variable` is deprecated and will be removed in a future version. Please use `layer.add_weight` method instead.
  warnings.warn('`layer.add_variable` is deprecated and '

Invocation of prune_low_magnitude has finished without any errors meaning that the model is fully supported for the tfmot.sparsity.keras.PruneForLatencyOnXNNPack policy and can be accelerated using XNNPACK Sparse inference.

Fine-tune the sparse model

Following the pruning example, we fine-tune the sparse model using the weights of the dense model. We start fine-tuning of the model with 25% sparsity (25% of the weights are set to zero) and end with 75% sparsity.

logdir = tempfile.mkdtemp()

callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep(),
  tfmot.sparsity.keras.PruningSummaries(log_dir=logdir),
]

model_for_pruning.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

model_for_pruning.fit(
  ds_train,
  epochs=15,
  validation_data=ds_val,
  callbacks=callbacks)

# Evaluate the dense model.
_, pruned_model_accuracy = model_for_pruning.evaluate(ds_test, verbose=0)

print('Dense model test accuracy:', dense_model_accuracy)
print('Pruned model test accuracy:', pruned_model_accuracy)
2021-08-13 11:14:50.266658: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-13 11:14:50.266694: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-13 11:14:50.833248: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-13 11:14:50.851018: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Epoch 1/15
 10/352 [..............................] - ETA: 8s - loss: 1.4245 - accuracy: 0.5016
2021-08-13 11:14:52.593103: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-13 11:14:52.593147: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-13 11:14:52.617240: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-13 11:14:52.619415: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-13 11:14:52.623098: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52

2021-08-13 11:14:52.625016: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.trace.json.gz
2021-08-13 11:14:52.628674: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52

2021-08-13 11:14:52.628785: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.memory_profile.json.gz
2021-08-13 11:14:52.629073: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52
Dumped tool data for xplane.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.xplane.pb
Dumped tool data for overview_page.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.overview_page.pb
Dumped tool data for input_pipeline.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.kernel_stats.pb
352/352 [==============================] - 9s 20ms/step - loss: 1.4474 - accuracy: 0.4732 - val_loss: 1.5224 - val_accuracy: 0.4368
Epoch 2/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4763 - accuracy: 0.4601 - val_loss: 1.9179 - val_accuracy: 0.3514
Epoch 3/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4861 - accuracy: 0.4602 - val_loss: 1.5849 - val_accuracy: 0.4100
Epoch 4/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4838 - accuracy: 0.4614 - val_loss: 1.5123 - val_accuracy: 0.4412
Epoch 5/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4669 - accuracy: 0.4696 - val_loss: 1.7005 - val_accuracy: 0.3620
Epoch 6/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4497 - accuracy: 0.4772 - val_loss: 1.4644 - val_accuracy: 0.4576
Epoch 7/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4397 - accuracy: 0.4799 - val_loss: 1.4532 - val_accuracy: 0.4710
Epoch 8/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4307 - accuracy: 0.4844 - val_loss: 2.0308 - val_accuracy: 0.3674
Epoch 9/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4254 - accuracy: 0.4849 - val_loss: 1.6031 - val_accuracy: 0.4180
Epoch 10/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4200 - accuracy: 0.4834 - val_loss: 1.8140 - val_accuracy: 0.3768
Epoch 11/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4132 - accuracy: 0.4892 - val_loss: 1.4289 - val_accuracy: 0.4810
Epoch 12/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4075 - accuracy: 0.4915 - val_loss: 1.4257 - val_accuracy: 0.4734
Epoch 13/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4032 - accuracy: 0.4922 - val_loss: 1.4693 - val_accuracy: 0.4620
Epoch 14/15
352/352 [==============================] - 7s 19ms/step - loss: 1.3992 - accuracy: 0.4950 - val_loss: 1.3901 - val_accuracy: 0.4860
Epoch 15/15
352/352 [==============================] - 7s 19ms/step - loss: 1.3957 - accuracy: 0.4952 - val_loss: 1.4754 - val_accuracy: 0.4620
Dense model test accuracy: 0.43209999799728394
Pruned model test accuracy: 0.4596000015735626

The logs show the progression of sparsity on a per-layer basis.

%tensorboard --logdir={logdir}

After the fine-tuning with pruning, test accuracy demonstrates a modest improvement (43% to 44%) compared to the dense model. Let's compare on-device latency using TFLite benchmark.

Model conversion and benchmarking

To convert the pruned model into TFLite, we need replace the PruneLowMagnitude wrappers with original layers via the strip_pruning function. Also, since the weights of the pruned model (model_for_pruning) are mostly zeros, we may apply an optimization tf.lite.Optimize.EXPERIMENTAL_SPARSITY to efficiently store the resulted TFLite model. This optimization flag is not required for the dense model.

converter = tf.lite.TFLiteConverter.from_keras_model(dense_model)
dense_tflite_model = converter.convert()

_, dense_tflite_file = tempfile.mkstemp('.tflite')
with open(dense_tflite_file, 'wb') as f:
  f.write(dense_tflite_model)

model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.EXPERIMENTAL_SPARSITY]
pruned_tflite_model = converter.convert()

_, pruned_tflite_file = tempfile.mkstemp('.tflite')
with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)
INFO:tensorflow:Assets written to: /tmp/tmp0yx5e3fy/assets
INFO:tensorflow:Assets written to: /tmp/tmp0yx5e3fy/assets
2021-08-13 11:16:36.564681: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2021-08-13 11:16:36.564926: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-13 11:16:36.568512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.008ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
2021-08-13 11:16:36.664551: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-13 11:16:36.664597: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.
2021-08-13 11:16:36.668981: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: /tmp/tmpenn8hns6/assets
INFO:tensorflow:Assets written to: /tmp/tmpenn8hns6/assets
2021-08-13 11:16:39.184787: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2021-08-13 11:16:39.185019: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-13 11:16:39.188948: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.01ms.
  function_optimizer: function_optimizer did nothing. time = 0.002ms.

2021-08-13 11:16:39.294765: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-13 11:16:39.294816: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.

Following the instructions of TFLite Model Benchmarking Tool, we build the tool, upload it to the Android device together with dense and pruned TFLite models, and benchmark both models on the device.

! adb shell /data/local/tmp/benchmark_model \
    --graph=/data/local/tmp/dense_model.tflite \
    --use_xnnpack=true \
    --num_runs=100 \
    --num_threads=1
/bin/bash: adb: command not found
! adb shell /data/local/tmp/benchmark_model \
    --graph=/data/local/tmp/pruned_model.tflite \
    --use_xnnpack=true \
    --num_runs=100 \
    --num_threads=1
/bin/bash: adb: command not found

Benchmarks on Pixel 4 resulted in average inference time of 17us for the dense model and 12us for the pruned model. The on-device benchmarks demonstrate a clear 5us or 30% improvements in latency even for such small models. In our experience, larger models based on MobileNetV3 or EfficientNet-lite show similar performance improvements. The speed-up varies based on the relative contribution of 1x1 convolutions to the overall model.

Conclusion

In this tutorial, we show how one may create sparse models for faster on-device performance using the new functionality introduced by the TF MOT API and XNNPack. These sparse models are smaller and faster than their dense counterparts while retaining or even surpassing their quality.

We encourage you to try this new capability which can be particularly important for deploying your models on device.