Implementing a Custom Delegate

What is a TensorFlow Lite Delegate?

A TensorFlow Lite Delegate allows you to run your models (part or whole) on another executor. This mechanism can leverage a variety of on-device accelerators such as the GPU or Edge TPU (Tensor Processing Unit) for inference. This provides developers a flexible and decoupled method from the default TFLite to speed up inference.

Diagram below summarizes the delegates, more details in the below sections.

TFLite Delegates

When should I create a Custom delegate?

TensorFlow Lite has a wide variety of delegates for target accelerators such as GPU, DSP, EdgeTPU and frameworks like Android NNAPI.

Creating your own delegate is useful in the following scenarios:

  • You want to integrate a new ML inference engine not supported by any existing delegate.
  • You have a custom hardware accelerator that improves runtime for known scenarios.
  • You are developing CPU optimizations (such as operator fusing) that can speed up certain models.

How do delegates work?

Consider a simple model graph such as the following, and a delegate “MyDelegate” that has a faster implementation for Conv2D and Mean operations.

Original graph

After applying this “MyDelegate”, the original TensorFlow Lite graph will be updated like the following:

Graph with delegate

The graph above is obtained as TensorFlow Lite splits the original graph following two rules:

  • Specific operations that could be handled by the delegate are put into a partition while still satisfying the original computing workflow dependencies among operations.
  • Each to-be-delegated partition only has input and output nodes that are not handled by the delegate.

Each partition that is handled by a delegate is replaced by a delegate node (can also be called as a delegate kernel) in the original graph that evaluates the partition on its invoke call.

Depending on the model, the final graph can end up with one or more nodes, the latter meaning that some ops are not supported by the delegate. In general, you don’t want to have multiple partitions handled by the delegate, because each time you switch from delegate to the main graph, there is an overhead for passing the results from the delegated subgraph to the main graph that results due to memory copies (for example, GPU to CPU). Such overhead might offset performance gains especially when there are a large amount of memory copies.

Implementing your own Custom delegate

The preferred method to add a delegate is using SimpleDelegate API.

To create a new delegate, you need to implement 2 interfaces and provide your own implementation for the interface methods.

1 - SimpleDelegateInterface

This class represents the capabilities of the delegate, which operations aer supported, and a factory class for creating a kernel which encapsulates the delegated graph. For more details, see the interface defined in this C++ header file. The comments in the code explain each API in detail.

2 - SimpleDelegateKernelInterface

This class encapsulates the logic for initializing / preparing / and running the delegated partition.

It has: (See definition)

  • Init(...): which will be called once to do any one-time initialization.
  • Prepare(...): called for each different instance of this node - this happens if you have multiple delegated partitions. Usually you want to do memory allocations here, since this will be called everytime tensors are resized.
  • Invoke(...): which will be called for inference.


In this example, you will create a very simple delegate that can support only 2 types of operations (ADD) and (SUB) with float32 tensors only.

// MyDelegate implements the interface of SimpleDelegateInterface.
// This holds the Delegate capabilities.
class MyDelegate : public SimpleDelegateInterface {
  bool IsNodeSupportedByDelegate(const TfLiteRegistration* registration,
                                 const TfLiteNode* node,
                                 TfLiteContext* context) const override {
    // Only supports Add and Sub ops.
    if (kTfLiteBuiltinAdd != registration->builtin_code &&
        kTfLiteBuiltinSub != registration->builtin_code)
      return false;
    // This delegate only supports float32 types.
    for (int i = 0; i < node->inputs->size; ++i) {
      auto& tensor = context->tensors[node->inputs->data[i]];
      if (tensor.type != kTfLiteFloat32) return false;
    return true;

  TfLiteStatus Initialize(TfLiteContext* context) override { return kTfLiteOk; }

  const char* Name() const override {
    static constexpr char kName[] = "MyDelegate";
    return kName;

  std::unique_ptr<SimpleDelegateKernelInterface> CreateDelegateKernelInterface()
      override {
    return std::make_unique<MyDelegateKernel>();

Next, create your own delegate kernel by inheriting from the SimpleDelegateKernelInterface

// My delegate kernel.
class MyDelegateKernel : public SimpleDelegateKernelInterface {
  TfLiteStatus Init(TfLiteContext* context,
                    const TfLiteDelegateParams* params) override {
    // Save index to all nodes which are part of this delegate.
    for (int i = 0; i < params->nodes_to_replace->size; ++i) {
      const int node_index = params->nodes_to_replace->data[i];
      // Get this node information.
      TfLiteNode* delegated_node = nullptr;
      TfLiteRegistration* delegated_node_registration = nullptr;
          context->GetNodeAndRegistration(context, node_index, &delegated_node,
      builtin_code_[i] = delegated_node_registration->builtin_code;
    return kTfLiteOk;

  TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) override {
    return kTfLiteOk;

  TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) override {
    // Evaluate the delegated graph.
    // Here we loop over all the delegated nodes.
    // We know that all the nodes are either ADD or SUB operations and the
    // number of nodes equals ''inputs_.size()'' and inputs[i] is a list of
    // tensor indices for inputs to node ''i'', while outputs_[i] is the list of
    // outputs for node
    // ''i''. Note, that it is intentional we have simple implementation as this
    // is for demonstration.

    for (int i = 0; i < inputs_.size(); ++i) {
      // Get the node input tensors.
      // Add/Sub operation accepts 2 inputs.
      auto& input_tensor_1 = context->tensors[inputs_[i][0]];
      auto& input_tensor_2 = context->tensors[inputs_[i][1]];
      auto& output_tensor = context->tensors[outputs_[i][0]];
          ComputeResult(context, builtin_code_[i], &input_tensor_1,
                        &input_tensor_2, &output_tensor),
    return kTfLiteOk;

  // Computes the result of addition of 'input_tensor_1' and 'input_tensor_2'
  // and store the result in 'output_tensor'.
  TfLiteStatus ComputeResult(TfLiteContext* context, int builtin_code,
                             const TfLiteTensor* input_tensor_1,
                             const TfLiteTensor* input_tensor_2,
                             TfLiteTensor* output_tensor) {
    if (NumElements(input_tensor_1) != NumElements(input_tensor_2) ||
        NumElements(input_tensor_1) != NumElements(output_tensor)) {
      return kTfLiteDelegateError;
    // This code assumes no activation, and no broadcasting needed (both inputs
    // have the same size).
    auto* input_1 = GetTensorData<float>(input_tensor_1);
    auto* input_2 = GetTensorData<float>(input_tensor_2);
    auto* output = GetTensorData<float>(output_tensor);
    for (int i = 0; i < NumElements(input_tensor_1); ++i) {
      if (builtin_code == kTfLiteBuiltinAdd)
        output[i] = input_1[i] + input_2[i];
        output[i] = input_1[i] - input_2[i];
    return kTfLiteOk;

  // Holds the indices of the input/output tensors.
  // inputs_[i] is list of all input tensors to node at index 'i'.
  // outputs_[i] is list of all output tensors to node at index 'i'.
  std::vector<std::vector<int>> inputs_, outputs_;
  // Holds the builtin code of the ops.
  // builtin_code_[i] is the type of node at index 'i'
  std::vector<int> builtin_code_;

Benchmark and evaluate the new delegate

TFLite has a set of tools that you can quickly test against a TFLite model.

  • Model Benchmark Tool: The tool takes a TFLite model, generates random inputs, and then repeatedly runs the model for a specified number of runs. It prints aggregated latency statistics at the end.
  • Inference Diff Tool: For a given model, the tool generates random Gaussian data and passes it through two different TFLite interpreters, one running single threaded CPU kernel and the other using a user-defined spec. It measures the absolute difference between the output tensors from each interpreter, on a per-element basis. This tool can also be helpful for debugging accuracy issues.
  • There are also task specific evaluation tools, for image classification and object detection. These tools can be found here

In addition, TFLite has a large set of kernel and op unit tests that could be reused to test the new delegate with more coverage and to ensure the regular TFLite execution path is not broken.

To achieve reusing TFLite tests and tooling for the new delegate, you can use either of the following two options:

Choosing the best approach

Both approaches require a few changes as detailed below. However, the first approach links the delegate statically and requires rebuilding the testing, benchmarking and evaluation tools. In contrast, the second one makes the delegate as a shared library and requires you to expose the create/delete methods from the shared library.

As a result, the external-delegate mechanism will work with TFLite’s pre-built Tensorflow Lite tooling binaries. But it is less explicit and it might be more complicated to set up in automated integration tests. Use the delegate registrar approach for better clarity.

Option 1: Leverage delegate registrar

The delegate registrar keeps a list of delegate providers, each of which provides an easy way to create TFLite delegates based on command-line flags, and are hence, convenient for tooling. To plug in the new delegate to all the Tensorflow Lite tools mentioned above, you first create a new delegate provider like this one, and then makes only a few changes to the BUILD rules. A full example of this integration process is shown below (and code can be found here).

Assuming you have a delegate that implements the SimpleDelegate APIs, and the extern "C" APIs of creating/deleting this 'dummy' delegate as shown below:

// Returns default options for DummyDelegate.
DummyDelegateOptions TfLiteDummyDelegateOptionsDefault();

// Creates a new delegate instance that need to be destroyed with
// `TfLiteDummyDelegateDelete` when delegate is no longer used by TFLite.
// When `options` is set to `nullptr`, the above default values are used:
TfLiteDelegate* TfLiteDummyDelegateCreate(const DummyDelegateOptions* options);

// Destroys a delegate created with `TfLiteDummyDelegateCreate` call.
void TfLiteDummyDelegateDelete(TfLiteDelegate* delegate);

To integrate the “DummyDelegate” with Benchmark Tool and Inference Tool, define a DelegateProvider like below:

class DummyDelegateProvider : public DelegateProvider {
  DummyDelegateProvider() {

  std::vector<Flag> CreateFlags(ToolParams* params) const final;

  void LogParams(const ToolParams& params) const final;

  TfLiteDelegatePtr CreateTfLiteDelegate(const ToolParams& params) const final;

  std::string GetName() const final { return "DummyDelegate"; }

std::vector<Flag> DummyDelegateProvider::CreateFlags(ToolParams* params) const {
  std::vector<Flag> flags = {CreateFlag<bool>("use_dummy_delegate", params,
                                              "use the dummy delegate.")};
  return flags;

void DummyDelegateProvider::LogParams(const ToolParams& params) const {
  TFLITE_LOG(INFO) << "Use dummy test delegate : ["
                   << params.Get<bool>("use_dummy_delegate") << "]";

TfLiteDelegatePtr DummyDelegateProvider::CreateTfLiteDelegate(
    const ToolParams& params) const {
  if (params.Get<bool>("use_dummy_delegate")) {
    auto default_options = TfLiteDummyDelegateOptionsDefault();
    return TfLiteDummyDelegateCreateUnique(&default_options);
  return TfLiteDelegatePtr(nullptr, [](TfLiteDelegate*) {});

The BUILD rule definitions are important as you need to make sure that the library is always linked and not dropped by optimizer.

#### The following are for using the dummy test delegate in TFLite tooling ####
    name = "dummy_delegate_provider",
    srcs = [""],
    copts = tflite_copts(),
    deps = [
    alwayslink = 1, # This is required so the optimizer doesn't optimize the library away.

Now add these two wrapper rules in your BUILD file to create a version of Benchmark Tool and Inference Tool, and other evaluation tools, that could run with your own delegate.

    name = "benchmark_model_plus_dummy_delegate",
    copts = tflite_copts(),
    linkopts = task_linkopts(),
    deps = [

    name = "inference_diff_plus_dummy_delegate",
    copts = tflite_copts(),
    linkopts = task_linkopts(),
    deps = [

    name = "imagenet_classification_eval_plus_dummy_delegate",
    copts = tflite_copts(),
    linkopts = task_linkopts(),
    deps = [

    name = "coco_object_detection_eval_plus_dummy_delegate",
    copts = tflite_copts(),
    linkopts = task_linkopts(),
    deps = [

You can also plug in this delegate provider to TFLite kernel tests as described here.

Option 2: Leverage external delegate

In this alternative, you first create an external delegate adaptor the as shown below. Note, this approach is slightly less preferred as compared to Option 1 as has been aforementioned.

TfLiteDelegate* CreateDummyDelegateFromOptions(char** options_keys,
                                               char** options_values,
                                               size_t num_options) {
  DummyDelegateOptions options = TfLiteDummyDelegateOptionsDefault();

  // Parse key-values options to DummyDelegateOptions.
  // You can achieve this by mimicking them as command-line flags.
  std::unique_ptr<const char*> argv =
      std::unique_ptr<const char*>(new const char*[num_options + 1]);
  constexpr char kDummyDelegateParsing[] = "dummy_delegate_parsing";
  argv.get()[0] = kDummyDelegateParsing;

  std::vector<std::string> option_args;
  for (int i = 0; i < num_options; ++i) {
    argv.get()[i + 1] = option_args.rbegin()->c_str();

  // Define command-line flags.
  // ...
  std::vector<tflite::Flag> flag_list = {

  int argc = num_options + 1;
  if (!tflite::Flags::Parse(&argc, argv.get(), flag_list)) {
    return nullptr;

  return TfLiteDummyDelegateCreate(&options);

#ifdef __cplusplus
extern "C" {
#endif  // __cplusplus

// Defines two symbols that need to be exported to use the TFLite external
// delegate. See tensorflow/lite/delegates/external for details.
TFL_CAPI_EXPORT TfLiteDelegate* tflite_plugin_create_delegate(
    char** options_keys, char** options_values, size_t num_options,
    void (*report_error)(const char*)) {
  return tflite::tools::CreateDummyDelegateFromOptions(
      options_keys, options_values, num_options);

TFL_CAPI_EXPORT void tflite_plugin_destroy_delegate(TfLiteDelegate* delegate) {

#ifdef __cplusplus
#endif  // __cplusplus

Now create the corresponding BUILD target to build a dynamic library as shown below:

    name = "",
    srcs = [
    linkshared = 1,
    linkstatic = 1,
    deps = [

After this external delegate .so file is created, you can build binaries or use pre-built ones to run with the new delegate as long as the binary is linked with the external_delegate_provider library which supports command-line flags as described here. Note: this external delegate provider has already been linked to existing testing and tooling binaries.

Refer to descriptions here for an illustration of how to benchmark the dummy delegate via this external-delegate approach. You can use similar commands for the testing and evaluation tools mentioned earlier.

It is worth noting the external delegate is the corresponding C++ implementation of the delegate in Tensorflow Lite Python binding as shown here. Therefore, the dynamic external delegate adaptor library created here could be directly used with Tensorflow Lite Python APIs.


Linux x86_64
Android arm