Performance is an important consideration when training machine learning models. Performance speeds up and scales research while also providing end users with near instant predictions. This section provides details on the high level APIs to use along with best practices to build and train high performance models, and quantize models for the least latency and highest throughput for inference.

Tensorflow Model Optimization Toolkit is a set of techniques for optimizing models for inference:

XLA (Accelerated Linear Algebra) is an experimental compiler for linear algebra that optimizes TensorFlow computations. The following guides explore XLA:

  • XLA Overview, which introduces XLA.
  • Broadcasting Semantics, which describes XLA's broadcasting semantics.
  • Developing a new back end for XLA, which explains how to re-target TensorFlow in order to optimize the performance of the computational graph for particular hardware.
  • Using JIT Compilation, which describes the XLA JIT compiler that compiles and runs parts of TensorFlow graphs via XLA in order to optimize performance.
  • Operation Semantics, which is a reference manual describing the semantics of operations in the ComputationBuilder interface.
  • Shapes and Layout, which details the Shape protocol buffer.
  • Using AOT compilation, which explains tfcompile, a standalone tool that compiles TensorFlow graphs into executable code in order to optimize performance.