Overview
This guide assumes familiarity with the TensorFlow
Profiler and
tf.data
. It aims to provide step by
step instructions with examples to help users diagnose and fix input pipeline
performance issues.
To begin, collect a profile of your TensorFlow job. Instructions on how to do so are available for CPUs/GPUs and Cloud TPUs.
The analysis workflow detailed below focuses on the trace viewer tool in the
Profiler. This tool displays a timeline that shows the duration of ops executed
by your TensorFlow program and allows you to identify which ops take the longest
to execute. For more information on the trace viewer, check out
this section of the TF
Profiler guide. In general, tf.data
events will appear on the host CPU
timeline.
Analysis Workflow
Please follow the workflow below. If you have feedback to help us improve it, please create a github issue with the label “comp:data”.
1. Is your tf.data
pipeline producing data fast enough?
Begin by ascertaining whether the input pipeline is the bottleneck for your TensorFlow program.
To do so, look for IteratorGetNext::DoCompute
ops in the trace viewer. In
general, you expect to see these at the start of a step. These slices represent
the time it takes for your input pipeline to yield a batch of elements when it
is requested. If you’re using keras or iterating over your dataset in a
tf.function
, these should be found in tf_data_iterator_get_next
threads.
Note that if you’re using a
distribution strategy,
you may see IteratorGetNextAsOptional::DoCompute
events instead of
IteratorGetNext::DoCompute
(as of TF 2.3).
If the calls return quickly (<= 50 us), this means that your data is available when it is requested. The input pipeline is not your bottleneck; see the Profiler guide for more generic performance analysis tips.
If the calls return slowly, tf.data
is unable to keep up with the
consumer’s requests. Continue to the next section.
2. Are you prefetching data?
The best practice for input pipeline performance is to insert a
tf.data.Dataset.prefetch
transformation at the end of your tf.data
pipeline.
This transformation overlaps the input pipeline’s preprocessing computation with
the next step of model computation and is required for optimal input pipeline
performance when training your model. If you’re prefetching data, you should see
a Iterator::Prefetch
slice on the same thread as the
IteratorGetNext::DoCompute
op.
If you don’t have a prefetch
at the end of your pipeline, you should add
one. For more information about tf.data
performance recommendations, see the
tf.data performance guide.
If you’re already prefetching data, and the input pipeline is still your bottleneck, continue to the next section to further analyze performance.
3. Are you reaching high CPU utilization?
tf.data
achieves high throughput by trying to make the best possible use of
available resources. In general, even when running your model on an accelerator
like a GPU or TPU, the tf.data
pipelines are run on the CPU. You can check
your utilization with tools like sar and
htop, or in the
cloud monitoring console if you’re running on GCP.
If your utilization is low, this suggests that your input pipeline may not be taking full advantage of the host CPU. You should consult the tf.data performance guide for best practices. If you have applied the best practices and utilization and throughput remain low, continue to Bottleneck analysis below.
If your utilization is approaching the resource limit, in order to improve performance further, you need to either improve the efficiency of your input pipeline (for example, avoiding unnecessary computation) or offload computation.
You can improve the efficiency of your input pipeline by avoiding unnecessary
computation in tf.data
. One way of doing this is inserting a
tf.data.Dataset.cache
transformation after computation-intensive work if your data fits into memory;
this reduces computation at the cost of increased memory usage. Additionally,
disabling intra-op parallelism in tf.data
has the potential to increase
efficiency by > 10%, and can be done by setting the following option on your
input pipeline:
dataset = ...
options = tf.data.Options()
options.experimental_threading.max_intra_op_parallelism = 1
dataset = dataset.with_options(options)
4. Bottleneck Analysis
The following section walks through how to read tf.data
events in the trace
viewer to understand where the bottleneck is and possible mitigation strategies.
Understanding tf.data
events in the Profiler
Each tf.data
event in the Profiler has the name Iterator::<Dataset>
, where
<Dataset>
is the name of the dataset source or transformation. Each event also
has the long name Iterator::<Dataset_1>::...::<Dataset_n>
, which you can see
by clicking on the tf.data
event. In the long name, <Dataset_n>
matches
<Dataset>
from the (short) name, and the other datasets in the long name
represent downstream transformations.
For example, the above screenshot was generated from the following code:
dataset = tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: x)
dataset = dataset.repeat(2)
dataset = dataset.batch(5)
Here, the Iterator::Map
event has the long name
Iterator::BatchV2::FiniteRepeat::Map
. Note that the datasets name may differ
slightly from the python API (for example, FiniteRepeat instead of Repeat), but
should be intuitive enough to parse.
Synchronous and asynchronous transformations
For synchronous tf.data
transformations (such as Batch
and Map
), you will
see events from upstream transformations on the same thread. In the above
example, since all the transformations used are synchronous, all the events
appear on the same thread.
For asynchronous transformations (such as Prefetch
, ParallelMap
,
ParallelInterleave
and MapAndBatch
) events from upstream transformations
will be on a different thread. In such cases, the “long name” can help you
identify which transformation in a pipeline an event corresponds to.
For example, the above screenshot was generated from the following code:
dataset = tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: x)
dataset = dataset.repeat(2)
dataset = dataset.batch(5)
dataset = dataset.prefetch(1)
Here, the Iterator::Prefetch
events are on the tf_data_iterator_get_next
threads. Since Prefetch
is asynchronous, its input events (BatchV2
) will be
on a different thread, and can be located by searching for the long name
Iterator::Prefetch::BatchV2
. In this case, they are on the
tf_data_iterator_resource
thread. From its long name, you can deduce that
BatchV2
is upstream of Prefetch
. Furthermore, the parent_id
of the
BatchV2
event will match the ID of the Prefetch
event.
Identifying the bottleneck
In general, to identify the bottleneck in your input pipeline, walk the input
pipeline from the outermost transformation all the way to the source. Starting
from the final transformation in your pipeline, recurse into upstream
transformations until you find a slow transformation or reach a source dataset,
such as TFRecord
. In the example above, you would start from Prefetch
, then
walk upstream to BatchV2
, FiniteRepeat
, Map
, and finally Range
.
In general, a slow transformation corresponds to one whose events are long, but whose input events are short. Some examples follow below.
Note that the final (outermost) transformation in most host input pipelines is
the Iterator::Model
event. The Model transformation is introduced
automatically by the tf.data
runtime and is used for instrumenting and
autotuning the input pipeline performance.
If your job is using a
distribution strategy,
the trace viewer will contain additional events that correspond to the device
input pipeline. The outermost transformation of the device pipeline (nested
under IteratorGetNextOp::DoCompute
or
IteratorGetNextAsOptionalOp::DoCompute
) will be an Iterator::Prefetch
event
with an upstream Iterator::Generator
event. You can find the corresponding
host pipeline by searching for Iterator::Model
events.
Example 1
The above screenshot is generated from the following input pipeline:
dataset = tf.data.TFRecordDataset(filename)
dataset = dataset.map(parse_record)
dataset = dataset.batch(32)
dataset = dataset.repeat()
In the screenshot, observe that (1) Iterator::Map
events are long, but (2) its
input events (Iterator::FlatMap
) return quickly. This suggests that the
sequential Map transformation is the bottleneck.
Note that in the screenshot, the InstantiatedCapturedFunction::Run
event
corresponds to the time it takes to execute the map function.
Example 2
The above screenshot is generated from the following input pipeline:
dataset = tf.data.TFRecordDataset(filename)
dataset = dataset.map(parse_record, num_parallel_calls=2)
dataset = dataset.batch(32)
dataset = dataset.repeat()
This example is similar to the above, but uses ParallelMap instead of Map. We
notice here that (1) Iterator::ParallelMap
events are long, but (2) its input
events Iterator::FlatMap
(which are on a different thread, since ParallelMap
is asynchronous) are short. This suggests that the ParallelMap transformation is
the bottleneck.
Addressing the bottleneck
Source datasets
If you’ve identified a dataset source as the bottleneck, such as reading from
TFRecord files, you can improve performance by parallelizing data extraction. To
do so, ensure that your data is sharded across multiple files and use
tf.data.Dataset.interleave
with the num_parallel_calls
parameter set to
tf.data.AUTOTUNE
. If determinism is not important to your
program, you can further improve performance by setting the
deterministic=False
flag on tf.data.Dataset.interleave
as of TF 2.2. For
example, if you’re reading from TFRecords, you can do the following:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.interleave(tf.data.TFRecordDataset,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=False)
Note that sharded files should be reasonably large to amortize the overhead of
opening a file. For more details on parallel data extraction, see
this section
of the tf.data
performance guide.
Transformation datasets
If you’ve identified an intermediate tf.data
transformation as the bottleneck,
you can address it by parallelizing the transformation or
caching the computation
if your data fits into memory and it is appropriate. Some transformations such
as Map
have parallel counterparts; the
tf.data
performance guide demonstrates
how to parallelize these. Other transformations, such as Filter
, Unbatch
,
and Batch
are inherently sequential; you can parallelize them by introducing
“outer parallelism”. For example, supposing your input pipeline initially looks
like the following, with Batch
as the bottleneck:
filenames = tf.data.Dataset.list_files(file_path, shuffle=is_training)
dataset = filenames_to_dataset(filenames)
dataset = dataset.batch(batch_size)
You can introduce “outer parallelism” by running multiple copies of the input pipeline over sharded inputs and combining the results:
filenames = tf.data.Dataset.list_files(file_path, shuffle=is_training)
def make_dataset(shard_index):
filenames = filenames.shard(NUM_SHARDS, shard_index)
dataset = filenames_to_dataset(filenames)
Return dataset.batch(batch_size)
indices = tf.data.Dataset.range(NUM_SHARDS)
dataset = indices.interleave(make_dataset,
num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
Additional resources
- tf.data performance guide
on how to write performance
tf.data
input pipelines - Inside TensorFlow video:
tf.data
best practices - Profiler guide
- Profiler tutorial with colab