View source on GitHub |
A PTransformAnalyzer which enables analyzer cache.
tft.experimental.CacheablePTransformAnalyzer(
make_accumulators_ptransform,
merge_accumulators_ptransform,
extract_output_ptransform,
cache_coder
)
- make_accumulators_ptransform: this is a
beam.PTransform
which maps data to a more compact mergeable representation (accumulator). Mergeable here means that it is possible to combine multiple representations produced from a partition of the dataset into a representation of the entire dataset. - merge_accumulators_ptransform: this is a
beam.PTransform
which operates on a collection of accumulators, i.e. the results of both the make_accumulators_ptransform and merge_accumulators_ptransform stages, and produces a single reduced accumulator. This operation must be associative and commutative in order to have reliably reproducible results. - extract_output: this is a
beam.PTransform
which operates on the result of the merge_accumulators_ptransform stage, and produces the outputs of the analyzer. These outputs must be consistent with theoutput_dtypes
andoutput_shapes
provided toptransform_analyzer
.
This container also holds a cache_coder
(PTransformAnalyzerCacheCoder
)
which can encode outputs and decode the inputs of the
merge_accumulators_ptransform
stage.
In many cases, SimpleJsonPTransformAnalyzerCacheCoder
would be sufficient.
To ensure the correctness of this analyzer, the following must hold: merge(make({D1, ..., Dn})) == merge({make(D1), ..., make(Dn)})