API for detecting feature skew between training and serving examples.
tfdv.DetectFeatureSkew(
identifier_features: List[types.FeatureName],
features_to_ignore: Optional[List[types.FeatureName]] = None,
sample_size: int = 0,
float_round_ndigits: Optional[int] = None,
allow_duplicate_identifiers: bool = False
) -> None
Example:
with beam.Pipeline(runner=...) as p:
training_examples = p | 'ReadTrainingData' >>
beam.io.ReadFromTFRecord(
training_filepaths, coder=beam.coders.ProtoCoder(tf.train.Example))
serving_examples = p | 'ReadServingData' >>
beam.io.ReadFromTFRecord(
serving_filepaths, coder=beam.coders.ProtoCoder(tf.train.Example))
_ = ((training_examples, serving_examples) | 'DetectFeatureSkew' >>
DetectFeatureSkew(identifier_features=['id1'], sample_size=5)
| 'WriteFeatureSkewResultsOutput' >>
tfdv.WriteFeatureSkewResultsToTFRecord(output_path)
| 'WriteFeatureSkwePairsOutput' >>
tfdv.WriteFeatureSkewPairsToTFRecord(output_path))
See the documentation for DetectFeatureSkewImpl for more detail about feature
skew detection.
Args |
identifier_features
|
Names of features to use as identifiers.
|
features_to_ignore
|
Names of features for which no feature skew detection
is done.
|
sample_size
|
Size of the sample of training-serving example pairs that
exhibit skew to include in the skew results.
|
float_round_ndigits
|
Number of digits precision after the decimal point to
which to round float values before comparing them.
|
allow_duplicate_identifiers
|
If set, skew detection will be done on
examples for which there are duplicate identifier feature values. In
this case, the counts in the FeatureSkew result are based on each
training-serving example pair analyzed. Examples with given identifier
feature values must all fit in memory.
|
Class Variables |
pipeline
|
None
|