Options for generating statistics.
tfdv.StatsOptions(
generators: Optional[List[stats_generator.StatsGenerator]] = None,
feature_whitelist: Optional[List[types.FeatureName]] = None,
schema: Optional[schema_pb2.Schema] = None,
label_feature: Optional[types.FeatureName] = None,
weight_feature: Optional[types.FeatureName] = None,
slice_functions: Optional[List[types.SliceFunction]] = None,
sample_rate: Optional[float] = None,
num_top_values: int = 20,
frequency_threshold: int = 1,
weighted_frequency_threshold: float = 1.0,
num_rank_histogram_buckets: int = 1000,
num_values_histogram_buckets: int = 10,
num_histogram_buckets: int = 10,
num_quantiles_histogram_buckets: int = 10,
epsilon: float = 0.01,
infer_type_from_schema: bool = False,
desired_batch_size: Optional[int] = None,
enable_semantic_domain_stats: bool = False,
semantic_domain_stats_sample_rate: Optional[float] = None,
per_feature_weight_override: Optional[Dict[types.FeaturePath, types.FeatureName]] = None,
vocab_paths: Optional[Dict[types.VocabName, types.VocabPath]] = None,
add_default_generators: bool = True,
feature_allowlist: Optional[List[types.FeatureName]] = None
)
Used in the notebooks
Args |
generators
|
An optional list of statistics generators. A statistics
generator must extend either CombinerStatsGenerator or
TransformStatsGenerator.
|
feature_whitelist
|
DEPRECATED. Use feature_allowlist instead.
|
schema
|
An optional tensorflow_metadata Schema proto. Currently we use the
schema to infer categorical and bytes features.
|
label_feature
|
An optional feature name which represents the label.
|
weight_feature
|
An optional feature name whose numeric value represents
the weight of an example.
|
slice_functions
|
An optional list of functions that generate slice keys
for each example. Each slice function should take an example dict as
input and return a list of zero or more slice keys.
|
sample_rate
|
An optional sampling rate. If specified, statistics is
computed over the sample.
|
num_top_values
|
An optional number of most frequent feature values to keep
for string features.
|
frequency_threshold
|
An optional minimum number of examples the most
frequent values must be present in.
|
weighted_frequency_threshold
|
An optional minimum weighted number of
examples the most frequent weighted values must be present in. This
option is only relevant when a weight_feature is specified.
|
num_rank_histogram_buckets
|
An optional number of buckets in the rank
histogram for string features.
|
num_values_histogram_buckets
|
An optional number of buckets in a quantiles
histogram for the number of values per Feature, which is stored in
CommonStatistics.num_values_histogram.
|
num_histogram_buckets
|
An optional number of buckets in a standard
NumericStatistics.histogram with equal-width buckets.
|
num_quantiles_histogram_buckets
|
An optional number of buckets in a
quantiles NumericStatistics.histogram.
|
epsilon
|
An optional error tolerance for the computation of quantiles,
typically a small fraction close to zero (e.g. 0.01). Higher values of
epsilon increase the quantile approximation, and hence result in more
unequal buckets, but could improve performance, and resource
consumption.
|
infer_type_from_schema
|
A boolean to indicate whether the feature types
should be inferred from the schema. If set to True, an input schema
must be provided. This flag is used only when generating statistics
on CSV data.
|
desired_batch_size
|
An optional number of examples to include in each
batch that is passed to the statistics generators.
|
enable_semantic_domain_stats
|
If True statistics for semantic domains are
generated (e.g: image, text domains).
|
semantic_domain_stats_sample_rate
|
An optional sampling rate for semantic
domain statistics. If specified, semantic domain statistics is computed
over a sample.
|
per_feature_weight_override
|
If specified, the "example weight" paired
with a feature will be first looked up in this map and if not found,
fall back to weight_feature .
|
vocab_paths
|
An optional dictionary mapping vocab names to paths. Used in
the schema when specifying a NaturalLanguageDomain. The paths can either
be to GZIP-compressed TF record files that have a tfrecord.gz suffix
or to text files.
|
add_default_generators
|
Whether to invoke the default set of stats
generators in the run. Generators invoked consists of 1) the default
generators (controlled by this option); 2) user-provided generators (
controlled by the generators option); 3) semantic generators
(controlled by enable_semantic_domain_stats ) and 4) schema-based
generators that are enabled based on information provided in the schema.
|
feature_allowlist
|
An optional list of names of the features to calculate
statistics for.
|
Attributes |
add_default_generators
|
|
desired_batch_size
|
|
example_weight_map
|
|
feature_allowlist
|
|
generators
|
|
num_histogram_buckets
|
|
num_quantiles_histogram_buckets
|
|
num_values_histogram_buckets
|
|
sample_rate
|
|
schema
|
|
semantic_domain_stats_sample_rate
|
|
slice_functions
|
|
vocab_paths
|
|
Methods
from_json
View source
@classmethod
from_json(
options_json: Text
) -> "StatsOptions"
Construct an instance of stats options from a JSON representation.
Args |
options_json
|
A JSON representation of the dict attribute of a
StatsOptions instance.
|
Returns |
A StatsOptions instance constructed by setting the dict attribute to
the deserialized value of options_json.
|
to_json
View source
to_json() -> Text
Convert from an object to JSON representation of the dict attribute.
Custom generators and slice_functions are skipped, meaning that they will
not be used when running TFDV in a setting where the stats options have been
json-serialized, first. This will happen in the case where TFDV is run as a
TFX component. The schema proto will be json_encoded.
Returns |
A JSON representation of a filtered version of dict.
|