View source on GitHub |
Computes the unique values of a Tensor
over the whole dataset.
tft.experimental.approximate_vocabulary(
x: common_types.TensorType,
top_k: int,
*,
vocab_filename: Optional[str] = None,
store_frequency: bool = False,
reserved_tokens: Optional[Union[Sequence[str], tf.Tensor]] = None,
weights: Optional[tf.Tensor] = None,
file_format: common_types.VocabularyFileFormatType = analyzers.DEFAULT_VOCABULARY_FILE_FORMAT,
name: Optional[str] = None
) -> common_types.TemporaryAnalyzerOutputType
Approximately computes the unique values taken by x
, which can be a
Tensor
, SparseTensor
, or RaggedTensor
of any size. The unique values
will be aggregated over all dimensions of x
and all instances.
This analyzer provides an approximate alternative to tft.vocabulary
that can
be more efficient with smaller top_k
and/or smaller number of unique
elements in x
. As a rule of thumb, approximate_vocabulary
becomes more
efficient than tft.vocabulary
if top_k
or the number of unique elements in
x
is smaller than 2*10^5. Moreover, this analyzer is subject to combiner
packing optimization that does not apply to tft.vocabulary
. Caching is also
more efficient with the approximate implementation since the filtration
happens before writing out cache. Output artifact of approximate_vocabulary
is consistent with tft.vocabulary
and can be used in tft.apply_vocabulary
mapper.
Implementation of this analyzer is based on the Misra-Gries algorithm [1]. It
stores at most top_k
elements with lower bound frequency estimates at a
time. The algorithm keeps track of the approximation error delta
such that
for any item x with true frequency X:
frequency[x] <= X <= frequency[x] + delta,
delta <= (m - m') / (top_k + 1),
where m is the total frequency of the items in the dataset and m' is the sum
of the lower bound estimates in frequency
[2]. For datasets that are Zipfian
distributed with parameter a
, the algorithm provides an expected value of
delta = m / (top_k ^ a) [3].
[1] https://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf [2] http://www.cohenwang.com/edith/bigdataclass2013/lectures/lecture1.pdf [3] http://dimacs.rutgers.edu/~graham/pubs/papers/countersj.pdf
In case file_format
is 'text' and one of the tokens contains the '\n' or
'\r' characters or is empty it will be discarded.
If an integer Tensor
is provided, its semantic type should be categorical
not a continuous/numeric, since computing a vocabulary over a continuous
feature is not appropriate.
The unique values are sorted by decreasing frequency and then reverse
lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even
if x
is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).
Args | |
---|---|
x
|
A categorical/discrete input Tensor , SparseTensor , or RaggedTensor
with dtype tf.string or tf.int[8|16|32|64].
|
top_k
|
Limit the generated vocabulary to the first top_k elements. Note
that if top_k is larger than the number of unique elements in x , then
the result will be exact.
|
vocab_filename
|
The file name for the vocabulary file. If None, a file name
will be chosen based on the current scope. If not None, should be unique
within a given preprocessing function. NOTE: To make your pipelines
resilient to implementation details please set vocab_filename when you
are using the vocab_filename on a downstream component.
|
store_frequency
|
If True, frequency of the words is stored in the vocabulary
file. Each line in the file will be of the form 'frequency word'. NOTE: if
this is True then the computed vocabulary cannot be used with
tft.apply_vocabulary directly, since frequencies are added to the
beginning of each row of the vocabulary, which the mapper will not ignore.
|
reserved_tokens
|
(Optional) A list of tokens that should appear in the vocabulary regardless of their appearance in the input. These tokens would maintain their order, and have a reserved spot at the beginning of the vocabulary. Note: this field has no affect on cache. |
weights
|
(Optional) Weights Tensor for the vocabulary. It must have the
same shape as x.
|
file_format
|
(Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'. |
name
|
(Optional) A name for this operation. |
Returns | |
---|---|
The path name for the vocabulary file containing the unique values of x .
|
Raises | |
---|---|
ValueError
|
If top_k is negative.
If file_format is not in the list of allowed formats.
If x.dtype is not string or integral.
|