tft.experimental.approximate_vocabulary

Computes the unique values of a Tensor over the whole dataset.

Approximately computes the unique values taken by x, which can be a Tensor or CompositeTensor of any size. The unique values will be aggregated over all dimensions of x and all instances.

This analyzer provides an approximate alternative to tft.vocabulary that can be more efficient with smaller top_k and/or smaller number of unique elements in x. As a rule of thumb, approximate_vocabulary becomes more efficient than tft.vocabulary if top_k or the number of unique elements in x is smaller than 2*10^5. Moreover, this analyzer is subject to combiner packing optimization that does not apply to tft.vocabulary. Caching is also more efficient with the approximate implementation since the filtration happens before writing out cache. Output artifact of approximate_vocabulary is consistent with tft.vocabulary and can be used in tft.apply_vocabulary mapper.

Implementation of this analyzer is based on the Misra-Gries algorithm [1]. It stores at most top_k elements with lower bound frequency estimates at a time. The algorithm keeps track of the approximation error delta such that for any item x with true frequency X:

        frequency[x] <= X <= frequency[x] + delta,
        delta <= (m - m') / (top_k + 1),

where m is the total frequency of the items in the dataset and m' is the sum of the lower bound estimates in frequency [2]. For datasets that are Zipfian distributed with parameter a, the algorithm provides an expected value of delta = m / (top_k ^ a) [3].

[1] https://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf [2] http://www.cohenwang.com/edith/bigdataclass2013/lectures/lecture1.pdf [3] http://dimacs.rutgers.edu/~graham/pubs/papers/countersj.pdf

In case file_format is 'text' and one of the tokens contains the '\n' or '\r' characters or is empty it will be discarded.

If an integer Tensor is provided, its semantic type should be categorical not a continuous/numeric, since computing a vocabulary over a continuous feature is not appropriate.

The unique values are sorted by decreasing frequency and then reverse lexicographical order (e.g. [('a', 5), ('c', 3), ('b', 3)]). This is true even if x is numerical dtype (e.g. [('3', 5), ('2', 3), ('111', 3)]).

x A categorical/discrete input Tensor or CompositeTensor with dtype tf.string or tf.int[8|16|32|64].
top_k Limit the generated vocabulary to the first top_k elements. Note that if top_k is larger than the number of unique elements in x, then the result will be exact.
vocab_filename The file name for the vocabulary file. If None, a file name will be chosen based on the current scope. If not None, should be unique within a given preprocessing function. NOTE: To make your pipelines resilient to implementation details please set vocab_filename when you are using the vocab_filename on a downstream component.
store_frequency If True, frequency of the words is stored in the vocabulary file. Each line in the file will be of the form 'frequency word'. NOTE: if this is True then the computed vocabulary cannot be used with tft.apply_vocabulary directly, since frequencies are added to the beginning of each row of the vocabulary, which the mapper will not ignore.
weights (Optional) Weights Tensor for the vocabulary. It must have the same shape as x.
file_format (Optional) A str. The format of the resulting vocabulary file. Accepted formats are: 'tfrecord_gzip', 'text'. 'tfrecord_gzip' requires tensorflow>=2.4. The default value is 'text'.
name (Optional) A name for this operation.

The path name for the vocabulary file containing the unique values of x.

ValueError If top_k is negative. If file_format is not in the list of allowed formats. If x.dtype is not string or integral.