tft.experimental.document_frequency
Stay organized with collections
Save and categorize content based on your preferences.
Maps the terms in x to their document frequency in the same order.
tft.experimental.document_frequency(
x: tf.SparseTensor, vocab_size: int, name: Optional[str] = None
) -> tf.SparseTensor
The document frequency of a term is the number of documents that contain the
term in the entire dataset. Each unique vocab term has a unique document
frequency.
Example usage:
def preprocessing_fn(inputs):
integerized = tft.compute_and_apply_vocabulary(inputs['x'])
vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)
return {
'df': tft.experimental.document_frequency(integerized, vocab_size),
'integerized': integerized,
}
raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]),
dict(x=["yum", "yum", "pie"])]
feature_spec = dict(x=tf.io.VarLenFeature(tf.string))
raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = (
(raw_data, raw_data_metadata)
| tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset
transformed_data
[{'df': array([1, 1, 2, 2, 2]), 'integerized': array([3, 2, 0, 0, 0])},
{'df': array([1, 1, 2]), 'integerized': array([1, 1, 0])}]
example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 1, 2, 2, 2, 1, 1, 2])
Args |
x
|
A 2D SparseTensor representing int64 values (most likely that are the
result of calling compute_and_apply_vocabulary on a tokenized string).
|
vocab_size
|
An int - the count of vocab used to turn the string into int64s
including any OOV buckets.
|
name
|
(Optional) A name for this operation.
|
Returns |
SparseTensor s with indices [index_in_batch, index_in_local_sequence] and
values document_frequency. Same shape as the input x .
|
Raises |
ValueError if x does not have 2 dimensions.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-11-01 UTC.
[null,null,["Last updated 2024-11-01 UTC."],[],[],null,["# tft.experimental.document_frequency\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/transform/blob/v1.16.0/tensorflow_transform/experimental/mappers.py#L129-L204) |\n\nMaps the terms in x to their document frequency in the same order. \n\n tft.experimental.document_frequency(\n x: tf.SparseTensor, vocab_size: int, name: Optional[str] = None\n ) -\u003e tf.SparseTensor\n\nThe document frequency of a term is the number of documents that contain the\nterm in the entire dataset. Each unique vocab term has a unique document\nfrequency.\n\n#### Example usage:\n\n def preprocessing_fn(inputs):\n integerized = tft.compute_and_apply_vocabulary(inputs['x'])\n vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)\n return {\n 'df': tft.experimental.document_frequency(integerized, vocab_size),\n 'integerized': integerized,\n }\n raw_data = [dict(x=[\"I\", \"like\", \"pie\", \"pie\", \"pie\"]),\n dict(x=[\"yum\", \"yum\", \"pie\"])]\n feature_spec = dict(x=tf.io.VarLenFeature(tf.string))\n raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)\n with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n transformed_dataset, transform_fn = (\n (raw_data, raw_data_metadata)\n | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))\n transformed_data, transformed_metadata = transformed_dataset\n transformed_data\n [{'df': array([1, 1, 2, 2, 2]), 'integerized': array([3, 2, 0, 0, 0])},\n {'df': array([1, 1, 2]), 'integerized': array([1, 1, 0])}]\n\n example strings: [[\"I\", \"like\", \"pie\", \"pie\", \"pie\"], [\"yum\", \"yum\", \"pie]]\n in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],\n [1, 0], [1, 1], [1, 2]],\n values=[1, 2, 0, 0, 0, 3, 3, 0])\n out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],\n [1, 0], [1, 1], [1, 2]],\n values=[1, 1, 2, 2, 2, 1, 1, 2])\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------|\n| `x` | A 2D `SparseTensor` representing int64 values (most likely that are the result of calling `compute_and_apply_vocabulary` on a tokenized string). |\n| `vocab_size` | An int - the count of vocab used to turn the string into int64s including any OOV buckets. |\n| `name` | (Optional) A name for this operation. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| `SparseTensor`s with indices \\[index_in_batch, index_in_local_sequence\\] and values document_frequency. Same shape as the input `x`. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|---|---|\n| ValueError if `x` does not have 2 dimensions. ||\n\n\u003cbr /\u003e"]]