tft.ngrams
Stay organized with collections
Save and categorize content based on your preferences.
Create a SparseTensor
of n-grams.
tft.ngrams(
tokens: tf.SparseTensor,
ngram_range: Tuple[int, int],
separator: str,
name: Optional[str] = None
) -> tf.SparseTensor
Given a SparseTensor
of tokens, returns a SparseTensor
containing the
ngrams that can be constructed from each row.
separator
is inserted between each pair of tokens, so " " would be an
appropriate choice if the tokens are words, while "" would be an appropriate
choice if they are characters.
Example:
tokens = tf.SparseTensor(
indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [1, 3]],
values=['One', 'was', 'Johnny', 'Two', 'was', 'a', 'rat'],
dense_shape=[2, 4])
print(tft.ngrams(tokens, ngram_range=(1, 3), separator=' '))
SparseTensor(indices=tf.Tensor(
[[0 0] [0 1] [0 2] [0 3] [0 4] [0 5]
[1 0] [1 1] [1 2] [1 3] [1 4] [1 5] [1 6] [1 7] [1 8]],
shape=(15, 2), dtype=int64),
values=tf.Tensor(
[b'One' b'One was' b'One was Johnny' b'was' b'was Johnny' b'Johnny' b'Two'
b'Two was' b'Two was a' b'was' b'was a' b'was a rat' b'a' b'a rat'
b'rat'], shape=(15,), dtype=string),
dense_shape=tf.Tensor([2 9], shape=(2,), dtype=int64))
Args |
tokens
|
a two-dimensionalSparseTensor of dtype tf.string containing
tokens that will be used to construct ngrams.
|
ngram_range
|
A pair with the range (inclusive) of ngram sizes to return.
|
separator
|
a string that will be inserted between tokens when ngrams are
constructed.
|
name
|
(Optional) A name for this operation.
|
Returns |
A SparseTensor containing all ngrams from each row of the input. Note:
if an ngram appears multiple times in the input row, it will be present the
same number of times in the output. For unique ngrams, see tft.bag_of_words.
|
Raises |
ValueError
|
if tokens is not 2D.
|
ValueError
|
if ngram_range[0] < 1 or ngram_range[1] < ngram_range[0]
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-11-01 UTC.
[null,null,["Last updated 2024-11-01 UTC."],[],[],null,["# tft.ngrams\n\n\u003cbr /\u003e\n\n|---------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/transform/blob/v1.16.0/tensorflow_transform/mappers.py#L1517-L1655) |\n\nCreate a `SparseTensor` of n-grams. \n\n tft.ngrams(\n tokens: tf.SparseTensor,\n ngram_range: Tuple[int, int],\n separator: str,\n name: Optional[str] = None\n ) -\u003e tf.SparseTensor\n\nGiven a `SparseTensor` of tokens, returns a `SparseTensor` containing the\nngrams that can be constructed from each row.\n\n`separator` is inserted between each pair of tokens, so \" \" would be an\nappropriate choice if the tokens are words, while \"\" would be an appropriate\nchoice if they are characters.\n\n#### Example:\n\n tokens = tf.SparseTensor(\n indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [1, 3]],\n values=['One', 'was', 'Johnny', 'Two', 'was', 'a', 'rat'],\n dense_shape=[2, 4])\n print(tft.ngrams(tokens, ngram_range=(1, 3), separator=' '))\n SparseTensor(indices=tf.Tensor(\n [[0 0] [0 1] [0 2] [0 3] [0 4] [0 5]\n [1 0] [1 1] [1 2] [1 3] [1 4] [1 5] [1 6] [1 7] [1 8]],\n shape=(15, 2), dtype=int64),\n values=tf.Tensor(\n [b'One' b'One was' b'One was Johnny' b'was' b'was Johnny' b'Johnny' b'Two'\n b'Two was' b'Two was a' b'was' b'was a' b'was a rat' b'a' b'a rat'\n b'rat'], shape=(15,), dtype=string),\n dense_shape=tf.Tensor([2 9], shape=(2,), dtype=int64))\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `tokens` | a two-dimensional`SparseTensor` of dtype [`tf.string`](https://www.tensorflow.org/api_docs/python/tf#string) containing tokens that will be used to construct ngrams. |\n| `ngram_range` | A pair with the range (inclusive) of ngram sizes to return. |\n| `separator` | a string that will be inserted between tokens when ngrams are constructed. |\n| `name` | (Optional) A name for this operation. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| A `SparseTensor` containing all ngrams from each row of the input. Note: if an ngram appears multiple times in the input row, it will be present the same number of times in the output. For unique ngrams, see tft.bag_of_words. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|------------------------------------------------------------------|\n| `ValueError` | if `tokens` is not 2D. |\n| `ValueError` | if ngram_range\\[0\\] \\\u003c 1 or ngram_range\\[1\\] \\\u003c ngram_range\\[0\\] |\n\n\u003cbr /\u003e"]]