tf.contrib.lookup.index_table_from_file
Stay organized with collections
Save and categorize content based on your preferences.
Returns a lookup table that converts a string tensor into int64 IDs.
tf.contrib.lookup.index_table_from_file(
vocabulary_file=None, num_oov_buckets=0, vocab_size=None, default_value=-1,
hasher_spec=tf.contrib.lookup.FastHashSpec, key_dtype=tf.dtypes.string,
name=None, key_column_index=TextFileIndex.WHOLE_LINE,
value_column_index=TextFileIndex.LINE_NUMBER, delimiter='\t'
)
This operation constructs a lookup table to convert tensor of strings into
int64 IDs. The mapping can be initialized from a vocabulary file specified in
vocabulary_file
, where the whole line is the key and the zero-based line
number is the ID.
Any lookup of an out-of-vocabulary token will return a bucket ID based on its
hash if num_oov_buckets
is greater than zero. Otherwise it is assigned the
default_value
.
The bucket ID range is
[vocabulary size, vocabulary size + num_oov_buckets - 1]
.
The underlying table must be initialized by calling
session.run(tf.compat.v1.tables_initializer())
or
session.run(table.init())
once.
To specify multi-column vocabulary files, use key_column_index and
value_column_index and delimiter.
- TextFileIndex.LINE_NUMBER means use the line number starting from zero,
expects data type int64.
- TextFileIndex.WHOLE_LINE means use the whole line content, expects data
type string.
- A value >=0 means use the index (starting at zero) of the split line based
on
delimiter
.
Sample Usages:
If we have a vocabulary file "test.txt" with the following content:
emerson
lake
palmer
features = tf.constant(["emerson", "lake", "and", "palmer"])
table = tf.lookup.index_table_from_file(
vocabulary_file="test.txt", num_oov_buckets=1)
ids = table.lookup(features)
...
tf.compat.v1.tables_initializer().run()
ids.eval() ==> [0, 1, 3, 2] # where 3 is the out-of-vocabulary bucket
Args |
vocabulary_file
|
The vocabulary filename, may be a constant scalar Tensor .
|
num_oov_buckets
|
The number of out-of-vocabulary buckets.
|
vocab_size
|
Number of the elements in the vocabulary, if known.
|
default_value
|
The value to use for out-of-vocabulary feature values.
Defaults to -1.
|
hasher_spec
|
A HasherSpec to specify the hash function to use for
assignation of out-of-vocabulary buckets.
|
key_dtype
|
The key data type.
|
name
|
A name for this op (optional).
|
key_column_index
|
The column index from the text file to get the key
values from. The default is to use the whole line content.
|
value_column_index
|
The column index from the text file to get the value
values from. The default is to use the line number, starting from zero.
|
delimiter
|
The delimiter to separate fields in a line.
|
Returns |
The lookup table to map a key_dtype Tensor to index int64 Tensor .
|
Raises |
ValueError
|
If vocabulary_file is not set.
|
ValueError
|
If num_oov_buckets is negative or vocab_size is not greater
than zero.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2020-10-01 UTC.
[null,null,["Last updated 2020-10-01 UTC."],[],[],null,["# tf.contrib.lookup.index_table_from_file\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/ops/lookup_ops.py#L1221-L1339) |\n\nReturns a lookup table that converts a string tensor into int64 IDs. \n\n tf.contrib.lookup.index_table_from_file(\n vocabulary_file=None, num_oov_buckets=0, vocab_size=None, default_value=-1,\n hasher_spec=tf.contrib.lookup.FastHashSpec, key_dtype=tf.dtypes.string,\n name=None, key_column_index=TextFileIndex.WHOLE_LINE,\n value_column_index=TextFileIndex.LINE_NUMBER, delimiter='\\t'\n )\n\nThis operation constructs a lookup table to convert tensor of strings into\nint64 IDs. The mapping can be initialized from a vocabulary file specified in\n`vocabulary_file`, where the whole line is the key and the zero-based line\nnumber is the ID.\n\nAny lookup of an out-of-vocabulary token will return a bucket ID based on its\nhash if `num_oov_buckets` is greater than zero. Otherwise it is assigned the\n`default_value`.\nThe bucket ID range is\n`[vocabulary size, vocabulary size + num_oov_buckets - 1]`.\n\nThe underlying table must be initialized by calling\n`session.run(tf.compat.v1.tables_initializer())` or\n`session.run(table.init())` once.\n\nTo specify multi-column vocabulary files, use key_column_index and\nvalue_column_index and delimiter.\n\n- TextFileIndex.LINE_NUMBER means use the line number starting from zero, expects data type int64.\n- TextFileIndex.WHOLE_LINE means use the whole line content, expects data type string.\n- A value \\\u003e=0 means use the index (starting at zero) of the split line based on `delimiter`.\n\n#### Sample Usages:\n\nIf we have a vocabulary file \"test.txt\" with the following content: \n\n emerson\n lake\n palmer\n\n features = tf.constant([\"emerson\", \"lake\", \"and\", \"palmer\"])\n table = tf.lookup.index_table_from_file(\n vocabulary_file=\"test.txt\", num_oov_buckets=1)\n ids = table.lookup(features)\n ...\n tf.compat.v1.tables_initializer().run()\n\n ids.eval() ==\u003e [0, 1, 3, 2] # where 3 is the out-of-vocabulary bucket\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|----------------------|--------------------------------------------------------------------------------------------------------------------------------|\n| `vocabulary_file` | The vocabulary filename, may be a constant scalar `Tensor`. |\n| `num_oov_buckets` | The number of out-of-vocabulary buckets. |\n| `vocab_size` | Number of the elements in the vocabulary, if known. |\n| `default_value` | The value to use for out-of-vocabulary feature values. Defaults to -1. |\n| `hasher_spec` | A `HasherSpec` to specify the hash function to use for assignation of out-of-vocabulary buckets. |\n| `key_dtype` | The `key` data type. |\n| `name` | A name for this op (optional). |\n| `key_column_index` | The column index from the text file to get the `key` values from. The default is to use the whole line content. |\n| `value_column_index` | The column index from the text file to get the `value` values from. The default is to use the line number, starting from zero. |\n| `delimiter` | The delimiter to separate fields in a line. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| The lookup table to map a `key_dtype` `Tensor` to index `int64` `Tensor`. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|----------------------------------------------------------------------------|\n| `ValueError` | If `vocabulary_file` is not set. |\n| `ValueError` | If `num_oov_buckets` is negative or `vocab_size` is not greater than zero. |\n\n\u003cbr /\u003e"]]