![]() |
Class LMDBDataset
A LMDB Dataset that reads the lmdb file.
__init__
__init__(filenames)
Create a LMDBDataset
.
LMDBDataset
allows a user to read data from a mdb file as
(key value) pairs sequentially.
For example:
tf.compat.v1.enable_eager_execution()
dataset = tf.contrib.lmdb.LMDBDataset("/foo/bar.mdb")
# Prints the (key, value) pairs inside a lmdb file.
for key, value in dataset:
print(key, value)
Args:
filenames: A tf.string
tensor containing one or more filenames.
Properties
element_spec
The type specification of an element of this dataset.
Returns:
A nested structure of tf.TypeSpec
objects matching the structure of an
element of this dataset and specifying the type of individual components.
Methods
__iter__
__iter__()
Creates an Iterator
for enumerating the elements of this dataset.
The returned iterator implements the Python iterator protocol and therefore can only be used in eager mode.
Returns:
An Iterator
over the elements of this dataset.
Raises:
RuntimeError
: If not inside of tf.function and not executing eagerly.
apply
apply(transformation_func)
Applies a transformation function to this dataset.
apply
enables chaining of custom Dataset
transformations, which are
represented as functions that take one Dataset
argument and return a
transformed Dataset
.
For example:
dataset = (dataset.map(lambda x: x ** 2)
.apply(group_by_window(key_func, reduce_func, window_size))
.map(lambda x: x ** 3))
Args:
transformation_func
: A function that takes oneDataset
argument and returns aDataset
.
Returns:
Dataset
: TheDataset
returned by applyingtransformation_func
to this dataset.
batch
batch(
batch_size,
drop_remainder=False
)
Combines consecutive elements of this dataset into batches.
The components of the resulting element will have an additional outer
dimension, which will be batch_size
(or N % batch_size
for the last
element if batch_size
does not divide the number of input elements N
evenly and drop_remainder
is False
). If your program depends on the
batches having the same outer dimension, you should set the drop_remainder
argument to True
to prevent the smaller batch from being produced.
Args:
batch_size
: Atf.int64
scalartf.Tensor
, representing the number of consecutive elements of this dataset to combine in a single batch.drop_remainder
: (Optional.) Atf.bool
scalartf.Tensor
, representing whether the last batch should be dropped in the case it has fewer thanbatch_size
elements; the default behavior is not to drop the smaller batch.
Returns:
Dataset
: ADataset
.
cache
cache(filename='')
Caches the elements in this dataset.
Args:
filename
: Atf.string
scalartf.Tensor
, representing the name of a directory on the filesystem to use for caching elements in this Dataset. If a filename is not provided, the dataset will be cached in memory.
Returns:
Dataset
: ADataset
.
concatenate
concatenate(dataset)
Creates a Dataset
by concatenating the given dataset with this dataset.
a = Dataset.range(1, 4) # ==> [ 1, 2, 3 ]
b = Dataset.range(4, 8) # ==> [ 4, 5, 6, 7 ]
# The input dataset and dataset to be concatenated should have the same
# nested structures and output types.
# c = Dataset.range(8, 14).batch(2) # ==> [ [8, 9], [10, 11], [12, 13] ]
# d = Dataset.from_tensor_slices([14.0, 15.0, 16.0])
# a.concatenate(c) and a.concatenate(d) would result in error.
a.concatenate(b) # ==> [ 1, 2, 3, 4, 5, 6, 7 ]
Args:
dataset
:Dataset
to be concatenated.
Returns:
Dataset
: ADataset
.
enumerate
enumerate(start=0)
Enumerates the elements of this dataset.
It is similar to python's enumerate
.
For example:
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
a = { 1, 2, 3 }
b = { (7, 8), (9, 10) }
# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
a.enumerate(start=5)) == { (5, 1), (6, 2), (7, 3) }
b.enumerate() == { (0, (7, 8)), (1, (9, 10)) }
Args:
Returns:
Dataset
: ADataset
.
filter
filter(predicate)
Filters this dataset according to predicate
.
d = tf.data.Dataset.from_tensor_slices([1, 2, 3])
d = d.filter(lambda x: x < 3) # ==> [1, 2]
# `tf.math.equal(x, y)` is required for equality comparison
def filter_fn(x):
return tf.math.equal(x, 1)
d = d.filter(filter_fn) # ==> [1]
Args:
predicate
: A function mapping a dataset element to a boolean.
Returns:
Dataset
: TheDataset
containing the elements of this dataset for whichpredicate
isTrue
.
flat_map
flat_map(map_func)
Maps map_func
across this dataset and flattens the result.
Use flat_map
if you want to make sure that the order of your dataset
stays the same. For example, to flatten a dataset of batches into a
dataset of their elements:
a = Dataset.from_tensor_slices([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
a.flat_map(lambda x: Dataset.from_tensor_slices(x + 1)) # ==>
# [ 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
tf.data.Dataset.interleave()
is a generalization of flat_map
, since
flat_map
produces the same output as
tf.data.Dataset.interleave(cycle_length=1)
Args:
map_func
: A function mapping a dataset element to a dataset.
Returns:
Dataset
: ADataset
.
from_generator
from_generator(
generator,
output_types,
output_shapes=None,
args=None
)
Creates a Dataset
whose elements are generated by generator
.
The generator
argument must be a callable object that returns
an object that supports the iter()
protocol (e.g. a generator function).
The elements generated by generator
must be compatible with the given
output_types
and (optional) output_shapes
arguments.
For example:
import itertools
tf.compat.v1.enable_eager_execution()
def gen():
for i in itertools.count(1):
yield (i, [1] * i)
ds = tf.data.Dataset.from_generator(
gen, (tf.int64, tf.int64), (tf.TensorShape([]), tf.TensorShape([None])))
for value in ds.take(2):
print value
# (1, array([1]))
# (2, array([1, 1]))
NOTE: The current implementation of Dataset.from_generator()
uses
tf.numpy_function
and inherits the same constraints. In particular, it
requires the Dataset
- and Iterator
-related operations to be placed
on a device in the same process as the Python program that called
Dataset.from_generator()
. The body of generator
will not be
serialized in a GraphDef
, and you should not use this method if you
need to serialize your model and restore it in a different environment.
NOTE: If generator
depends on mutable global variables or other external
state, be aware that the runtime may invoke generator
multiple times
(in order to support repeating the Dataset
) and at any time
between the call to Dataset.from_generator()
and the production of the
first element from the generator. Mutating global variables or external
state can cause undefined behavior, and we recommend that you explicitly
cache any external state in generator
before calling
Dataset.from_generator()
.
Args:
generator
: A callable object that returns an object that supports theiter()
protocol. Ifargs
is not specified,generator
must take no arguments; otherwise it must take as many arguments as there are values inargs
.output_types
: A nested structure oftf.DType
objects corresponding to each component of an element yielded bygenerator
.output_shapes
: (Optional.) A nested structure oftf.TensorShape
objects corresponding to each component of an element yielded bygenerator
.args
: (Optional.) A tuple oftf.Tensor
objects that will be evaluated and passed togenerator
as NumPy-array arguments.
Returns:
Dataset
: ADataset
.
from_tensor_slices
from_tensor_slices(tensors)
Creates a Dataset
whose elements are slices of the given tensors.
Note that if tensors
contains a NumPy array, and eager execution is not
enabled, the values will be embedded in the graph as one or more
tf.constant
operations. For large datasets (> 1 GB), this can waste
memory and run into byte limits of graph serialization. If tensors
contains one or more large NumPy arrays, consider the alternative described
in this guide.
Args:
tensors
: A dataset element, with each component having the same size in the 0th dimension.
Returns:
Dataset
: ADataset
.
from_tensors
from_tensors(tensors)
Creates a Dataset
with a single element, comprising the given tensors.
Note that if tensors
contains a NumPy array, and eager execution is not
enabled, the values will be embedded in the graph as one or more
tf.constant
operations. For large datasets (> 1 GB), this can waste
memory and run into byte limits of graph serialization. If tensors
contains one or more large NumPy arrays, consider the alternative described
in this
guide.
Args:
tensors
: A dataset element.
Returns:
Dataset
: ADataset
.
interleave
interleave(
map_func,
cycle_length=AUTOTUNE,
block_length=1,
num_parallel_calls=None
)
Maps map_func
across this dataset, and interleaves the results.
For example, you can use Dataset.interleave()
to process many input files
concurrently:
# Preprocess 4 files concurrently, and interleave blocks of 16 records from
# each file.
filenames = ["/var/data/file1.txt", "/var/data/file2.txt", ...]
dataset = (Dataset.from_tensor_slices(filenames)
.interleave(lambda x:
TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
cycle_length=4, block_length=16))
The cycle_length
and block_length
arguments control the order in which
elements are produced. cycle_length
controls the number of input elements
that are processed concurrently. If you set cycle_length
to 1, this
transformation will handle one input element at a time, and will produce
identical results to tf.data.Dataset.flat_map
. In general,
this transformation will apply map_func
to cycle_length
input elements,
open iterators on the returned Dataset
objects, and cycle through them
producing block_length
consecutive elements from each iterator, and
consuming the next input element each time it reaches the end of an
iterator.
For example:
a = Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ]
# NOTE: New lines indicate "block" boundaries.
a.interleave(lambda x: Dataset.from_tensors(x).repeat(6),
cycle_length=2, block_length=4) # ==> [1, 1, 1, 1,
# 2, 2, 2, 2,
# 1, 1,
# 2, 2,
# 3, 3, 3, 3,
# 4, 4, 4, 4,
# 3, 3,
# 4, 4,
# 5, 5, 5, 5,
# 5, 5]
NOTE: The order of elements yielded by this transformation is
deterministic, as long as map_func
is a pure function. If
map_func
contains any stateful operations, the order in which
that state is accessed is undefined.
Args:
map_func
: A function mapping a dataset element to a dataset.cycle_length
: (Optional.) The number of input elements that will be processed concurrently. If not specified, the value will be derived from the number of available CPU cores. If thenum_parallel_calls
argument is set totf.data.experimental.AUTOTUNE
, thecycle_length
argument also identifies the maximum degree of parallelism.block_length
: (Optional.) The number of consecutive elements to produce from each input element before cycling to another input element.num_parallel_calls
: (Optional.) If specified, the implementation creates a threadpool, which is used to fetch inputs from cycle elements asynchronously and in parallel. The default behavior is to fetch inputs from cycle elements synchronously with no parallelism. If the valuetf.data.experimental.AUTOTUNE
is used, then the number of parallel calls is set dynamically based on available CPU.
Returns:
Dataset
: ADataset
.
list_files
list_files(
file_pattern,
shuffle=None,
seed=None
)
A dataset of all files matching one or more glob patterns.
NOTE: The default behavior of this method is to return filenames in
a non-deterministic random shuffled order. Pass a seed
or shuffle=False
to get results in a deterministic order.
Example:
If we had the following files on our filesystem: - /path/to/dir/a.txt - /path/to/dir/b.py - /path/to/dir/c.py If we pass "/path/to/dir/*.py" as the directory, the dataset would produce: - /path/to/dir/b.py - /path/to/dir/c.py
Args:
file_pattern
: A string, a list of strings, or atf.Tensor
of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.shuffle
: (Optional.) IfTrue
, the file names will be shuffled randomly. Defaults toTrue
.seed
: (Optional.) Atf.int64
scalartf.Tensor
, representing the random seed that will be used to create the distribution. Seetf.compat.v1.set_random_seed
for behavior.
Returns:
Dataset
: ADataset
of strings corresponding to file names.
map
map(
map_func,
num_parallel_calls=None
)
Maps map_func
across the elements of this dataset.
This transformation applies map_func
to each element of this dataset, and
returns a new dataset containing the transformed elements, in the same
order as they appeared in the input.
For example:
a = Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ]
a.map(lambda x: x + 1) # ==> [ 2, 3, 4, 5, 6 ]
The input signature of map_func
is determined by the structure of each
element in this dataset. For example:
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
# Each element is a `tf.Tensor` object.
a = { 1, 2, 3, 4, 5 }
# `map_func` takes a single argument of type `tf.Tensor` with the same
# shape and dtype.
result = a.map(lambda x: ...)
# Each element is a tuple containing two `tf.Tensor` objects.
b = { (1, "foo"), (2, "bar"), (3, "baz") }
# `map_func` takes two arguments of type `tf.Tensor`.
result = b.map(lambda x_int, y_str: ...)
# Each element is a dictionary mapping strings to `tf.Tensor` objects.
c = { {"a": 1, "b": "foo"}, {"a": 2, "b": "bar"}, {"a": 3, "b": "baz"} }
# `map_func` takes a single argument of type `dict` with the same keys as
# the elements.
result = c.map(lambda d: ...)
The value or values returned by map_func
determine the structure of each
element in the returned dataset.
# `map_func` returns a scalar `tf.Tensor` of type `tf.float32`.
def f(...):
return tf.constant(37.0)
result = dataset.map(f)
result.output_classes == tf.Tensor
result.output_types == tf.float32
result.output_shapes == [] # scalar
# `map_func` returns two `tf.Tensor` objects.
def g(...):
return tf.constant(37.0), tf.constant(["Foo", "Bar", "Baz"])
result = dataset.map(g)
result.output_classes == (tf.Tensor, tf.Tensor)
result.output_types == (tf.float32, tf.string)
result.output_shapes == ([], [3])
# Python primitives, lists, and NumPy arrays are implicitly converted to
# `tf.Tensor`.
def h(...):
return 37.0, ["Foo", "Bar", "Baz"], np.array([1.0, 2.0] dtype=np.float64)
result = dataset.map(h)
result.output_classes == (tf.Tensor, tf.Tensor, tf.Tensor)
result.output_types == (tf.float32, tf.string, tf.float64)
result.output_shapes == ([], [3], [2])
# `map_func` can return nested structures.
def i(...):
return {"a": 37.0, "b": [42, 16]}, "foo"
result.output_classes == ({"a": tf.Tensor, "b": tf.Tensor}, tf.Tensor)
result.output_types == ({"a": tf.float32, "b": tf.int32}, tf.string)
result.output_shapes == ({"a": [], "b": [2]}, [])
map_func
can accept as arguments and return any type of dataset element.
Note that irrespective of the context in which map_func
is defined (eager
vs. graph), tf.data traces the function and executes it as a graph. To use
Python code inside of the function you have two options:
1) Rely on AutoGraph to convert Python code into an equivalent graph computation. The downside of this approach is that AutoGraph can convert some but not all Python code.
2) Use tf.py_function
, which allows you to write arbitrary Python code but
will generally result in worse performance than 1). For example: