|  View source on GitHub | 
Abstract base class for TF RL environments.
tf_agents.environments.TFEnvironment(
    time_step_spec=None, action_spec=None, batch_size=1
)
The current_time_step() method returns current time_step, resetting the
environment if necessary.
The step(action) method applies the action and returns the new time_step.
This method will also reset the environment if needed and ignore the action in
that case.
The reset() method returns time_step that results from an environment
reset and is guaranteed to have step_type=ts.FIRST
The reset() method is only needed for explicit resets. In general, the
environment will reset automatically when needed, for example, when no
episode was started or when it reaches a step after the end of the episode
(i.e. step_type=ts.LAST).
Example for collecting an episode in eager mode:
tf_env = TFEnvironment()
# reset() creates the initial time_step and resets the environment. time_step = tf_env.reset() while not time_step.is_last(): action_step = policy.action(time_step) time_step = tf_env.step(action_step.action)
Example of simple use in graph mode:
tf_env = TFEnvironment()
# current_time_step() creates the initial TimeStep. time_step = tf_env.current_time_step() action_step = policy.action(time_step) # Apply the action and return the new TimeStep. next_time_step = tf_env.step(action_step.action)
sess.run([time_step, action_step, next_time_step])
Example with explicit resets in graph mode:
reset_op = tf_env.reset() time_step = tf_env.current_time_step() action_step = policy.action(time_step) # Apply the action and return the new TimeStep. next_time_step = tf_env.step(action_step.action)
# The environment will initialize before starting. sess.run([time_step, action_step, next_time_step]) # This will force reset the Environment. sess.run(reset_op) # This will apply a new action in the environment. sess.run([time_step, action_step, next_time_step])
Example of random actions in graph mode:
tf_env = TFEnvironment()
# Action needs to depend on the time_step using control_dependencies. time_step = tf_env.current_time_step() with tf.control_dependencies([time_step.step_type]): action = tensor_spec.sample_bounded_spec(tf_env.action_spec()) next_time_step = tf_env.step(action)
sess.run([time_step, action, next_time_step])
Example of collecting full episodes with a while_loop:
tf_env = TFEnvironment()
# reset() creates the initial time_step time_step = tf_env.reset() c = lambda t: tf.logical_not(t.is_last()) body = lambda t: [tf_env.step(t.observation)]
final_time_step = tf.while_loop(c, body, [time_step])
sess.run(final_time_step)
| Attributes | |
|---|---|
| batch_size | |
| batched | |
Methods
action_spec
action_spec()
Describes the specs of the Tensors expected by step(action).
action can be a single Tensor, or a nested dict, list or tuple of
Tensors.
| Returns | |
|---|---|
| An single TensorSpec, or a nested dict, list or tuple ofTensorSpecobjects, which describe the shape and
dtype of each Tensor expected bystep(). | 
current_time_step
current_time_step()
Returns the current TimeStep.
| Returns | |
|---|---|
| A TimeStepnamedtuple containing:
step_type: AStepTypevalue.
reward: Reward at this time_step.
discount: A discount in the range [0, 1].
observation: A Tensor, or a nested dict, list or tuple of Tensors
  corresponding toobservation_spec(). | 
observation_spec
observation_spec()
Defines the TensorSpec of observations provided by the environment.
| Returns | |
|---|---|
| A TensorSpec, or a nested dict, list or tuple ofTensorSpecobjects, which describe the observation. | 
render
render()
Renders a frame from the environment.
| Raises | |
|---|---|
| NotImplementedError | If the environment does not support rendering. | 
reset
reset()
Resets the environment and returns the current time_step.
| Returns | |
|---|---|
| A TimeStepnamedtuple containing:
step_type: AStepTypevalue.
reward: Reward at this time_step.
discount: A discount in the range [0, 1].
observation: A Tensor, or a nested dict, list or tuple of Tensors
  corresponding toobservation_spec(). | 
reward_spec
reward_spec()
Defines the TensorSpec of rewards provided by the environment.
| Returns | |
|---|---|
| A TensorSpec, or a nested dict, list or tuple ofTensorSpecobjects, which describe the reward. | 
step
step(
    action
)
Steps the environment according to the action.
If the environment returned a TimeStep with StepType.LAST at the
previous step, this call to step should reset the environment (note that
it is expected that whoever defines this method, calls reset in this case),
start a new sequence and action will be ignored.
This method will also start a new sequence if called after the environment
has been constructed and reset() has not been called. In this case
action will be ignored.
Expected sequences look like:
time_step -> action -> next_time_step
The action should depend on the previous time_step for correctness.
| Args | |
|---|---|
| action | A Tensor, or a nested dict, list or tuple of Tensors corresponding
to action_spec(). | 
| Returns | |
|---|---|
| A TimeStepnamedtuple containing:
step_type: AStepTypevalue.
reward: Reward at this time_step.
discount: A discount in the range [0, 1].
observation: A Tensor, or a nested dict, list or tuple of Tensors
  corresponding toobservation_spec(). | 
time_step_spec
time_step_spec()
Describes the TimeStep specs of Tensors returned by step().
| Returns | |
|---|---|
| A TimeStepnamedtuple containingTensorSpecobjects defining the
Tensors returned bystep(), i.e.
(step_type, reward, discount, observation). |