View source on GitHub |
Implements finite-armed Bernoulli Bandits.
Inherits From: BanditPyEnvironment
, PyEnvironment
tf_agents.bandits.environments.bernoulli_py_environment.BernoulliPyEnvironment(
means: Sequence[tf_agents.typing.types.Float
],
batch_size: Optional[types.Int] = 1
)
This environment implements a finite-armed non-contextual Bernoulli Bandit environment as a subclass of BanditPyEnvironment. For every arm, the reward distribution is 0/1 (Bernoulli) with parameter p set at the initialization. For a reference, see e.g., Example 1.1 in "A Tutorial on Thompson Sampling" by Russo et al. (https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf)
Args | |
---|---|
means
|
vector of floats in [0, 1], the mean rewards for actions. The number of arms is determined by its length. |
batch_size
|
(int) The batch size. |
Methods
action_spec
action_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the actions that should be provided to step()
.
May use a subclass of ArraySpec
that specifies additional properties such
as min and max bounds on the values.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
close
close() -> None
Frees any resources used by the environment.
Implement this method for an environment backed by an external process.
This method be used directly
env = Env(...)
# Use env.
env.close()
or via a context manager
with Env(...) as env:
# Use env.
current_time_step
current_time_step() -> tf_agents.trajectories.TimeStep
Returns the current timestep.
discount_spec
discount_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the discount that are returned by step()
.
Override this method to define an environment that uses non-standard discount values, for example an environment with array-valued discounts.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
get_info
get_info() -> tf_agents.typing.types.NestedArray
Returns the environment info returned on the last step.
Returns | |
---|---|
Info returned by last call to step(). None by default. |
Raises | |
---|---|
NotImplementedError
|
If the environment does not use info. |
get_state
get_state() -> Any
Returns the state
of the environment.
The state
contains everything required to restore the environment to the
current configuration. This can contain e.g.
- The current time_step.
- The number of steps taken in the environment (for finite horizon MDPs).
- Hidden state (for POMDPs).
Callers should not assume anything about the contents or format of the
returned state
. It should be treated as a token that can be passed back to
set_state()
later.
Note that the returned state
handle should not be modified by the
environment later on, and ensuring this (e.g. using copy.deepcopy) is the
responsibility of the environment.
Returns | |
---|---|
state
|
The current state of the environment. |
observation_spec
observation_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the observations provided by the environment.
May use a subclass of ArraySpec
that specifies additional properties such
as min and max bounds on the values.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
render
render(
mode: Text = 'rgb_array'
) -> Optional[types.NestedArray]
Renders the environment.
Args | |
---|---|
mode
|
One of ['rgb_array', 'human']. Renders to an numpy array, or brings up a window where the environment can be visualized. |
Returns | |
---|---|
An ndarray of shape [width, height, 3] denoting an RGB image if mode is
rgb_array . Otherwise return nothing and render directly to a display
window.
|
Raises | |
---|---|
NotImplementedError
|
If the environment does not support rendering. |
reset
reset() -> tf_agents.trajectories.TimeStep
Starts a new sequence and returns the first TimeStep
of this sequence.
Returns | |
---|---|
A TimeStep namedtuple containing:
step_type: A StepType of FIRST .
reward: 0.0, indicating the reward.
discount: 1.0, indicating the discount.
observation: A NumPy array, or a nested dict, list or tuple of arrays
corresponding to observation_spec() .
|
reward_spec
reward_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the rewards that are returned by step()
.
Override this method to define an environment that uses non-standard reward values, for example an environment with array-valued rewards.
Returns | |
---|---|
An ArraySpec , or a nested dict, list or tuple of ArraySpec s.
|
seed
seed(
seed: tf_agents.typing.types.Seed
) -> Any
Seeds the environment.
Args | |
---|---|
seed
|
Value to use as seed for the environment. |
set_state
set_state(
state: Any
) -> None
Restores the environment to a given state
.
See definition of state
in the documentation for get_state().
Args | |
---|---|
state
|
A state to restore the environment to. |
should_reset
should_reset(
current_time_step: tf_agents.trajectories.TimeStep
) -> bool
Whether the Environmet should reset given the current timestep.
By default it only resets when all time_steps are LAST
.
Args | |
---|---|
current_time_step
|
The current TimeStep .
|
Returns | |
---|---|
A bool indicating whether the Environment should reset or not. |
step
step(
action: tf_agents.typing.types.NestedArray
) -> tf_agents.trajectories.TimeStep
Updates the environment according to the action and returns a TimeStep
.
If the environment returned a TimeStep
with StepType.LAST
at the
previous step the implementation of _step
in the environment should call
reset
to start a new sequence and ignore action
.
This method will start a new sequence if called after the environment
has been constructed and reset
has not been called. In this case
action
will be ignored.
If should_reset(current_time_step)
is True, then this method will reset
by itself. In this case action
will be ignored.
Args | |
---|---|
action
|
A NumPy array, or a nested dict, list or tuple of arrays
corresponding to action_spec() .
|
Returns | |
---|---|
A TimeStep namedtuple containing:
step_type: A StepType value.
reward: A NumPy array, reward value for this timestep.
discount: A NumPy array, discount in the range [0, 1].
observation: A NumPy array, or a nested dict, list or tuple of arrays
corresponding to observation_spec() .
|
time_step_spec
time_step_spec() -> tf_agents.trajectories.TimeStep
Describes the TimeStep
fields returned by step()
.
Override this method to define an environment that uses non-standard values
for any of the items returned by step()
. For example, an environment with
array-valued rewards.
Returns | |
---|---|
A TimeStep namedtuple containing (possibly nested) ArraySpec s defining
the step_type, reward, discount, and observation structure.
|
__enter__
__enter__()
Allows the environment to be used in a with-statement context.
__exit__
__exit__(
unused_exception_type, unused_exc_value, unused_traceback
)
Allows the environment to be used in a with-statement context.