|  View source on GitHub | 
Implements finite-armed Bernoulli Bandits.
Inherits From: BanditPyEnvironment, PyEnvironment
tf_agents.bandits.environments.bernoulli_py_environment.BernoulliPyEnvironment(
    means: Sequence[tf_agents.typing.types.Float],
    batch_size: Optional[types.Int] = 1
)
This environment implements a finite-armed non-contextual Bernoulli Bandit environment as a subclass of BanditPyEnvironment. For every arm, the reward distribution is 0/1 (Bernoulli) with parameter p set at the initialization. For a reference, see e.g., Example 1.1 in "A Tutorial on Thompson Sampling" by Russo et al. (https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf)
| Args | |
|---|---|
| means | vector of floats in [0, 1], the mean rewards for actions. The number of arms is determined by its length. | 
| batch_size | (int) The batch size. | 
Methods
action_spec
action_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the actions that should be provided to step().
May use a subclass of ArraySpec that specifies additional properties such
as min and max bounds on the values.
| Returns | |
|---|---|
| An ArraySpec, or a nested dict, list or tuple ofArraySpecs. | 
close
close() -> None
Frees any resources used by the environment.
Implement this method for an environment backed by an external process.
This method be used directly
env = Env(...)
# Use env.
env.close()
or via a context manager
with Env(...) as env:
  # Use env.
current_time_step
current_time_step() -> tf_agents.trajectories.TimeStep
Returns the current timestep.
discount_spec
discount_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the discount that are returned by step().
Override this method to define an environment that uses non-standard discount values, for example an environment with array-valued discounts.
| Returns | |
|---|---|
| An ArraySpec, or a nested dict, list or tuple ofArraySpecs. | 
get_info
get_info() -> tf_agents.typing.types.NestedArray
Returns the environment info returned on the last step.
| Returns | |
|---|---|
| Info returned by last call to step(). None by default. | 
| Raises | |
|---|---|
| NotImplementedError | If the environment does not use info. | 
get_state
get_state() -> Any
Returns the state of the environment.
The state contains everything required to restore the environment to the
current configuration. This can contain e.g.
- The current time_step.
- The number of steps taken in the environment (for finite horizon MDPs).
- Hidden state (for POMDPs).
Callers should not assume anything about the contents or format of the
returned state. It should be treated as a token that can be passed back to
set_state() later.
Note that the returned state handle should not be modified by the
environment later on, and ensuring this (e.g. using copy.deepcopy) is the
responsibility of the environment.
| Returns | |
|---|---|
| state | The current state of the environment. | 
observation_spec
observation_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the observations provided by the environment.
May use a subclass of ArraySpec that specifies additional properties such
as min and max bounds on the values.
| Returns | |
|---|---|
| An ArraySpec, or a nested dict, list or tuple ofArraySpecs. | 
render
render(
    mode: Text = 'rgb_array'
) -> Optional[types.NestedArray]
Renders the environment.
| Args | |
|---|---|
| mode | One of ['rgb_array', 'human']. Renders to an numpy array, or brings up a window where the environment can be visualized. | 
| Returns | |
|---|---|
| An ndarray of shape [width, height, 3] denoting an RGB image if mode is rgb_array. Otherwise return nothing and render directly to a display
window. | 
| Raises | |
|---|---|
| NotImplementedError | If the environment does not support rendering. | 
reset
reset() -> tf_agents.trajectories.TimeStep
Starts a new sequence and returns the first TimeStep of this sequence.
| Returns | |
|---|---|
| A TimeStepnamedtuple containing:
step_type: AStepTypeofFIRST.
reward: 0.0, indicating the reward.
discount: 1.0, indicating the discount.
observation: A NumPy array, or a nested dict, list or tuple of arrays
  corresponding toobservation_spec(). | 
reward_spec
reward_spec() -> tf_agents.typing.types.NestedArraySpec
Defines the rewards that are returned by step().
Override this method to define an environment that uses non-standard reward values, for example an environment with array-valued rewards.
| Returns | |
|---|---|
| An ArraySpec, or a nested dict, list or tuple ofArraySpecs. | 
seed
seed(
    seed: tf_agents.typing.types.Seed
) -> Any
Seeds the environment.
| Args | |
|---|---|
| seed | Value to use as seed for the environment. | 
set_state
set_state(
    state: Any
) -> None
Restores the environment to a given state.
See definition of state in the documentation for get_state().
| Args | |
|---|---|
| state | A state to restore the environment to. | 
should_reset
should_reset(
    current_time_step: tf_agents.trajectories.TimeStep
) -> bool
Whether the Environmet should reset given the current timestep.
By default it only resets when all time_steps are LAST.
| Args | |
|---|---|
| current_time_step | The current TimeStep. | 
| Returns | |
|---|---|
| A bool indicating whether the Environment should reset or not. | 
step
step(
    action: tf_agents.typing.types.NestedArray
) -> tf_agents.trajectories.TimeStep
Updates the environment according to the action and returns a TimeStep.
If the environment returned a TimeStep with StepType.LAST at the
previous step the implementation of _step in the environment should call
reset to start a new sequence and ignore action.
This method will start a new sequence if called after the environment
has been constructed and reset has not been called. In this case
action will be ignored.
If should_reset(current_time_step) is True, then this method will reset
by itself. In this case action will be ignored.
| Args | |
|---|---|
| action | A NumPy array, or a nested dict, list or tuple of arrays
corresponding to action_spec(). | 
| Returns | |
|---|---|
| A TimeStepnamedtuple containing:
step_type: AStepTypevalue.
reward: A NumPy array, reward value for this timestep.
discount: A NumPy array, discount in the range [0, 1].
observation: A NumPy array, or a nested dict, list or tuple of arrays
  corresponding toobservation_spec(). | 
time_step_spec
time_step_spec() -> tf_agents.trajectories.TimeStep
Describes the TimeStep fields returned by step().
Override this method to define an environment that uses non-standard values
for any of the items returned by step(). For example, an environment with
array-valued rewards.
| Returns | |
|---|---|
| A TimeStepnamedtuple containing (possibly nested)ArraySpecs defining
the step_type, reward, discount, and observation structure. | 
__enter__
__enter__()
Allows the environment to be used in a with-statement context.
__exit__
__exit__(
    unused_exception_type, unused_exc_value, unused_traceback
)
Allows the environment to be used in a with-statement context.