View source on GitHub |
A drifting linear environment dynamics.
Inherits From: EnvironmentDynamics
tf_agents.bandits.environments.drifting_linear_environment.DriftingLinearDynamics(
observation_distribution: types.Distribution,
observation_to_reward_distribution: types.Distribution,
drift_distribution: types.Distribution,
additive_reward_distribution: types.Distribution
)
This is a drifting linear environment which computes rewards as:
rewards(t) = observation(t) * observation_to_reward(t) + additive_reward(t)
where t
is the environment time. observation_to_reward
slowly rotates over
time. The environment time is incremented in the base class after the reward
is computed. The parameters observation_to_reward
and additive_reward
are
updated at each time step.
In order to preserve the norm of the observation_to_reward
(and the range
of values of the reward) the drift is applied in form of rotations, i.e.,
observation_to_reward(t) = R(theta(t)) * observation_to_reward(t - 1)
where theta
is the angle of the rotation. The angle is sampled from a
provided input distribution.
Args | |
---|---|
observation_distribution
|
A distribution from tfp.distributions with shape
[batch_size, observation_dim] Note that the values of batch_size and
observation_dim are deduced from the distribution.
|
observation_to_reward_distribution
|
A distribution from
tfp.distributions with shape [observation_dim, num_actions] . The
value observation_dim must match the second dimension of
observation_distribution .
|
drift_distribution
|
A scalar distribution from tfp.distributions of type
tf.float32. It represents the angle of rotation.
|
additive_reward_distribution
|
A distribution from tfp.distributions with
shape [num_actions] . This models the non-contextual behavior of the
bandit.
|
Attributes | |
---|---|
action_spec
|
Specification of the actions. |
batch_size
|
Returns the batch size used for observations and rewards. |
observation_spec
|
Specification of the observations. |
Methods
compute_optimal_action
compute_optimal_action(
observation: tf_agents.typing.types.NestedTensor
) -> tf_agents.typing.types.NestedTensor
compute_optimal_reward
compute_optimal_reward(
observation: tf_agents.typing.types.NestedTensor
) -> tf_agents.typing.types.NestedTensor
observation
observation(
unused_t
) -> tf_agents.typing.types.NestedTensor
Returns an observation batch for the given time.
Args | |
---|---|
env_time
|
The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed. |
Returns | |
---|---|
The observation batch with spec according to observation_spec.
|
reward
reward(
observation: tf_agents.typing.types.NestedTensor
,
t: tf_agents.typing.types.Int
) -> tf_agents.typing.types.NestedTensor
Reward for the given observation and time step.
Args | |
---|---|
observation
|
A batch of observations with spec according to
observation_spec.
|
env_time
|
The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed. |
Returns | |
---|---|
A batch of rewards with spec shape [batch_size, num_actions] containing rewards for all arms. |