tf_agents.bandits.environments.drifting_linear_environment.DriftingLinearDynamics

A drifting linear environment dynamics.

Inherits From: EnvironmentDynamics

This is a drifting linear environment which computes rewards as:

rewards(t) = observation(t) * observation_to_reward(t) + additive_reward(t)

where t is the environment time. observation_to_reward slowly rotates over time. The environment time is incremented in the base class after the reward is computed. The parameters observation_to_reward and additive_reward are updated at each time step. In order to preserve the norm of the observation_to_reward (and the range of values of the reward) the drift is applied in form of rotations, i.e.,

observation_to_reward(t) = R(theta(t)) * observation_to_reward(t - 1)

where theta is the angle of the rotation. The angle is sampled from a provided input distribution.

observation_distribution A distribution from tfp.distributions with shape [batch_size, observation_dim] Note that the values of batch_size and observation_dim are deduced from the distribution.
observation_to_reward_distribution A distribution from tfp.distributions with shape [observation_dim, num_actions]. The value observation_dim must match the second dimension of observation_distribution.
drift_distribution A scalar distribution from tfp.distributions of type tf.float32. It represents the angle of rotation.
additive_reward_distribution A distribution from tfp.distributions with shape [num_actions]. This models the non-contextual behavior of the bandit.

action_spec Specification of the actions.
batch_size Returns the batch size used for observations and rewards.
observation_spec Specification of the observations.

Methods

compute_optimal_action

View source

compute_optimal_reward

View source

observation

View source

Returns an observation batch for the given time.

Args
env_time The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed.

Returns
The observation batch with spec according to observation_spec.

reward

View source

Reward for the given observation and time step.

Args
observation A batch of observations with spec according to observation_spec.
env_time The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed.

Returns
A batch of rewards with spec shape [batch_size, num_actions] containing rewards for all arms.