tf_agents.bandits.environments.drifting_linear_environment.DriftingLinearDynamics

A drifting linear environment dynamics.

Inherits From: EnvironmentDynamics

tf_agents.bandits.environments.drifting_linear_environment.DriftingLinearDynamics(
    observation_distribution: types.Distribution,
    observation_to_reward_distribution: types.Distribution,
    drift_distribution: types.Distribution,
    additive_reward_distribution: types.Distribution
)

This is a drifting linear environment which computes rewards as:

rewards(t) = observation(t) * observation_to_reward(t) + additive_reward(t)

where t is the environment time. observation_to_reward slowly rotates over time. The environment time is incremented in the base class after the reward is computed. The parameters observation_to_reward and additive_reward are updated at each time step. In order to preserve the norm of the observation_to_reward (and the range of values of the reward) the drift is applied in form of rotations, i.e.,

observation_to_reward(t) = R(theta(t)) * observation_to_reward(t - 1)

where theta is the angle of the rotation. The angle is sampled from a provided input distribution.

Args
`observation_distribution`	A distribution from tfp.distributions with shape `[batch_size, observation_dim]` Note that the values of `batch_size` and `observation_dim` are deduced from the distribution.
`observation_to_reward_distribution`	A distribution from `tfp.distributions` with shape `[observation_dim, num_actions]`. The value `observation_dim` must match the second dimension of `observation_distribution`.
`drift_distribution`	A scalar distribution from `tfp.distributions` of type tf.float32. It represents the angle of rotation.
`additive_reward_distribution`	A distribution from `tfp.distributions` with shape `[num_actions]`. This models the non-contextual behavior of the bandit.

Attributes
`action_spec`	Specification of the actions.
`batch_size`	Returns the batch size used for observations and rewards.
`observation_spec`	Specification of the observations.

Methods

`compute_optimal_action`

View source

compute_optimal_action(
    observation: tf_agents.typing.types.NestedTensor
) -> tf_agents.typing.types.NestedTensor

`compute_optimal_reward`

View source

compute_optimal_reward(
    observation: tf_agents.typing.types.NestedTensor
) -> tf_agents.typing.types.NestedTensor

`observation`

View source

observation(
    unused_t
) -> tf_agents.typing.types.NestedTensor

Returns an observation batch for the given time.

Args
`env_time`	The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed.

Returns
The observation batch with spec according to `observation_spec.`

`reward`

View source

reward(
    observation: tf_agents.typing.types.NestedTensor,
    t: tf_agents.typing.types.Int
) -> tf_agents.typing.types.NestedTensor

Reward for the given observation and time step.

Args
`observation`	A batch of observations with spec according to `observation_spec.`
`env_time`	The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed.

Returns
A batch of rewards with spec shape [batch_size, num_actions] containing rewards for all arms.

tf_agents.bandits.environments.drifting_linear_environment.DriftingLinearDynamics

Args

Attributes

Methods

compute_optimal_action

compute_optimal_reward

observation

reward

`compute_optimal_action`

`compute_optimal_reward`

`observation`

`reward`