Linear Thompson Sampling Policy.
Inherits From: LinearBanditPolicy
, TFPolicy
tf_agents.bandits.policies.linear_thompson_sampling_policy.LinearThompsonSamplingPolicy(
action_spec: tf_agents.typing.types.BoundedTensorSpec
,
cov_matrix: Sequence[tf_agents.typing.types.Float
],
data_vector: Sequence[tf_agents.typing.types.Float
],
num_samples: Sequence[tf_agents.typing.types.Int
],
time_step_spec: Optional[tf_agents.typing.types.TimeStep
] = None,
alpha: float = 1.0,
eig_vals: Sequence[tf_agents.typing.types.Float
] = (),
eig_matrix: Sequence[tf_agents.typing.types.Float
] = (),
tikhonov_weight: float = 1.0,
add_bias: bool = False,
emit_policy_info: Sequence[Text] = (),
observation_and_action_constraint_splitter: Optional[types.Splitter] = None,
name: Optional[Text] = None
)
Implements the Linear Thompson Sampling Policy from the following paper:
"Thompson Sampling for Contextual Bandits with Linear Payoffs",
Shipra Agrawal, Navin Goyal, ICML 2013. The actual algorithm implemented is
Algorithm 3
from the supplementary material of the paper from
<a href="http://proceedings.mlr.press/v28/agrawal13-supp.pdf">http://proceedings.mlr.press/v28/agrawal13-supp.pdf</a>
.
In a nutshell, the algorithm estimates reward distributions based on
parameters B_inv
and f
for every action. Then for each
action we sample a reward and take the argmax.
Methods
action
action(
time_step: tf_agents.trajectories.TimeStep
,
policy_state: tf_agents.typing.types.NestedTensor
= (),
seed: Optional[types.Seed] = None
) -> tf_agents.trajectories.PolicyStep
Generates next action given the time_step and policy_state.
Args | |
---|---|
time_step
|
A TimeStep tuple corresponding to time_step_spec() .
|
policy_state
|
A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state. |
seed
|
Seed to use if action performs sampling (optional). |
Returns | |
---|---|
A PolicyStep named tuple containing:
action : An action Tensor matching the action_spec .
state : A policy state tensor to be fed into the next call to action.
info : Optional side information such as action log probabilities.
|
Raises | |
---|---|
RuntimeError
|
If subclass init didn't call super().init.
ValueError or TypeError: If validate_args is True and inputs or
outputs do not match time_step_spec , policy_state_spec ,
or policy_step_spec .
|
distribution
distribution(
time_step: tf_agents.trajectories.TimeStep
,
policy_state: tf_agents.typing.types.NestedTensor
= ()
) -> tf_agents.trajectories.PolicyStep
Generates the distribution over next actions given the time_step.
Args | |
---|---|
time_step
|
A TimeStep tuple corresponding to time_step_spec() .
|
policy_state
|
A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state. |
Returns | |
---|---|
A PolicyStep named tuple containing:
|
Raises | |
---|---|
ValueError or TypeError: If validate_args is True and inputs or
outputs do not match time_step_spec , policy_state_spec ,
or policy_step_spec .
|
get_initial_state
get_initial_state(
batch_size: Optional[types.Int]
) -> tf_agents.typing.types.NestedTensor
Returns an initial state usable by the policy.
Args | |
---|---|
batch_size
|
Tensor or constant: size of the batch dimension. Can be None in which case no dimensions gets added. |
Returns | |
---|---|
A nested object of type policy_state containing properly
initialized Tensors.
|
update
update(
policy,
tau: float = 1.0,
tau_non_trainable: Optional[float] = None,
sort_variables_by_name: bool = False
) -> tf.Operation
Update the current policy with another policy.
This would include copying the variables from the other policy.
Args | |
---|---|
policy
|
Another policy it can update from. |
tau
|
A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables. |
tau_non_trainable
|
A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau. |
sort_variables_by_name
|
A bool, when True would sort the variables by name before doing the update. |
Returns | |
---|---|
An TF op to do the update. |