|  View source on GitHub | 
Neural LinUCB Policy.
Inherits From: TFPolicy
tf_agents.bandits.policies.neural_linucb_policy.NeuralLinUCBPolicy(
    encoding_network: tf_agents.typing.types.Network,
    encoding_dim: int,
    reward_layer: tf.keras.layers.Dense,
    epsilon_greedy: float,
    actions_from_reward_layer: tf_agents.typing.types.Bool,
    cov_matrix: Sequence[tf_agents.typing.types.Float],
    data_vector: Sequence[tf_agents.typing.types.Float],
    num_samples: Sequence[tf_agents.typing.types.Int],
    time_step_spec: tf_agents.typing.types.TimeStep,
    alpha: float = 1.0,
    emit_policy_info: Sequence[Text] = (),
    emit_log_probability: bool = False,
    accepts_per_arm_features: bool = False,
    distributed_use_reward_layer: bool = False,
    observation_and_action_constraint_splitter: Optional[types.Splitter] = None,
    name: Optional[Text] = None
)
Applies LinUCB on top of an encoding network. Since LinUCB is a linear method, the encoding network is used to capture the non-linear relationship between the context features and the expected rewards. The policy starts with exploration based on epsilon greedy and then switches to LinUCB for exploring more efficiently.
This policy supports both the global-only observation model and the global and per-arm model:
-- In the global-only case, there is one single observation per time step, and every arm has its own reward estimation function. -- In the per-arm case, all arms receive individual observations, and the reward estimation function is identical for all arms.
Reference:
Carlos Riquelme, George Tucker, Jasper Snoek,
Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep
Networks for Thompson Sampling, ICLR 2018.
Methods
action
action(
    time_step: tf_agents.trajectories.TimeStep,
    policy_state: tf_agents.typing.types.NestedTensor = (),
    seed: Optional[types.Seed] = None
) -> tf_agents.trajectories.PolicyStep
Generates next action given the time_step and policy_state.
| Args | |
|---|---|
| time_step | A TimeSteptuple corresponding totime_step_spec(). | 
| policy_state | A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state. | 
| seed | Seed to use if action performs sampling (optional). | 
| Returns | |
|---|---|
| A PolicyStepnamed tuple containing:action: An action Tensor matching theaction_spec.state: A policy state tensor to be fed into the next call to action.info: Optional side information such as action log probabilities. | 
| Raises | |
|---|---|
| RuntimeError | If subclass init didn't call super().init.
ValueError or TypeError: If validate_args is Trueand inputs or
  outputs do not matchtime_step_spec,policy_state_spec,
  orpolicy_step_spec. | 
distribution
distribution(
    time_step: tf_agents.trajectories.TimeStep,
    policy_state: tf_agents.typing.types.NestedTensor = ()
) -> tf_agents.trajectories.PolicyStep
Generates the distribution over next actions given the time_step.
| Args | |
|---|---|
| time_step | A TimeSteptuple corresponding totime_step_spec(). | 
| policy_state | A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state. | 
| Returns | |
|---|---|
| A PolicyStepnamed tuple containing:
 | 
| Raises | |
|---|---|
| ValueError or TypeError: If validate_args is Trueand inputs or
outputs do not matchtime_step_spec,policy_state_spec,
orpolicy_step_spec. | 
get_initial_state
get_initial_state(
    batch_size: Optional[types.Int]
) -> tf_agents.typing.types.NestedTensor
Returns an initial state usable by the policy.
| Args | |
|---|---|
| batch_size | Tensor or constant: size of the batch dimension. Can be None in which case no dimensions gets added. | 
| Returns | |
|---|---|
| A nested object of type policy_statecontaining properly
initialized Tensors. | 
update
update(
    policy,
    tau: float = 1.0,
    tau_non_trainable: Optional[float] = None,
    sort_variables_by_name: bool = False
) -> tf.Operation
Update the current policy with another policy.
This would include copying the variables from the other policy.
| Args | |
|---|---|
| policy | Another policy it can update from. | 
| tau | A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables. | 
| tau_non_trainable | A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau. | 
| sort_variables_by_name | A bool, when True would sort the variables by name before doing the update. | 
| Returns | |
|---|---|
| An TF op to do the update. |