Module: tf_agents.bandits.environments.ranking_environment

Ranking Python Bandit environment with items as per-arm features.

The observations are drawn with the help of the arguments global_sampling_fn and item_sampling_fn.

The user is modeled the following way: the score of an item is calculated as a weighted inner product of the global feature and the item feature. These scores for all elements of a recommendation are treated as unnormalized logits for a categorical distribution.

To model diversity and no-click, one can choose one from the following options: --Do the following trick: every action (a list of recommended items) gets item_dim many extra "ghost actions", represented with unit vectors as item features. If, based on inner products and all the items in the recommendation, one of these ghost items is chosen by the environment's user model, it means there was no suitable candidate in the neighborhood, and thus it means that the user did not click on any of the real items. This somewhat relates to diversity, as if the item feature space had been covered better, the ghost items would have been selected with very low probability. --Calculate the scores of all items, and if none of them exceeds a given threshold, no item is selected by the user.


class ClickModel: Enumeration of user click models.

class FeedbackModel: Enumeration of feedback models.

class RankingPyEnvironment: Stationary Stochastic Bandit environment with per-arm features.

GLOBAL_KEY 'global'
PER_ARM_KEY 'per_arm'