|View source on GitHub|
This agent trains ranking policies. The policy has a scoring network used for
scoring items. Some of these items will then be selected based on scores and
similarity. The agent receives feedback based on which item in a recommendation
list was interacted with. In this agent we assume either a
score_vector or a
cascading feedback framework. In the former case, the feedback is a vector of
scores for every item in the slots. In the latter case, if the kth item was
clicked, then the items up to k-1 receive a score of -1, the kth item receives
a score based on a feedback value, while the rest of the items receive feedback
of 0. The task of the agent is to train the scoring network to be able to
estimate the above scores.
The observation the agent ingests contains the global features and the features
of the items in the recommendation slots. The item features are stored in the
per_arm part of the observation, in the order of how they are recommended.
Since this ordered list of items expresses what action was taken by the policy,
action value of the trajectory is not used by the agent.
Note the difference between the per-arm part of the observation received by the policy and the agent: While the agent receives the items in the recommendation slots (as explained above), the policy receives the items that are available for recommendation. The user is responsible for converting the observation to the syntax required by the agent.
class FeedbackModel: Enumeration of feedback models.
class RankingAgent: Ranking agent class.
class RankingPolicyType: Enumeration of ranking policy types.
compute_score_tensor_for_cascading(...): Gives scores for all items in a batch.