Introduction to Multi-Armed Bandits

View on View source on GitHub Download notebook


Multi-Armed Bandit (MAB) is a Machine Learning framework in which an agent has to select actions (arms) in order to maximize its cumulative reward in the long term. In each round, the agent receives some information about the current state (context), then it chooses an action based on this information and the experience gathered in previous rounds. At the end of each round, the agent receives the reward assiociated with the chosen action.

Perhaps the purest example is the problem that lent its name to MAB: imagine that we are faced with k slot machines (one-armed bandits), and we need to figure out which one has the best payout, while not losing too much money.

Multi-Armed Bandits

Trying each machine once and then choosing the one that paid the most would not be a good strategy: The agent could fall into choosing a machine that had a lucky outcome in the beginning but is suboptimal in general. Instead, the agent should repeatedly come back to choosing machines that do not look so good, in order to collect more information about them. This is the main challenge in Multi-Armed Bandits: the agent has to find the right mixture between exploiting prior knowledge and exploring so as to avoid overlooking the optimal actions.

More practical instances of MAB involve a piece of side information every time the learner makes a decision. We call this side information "context" or "observation".

Multi-Armed Bandits and Reinforcement Learning

Why is there a MAB Suite in the TF-Agents library? What is the connection between RL and MAB? Multi-Armed Bandits can be thought of as a special case of Reinforcement Learning. To quote Intro to RL:

At each time step, the agent takes an action on the environment based on its policy $\pi(a_t|s_t)$, where $s_t$ is the current observation from the environment, and receives a reward $r_{t+1}$ and the next observation $s_{t+1}$ from the environment. The goal is to improve the policy so as to maximize the sum of rewards (return).

In the general RL case, the next observation $s_{t+1}$ depends on the previous state $s_t$ and the action $a_t$ taken by the policy. This last part is what separates MAB from RL: in MAB, the next state, which is the observation, does not depend on the action chosen by the agent.

This similarity allows us to reuse all the concepts that exist in TF-Agents.

  • An environment outputs observations, and responds to actions with rewards.
  • A policy outputs an action based on an observation, and
  • An agent repeatedly updates the policy based on previous observation-action-reward tuples.

The Mushroom Environment

For illustrative purposes, we use a toy example called the "Mushroom Environment". The mushroom dataset (Schlimmer, 1981) consists of labeled examples of edible and poisonous mushrooms. Features include shapes, colors, sizes of different parts of the mushroom, as well as odor and many more.


The mushroom dataset, just like all supervised learning datasets, can be turned into a contextual MAB problem. We use the method also used by Riquelme et al. (2018). In this conversion, the agent receives the features of a mushroom, decides to eat it or not. Eating an edible mushroom results in a reward of +5, while eating a poisonous mushroom will give either +5 or -35 with equal probability. Not eating the mushroom results in 0 reward, independently of the type of the mushroom. The following table summarizes the reward assignments:

           | edible | poisonous
eating it  |     +5 | -35 / +5
leaving it |      0 |        0

The LinUCB Agent

Performing well in a contextual bandit environment requires a good estimate on the reward function of each action, given the observation. One possibility is to estimate the reward function with linear functions. That is, for every action $i$, we are trying to find the parameter $\theta_i\in\mathbb R^d$ for which the estimates

$r_{t, i} \sim \langle v_t, \theta_i\rangle$

are as close to the reality as possible. Here $v_t\in\mathbb R^d$ is the context received at time step $t$. Then, if the agent is very confident in its estimates, it can choose $\arg\max_{1, ..., K}\langle v_t, \theta_k\rangle$ to get the highest expected reward.

As explained above, simply choosing the arm with the best estimated reward does not lead to a good strategy. There are many different ways to mix exploitation and exploration in linear estimator agents, and one of the most famous is the Linear Upper Confidence Bound (LinUCB) algorithm (see e.g. Li et al. 2010). LinUCB has two main building blocks (with some details omitted):

  1. It maintains estimates for the parameters of every arm with Linear Least Squares: $\hat\theta_i\sim X^+_i r_i$, where $X_i$ and $r_i$ are the stacked contexts and rewards of rounds where arm $i$ was chosen, and $()^+$ is the pseudo inverse.
  2. It maintains confidence ellipsoids defined by the inverse covariance $X_i^\top X_i$ for the above estimates.

The main idea of LinUCB is that of "Optimisim in the Face of Uncertainty". The agent incorporates exploration via boosting the estimates by an amount that corresponds to the variance of those estimates. That is where the confidence ellipsoids come into the picture: for every arm, the optimistic estimate is $\hat r_i = \max_{\theta\in E_i}\langle v_t, \theta\rangle$, where $E_i$ is the ellipsoid around $\hat\theta_i$. The agent chooses best looking arm $\arg\max_i\hat r_i$.

Of course the above description is just an intuitive but superficial summary of what LinUCB does. An implementation can be found in our codebase here

What's Next?

If you want to have a more detailed tutorial on our Bandits library take a look at our tutorial for Bandits. If, instead, you would like to start exploring our library right away, you can find it here. If you are even more eager to start training, look at some of our end-to-end examples here, including the above described mushroom environment with LinUCB here.