Online learning and multi-armed bandits: Lecture notes¶

06 November 2019
Prof. Varun Kanade
Notes taken by Miroslav Gasparek

04 November 2019¶

References¶

(BC) S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. link
(CL) N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press 2006.
(LS) T. Lattimore and C. Szepesvári. Bandit Algorithms. link
(Sli) Alex Slivkins. Introduction to Multi-Armed Bandits. link

Introduction¶

Difficult to find algorithms that solve everything....
Agent (Decision-maker, algorithm, player) interacts with environment
A (finite) set of actions: agent picks an action and interacts with environment, then receives reward and the process repeats
The environments can always vary.
We can start coming up with the different models for agents and environments

Example: A/B Testing¶

Buying flowers online
You would like to know where the optimal position of the "buy flowers!" banner is
Statistician: Pick 1000 positions for the banner and simply check which gives the best outcomes
RL: Consider the problem and reframe it as a game, perhaps try to maximize the reward

Example: Drug Trials¶

Find out if the drug actually works when compared to placebo (control group)
If the drug is very effective, you can start shifting patients between active drug and placebo drug

Multi-armed Bandits¶

(stochastic bandits) Actions set: $A = \{1, ..., k \}$ In each round t

Algorithm pciks action $a_{t} \in A$
Gets reward $r_{t} \in [0,1]$ for chosen $r_{t} \sim D_{a_{t}}$ (For each $a \in Aa, \exists D_{a}$ over [0, 1])

Suppose we play for T rounds? "Pcik the action with maximum expected value" $\mu^{*}$ is the expected optimal reward. Optimal: gets $T\mu^{*}$.

$Regret(Alg) = T\mu^{*} - \mathbb{E}[Reward(Alg)]$

(pseudo) Explore then Exploit algorithm:

For the first $Nk$ steps,
Try each action N times
for $t = (Nk)+1,... T$
play action that we though was the best

Hoeffding's Inequality:
Let $X_1, ... X_n$ are i.i.d. random variables , with $X_i \in [0, 1], \mathbb{E}[X_i] = \mu$

then $\mathbb{P}(\mid \frac{1}{n}\sum_{i} X_i - \mu \mid > t) \leq 2e^{-2nt^2}$

Question: Are assumptions stronger than in the central limit theorem?

Fix some arm $a$ \begin{equation} \epsilon_{a}: \mathbb{P}\left( \mid \hat{\mu}(a) - \mu(a) \mid \geq \sqrt{\frac{2 log \ T}{N}}) \right) \leq 2e^{-2N \frac{2 log \ T}{N}} = \frac{1}{T^{4}} \end{equation} Then $\mathbb{P}(\epsilon_a) \leq \frac{1}{T^4}$

$\mathbb{P}(\cap_{a} \epsilon^c_{a}) = \mathbb{P}((\cup_a \epsilon_a)^c) = 1 - \mathbb{P}(\cup \epsilon_a) \geq 1 - \sum_a \mathbb{P}(\epsilon_a) \geq 1 - \frac{k}{T^4} \geq 1 - \frac{1}{T^3}$

Let $a^*$ be an arm that has optimal reward. $a$ is the arm picked by the algorithm and "the good event occurs".

then $\mu(a) + \sqrt{\frac{2 log \ T}{N}} \geq \hat{\mu}(a) \geq \hat{\mu}(a^*) \geq \mu^* - \sqrt{\frac{2 log \ T}{N}}$

Then

$Regret = Nk + T \left( \sqrt{\frac{2 log \ T}{N}} \right)$

Where $Nk$ is explore term and the second term is exploit term. How to pick $N$?

$N = \left(\frac{2T \sqrt{2log \ T}}{k} \right)^{2/3}$

$Regret = \mathcal{O}(k^{1/3}T^{2/3} (log \ T)^{2/3})$

The regret grows slower than $T$, which means that we are actually learning - if it grew linearly with $T$, we would actually not learn anything.

"Best arm identification" (for the fixed time)...

$\epsilon-greedy$:

For each round $t = 1, ...,T,$ do:

with prob. $\epsilon_t$: pick arm uniformly at random
else: choose arm with highest empirical means
$\epsilon_t \sim t^-1/3$

What were the problems with the algorithm? Perhaps we do not exploit the good arms more often?

Successive Elimination: Initially all arms are active

At each phase:

Try all active arms
Deactivate all arms $a$, such that $\exists a'$ such that $UCB(a) < LCB(a')$
Repeat until end of time

The bounds come from the Hoeffding's inequality

UCB Algorithm

Always pick $a \in \underset{a}{arg \ max} UCB_t(a)$
$UCB_t(a) = \hat{\mu}(a) + r_t(a)$
$r_t(a) = \sqrt{\frac{2 log \ T}{n_t(a)}}$
$n_t(a) =$ # of times arm $a$ has been pulled

How much suboptimal to pick $\Delta(a) = \mu^* - \mu(a)$?

$a^*$ is optimal

Suppose arm picked at time $t$, $a_t$, is a suboptimal arm. Then the following must hold. Assume that good event is that reward is always in the confidence interval and assume we get the good event.

Then: \begin{equation} \mu(a_t) + 2r_t(a_t) \geq \hat{\mu}(a_t) + r_t(a_t) \geq \mu(a^*) \end{equation}

The quantity $\hat{\mu}(a_t) + r_t(a_t)$ is $UCB_t(a_t)$

Also, \begin{equation} \Delta(a_t) = \mu(a^*) - \mu(a_t) \leq 2 \sqrt{\frac{2 log \ T}{n_t(a)}} \end{equation}

\begin{align} Regret &= \sum^T_{t=1} \Delta(a_t) \leq 2\sqrt{2log \ T} \sum^K_{a=1} \sum^{n_T(a)}_{s=1}\sqrt{\frac{1}{s}} \\ & \leq 2\sqrt{2 log \ T} \sum_{a=1}^k\sqrt{\frac{n_T(a)}{k}} \\ & \leq 2\sqrt{2 log \ T}k \sqrt{\frac{T}{k}} &= \mathcal{O}(\sqrt{Tk \log \ T}) \end{align}

NB: Check the Jensen's inequality for the concave function.

\begin{align} n_T(a) & \leq c \frac{log \ T}{(\Delta(a))^2} \\ Regret &= \sum_{a \ suboptimal}n_T(a) \Delta(a) \leq c\sum_a \Delta(a) \frac{log \ T}{\Delta(a)^2} \\ &= c \ log \ T \sum_{a \ suboptimal} \frac{1}{\Delta(a)} \end{align}

05 November 2019¶

Explore then Exploit: $Regret = \mathcal{O}(T^{2/3}(k log \ T)^{1/3})$

$\epsilon$-greedy: $Regret = \mathcal{O}(T^{2/3}(k log \ T)^{1/3})$

Successive elimination: $Regret = \mathcal{O}(\sqrt{Tk \ log \ T}) \land \mathcal{O}(log \ T \sum_a \frac{1}{\Delta(a)})$

UCB1: $Regret = \mathcal{O}(\sqrt{Tk \ log \ T}) \land \mathcal{O}(log \ T \sum_a \frac{1}{\Delta(a)})$

Random Coin with $\epsilon$ bias¶

\begin{align} RC_\epsilon = \begin{cases} P(H): 1/2 + \epsilon/2 \\ P(T): 1/2 - \epsilon/2 \end{cases} \end{align}

KL divergence¶

$p, q$ probability distributions over $\Omega$

\begin{align} KL(p \| q) &= \sum_{x \in \Omega} p(x) ln \left( \frac{p(x)}{q(x)} \right) \\ &= \mathbb{E}_{x \sim p} \left[ln \left( \frac{p(x)}{q(x)} \right) \right] \end{align}

Properties:

i. $KL(p \| q) \geq 0$ (equality if $p=q$)

ii. Chain rule: $\Omega = \Omega_1 x, ... , x\Omega_n$, when $p > q, p = p_1x...xp_m$

$KL(p \| q) = \sum_i KL(p_i \| q_i)$

iii. $2(p(A) - q(A))^2 \leq KL(p\|q) \ \forall A$

$\forall A \ \mid p(A) - q(A) \mid \leq \sqrt(\frac{1}{2}KL(p\|q))$

iv. $KL(RC_{\epsilon}, RC_0) \leq 2\epsilon^2, 0 < \epsilon < \frac{1}{12}$

$KL(RC_0, RC_\epsilon) \leq \epsilon^2$

Fix $T,k$, any bandit algorithm, then there exists a problem instance $\mathbb{E}[Regret] \geq c\sqrt{kT}$

For arm $a$: define \begin{align} I_a = \begin{cases} \mu_a = \frac{1}{2} + \frac{\epsilon}{2}, \epsilon = \sqrt{\frac{k}{T}} \\ \mu_i = \frac{1}{2}, \quad i\neq j \end{cases} \end{align}

Algorithm predicts after $T$ rounds arm $a$. $\mathbb{P}[prediction \ after \ T \ rounds \ is \ correct \mid I_a] \geq 0.99$

Lemma: For "bandits with prediction", when $T\leq \frac{ck}{\epsilon^2}$ for some constant $c$, any deterministic algortihm has the following property:

$\exists k/3$ arms $a$ such that $\mathbb{P}[prediction \ is \ a \mid I_a] \leq 3/4$

Corollary: If instance picked randomly, then $\mathbb{P}[prediction \ is \ invalid] \geq \frac{1}{12}$

Proof:

Let $\epsilon = \sqrt{\frac{ck}{T}}$

Fix any round $t \leq T$

$\mathbb{P}[alg \ incorrect \ to \ predict \ a \ at \ a \ time \ t] \geq 1/12$

$\Delta(a_t) = \mu^* - \mu(a_t)$

$\mathbb{E}[\Delta(a_t)] \geq \epsilon/24$

$\mathbb{E}[Reject] = \underset{t}{\epsilon} \mathbb{E}[\Delta(a_t)] \geq \frac{T\epsilon}{24} \geq \hat{c}\sqrt{kT}$

$K = 2, I_1, I_2$

$\Omega$ is a $2 \times n$ grid, $\Omega = \{ 0,1 \}^{2T}$

$A:$ "Algorithm outputs arm 1"

$P_1(A) \geq 3/4, P_2(A) \leq 1/4$ Then

$KL(P_1, P_2) = \sum^{2}_{a=1} \sum^{T}_{t=1} KL(P_1^{a,t}, P_2^{a,t}) \leq 4\epsilon^2 T$

but $\sqrt{\frac{1}{2}KL(P_1 \| P_2)} \leq \epsilon \sqrt{2T}$, $\epsilon = \frac{1}{4\sqrt{T}}$

Non-stochastic Multi-armed bandit:¶

DM picks one of $K$ arms at each time-step.
Environment gives a reward "all arms", $r_{t,a} \in [0,1]$
$Regret = -Reward(Alg) + \underset{a}{max} \sum_{t=1}^T r_a^t$, then the expected regret is $\mathbb{E}(Regret) = \mathcal{O}(\sqrt{kT \log \ T})$, also scales with the upper bound on the reward, i. e. for $r_{t,a} \in [0, M], M \in \mathbb{R}^+$, we have $\mathbb{E}[Regret] = \mathcal{O}(M\sqrt{kT \ log \ T})$

Next...

Algorithm:

$K \subseteq \mathbb{R}^n, x_t \in K$

$Loss(Algorithm) = \sum_t f_t(x_t)$

$Regret = \sum_t f_t (x_t) - \underset{x \in K}{min} \sum_t f_t(x)$

Environment:

$f_t:K \rightarrow [0,1]$

Learning with expert advice¶

You have $n$ "experts" (i. e. telling you to pick a a stock), $\Delta_n = \{ x \geq 0 \mid \sum_i x_i = 1 \}$

Algorithm: pick $x_t \in \Delta_n$ (probability distribution of $n$ experts)

Each expert has a loss $l_{t,i} \in [0,1]$

Algorithm's loss is then $\sum_i x_{t,i}l_{t,i}$, but in this case the algorithm observes the vector $l_t$

Regret is simply loss of the algorithm $Regret = \sum_t l_t x_t - \underset{i}{min} \sum_{t=1}^{r} k_{t,i}$, how can you minimize this??

One good wrong stratregy, at time t:

$i \in arg \ min \ \sum_{s=1}^{t-1} l_{t,i}$

$x_{t,i} = 1$ and $x_{t,j} = 0$ for $j \neq i$

This is called Follow the leader algorithm, and it does not work.

The correct choice is... at time $t$:

$x_t \in arg \ softmax \ (-\eta \sum_{s=1}^{t-1} l_{t,i})$

$x_{t,i} \propto exp(- \eta \sum_{s=1}^{t-1} l_{t,i})$ called also Follow the regularized leader perturbed, Hedge, Multiplicative WEight Update, Mirror Descent

Proof:

Let weight on the expert at time $=1$ be $w_{1,i} = 1$. Then we update the weight as

$w_{t+1,i} = w_{t,i} exp(-\eta l_{t,i})$

$Z_{t+1} = \sum_{i}w_{t+1,i}$

\begin{align} \frac{Z_{t+1}}{Z_t} &= \sum_i \frac{w_{t,i}}{Z_t} exp(-\eta l_{t,i}) \\ &= \sum_i x_{t,i} (1+(e^{-\eta}-1)l_{t,i}) \\ &= 1 + (e^{-\eta}-1)(\sum_i x_{t,i}l_{t,i}) \\ &\leq exp(x_{t} \cdot l_t(e^{-\eta})-1) \\ \Pi_{t=1} \frac{Z_{t+1}}{Z_{t}} &\leq exp \left( (e^{-\eta}-1) loss(Alg) \right) \end{align}

Then ($i^*$ is the best expert):

\begin{align} \frac{Z_{t+1}}{n} \geq \frac{w_{T+1, i^*}}{n} &= \frac{exp(-\eta loss(i^{*}))}{n} \\ - \eta \ loss(i^*) - log \ n &\leq (e^{-\eta}-1) loss(Alg) \\ loss(Alg) - loss(i^*) &\leq \frac{(e^{-\eta}-(1-\eta))}{2}loss(Alg) + \frac{log \ n}{\eta} \\ &\leq 2 \ loss(Alg) + \frac{log \ n}{\eta} \\ &\leq 2\sqrt{T \ log(n)} \end{align}

06 November 2019¶

Online learning with experts¶

$n$ experts

$a_t \in \Delta_n$
Oberve $l_t \in [0,1]^n$ and incur loss $l_t x_t$ $Regret = \sum_t l_t x_t - \underset{i}{min} \sum_t l_{t,i}$
MWUA $Regret = \mathcal{O}(log (n))$
Adaboost: (instance of OLE)
Weak learning guarantee:
- Binary classification:
- training data -> weak learner -> accuracy at least 51% (classifier) $(50 + \gamma)%$

Let us have a set of data points $x_1, ..., x_n$ and some classifiers $c_1, c_2, ... ,c_T$, where $c_t$ is some classifier and \begin{align} l_{t,i} &= \begin{cases} 1, \quad \text{if $c_t$ is corrrect} \\ 0, \quad \text{otherwise} \end{cases} \end{align}

Choose $p_t$ on $\{ x_1, ..., x_n\}$, and $l_t p_t \geq \frac{1}{2} + \gamma$

$\sum_{t=1}^T l_t p_t \geq \left( \frac{1}{2} + \gamma \right)T$

$\sum_{t=1}^{T} l_t p_t - \sum_{t=1}^{T} l_{t,i} \leq 2\sqrt{T \ log \ n} \forall i$

$\left( \frac{1}{2}+\gamma \right)T - \sum_{t=1}^T l_{t,i} \leq 2\sqrt{T \ log \ n}$

$\sum_{t=1}^T l_{t,i} \geq \frac{T}{2} + \gamma T - 2\sqrt{T \ log \ n} > \frac{T}{2}$

$MAJORITY(c_1, c_2,..., c_T)$ actually classifies all $n$ examples correctly...

Have $k$ features in the data and classify them as 0 and 1 and find out if your thresholding gives you better results than random fit.

Algorithm:

Just a remark: stop when you get to the accuracy $50\%$. In that case, simply take the majority of the classifiers to classify correctly (need some explanation and thinking).

Linear programming problem:

$\exists x \mid Ax \leq b$?

$Ax \leq b + \delta$

Let's go through this algorithm... $x_1$;

For $t = 1,...T$

$f_t (x_t)$ (loss)
$\x_{t+1} = x_t - \eta \nabla_f (x_t)$

Then we can expand the quantity $\lVert x_{t+1} - x^* \rVert^2$...

Then $\lVert x_t - x^* \rVert^2 - \lVert x_{t+1} - x^* \rVert^2 = -\eta^2 \lVert \nabla f_t(x_t) \rVert^2 + 2\eta \langle \nabla f_t(x_t), x_t - x^* \rangle$

The 2-norm of loss is bounded by $n$ (makes sense)

Adversarial Bandits¶

Parameter $\gamma \in (0,1)$
Initialize: $w_{1,i} = 1$
for $t=1,...,T$:
- $p_{t,i} = (1-\gamma) \frac{w_{t,i}}{Z_t} + \frac{\gamma}{K}$
- $a_t \sim p_t$, receive loss $l_{t,a_t}$
For $j = 1, ..., k$:
- Make fake loss vector $\hat{l}_{t,j}$, so that \begin{align} \hat{l}_{t,j} &= \begin{cases} l_{t,a_t}, \quad if j=a_t \\ 0, \quad \text{otherwise} \end{cases} \end{align}
  
  $w_{t+1,j} = w_{t,j} exp(-\frac{\gamma}{k} \hat{l}_{t,j})$

(Proof follows)

Adaptive
Oblivious Adversary I do not want to build a stochastic model of the world, in a sense can it can think of writing out all the future ahead of the time.
Can think of Regret as of additive constant fixing the one particular strategy, anchoring it and comparing to each other.

Problem: We have the graph, where we start in $s$ and end at $t$, each edge has a weight, which changes at each time. How to find the optimal (shortest) path?

(Can use Dijkstra's algorithm)

We know the graph and we only get all the weights of the edges at the end of the trial... How to solve it?

Objective: Minimize the regret:

$Regret = \sum cost_t(s,t) - \sum cost_t(path)$

07 Nov 2019¶

Online Learning¶

Player: Decision set $k \subseteq \mathbb{R}^n$
At $t$: $x_t \in k$
Environment: $f_t:k \rightarrow \mathbb{R}$
Loss of player at time $t$: $f_t(x_t)$
Regret: $\sum_{t=1}^{T} f_t (x_t) - \underset{x^{*} \in k}{min} \sum_{t=1}^{T} f_t(x^*)$

Algorithms

Follow the Leader: $x_t = \underset{x \in k}{min} \sum_{s=1}^{t-1} f_s(x)$
Be the Leader: "Play $x_{t+1}$ at time $t$
Follow the "Regularized"/"Perturbed" Leader: $x_t = \underset{x \in k}{min}\left( \sum_{s=1}^{t-1}f_s(x) + R(x) \right)$, where $R(\codt)$ is the arbitrary regularization operator
(Ex. FTL works if $k$ is convex & $f_t$ are strongly conex)
Lemma: Be the ledar has negative regret
Proof:
- $\underset{x \in k}{min} \sum_{t=1}^{T} f_t(x)$ is the best loss in hindsight \begin{align} \underset{x \in k}{min} \sum_{t=1}^{T} f_t(x) &= \sum_{t=1}^T f_t(x_{T+1}) \\ &= f_T(x_{T+1}) + \sum_{t=1}^{T-1}f_t(x_{T+1}) \end{align}
... And then we follow the (supposedly) straightforward proof.
$Regret(FTRL) = \sum_{t=1}^{T} f_t(x_t) - \underset{x \ in k}{min} \sum_{t=1}^{T} f_t(x)$, let $x*$ be the minimizer \begin{align} \sum_{t=0}^{T} f_t(x_{t+1}) \leq \underset{x \in k}{min} \sum_{t=0}^{T} f_t(x) \leq f_0(x^*) + \sum_{t=1}^{T}f_t(x^*) \end{align}
$\sum_{t=1}^{T}f_t(x^*)$ is the loss of best in hindsight

\begin{align} Regret(FTRL) &= \sum_{t=1}^T f_t(x_t) - \sum_{t=1}^{T} f_t(x^*) + \sum_{t=0}^{T}f_t(x_{t+1} - \sum_{t=0}^{T}f_t(x_{t+1}) + f_0(x^*) - f_0(x^*) \\ &\leq \sum_{t=1}^T (f_t(x_t)-f_t(x_{t+1})) + f_0(x^*) - f_0(x_1) \end{align}

But the term $f_0(x^*) - f_0(x_1)$ depends only on regularizer, and $\sum_{t=1}^T (f_t(x_t)-f_t(x_{t+1}))$ depends on stability of the algorithm.
In the FTRL, we only minimize the regularizer at the time 0?
Lemma: $\sum_{t=0}^T f_t(x_{t+1}) \leq \underset{x \in k}{min} \sum_{t=0}^T f_t(x)$, where $f_0 = R$
Lemma: $Regret(FTLR) \leq \sum_{t=1}^T(f_t(x_t) - f_t(x_{t+1})) + R(x^*) - R(x_1)$
Remark: Whenever working on simplex, entropy is a good regularizer
Experts Problem: $K = \{ x \in \mathbb{R}^{n}_+ \mid \sum_{i}x_{i} = 1 \}, R(x) = -\frac{1}{\eta}H(x), l \in \mathbb{R}_+^n$
We aim to solve $\underset{x \in k}{min} \ l \cdot x - \frac{1}{\eta} H(x)$
Using the (equality-constrained) Lagrange multipliers, we get $x_i = exp(-\eta l_i) \cdot exp(-(\eta \mu + 1))$
$f_t(x) = l_t x$

Let's try to bound $Regret(FTLR) \leq \sum_{t=1}^T(f_t(x_t) - f_t(x_{t+1})) + R(x^*) - R(x_1)$
We can bound $R(x^*) - R(x_1)$ as $\frac{log \ \eta}{\eta}$, now let's bound the stability theorem...
$x_{t,i} = \frac{w{t,i}}{Z_t}$
$w_{t+1,i} = w_{t,i} \ exp(-\eta \ l_{t,i})$
$w_{t,i} exp(-\eta) \leq w_{t+1,i} \leq w_{t,i}$
Zt exp(-\eta) \leq Z{t+1} \leq Z_t
$x_{t,i} exp(-\eta) \leq x_{t+1,i} \leq \eta_{t,i} exp(\eta)$, so the bound of $\sum_{t=1}^T(f_t(x_t) - f_t(x_{t+1}))$ is then $\eta \ T$

The Shortest Path Problem

We have a graph with $N$ nodes and $n$ edges, then every path can be written as $p \in {0,1}^n$ and then $k \subseteq {0,1}^n$ only represents valid s-t path, the set of weights is $w = (w_1, ..., w_n)$
$w \cdot x$ (length of the path $x$)
The set of losses: $l = (l_1, ..., l_n)$, $l \cdot x$ is the loss.
The problem: \begin{align} x_t = \underset{x \in k}{arg \ min} \left( \sum_{s=1}^{t-1} l_s \cdot x + R(x) \right) \end{align}
we choose the regularizer $R(x) = z \cdot x$, where $z$ is a random vector, i. e. $z \in [0, \beta]^n, \ \beta = \sqrt{T}$
Assume that radius of $k$ is $diam(k) \leq D$, where $dim(\cdot)$ is the largest distance between two points. $diam(k) = \underset{x, x' \in K}{sup} \lVert x - x' \rVert$
$R(x^*) - R(x_1) \leq \beta \underset{x \in k}{max} \lVert x \rVert_1 \leq \beta N$
Now we need to minimize the term $ l_t \cdot x_t - l_t \cdot _x{t+1}$
Let us have $L_{t} = \sum_{s=1}^{t-1} l_{s}$, but we are only worrying about the expectation.
Volume of the cube $[0, \beta]^n$ is $\beta^n$
Maximum value of $l_t \cdot x_t$ is $\lVert l \rVert_{\infty} \lVert x_t \rVert_1$
Then maximum value of $ l_t \cdot x_t - l_t \cdot _x{t+1}$ is $T \frac{N^2}{\beta}$ - trade-off between $\beta$ in numerator and denominator.
$\underset{x \in k}{min} \left( \sum_{s=1}^{t-1}l_s \cdot x + z \cdot x \right)$
$\left( (\sum_{s=1}^{t-1} l_s + z)x \right)$

Shortest Path Problem:

$\mathbb{E}_{z}[l_t \cdot x_t]$
$\mathbb{E}_{z}[l_t \cdot x_{t+1}]$

And this is the end, my friends!