Content uploaded by Olusola Odeyomi
Author content
All content in this area was uploaded by Olusola Odeyomi on Aug 26, 2020
Content may be subject to copyright.
1
Time-Varying Truth Prediction in Social Networks Using Online
Learning
Olusola T. Odeyomi1, Hyuck M. Kwon1and David A. Murrell2
Abstract—This paper shows how agents in a social network
can predict their true state when the true state is arbitrarily
time-varying. We model the social network using graph theory,
where the agents are all strongly connected. We then apply
online learning and propose a non-stochastic multi-armed bandit
algorithm. We obtain a sublinear upper bound regret and show
by simulation that all agents can make a better prediction over
time.
Index Terms—non-Bayesian learning, online learning, multi-
armed bandit, regret, graph theory, strongly connected network.
I. INTRODUCTION
The motivation for this work is to study how humans
(agents) can predict time-varying truth in social networks
among a plethora of false information. We consider time-
varying truth (otherwise called the true state) because in social
networks, information is released and updated almost every
second.
The objective of this work is to help agents predict the true
state, even when that true state is dynamic. To achieve that
objective, this paper models the social learning problem using
graph theory for the case of a strongly connected network and
develops a non-stochastic multi-armed bandit algorithm that
can help agents of the network make a good prediction over
time. The regret incurred by agents is shown to be sublinear.
This means that the regret per round converges quickly to
zero as the number of rounds tends to infinity. We represent
the upper bound of this convergence rate with the ”Big-O.”
Non-Bayesian learning is an established graph-theoretic
approach employed by agents to learn the true state when
the true state is fixed [1], [2]. In this approach, each agent
observes a set of private information from the environment
to form an intermediate belief using Bayes’ rule and then
cooperates with neighbors by creating a weighted combination
of its intermediate belief with the old or updated belief from
its connected neighbors to form the agent’s own belief. The
authors in [1], [3] showed that by using the updated beliefs
of neighbors, rather than the old beliefs, the convergence of
agents’ beliefs to the true state improves. They referred to this
type of non-Bayesian learning as diffusion learning.
Most existing literature has focused on learning the true
state when the true state is fixed [2]–[5]. This learning ap-
proach is not practical since the true state of social networks
changes arbitrarily with time [6]–[8]. Moreover, the analyses
1Olusola T. Odeyomi and Hyuck M. Kwon are with the Department of Elec-
trical Engineering and Computer Science, Wichita State University, Wichita,
Kansas, KS 67260, USA. otodeyomi@shockers.wichita.edu
and hyuck.kwon@wichita.edu
2David A. Murrell is with the Airforce Research Laboratory / RVBY, 3550
Aberdeen Avenue Kirtland AFB, NM 87117-577, USA.
and results in the existing literature are not applicable for
a dynamic setting. For instance, the convergence of agents’
beliefs to the true state plays a key role when the true state is
fixed, whereas convergence is difficult to achieve when the true
state is dynamic [9]. This is why we consider online learning
that can handle the dynamicity of the true state in a real-life
scenario.
Online learning is one of the branches of machine learning
where a player (or agent) chooses an action and observes some
feedback that allows him to learn and improve his choices
in subsequent rounds of play. The feedback is either a loss
or reward function. In this paper, we use the loss function.
Losses over the entire duration of play can either be fixed
before the game begins by an oblivious adversary, or it can
be adaptively chosen at the start of each round of play, while
leveraging the player' s history of play, by a non-oblivious
adversary [10]. The types of feedback vary. The most common
type is full feedback, where the loss of all actions is revealed
to the player at the end of each round [11], [12]. This is to
say, that the player knows the loss of every action regardless
of his/her choice of play. A second common feedback model
is the bandit feedback [10], where the loss of only the action a
player chooses is revealed to him. This is to say, that all other
losses for unchosen actions are kept secret from the player.
The bandit setting is a special case of the partial monitoring
prediction problem [13], [14], where a player never has full
knowledge of the losses s/he suffered in the past, rather, s/he
receives feedback with limited information. The player must
then balance an exploitation-exploration trade-off to minimize
regret. In other words, the player is faced with two choices;
exploiting what s/he has learned from previous rounds, by
choosing an action that has performed well in the past, or
exploring for less frequently chosen actions with the optimism
that the most informative feedback will be given. When the
total number of actions from which a player can select at
any given round is more than two, it is referred to as the
multi-armed bandit. Full feedback and bandit feedback are
well understood by using a directed feedback graph [15]–[18].
Because of limited feedback in bandit setting, it is a tougher
setting. Therefore, in this paper, we employ the multi-armed
bandit setting.
The multi-armed bandit application to social learning prob-
lem is not new and can be dated back to the work in [19].
There has been increasing work on social learning using the
multi-armed bandit ever since. All of this work is formulated
from different variants of the multi-armed bandit. However,
one important limitation of this various work is that the
loss distribution is assumed to be independent and identically
distributed [20]–[24]. However, to effectively capture the be-
havior of humans in social networks, it is not necessary to
2
place any restriction on the loss distribution. Much of this
work also lacks proper theoretical analysis [20]–[22], which is
the reason for considering non-stochastic or adversarial multi-
armed bandit, which was first proposed in [10]. In this multi-
armed bandit model, losses are assigned by an adversary,
which may be oblivious or non-oblivious. In this paper, we
use the oblivious adversary, non-stochastic multi-armed bandit
model.
The contributions of this paper are as follows: (1) we
formulate the social network problem of a dynamic true state
as an online learning problem; (2) we propose a novel non-
stochastic bandit algorithm; and (3) we show that all agents
can learn the truth with sublinear regret.
II. MO DE L
Consider a directed graph G= (V, E ), where V=
{1,·· · , N }represents a set of agents in the network with
|V|=N. We can assign a pair of non-negative scalar weights
{ajk , akj } ∈ E, to the edge joining agents k∈Vand j∈V.
The network is said to be strongly connected if there is a
directed path in both ways connecting any two agents and
there is at least one self-loop, i.e., akk >0. It is also possible
to have ajk >0and akj = 0. We define the neighborhood
Nkof the agent kas the set of agents connected to k. Note
that agent kalso belongs to its neighborhood. We define the
adjacency matrix of a graph as a square matrix with elements
that are the weights of the edges linking any two agents. We
denote the adjacency matrix as A. If the sum of elements in
each column in the adjacency matrix is one, then we say the
matrix is left-stochastic, i.e.,
ajk ≥0,X
j
ajk = 1.(1)
A strongly connected network is left-stochastic and has a
spectral radius of one, i.e., the eigenvalues are positive and
bounded by one. A strongly connected network also obeys
the Perron-Frobenius theorem, and has a single eigenvalue at
one while other eigenvalues are strictly inside a unit disc [3].
In diffusion learning, all agents start from a prior belief.
Agent k’s prior belief for instance is µk,0(θ), which is uni-
formly distributed over θ∈Θ. Each agent will update its
belief at each time t≥1first by observing a realization of
some random observational private signals from a finite state
space (say Sk,t ∈ Zk,t for agent k), which are generated by
some known likelihood function Lk(·|θ∗
t)that is dependent on
the true state θ∗
t. In this paper, the likelihood function of all
agents is given as L(Sk,t|θ), i.e.,
L(Sk,t|θ) = θ∗
t+ Γk,t(θ),∀k∈V , t ≤T , θ ∈Θ(2)
where θ∗
t∈Θis the true state at time tand chosen arbitrarily,
and Γk,t(θ)is a dithered quantization noise. Therefore, the
random private observation is a noisy observation of the un-
derlying true state. The private signals are not fully informative
and must cooperate with other neighbors. Each agent uses
Bayesian rule to generate an intermediate belief as follows:
ψk,t(θ) = µk,t−1(θ)Lk(Sk,t|θ)
Pθ0∈Θµk,t−1(θ0)Lk(Sk,t|θ0)(3)
where ψk,t(θ)is the intermediate belief of agent kat round
t. This agent uses the intermediate belief to compute an
intermediate probability pk,t(θ)as
pk,t(θ) = (1 −γ)µk,t−1(θ) + γψk,t(θ).(4)
Then, it cooperates with its neighbors to form the probability
Pk,t(θ)from which it chooses a state θtat each time t.Pk,t(θ)
is computed as
Pk,t(θ) = X
j∈Nk
ajk pj,t(θ).(5)
We now set up the network as an online learning problem,
where the goal of each agent is to minimize the regret over
the entire round of play. The loss lk,t(θ)∈[0,1] is fixed
for each agent kbefore the start of the game. The true state
has lk,t(θ∗
t) = 0, while every other states has lk,t(θ) = 1.
The expected weighted regret of agent k, which shows the
performance of our randomized algorithm to the choice of the
best fixed state θ•for the entire duration [1,·· · , T ], is
EFt"T
X
t=1 X
θ∈Θ
pk,t(θ)lk,t(θ)−
T
X
t=1
lk,t (θ•)#.(6)
This expectation is taken over the history of all observed
private information from time t= 1 to time t≤T, represented
as σ−algebra Ft. In Algorithm 1, we introduce parameters
ηand γ, which represent the learning rate and the exploration
parameter, respectively.
Definition 1: The independent number of a graph Gdenoted
by α(G), is the largest size of any subset B⊆Vnot connected
by any edges.
Lemma 1: The expected value of ˆ
lk,t(θ), which is an
unbiased estimate of the true loss lk,t(θ), is given as
Eθt∼Pk,t(θ)hˆ
lk,t(θ)i=lk,t(θ)and
Eθt∼Pk,t(θ)hˆ
lk,t(θ)2i=lk,t(θ)2
Pk,t(θ).
(7)
Proof sketch:
Eθt∼Pk,t(θ)[ˆ
lk,t(θ)] = lk,t(θ)
Pk,t(θ)Pk,t(θ)=lk,t(θ)
Eθt∼Pk,t(θ)[ˆ
lk,t(θ)2] = lk,t(θ)
Pk,t(θ)!2
Pk,t(θ) = lk,t(θ)2
Pk,t(θ).
Lemma 2 : In a strongly connected network with cardinality
of Θequal to M, if all agents are self-aware such that
Pθ∈Θpk,t(θ)≤1and pk,t(θ)≥εfor all θ∈Θ,|Θ|=M,
then
X
θ∈Θ
pk,t(θ)
Pk,t(θ)≤4αln 4M
αε ,0< ε ≤1
2.(8)
Proof : We employ the result of Lemma 5 in [18]. By using
Definition 1, Lemma 1, and Lemma 2, we can show that all
agents in a strongly connected network can learn and incur a
sublinear regret with our proposed algorithm.
3
Algorithm 1: Online Diffusion Learning for Strongly Connected Network
Parameters: Feedback graph, learning rate η > 0.
Vis the set of strongly connected agents, and Eis the set of edges.
exploration parameter γ∈(0,1
2]
Step 0: µk,0(θ) = 1
M,∀θ∈Θ
For each round t∈ {1,· · · , T }
Step 1: Compute pk,t(θ) = (1 −γ)µk,t−1(θ) + γ ψk,t (θ)for each θ∈Θ
Step 2: Compute Pk,t(θ) = Pj∈Nkaj kpj,t (θ)
Step 3: Draw state θt∼Pk,t, and incur loss lk,t (θt)∈[0,1]
Step 4: Update
ˆ
lk,t(θ) = lk,t(θ)
Pk,t(θ)I{θ=θt} ∀θ∈Θ
µk,t(θ) = µk,t−1(θ) exp(−ηˆ
lk,t(θ))
Pθ∈Θµk,t−1(θ) exp(−ηˆ
lk,t(θ))∀θ∈Θ
end
Theorem 1: The upper bound on the expected weighted
regret for Algorithm 1 for a strongly connected network is
Oα−1M2T1/2ln Mwhen γ= min 8αη, 1
2and η=
M
α√T2
.
Proof sketch: We follow the approach for the proof of
Theorem 2 in [18] with some modifications that ensure that
the sum of all beliefs is one. Moreover, we remove the
restriction that some portion of the graph networks have losses
lk,t(θ)≤1
ηused in Lemma 4 in [18]. For the sake of brevity,
however, we omit the proof.
III. PROP OS ED ALGORITHM
The proposed algorithm is a non-stochastic algorithm that
can handle a graph network as input. The graph network
geometry is not time-varying but the true state of the network
is time-varying in this paper. Step 0: The belief of each agent
µk,0(θ)over the state θ∈Θis initialized as a uniform
distribution over Θ, at the start of the algorithm. Step 1:
At each round of the algorithm, the probability pk,t(θ)is
computed for all θ∈Θ. Computing pk,t(θ)involves using
the exploration parameter γ, which is a trade-off parameter
between using the previous belief µk,t−1(θ)and exploring a
new state based on the likelihood function ψk,t(θ).Step 2:
Each agent now combines its own probability pk,t(θ)with that
of its neighbors using the weight of the edges in the connection
to obtain a weighted probability Pk,t(θ)over all states θ∈Θ.
Step 3: From this weighted probability, the agent chooses a
state with the highest probability as the likely true state θ∗
tat
that time t.Step 4: If the prediction is correct, then the agent
incurs a loss lk,t(θ∗) = 0 and if the prediction is wrong, the
agent incurs a loss lk,t(θ)=1, θ 6=θ∗. The agent knows if the
state it chooses is the true state or not, based on the loss value
it receives. However, the agent does not know the value of the
loss of other states it did not choose at that round of play. In
order to update the belief, the agent needs to know the value of
the loss over all states. To achieve this, the algorithm computes
an estimated loss ˆ
lk,t(θ)over all states. Based on Lemma 1,
the expectation of the estimated loss ˆ
lk,t(θ)over θtgives the
true loss lk,t(θ). The algorithm now uses the estimated loss
and the learning rate ηto update the belief probability µk,t(θ)
over all the states. This update is an exponential update where
states that incur losses have their belief probability reduced.
The exponential update is normalized to keep the sum of all
beliefs as one. The algorithm continues until the end of the
round or time slot T.
IV. SIMULATION RESULT
For our simulation, we use a strongly connected graph
consisting of three agents. The adjacency matrix of this graph
is represented below:
A=
0.2 0.2 0.8
0.5 0.4 0.1
0.3 0.4 0.1
.
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
20
40
60
80
100
120
expected weighted regret
agent1
agent2
agent3
Fig. 1. Expected weighted regret for agents 1, 2 and 3 for γ= 0.2.
Observe that the matrix is left-stochastic, i.e., the sum of
each left column elements is 1. The highest eigenvalue is one,
and all other eigenvalues are strictly within a unit disc. Hence,
the matrix obeys the Perron-Frobenius theorem. Assume five
states, i.e., Θ = {θ1,·· · , θ5}, that any of these states can be
a true state at each time t, and that the true state is arbitrarily
time-varying. The goal of the agents is to predict this true
state. We use α= 1,T= 1,000,η= 0.025, and γ= 0.2for
the simulation shown in Figures 1 and 2.
The likelihood function for all agents over all states at each
time is computed from [2] with quantized Gaussian noise of
mean zero and variances randomly chosen from one to five.
The algorithm runs 100,000 times and the average is given for
all plots. Observe in Figure 1 that the expected regret across all
agents begins to converge as the round of iteration increases.
This is because the agents learn from their mistakes and make
better predictions in subsequent rounds. Cooperation among
agents also helps them ensure that they learn at almost the
same rate. Figure 2 shows the decay of the regret with time.
4
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
1
2
3
4
5
6
7
8
9
expected weighted regret/round
agent1
agent2
agent3
Fig. 2. Sublinear regret for agents 1, 2 and 3 for γ= 0.2.
As time goes to infinity, the sublinear regret for all three agents
goes to zero.
If we change the value of γfrom 0.2 to 0.5 while keeping
all other simulation parameters unchanged, we obtain Figure
3. The regret in Figure 3 is higher than in Figure 1 because
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
50
100
150
200
250
expected weighted regret
agent1
agent2
agent3
Fig. 3. Expected weighted regret for agents 1, 2 and 3 for γ= 0.5.
the agents do not make good use of their own history. That is,
they fail to maximally exploit their past beliefs about the true
state before exploring a new state. This can be seen from step
1 of algorithm 1. It can be seen that the decay of the regret is
slower in Figure 4 than in Figure 2.
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
5
10
15
20
25
30
35
40
45
expected weighted regret/round
agent1
agent2
agent3
Fig. 4. Sublinear regret for agents 1, 2 and 3 for γ= 0.5.
In Figure 5, we use an adjacency matrix A1such that the
self loop of each agent has a higher weight compared to the
weights of neighboring agents. The matrix still left-stochastic
with the highest eigenvalue as 1, it can be seen that the
algorithm performs better and the expected weighted regret is
much less when γ= 0.2. The matrix A1is represented as:
A1=
0.5 0.3 0.2
0.3 0.5 0.2
0.2 0.2 0.6
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
-2
0
2
4
6
8
10
12
expected weighted regret
agent1
agent2
agent3
Fig. 5. Expected weighted regret for agents 1, 2 and 3 for γ= 0.2 with
adjacency matrix A1.
V. RE LATE D WOR K
Prediction with expert advice [11], [25]–[27] is a general
abstract framework for studying sequential prediction prob-
lems, formulated as repeated games between a player and an
adversary. For a T-round sequential learning with Karms (or
actions), the Hedge algorithm [25] is known to achieve the
optimal bound of order O(√Tlog K)on the regret for the
full feedback setting, while the INF algorithm [28] achieves
the optimal bound of order O(√T K)on the regret for the
bandit setting. The famous EXP3 algorithm achieves a slightly
worse regret of O(√T K log K)for the bandit setting. The
bandit setting is a special case of partial monitoring [13], [14].
The feedback graph model for representing games over an
undirected graph was first proposed in [15]. In this setting,
an edge from action ito action jimplies that losses for
action iand jare revealed when action iis played. This
seems like a middle ground between the full feedback setting
where all actions are revealed and the bandit setting where
only the played action is revealed. The regret bound of [15]
is O(√T α log K)and can be achieved by the ELP algorithm
which is a variant of the EXP3 algorithm. Here, αrepresent the
independence number of the graph. The authors in [16], [29]
considered the case of directed graphs and propose the EXP3-
DOM algorithm and the EXP3-SET algorithm. The upper
bound on the regret of EXP3-SET is not tight. A more im-
proved regret bound is gotten from EXP3-IX proposed in [30].
The authors in [17] fully characterized the minimax regret of
online learning problem defined over feedback graphs. They
categorized the set of all feedback graphs into three distinct
sets: strongly observable, weakly observable and unobservable.
The minimax regret for strongly observable feedback graphs
is on the order ˜
Θα1/2T1/2; weakly observable feedback
graphs is on the order ˜
Θδ2/3T2/3, where δis the domination
number; and the set of unobservable graphs where learning is
difficult is on the order Θ(T). Social learning problems are
well formulated using directed feedback graphs.
VI. CONCLUSION
This paper investigated how agents in a graph network can
predict their true state in situations where the true state of the
agents keeps changing arbitrarily with time. We formulated
the problem as an online learning problem and proposed
a non-stochastic multi-armed bandit algorithm for strongly
5
connected agents to predict their arbitrarily time-varying true
state. We showed theoretically and through simulation that
strongly connected agents can predict their true state with
sublinear regret over the entire round of play. Our findings
are applicable to the behavior of humans in social networks.
REFERENCES
[1] H. Salami, B. Ying, and A. H. Sayed, “Social learning over weakly
connected graphs,” IEEE Transactions on Signal and Information Pro-
cessing over Networks, vol. 3, no. 2, pp. 222–238, 2017.
[2] A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, “Non-
bayesian social learning,” Games and Economic Behavior, vol. 76, no. 1,
pp. 210–225, 2012.
[3] A. H. Sayed et al., “Adaptation, learning, and optimization over net-
works,” Foundations and Trends® in Machine Learning, vol. 7, no. 4-5,
pp. 311–801, 2014.
[4] M. H. DeGroot, “Reaching a consensus,” Journal of the American
Statistical Association, vol. 69, no. 345, pp. 118–121, 1974.
[5] A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE, vol. 102,
no. 4, pp. 460–497, 2014.
[6] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of
dynamic parameters in social networks,” in Advances in Neural Infor-
mation Processing Systems, 2013.
[7] R. M. Frongillo, G. Schoenebeck, and O. Tamuz, “Social learning in
a changing world,” in International Workshop on Internet and Network
Economics. Springer, 2011, pp. 146–157.
[8] K. Dasaratha, B. Golub, and N. Hak, “Social learning in a dynamic
environment,” Available at SSRN 3097505, 2018.
[9] G. Moscarini, M. Ottaviani, and L. Smith, “Social learning in a changing
world,” Economic Theory, vol. 11, no. 3, pp. 657–665, 1998.
[10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling
in a rigged casino: The adversarial multi-armed bandit problem,” in
Proceedings of IEEE 36th Annual Foundations of Computer Science.
IEEE, 1995, pp. 322–331.
[11] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,”
Information and computation, vol. 108, no. 2, pp. 212–261, 1994.
[12] V. Vovk, “A game of prediction with expert advice,” Journal of Computer
and System Sciences, vol. 56, no. 2, pp. 153–173, 1998.
[13] Gy¨
orgy, r´
as, T. Linder, G. Lugosi, and G. Ottucs´
ak, “The on-line shortest
path problem under partial monitoring,” Journal of Machine Learning
Research, vol. 8, no. Oct., pp. 2369–2403, 2007.
[14] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz, “Regret minimization under
partial monitoring,” Mathematics of Operations Research, vol. 31, no. 3,
pp. 562–580, 2006.
[15] S. Mannor and O. Shamir, “From bandits to experts: On the value
of side-observations,” in Advances in Neural Information Processing
Systems, 2011, pp. 684–692.
[16] N. Alon, N. Cesa-Bianchi, C. Gentile, S. Mannor, Y. Mansour, and
O. Shamir, “Nonstochastic multi-armed bandits with graph-structured
feedback,” SIAM Journal on Computing, vol. 46, no. 6, pp. 1785–1826,
2017.
[17] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren, “Online learning with
feedback graphs: Beyond bandits,” in Annual Conference on Learning
Theory, vol. 40. Microtome Publishing, 2015.
[18] ——, “Online learning with feedback graphs: Beyond bandits,” arXiv
preprint arXiv:1502.07617, 2015.
[19] K. H. Schlag, “Why imitate, and if so, how?: A boundedly rational
approach to multi-armed bandits,” Journal of Economic Theory, vol. 78,
no. 1, pp. 130–156, 1998.
[20] A. Zhou, Y. Feng, P. Zhou, and J. Xu, “Social intimacy based iot services
mining of massive data,” in 2017 IEEE International Conference on
Data Mining Workshops (ICDMW). IEEE, 2017, pp. 641–648.
[21] T. T. Nguyen and H. W. Lauw, “Dynamic clustering of contextual
multi-armed bandits,” in Proceedings of the 23rd ACM International
Conference on Conference on Information and Knowledge Management.
ACM, 2014, pp. 1959–1962.
[22] S. L. Scott, “A modern bayesian look at the multi-armed bandit,” Applied
Stochastic Models in Business and Industry, vol. 26, no. 6, pp. 639–658,
2010.
[23] R. K. Kolla, K. Jagannathan, and A. Gopalan, “Collaborative learning
of stochastic bandits over a social network,” IEEE/ACM Transactions
on Networking (TON), vol. 26, no. 4, pp. 1782–1795, 2018.
[24] S. Buccapatnam, A. Eryilmaz, and N. B. Shroff, “Multi-armed bandits
in the presence of side observations in social networks,” in 52nd IEEE
Conference on Decision and Control. IEEE, 2013, pp. 7309–7314.
[25] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
[26] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire,
and M. K. Warmuth, “How to use expert advice,” Journal of the ACM
(JACM), vol. 44, no. 3, pp. 427–485, 1997.
[27] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.
Cambridge University Press, 2006.
[28] J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and
stochastic bandits,” in COLT, 2009, pp. 217–226.
[29] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “From bandits to
experts: A tale of domination and independence,” in Advances in Neural
Information Processing Systems, 2013, pp. 1610–1618.
[30] T. Koc´
ak, G. Neu, M. Valko, and R. Munos, “Efficient learning by
implicit exploration in bandit problems with side observations,” in
Advances in Neural Information Processing Systems, 2014, pp. 613–
621.