Conference PaperPDF Available

Time-Varying Truth Prediction in Social Networks Using Online Learning

Authors:
  • North Carolina A &T University
1
Time-Varying Truth Prediction in Social Networks Using Online
Learning
Olusola T. Odeyomi1, Hyuck M. Kwon1and David A. Murrell2
Abstract—This paper shows how agents in a social network
can predict their true state when the true state is arbitrarily
time-varying. We model the social network using graph theory,
where the agents are all strongly connected. We then apply
online learning and propose a non-stochastic multi-armed bandit
algorithm. We obtain a sublinear upper bound regret and show
by simulation that all agents can make a better prediction over
time.
Index Terms—non-Bayesian learning, online learning, multi-
armed bandit, regret, graph theory, strongly connected network.
I. INTRODUCTION
The motivation for this work is to study how humans
(agents) can predict time-varying truth in social networks
among a plethora of false information. We consider time-
varying truth (otherwise called the true state) because in social
networks, information is released and updated almost every
second.
The objective of this work is to help agents predict the true
state, even when that true state is dynamic. To achieve that
objective, this paper models the social learning problem using
graph theory for the case of a strongly connected network and
develops a non-stochastic multi-armed bandit algorithm that
can help agents of the network make a good prediction over
time. The regret incurred by agents is shown to be sublinear.
This means that the regret per round converges quickly to
zero as the number of rounds tends to infinity. We represent
the upper bound of this convergence rate with the ”Big-O.
Non-Bayesian learning is an established graph-theoretic
approach employed by agents to learn the true state when
the true state is fixed [1], [2]. In this approach, each agent
observes a set of private information from the environment
to form an intermediate belief using Bayes’ rule and then
cooperates with neighbors by creating a weighted combination
of its intermediate belief with the old or updated belief from
its connected neighbors to form the agent’s own belief. The
authors in [1], [3] showed that by using the updated beliefs
of neighbors, rather than the old beliefs, the convergence of
agents’ beliefs to the true state improves. They referred to this
type of non-Bayesian learning as diffusion learning.
Most existing literature has focused on learning the true
state when the true state is fixed [2]–[5]. This learning ap-
proach is not practical since the true state of social networks
changes arbitrarily with time [6]–[8]. Moreover, the analyses
1Olusola T. Odeyomi and Hyuck M. Kwon are with the Department of Elec-
trical Engineering and Computer Science, Wichita State University, Wichita,
Kansas, KS 67260, USA. otodeyomi@shockers.wichita.edu
and hyuck.kwon@wichita.edu
2David A. Murrell is with the Airforce Research Laboratory / RVBY, 3550
Aberdeen Avenue Kirtland AFB, NM 87117-577, USA.
and results in the existing literature are not applicable for
a dynamic setting. For instance, the convergence of agents’
beliefs to the true state plays a key role when the true state is
fixed, whereas convergence is difficult to achieve when the true
state is dynamic [9]. This is why we consider online learning
that can handle the dynamicity of the true state in a real-life
scenario.
Online learning is one of the branches of machine learning
where a player (or agent) chooses an action and observes some
feedback that allows him to learn and improve his choices
in subsequent rounds of play. The feedback is either a loss
or reward function. In this paper, we use the loss function.
Losses over the entire duration of play can either be fixed
before the game begins by an oblivious adversary, or it can
be adaptively chosen at the start of each round of play, while
leveraging the player' s history of play, by a non-oblivious
adversary [10]. The types of feedback vary. The most common
type is full feedback, where the loss of all actions is revealed
to the player at the end of each round [11], [12]. This is to
say, that the player knows the loss of every action regardless
of his/her choice of play. A second common feedback model
is the bandit feedback [10], where the loss of only the action a
player chooses is revealed to him. This is to say, that all other
losses for unchosen actions are kept secret from the player.
The bandit setting is a special case of the partial monitoring
prediction problem [13], [14], where a player never has full
knowledge of the losses s/he suffered in the past, rather, s/he
receives feedback with limited information. The player must
then balance an exploitation-exploration trade-off to minimize
regret. In other words, the player is faced with two choices;
exploiting what s/he has learned from previous rounds, by
choosing an action that has performed well in the past, or
exploring for less frequently chosen actions with the optimism
that the most informative feedback will be given. When the
total number of actions from which a player can select at
any given round is more than two, it is referred to as the
multi-armed bandit. Full feedback and bandit feedback are
well understood by using a directed feedback graph [15]–[18].
Because of limited feedback in bandit setting, it is a tougher
setting. Therefore, in this paper, we employ the multi-armed
bandit setting.
The multi-armed bandit application to social learning prob-
lem is not new and can be dated back to the work in [19].
There has been increasing work on social learning using the
multi-armed bandit ever since. All of this work is formulated
from different variants of the multi-armed bandit. However,
one important limitation of this various work is that the
loss distribution is assumed to be independent and identically
distributed [20]–[24]. However, to effectively capture the be-
havior of humans in social networks, it is not necessary to
2
place any restriction on the loss distribution. Much of this
work also lacks proper theoretical analysis [20]–[22], which is
the reason for considering non-stochastic or adversarial multi-
armed bandit, which was first proposed in [10]. In this multi-
armed bandit model, losses are assigned by an adversary,
which may be oblivious or non-oblivious. In this paper, we
use the oblivious adversary, non-stochastic multi-armed bandit
model.
The contributions of this paper are as follows: (1) we
formulate the social network problem of a dynamic true state
as an online learning problem; (2) we propose a novel non-
stochastic bandit algorithm; and (3) we show that all agents
can learn the truth with sublinear regret.
II. MO DE L
Consider a directed graph G= (V, E ), where V=
{1,·· · , N }represents a set of agents in the network with
|V|=N. We can assign a pair of non-negative scalar weights
{ajk , akj } ∈ E, to the edge joining agents kVand jV.
The network is said to be strongly connected if there is a
directed path in both ways connecting any two agents and
there is at least one self-loop, i.e., akk >0. It is also possible
to have ajk >0and akj = 0. We define the neighborhood
Nkof the agent kas the set of agents connected to k. Note
that agent kalso belongs to its neighborhood. We define the
adjacency matrix of a graph as a square matrix with elements
that are the weights of the edges linking any two agents. We
denote the adjacency matrix as A. If the sum of elements in
each column in the adjacency matrix is one, then we say the
matrix is left-stochastic, i.e.,
ajk 0,X
j
ajk = 1.(1)
A strongly connected network is left-stochastic and has a
spectral radius of one, i.e., the eigenvalues are positive and
bounded by one. A strongly connected network also obeys
the Perron-Frobenius theorem, and has a single eigenvalue at
one while other eigenvalues are strictly inside a unit disc [3].
In diffusion learning, all agents start from a prior belief.
Agent k’s prior belief for instance is µk,0(θ), which is uni-
formly distributed over θΘ. Each agent will update its
belief at each time t1first by observing a realization of
some random observational private signals from a finite state
space (say Sk,t ∈ Zk,t for agent k), which are generated by
some known likelihood function Lk(·|θ
t)that is dependent on
the true state θ
t. In this paper, the likelihood function of all
agents is given as L(Sk,t|θ), i.e.,
L(Sk,t|θ) = θ
t+ Γk,t(θ),kV , t T , θ Θ(2)
where θ
tΘis the true state at time tand chosen arbitrarily,
and Γk,t(θ)is a dithered quantization noise. Therefore, the
random private observation is a noisy observation of the un-
derlying true state. The private signals are not fully informative
and must cooperate with other neighbors. Each agent uses
Bayesian rule to generate an intermediate belief as follows:
ψk,t(θ) = µk,t1(θ)Lk(Sk,t|θ)
Pθ0Θµk,t1(θ0)Lk(Sk,t|θ0)(3)
where ψk,t(θ)is the intermediate belief of agent kat round
t. This agent uses the intermediate belief to compute an
intermediate probability pk,t(θ)as
pk,t(θ) = (1 γ)µk,t1(θ) + γψk,t(θ).(4)
Then, it cooperates with its neighbors to form the probability
Pk,t(θ)from which it chooses a state θtat each time t.Pk,t(θ)
is computed as
Pk,t(θ) = X
j∈Nk
ajk pj,t(θ).(5)
We now set up the network as an online learning problem,
where the goal of each agent is to minimize the regret over
the entire round of play. The loss lk,t(θ)[0,1] is fixed
for each agent kbefore the start of the game. The true state
has lk,t(θ
t) = 0, while every other states has lk,t(θ) = 1.
The expected weighted regret of agent k, which shows the
performance of our randomized algorithm to the choice of the
best fixed state θfor the entire duration [1,·· · , T ], is
EFt"T
X
t=1 X
θΘ
pk,t(θ)lk,t(θ)
T
X
t=1
lk,t (θ)#.(6)
This expectation is taken over the history of all observed
private information from time t= 1 to time tT, represented
as σalgebra Ft. In Algorithm 1, we introduce parameters
ηand γ, which represent the learning rate and the exploration
parameter, respectively.
Definition 1: The independent number of a graph Gdenoted
by α(G), is the largest size of any subset BVnot connected
by any edges.
Lemma 1: The expected value of ˆ
lk,t(θ), which is an
unbiased estimate of the true loss lk,t(θ), is given as
EθtPk,t(θ)hˆ
lk,t(θ)i=lk,t(θ)and
EθtPk,t(θ)hˆ
lk,t(θ)2i=lk,t(θ)2
Pk,t(θ).
(7)
Proof sketch:
EθtPk,t(θ)[ˆ
lk,t(θ)] = lk,t(θ)
Pk,t(θ)Pk,t(θ)=lk,t(θ)
EθtPk,t(θ)[ˆ
lk,t(θ)2] = lk,t(θ)
Pk,t(θ)!2
Pk,t(θ) = lk,t(θ)2
Pk,t(θ).
Lemma 2 : In a strongly connected network with cardinality
of Θequal to M, if all agents are self-aware such that
PθΘpk,t(θ)1and pk,t(θ)εfor all θΘ,|Θ|=M,
then
X
θΘ
pk,t(θ)
Pk,t(θ)4αln 4M
αε ,0< ε 1
2.(8)
Proof : We employ the result of Lemma 5 in [18]. By using
Definition 1, Lemma 1, and Lemma 2, we can show that all
agents in a strongly connected network can learn and incur a
sublinear regret with our proposed algorithm.
3
Algorithm 1: Online Diffusion Learning for Strongly Connected Network
Parameters: Feedback graph, learning rate η > 0.
Vis the set of strongly connected agents, and Eis the set of edges.
exploration parameter γ(0,1
2]
Step 0: µk,0(θ) = 1
M,θΘ
For each round t∈ {1,· · · , T }
Step 1: Compute pk,t(θ) = (1 γ)µk,t1(θ) + γ ψk,t (θ)for each θΘ
Step 2: Compute Pk,t(θ) = Pj∈Nkaj kpj,t (θ)
Step 3: Draw state θtPk,t, and incur loss lk,t (θt)[0,1]
Step 4: Update
ˆ
lk,t(θ) = lk,t(θ)
Pk,t(θ)I{θ=θt} ∀θΘ
µk,t(θ) = µk,t1(θ) exp(ηˆ
lk,t(θ))
PθΘµk,t1(θ) exp(ηˆ
lk,t(θ))θΘ
end
Theorem 1: The upper bound on the expected weighted
regret for Algorithm 1 for a strongly connected network is
Oα1M2T1/2ln Mwhen γ= min 8αη, 1
2and η=
M
αT2
.
Proof sketch: We follow the approach for the proof of
Theorem 2 in [18] with some modifications that ensure that
the sum of all beliefs is one. Moreover, we remove the
restriction that some portion of the graph networks have losses
lk,t(θ)1
ηused in Lemma 4 in [18]. For the sake of brevity,
however, we omit the proof.
III. PROP OS ED ALGORITHM
The proposed algorithm is a non-stochastic algorithm that
can handle a graph network as input. The graph network
geometry is not time-varying but the true state of the network
is time-varying in this paper. Step 0: The belief of each agent
µk,0(θ)over the state θΘis initialized as a uniform
distribution over Θ, at the start of the algorithm. Step 1:
At each round of the algorithm, the probability pk,t(θ)is
computed for all θΘ. Computing pk,t(θ)involves using
the exploration parameter γ, which is a trade-off parameter
between using the previous belief µk,t1(θ)and exploring a
new state based on the likelihood function ψk,t(θ).Step 2:
Each agent now combines its own probability pk,t(θ)with that
of its neighbors using the weight of the edges in the connection
to obtain a weighted probability Pk,t(θ)over all states θΘ.
Step 3: From this weighted probability, the agent chooses a
state with the highest probability as the likely true state θ
tat
that time t.Step 4: If the prediction is correct, then the agent
incurs a loss lk,t(θ) = 0 and if the prediction is wrong, the
agent incurs a loss lk,t(θ)=1, θ 6=θ. The agent knows if the
state it chooses is the true state or not, based on the loss value
it receives. However, the agent does not know the value of the
loss of other states it did not choose at that round of play. In
order to update the belief, the agent needs to know the value of
the loss over all states. To achieve this, the algorithm computes
an estimated loss ˆ
lk,t(θ)over all states. Based on Lemma 1,
the expectation of the estimated loss ˆ
lk,t(θ)over θtgives the
true loss lk,t(θ). The algorithm now uses the estimated loss
and the learning rate ηto update the belief probability µk,t(θ)
over all the states. This update is an exponential update where
states that incur losses have their belief probability reduced.
The exponential update is normalized to keep the sum of all
beliefs as one. The algorithm continues until the end of the
round or time slot T.
IV. SIMULATION RESULT
For our simulation, we use a strongly connected graph
consisting of three agents. The adjacency matrix of this graph
is represented below:
A=
0.2 0.2 0.8
0.5 0.4 0.1
0.3 0.4 0.1
.
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
20
40
60
80
100
120
expected weighted regret
agent1
agent2
agent3
Fig. 1. Expected weighted regret for agents 1, 2 and 3 for γ= 0.2.
Observe that the matrix is left-stochastic, i.e., the sum of
each left column elements is 1. The highest eigenvalue is one,
and all other eigenvalues are strictly within a unit disc. Hence,
the matrix obeys the Perron-Frobenius theorem. Assume five
states, i.e., Θ = {θ1,·· · , θ5}, that any of these states can be
a true state at each time t, and that the true state is arbitrarily
time-varying. The goal of the agents is to predict this true
state. We use α= 1,T= 1,000,η= 0.025, and γ= 0.2for
the simulation shown in Figures 1 and 2.
The likelihood function for all agents over all states at each
time is computed from [2] with quantized Gaussian noise of
mean zero and variances randomly chosen from one to five.
The algorithm runs 100,000 times and the average is given for
all plots. Observe in Figure 1 that the expected regret across all
agents begins to converge as the round of iteration increases.
This is because the agents learn from their mistakes and make
better predictions in subsequent rounds. Cooperation among
agents also helps them ensure that they learn at almost the
same rate. Figure 2 shows the decay of the regret with time.
4
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
1
2
3
4
5
6
7
8
9
expected weighted regret/round
agent1
agent2
agent3
Fig. 2. Sublinear regret for agents 1, 2 and 3 for γ= 0.2.
As time goes to infinity, the sublinear regret for all three agents
goes to zero.
If we change the value of γfrom 0.2 to 0.5 while keeping
all other simulation parameters unchanged, we obtain Figure
3. The regret in Figure 3 is higher than in Figure 1 because
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
50
100
150
200
250
expected weighted regret
agent1
agent2
agent3
Fig. 3. Expected weighted regret for agents 1, 2 and 3 for γ= 0.5.
the agents do not make good use of their own history. That is,
they fail to maximally exploit their past beliefs about the true
state before exploring a new state. This can be seen from step
1 of algorithm 1. It can be seen that the decay of the regret is
slower in Figure 4 than in Figure 2.
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
0
5
10
15
20
25
30
35
40
45
expected weighted regret/round
agent1
agent2
agent3
Fig. 4. Sublinear regret for agents 1, 2 and 3 for γ= 0.5.
In Figure 5, we use an adjacency matrix A1such that the
self loop of each agent has a higher weight compared to the
weights of neighboring agents. The matrix still left-stochastic
with the highest eigenvalue as 1, it can be seen that the
algorithm performs better and the expected weighted regret is
much less when γ= 0.2. The matrix A1is represented as:
A1=
0.5 0.3 0.2
0.3 0.5 0.2
0.2 0.2 0.6
0 100 200 300 400 500 600 700 800 900 1000
number of rounds
-2
0
2
4
6
8
10
12
expected weighted regret
agent1
agent2
agent3
Fig. 5. Expected weighted regret for agents 1, 2 and 3 for γ= 0.2 with
adjacency matrix A1.
V. RE LATE D WOR K
Prediction with expert advice [11], [25]–[27] is a general
abstract framework for studying sequential prediction prob-
lems, formulated as repeated games between a player and an
adversary. For a T-round sequential learning with Karms (or
actions), the Hedge algorithm [25] is known to achieve the
optimal bound of order O(Tlog K)on the regret for the
full feedback setting, while the INF algorithm [28] achieves
the optimal bound of order O(T K)on the regret for the
bandit setting. The famous EXP3 algorithm achieves a slightly
worse regret of O(T K log K)for the bandit setting. The
bandit setting is a special case of partial monitoring [13], [14].
The feedback graph model for representing games over an
undirected graph was first proposed in [15]. In this setting,
an edge from action ito action jimplies that losses for
action iand jare revealed when action iis played. This
seems like a middle ground between the full feedback setting
where all actions are revealed and the bandit setting where
only the played action is revealed. The regret bound of [15]
is O(T α log K)and can be achieved by the ELP algorithm
which is a variant of the EXP3 algorithm. Here, αrepresent the
independence number of the graph. The authors in [16], [29]
considered the case of directed graphs and propose the EXP3-
DOM algorithm and the EXP3-SET algorithm. The upper
bound on the regret of EXP3-SET is not tight. A more im-
proved regret bound is gotten from EXP3-IX proposed in [30].
The authors in [17] fully characterized the minimax regret of
online learning problem defined over feedback graphs. They
categorized the set of all feedback graphs into three distinct
sets: strongly observable, weakly observable and unobservable.
The minimax regret for strongly observable feedback graphs
is on the order ˜
Θα1/2T1/2; weakly observable feedback
graphs is on the order ˜
Θδ2/3T2/3, where δis the domination
number; and the set of unobservable graphs where learning is
difficult is on the order Θ(T). Social learning problems are
well formulated using directed feedback graphs.
VI. CONCLUSION
This paper investigated how agents in a graph network can
predict their true state in situations where the true state of the
agents keeps changing arbitrarily with time. We formulated
the problem as an online learning problem and proposed
a non-stochastic multi-armed bandit algorithm for strongly
5
connected agents to predict their arbitrarily time-varying true
state. We showed theoretically and through simulation that
strongly connected agents can predict their true state with
sublinear regret over the entire round of play. Our findings
are applicable to the behavior of humans in social networks.
REFERENCES
[1] H. Salami, B. Ying, and A. H. Sayed, “Social learning over weakly
connected graphs,” IEEE Transactions on Signal and Information Pro-
cessing over Networks, vol. 3, no. 2, pp. 222–238, 2017.
[2] A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, “Non-
bayesian social learning,” Games and Economic Behavior, vol. 76, no. 1,
pp. 210–225, 2012.
[3] A. H. Sayed et al., “Adaptation, learning, and optimization over net-
works,” Foundations and Trends® in Machine Learning, vol. 7, no. 4-5,
pp. 311–801, 2014.
[4] M. H. DeGroot, “Reaching a consensus,” Journal of the American
Statistical Association, vol. 69, no. 345, pp. 118–121, 1974.
[5] A. H. Sayed, “Adaptive networks,Proceedings of the IEEE, vol. 102,
no. 4, pp. 460–497, 2014.
[6] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of
dynamic parameters in social networks,” in Advances in Neural Infor-
mation Processing Systems, 2013.
[7] R. M. Frongillo, G. Schoenebeck, and O. Tamuz, “Social learning in
a changing world,” in International Workshop on Internet and Network
Economics. Springer, 2011, pp. 146–157.
[8] K. Dasaratha, B. Golub, and N. Hak, “Social learning in a dynamic
environment,Available at SSRN 3097505, 2018.
[9] G. Moscarini, M. Ottaviani, and L. Smith, “Social learning in a changing
world,” Economic Theory, vol. 11, no. 3, pp. 657–665, 1998.
[10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling
in a rigged casino: The adversarial multi-armed bandit problem,” in
Proceedings of IEEE 36th Annual Foundations of Computer Science.
IEEE, 1995, pp. 322–331.
[11] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,
Information and computation, vol. 108, no. 2, pp. 212–261, 1994.
[12] V. Vovk, “A game of prediction with expert advice,Journal of Computer
and System Sciences, vol. 56, no. 2, pp. 153–173, 1998.
[13] Gy¨
orgy, r´
as, T. Linder, G. Lugosi, and G. Ottucs´
ak, “The on-line shortest
path problem under partial monitoring,” Journal of Machine Learning
Research, vol. 8, no. Oct., pp. 2369–2403, 2007.
[14] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz, “Regret minimization under
partial monitoring,” Mathematics of Operations Research, vol. 31, no. 3,
pp. 562–580, 2006.
[15] S. Mannor and O. Shamir, “From bandits to experts: On the value
of side-observations,” in Advances in Neural Information Processing
Systems, 2011, pp. 684–692.
[16] N. Alon, N. Cesa-Bianchi, C. Gentile, S. Mannor, Y. Mansour, and
O. Shamir, “Nonstochastic multi-armed bandits with graph-structured
feedback,” SIAM Journal on Computing, vol. 46, no. 6, pp. 1785–1826,
2017.
[17] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren, “Online learning with
feedback graphs: Beyond bandits,” in Annual Conference on Learning
Theory, vol. 40. Microtome Publishing, 2015.
[18] ——, “Online learning with feedback graphs: Beyond bandits,” arXiv
preprint arXiv:1502.07617, 2015.
[19] K. H. Schlag, “Why imitate, and if so, how?: A boundedly rational
approach to multi-armed bandits,” Journal of Economic Theory, vol. 78,
no. 1, pp. 130–156, 1998.
[20] A. Zhou, Y. Feng, P. Zhou, and J. Xu, “Social intimacy based iot services
mining of massive data,” in 2017 IEEE International Conference on
Data Mining Workshops (ICDMW). IEEE, 2017, pp. 641–648.
[21] T. T. Nguyen and H. W. Lauw, “Dynamic clustering of contextual
multi-armed bandits,” in Proceedings of the 23rd ACM International
Conference on Conference on Information and Knowledge Management.
ACM, 2014, pp. 1959–1962.
[22] S. L. Scott, “A modern bayesian look at the multi-armed bandit,Applied
Stochastic Models in Business and Industry, vol. 26, no. 6, pp. 639–658,
2010.
[23] R. K. Kolla, K. Jagannathan, and A. Gopalan, “Collaborative learning
of stochastic bandits over a social network,IEEE/ACM Transactions
on Networking (TON), vol. 26, no. 4, pp. 1782–1795, 2018.
[24] S. Buccapatnam, A. Eryilmaz, and N. B. Shroff, “Multi-armed bandits
in the presence of side observations in social networks,” in 52nd IEEE
Conference on Decision and Control. IEEE, 2013, pp. 7309–7314.
[25] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
[26] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire,
and M. K. Warmuth, “How to use expert advice,” Journal of the ACM
(JACM), vol. 44, no. 3, pp. 427–485, 1997.
[27] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.
Cambridge University Press, 2006.
[28] J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and
stochastic bandits,” in COLT, 2009, pp. 217–226.
[29] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “From bandits to
experts: A tale of domination and independence,” in Advances in Neural
Information Processing Systems, 2013, pp. 1610–1618.
[30] T. Koc´
ak, G. Neu, M. Valko, and R. Munos, “Efficient learning by
implicit exploration in bandit problems with side observations,” in
Advances in Neural Information Processing Systems, 2014, pp. 613–
621.
... Online learning has shown to perform well in predicting the time-varying true state for strongly connected agents unlike conventional social learning methods that fails. For instance, [14] proposed an online learning approach that can help strongly connected agents predict the time-varying true state. Although, the performance of the agents often come with some regret. ...
... There are many online learning strategies. The multi-armed bandit technique, which is an online learning strategy, has proven to be one of the most successful for social learning even in difficult situations where cooperation among the agents is difficult [14], [16]. In such a setting, an agent may be denied feedback information from its neighboring agents making consensus difficult. ...
... Step 6: end Fig. 2. sublinear regret bound comparsion between the algorithm designed for strongly connected agents in [14] and the proposed algorithm designed for weakly connected agents ...
Conference Paper
Full-text available
This paper provides a study into the social network where influential personalities collaborate positively among themselves to learn an underlying truth over time but may have misled their followers to believe a false information. Most existing work models the social network as a graph network and applies non-Bayesian learning to understand the behavior of agents in the network. Although this approach is popular, it has the limitation of assuming that the truth - otherwise called the true state - is time-invariant. This is not practical in social network where streams of information are released and updated every second. Thus, this paper improves on existing work by introducing online reinforcement learning into the graph theoretic framework. Specifically, multi-armed bandit technique is applied. A multi-armed bandit algorithm is proposed for weakly connected agents to predict the time-varying true state. The result shows that the weakly connected agents can predict this time-varying true state, howbeit with a higher regret than the strongly connected agents.
... Online learning has shown to perform well in predicting the time-varying true state for strongly connected agents unlike conventional social learning methods that fail. For instance, [27] proposed an online learning approach that can help strongly connected agents predict the time-varying true state. Although, the performance of the agents often come with some regret. ...
... There are many online learning strategies. The multi-armed bandit technique, which is an online learning strategy, has proven to be one of the most successful for social learning even in difficult situations, where cooperation among the agents is difficult [27], [29]. In such a setting, an agent may be denied feedback information from its neighboring agents making consensus difficult. ...
... Although, the multi-armed bandit technique has proven to be quite effective for training strongly connected agents to learn the truth when the true state is arbitrarily time-varying [27], [30], it is yet to be applied for weakly connected agents. Thus, this paper applies multi-armed bandit technique to help weakly connected agents predict the time-varying true state. ...
Article
Full-text available
This paper provides a study into the social network where influential personalities collaborate positively among themselves to learn an underlying truth over time, but may have misled their followers to believe a false information. Most existing work that study leader-follower relationships in a social network model the social network as a graph network, and apply non-Bayesian learning to train the weakly connected agents to learn the truth. Although this approach is popular, it has the limitation of assuming that the truth -otherwise called the true state -is time-invariant. This is not practical in social network, where streams of information are released and updated every second, making the true state arbitrarily time-varying. Thus, this paper improves on existing work by introducing online reinforcement learning into the graph theoretic framework. Specifically, multi-armed bandit technique is applied. A multi-armed bandit algorithm is proposed and used to train the weakly connected agents to converge to the most stable state over time. The speed of convergence for these weakly connected agents trained with the proposed algorithm is slower by 66% on average, when compared to the speed of convergence for strongly connected agents trained with the state-of-the-art algorithm. This is because weakly connected agents are difficult to train. However, the speed of convergence of these weakly connected agents can be improved by approximately 50% on average, by fine-tuning the learning rate of the proposed algorithm. The sublinearity of the regret bound for the proposed algorithm is compared to the sublinearity of the regret bound for the state-of-the-art algorithm for strongly connected networks.
... Using the fact that the expectation of the estimated loss yields the true loss similar to (15), ...
Article
Full-text available
There has been an exponential growth over the years in the number of users connected to social networks. This has spurred research interest in social networks to ensure the privacy of users. From a theoretical standpoint, the social network is modeled as a directed graph network and interactions among agents in the graph network can be analyzed with non-Bayesian learning and online learning strategies -such as the multi-armed bandit. The goal of the agents is to learn the time-varying true state of the network through repeated cooperation among themselves. Recent work includes differential privacy in social network analysis to guarantee the privacy of shared information among the agents. However, the stochastic multi-armed bandit approach is used in these existing works which assume that the loss distribution is independent and identically distributed. This does not account for the arbitrariness of the time-varying true state in the social network. Therefore, this paper proposes a tougher but realistic setting that removes the restriction on the loss distribution. Two non-stochastic multi-armed bandit algorithms are proposed. The first algorithm uses the Laplace mechanism to guarantee differential privacy against a third-party intruder. The second algorithm uses the Laplace mechanism to guarantee differential privacy against both a third-party intruder and any spying agent in the network. The simulation results show that the agents’ beliefs converge to the most dominant true state among the sequence of arbitrarily time-varying true states over the time horizon. The speed of convergence comes as a trade-off with privacy. Regret bounds are obtained for the proposed algorithms and compared to the non-private algorithm in the literature.
... The drive to achieve a smart distribution system has opened up new set of challenges for the utility companies [24][25][26]. Data communication and control between the physical systems and the cyber-network have made the smart grid prone to cyber-physical attacks [27][28][29][30]. In this paper, cyber-physical attacks that occur in smart grid and their mitigation techniques are extensively reviewed under the three domains; device level, communication level and application level in Section 2. In Section 3, a case study of fault data injection (FDI) in a production meter of a standard IEEE 34 test feeder with three PVs has is simulated, analyzed and presented. ...
Chapter
This chapter presents a comprehensive review of the impacts of cyber attacks on the smart distribution grid and discusses the potential methods in the literature to mitigate them. The review considers different real-world case studies of successful cyber attacks on multiple grid assets, including networks with high-penetration of distributed energy resources (DERs). A specific use-case of a false data injection (FDI) attack on a photovoltaic (PV) production meter data used for 15-minute ahead forecasting is presented. The false data from the production meter causes the command and control center to give incorrect operational settings to the grid. The various impacts of this incorrect operational settings on the dynamics on the grid is theoretically analyzed, followed by simulation studies of this scenario on an IEEE 34 bus system with three PVs, one synchronous generator, and one energy storage. The impact of FDI on the system is analyzed by measuring the nodal voltages, the current flowing through the lines, and the systems’ active and reactive power losses. The results show that the FDI could potentially cause cascading failures due to possible over current and voltage collapse. This monograph also proposes an adaptive protection system based on a neural network model. This allows the network protection scheme to learn (based on the historical data) the dynamics of the system over time and adequately adapt the protection settings of the relays autonomously despite an FDI attack on the PV production meter. This study will be of particular importance to the utility and DER installers to proactively mitigate FDI attacks, thereby improving the overall situation awareness.
... This becomes the non-stochastic or adversarial multi-armed bandit [15]. A part of this work in [31] was the first to apply non-stochastic bandit to social learning. The results were preliminary. ...
Article
Full-text available
This paper explains how agents in a social network can learn the arbitrary time-varying true state of the network. This is practical in social networks where information is released and updated without any coordination. Most existing literature for learning the true state using the non-Bayesian learning approach, assumes that this true state is fixed, which is impractical. To address this problem, the social network is modeled as a graph network, and the time-varying true state is treated as a multi-armed bandit problem. The few works that have applied multi-armed bandit to a social network did not take into consideration the adversarial effects. Therefore, this paper proposes two non-stochastic multi-armed bandit algorithms that can handle the time-varying true state, even in the presence of an oblivious adversary. Regret bounds on the algorithms are obtained, and the simulation performance shows that all agents can converge to the most stable state. The sublinearity of the proposed algorithms is also compared with two well-known non-stochastic multi-armed bandit algorithms.
Article
Full-text available
Agents learn about a changing state using private signals and past actions of neighbors in a network. We characterize equilibrium learning and social influence in this setting. We then examine when agents can aggregate information well, responding quickly to recent changes. A key sufficient condition for good aggregation is that each individual’s neighbors have sufficiently different types of private information. In contrast, when signals are homogeneous, aggregation is suboptimal on any network. We also examine behavioral versions of the model, and show that achieving good aggregation requires a sophisticated understanding of correlations in neighbors’ actions. The model provides a Bayesian foundation for a tractable learning dynamic in networks, closely related to the DeGroot model, and offers new tools for counterfactual and welfare analyses.
Article
Full-text available
We consider online learning problems under a a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.
Article
Full-text available
With the prevalence of the Web and social media, users increasingly express their preferences online. In learning these preferences, recommender systems need to balance the trade-off between exploitation, by providing users with more of the "same", and exploration, by providing users with something "new" so as to expand the systems' knowledge. Multi-armed bandit (MAB) is a framework to balance this trade-off. Most of the previous work in MAB either models a single bandit for the whole population, or one bandit for each user. We propose an algorithm to divide the population of users into multiple clusters, and to customize the bandits to each cluster. This clustering is dynamic, i.e., users can switch from one cluster to another, as their preferences change. We evaluate the proposed algorithm on two real-life datasets.
Conference Paper
Bayesian agents learn about a moving target, such as a commodity price, using private signals and their network neighbors' estimates. The weights agents place on these sources of information are endogenously determined by the precisions and correlations of the sources; the weights, in turn, determine future correlations. We study stationary equilibria\textemdash ones in which all of these quantities are constant over time. Equilibria in linear updating rules always exist, yielding a Bayesian learning model as tractable as the commonly-used DeGroot heuristic. Equilibria and the comparative statics of learning outcomes can be readily computed even in large networks. Substantively, we identify pervasive inefficiencies in Bayesian learning. In any stationary equilibrium where agents put positive weights on neighbors' actions, learning is Pareto inefficient in a generic network: agents rationally overuse social information and underuse their private signals.
Article
We consider a collaborative online learning paradigm, wherein a group of agents connected through a social network are engaged in learning a stochastic multi-armed bandit problem. Each time an agent takes an action, the corresponding reward is instantaneously observed by the agent, as well as its neighbors in the social network. We perform a regret analysis of various policies in this collaborative learning setting. A key finding of this paper is that natural extensions of widely studied single agent learning policies to the network setting need not perform well in terms of regret. In particular, we identify a class of non-altruistic and individually consistent policies and argue by deriving regret lower bounds that they are liable to suffer a large regret in the networked setting. We also show that the learning performance can be substantially improved if the agents exploit the structure of the network and develop a simple learning algorithm based on dominating sets of the network. Specifically, we first consider a star network, which is a common motif in hierarchical social networks and show analytically that the hub agent can be used as an information sink to expedite learning and improve the overall regret. We also derive network-wide regret bounds for the algorithm applied to general networks. We conduct numerical experiments on a variety of networks to corroborate our analytical results.
Article
In this paper, we study diffusion social learning over weakly-connected graphs. We show that the asymmetric flow of information hinders the learning abilities of certain agents regardless of their local observations. Under some circumstances that we clarify in this work, a scenario of total influence (or "mind-control") arises where a set of influential agents ends up shaping the beliefs of non-influential agents. We derive useful closed-form expressions that characterize this influence, and which can be used to motivate design problems to control it. We provide simulation examples to illustrate the results.
Article
We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multi-armed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced $T$-round learning problem. Specifically, we show that any feedback graph belongs to one of three classes: strongly observable graphs, weakly observable graphs, and unobservable graphs. We prove that the first class induces learning problems with $\widetilde\Theta(\alpha^{1/2} T^{1/2})$ minimax regret, where $\alpha$ is the independence number of the underlying graph; the second class induces problems with $\widetilde\Theta(\delta^{1/3}T^{2/3})$ minimax regret, where $\delta$ is the domination number of a certain portion of the graph; and the third class induces problems with linear minimax regret. Our results subsume much of the previous work on learning with feedback graphs and reveal new connections to partial monitoring games. We also show how the regret is affected if the graphs are allowed to vary with time.
Conference Paper
We consider the decision problem of an external agent choosing to execute one of M actions for each user in a social network. We assume that observing a user's actions provides valuable information for a larger set of users since each user's preferences are interrelated with those of her social peers. This falls into the well-known setting of the multi-armed bandit (MAB) problems, but with the critical new component of side observations resulting from interactions between users. Our contributions in this work are as follows: 1) We model the MAB problem in the presence of side observations and obtain an asymptotic lower bound (as a function of the network structure) on the regret (loss) of any uniformly good policy that achieves the maximum long term average reward. 2) We propose a randomized policy that explores actions for each user at a rate that is a function of her network position. We show that this policy achieves the asymptotic lower bound on regret associated with actions that are unpopular for all the users. 3) We derive an upper bound on the regret of existing Upper Confidence Bound (UCB) policies for MAB problems modified for our setting of side observations. We present case studies to show that these UCB policies are agnostic of the network structure and this causes their regret to suffer in a network setting. Our investigations in this work reveal the significant gains that can be obtained even through static network-aware policies.