Conference PaperPDF Available

Time-Varying Truth Prediction in Social Networks Using Online Learning

February 2020

February 2020

DOI:10.1109/ICNC47757.2020.9049827

Conference: 2020 International Conference on Computing, Networking and Communications (ICNC)

Authors:

Olusola Odeyomi

North Carolina A &T University

Hyuck M. Kwon

Wichita State University

Expected weighted regret for agents 1, 2 and 3 for γ = 0.2.

…

Sublinear regret for agents 1, 2 and 3 for γ = 0.2.

…

Expected weighted regret for agents 1, 2 and 3 for γ = 0.5.

…

Sublinear regret for agents 1, 2 and 3 for γ = 0.5.

…

Expected weighted regret for agents 1, 2 and 3 for γ = 0.2 with adjacency matrix A 1 .

…

Figures - uploaded by Olusola Odeyomi

Content may be subject to copyright.

Content uploaded by Olusola Odeyomi

Content may be subject to copyright.

Time-Varying Truth Prediction in Social Networks Using Online

Learning

Olusola T. Odeyomi1, Hyuck M. Kwon1and David A. Murrell2

Abstract—This paper shows how agents in a social network

can predict their true state when the true state is arbitrarily

time-varying. We model the social network using graph theory,

where the agents are all strongly connected. We then apply

online learning and propose a non-stochastic multi-armed bandit

algorithm. We obtain a sublinear upper bound regret and show

by simulation that all agents can make a better prediction over

time.

Index Terms—non-Bayesian learning, online learning, multi-

armed bandit, regret, graph theory, strongly connected network.

I. INTRODUCTION

The motivation for this work is to study how humans

(agents) can predict time-varying truth in social networks

among a plethora of false information. We consider time-

varying truth (otherwise called the true state) because in social

networks, information is released and updated almost every

second.

The objective of this work is to help agents predict the true

state, even when that true state is dynamic. To achieve that

objective, this paper models the social learning problem using

graph theory for the case of a strongly connected network and

develops a non-stochastic multi-armed bandit algorithm that

can help agents of the network make a good prediction over

time. The regret incurred by agents is shown to be sublinear.

This means that the regret per round converges quickly to

zero as the number of rounds tends to inﬁnity. We represent

the upper bound of this convergence rate with the ”Big-O.”

Non-Bayesian learning is an established graph-theoretic

approach employed by agents to learn the true state when

the true state is ﬁxed [1], [2]. In this approach, each agent

observes a set of private information from the environment

to form an intermediate belief using Bayes’ rule and then

cooperates with neighbors by creating a weighted combination

of its intermediate belief with the old or updated belief from

its connected neighbors to form the agent’s own belief. The

authors in [1], [3] showed that by using the updated beliefs

of neighbors, rather than the old beliefs, the convergence of

agents’ beliefs to the true state improves. They referred to this

type of non-Bayesian learning as diffusion learning.

Most existing literature has focused on learning the true

state when the true state is ﬁxed [2]–[5]. This learning ap-

proach is not practical since the true state of social networks

changes arbitrarily with time [6]–[8]. Moreover, the analyses

1Olusola T. Odeyomi and Hyuck M. Kwon are with the Department of Elec-

trical Engineering and Computer Science, Wichita State University, Wichita,

Kansas, KS 67260, USA. otodeyomi@shockers.wichita.edu

and hyuck.kwon@wichita.edu

2David A. Murrell is with the Airforce Research Laboratory / RVBY, 3550

Aberdeen Avenue Kirtland AFB, NM 87117-577, USA.

and results in the existing literature are not applicable for

a dynamic setting. For instance, the convergence of agents’

beliefs to the true state plays a key role when the true state is

ﬁxed, whereas convergence is difﬁcult to achieve when the true

state is dynamic [9]. This is why we consider online learning

that can handle the dynamicity of the true state in a real-life

scenario.

Online learning is one of the branches of machine learning

where a player (or agent) chooses an action and observes some

feedback that allows him to learn and improve his choices

in subsequent rounds of play. The feedback is either a loss

or reward function. In this paper, we use the loss function.

Losses over the entire duration of play can either be ﬁxed

before the game begins by an oblivious adversary, or it can

be adaptively chosen at the start of each round of play, while

leveraging the player' s history of play, by a non-oblivious

adversary [10]. The types of feedback vary. The most common

type is full feedback, where the loss of all actions is revealed

to the player at the end of each round [11], [12]. This is to

say, that the player knows the loss of every action regardless

of his/her choice of play. A second common feedback model

is the bandit feedback [10], where the loss of only the action a

player chooses is revealed to him. This is to say, that all other

losses for unchosen actions are kept secret from the player.

The bandit setting is a special case of the partial monitoring

prediction problem [13], [14], where a player never has full

knowledge of the losses s/he suffered in the past, rather, s/he

receives feedback with limited information. The player must

then balance an exploitation-exploration trade-off to minimize

regret. In other words, the player is faced with two choices;

exploiting what s/he has learned from previous rounds, by

choosing an action that has performed well in the past, or

exploring for less frequently chosen actions with the optimism

that the most informative feedback will be given. When the

total number of actions from which a player can select at

any given round is more than two, it is referred to as the

multi-armed bandit. Full feedback and bandit feedback are

well understood by using a directed feedback graph [15]–[18].

Because of limited feedback in bandit setting, it is a tougher

setting. Therefore, in this paper, we employ the multi-armed

bandit setting.

The multi-armed bandit application to social learning prob-

lem is not new and can be dated back to the work in [19].

There has been increasing work on social learning using the

multi-armed bandit ever since. All of this work is formulated

from different variants of the multi-armed bandit. However,

one important limitation of this various work is that the

loss distribution is assumed to be independent and identically

distributed [20]–[24]. However, to effectively capture the be-

havior of humans in social networks, it is not necessary to

place any restriction on the loss distribution. Much of this

work also lacks proper theoretical analysis [20]–[22], which is

the reason for considering non-stochastic or adversarial multi-

armed bandit, which was ﬁrst proposed in [10]. In this multi-

armed bandit model, losses are assigned by an adversary,

which may be oblivious or non-oblivious. In this paper, we

use the oblivious adversary, non-stochastic multi-armed bandit

model.

The contributions of this paper are as follows: (1) we

formulate the social network problem of a dynamic true state

as an online learning problem; (2) we propose a novel non-

stochastic bandit algorithm; and (3) we show that all agents

can learn the truth with sublinear regret.

II. MO DE L

Consider a directed graph G= (V, E ), where V=

{1,·· · , N }represents a set of agents in the network with

|V|=N. We can assign a pair of non-negative scalar weights

{ajk , akj } ∈ E, to the edge joining agents k∈Vand j∈V.

The network is said to be strongly connected if there is a

directed path in both ways connecting any two agents and

there is at least one self-loop, i.e., akk >0. It is also possible

to have ajk >0and akj = 0. We deﬁne the neighborhood

Nkof the agent kas the set of agents connected to k. Note

that agent kalso belongs to its neighborhood. We deﬁne the

adjacency matrix of a graph as a square matrix with elements

that are the weights of the edges linking any two agents. We

denote the adjacency matrix as A. If the sum of elements in

each column in the adjacency matrix is one, then we say the

matrix is left-stochastic, i.e.,

ajk ≥0,X

ajk = 1.(1)

A strongly connected network is left-stochastic and has a

spectral radius of one, i.e., the eigenvalues are positive and

bounded by one. A strongly connected network also obeys

the Perron-Frobenius theorem, and has a single eigenvalue at

one while other eigenvalues are strictly inside a unit disc [3].

In diffusion learning, all agents start from a prior belief.

Agent k’s prior belief for instance is µk,0(θ), which is uni-

formly distributed over θ∈Θ. Each agent will update its

belief at each time t≥1ﬁrst by observing a realization of

some random observational private signals from a ﬁnite state

space (say Sk,t ∈ Zk,t for agent k), which are generated by

some known likelihood function Lk(·|θ∗

t)that is dependent on

the true state θ∗

t. In this paper, the likelihood function of all

agents is given as L(Sk,t|θ), i.e.,

L(Sk,t|θ) = θ∗

t+ Γk,t(θ),∀k∈V , t ≤T , θ ∈Θ(2)

where θ∗

t∈Θis the true state at time tand chosen arbitrarily,

and Γk,t(θ)is a dithered quantization noise. Therefore, the

random private observation is a noisy observation of the un-

derlying true state. The private signals are not fully informative

and must cooperate with other neighbors. Each agent uses

Bayesian rule to generate an intermediate belief as follows:

ψk,t(θ) = µk,t−1(θ)Lk(Sk,t|θ)

Pθ0∈Θµk,t−1(θ0)Lk(Sk,t|θ0)(3)

where ψk,t(θ)is the intermediate belief of agent kat round

t. This agent uses the intermediate belief to compute an

intermediate probability pk,t(θ)as

pk,t(θ) = (1 −γ)µk,t−1(θ) + γψk,t(θ).(4)

Then, it cooperates with its neighbors to form the probability

Pk,t(θ)from which it chooses a state θtat each time t.Pk,t(θ)

is computed as

Pk,t(θ) = X

j∈Nk

ajk pj,t(θ).(5)

We now set up the network as an online learning problem,

where the goal of each agent is to minimize the regret over

the entire round of play. The loss lk,t(θ)∈[0,1] is ﬁxed

for each agent kbefore the start of the game. The true state

has lk,t(θ∗

t) = 0, while every other states has lk,t(θ) = 1.

The expected weighted regret of agent k, which shows the

performance of our randomized algorithm to the choice of the

best ﬁxed state θ•for the entire duration [1,·· · , T ], is

EFt"T

t=1 X

θ∈Θ

pk,t(θ)lk,t(θ)−

t=1

lk,t (θ•)#.(6)

This expectation is taken over the history of all observed

private information from time t= 1 to time t≤T, represented

as σ−algebra Ft. In Algorithm 1, we introduce parameters

ηand γ, which represent the learning rate and the exploration

parameter, respectively.

Deﬁnition 1: The independent number of a graph Gdenoted

by α(G), is the largest size of any subset B⊆Vnot connected

by any edges.

Lemma 1: The expected value of ˆ

lk,t(θ), which is an

unbiased estimate of the true loss lk,t(θ), is given as

Eθt∼Pk,t(θ)hˆ

lk,t(θ)i=lk,t(θ)and

Eθt∼Pk,t(θ)hˆ

lk,t(θ)2i=lk,t(θ)2

Pk,t(θ).

(7)

Proof sketch:

Eθt∼Pk,t(θ)[ˆ

lk,t(θ)] = lk,t(θ)

Pk,t(θ)Pk,t(θ)=lk,t(θ)

Eθt∼Pk,t(θ)[ˆ

lk,t(θ)2] = lk,t(θ)

Pk,t(θ)!2

Pk,t(θ) = lk,t(θ)2

Pk,t(θ).

Lemma 2 : In a strongly connected network with cardinality

of Θequal to M, if all agents are self-aware such that

Pθ∈Θpk,t(θ)≤1and pk,t(θ)≥εfor all θ∈Θ,|Θ|=M,

then

θ∈Θ

pk,t(θ)

Pk,t(θ)≤4αln 4M

αε ,0< ε ≤1

2.(8)

Proof : We employ the result of Lemma 5 in [18]. By using

Deﬁnition 1, Lemma 1, and Lemma 2, we can show that all

agents in a strongly connected network can learn and incur a

sublinear regret with our proposed algorithm.

Algorithm 1: Online Diffusion Learning for Strongly Connected Network

Parameters: Feedback graph, learning rate η > 0.

Vis the set of strongly connected agents, and Eis the set of edges.

exploration parameter γ∈(0,1

Step 0: µk,0(θ) = 1

M,∀θ∈Θ

For each round t∈ {1,· · · , T }

Step 1: Compute pk,t(θ) = (1 −γ)µk,t−1(θ) + γ ψk,t (θ)for each θ∈Θ

Step 2: Compute Pk,t(θ) = Pj∈Nkaj kpj,t (θ)

Step 3: Draw state θt∼Pk,t, and incur loss lk,t (θt)∈[0,1]

Step 4: Update

lk,t(θ) = lk,t(θ)

Pk,t(θ)I{θ=θt} ∀θ∈Θ

µk,t(θ) = µk,t−1(θ) exp(−ηˆ

lk,t(θ))

Pθ∈Θµk,t−1(θ) exp(−ηˆ

lk,t(θ))∀θ∈Θ

end

Theorem 1: The upper bound on the expected weighted

regret for Algorithm 1 for a strongly connected network is

Oα−1M2T1/2ln Mwhen γ= min 8αη, 1

2and η=

M

α√T2

Proof sketch: We follow the approach for the proof of

Theorem 2 in [18] with some modiﬁcations that ensure that

the sum of all beliefs is one. Moreover, we remove the

restriction that some portion of the graph networks have losses

lk,t(θ)≤1

ηused in Lemma 4 in [18]. For the sake of brevity,

however, we omit the proof.

III. PROP OS ED ALGORITHM

The proposed algorithm is a non-stochastic algorithm that

can handle a graph network as input. The graph network

geometry is not time-varying but the true state of the network

is time-varying in this paper. Step 0: The belief of each agent

µk,0(θ)over the state θ∈Θis initialized as a uniform

distribution over Θ, at the start of the algorithm. Step 1:

At each round of the algorithm, the probability pk,t(θ)is

computed for all θ∈Θ. Computing pk,t(θ)involves using

the exploration parameter γ, which is a trade-off parameter

between using the previous belief µk,t−1(θ)and exploring a

new state based on the likelihood function ψk,t(θ).Step 2:

Each agent now combines its own probability pk,t(θ)with that

of its neighbors using the weight of the edges in the connection

to obtain a weighted probability Pk,t(θ)over all states θ∈Θ.

Step 3: From this weighted probability, the agent chooses a

state with the highest probability as the likely true state θ∗

tat

that time t.Step 4: If the prediction is correct, then the agent

incurs a loss lk,t(θ∗) = 0 and if the prediction is wrong, the

agent incurs a loss lk,t(θ)=1, θ 6=θ∗. The agent knows if the

state it chooses is the true state or not, based on the loss value

it receives. However, the agent does not know the value of the

loss of other states it did not choose at that round of play. In

order to update the belief, the agent needs to know the value of

the loss over all states. To achieve this, the algorithm computes

an estimated loss ˆ

lk,t(θ)over all states. Based on Lemma 1,

the expectation of the estimated loss ˆ

lk,t(θ)over θtgives the

true loss lk,t(θ). The algorithm now uses the estimated loss

and the learning rate ηto update the belief probability µk,t(θ)

over all the states. This update is an exponential update where

states that incur losses have their belief probability reduced.

The exponential update is normalized to keep the sum of all

beliefs as one. The algorithm continues until the end of the

round or time slot T.

IV. SIMULATION RESULT

For our simulation, we use a strongly connected graph

consisting of three agents. The adjacency matrix of this graph

is represented below:

A=



0.2 0.2 0.8

0.5 0.4 0.1

0.3 0.4 0.1

.

0 100 200 300 400 500 600 700 800 900 1000

number of rounds

100

120

expected weighted regret

agent1

agent2

agent3

Fig. 1. Expected weighted regret for agents 1, 2 and 3 for γ= 0.2.

Observe that the matrix is left-stochastic, i.e., the sum of

each left column elements is 1. The highest eigenvalue is one,

and all other eigenvalues are strictly within a unit disc. Hence,

the matrix obeys the Perron-Frobenius theorem. Assume ﬁve

states, i.e., Θ = {θ1,·· · , θ5}, that any of these states can be

a true state at each time t, and that the true state is arbitrarily

time-varying. The goal of the agents is to predict this true

state. We use α= 1,T= 1,000,η= 0.025, and γ= 0.2for

the simulation shown in Figures 1 and 2.

The likelihood function for all agents over all states at each

time is computed from [2] with quantized Gaussian noise of

mean zero and variances randomly chosen from one to ﬁve.

The algorithm runs 100,000 times and the average is given for

all plots. Observe in Figure 1 that the expected regret across all

agents begins to converge as the round of iteration increases.

This is because the agents learn from their mistakes and make

better predictions in subsequent rounds. Cooperation among

agents also helps them ensure that they learn at almost the

same rate. Figure 2 shows the decay of the regret with time.

0 100 200 300 400 500 600 700 800 900 1000

number of rounds

expected weighted regret/round

agent1

agent2

agent3

Fig. 2. Sublinear regret for agents 1, 2 and 3 for γ= 0.2.

As time goes to inﬁnity, the sublinear regret for all three agents

goes to zero.

If we change the value of γfrom 0.2 to 0.5 while keeping

all other simulation parameters unchanged, we obtain Figure

3. The regret in Figure 3 is higher than in Figure 1 because

0 100 200 300 400 500 600 700 800 900 1000

number of rounds

100

150

200

250

expected weighted regret

agent1

agent2

agent3

Fig. 3. Expected weighted regret for agents 1, 2 and 3 for γ= 0.5.

the agents do not make good use of their own history. That is,

they fail to maximally exploit their past beliefs about the true

state before exploring a new state. This can be seen from step

1 of algorithm 1. It can be seen that the decay of the regret is

slower in Figure 4 than in Figure 2.

0 100 200 300 400 500 600 700 800 900 1000

number of rounds

expected weighted regret/round

agent1

agent2

agent3

Fig. 4. Sublinear regret for agents 1, 2 and 3 for γ= 0.5.

In Figure 5, we use an adjacency matrix A1such that the

self loop of each agent has a higher weight compared to the

weights of neighboring agents. The matrix still left-stochastic

with the highest eigenvalue as 1, it can be seen that the

algorithm performs better and the expected weighted regret is

much less when γ= 0.2. The matrix A1is represented as:

A1=



0.5 0.3 0.2

0.3 0.5 0.2

0.2 0.2 0.6



0 100 200 300 400 500 600 700 800 900 1000

number of rounds

-2

expected weighted regret

agent1

agent2

agent3

Fig. 5. Expected weighted regret for agents 1, 2 and 3 for γ= 0.2 with

adjacency matrix A1.

V. RE LATE D WOR K

Prediction with expert advice [11], [25]–[27] is a general

abstract framework for studying sequential prediction prob-

lems, formulated as repeated games between a player and an

adversary. For a T-round sequential learning with Karms (or

actions), the Hedge algorithm [25] is known to achieve the

optimal bound of order O(√Tlog K)on the regret for the

full feedback setting, while the INF algorithm [28] achieves

the optimal bound of order O(√T K)on the regret for the

bandit setting. The famous EXP3 algorithm achieves a slightly

worse regret of O(√T K log K)for the bandit setting. The

bandit setting is a special case of partial monitoring [13], [14].

The feedback graph model for representing games over an

undirected graph was ﬁrst proposed in [15]. In this setting,

an edge from action ito action jimplies that losses for

action iand jare revealed when action iis played. This

seems like a middle ground between the full feedback setting

where all actions are revealed and the bandit setting where

only the played action is revealed. The regret bound of [15]

is O(√T α log K)and can be achieved by the ELP algorithm

which is a variant of the EXP3 algorithm. Here, αrepresent the

independence number of the graph. The authors in [16], [29]

considered the case of directed graphs and propose the EXP3-

DOM algorithm and the EXP3-SET algorithm. The upper

bound on the regret of EXP3-SET is not tight. A more im-

proved regret bound is gotten from EXP3-IX proposed in [30].

The authors in [17] fully characterized the minimax regret of

online learning problem deﬁned over feedback graphs. They

categorized the set of all feedback graphs into three distinct

sets: strongly observable, weakly observable and unobservable.

The minimax regret for strongly observable feedback graphs

is on the order ˜

Θα1/2T1/2; weakly observable feedback

graphs is on the order ˜

Θδ2/3T2/3, where δis the domination

number; and the set of unobservable graphs where learning is

difﬁcult is on the order Θ(T). Social learning problems are

well formulated using directed feedback graphs.

VI. CONCLUSION

This paper investigated how agents in a graph network can

predict their true state in situations where the true state of the

agents keeps changing arbitrarily with time. We formulated

the problem as an online learning problem and proposed

a non-stochastic multi-armed bandit algorithm for strongly

connected agents to predict their arbitrarily time-varying true

state. We showed theoretically and through simulation that

strongly connected agents can predict their true state with

sublinear regret over the entire round of play. Our ﬁndings

are applicable to the behavior of humans in social networks.

REFERENCES

[1] H. Salami, B. Ying, and A. H. Sayed, “Social learning over weakly

connected graphs,” IEEE Transactions on Signal and Information Pro-

cessing over Networks, vol. 3, no. 2, pp. 222–238, 2017.

[2] A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi, “Non-

bayesian social learning,” Games and Economic Behavior, vol. 76, no. 1,

pp. 210–225, 2012.

[3] A. H. Sayed et al., “Adaptation, learning, and optimization over net-

works,” Foundations and Trends® in Machine Learning, vol. 7, no. 4-5,

pp. 311–801, 2014.

[4] M. H. DeGroot, “Reaching a consensus,” Journal of the American

Statistical Association, vol. 69, no. 345, pp. 118–121, 1974.

[5] A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE, vol. 102,

no. 4, pp. 460–497, 2014.

[6] S. Shahrampour, S. Rakhlin, and A. Jadbabaie, “Online learning of

dynamic parameters in social networks,” in Advances in Neural Infor-

mation Processing Systems, 2013.

[7] R. M. Frongillo, G. Schoenebeck, and O. Tamuz, “Social learning in

a changing world,” in International Workshop on Internet and Network

Economics. Springer, 2011, pp. 146–157.

[8] K. Dasaratha, B. Golub, and N. Hak, “Social learning in a dynamic

environment,” Available at SSRN 3097505, 2018.

[9] G. Moscarini, M. Ottaviani, and L. Smith, “Social learning in a changing

world,” Economic Theory, vol. 11, no. 3, pp. 657–665, 1998.

[10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling

in a rigged casino: The adversarial multi-armed bandit problem,” in

Proceedings of IEEE 36th Annual Foundations of Computer Science.

IEEE, 1995, pp. 322–331.

[11] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,”

Information and computation, vol. 108, no. 2, pp. 212–261, 1994.

[12] V. Vovk, “A game of prediction with expert advice,” Journal of Computer

and System Sciences, vol. 56, no. 2, pp. 153–173, 1998.

[13] Gy¨

orgy, r´

as, T. Linder, G. Lugosi, and G. Ottucs´

ak, “The on-line shortest

path problem under partial monitoring,” Journal of Machine Learning

Research, vol. 8, no. Oct., pp. 2369–2403, 2007.

[14] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz, “Regret minimization under

partial monitoring,” Mathematics of Operations Research, vol. 31, no. 3,

pp. 562–580, 2006.

[15] S. Mannor and O. Shamir, “From bandits to experts: On the value

of side-observations,” in Advances in Neural Information Processing

Systems, 2011, pp. 684–692.

[16] N. Alon, N. Cesa-Bianchi, C. Gentile, S. Mannor, Y. Mansour, and

O. Shamir, “Nonstochastic multi-armed bandits with graph-structured

feedback,” SIAM Journal on Computing, vol. 46, no. 6, pp. 1785–1826,

2017.

[17] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren, “Online learning with

feedback graphs: Beyond bandits,” in Annual Conference on Learning

Theory, vol. 40. Microtome Publishing, 2015.

[18] ——, “Online learning with feedback graphs: Beyond bandits,” arXiv

preprint arXiv:1502.07617, 2015.

[19] K. H. Schlag, “Why imitate, and if so, how?: A boundedly rational

approach to multi-armed bandits,” Journal of Economic Theory, vol. 78,

no. 1, pp. 130–156, 1998.

[20] A. Zhou, Y. Feng, P. Zhou, and J. Xu, “Social intimacy based iot services

mining of massive data,” in 2017 IEEE International Conference on

Data Mining Workshops (ICDMW). IEEE, 2017, pp. 641–648.

[21] T. T. Nguyen and H. W. Lauw, “Dynamic clustering of contextual

multi-armed bandits,” in Proceedings of the 23rd ACM International

Conference on Conference on Information and Knowledge Management.

ACM, 2014, pp. 1959–1962.

[22] S. L. Scott, “A modern bayesian look at the multi-armed bandit,” Applied

Stochastic Models in Business and Industry, vol. 26, no. 6, pp. 639–658,

2010.

[23] R. K. Kolla, K. Jagannathan, and A. Gopalan, “Collaborative learning

of stochastic bandits over a social network,” IEEE/ACM Transactions

on Networking (TON), vol. 26, no. 4, pp. 1782–1795, 2018.

[24] S. Buccapatnam, A. Eryilmaz, and N. B. Shroff, “Multi-armed bandits

in the presence of side observations in social networks,” in 52nd IEEE

Conference on Decision and Control. IEEE, 2013, pp. 7309–7314.

[25] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of

on-line learning and an application to boosting,” Journal of Computer

and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.

[26] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire,

and M. K. Warmuth, “How to use expert advice,” Journal of the ACM

(JACM), vol. 44, no. 3, pp. 427–485, 1997.

[27] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.

Cambridge University Press, 2006.

[28] J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and

stochastic bandits,” in COLT, 2009, pp. 217–226.

[29] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “From bandits to

experts: A tale of domination and independence,” in Advances in Neural

Information Processing Systems, 2013, pp. 1610–1618.

[30] T. Koc´

ak, G. Neu, M. Valko, and R. Munos, “Efﬁcient learning by

implicit exploration in bandit problems with side observations,” in

Advances in Neural Information Processing Systems, 2014, pp. 613–

621.

Truth Prediction by Weakly Connected Agents in Social Networks Using Online Learning

Conference Paper

Full-text available

Dec 2020

Olusola Odeyomi

This paper provides a study into the social network where influential personalities collaborate positively among themselves to learn an underlying truth over time but may have misled their followers to believe a false information. Most existing work models the social network as a graph network and applies non-Bayesian learning to understand the behavior of agents in the network. Although this approach is popular, it has the limitation of assuming that the truth - otherwise called the true state - is time-invariant. This is not practical in social network where streams of information are released and updated every second. Thus, this paper improves on existing work by introducing online reinforcement learning into the graph theoretic framework. Specifically, multi-armed bandit technique is applied. A multi-armed bandit algorithm is proposed for weakly connected agents to predict the time-varying true state. The result shows that the weakly connected agents can predict this time-varying true state, howbeit with a higher regret than the strongly connected agents.

Learning the Truth by Weakly Connected Agents in Social Networks Using Multi-Armed Bandit

Article

Full-text available

Nov 2020

Olusola Odeyomi

This paper provides a study into the social network where influential personalities collaborate positively among themselves to learn an underlying truth over time, but may have misled their followers to believe a false information. Most existing work that study leader-follower relationships in a social network model the social network as a graph network, and apply non-Bayesian learning to train the weakly connected agents to learn the truth. Although this approach is popular, it has the limitation of assuming that the truth -otherwise called the true state -is time-invariant. This is not practical in social network, where streams of information are released and updated every second, making the true state arbitrarily time-varying. Thus, this paper improves on existing work by introducing online reinforcement learning into the graph theoretic framework. Specifically, multi-armed bandit technique is applied. A multi-armed bandit algorithm is proposed and used to train the weakly connected agents to converge to the most stable state over time. The speed of convergence for these weakly connected agents trained with the proposed algorithm is slower by 66% on average, when compared to the speed of convergence for strongly connected agents trained with the state-of-the-art algorithm. This is because weakly connected agents are difficult to train. However, the speed of convergence of these weakly connected agents can be improved by approximately 50% on average, by fine-tuning the learning rate of the proposed algorithm. The sublinearity of the regret bound for the proposed algorithm is compared to the sublinearity of the regret bound for the state-of-the-art algorithm for strongly connected networks.

Differential Privacy in Social Networks Using Multi-Armed Bandit

Article

Full-text available

Jan 2022

Olusola Odeyomi

There has been an exponential growth over the years in the number of users connected to social networks. This has spurred research interest in social networks to ensure the privacy of users. From a theoretical standpoint, the social network is modeled as a directed graph network and interactions among agents in the graph network can be analyzed with non-Bayesian learning and online learning strategies -such as the multi-armed bandit. The goal of the agents is to learn the time-varying true state of the network through repeated cooperation among themselves. Recent work includes differential privacy in social network analysis to guarantee the privacy of shared information among the agents. However, the stochastic multi-armed bandit approach is used in these existing works which assume that the loss distribution is independent and identically distributed. This does not account for the arbitrariness of the time-varying true state in the social network. Therefore, this paper proposes a tougher but realistic setting that removes the restriction on the loss distribution. Two non-stochastic multi-armed bandit algorithms are proposed. The first algorithm uses the Laplace mechanism to guarantee differential privacy against a third-party intruder. The second algorithm uses the Laplace mechanism to guarantee differential privacy against both a third-party intruder and any spying agent in the network. The simulation results show that the agents’ beliefs converge to the most dominant true state among the sequence of arbitrarily time-varying true states over the time horizon. The speed of convergence comes as a trade-off with privacy. Regret bounds are obtained for the proposed algorithms and compared to the non-private algorithm in the literature.

Impact Analysis of Cyber-Attacks on Smart Grid: A Review and Case Study

Chapter

May 2021

This chapter presents a comprehensive review of the impacts of cyber attacks on the smart distribution grid and discusses the potential methods in the literature to mitigate them. The review considers different real-world case studies of successful cyber attacks on multiple grid assets, including networks with high-penetration of distributed energy resources (DERs). A specific use-case of a false data injection (FDI) attack on a photovoltaic (PV) production meter data used for 15-minute ahead forecasting is presented. The false data from the production meter causes the command and control center to give incorrect operational settings to the grid. The various impacts of this incorrect operational settings on the dynamics on the grid is theoretically analyzed, followed by simulation studies of this scenario on an IEEE 34 bus system with three PVs, one synchronous generator, and one energy storage. The impact of FDI on the system is analyzed by measuring the nodal voltages, the current flowing through the lines, and the systems’ active and reactive power losses. The results show that the FDI could potentially cause cascading failures due to possible over current and voltage collapse. This monograph also proposes an adaptive protection system based on a neural network model. This allows the network protection scheme to learn (based on the historical data) the dynamics of the system over time and adequately adapt the protection settings of the relays autonomously despite an FDI attack on the PV production meter. This study will be of particular importance to the utility and DER installers to proactively mitigate FDI attacks, thereby improving the overall situation awareness.

Learning the Truth in Social Networks Using Multi-Armed Bandit

Article

Full-text available

Jan 2020

Olusola Odeyomi

This paper explains how agents in a social network can learn the arbitrary time-varying true state of the network. This is practical in social networks where information is released and updated without any coordination. Most existing literature for learning the true state using the non-Bayesian learning approach, assumes that this true state is fixed, which is impractical. To address this problem, the social network is modeled as a graph network, and the time-varying true state is treated as a multi-armed bandit problem. The few works that have applied multi-armed bandit to a social network did not take into consideration the adversarial effects. Therefore, this paper proposes two non-stochastic multi-armed bandit algorithms that can handle the time-varying true state, even in the presence of an oblivious adversary. Regret bounds on the algorithms are obtained, and the simulation performance shows that all agents can converge to the most stable state. The sublinearity of the proposed algorithms is also compared with two well-known non-stochastic multi-armed bandit algorithms.

Learning from Neighbors about a Changing State

Article

Full-text available

Dec 2021

Agents learn about a changing state using private signals and past actions of neighbors in a network. We characterize equilibrium learning and social influence in this setting. We then examine when agents can aggregate information well, responding quickly to recent changes. A key sufficient condition for good aggregation is that each individual’s neighbors have sufficiently different types of private information. In contrast, when signals are homogeneous, aggregation is suboptimal on any network. We also examine behavioral versions of the model, and show that achieving good aggregation requires a sophisticated understanding of correlations in neighbors’ actions. The model provides a Bayesian foundation for a tractable learning dynamic in networks, closely related to the DeGroot model, and offers new tools for counterfactual and welfare analyses.

Efficient learning by implicit exploration in bandit problems with side observations

Article

Full-text available

Dec 2014
Adv Neural Inform Process Syst

We consider online learning problems under a a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.

Dynamic Clustering of Contextual Multi-Armed Bandits. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14)

Article

Full-text available

Nov 2014

Trong T. Nguyen

With the prevalence of the Web and social media, users increasingly express their preferences online. In learning these preferences, recommender systems need to balance the trade-off between exploitation, by providing users with more of the "same", and exploration, by providing users with something "new" so as to expand the systems' knowledge. Multi-armed bandit (MAB) is a framework to balance this trade-off. Most of the previous work in MAB either models a single bandit for the whole population, or one bandit for each user. We propose an algorithm to divide the population of users into multiple clusters, and to customize the bandits to each cluster. This clustering is dynamic, i.e., users can switch from one cluster to another, as their preferences change. We evaluate the proposed algorithm on two real-life datasets.

Bayesian Social Learning in a Dynamic Environment

Conference Paper

Jun 2018

Bayesian agents learn about a moving target, such as a commodity price, using private signals and their network neighbors' estimates. The weights agents place on these sources of information are endogenously determined by the precisions and correlations of the sources; the weights, in turn, determine future correlations. We study stationary equilibria\textemdash ones in which all of these quantities are constant over time. Equilibria in linear updating rules always exist, yielding a Bayesian learning model as tractable as the commonly-used DeGroot heuristic. Equilibria and the comparative statics of learning outcomes can be readily computed even in large networks. Substantively, we identify pervasive inefficiencies in Bayesian learning. In any stationary equilibrium where agents put positive weights on neighbors' actions, learning is Pareto inefficient in a generic network: agents rationally overuse social information and underuse their private signals.

Collaborative Learning of Stochastic Bandits Over a Social Network

Article

Jul 2018

We consider a collaborative online learning paradigm, wherein a group of agents connected through a social network are engaged in learning a stochastic multi-armed bandit problem. Each time an agent takes an action, the corresponding reward is instantaneously observed by the agent, as well as its neighbors in the social network. We perform a regret analysis of various policies in this collaborative learning setting. A key finding of this paper is that natural extensions of widely studied single agent learning policies to the network setting need not perform well in terms of regret. In particular, we identify a class of non-altruistic and individually consistent policies and argue by deriving regret lower bounds that they are liable to suffer a large regret in the networked setting. We also show that the learning performance can be substantially improved if the agents exploit the structure of the network and develop a simple learning algorithm based on dominating sets of the network. Specifically, we first consider a star network, which is a common motif in hierarchical social networks and show analytically that the hub agent can be used as an information sink to expedite learning and improve the overall regret. We also derive network-wide regret bounds for the algorithm applied to general networks. We conduct numerical experiments on a variety of networks to corroborate our analytical results.

Social Intimacy Based IoT Services Mining of Massive Data

Conference Paper

Nov 2017

Collaborative learning of stochastic bandits over a social network

Conference Paper

Sep 2016

Social Learning Over Weakly Connected Graphs

Article

Sep 2016

In this paper, we study diffusion social learning over weakly-connected graphs. We show that the asymmetric flow of information hinders the learning abilities of certain agents regardless of their local observations. Under some circumstances that we clarify in this work, a scenario of total influence (or "mind-control") arises where a set of influential agents ends up shaping the beliefs of non-influential agents. We derive useful closed-form expressions that characterize this influence, and which can be used to motivate design problems to control it. We provide simulation examples to illustrate the results.

Online Learning with Feedback Graphs: Beyond Bandits

Article

Feb 2015

We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multi-armed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced $T$-round learning problem. Specifically, we show that any feedback graph belongs to one of three classes: strongly observable graphs, weakly observable graphs, and unobservable graphs. We prove that the first class induces learning problems with $\widetilde\Theta(\alpha^{1/2} T^{1/2})$ minimax regret, where $\alpha$ is the independence number of the underlying graph; the second class induces problems with $\widetilde\Theta(\delta^{1/3}T^{2/3})$ minimax regret, where $\delta$ is the domination number of a certain portion of the graph; and the third class induces problems with linear minimax regret. Our results subsume much of the previous work on learning with feedback graphs and reveal new connections to partial monitoring games. We also show how the regret is affected if the graphs are allowed to vary with time.

Multi-armed bandits in the presence of side observations in social networks

Conference Paper

Dec 2013

We consider the decision problem of an external agent choosing to execute one of M actions for each user in a social network. We assume that observing a user's actions provides valuable information for a larger set of users since each user's preferences are interrelated with those of her social peers. This falls into the well-known setting of the multi-armed bandit (MAB) problems, but with the critical new component of side observations resulting from interactions between users. Our contributions in this work are as follows: 1) We model the MAB problem in the presence of side observations and obtain an asymptotic lower bound (as a function of the network structure) on the regret (loss) of any uniformly good policy that achieves the maximum long term average reward. 2) We propose a randomized policy that explores actions for each user at a rate that is a function of her network position. We show that this policy achieves the asymptotic lower bound on regret associated with actions that are unpopular for all the users. 3) We derive an upper bound on the regret of existing Upper Confidence Bound (UCB) policies for MAB problems modified for our setting of side observations. We present case studies to show that these UCB policies are agnostic of the network structure and this causes their regret to suffer in a network setting. Our investigations in this work reveal the significant gains that can be obtained even through static network-aware policies.

Time-Varying Truth Prediction in Social Networks Using Online Learning

Figures

Recommended publications

Truth Prediction by Weakly Connected Agents in Social Networks Using Online Learning

Learning the Truth in Social Networks Using Multi-Armed Bandit

Learning the Truth by Weakly Connected Agents in Social Networks Using Multi-Armed Bandit

Online Learning with Feedback Graphs Without the Graphs