Conference PaperPDF Available

Achieving Context Awareness and Intelligence in Distributed Cognitive Radio Networks: A Payoff Propagation Approach

Authors:

Abstract and Figures

Cognitive Radio (CR) is a next-generation wireless communication system that exploits underutilized licensed spectrum to optimize the utilization of the overall radio spectrum. A Distributed Cognitive Radio Network (DCRN) is a distributed wireless network established by a number of CR hosts in the absence of fixed network infrastructure. Context awareness and intelligence are key characteristics of CR networks that enable the CR hosts to be aware of their operating environment in order to make an optimal joint action. This research aims to achieve context awareness and intelligence in DCRN using our novel Locally-Confined Payoff Propagation (LCPP), which is an important feature in Multi-Agent Reinforcement Learning (MARL). The LCPP mechanism is suitable to be applied in most applications in DCRN that require context awareness and intelligence such as scheduling, congestion control, as well as Dynamic Channel Selection (DCS), which is the focus of this paper. Simulation results show that the LCPP mechanism is a promising approach. The LCPP mechanism converges to an optimal joint action including networks with cyclic topology. Fast convergence is possible. The investigation in this paper serve as an important foundation for future work in this research field.
Content may be subject to copyright.
Achieving Context Awareness and Intelligence in
Distributed Cognitive Radio Networks: A Payoff
Propagation Approach
Kok-Lim Alvin Yau1, Peter Komisarczuk2,1 and Paul D. Teal1
1School of Engineering and Computer Science, Victoria University of Wellington, New Zealand
2School of Computing and Technology, Thames Valley University, UK
kok-lim.yau@ecs.vuw.ac.nz, peter.komisarczuk@tvu.ac.uk, paul.teal@ecs.vuw.ac.nz
Abstract-Cognitive Radio (CR) is a next-generation wireless
communication system that exploits underutilized licensed
spectrum to optimize the utilization of the overall radio
spectrum. A Distributed Cognitive Radio Network (DCRN) is a
distributed wireless network established by a number of CR
hosts in the absence of fixed network infrastructure. Context
awareness and intelligence are key characteristics of CR
networks that enable the CR hosts to be aware of their
operating environment in order to make an optimal joint
action. This research aims to achieve context awareness and
intelligence in DCRN using our novel Locally-Confined Payoff
Propagation (LCPP), which is an important feature in Multi-
Agent Reinforcement Learning (MARL). The LCPP
mechanism is suitable to be applied in most applications in
DCRN that require context awareness and intelligence such as
scheduling, congestion control, as well as Dynamic Channel
Selection (DCS), which is the focus of this paper. Simulation
results show that the LCPP mechanism is a promising
approach. The LCPP mechanism converges to an optimal joint
action including networks with cyclic topology. Fast
convergence is possible. The investigation in this paper serve as
an important foundation for future work in this research field.
Keyword-Cognitive radio networks; multi-agent
reinforcement learning; payoff propagation
I. INTRODUCTION
Studies sponsored by the Federal Communications
Commission (FCC) pointed out that the current static
spectrum allocation has led to overall low spectrum
utilization where up to 70% of the licensed spectrum
remains unused (called white space) at any one time even in
a crowded area [1]. The white space can be defined by time
and frequency at a particular location. Cognitive Radio (CR)
is a next-generation wireless communication system that
enables unlicensed spectrum users or Secondary Users
(SUs) to use the white space of licensed users’ or Primary
Users' (PUs') spectrum conditional on the interference to the
PU being below an acceptable level. A Distributed
Cognitive Radio Network (DCRN) is a distributed wireless
network comprised of a number of SUs that interact with
each other in a common operating environment in the
absence of fixed network infrastructure such as a base
station.
Context awareness and intelligence, as achieved by the
popular conceptual cognition cycle [2], are key
characteristics of CR networks. Through context awareness,
an SU is aware of its operating environment; and through
intelligence, an SU utilizes the sensed and inferred high
quality white space in an efficient manner without following
a static pre-defined policy. The key terms in this paper are:
1) joint action; and 2) optimal joint action. A joint action is
defined as the actions taken by all the SUs throughout the
DCRN. An optimal joint action is the joint action that
provides ideal and optimal network-wide performance. The
purpose of context awareness and intelligence is to achieve
an optimal joint action in a distributed manner. In this paper,
our objective is to investigate our novel Locally-Confined
Payoff Propagation (LCPP), which is an important
component in Multi-Agent Reinforcement Learning
(MARL) [3], and it is applied to design the Dynamic
Channel Selection (DCS) scheme in DCRN. In [3], the
MARL approach has been successfully shown to enable the
learning agents to observe, learn, and carry out their
respective actions as part of the optimal joint action. The
LCPP mechanism is a message exchange mechanism that
helps the SUs to communicate and compute their own
actions.
Our contribution in this paper is to investigate into the
LCPP mechanism as an approach to achieve context
awareness and intelligence in DCRNs. This approach is
potentially useful in addressing existing drawbacks
associated with one of the widely-use approaches, namely
Game Theory (GT). We show that the LCPP mechanism
converges to an optimal joint action including networks with
cyclic topology, and fast convergence is possible.
The remainder of this paper is organized as follows.
Section II provides a scenario under consideration. Section
III discusses research problem and challenges. Section IV
presents our system design. Section V presents MARL and
our LCPP mechanism. Section VI presents simulation
results and discussions. Section VII discusses open issues.
Finally, we provide conclusions and future work.
II. A SCENARIO FOR DCS IN DCRN
In DCS, a problem arises as to what is the best strategy to
select an available channel among the licensed channels for
data transmission from an SU node given that the objective
is to maximize network-wide throughput. The DCRN, as
PU1
T3
PU2
R1
R2
R3
T2
T1
F1F2
Figu re 1. Single- ho p DCRN. Solid lin e in dic ates com mu nication link;
while dotted lin e indic ates in terference lin k.
2011 Workshops of International Conference on Advanced Information Networking and Applications
978-0-7695-4338-3/11 $26.00 © 2011 IEEE
DOI 10.1109/WAINA.2011.47
210
shown in Figure 1, is modeled using an undirected graph
GV , E
, where
V
is the set of SUs.
Ti
is the transmitter
and
Ri
is the receiver of an SU communication node pair
i
; and
is the set of edges between the SUs including the
communication links and interference links. There are
U=V/2
SU communication node pairs or agents. We
refer to a CR host as a single SU node; and an SU
communication node pair as an agent henceforth. Each
agent maintains a single set of learned outcomes or
knowledge obtained through message exchange; this is
necessary because the transmitter and receiver must choose
a common channel for data transmission in DCS. An
interference edge
EI
exists between non-communicating
agents if they are located within transmission range of each
other. The agents are distributed in a uniform and random
manner in a square region. There are
K
PUs,
PU =[ PU 1,, PU K]
and each of them uses one of the
K
distinctive channels of frequency
F=[ F1,, F K]
. In
Figure 1,
K=2U=3
. Each channel frequency is
characterized by various levels of PU Utilization Level
(PUL) and Packet Error Rate (PER); thus, we consider
heterogeneous channels. For a particular channel, higher
levels of PUL indicate higher levels of PU activity and
hence lesser amount of white spaces; while higher levels of
PER indicate higher levels of packet drop rate due to
interference, channel selective fading, path loss, and other
factors. Spatial reuse is possible where multiple agents are
allowed to share a particular channel. The agents infer PUL
and PER in each channel in a distributed manner, and select
a channel for data transmission individually in order to
achieve an optimal joint action so that network-wide
throughput is maximized.
III. RESEARCH PROBLEM AND CHALLENGES
Game Theory has been the most popular approach for
achieving context awareness and intelligence in CR
networks. GT studies the interaction of multiple SUs whose
objective is to maximize their individual local rewards. To
date, research has been focusing on one-shot or repetitive
games, such as the matrix game, potential game, etc. and
these games have been successfully applied in various
applications as shown in [4] and [5]. There are several
known issues in GT:
Mis-coordination where the SUs are not able to
converge to an optimal joint action because of severe
negative rewards [6]. The SUs converge to a safe joint
action instead.
The SUs might converge to a sub-optimal joint action
when multiple optimal joint actions exist [6].
GT as applied to CR so far requires a complete set of
information to compute the Nash equilibrium in DCRN;
hence its extensive and successful usage in centralized
CR networks.
GT assumes that all SUs react rationally as game
theorists.
GT assumes a single type of utility function throughout
the DCRN, and hence homogeneous learning
mechanism in all the SUs.
Although GT has been successfully applied in CR
networks [4], [5], MARL is a good alternative which
addresses the aforementioned issues associated with GT [3].
However, it should be noted that MARL is a new research
area [7], hence many open issues in MARL are yet to be
addressed including the LCPP mechanism.
In a multi-agent environment, there are three main
challenges. Firstly, an agent's action is dependent on the
other payoff-optimizing agents' actions. Secondly, all agents
must converge to an optimal joint action that provides good
network-wide performance. Thirdly, all agents must infer
the channel heterogeneous characteristics including PUL
and PER that might be different among the agents. In other
words, from the perspective of each agent, the challenge is
How does an agent infer its channel characteristics and
choose its own action such that the joint action converges to
an optimal joint action?” The traditional single-agent-based
reinforcement learning approach enables each agent to learn
and choose its action in a unilateral fashion regardless of its
neighbour agents' actions in a multi-agent environment, and
this may cause instability or oscillation because it switches
its action from time to time [3], hence we choose to use
MARL. In this paper, we show that, using the LCPP
mechanism, which is an important component in MARL, the
SUs converge to an optimal joint action in respect to DCS in
a distributed manner in DCRN. Our focus in this paper is the
payoff message exchange mechanism. In our solution, a
local learning mechanism, such as MARL, is available at
each SU. The local learning mechanism provides each agent
the local reward that an agent could gain for choosing an
action; further explanation is provided in the next section.
IV. SYSTEM DESIGN
In this paper, we model each SU communication node
pair as a learning agent, as shown in Figure 2, because the
transmitter and receiver share a common learned outcome or
knowledge. At a particular time instant, the agent observes
its local operating environment in the form of a local
reward. Due to the limited sensing capability at each agent,
it can only observe its own local reward. The agent
improves the global reward in the next time instant through
carrying out a proper action. The MARL is comprised of
two components: Local Learning (LL) and the LCPP
mechanism. The LL mechanism provides knowledge on the
operating environment comprised of multiple agents through
observing the consequences of its prior action in the form of
local reward [3]. The LCPP mechanism is a message
exchange mechanism to help the learning engine embedded
Figu re 2. Ag ents (o r SU co mm unicatio n no de pa irs) and their
env ironm en t.
Local Environment
Observe
Learn
Decide action
t→t+1
reward
Agent i
action
Local Environment
Observe
Learn
Decide action
t→t+1
reward
Agent j
action
Environment
Payoff message
exchange
initialize μij = μji = 0 for , gi = 0
{Task: 1. Broadcast payoff message to neighbour agents
2. Select optimal action}
if ( my turn to select an optimal action )
R = uniform(0,1)
if ( R ε ) then
a = uniform(1,K) {Select random action}
else
for ( all neighbour agents )
compute μij(aj,t) {Equation (2)}
if ( μij(aj,t) μij(aj,t-1) ) then
include μij(aj,t) in broadcast message
compute a = {Select optimal action}
{Equation (3) and (4)}
end if
end for
end if
return a
end if
{Task: 3. Process payoff message}
if ( receive message )
if ( message == μij(aj,t) ) then store this information
end if
end if
211
in each agent to communicate and compute its own action as
part of the optimal joint action using the knowledge
provided by the LL mechanism. In other words, the LCPP
mechanism is a means of communication for the LL
mechanism embedded in each agent. As time progresses, the
agents learn to carry out the proper action to maximize the
accummulated global rewards. In DCS, the LL mechanism
is used to learn the channel heterogeneous characteristics,
which is the PUL and PER levels. The global reward is a
linear combination or summation of all the local rewards at
each agent; while the global payoff is the equivalent for
local payoffs generated by each SU. The LCPP mechanism
maximizes both the global reward and global payoff in order
to achieve an optimal joint action. Based on an application,
such as DCS, the reward and payoff values indicate
distinctive network performance metrics such as throughput
and successful data packet transmission rate. Thus,
maximizing the global reward and the global payoff provide
network-wide performance enhancement.
V. THE MARL APPROACH
We first describe the Coordination Graph (CG); followed
by the LL mechanism, and finally the LCPP mechanism.
This paper discusses MARL and PP mechanism in our
context and application scenario, and the reader is referred
to [3] and [7] for more details.
A. Coordination Graph
In Figure 3, there are
U=V/2=4
agents in the DCRN,
and each agent (i.e.
T1
-
R1
pair) is represented by a single
node. An interference edge
EI
exists between a pair of
neighbouring agents. The graph
G
can be decomposed into
smaller and local CGs which are each a local view of the
G
for each agent. For instance, the CG of agent 1 is comprised
of agents 1, 2, and 3, hence the representation of the local
reward or Q-value
Qi , t ai , t , a j∈ i, t
=
Q1,ta1, t, a 2,t, a3, t
,
where
aiA
. The CG defines collaborative relationships,
and each relationship is a local payoff message exchange
mechanism among the agents. Each agent runs an LL
mechanism independently to update its own Q-values. The
approximate global Q-value
Qtat
at time
t
is a linear
combination or summation of all the local Q-values at each
agent as follows:
where
i
represents all the neighbours of agent
i
.
B. Local Learning Mechanism
The Q-value
Qi , t ai , t , a j∈ i, t
, which is the learned
knowledge and maintained in a lookup Q-table with
A
entries at agent
i
, represents the local reward that the agent
can gain for choosing an action
aiA=F
. For example, in
DCS, the Q-value represents the throughput and it is
dependent on the local PUL, PER and the joint action
at
taken by all the agents. The joint action affects the Q-value
due to the dependency of actions among the agents; for
example, two neighbour agents that choose a particular
action, specifically a channel, might increase their
contention level, and hence reduce their respective Q-values
for the action.
C. Locally-Confined Payoff Propagation (LCPP)
Algorithm
The pseudo-code of our LCPP algorithm is shown in
Algorithm 1, and it is embedded in each agent. Each agent
i
constantly sends reward value or payoff message
ij aj , t
to its neighbour agents
j∈ i
over the edges as
shown in Figure 4. Channel selection is based on an agent's
two-hop neighbour agents in DCRNs, hence each
ij aj , t
contains two pieces of information: 1) its own Q-value; and
2) its one-hop neighbours’ local Q-value except that from
agent
j
, as follows:
where
i j
represents all the neighbours of agent
i
except agent
j
. Using
ij aj , t
, agent
i
informs agent
j
about the Q-values of itself and its neighbour agents when
agent
j
is taking its own action so that agent
j
can
evaluate its own action.
The payoff messages are exchanged among the agents
until convergence to a fixed optimal point occurs within a
finite number of iterations. Before convergence, the
messages are an estimation of the fixed optimal point as all
incoming messages are yet to converge. Each agent selects
its own optimal action, which is part of the optimal joint
action, to maximize the local payoff as shown next:
Figu re 3. A f ou r-agen t grap h G. E ach ag en t re pr esents an SU
com municatio n no de pair. The ed ges are inte rferen ce edg es that ex ist
betw een th e ag en ts that in terf ere with each oth er.
1 2
43
Q2(a1,a2,a4)
Q4(a2,a3,a4)Q3(a1,a3,a4)
Q1(a1,a2,a3)
12
4
3
Q12(a1,a2)
Q34(a3,a4)
(1)
Figu re 4. Illustra tion of fou r a gen ts and pay off m essage ex chang e.
(2)
initialize μij = μji = 0 for , gi = 0
{Task: 1. Broadcast payoff message to neighbour agents
2. Select optimal action}
if ( my turn to select an optimal action )
R = uniform(0,1)
if ( R ε ) then
a = uniform(1,K) {Select random action}
else
for ( all neighbour agents )
compute μij(aj,t) {Equation (2)}
if ( μij(aj,t) μij(aj,t-1) ) then
include μij(aj,t) in broadcast message
compute a = {Select optimal action}
{Equation (3) and (4)}
end if
end for
end if
return a
end if
{Task: 3. Process payoff message}
if ( receive message )
if ( message == μij(aj,t) ) then store this information
end if
end if
Qtat=
i=1
U
Qi,tai,t, a j∈ i, t
ij aj , t =[Qi,tai, t , a j∈ i, t ,
k i j
Qk , t ak , t , a l∈ k,t ]
212
where
Qj , a , t
i
, which is kept at agent
i
, indicates the Q-
value of agent
j
when agent
i
is taking action
a
.
Each agent
i
determines its optimal action individually
as follows:
The approximate global payoff
gat
at time
t
is a
linear combination or summation of all local payoff as
follows:
The agents would reach a fixed optimal point after a finite
number of iterations. Note the difference between the global
Q-value,
iQi
in (1) and the global payoff,
i[Qi∑jji ]
in (5). The global Q-value is the total
rewards received by all the agents in the network; while the
global payoff is the total local Q-value and local payoff
value exchanged among the agents. Both global Q-value and
global payoff are the performance metrics for the LCPP
mechanism. They converge to an optimal joint action. Both
equations (1) and (5) are the performance metrics for the
LCPP algorithm.
Selecting the optimal action for all the times does not
cater for the actions that are never chosen. Exploitation
chooses the optimal action. Exploration chooses the other
possible actions in
A
in order to improve the estimates of
the other Q-values. In the
-greedy approach [8], an agent
performs exploration with small probability
, and
exploitation with probability
1−
.
D. Difference between the Original PP mechanism and the
Locally-Confined PP mechanism
The original PP mechanism [3] has been shown to be
useful in many applications, however, our LCPP mechanism
provides two advantages so that it is suitable to be applied in
DCS in DCRN. Firstly, LCPP is suitable for DCRNs with
cyclic topology because old payoff messages are not added
to the current payoff message. As explained in [3], the
payoff message
ij aj , t
increases without bound in a
cyclic topology if old payoff values are added into current
payoff value computation. The original PP mechanism
requires the agents to compute global payoff in the entire
network from time to time to alleviate this issue. Since old
payoff messages are not added to the current payoff
message in LCPP, our simulation results show that the
global payoff value does not increase without bound. Hence,
LCPP does not require the agents to compute global payoff
in the entire network from time to time. Secondly, in DCS,
an agent selects its channel based on the channel selections
of its two-hop neighbour agents. An agent uses Q-values
from its two-hop neighbour agents only in payoff value
computation in LCPP, while in [3], an agent uses Q-values
from agents that are multiple hops away.
The LCPP mechanism provides modification to the
original PP mechanism so that it is suitable to be applied in
DCRN. Since the original PP mechanism cannot be directly
applied in DCRN, we do not investigate into performance
comparison between the original PP mechanism and the
LLCP mechanism.
VI. SIMULATION RESULTS AND DISCUSSIONS
Our objective is to enable the SUs to select their channel
(action) for data transmission respectively in DCS. The
channel selections (joint action) by all the SUs provide an
optimal network-wide throughput (global Q-value or global
reward). In this paper, simulations were performed using C
programming, rahter than network simulator, so that the
simulation results are not dependent on the underlying
Medium Acccess Control (MAC) protocol.
The simulation scenario is discussed in Section II.
Graphical representation of the DCS scheme is shown in
Figure 5. We perform simulation using
G=V , E
with
U=V/2=10
and three levels of densities with different
number of interference links in the entire network
EI={Low=5, Medium=10, High=15}
, and cyclic
Algo rithm 1. Pseudo -co de of th e L CPP alg orithm at agent i.
giai , t =max
aA
[Qi,ta,aj∈ i, t
j∈ i
Qj , a , t
iaj , t , a k∈ j, t
(3)
(4)
k∈ j∖ j
Qk ,a , t
iak ,t , a l∈ k, t ]
initialize μij = μji = 0 for , gi = 0
{Task: 1. Broadcast payoff message to neighbour agents
2. Select optimal action}
if ( my turn to select an optimal action )
R = uniform(0,1)
if ( R ε ) then
a = uniform(1,K) {Select random action}
else
for ( all neighbour agents )
compute μij(aj,t) {Equation (2)}
if ( μij(aj,t) μij(aj,t-1) ) then
include μij(aj,t) in broadcast message
compute a = {Select optimal action}
{Equation (3) and (4)}
end if
end for
end if
return a
end if
{Task: 3. Process payoff message}
if ( receive message )
if ( message == μij(aj,t) ) then store this information
end if
end if
ai , t
*
j∈ i
Qj , ai, t
iaj , t , a k∈ j,t
k∈ j∖ j
Qk ,ai,t
iak , t , al  k, t ]
gat=
i=1
U
[Qi , t ai, a j∈ i, t
(5)
j∈ i
ai , t
*=argmax
aA
gia
s
213
topology exists. There are
K=3
channels. The Q-value
characterizes the channel heterogeneity properties for each
channel including PUL and PER. For a particular channel,
the Q-values are different among the agents as each of them
observes different levels of PUL and PER. In short, the Q-
values are Independent and Identically Distributed (i.i.d.)
among the agents and the channels. In the simulation, each
Q-value has a range
5Qiai, a j∈ i15
. Higher Q-
value indicates better reward, and hence higher throughput.
At each iteration
t
, communication node pair
i
chooses a
channel out of
K=3
channels for data transmission, and it
explores with probability
=0.0125
. A single iteration
corresponds to the conventional four-way handshaking
mechanism that covers payoff message exchange in RTS
and CTS, data packet and ACK transmission.
Figure 6 shows that the global payoff of the low, medium
and high density networks increases and converges to a
fixed point as the number of iterations advances. The high-
density network converges in approximately 9000 iterations,
while medium and low-density networks converge in
approximately 5000 iterations. The global payoff fluctuates
once in a while due to occasional exploration. The global
payoff converges in cyclic topology. The high-density
network has the highest global payoff because the payoff
depends on the number of neighbour agents.
Figure 7 shows that the difference between the optimal
and instantaneous global Q-values of the low, medium and
high density networks decreases to approximately zero. This
indicates the convergence to an optimal joint action. The
instantaneous global Q-value is calculated using (1); and the
optimal global Q-value is the linear combination or
summation of all optimal local Q-values at each agent. The
difference value fluctuates once in a while due to occasional
exploration.
Figure 8 shows that the global payoff of a medium-
density network increases and converges to a fixed point as
the number of iterations advances in the presence of varying
Q-values for the first 2000, 4000 and 6000 iterations. At
each iteration, each Q-value varies with probability 0.5. The
Figu re 6. Glo ba l payo ff fo r low, mediu m and high d ensity netwo rks.
Figu re 7. Differe nce b etwee n op tima l glob al Q-va lue an d
instan taneou s g lob al Q -v alue fo r low, mediu m and high density
netw or ks.
Figu re 8. Glo bal pa yo ff fo r m edium d ensity n etwo rk with v ary ing Q-
valu es for the fir st 20 00, 40 00 an d 60 00 iteratio ns.
Figu re 5. Gr aphica l representation of th e D CS scheme . Bo ld line
ind icates data tra nsm ission between a co mm unicatio n no de pa ir ( an
age nt) ov er a ch osen chan ne l; wh ile d otted line ind icate s interfe ren ce
link .
214
results indicate that stable Q-values provided by the LL
mechanism is the key factor for convergence.
Figure 9 shows that the global payoff of the high density
network increases and converges to a fixed point as the
number of iterations advances. The network converges in
approximately 9000 iterations when
=0.0125
, 11000
iterations when
=0.1
, and 50 iterations when
=0.3
.
Thus, higher
values improve convergence rate. However,
the global payoff fluctuates when
=0.1
and
=0.3
due
to excessive exploration. This is why
=0.0125
has been
chosen in our simulations although it has the lowest
convergence rate. This key finding shows that it is possible
to achieve fast convergence using high values of
and
subsequently to achieve stability using lower values of
upon reaching the stability state.
VII. OPEN ISSUES
The open issues are relevant to the MARL, LL and LCPP
mechanisms respectively. MARL is a new reesarch area [7]
suitable for tackling the drawbacks of GT. These open
issues are subject to further investigation. In MARL, it is
necessary to detect and respond to irrational agents that take
random or suboptimal actions. In LL, there are two issues.
Firstly, mechanism to provide stable Q-values. This
includes detecting and responding to any fluctuations in the
Q-values. This is critical so that the LCPP mechanism
converges to an optimal joint action (see Figure 8).
Secondly, mechanism for heterogeneous learning among the
agents. Each agent represents the Q-values with different
performance metrics in a particular DCRN to enable
heterogeneous learning entities. In LCPP, there are four
issues. Firstly, mechanism to improve the convergence rate
of the global payoff (see Figure 6) and global Q-value (see
Figure 7). Secondly, investigation into the impact of
dynamics in the operating environment on
, as well as
mechanism to measure the level of dynamics for dynamic
adjustment of
. Thirdly, investigation into tradeoff
between convergence rate and stability through the
adjustment of
. Fourthly, comparison of the amount of
overhead between the LCPP and GT approaches.
CONCLUSIONS AND FUTURE WORK
In this paper, we have presented our Locally-Confined
Payoff Propagation (LCPP) mechanism, which is an
important feature in the Multi-Agent Reinforcement
Learning (MARL) approach to achieve optimal joint action
in a Distributed Cognitive Radio Network (DCRN). We
have shown that the LCPP mechanism converges to an
optimal joint action including networks with cyclic
topology. Fast convergence is possible through the
adjustment of the exploration probability
. In our future
work, we will investigate the open issues raised in this
paper. The investigations in this paper serve as the basis for
future research in DCRN.
REFERENCES
[1] FCC Spec Plcy Tsk Frc, “Report of the Spectrum Efficiency Working
Group,” Fed Comm Comsn, Tech Rpt 02-155, US, Nov. 2002.
[2] Mitalo III, J., and Maguire, G. Q., “Cognitive radio: Making software
radios more personal,” IEEE Psnl. Comm., 6, pp. 13-18, 1999.
[3] J. R. Kok, and N. Vlassis “Collaborative Multiagent Reinforcement
Learning,” J. Mach. Learn. Research 7, pp. 1789-1828, Sep. 2006.
[4] I. Malanchini, M. Cesana, and N. Gatti, “On Spectrum Selection
Games in Cognitive Radio Networks,” IEEE Global Telecom. Conf.
(GLOBECOM), Honolulu, HI, Dec 2009.
[5] H. Li, and Z. Han, “Competitive Spectrum Access in Cognitive Radio
Networks: Graphical Game and Learning,” IEEE Wls. Comm. and
Nwk. Conf. (WCNC), Sydney, Australia, April 2010.
[6] S. Kapetanakis, D. Kudenko, and M. J. A. Strens, “Reinforcement
learning approaches to coordination in cooperative multi-agent
systems,” Adptv. Ag. and Multi-ag. Sys, LNCS, Springer Berlin, 2003.
[7] L. Busoniu, R. Babuska, and B. D. Schutter, “A comprehensive
survey of multiagent reinforcement learning,” IEEE Trans. on Sys,
Man, & Cys, Part C: Appl. & Rew., 38(2), pp. 156-172, Mar. 2008.
[8] R. S. Sutton and A. G. Barto, Reinforcement learning: an
introduction. Cambridge MA, MIT Press, 1998.
Figu re 9. Glo ba l payo ff fo r different ɛ in a h igh -density netwo rk .
215
... RL is an artificial intelligence approach that acquires knowledge by exploring its local operating environment without the need of external supervision. Specifically, an agent of this system explores the environment itself and learns the optimal actions in a trial and error manner [7]. As such, the action will be chosen more often if the feedback result indicates a positive outcome and vice-versa. ...
... In the literature [7,8], RL has been shown to improve network performance in CRNs based on local observations of the operating environment including states and rewards. RL can be embedded in each SU so that it can keep track and learn the behaviour of each neighbouring node. ...
Conference Paper
Full-text available
Cognitive radio is the next-generation wireless communication network that improves the efficiency of the radio spectrum through exploitation of underutilized licensed spectrum (or white spaces). This paper applies a reinforcement learning-based Trust and Reputation Management (TRM) scheme to cluster-based routing and shows network performance enhancement, including throughput and rewards. Generally speaking, clustering forms logical groups of nodes throughout the entire network, and routing establishes routes on the underlying clustered network which is distributed in nature. Each cluster is comprised of a clusterhead (or the leader of the cluster) and member nodes. TRM is applied as a security measure to allow each node to determine credibility of its neighbouring nodes. Reinforcement Learning (RL) is applied to keep track of the credibility level of a node, and it provides reward based on a node's behaviour; subsequently the reward is applied to select clusterheads. The selection of trusted nodes as clusterheads has been a problem that is of great significance due to the important role played by clusterheads as the local point of process for various applications such as channel sensing and routing. Our simulation results show that the RL-based TRM approach applied to clusterhead selection helps to reduce the effects of attacks from malicious nodes, and this has been shown to increase average throughput and reward rate, as well as to reduce changes of clusterhead in a cluster.
... By utilizing in the context of finite queue backlogs a joint sub channel, power and rate allocation problem with a total power constraint at the base station is presented and solved for Downlink OFDMA Cognitive Radio Networks [3]. Achieving Context Awareness and Intelligence in Distributed Cognitive Radio Networks is another challenge [4]. Context-Aware Dynamic Resource Allocation for Cellular M2M Communications is another significant problem which estimates the context of contention detection and based on this contention context the resource allocation is carried out [5]. ...
Article
Full-text available
Cognitive radio is an intelligent radio which will run the cognitive cycle of observing, understand, create knowledge, make a decision and modifies the radio parameters for the given objective. Cognitive radio designed with single purpose may not be suitable for the next generation of heterogeneous network, where there are multiple QoS requirements on application/user side, experiences a different kind of channel condition and must support different frequency band of transmission. So, there is a need for cognitive radio that will meet the multi-scenario requirements or context aware cognitive radio communication system for the heterogeneous network. This work presents five transmission mode cognitive waveforms for handle five different contexts. The five transmission waveforms are (1) Energy efficient QoS CR waveform using Genetic algorithm. (2) Low data rate FBMC based subcarrier level interleave CR waveform. (3) Emergency communication support underlay spatial coder waveform. (4) Hardware impairment handling waveform using prewhitened precoding. (5) Imperfect channel state handling adaptive training sequence design based interleave CR waveform. Optimal decision making based on observed values and receiver feedback relies on the accuracy level of observed values which is not a precise one. The fuzzy logic is tolerant of such impreciseness of data. So a cognitive engine deigns with fuzzy based decision system to select optimal waveform for the given context is presented. The system is designed to take input from spectrum hole from detecting unit and database, inputs from receiver feedback like BER, data rate, channel gain, channel imperfection, SINR from PR receiver, input from the transmitter about hardware impairment and finally input from user application about the QoS requirement.
... In [20] [15], the authors propose cooperation between PUs and SUs and between SUs only. Agents are deployed on the user's terminals to cooperate and result in contracts governing spectrum allocation. ...
Article
Full-text available
The increasing demand for wireless communication introduces efficient spectrum utilization challenge. To address this challenge, cognitive radio (CR) is emerged as the key technology; which enables opportunistic access to the spectrum. CR is a form of wireless communication in which a transceiver can intelligently detect which communication channels are in use and which are not, and instantly move into vacant channels while avoiding occupied ones. This optimizes the use of available radio-frequency (RF) spectrum while minimizing interference to other users. In this paper, we present a state of the art on the use of Multi Agent Systems (MAS) for spectrum access using cooperation and competition to solve the problem of spectrum allocation and ensure better management. Then we propose a new approach which uses the CR for improving wireless communication for a single cognitive radio mobile terminal (CRMT).
... The DCS scheme selects operating channel(s) with white spaces for data transmission whilst minimizing interference to PUs. Yau et al. [8,9] propose a DCS scheme that enables SUs to learn and select channels with low packet error rate and low level of channel utilization by PUs in order to enhance QoS, particularly throughput and delay performances. ...
Article
Full-text available
Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR.
Chapter
It is now widely recognized that wireless communications systems don’t exploit the whole available frequency band. The idea has naturally emerged to develop tools to better use the spectrum. Cognitive Radio (CR) is the concept that meets this challenge. The CR is a form of wireless communication in which a transmitter/receiver can detect intelligently communication channels that are in use and those that are not, and can move to unused channels. This optimizes the use of available radio frequency spectrum while minimizing interference with other users.CRs must have the ability to learn and adapt their wireless transmission according to the ambient radio environment. The application of Artificial Intelligence (AI) approaches in the CR is very promising because they are essential for the implementation of CR networks architecture. They must be able to coexist to make CR systems practical, which may cause interference to other users. To solve the problem of congestion, CR networks use Dynamic Spectrum Access (DSA). In order to deal with this problem, the idea of cooperation between users to detect and share spectrum without causing interferences is introduced.The authors found a large number of suggested works relating to spectrum access, those using Auctions, a large number of approaches use the Game theory, but those using Markov chains are fewer. However, some research has been done in this area using Multi Agent Systems (MAS).
Article
Full-text available
Wireless sensor network (WSN) consists of a large number of sensors and sink nodes which are used to monitor events or environmental parameters, such as movement, temperature, humidity, etc. Reinforcement learning (RL) has been applied in a wide range of schemes in WSNs, such as cooperative communication, routing and rate control, so that the sensors and sink nodes are able to observe and carry out optimal actions on their respective operating environment for network and application performance enhancements. This article provides an extensive review on the application of RL to WSNs. This covers many components and features of RL, such as state, action and reward. This article presents how most schemes in WSNs have been approached using the traditional and enhanced RL models and algorithms. It also presents performance enhancements brought about by the RL algorithms, and open issues associated with the application of RL in WSNs. This article aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive and applicable to readers outside the specialty of both RL and WSNs.
Article
Full-text available
In the last few years, we have attested an impressive growth in wireless communication due to the popularity of smart phones and other mobile devices. Due to the emergence of application domains, such as sensor networks, smart grid control, medical wearable and embedded wireless devices, we are seeing increasing demand for unlicensed bandwidth. There has been also increasing interest from the wireless community in the use of game theory and multi-agent systems. Our aims in this article are to summarize the different uses of game theory in wireless networks and we focused on its use in cognitive radio networks, then to discuss how multi-agent systems can be applied to solve the problem of radio resource management, and finally give the results of our simulations when combining auctions theory with multi-agent systems for the negotiation between agents.
Article
Full-text available
In the first chapter of this report, we provide an overview on mobile and wireless networks, with special focus on the IEEE 802.22 norm, which is a norm dedicated to cognitive radio (CR). Chapter 2 goes into detail about CR and Chapter 3 is devoted to the presentation of the concept of agents and in particular the concept of multi-agent systems (MAS). Finally, Chapter 4 provides a state of the art on the use of artificial intelligence techniques, particularly MAS for radio resource allocation and dynamic spectrum access in the field of CR.
Conference Paper
Full-text available
We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on two novel approaches: one is based on a new action selection strategy for Q-learning [10], and the other is based on model estimation with a shared action-selection protocol. The new techniques are applicable to scenarios where mutual observation of actions is not possible. To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results [2] by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.
Conference Paper
Full-text available
Cognitive radio networks aim at enhancing spectrum utilization by allowing cognitive devices to opportunistically access vast portions of the spectrum. To reach such ambitious goal, cognitive terminals must be geared with enhanced spectrum management capabilities including the detection of unused spectrum holes (spectrum sensing), the characterization of available bands (spectrum decision), the coordination with other cognitive devices in the access phase (spectrum sharing), and the capability to handover towards other spectrum holes when licensed users kick in or if a better spectrum opportunity becomes available (spectrum mobility). In this paper, a game theoretic framework is proposed to evaluate spectrum management functionalities in cognitive radio networks. The spectrum selection process is cast as a non-cooperative game among secondary users who can opportunistically select the "best" spectrum opportunity, under the tight constraint not to harm primary licensed users. Different quality measures for the spectrum opportunities are considered and evaluated in the game framework, including the spectrum bandwidth, and the spectrum opportunity holding time. The cost of spectrum mobility is also accounted in the analytical framework. Numerical results are reported to assess the quality of the game equilibria.
Article
Full-text available
Multiagent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must, instead, discover a solution on their own, using learning. A significant part of the research on multiagent learning concerns reinforcement learning techniques. This paper provides a comprehensive survey of multiagent reinforcement learning (MARL). A central issue in the field is the formal statement of the multiagent learning goal. Different viewpoints on this issue have led to the proposal of many different goals, among which two focal points can be distinguished: stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents. The MARL algorithms described in the literature aim---either explicitly or implicitly---at one of these two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting. A representative selection of these algorithms is discussed in detail in this paper, together with the specific issues that arise in each category. Additionally, the benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied. Finally, an outlook for the field is provided.
Conference Paper
Competitive spectrum access is studied for cognitive radio networks. Based on the assumption of rational secondary users, the spectrum access is modeled as a graphical game, in which the payoff of a secondary user is dependent on only other secondary users that can cause significant interference. The Nash equilibrium in the graphical game is computed by minimizing the sum of regrets. To alleviate the local knowledge of payoffs (each secondary user knows only its own payoff for different channels), a subgradient based iterative algorithm is applied by exchanging information across different secondary users. When the knowledge of payoffs is unknown to secondary users, learning for spectrum access is carried out by employing stochastic approximation (more specifically, the Kiefer-Wolfowitz algorithm). The convergence of both situations is demonstrated by numerical simulations.
Article
In this article we describe a set of scalable techniques for l earning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the f ramework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making ana- logue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcement- learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scale s only linearly in the problem size. We pro- vide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences.
Article
We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on a novel action selection strategy for Q-learning. The new technique is applicable to scenarios where mutual observation of actions is not possible. To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.
Article
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"
Article
Software radios are emerging as platforms for multiband multimode personal communications systems. Radio etiquette is the set of RF bands, air interfaces, protocols, and spatial and temporal patterns that moderate the use of the radio spectrum. Cognitive radio extends the software radio with radio-domain model-based reasoning about such etiquettes. Cognitive radio enhances the flexibility of personal services through a radio knowledge representation language. This language represents knowledge of radio etiquette, devices, software modules, propagation, networks, user needs, and application scenarios in a way that supports automated reasoning about the needs of the user. This empowers software radios to conduct expressive negotiations among peers about the use of radio spectrum across fluents of space, time, and user context. With RKRL, cognitive radio agents may actively manipulate the protocol stack to adapt known etiquettes to better satisfy the user's needs. This transforms radio nodes from blind executors of predefined protocols to radio-domain-aware intelligent agents that search out ways to deliver the services the user wants even if that user does not know how to obtain them. Software radio provides an ideal platform for the realization of cognitive radio
Cognitive radio: Making software radios more personal
  • Iii Mitalo
  • J Maguire
Mitalo III, J., and Maguire, G. Q., "Cognitive radio: Making software radios more personal," IEEE Psnl. Comm., 6, pp. 13-18, 1999.