Conference PaperPDF Available

Achieving Context Awareness and Intelligence in Distributed Cognitive Radio Networks: A Payoff Propagation Approach

March 2011

March 2011

DOI:10.1109/WAINA.2011.47

Source
DBLP

Conference: 25th IEEE International Conference on Advanced Information Networking and Applications Workshops, WAINA 2011, Biopolis, Singapore, March 22-25, 2011

Authors:

Kok-Lim Alvin Yau

Sunway University

Peter Komisarczuk

Royal Holloway, University of London

Paul Teal

Victoria University of Wellington

Cognitive Radio (CR) is a next-generation wireless communication system that exploits underutilized licensed spectrum to optimize the utilization of the overall radio spectrum. A Distributed Cognitive Radio Network (DCRN) is a distributed wireless network established by a number of CR hosts in the absence of fixed network infrastructure. Context awareness and intelligence are key characteristics of CR networks that enable the CR hosts to be aware of their operating environment in order to make an optimal joint action. This research aims to achieve context awareness and intelligence in DCRN using our novel Locally-Confined Payoff Propagation (LCPP), which is an important feature in Multi-Agent Reinforcement Learning (MARL). The LCPP mechanism is suitable to be applied in most applications in DCRN that require context awareness and intelligence such as scheduling, congestion control, as well as Dynamic Channel Selection (DCS), which is the focus of this paper. Simulation results show that the LCPP mechanism is a promising approach. The LCPP mechanism converges to an optimal joint action including networks with cyclic topology. Fast convergence is possible. The investigation in this paper serve as an important foundation for future work in this research field.

Agents (or SU communication node pairs) and their environment.

…

A four-agent graph G. Each agent represents an SU communication node pair. The edges are interference edges that exist between the agents that interfere with each other.

…

Figures - uploaded by Kok-Lim Alvin Yau

Content may be subject to copyright.

Content uploaded by Kok-Lim Alvin Yau

Content may be subject to copyright.

Achieving Context Awareness and Intelligence in

Distributed Cognitive Radio Networks: A Payoff

Propagation Approach

Kok-Lim Alvin Yau1, Peter Komisarczuk2,1 and Paul D. Teal1

1School of Engineering and Computer Science, Victoria University of Wellington, New Zealand

2School of Computing and Technology, Thames Valley University, UK

kok-lim.yau@ecs.vuw.ac.nz, peter.komisarczuk@tvu.ac.uk, paul.teal@ecs.vuw.ac.nz

Abstract-Cognitive Radio (CR) is a next-generation wireless

communication system that exploits underutilized licensed

spectrum to optimize the utilization of the overall radio

spectrum. A Distributed Cognitive Radio Network (DCRN) is a

distributed wireless network established by a number of CR

hosts in the absence of fixed network infrastructure. Context

awareness and intelligence are key characteristics of CR

networks that enable the CR hosts to be aware of their

operating environment in order to make an optimal joint

action. This research aims to achieve context awareness and

intelligence in DCRN using our novel Locally-Confined Payoff

Propagation (LCPP), which is an important feature in Multi-

Agent Reinforcement Learning (MARL). The LCPP

mechanism is suitable to be applied in most applications in

DCRN that require context awareness and intelligence such as

scheduling, congestion control, as well as Dynamic Channel

Selection (DCS), which is the focus of this paper. Simulation

results show that the LCPP mechanism is a promising

approach. The LCPP mechanism converges to an optimal joint

action including networks with cyclic topology. Fast

convergence is possible. The investigation in this paper serve as

an important foundation for future work in this research field.

Keyword-Cognitive radio networks; multi-agent

reinforcement learning; payoff propagation

I. INTRODUCTION

Studies sponsored by the Federal Communications

Commission (FCC) pointed out that the current static

spectrum allocation has led to overall low spectrum

utilization where up to 70% of the licensed spectrum

remains unused (called white space) at any one time even in

a crowded area [1]. The white space can be defined by time

and frequency at a particular location. Cognitive Radio (CR)

is a next-generation wireless communication system that

enables unlicensed spectrum users or Secondary Users

(SUs) to use the white space of licensed users’ or Primary

Users' (PUs') spectrum conditional on the interference to the

PU being below an acceptable level. A Distributed

Cognitive Radio Network (DCRN) is a distributed wireless

network comprised of a number of SUs that interact with

each other in a common operating environment in the

absence of fixed network infrastructure such as a base

station.

Context awareness and intelligence, as achieved by the

popular conceptual cognition cycle [2], are key

characteristics of CR networks. Through context awareness,

an SU is aware of its operating environment; and through

intelligence, an SU utilizes the sensed and inferred high

quality white space in an efficient manner without following

a static pre-defined policy. The key terms in this paper are:

1) joint action; and 2) optimal joint action. A joint action is

defined as the actions taken by all the SUs throughout the

DCRN. An optimal joint action is the joint action that

provides ideal and optimal network-wide performance. The

purpose of context awareness and intelligence is to achieve

an optimal joint action in a distributed manner. In this paper,

our objective is to investigate our novel Locally-Confined

Payoff Propagation (LCPP), which is an important

component in Multi-Agent Reinforcement Learning

(MARL) [3], and it is applied to design the Dynamic

Channel Selection (DCS) scheme in DCRN. In [3], the

MARL approach has been successfully shown to enable the

learning agents to observe, learn, and carry out their

respective actions as part of the optimal joint action. The

LCPP mechanism is a message exchange mechanism that

helps the SUs to communicate and compute their own

actions.

Our contribution in this paper is to investigate into the

LCPP mechanism as an approach to achieve context

awareness and intelligence in DCRNs. This approach is

potentially useful in addressing existing drawbacks

associated with one of the widely-use approaches, namely

Game Theory (GT). We show that the LCPP mechanism

converges to an optimal joint action including networks with

cyclic topology, and fast convergence is possible.

The remainder of this paper is organized as follows.

Section II provides a scenario under consideration. Section

III discusses research problem and challenges. Section IV

presents our system design. Section V presents MARL and

our LCPP mechanism. Section VI presents simulation

results and discussions. Section VII discusses open issues.

Finally, we provide conclusions and future work.

II. A SCENARIO FOR DCS IN DCRN

In DCS, a problem arises as to what is the best strategy to

select an available channel among the licensed channels for

data transmission from an SU node given that the objective

is to maximize network-wide throughput. The DCRN, as

PU1

PU2

F1F2

Figu re 1. Single- ho p DCRN. Solid lin e in dic ates com mu nication link;

while dotted lin e indic ates in terference lin k.

2011 Workshops of International Conference on Advanced Information Networking and Applications

DOI 10.1109/WAINA.2011.47

210

shown in Figure 1, is modeled using an undirected graph

GV , E 

, where

is the set of SUs.

is the transmitter

and

is the receiver of an SU communication node pair

; and

is the set of edges between the SUs including the

communication links and interference links. There are

U=∣V∣/2

SU communication node pairs or agents. We

refer to a CR host as a single SU node; and an SU

communication node pair as an agent henceforth. Each

agent maintains a single set of learned outcomes or

knowledge obtained through message exchange; this is

necessary because the transmitter and receiver must choose

a common channel for data transmission in DCS. An

interference edge

exists between non-communicating

agents if they are located within transmission range of each

other. The agents are distributed in a uniform and random

manner in a square region. There are

PUs,

PU =[ PU 1,, PU K]

and each of them uses one of the

distinctive channels of frequency

F=[ F1,, F K]

. In

Figure 1,

K=2U=3

. Each channel frequency is

characterized by various levels of PU Utilization Level

(PUL) and Packet Error Rate (PER); thus, we consider

heterogeneous channels. For a particular channel, higher

levels of PUL indicate higher levels of PU activity and

hence lesser amount of white spaces; while higher levels of

PER indicate higher levels of packet drop rate due to

interference, channel selective fading, path loss, and other

factors. Spatial reuse is possible where multiple agents are

allowed to share a particular channel. The agents infer PUL

and PER in each channel in a distributed manner, and select

a channel for data transmission individually in order to

achieve an optimal joint action so that network-wide

throughput is maximized.

III. RESEARCH PROBLEM AND CHALLENGES

Game Theory has been the most popular approach for

achieving context awareness and intelligence in CR

networks. GT studies the interaction of multiple SUs whose

objective is to maximize their individual local rewards. To

date, research has been focusing on one-shot or repetitive

games, such as the matrix game, potential game, etc. and

these games have been successfully applied in various

applications as shown in [4] and [5]. There are several

known issues in GT:

•Mis-coordination where the SUs are not able to

converge to an optimal joint action because of severe

negative rewards [6]. The SUs converge to a safe joint

action instead.

•The SUs might converge to a sub-optimal joint action

when multiple optimal joint actions exist [6].

•GT as applied to CR so far requires a complete set of

information to compute the Nash equilibrium in DCRN;

hence its extensive and successful usage in centralized

CR networks.

•GT assumes that all SUs react rationally as game

theorists.

•GT assumes a single type of utility function throughout

the DCRN, and hence homogeneous learning

mechanism in all the SUs.

Although GT has been successfully applied in CR

networks [4], [5], MARL is a good alternative which

addresses the aforementioned issues associated with GT [3].

However, it should be noted that MARL is a new research

area [7], hence many open issues in MARL are yet to be

addressed including the LCPP mechanism.

In a multi-agent environment, there are three main

challenges. Firstly, an agent's action is dependent on the

other payoff-optimizing agents' actions. Secondly, all agents

must converge to an optimal joint action that provides good

network-wide performance. Thirdly, all agents must infer

the channel heterogeneous characteristics including PUL

and PER that might be different among the agents. In other

words, from the perspective of each agent, the challenge is

“How does an agent infer its channel characteristics and

choose its own action such that the joint action converges to

an optimal joint action?” The traditional single-agent-based

reinforcement learning approach enables each agent to learn

and choose its action in a unilateral fashion regardless of its

neighbour agents' actions in a multi-agent environment, and

this may cause instability or oscillation because it switches

its action from time to time [3], hence we choose to use

MARL. In this paper, we show that, using the LCPP

mechanism, which is an important component in MARL, the

SUs converge to an optimal joint action in respect to DCS in

a distributed manner in DCRN. Our focus in this paper is the

payoff message exchange mechanism. In our solution, a

local learning mechanism, such as MARL, is available at

each SU. The local learning mechanism provides each agent

the local reward that an agent could gain for choosing an

action; further explanation is provided in the next section.

IV. SYSTEM DESIGN

In this paper, we model each SU communication node

pair as a learning agent, as shown in Figure 2, because the

transmitter and receiver share a common learned outcome or

knowledge. At a particular time instant, the agent observes

its local operating environment in the form of a local

reward. Due to the limited sensing capability at each agent,

it can only observe its own local reward. The agent

improves the global reward in the next time instant through

carrying out a proper action. The MARL is comprised of

two components: Local Learning (LL) and the LCPP

mechanism. The LL mechanism provides knowledge on the

operating environment comprised of multiple agents through

observing the consequences of its prior action in the form of

local reward [3]. The LCPP mechanism is a message

exchange mechanism to help the learning engine embedded

Figu re 2. Ag ents (o r SU co mm unicatio n no de pa irs) and their

env ironm en t.

Local Environment

Observe

Learn

Decide action

t→t+1

reward

Agent i

action

Local Environment

Observe

Learn

Decide action

t→t+1

reward

Agent j

action

Environment

Payoff message

exchange

initialize μij = μji = 0 for , gi = 0

{Task: 1. Broadcast payoff message to neighbour agents

2. Select optimal action}

if ( my turn to select an optimal action )

R = uniform(0,1)

if ( R ≤ ε ) then

a = uniform(1,K) {Select random action}

else

for ( all neighbour agents )

compute μij(aj,t) {Equation (2)}

if ( μij(aj,t) ≠ μij(aj,t-1) ) then

include μij(aj,t) in broadcast message

compute a = {Select optimal action}

{Equation (3) and (4)}

end if

end for

end if

return a

end if

{Task: 3. Process payoff message}

if ( receive message )

if ( message == μij(aj,t) ) then store this information

end if

211

in each agent to communicate and compute its own action as

part of the optimal joint action using the knowledge

provided by the LL mechanism. In other words, the LCPP

mechanism is a means of communication for the LL

mechanism embedded in each agent. As time progresses, the

agents learn to carry out the proper action to maximize the

accummulated global rewards. In DCS, the LL mechanism

is used to learn the channel heterogeneous characteristics,

which is the PUL and PER levels. The global reward is a

linear combination or summation of all the local rewards at

each agent; while the global payoff is the equivalent for

local payoffs generated by each SU. The LCPP mechanism

maximizes both the global reward and global payoff in order

to achieve an optimal joint action. Based on an application,

such as DCS, the reward and payoff values indicate

distinctive network performance metrics such as throughput

and successful data packet transmission rate. Thus,

maximizing the global reward and the global payoff provide

network-wide performance enhancement.

V. THE MARL APPROACH

We first describe the Coordination Graph (CG); followed

by the LL mechanism, and finally the LCPP mechanism.

This paper discusses MARL and PP mechanism in our

context and application scenario, and the reader is referred

to [3] and [7] for more details.

A. Coordination Graph

In Figure 3, there are

U=∣V∣/2=4

agents in the DCRN,

and each agent (i.e.

pair) is represented by a single

node. An interference edge

exists between a pair of

neighbouring agents. The graph

can be decomposed into

smaller and local CGs which are each a local view of the

for each agent. For instance, the CG of agent 1 is comprised

of agents 1, 2, and 3, hence the representation of the local

reward or Q-value

Qi , t ai , t , a j∈ i, t 

Q1,ta1, t, a 2,t, a3, t

where

ai∈A

. The CG defines collaborative relationships,

and each relationship is a local payoff message exchange

mechanism among the agents. Each agent runs an LL

mechanism independently to update its own Q-values. The

approximate global Q-value

Qtat

at time

is a linear

combination or summation of all the local Q-values at each

agent as follows:

where

 i

represents all the neighbours of agent

B. Local Learning Mechanism

The Q-value

Qi , t ai , t , a j∈ i, t 

, which is the learned

knowledge and maintained in a lookup Q-table with

∣A∣

entries at agent

, represents the local reward that the agent

can gain for choosing an action

ai∈A=F

. For example, in

DCS, the Q-value represents the throughput and it is

dependent on the local PUL, PER and the joint action

taken by all the agents. The joint action affects the Q-value

due to the dependency of actions among the agents; for

example, two neighbour agents that choose a particular

action, specifically a channel, might increase their

contention level, and hence reduce their respective Q-values

for the action.

C. Locally-Confined Payoff Propagation (LCPP)

Algorithm

The pseudo-code of our LCPP algorithm is shown in

Algorithm 1, and it is embedded in each agent. Each agent

constantly sends reward value or payoff message

ij aj , t 

to its neighbour agents

j∈  i

over the edges as

shown in Figure 4. Channel selection is based on an agent's

two-hop neighbour agents in DCRNs, hence each

ij aj , t 

contains two pieces of information: 1) its own Q-value; and

2) its one-hop neighbours’ local Q-value except that from

agent

, as follows:

where

 i∖ j

represents all the neighbours of agent

except agent

. Using

ij aj , t 

, agent

informs agent

about the Q-values of itself and its neighbour agents when

agent

is taking its own action so that agent

can

evaluate its own action.

The payoff messages are exchanged among the agents

until convergence to a fixed optimal point occurs within a

finite number of iterations. Before convergence, the

messages are an estimation of the fixed optimal point as all

incoming messages are yet to converge. Each agent selects

its own optimal action, which is part of the optimal joint

action, to maximize the local payoff as shown next:

Figu re 3. A f ou r-agen t grap h G. E ach ag en t re pr esents an SU

com municatio n no de pair. The ed ges are inte rferen ce edg es that ex ist

betw een th e ag en ts that in terf ere with each oth er.

1 2

Q2(a1,a2,a4)

Q4(a2,a3,a4)Q3(a1,a3,a4)

Q1(a1,a2,a3)

Q12(a1,a2)

Q34(a3,a4)

(1)

Figu re 4. Illustra tion of fou r a gen ts and pay off m essage ex chang e.

(2)

initialize μij = μji = 0 for , gi = 0

{Task: 1. Broadcast payoff message to neighbour agents

2. Select optimal action}

if ( my turn to select an optimal action )

R = uniform(0,1)

if ( R ≤ ε ) then

a = uniform(1,K) {Select random action}

else

for ( all neighbour agents )

compute μij(aj,t) {Equation (2)}

if ( μij(aj,t) ≠ μij(aj,t-1) ) then

include μij(aj,t) in broadcast message

compute a = {Select optimal action}

{Equation (3) and (4)}

end if

end for

end if

return a

end if

{Task: 3. Process payoff message}

if ( receive message )

if ( message == μij(aj,t) ) then store this information

end if

Qtat=∑

i=1

Qi,tai,t, a j∈  i, t

ij aj , t =[Qi,tai, t , a j∈  i, t ,∑

k∈  i ∖ j

Qk , t ak , t , a l∈ k,t ]

212

where

Qj , a , t

, which is kept at agent

, indicates the Q-

value of agent

when agent

is taking action

Each agent

determines its optimal action individually

as follows:

The approximate global payoff

gat

at time

is a

linear combination or summation of all local payoff as

follows:

The agents would reach a fixed optimal point after a finite

number of iterations. Note the difference between the global

Q-value,

∑iQi

in (1) and the global payoff,

∑i[Qi∑jji ]

in (5). The global Q-value is the total

rewards received by all the agents in the network; while the

global payoff is the total local Q-value and local payoff

value exchanged among the agents. Both global Q-value and

global payoff are the performance metrics for the LCPP

mechanism. They converge to an optimal joint action. Both

equations (1) and (5) are the performance metrics for the

LCPP algorithm.

Selecting the optimal action for all the times does not

cater for the actions that are never chosen. Exploitation

chooses the optimal action. Exploration chooses the other

possible actions in

in order to improve the estimates of

the other Q-values. In the



-greedy approach [8], an agent

performs exploration with small probability



, and

exploitation with probability

1−

D. Difference between the Original PP mechanism and the

Locally-Confined PP mechanism

The original PP mechanism [3] has been shown to be

useful in many applications, however, our LCPP mechanism

provides two advantages so that it is suitable to be applied in

DCS in DCRN. Firstly, LCPP is suitable for DCRNs with

cyclic topology because old payoff messages are not added

to the current payoff message. As explained in [3], the

payoff message

ij aj , t 

increases without bound in a

cyclic topology if old payoff values are added into current

payoff value computation. The original PP mechanism

requires the agents to compute global payoff in the entire

network from time to time to alleviate this issue. Since old

payoff messages are not added to the current payoff

message in LCPP, our simulation results show that the

global payoff value does not increase without bound. Hence,

LCPP does not require the agents to compute global payoff

in the entire network from time to time. Secondly, in DCS,

an agent selects its channel based on the channel selections

of its two-hop neighbour agents. An agent uses Q-values

from its two-hop neighbour agents only in payoff value

computation in LCPP, while in [3], an agent uses Q-values

from agents that are multiple hops away.

The LCPP mechanism provides modification to the

original PP mechanism so that it is suitable to be applied in

DCRN. Since the original PP mechanism cannot be directly

applied in DCRN, we do not investigate into performance

comparison between the original PP mechanism and the

LLCP mechanism.

VI. SIMULATION RESULTS AND DISCUSSIONS

Our objective is to enable the SUs to select their channel

(action) for data transmission respectively in DCS. The

channel selections (joint action) by all the SUs provide an

optimal network-wide throughput (global Q-value or global

reward). In this paper, simulations were performed using C

programming, rahter than network simulator, so that the

simulation results are not dependent on the underlying

Medium Acccess Control (MAC) protocol.

The simulation scenario is discussed in Section II.

Graphical representation of the DCS scheme is shown in

Figure 5. We perform simulation using

G=V , E 

with

U=∣V∣/2=10

and three levels of densities with different

number of interference links in the entire network

∣EI∣={Low=5, Medium=10, High=15}

, and cyclic

Algo rithm 1. Pseudo -co de of th e L CPP alg orithm at agent i.

giai , t =max

a∈A

[Qi,ta,aj∈ i, t

∑

j∈ i

Qj , a , t

iaj , t , a k∈ j, t 

(3)

(4)

∑

k∈  j∖ j

Qk ,a , t

iak ,t , a l∈ k, t ]

initialize μij = μji = 0 for , gi = 0

{Task: 1. Broadcast payoff message to neighbour agents

2. Select optimal action}

if ( my turn to select an optimal action )

R = uniform(0,1)

if ( R ≤ ε ) then

a = uniform(1,K) {Select random action}

else

for ( all neighbour agents )

compute μij(aj,t) {Equation (2)}

if ( μij(aj,t) ≠ μij(aj,t-1) ) then

include μij(aj,t) in broadcast message

compute a = {Select optimal action}

{Equation (3) and (4)}

end if

end for

end if

return a

end if

{Task: 3. Process payoff message}

if ( receive message )

if ( message == μij(aj,t) ) then store this information

end if

ai , t

∑

j∈ i

Qj , ai, t

iaj , t , a k∈ j,t 

∑

k∈  j∖ j

Qk ,ai,t

iak , t , al∈  k, t ]

gat=∑

i=1

[Qi , t ai, a j∈ i, t 

(5)

j∈  i

ai , t

*=argmax

a∈A

gia

213

topology exists. There are

K=3

channels. The Q-value

characterizes the channel heterogeneity properties for each

channel including PUL and PER. For a particular channel,

the Q-values are different among the agents as each of them

observes different levels of PUL and PER. In short, the Q-

values are Independent and Identically Distributed (i.i.d.)

among the agents and the channels. In the simulation, each

Q-value has a range

−5Qiai, a j∈ i15

. Higher Q-

value indicates better reward, and hence higher throughput.

At each iteration

, communication node pair

chooses a

channel out of

K=3

channels for data transmission, and it

explores with probability

=0.0125

. A single iteration

corresponds to the conventional four-way handshaking

mechanism that covers payoff message exchange in RTS

and CTS, data packet and ACK transmission.

Figure 6 shows that the global payoff of the low, medium

and high density networks increases and converges to a

fixed point as the number of iterations advances. The high-

density network converges in approximately 9000 iterations,

while medium and low-density networks converge in

approximately 5000 iterations. The global payoff fluctuates

once in a while due to occasional exploration. The global

payoff converges in cyclic topology. The high-density

network has the highest global payoff because the payoff

depends on the number of neighbour agents.

Figure 7 shows that the difference between the optimal

and instantaneous global Q-values of the low, medium and

high density networks decreases to approximately zero. This

indicates the convergence to an optimal joint action. The

instantaneous global Q-value is calculated using (1); and the

optimal global Q-value is the linear combination or

summation of all optimal local Q-values at each agent. The

difference value fluctuates once in a while due to occasional

exploration.

Figure 8 shows that the global payoff of a medium-

density network increases and converges to a fixed point as

the number of iterations advances in the presence of varying

Q-values for the first 2000, 4000 and 6000 iterations. At

each iteration, each Q-value varies with probability 0.5. The

Figu re 6. Glo ba l payo ff fo r low, mediu m and high d ensity netwo rks.

Figu re 7. Differe nce b etwee n op tima l glob al Q-va lue an d

instan taneou s g lob al Q -v alue fo r low, mediu m and high density

netw or ks.

Figu re 8. Glo bal pa yo ff fo r m edium d ensity n etwo rk with v ary ing Q-

valu es for the fir st 20 00, 40 00 an d 60 00 iteratio ns.

Figu re 5. Gr aphica l representation of th e D CS scheme . Bo ld line

ind icates data tra nsm ission between a co mm unicatio n no de pa ir ( an

age nt) ov er a ch osen chan ne l; wh ile d otted line ind icate s interfe ren ce

link .

214

results indicate that stable Q-values provided by the LL

mechanism is the key factor for convergence.

Figure 9 shows that the global payoff of the high density

network increases and converges to a fixed point as the

number of iterations advances. The network converges in

approximately 9000 iterations when

=0.0125

, 11000

iterations when

=0.1

, and 50 iterations when

=0.3

Thus, higher



values improve convergence rate. However,

the global payoff fluctuates when

=0.1

and

=0.3

due

to excessive exploration. This is why

=0.0125

has been

chosen in our simulations although it has the lowest

convergence rate. This key finding shows that it is possible

to achieve fast convergence using high values of



and

subsequently to achieve stability using lower values of



upon reaching the stability state.

VII. OPEN ISSUES

The open issues are relevant to the MARL, LL and LCPP

mechanisms respectively. MARL is a new reesarch area [7]

suitable for tackling the drawbacks of GT. These open

issues are subject to further investigation. In MARL, it is

necessary to detect and respond to irrational agents that take

random or suboptimal actions. In LL, there are two issues.

Firstly, mechanism to provide stable Q-values. This

includes detecting and responding to any fluctuations in the

Q-values. This is critical so that the LCPP mechanism

converges to an optimal joint action (see Figure 8).

Secondly, mechanism for heterogeneous learning among the

agents. Each agent represents the Q-values with different

performance metrics in a particular DCRN to enable

heterogeneous learning entities. In LCPP, there are four

issues. Firstly, mechanism to improve the convergence rate

of the global payoff (see Figure 6) and global Q-value (see

Figure 7). Secondly, investigation into the impact of

dynamics in the operating environment on



, as well as

mechanism to measure the level of dynamics for dynamic

adjustment of



. Thirdly, investigation into tradeoff

between convergence rate and stability through the

adjustment of



. Fourthly, comparison of the amount of

overhead between the LCPP and GT approaches.

CONCLUSIONS AND FUTURE WORK

In this paper, we have presented our Locally-Confined

Payoff Propagation (LCPP) mechanism, which is an

important feature in the Multi-Agent Reinforcement

Learning (MARL) approach to achieve optimal joint action

in a Distributed Cognitive Radio Network (DCRN). We

have shown that the LCPP mechanism converges to an

optimal joint action including networks with cyclic

topology. Fast convergence is possible through the

adjustment of the exploration probability



. In our future

work, we will investigate the open issues raised in this

paper. The investigations in this paper serve as the basis for

future research in DCRN.

REFERENCES

[1] FCC Spec Plcy Tsk Frc, “Report of the Spectrum Efficiency Working

Group,” Fed Comm Comsn, Tech Rpt 02-155, US, Nov. 2002.

[2] Mitalo III, J., and Maguire, G. Q., “Cognitive radio: Making software

radios more personal,” IEEE Psnl. Comm., 6, pp. 13-18, 1999.

[3] J. R. Kok, and N. Vlassis “Collaborative Multiagent Reinforcement

Learning,” J. Mach. Learn. Research 7, pp. 1789-1828, Sep. 2006.

[4] I. Malanchini, M. Cesana, and N. Gatti, “On Spectrum Selection

Games in Cognitive Radio Networks,” IEEE Global Telecom. Conf.

(GLOBECOM), Honolulu, HI, Dec 2009.

[5] H. Li, and Z. Han, “Competitive Spectrum Access in Cognitive Radio

Networks: Graphical Game and Learning,” IEEE Wls. Comm. and

Nwk. Conf. (WCNC), Sydney, Australia, April 2010.

[6] S. Kapetanakis, D. Kudenko, and M. J. A. Strens, “Reinforcement

learning approaches to coordination in cooperative multi-agent

systems,” Adptv. Ag. and Multi-ag. Sys, LNCS, Springer Berlin, 2003.

[7] L. Busoniu, R. Babuska, and B. D. Schutter, “A comprehensive

survey of multiagent reinforcement learning,” IEEE Trans. on Sys,

Man, & Cys, Part C: Appl. & Rew., 38(2), pp. 156-172, Mar. 2008.

[8] R. S. Sutton and A. G. Barto, Reinforcement learning: an

introduction. Cambridge MA, MIT Press, 1998.

Figu re 9. Glo ba l payo ff fo r different ɛ in a h igh -density netwo rk .

215

TRUST AND REPUTATION SCHEME FOR CLUSTERING IN COGNITIVE RADIO NETWORKS

Conference Paper

Full-text available

Nov 2014

Cognitive radio is the next-generation wireless communication network that improves the efficiency of the radio spectrum through exploitation of underutilized licensed spectrum (or white spaces). This paper applies a reinforcement learning-based Trust and Reputation Management (TRM) scheme to cluster-based routing and shows network performance enhancement, including throughput and rewards. Generally speaking, clustering forms logical groups of nodes throughout the entire network, and routing establishes routes on the underlying clustered network which is distributed in nature. Each cluster is comprised of a clusterhead (or the leader of the cluster) and member nodes. TRM is applied as a security measure to allow each node to determine credibility of its neighbouring nodes. Reinforcement Learning (RL) is applied to keep track of the credibility level of a node, and it provides reward based on a node's behaviour; subsequently the reward is applied to select clusterheads. The selection of trusted nodes as clusterheads has been a problem that is of great significance due to the important role played by clusterheads as the local point of process for various applications such as channel sensing and routing. Our simulation results show that the RL-based TRM approach applied to clusterhead selection helps to reduce the effects of attacks from malicious nodes, and this has been shown to increase average throughput and reward rate, as well as to reduce changes of clusterhead in a cluster.

Fuzzy Logic Based Decision System for Context Aware Cognitive Waveform Generation

Article

Full-text available

Jun 2017
WIRELESS PERS COMMUN

Cognitive radio is an intelligent radio which will run the cognitive cycle of observing, understand, create knowledge, make a decision and modifies the radio parameters for the given objective. Cognitive radio designed with single purpose may not be suitable for the next generation of heterogeneous network, where there are multiple QoS requirements on application/user side, experiences a different kind of channel condition and must support different frequency band of transmission. So, there is a need for cognitive radio that will meet the multi-scenario requirements or context aware cognitive radio communication system for the heterogeneous network. This work presents five transmission mode cognitive waveforms for handle five different contexts. The five transmission waveforms are (1) Energy efficient QoS CR waveform using Genetic algorithm. (2) Low data rate FBMC based subcarrier level interleave CR waveform. (3) Emergency communication support underlay spatial coder waveform. (4) Hardware impairment handling waveform using prewhitened precoding. (5) Imperfect channel state handling adaptive training sequence design based interleave CR waveform. Optimal decision making based on observed values and receiver feedback relies on the accuracy level of observed values which is not a precise one. The fuzzy logic is tolerant of such impreciseness of data. So a cognitive engine deigns with fuzzy based decision system to select optimal waveform for the given context is presented. The system is designed to take input from spectrum hole from detecting unit and database, inputs from receiver feedback like BER, data rate, channel gain, channel imperfection, SINR from PR receiver, input from the transmitter about hardware impairment and finally input from user application about the QoS requirement.

Intelligent Wireless Communication System Using Cognitive Radio

Article

Full-text available

Mar 2012

Asma Amraoui

The increasing demand for wireless communication introduces efficient spectrum utilization challenge. To address this challenge, cognitive radio (CR) is emerged as the key technology; which enables opportunistic access to the spectrum. CR is a form of wireless communication in which a transceiver can intelligently detect which communication channels are in use and which are not, and instantly move into vacant channels while avoiding occupied ones. This optimizes the use of available radio-frequency (RF) spectrum while minimizing interference to other users. In this paper, we present a state of the art on the use of Multi Agent Systems (MAS) for spectrum access using cooperation and competition to solve the problem of spectrum allocation and ensure better management. Then we propose a new approach which uses the CR for improving wireless communication for a single cognitive radio mobile terminal (CRMT).

Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms

Article

Full-text available

Jun 2014
TSWJ

Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR.

面向智能体的语义通信：架构与范例

Article

Apr 2021

Dynamic Spectrum Access Techniques

Chapter

Jan 2015

It is now widely recognized that wireless communications systems don’t exploit the whole available frequency band. The idea has naturally emerged to develop tools to better use the spectrum. Cognitive Radio (CR) is the concept that meets this challenge. The CR is a form of wireless communication in which a transmitter/receiver can detect intelligently communication channels that are in use and those that are not, and can move to unused channels. This optimizes the use of available radio frequency spectrum while minimizing interference with other users.CRs must have the ability to learn and adapt their wireless transmission according to the ambient radio environment. The application of Artificial Intelligence (AI) approaches in the CR is very promising because they are essential for the implementation of CR networks architecture. They must be able to coexist to make CR systems practical, which may cause interference to other users. To solve the problem of congestion, CR networks use Dynamic Spectrum Access (DSA). In order to deal with this problem, the idea of cooperation between users to detect and share spectrum without causing interferences is introduced.The authors found a large number of suggested works relating to spectrum access, those using Auctions, a large number of approaches use the Game theory, but those using Markov chains are fewer. However, some research has been done in this area using Multi Agent Systems (MAS).

A Study on Multi-agent Systems in Cognitive Radio: AICC 2018

Chapter

Jan 2019

Application of reinforcement learning to wireless sensor networks: models and algorithms

Article

Full-text available

Nov 2015

Wireless sensor network (WSN) consists of a large number of sensors and sink nodes which are used to monitor events or environmental parameters, such as movement, temperature, humidity, etc. Reinforcement learning (RL) has been applied in a wide range of schemes in WSNs, such as cooperative communication, routing and rate control, so that the sensors and sink nodes are able to observe and carry out optimal actions on their respective operating environment for network and application performance enhancements. This article provides an extensive review on the application of RL to WSNs. This covers many components and features of RL, such as state, action and reward. This article presents how most schemes in WSNs have been approached using the traditional and enhanced RL models and algorithms. It also presents performance enhancements brought about by the RL algorithms, and open issues associated with the application of RL in WSNs. This article aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive and applicable to readers outside the specialty of both RL and WSNs.

Cognitive Radio Resource Management Using Multi-Agent Systems, Auctions and Game Theory

Article

Full-text available

Jan 2014

In the last few years, we have attested an impressive growth in wireless communication due to the popularity of smart phones and other mobile devices. Due to the emergence of application domains, such as sensor networks, smart grid control, medical wearable and embedded wireless devices, we are seeing increasing demand for unlicensed bandwidth. There has been also increasing interest from the wireless community in the use of game theory and multi-agent systems. Our aims in this article are to summarize the different uses of game theory in wireless networks and we focused on its use in cognitive radio networks, then to discuss how multi-agent systems can be applied to solve the problem of radio resource management, and finally give the results of our simulations when combining auctions theory with multi-agent systems for the negotiation between agents.

R\'eseaux de radio cognitive : Allocation des ressources radio et acc\`es dynamique au spectre

Article

Full-text available

Jul 2014

In the first chapter of this report, we provide an overview on mobile and wireless networks, with special focus on the IEEE 802.22 norm, which is a norm dedicated to cognitive radio (CR). Chapter 2 goes into detail about CR and Chapter 3 is devoted to the presentation of the concept of agents and in particular the concept of multi-agent systems (MAS). Finally, Chapter 4 provides a state of the art on the use of artificial intelligence techniques, particularly MAS for radio resource allocation and dynamic spectrum access in the field of CR.

Reinforcement Learning Approaches to Coordination in Cooperative Multi-agent Systems

Conference Paper

Full-text available

Jan 2002
Lect Notes Comput Sci

We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on two novel approaches: one is based on a new action selection strategy for Q-learning [10], and the other is based on model estimation with a shared action-selection protocol. The new techniques are applicable to scenarios where mutual observation of actions is not possible. To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results [2] by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.

On Spectrum Selection Games in Cognitive Radio Networks

Conference Paper

Full-text available

Nov 2009

Cognitive radio networks aim at enhancing spectrum utilization by allowing cognitive devices to opportunistically access vast portions of the spectrum. To reach such ambitious goal, cognitive terminals must be geared with enhanced spectrum management capabilities including the detection of unused spectrum holes (spectrum sensing), the characterization of available bands (spectrum decision), the coordination with other cognitive devices in the access phase (spectrum sharing), and the capability to handover towards other spectrum holes when licensed users kick in or if a better spectrum opportunity becomes available (spectrum mobility). In this paper, a game theoretic framework is proposed to evaluate spectrum management functionalities in cognitive radio networks. The spectrum selection process is cast as a non-cooperative game among secondary users who can opportunistically select the "best" spectrum opportunity, under the tight constraint not to harm primary licensed users. Different quality measures for the spectrum opportunities are considered and evaluated in the game framework, including the spectrum bandwidth, and the spectrum opportunity holding time. The cost of spectrum mobility is also accounted in the analytical framework. Numerical results are reported to assess the quality of the game equilibria.

A Comprehensive Survey of Multiagent Reinforcement Learning

Article

Full-text available

Apr 2008

Multiagent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must, instead, discover a solution on their own, using learning. A significant part of the research on multiagent learning concerns reinforcement learning techniques. This paper provides a comprehensive survey of multiagent reinforcement learning (MARL). A central issue in the field is the formal statement of the multiagent learning goal. Different viewpoints on this issue have led to the proposal of many different goals, among which two focal points can be distinguished: stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents. The MARL algorithms described in the literature aim---either explicitly or implicitly---at one of these two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting. A representative selection of these algorithms is discussed in detail in this paper, together with the specific issues that arise in each category. Additionally, the benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied. Finally, an outlook for the field is provided.

FCC Spectrum Policy Task Force, Report of the spectrum Efficiency Working Group, 2002

Article

Nov 2002

Competitive Spectrum Access in Cognitive Radio Networks: Graphical Game and Learning

Conference Paper

Apr 2010

Competitive spectrum access is studied for cognitive radio networks. Based on the assumption of rational secondary users, the spectrum access is modeled as a graphical game, in which the payoff of a secondary user is dependent on only other secondary users that can cause significant interference. The Nash equilibrium in the graphical game is computed by minimizing the sum of regrets. To alleviate the local knowledge of payoffs (each secondary user knows only its own payoff for different channels), a subgradient based iterative algorithm is applied by exchanging information across different secondary users. When the knowledge of payoffs is unknown to secondary users, learning for spectrum access is carried out by employing stochastic approximation (more specifically, the Kiefer-Wolfowitz algorithm). The convergence of both situations is demonstrated by numerical simulations.

Collaborative Multiagent Reinforcement Learning by Payoff Propagation.

Article

Sep 2006

In this article we describe a set of scalable techniques for l earning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the f ramework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making ana- logue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcement- learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scale s only linearly in the problem size. We pro- vide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences.

Reinforcement Learning of Coordination in Cooperative Multi-Agent Systems

Article

Jan 2002
Lect Notes Comput Sci

We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on a novel action selection strategy for Q-learning. The new technique is applicable to scenarios where mutual observation of actions is not possible. To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.

Reinforcement Learning: An Introduction

Article

Feb 1998

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"

Maguire, G.Q.: Cognitive radio: making software radios more personal. IEEE Pers. Commun. 6(4), 13-18

Article

Sep 1999

Software radios are emerging as platforms for multiband multimode personal communications systems. Radio etiquette is the set of RF bands, air interfaces, protocols, and spatial and temporal patterns that moderate the use of the radio spectrum. Cognitive radio extends the software radio with radio-domain model-based reasoning about such etiquettes. Cognitive radio enhances the flexibility of personal services through a radio knowledge representation language. This language represents knowledge of radio etiquette, devices, software modules, propagation, networks, user needs, and application scenarios in a way that supports automated reasoning about the needs of the user. This empowers software radios to conduct expressive negotiations among peers about the use of radio spectrum across fluents of space, time, and user context. With RKRL, cognitive radio agents may actively manipulate the protocol stack to adapt known etiquettes to better satisfy the user's needs. This transforms radio nodes from blind executors of predefined protocols to radio-domain-aware intelligent agents that search out ways to deliver the services the user wants even if that user does not know how to obtain them. Software radio provides an ideal platform for the realization of cognitive radio

Cognitive radio: Making software radios more personal

Jan 1999
13-18

Iii Mitalo
J Maguire

Mitalo III, J., and Maguire, G. Q., "Cognitive radio: Making software radios more personal," IEEE Psnl. Comm., 6, pp. 13-18, 1999.

Achieving Context Awareness and Intelligence in Distributed Cognitive Radio Networks: A Payoff Propagation Approach

Abstract and Figures

Recommended publications

Applications of Reinforcement Learning to Cognitive Radio Networks

Enhancing network performance in Distributed Cognitive Radio Networks using single-agent and multi-a...

Performance Analysis of Reinforcement Learning for Achieving Context Awareness and Intelligence in M...

Achieving Efficient and Optimal Joint Action in Distributed Cognitive Radio Networks Using Payoff Pr...

Context-awareness and intelligence in Distributed Cognitive Radio Networks: A Reinforcement Learning...