Content uploaded by Han Cha
Author content
All content in this area was uploaded by Han Cha on May 18, 2020
Content may be subject to copyright.
A Reinforcement Learning Approach to Dynamic Spectrum Access
in Internet-of-Things Networks
Han Cha and Seong-Lyun Kim
School of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea
Email: {chan, slkim}@ramo.yonsei.ac.kr
Abstract—To support wireless communication traffic of
Internet-of-Things (IoT) systems in terms of massive connec-
tivity, dynamic spectrum access (DSA) is important issue. This
paper proposes spectrum sensor-aided DSA system based on
a reinforcement learning (RL) algorithm that aims at efficient
spectrum usage for IoT network over the incumbent network.
Due to small-form-factor of IoT devices, they do not have
spectrum sensing capability. To support DSA of IoT devices,
we introduce sensor-aided DSA system that enhances spatial
spectrum reusability by means of RL algorithm. With the RL
algorithm, proposed DSA system provides self-organizing feature
for massive number of IoT devices. We show that the performance
of proposed RL based DSA system in various densities of IoT
devices utilizing slotted ALOHA protocol that has spectrum
access probability learned by proposed DSA system. We also
present the performance of proposed RL based DSA system
surpass that of distributed Carrier Sensing Multiple Access with
Collision Avoidance (CSMA/CA) protocol for channel access
coordination. We also present the consistent performance of
incumbent user when the IoT devices access to the spectrum
band with learned spectrum access probability.
Index Terms—Reinforcement learning, Dynamic spectrum ac-
cess, Random MAC, Internet-of-Things, Area spectral efficiency.
I. INTRODUCTION
In recent years, there are enormous applications that re-
quire tremendous wireless communications traffic. Machine-
to-machine (M2M) communication is one of the traffic-
demanding application in terms of huge connectivity require-
ment. Internet-of-Things (IoT) technology is a key enabler for
realizing M2M communication. With the IoT technology, tons
of devices are connected through Internet that enables devices
to exchange the information in real-time. To support those
of connections in wireless communications manner, explosive
demand of spectrum resources is obvious.
Dynamic spectrum access (DSA) scheme coupled with
random access is one of the promising technology to satisfy
such demand of IoT networks. The DSA-aided IoT network
utilizes the spectrum bandwidth more efficiently by exploit-
ing underutilized spectrum that is not used by incumbent
users. This opportunity-aware spectrum access is supported by
spectrum sensing functionality that requires high-performance
features: digital-to-analog converter, signal processor, RF unit,
single/double link structure, and so on. It is difficult to equip
those of complex architecture to IoT device because of low
device cost requirement. To overcome such spectrum sensing
capability constraint, dedicated spectrum sensors support DSA
of IoT devices [1], [2].
1
Incumbent user
IoT device
Spectrum
sensor
Fig. 1: A system model of the Internet-of-Things (IoT)
network utilizing spectrum sensor-aided dynamic spectrum
access framework. Spectrum sensors provide spectrum usage
of incumbent users to IoT devices at each location of sensors.
IoT devices utilize this information for spectrum access.
Spectrum sensors collect the value of aggregate interference
at their positions and inform the values to a central unit that
controls the spectrum access of IoT devices. The central unit
determines the spectrum access probability of IoT devices in
certain moment based on collected information of aggregate
interference. IoT devices utilize DSA scheme, when the central
unit must consider harmful interference to another IoT devices
as well as incumbent user from IoT devices. Unfortunately,
the central unit may have no information about location of
IoT devices as well as incumbent user. To determine the
spectrum access probability of IoT devices, the central unit
needs to investigate the impact of each IoT devices to the
entire network.
In this context, reinforcement learning (RL) technique is
suitable for handling unknown information. The RL system
tries to learn the action with respect to maximizing numerical
reward by interacting with random environment iteratively [3].
RL has been applied various field of mobile communications:
Resource allocation for mobile cellular networks [4], and [5],
cell outage management for dense heterogeneous networks [6]
and channel selection for D2D communications [7]. In the
cognitive spectrum access area, the papers [8]- [9] consider
the dynamic multichannel access scenario for users with
performing RL process. In [10] and [9], the cognitive radio
system is designed for a single user. On the other hand, this
paper considers the dynamic spectrum access scenario for a
large number of low-capability users as well as enhancing
spatial spectrum reusability.
Since the numerical reward has stochastic characteristics,
the central unit has to repeat trials based on an RL framework
for finding proper access probability of IoT devices. The rest
of the paper organized as follows. First, we present our system
model and optimization problem that maximizes area spectral
efficiency of DSA IoT network in sections II and III. Second,
we introduce the reinforcement learning DSA procedure that
is conducted by a central unit in section IV. Finally, we
investigate the performance of the RL based DSA system with
various densities of STX along the finalized learning step in
section V. We also provide a performance comparison between
proposed RL based DSA system and Carrier Sense Multiple
Access with Collision Avoidance (CSMA/CA). In this section,
we show an impact to incumbent network from IoT network
with respect to average spectral efficiency.
II. SY ST EM MO DE L
A. Network Model
Consider a DSA IoT network where transmitters communi-
cate with receivers over an incumbent user network. We define
the incumbent user network as the primary user network and
IoT network as the secondary user network. The Nnumber
of secondary transmitters (STXs) are governed by a central
unit that determines the spectrum access probabilities with
aggregate interference obtained from spectrum sensors. The
STXs transmit the packet to its paired secondary receiver
(SRX) through a wireless channel, which have infinite back-
logged data to transmit. We denote the activation vector as
a= [a1, a2,· · · , aN]where the component akis one if the
kth STX is active, otherwise zero.
The STXs operate according to the value of activation vector
generated by the central unit or STX itself. Except for learning
process, the STX generates the activation vector, i.e. determine
the packet transmission itself. We assume that the primary
transmitters (PTXs) are always transmit its packet to primary
receivers (PRXs).
B. Channel Model
The transmitted signal experiences path-loss attenuation
with the exponent αas well as Rayleigh fading with unity
mean, i.e. h∼exp(1). The fading coefficient of kth secondary
pair is hk. We assume that the identical distance between
paired STX and SRX denoted by d. Primary and secondary
networks share common wireless channel, interfering with
each other. When the kth STX transmits the packet, signal-to-
interference-plus-noise-ratio (SINR) γkat the paired kth SRX
is given by:
γk(a) = P2hkd−α
Pi6=kaiP2hikd−α
ik +Ik+σ2,(1)
where Ikis aggregate interference from the primary network
at kth SRX, and hik is the fading coefficients from ith STX to
kth SRX. The value dik is distance between ith STX and kth
SRX. The value P2denotes the transmit power of STX. The
value σ2represents the thermal noise power.
III. PROB LE M FOR MU LATI ON
In the DSA IoT network, it is important to increase the
number of concurrent transmissions while guaranteeing the
individual quality of transmission. To this end, our prime
concern is to find the optimal access probability p∗that
maximizes area spectral efficiency (ASE) defined as the sum
of data rates per unit bandwidth in the unit area [11]. To
find p∗, we formulate the following optimization problem:
p∗= arg max
p
log2(1 + β)·E"X
k
1γk≥β#,(P1)
s.t. 0≤pk≤1,∀k= 1,· · · , N,
where Ndenotes the number of STXs and β > 0is a target
SINR threshold of SRX. Note that 1γk≥βis an indicator
function yielding one if γk≥β, otherwise zero. The term
E[Pk1γk≥β]represents the average number of successful
transmissions. The objective function represents the average
ASE in terms of optimal access probability of STXs.
The challenge for finding the optimal access probability
of STXs arises from the stochastic nature of communication
channels. The central unit does not know the location of
the STXs as well as the location of PTXs. According to
the lack of global information including channel coefficients,
the problems become intractable. The stochastic nature of
communication channels is effectively handled by evaluating
objective function repeatedly, which is a key property of RL
framework [12]. Therefore, we propose an RL based DSA
system suitable for stochastic environment.
IV. REI NF OR CE ME NT LE AR NI NG BASED DYNAM IC
SPE CT RUM AC CE SS SY ST EM
In this section we introduce the learning operation of sensor-
aided DSA IoT network which is performed by central unit.
We first introduce the operation of the central unit and describe
the learning procedure of proposed RL based DSA system.
A. The Central Unit
Let us consider a central unit interacts with the environment
along time steps denoted by tas presented in Fig. 2. The
central unit poses an action by controlling the spectrum access
of STXs, and gather the transmission results of secondary
pairs. With these results, the central unit calculates the benefit
of transmission of each STX. After that, the central unit adjusts
the access probabilities of STXs that produce higher ASE
value. These procedure is based on REINFORCE learning
2
Environment
The central unit
Transmission !
results
Action
Fig. 2: The reinforcement learning system. The central unit
interacts with wireless network by controlling the spectrum
access of secondary transmitters.
algorithm [12], which has an advantage for handling stochastic
environment. Now we describe the learning model.
In every time step, the central unit produces the activa-
tion vector a(t)=[a1(t), a2(t),· · · , aN(t)] that contains
the Bernoulli random variables. The probabilities of those
random variables are defined with the access probability vector
p(t)=[p1(t), p2(t),· · · , pN(t)], i.e. a(t) = Bernoulli(p(t)).
The access probability vector p(t)is related with internal
status vector w(t)and a(t)by means of the modified sigmoid
function as:
pk(t) = 1
1 + e−wk(t)·,where t= 1,2, .... (2)
The central unit finds the optimal access probability vector
p∗by learning internal state vector w(t)according to the
following updating rules:
wk(0) = ln1−pk(0)
pk(0) ,(3)
wk(t+ 1) = wk(t) + α(t){u(t)−u(t)}Gk(t),(4)
Gk(t) = (−1)ak(t)+1
1 + ewk(t)(−1)ak(t)+1 ,(5)
α(t+ 1) = α−∆·t, (6)
where the α(t)is a learning rate which monotonically de-
creases with ∆and the function Gk(t)is a gradient of pk(t)
with respect to wk(t). Note that an utility function u(t)and
the average utility function u(t)is defined as follows:
u(t) = log2(1 + β)·X
k
1γk≥β(a(t)),(7)
u(t+ 1) = (1 −λ)u(t) + λu(t),where 0< λ ≤1,(8)
where λis proportion of applying the value of utility function
at ttimeslot in the average utility function. We define the
baseline as the average utility function, which enables stable
performance enhancement of secondary network in reinforce-
ment learning procedure.
B. Proposed Reinforcement Learning Procedure
The proposed reinforcement learning procedure is com-
prised as follows:
-25 -20 -15 -10 -5 0 5 10 15 20 25
-25
-20
-15
-10
-5
0
5
10
15
20
25
s=0.008, p=0.0028 ssr=0.24
PTX
STX
SRX
1
2
3
5
4
6
7
Fig. 3: Network topologies where the network size is 50 m ×
50 m with the number of secondary pairs is 20. We extensively
simulate about the number of secondary pairs from 20 to
70. We note that the pentagrams represent spectrum sensors.
We give a number to PTXs so as to distinguish each other.
Otherwise we mention, the location of all PTXs, STXs, SRXs,
sensors are fixed.
Algorithm 1 Learning procedure of the central unit
1: Initializes p(0),w(0),a(0) with Sk,k= 1,2, ..., N .
2: Informs a(0) to STXs.
3: The STXs transmit according to a(0).
4: for t= 0 to Tl
5: Collects 1γk>β (a(t)), k = 1,2, ..., N.
6: Updates u(t)according to Eq. (7).
7: Updates w(t+ 1) according to Eq. (4).
8: Updates u(t+ 1) according to Eq. (8).
9: Calculates p(t+ 1) ←according to Eq. (2).
10: Produces a(t+ 1) according to .
11: Informs a(t+ 1) to STXs.
12: The STXs transmit according to a(t+ 1).
13: end for
a) Spectrum sensing: In this period, the central unit collects
the interference value for calculating initial access probability
of STXs. Each sensor nearest to each STX informs the
measured aggregate interference value to the central unit. We
denote the measured aggregate interference by sensor as Sk.
b) Initializing: The central unit initializes the access proba-
bility and the internal state vector. The central unit determines
initial access probability p(0) of STXs with the interference
level Skmeasured by kth sensors, where k= 1,2, ..., N. Then
the central unit calculates the initial internal state values w(0)
with p(0) according to inverse function of Eq. (2).
c) Access probability learning: In this period, the central unit
interacts with the environment, i.e., wireless network utilizing
REINFORCE learning algorithm. The central unit controls the
transmission of STXs along the time step t= 1,2, ..., Tl.
After the transmission of STXs, the central unit collects
the transmission results of STXs to assess the benefit of
transmission of each STX. The access probability learning
procedure of the central unit is described in Algorithm 1. Note
that Bernoulli(p) is one with probability p, otherwise zero.
d) Determining the final access probability: After REIN-
FORCE learning algorithm is done, the central unit determines
the final access probability of STXs plas follows:
pl=p(Tl).(9)
Note that the access probabilities of STXs converge to one
or zero for large enough Tl. The proof about convergence
behavior of access probabilities is represented in [12]. We
utilize this behavior for determining access probability of
STXs in learning procedure. With the final access probability,
kth STXs transmit its packet according to slotted ALOHA
protocol with access probability plk.
In the next section, we evaluate the performance of the
proposed spectrum access procedure.
V. SIMULATION RES ULTS
A. General Settings
Consider a network within a square of length Lnet that is
set as 50 m (Fig. 3). The STXs and sensors are distributed
randomly in the network with density λs, and λssr respec-
tively. Otherwise we mention, the density of sensors λssr is
identical in all simulation cases as 0.24. The pair distance of
secondary users dis 3 m. We assume that the secondary users
use QAM constellation, i.e. the target SINR threshold of SRX
is 3 dB. TABLE I summarizes the parameters of simulations.
In Fig. 3, the location and the number of PTXs is same in
all simulations. We obtain average ASE of secondary users
by executing slotted ALOHA transmission of STX-SRX pairs
with the final access probability repeating 100,000 times.
We compare the performance of proposed RL based DSA
system and a reference system with respect to average ASE.
We implement the reference system as Carrier Sense Multiple
Access with Collision Avoidance (CSMA/CA) protocol, which
is based on a Clear Channel Access (CCA) and random back-
off mechanism. We assume that CSMA/CA protocol has a
fixed carrier sensing range that is double of pair distance das
a conventional setting [13], i.e., 6 m.
The initial access probability of STXs is determined by
means of sense-and-predict (SaP) method presented in [14],
[15]. The interference level at STXs is different from that of
sensors due to location difference. When it comes to using
sensor value to obtain spectrum access probability of STXs,
difference of interference level hinders accurate decision of
this probability. The SaP method aims at overcome this differ-
ence by predicting the interference level at STXs using spatial
correlation of interference. With the predicted interference
0 2500 5000 7500 10000 12500 15000
Tl
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Final access proability
Fig. 4: Final access probabilities: the density of STXs is
0.004. Each line denotes the transition of the final access
probability of STXs along Tl. The STXs locating at the sparse
area that free from impact of PTXs have near-to-one final
access probability. These STXs capture spatial transmission
opportunities.
level, the SaP method provides successful transmission proba-
bility of STXs by means of stochastic geometry approach. We
refer successful transmission as OP.
TABLE I: Simulation parameters
Parameters Values
Network size (Lnet ×Lnet) 50 m ×50 m
STX node density (λs) 0.008, 0.012, ... , 0.028
Spectrum sensor node density (λssr) 0.24
Distance between STX-SRX pair (d) 3 m
Target SINR threshold (β)3dB
Transmit power of STX (P2)23 dBm
Transmit power of PTX (P1)43 dBm
Tl500,1000, ..., 15000
Initial learning rate (α)0.025
Initial access probability (p(0))OP [14]
∆α/(Tl+ 100)
Carrier sensing range for CSMA/CA 6 m
Distance between PTX and PRX 10 m
B. Average Area Spectral Efficiency Validation
In Fig. 5, we evaluate the performance of RL based DSA
system in various STX densities. As the density of STXs is
increasing, the average ASE is improved when the learning
procedure is done. This because more the STXs locate at the
sparse area, where far away from PTXs, as we can see in Fig.
3. These STXs just consider the other STXs around itself, so
more STXs freely communicate with its paired SRXs. After Tl
has a value of around 5,000, the performance enhancement is
slowdown in every simulation case. So we need to choose
0 2500 5000 7500 10000 12500 15000
Tl
1.5
2
2.5
3
3.5
4
4.5
Average area spectral efficiency (bps/Hz/m2)
10-3
s=0.028
s=0.024
s=0.020
s=0.016
s=0.012
s=0.008
Fig. 5: The average area spectral efficiencies of various STX
densities. As the density of STXs is higher, the performance
gain due to learning procedure is larger.
proper Tltrade-off between learning cost and performance
gain.
In Fig. 6, the performance of RL based DSA system
surpasses those of CSMA/CA as we mention before. If STXs
utilize CSMA/CA protocol, STXs located sparse area lose
transmission opportunity waiting for CCA procedure. In the
RL based DSA system, STXs located sparse area have final
access probability close to 1 (see Fig. 4), which captures spa-
tial transmission opportunity [16]. The STXs that have near-
to-one final access probability transmit its packet whenever
transmission is needed. Therefore, the performance of RL
based DSA system surpasses those of CSMA/CA protocol.
C. Investigating Impact to Primary Network
With slotted ALOHA protocol utilizing the access probabil-
ity with final access probability obtained by RL based DSA
system, secondary network has marginal impact to primary
network as we can see in Fig. 7. Nevertheless, as Tlgetting
larger, the average spectral efficiency of primary user slightly
deteriorates. As we discussed in previous section, modest value
of Tlhas sufficient performance enhancement of secondary
network. Therefore, selecting proper value of Tlmay help
enhancing aggregate performance through primary and sec-
ondary network, which can be optimized in future work.
VI. CONCLUSION
In this paper, we propose the reinforcement learning (RL)
based dynamic spectrum access (DSA) system which aims at
efficient spectrum usage for Internet-of-Things (IoT) networks.
The limitation of IoT device is presented and the architecture
of sensor-aided DSA IoT network is described. The main
objective of RL based DSA system is enhancing the spatial
spectrum reusability of IoT network, which is evaluated by
0.008 0.012 0.016 0.02 0.024 0.028
Density of STXs
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Average area spectral efficiency (bps/Hz/m2)
10-3
RL based DSA
CSMA/CA
Fig. 6: Performance comparison between proposed RL based
DSA system and CSMA/CA. The final access probabilities
when Tl= 10,000 is used for obtaining performance of
proposed RL based DSA system.
0 2500 5000 7500 10000 12500 15000
Tl
23
24
25
26
27
28
29
30
Average spectral efficiency (bps/Hz)
PTX=2
PTX=6
PTX=5
PTX=4
PTX=3
PTX=7
PTX=1
Fig. 7: Average spectral efficiencies of PTXs when the density
of STX is 0.004, i.e., the number of STXs is 20. The distance
between PTXs and PRXs is 10m.
average area spectral efficiency (ASE). We investigate the
performance of proposed RL based DSA system in various
densities of IoT devices. As the learning period is longer,
the performance of IoT network has more performance en-
hancement but gain is marginal after certain value of learning
period. We show the impact to incumbent network from IoT
network is marginal with respect to average spectral efficiency
of incumbent network. But as the learning period is longer,
there is slight performance degradation of incumbent network.
So we have to select proper learning period so as to reducing
impact on incumbent network as well as learning cost for
marginal performance enhancement of IoT network, as future
research topic. Also, the system architecture when there is no
dedicated control channel has to be considered to implement
the RL based DSA system in real wireless communication
networks.
ACK NOW LE DG EM EN T
This work was partly supported by Institute for Information
& communications Technology Planning & Evaluation (IITP)
grant funded by the Korea government (MSIT) (No. 2018-0-
00923, Scalable Spectrum Sensing for Beyond 5G Communi-
cation) and IITP grant funded by the MSIT (No.2018-0-00170,
Virtual Presence in Moving Objects through 5G).
REFERENCES
[1] G. A. Akpakwu, B. J. Silva, G. P. Hancke, and A. M. Abu-Mahfouz,
“A survey on 5g networks for the internet of things: communication
technologies and challenges,” IEEE Access, vol. 6, pp. 3619–3647, Dec.
2017.
[2] Z. Zhang, W. Zhang, S. Zeadally, Y. Wang, and Y. Liu, “Cognitive
radio spectrum sensing framework based on multi-agent architecture for
5g networks,” Wireless Communications, vol. 22, no. 6, pp. 34–39, Dec.
2015.
[3] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introduction,”
Cambridge, MA: MIT Press, Mar. 1998.
[4] G. Alnwaimi, S. Vahid, and K. Moessner, “Dynamic heterogeneous
learning games for opportunistic access in lte-based macro/femtocell
deployments,” IEEE Transactions on Wireless Communications, vol. 14,
no. 4, pp. 2294–2308, Apr. 2015.
[5] F. Bernardo, R. Agust, J. Perez-Romero, and O. Sallent, “An application
of reinforcement learning for efficient spectrum usage in next-generation
mobile cellular networks,” IEEE Transactions on Systems, Man, and
Cybernetics-PART C: Applications and Reviews, vol. 40, no. 4, pp. 477–
484, Jul. 2010.
[6] O. Onireti, A. Zoha, J. Moysen, A. Imran, L. Giupponi, M. A. Imran,
and A. Abu-Dayya, “A cell outage management framework for dense
heterogeneous networks,” IEEE Transactions on Vehicular Technology,
vol. 65, no. 4, pp. 2097–2113, Apr. 2016.
[7] S. Maghsudi and S. Stanczak, “Channel selection for network-assisted
d2d communication via no-regret bandit learning with calibrated fore-
casting,” IEEE Transactions on Wireless Communications, vol. 14, no. 3,
pp. 1309–1322, Mar. 2015.
[8] R. Bonnefoi, L. Besson, C. Moy, E. Kaufmann, and J. Palicot, “Multi-
armed bandit learning in iot networks: learning helps even in non-
stationary settings,” in Proc. 12th EAI International Conference on
Cognitive Radio Oriented Wireless Networks (CROWNCOM) 2017,
Lisbon, Portugal, pp. 173–185, Feb. 2018.
[9] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce-
ment learning for dynamic multichannel access in wireless networks,”
IEEE Transactions on Cognitive Communications and Networking,
vol. 4, no. 2, pp. 257–264, Jun. 2018.
[10] V. Raj, I. Dias, T. Tholeti, and S. Kalyani, “Spectrum access in cognitive
radio using a two stage reinforcement learning approach,” IEEE Journal
of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 20–34, Jan.
2018.
[11] D. M. Kim and S. L. Kim, “Exploiting regional differences: A spatially
adaptive random access,” IEEE Transaction on Wireless Communica-
tions, vol. 14, no. 8, pp. 4342–4352, Aug. 2015.
[12] V. V. Phansalkar and M. A. L. Thathachar, “Local and global optimiza-
tion algorithms for generalized learning automata,” Neural Computer,
vol. 7, no. 5, pp. 950–973, Sep. 1995.
[13] K. Xu, M. Gerla, and S. Bae, “How effective is the ieee 802.11
rts/cts handshake in ad hoc networks,” in Proc. ’02. IEEE Global
Telecommunications Conference, 2002, Taipei, Taiwan, Nov. 2002.
[14] J. Kim, S. W. Ko, H. Cha, and S. L. Kim, “Sense-and-predict: oppor-
tunistic mac based on spatial interference correlation for cognitive radio
networks,” in Proc. 2017 IEEE International Symposium on Dynamic
Spectrum Access Networks (DySPAN), Baltimore, MD, USA, Mar. 2017.
[15] ——, “Testbed verification of spectrum access opportunity detection in
cognitive radio networks,” in Proc. IEEE Asia-Pacific Conference on
Communications (APCC), Perth, Australia, Dec. 2017.
[16] T. Novlan, J. D. Matyjas, B. L. Ng, and J. Zhang, “Spatial spectrum
sensing-based device-to-device cellular networks,” IEEE Transactions on
Wireless Communications, vol. 15, no. 11, pp. 7299–7313, Nov. 2016.