Conference PaperPDF Available

Improved Decentralized Q-learning Algorithm for Interference Reduction in LTE-Femtocells

Authors:

Abstract and Figures

Femtocells are receiving considerable interest in mobile communications as a strategy to overcome the indoor coverage problems as well as to improve the efficiency of current macrocell systems. Nevertheless, the detrimental factor in such networks is co-channel interference between macrocells and femtocells, as well as among neighboring femtocells which can dramatically decrease the overall capacity of the network. In this paper we propose a Reinforcement Learning (RL) framework, based on an improved decentralized Q-learning algorithm for femtocells sharing the macrocell spectrum. Since the major drawback of Q-learning is its slow convergence, we propose a smart initialization procedure. The proposed algorithm will be compared with a basic Q-learning algorithm and some power control (PC) algorithms from literature, e.g., fixed power allocation, received power based PC. The goal is to show the performance improvement and enhanced convergence.
Content may be subject to copyright.
Improved Decentralized Q-learning Algorithm for
Interference Reduction in LTE-Femtocells
Meryem Simsek, Andreas Czylwik
Department of Communication Systems
University of Duisburg-Essen
Bismarckstrasse 81, 47057 Duisburg, Germany
Email: {simsek,czylwik}@nts.uni-due.de
Ana Galindo-Serrano, Lorenza Giupponi
Centre Tecnol
`
ogic de Telecommunicacions
de Catalunya (CTTC)
Barcelona, Spain 08860
Email: {ana.maria.galindo, lorenza.giupponi}@cttc.es
Abstract—Femtocells are receiving considerable interest in
mobile communications as a strategy to overcome the indoor
coverage problems as well as to improve the efficiency of current
macrocell systems. Nevertheless, the detrimental factor in such
networks is co-channel interference between macrocells and
femtocells, as well as among neighboring femtocells which can
dramatically decrease the overall capacity of the network. In this
paper we propose a Reinforcement Learning (RL) framework,
based on an improved decentralized Q-learning algorithm for
femtocells sharing the macrocell spectrum. Since the major
drawback of Q-learning is its slow convergence, we propose
a smart initialization procedure. The proposed algorithm will
be compared with a basic Q-learning algorithm and some
power control (PC) algorithms from literature, e.g., fixed power
allocation, received power based PC. The goal is to show the
performance improvement and enhanced convergence.
Index Terms—Femtocell system, interference management,
multi-agent system, decentralized Q-learning.
I. INTRODUCTION
The next generation mobile network (NGMN) aims to
efficiently deploy low cost and low power cellular BSs in the
subscribers home environment, known as femtocells. NGMN
aims to eliminate dead spots like home or office and let
multiple users efficiently use limited frequency resources by
providing a better wireless environment that enables high
capacity data transmission service. However, for co-channel
and closed access femtocells deployment, it is a big concern
to mitigate interference, that is caused by femtocells to the
existing macrocellular network. The interference of femtocell
networks cannot be fully avoided. However, it should be
reduced as much as possible.
A number of different deployment configurations have been
considered for femtocells [2]. Corresponding scenarios are
for example open or close access, dedicated or co-channel
deployment and fixed or adaptive downlink transmit power.
Especially the close access femtocells which are deployed
on the same channel as the macro network are considered
as the worst case interference s cenario. A key requirement
for co-channel femtocell deployment is to keep the increase
in interference caused by femtocells low enough to ensure a
low impact on the performance of the existing macrocellular
network, while still ensure enough transmit power for femto
BSs to achieve target coverage and services. Femtocells that
use co-channel allocation with macrocells can considerably
increase wireless coverage and system capacity, especially for
indoor and cell-edge users. However, the benefit is realized
only when interference between femtocells and macrocells is
well managed. Since the algorithm which is used to control the
femtocell transmit power is left as an implementation detail, a
variety of models have been analyzed. As an initial analysis in
[3] all femto BSs tr ansmit with equal maximum power. This
will lead to improved indoor coverage while decreasing the
macrocell performance rapidly. In [4], for example, the femto
BS adjusts its maximum downlink transmit power as a function
of air interface measurements to avoid interfering with macro
cell user equipments (UE). Examples of such measurements
are the total received interference, the reference signal received
power (RSRP) for the most dominant macro BS, etc. This
scheme is open loop and will be named as received power
based power control (PC) algorithm in this paper. Further PC
schemes can be found in [5–7].
Due to the selfish nature of femtocells and uncertainty on
their number and locations, self-organization techniques are
needed. Self-organization will allow femtocells to integrate
themselves into the network of the operator, learn about
their environment (neighbouring cells, interference) and tune
their parameters (power, frequency) accordingly. As a result,
distributed interference management was considered in [8] by
using a powerful learning technique known as Reinforcement
Learning (RL). Here, Q-learning is applied to the distributed
femtocell setting in the form of decentralized Q-learning.
RL [9; 10] describes a learning scenario, where an agent
tries to improve its behavior by taking actions in its environ-
ment and receiving reward for performing well or r eceiving
punishment for failure. As a new method, multi-agent RL has
been applied in many fields, such as artificial intelligence, to
solve multi-agent coordination and collaboration since it is a
promising approach for establishing autonomous agents that
improve their performance with experience. A fundamental
problem of its standard algorithm is that, although many
tasks can asymptotically be learned by adopting the Markov
Decision Process (MDP) framework, in practice they are not
solvable in a reasonable amount of time.
Therefore, in this paper we will show an improvement of the
Q-learning algorithm presented in [8] by introduction a new
2011 Wireless Advanced
978-1-4577-0109-2/11/$26.00 ©2011 IEEE 138
initialization method, which shows an enhanced convergence.
Based on an Long Term Evolution (LTE)-femtocell system
level simulation environment, we have presented in [11], we
show the performance of our proposed algorithm. We compare
our results with the performance of algorithms from literature,
such as in [3], [4] and [8].
The paper is organized as follows: In Section II, we
summarize well known PC algorithms in order to introduce
our proposed improved decentralized Q-learning algorithm in
Section III. In the next section (Section IV) we describe our
simulation environment and discuss simulation results. Finally,
we conclude the paper in Section V.
II. POWER CONTROL ALGORITHMS FOR FEMTOCELLS
In this section we describe some PC algorithms that have
been proposed for femtocells and that we will use for perfor-
mance comparison with our proposed algorithm.
We consider K macrocells, where m
K
macro UEs are ran-
domly located inside the macro coverage area. The macrocells
are deployed in an urban area and coexist with L femtocells.
Each femtocell provides service to its m
L
associated femto
UEs. We consider that the total bandwidth BW is divided
into subchannels with bandwidth f = 15 kHz. Orthogonal
frequency division multiplexing (OFDM) symbols are grouped
into resource blocks (RBs). Both macrocells and femtocells
operate in the same frequency band and have the same amount
R of available RBs. We consider proportional fair scheduling,
in which all RBs are allocated to UEs. In this paper we focus
on the downlink operation. For simplicity we will neglect the
time index in the following algorithms.
We denote with p
k,M
r
and p
l,F
r
the downlink transmit power
of macro BS k and femto BS l in RB r, respectively. The
maximum transmit powers for macro and femto BSs are p
M
max
and p
F
max
, respectively. In the following, transmit powers are
denoted with p and corresponding power levels in dBm with
P . Signal-to-noise-plus-interference power ratios (SINR) are
denoted with γ and the corresponding values in dB with Γ.
A. Fixed Power Allocation
The fixed power allocation method is the most basic and
common power allocation scheme, in which the total transmit
power of each BS is equally divided among the number of
subcarriers of the system. Assuming, there are 12 subcarriers
per RB the transmit power per RB is:
p
k,M
r
= 12 ·
p
M
max
12 · R
and p
l,F
r
= 12 ·
p
F
max
12 · R
. (1)
Due to the usage of maximum transmit power, this scheme
will improve the femto UE throughput with the drawback of
interfering the macro UEs. The macrocellular performance is
expected to be reduced.
B. Received Power Based Power Control Algorithm
To address the interference management in heterogeneous
networks with co-channel deployment of macro and femto
cells a received power based PC algorithm was proposed in
[4]. In this algorithm the femto BS l adjusts its maximum
Algorithm 1 Decentralized Q-learning.
Initialize:
for each s S , a A do
initialize the Q-value representation mechanism Q
l
(s, a)
end for
evaluate the starting state s S
Learning:
loop
generate a random number r between 0 and 1
if (r < ǫ) then
select action randomly
else
select the action a A characterized by the min(Q-value)
end if
execute a
receive an immediate cost c
observe the next state s
update the table entry as f ollows:
Q
l
(s, a) (1 α)Q
l
(s, a) + α[c + λ min
a
Q
l
(s
, a)]
s = s
end loop
downlink transmit power as a function of air interface mea-
surements and sets its maximum transmit power according to:
P
l,F
max
= max
h
min
α · P
l
m
+ β, P
l,F
max
, P
l,F
min
i
, (2)
where P
l,F
min
is the minimum transmit power level of femto
BS l and P
l
m
is the received power level from the strongest
co-channel macro BS and α = 0.8 and β = 40 dB. The
parameter α is a linear scalar that allows altering the slope
of power control mapping curve and adjustment to different
sizes of macro cells, β is a parameter expressed in dB that
can be used for altering the exact range of P
l
m
covered by
dynamic range of power control. This PC algorithm is open-
loop and is promising in allowing adequate femtocell coverage
area without causing significant performance degradation for
the macrocells. However, only the maximum transmit power is
adapted. There is no RB-based power adaptation considered.
C. Basic Decentralized Q-learning Algorithm
In the Q-learning algorithm [14], agents learn based on the
state of the environment and a cost value. The agents learn by
taking actions and using feedback from the environment. The
Q-value Q(s, a) in Q-learning is an estimation of the value of
future costs if the agent takes a particular action a when it is in
a particular state s. By exploring the environment, the agents
create a table of Q-values for each state and each possible
action. Except, when making an exploratory move in case of
ǫ-greedy policy, the agents select the action with the minimum
Q-value.
The Q-learning algorithm with an ǫ-greedy policy has three
parameters: the learning rate α (0 α 1), the discount
factor λ (0 λ 1) and the ǫ-greedy parameter, which
is usually very small (0.01 ǫ 0.05). The learning rate
parameter limits how quickly learning can occur. The Q-
learning algorithm controls how quickly the Q-values can
change with each state/action change. If the learning rate is too
139
small, learning will occur very slowly. If the rate is too high,
then the algorithm might not converge. The discount factor
controls the value placed on future costs [9]. If the value is
low, immediate costs are optimized, while values closer to 1
cause the learning algorithm to more strongly count future
costs. The value of ǫ is a probability of taking a non-greedy
(exploratory) action in ǫ-greedy action selection method. A
non-zero value of ǫ insures that all state/action pairs will be
explored as the number of trials goes to infinity. If ǫ = 0 the
algorithm might miss optimal solutions.
The distributed femtocell scenario can be mathematically
formulated by means of a stochastic game. We design our
basic decentralized Q-learning based on [8]. Let S be the set
of possible states S = {s
r,1
, s
r,2
, . . . , s
r,n
}, and A be the
set of possible actions A = {a
r,1
, a
r,2
, . . . , a
r,m
} that each
femto BS l may choose with respect to RB r. The interactions
between the multi-agent system and the environment at each
time instant t corresponding to RB r consist of the following
sequence.
The agent l senses the state s
l
r
= s S.
Based on s, agent l selects an action a
l
r
= a A.
As a result, the environment makes a transition to the
new state s
S.
The transition to the state s
generates a cost c
l
r
= c R,
for agent l.
The cost c is fed back to the agent and the process is
repeated.
A summary of the Q-learning procedure is given in Algo-
rithm 1.
Within our system model the SINR at macro UE m allocated
in RB r of macrocell k at time t is:
γ
m
r
=
p
k(m),M
r
h
k,m,MM
r
K
X
j=1,j6=k
p
j(m),M
r
h
j,m,MM
r
|
{z }
I
M
+
L
X
l=1
p
k(m),F
r
h
l,m,FM
r
|
{z }
I
F
+σ
2
.
(3)
Here h
k,m,MM
r
indicates the link gain between the trans-
mitting macro BS k and its macro UE m; h
j,m,MM
r
indicates
the link gain between the transmitting macro BS j and macro
UE m in the macrocell at BS k; h
l,m,FM
r
indicates the link
gain between the transmitting femto BS l and macro UE m
of macrocell k; σ
2
is the noise power. I
M
and I
F
are the
interferences caused by the macro BSs and the femto BSs,
respectively.
In the following, we sum up the Q-learning algorithm as it
was proposed in [8].
State: At time t f or femtocell l and RB r the state is
defined as:
s
l
r
= {I
r
, P
l
tot
}
where I
r
specifies the level of aggregated interference
generated by the femtocell system in RB r. The set of
possible values is based on:
I
r
=
0, if Γ
m
r
< Γ
target
2 dB
1, if Γ
target
2 dB Γ
m
r
Γ
target
+ 2 dB
2, otherwise
where Γ
m
r
is the instantaneous SINR measured at
macrouser m for RB r and Γ
target
= 20 dB represents
the minimum value of SINR that can be perceived by the
macrousers. P
l
tot
=
P
r=R
r=1
P
l
r
denotes the total transmit
power by the femtocell l in all RBs at time t. The set of
possible values is based on:
P
l
tot
=
0, if P
l
tot
< P
max
6 dBm
1, if P
max
6 dBm P
l
tot
P
max
+ 6 dBm
2, otherwise
where P
max
= 20 dBm is the maximum transmit power
that a femto BS can transmit.
Action: The set of possible actions are the 60 power
levels that a femto BS can assign to RB r. Those power
levels range from -80 to 20 dBm effective radiated power
(ERP), with 1 dBm granularity between 20 dBm and 0
dBm, 2 dBm granularity between -40 dBm and 0 dBm
and 4 dBm granularity between -80 dBm and -40 dBm.
Cost: The cost c
l
r
incurred due to the assignment of action
a in state s for femtocell l is:
c
l
r
=
(
500, if P
l
tot
> P
max
m
r
Γ
target
)
2
, if otherwise
,
where Γ
m
r
is the instanteneous SINR value measured at
macro UE m allocated at RB r. The rationale behind
this cost function is that the total transmit power of each
femtocell must not exceed the allowed value P
max
, and
the SINR at the macro UE m is close to the selected
target Γ
target
.
With respect to the Q-learning algorithm, the learning rate
is α = 0.5 and the discount factor is λ = 0.9.
III. IMPROVED DECENTRALIZED Q-LEARNING
ALGORITHM
The interactions between the multi-agent system and the
environment at each time instant t, corresponding to RB r for
our improved decentralized Q-learning algorithm, consist of
the following elements:
State: In our improved decentralized Q-learning algo-
rithm we define our state with a finer granularity as in the
basic Q-learning algorithm. At ti me t for femtocell l and
RB r the state is defined as quantized values of SINRs:
s
l
r
=
10, if Γ
m
r
8 dB
6, if 8 dB < Γ
m
r
4 dB
2, if 4 dB < Γ
m
r
0 dB
.
.
.
38, if Γ
m
r
> 40 dB.
Actions: The set of possible actions are the 25 power
levels that a femto BS can assign to RB r. Those power
140
levels range from -80 dBm to 20 dBm ERP with 4 dBm
granularity, where 20 dBm is the maximum power that a
femto BS can transmit. The quantized power levels, that
are available for femtocell l in RB r at each time instant
t, are accordingly defined as:
P
l,F
r,Q
[a
r,1
, . . . , a
r,m
] = [80, 76, 72, ..., 20] [dBm].
Notice that the set of actions has the same granularity as
the state.
Cost: For the Q-value update after our proposed initial-
ization algorithm, we use the same cost-function as in
the basic Q-learning algorithm case in order to show how
the performance of Q-learning can be enhanced through
the proposed initialization procedure. The cost-function
that we define for the initialization of the Q-table will be
introduced in the next subsection.
The major drawback of the Q-learning algorithm is its slow
convergence, because the learning approach requires all state-
action pairs to be visited at least once (and preferably a consid-
erable number of times) during the learning process in order
to determine an optimal policy. Furthermore, the traditional Q-
learning approach requires all state-action pairs to be visited
at least once (and preferably a considerable number of times)
during the learning process in order to determine an optimal
policy. In order to overcome this drawback, a new initialization
procedure for the Q-learning algorithm is proposed, which
shows a convergence/performance enhancement. Visiting one
state for the first time, we do not only update the Q-value of a
single state-action pair, but add estimates for the cost function
for all other possible actions of the current state, so that the
Q-table is faster initialized.
The major changes in our algorithm is not the definition
of different states and actions, but the way of initializing
the Q-table, as said before. Accordingly, the learning loop in
Algorithm 1 will be the same as in the basic decentralized
Q-learning algorithm. In the following we will only focus on
the initialization part.
A. Initialization of the Q-table
Fig. 1 demonstrates the proposed initialization procedure of
the Q-table. The x-y-plane in Fig. 1 represents the actions and
the states, where each (x,y)-point is an state-action pair. The
target SINR, Γ
target
, that can be perceived by a macro UE is
assumed to be 20 dB. This corresponds to state 18 as shown
in Fig. 1.
Assuming that the instantenous SINR Γ
m
r
of macro UE m
in RB r corresponds to state s
i
in Fig. 1 and according to the
Q-learning algorithm (see Algorithm 1) action a
i
is selected.
The difference s = s
i
18 from state s
i
to our target state
18 can be expressed as Γ
m
r
Γ
target
.
The interference at macro UE m in RB r is caused by both
I
M
and I
F
. Since we assume a fixed power distribution at
macro BSs, this interference can be reduced by controlling the
transmit power at femto BSs. Increasing the transmit power
of femto BS at RB r will reduce the SINR and lead to a state
closer to state 18, while at the same time better coverage and
−80
−76
...
−10
−6
s_i
...
18
22
action [dBm]
state [dB]
cost: (ΓΓ
target
)
2
s
a
Γ
target
a
new
a
i
Fig. 1. Description of Q-table initialization.
services at femtocells are obtained. This power enhancement
is i llustr ated in Fig. 1 as a. Neglecting intercell-unterference
and additive noise, the optimum action is to increase the
femtocell transmit power level by s. Therefore, the cost
function, which has to be assigned to any state-action pair,
has to be minimum for the action which already includes the
transmit power level increase by s.
The obtained cost-value for state s
i
and selected action a
i
is given by
c
l
r
(P
l,F
r,Q
= a
i
) =
m
r
Γ
target
)
2
. (4)
Because of the quadratic cost function in (4) also a quadratic
increase of the initial estimated cost function with respect to
the actions is assumed. This initial estimated cost function is
depicted in Fig. 1.
The objective to find an action a
new
in order to obtain zero
cost, can be expressed as c
l
r
(P
l,F
r,Q
= a
new
) = 0, where a
new
is
a
new
= a
i
+ a.
Accordingly, the new cost function can be expressed as
c
l
r,new
(P
l,F
r,Q
) = (P
l,F
r,Q
a
new
)
2
(5)
= (P
l,F
r,Q
a
i
m
r
Γ
target
))
2
Using this new cost function the Q-table is filled in state
s
i
for each action a A, which corresponds to the quantized
power level P
l,F
r,Q
in (5).
B. Q-value Update after Initialization
In the following iterations of the Q-learning algorithm the
Q-values of unvisited states s S, s 6= s
i
are initialized in
the same way. For states that have been visited before, the
basic Q-learning algorithm is used, i.e., the corresponding Q-
value Q
l
(s, a) is updated as shown in the learning loop of
Algorithm 1.
141
IV. SIMULATION RESULTS AND DISCUSSION
An urban scenario is considered to validate the proposed
Q-learning algorithm and compare its performance with al-
gorithms from literature. The simulation scenario is depicted
in Fig. 2 and is based on the 3GPP TSG ( Technical Specifi-
cation Group) RAN (Radio Access Network) WG4 (Working
Group 4) simulation assumption and parameters, that has been
proposed in [2] and [13]. The simulation parameters can be
summarized as in TABLE I.
−300 −200 −100 0 100 200
−300
−200
−100
0
100
200
distance [m]
distance [m]
1
2
3
Macro UE
Femtocell with
1 Femto UE
Fig. 2. Simulation scenario.
TABLE I
SIMULATION PARAMETERS.
Parameter Value
Cellular layou t Hexagonal grid,
3 sectors per site, reuse 1
Intersite-distance 500 m
Femtocell d eployment scenario Dualstripe model [12]
Number of sites 1
Number of macro UEs per sector 3
Number of femto blocks (FB) per sector 1
Number of femto BS per femto block 4
Number of femto UEs per femto BS 1
Carrier frequency 2 GHz
System bandwidth 1.4 MHz
Distance dependent path loss see [12]
Shadowing standard deviation 8 dB
Shadowing correlation
Between cells 0.5
Between sectors 1
Max. macro (femto) BS tranmit power P
M
max
= 46 dBm
(P
F
max
= 20 dBm)
Traffic model Fullbuffer
Scheduling algorithm Proportional fair
Macro UE speed 3
km
h
Max. distance of macro UE to FB center 50 m
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Avg. throughput per femto UE [bps/Hz]
cumulative distribution
Fixed PC
Received Power
Based PC
Q−learning Basic
Q−learning New
Fig. 3. Average throughput of femto UE for 20 times 50 ms.
System level simulations were performed with the LTE-
femtocell system level simulator presented in [11]. For the
dualstripe deployment scenario twenty random drops were
simulated, each with a simulation time of 50 ms and 500 ms,
which corresponds to 50 and 500 transmission time intervals
(TTI), respectively. In each of the drops we force the macro
UEs to be randomly distributed within a radius of 50 m from
the femtocell block center in order t o be placed within the
coverage area of the femtocells.
For each PC algorithm the same random scenarios and chan-
nels were analyzed. Fig. 3 and Fig. 4 depict the cumulative
distribution function of average throughput after 50 TTIs of
each random scenario for femto and macro UEs, respectively,
for each of the presented PC algorithms. Since the curves show
a similar behaviour, we present results for 50 TTI simulations,
only. For macro UEs additional simulations were performed
in the same deployment scenario, but without femtocells.
This curve will be used as a reference in order to show the
performance degradation in macrocells when femtocells are
activated as interferers.
As expected the fixed PC algorithm shows the highest data
rate in femtocells (Fig. 3). The drawback of this algorithm
can be seen in Fig. 4. The key requirement of co-channel
femtocells, not to impact the existing macrocell network,
cannot be fulfilled in this algorithm. The average macro
UE throughput is very close to the baseline curve for the
received power based PC algorithm. Thus significantly less
performance can be obtained for femtocells as in the case of
fixed PC. The worst results were obtained for the basic Q-
learning algorithm since it requires longer periods of learning.
Neither macro UEs nor femto UEs can get high data rates.
This is due to the bad convergence of this algorithm. Within
50 TTIs (and also 500 TTIs) the basic Q-learning algorithm
cannot show its effectivity.
Using the proposed initialization procedure yields to per-
formance increase for femto and macro UEs compared to the
142
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Avg. throughput per macro UE [bps/Hz]
cumulative distribution
0.4 0.5 0.6 0.7 0.8 0.9
0.4
0.45
0.5
0.55
0.6
No Femtocell
Fixed PC
Received Power Based PC
Q−learning Basic
Q−learning New
Fig. 4. Average throughput of macro UE for 20 times 50 ms.
basic Q-learning algorithm. Our algorithm shows significantly
better performance for femto UEs while at the same time
increasing the macro UE performance. Compared to the other
algorithms we can point out that our proposed Q-learning
algorithm shows better performance for femto UEs than the
received power based PC and less performance than the fixed
PC algorithm. In case of macro UE performance our algorithm
shows similar behaviour as the received power based PC, i.e.,
it is close to the performance of the reference case, while it
is better than the fixed PC algorithm. Thus our proposed Q-
learning algorithm shows the best tradeoff between the femto
and macro UE average throughput and fits best with the key
requirement of co-channel femtocells.
V. CONCLUSION
In this paper we have presented a new decentralized Q-
learning approach for interference management in a macrocel-
lular network overlaid by femtocells to improve the systems’
performance. The main drawback of the Q-learning approach
is known as its slow convergence. A mitigation approach
has been shown for this drawback. As a results, we have
introduced a new approach for the initialization procedure
of Q-learning. Hence, we have shown in a 3GPP compliant
system level simulation environment that, with respect to the
common Q-learning algorithm, our proposal yields significant
gains in terms of average UE throughput for both, macro and
femto UEs. In addition we compare our results with a basic
fixed PC algorithm, transmitting with maximum power, and
an open-loop received power based PC algorithm. We have
shown that our proposal fits best with the key requirement for
co-channel femtocells, which is to keep the increase in in-
terference caused by femtocells low enough to ensure a low
impact on the performance of existing macrocellular network,
while achieving the target coverage at femtocells.
REFERENCES
[1] S. R. Saunders, S. Carlaw, A. Giustina, R. R. Bhat,
V. S. Rao and R. Siegberg, Femtocells: Oppotunities and
Challenges for Business and Technology. Great Britain:
John Wiley & Sons Ltd., 2009.
[2] 3GPP TS 25.820, ”3rd Generation Partnership Project;
Technical Specification Group Radio Access Network; 3G
Home NodeB Study Item Technical Report (Release 8)”,
V8.2.0 (2008-09).
[3] 3GPP TSG-RAN WG4 R4-070902, ”Initial home NodeB
coexistence simulation results”, Nokia Siemens Networks,
(2007-06)
[4] 3GPP TSG-RAN WG4 R4-094245, ”Interference control
for LTE Rel-9 HeNB cells”, Nokia Siemens Networks,
(2009-11)
[5] 3GPP TSG-RAN WG4 R4-071540, ”LTE Home Node B
downlink simulation results with flexible Home Node B
power”, Nokia Siemens Networks, (2007-10)
[6] 3GPP TSG-RAN WG4 R4-071578, ”Simulation results
of macro-cell and co-channel Home NodeB with power
configuration and open access”, Alcatel-Lucent, (2007-10)
[7] 3GPP TSG-RAN WG4 R4-071621, ”HNB Coexistence
Scenario Evaluation”, Qualcomm Europe, (2007-10)
[8] A. Galindo-Serrano and L. Guipponi, ”Distributed Q-
learning for Interference Control in OFDMA-based Fem-
tocell Networks”, IEEE 71st Vehicular Technology Con-
ference, 2010, pp. 1-5.
[9] R. S. Sutton and A. G. Barto, Reinforcement Learning:
An Introduction, Cambridge, 1998
[10] L. P. Kaelbling, M. L. Littman and A. W. Moore,
”Reinforcement learning: A survey”, Journal of Artificial
Intelligence Research, vol.4, pp. 237-285,1996.
[11] M. Simsek et al, ”An LTE-femtocell dynamic system
level simulator, in Proc. IEEE Smart Antennas (WSA),
2010 International ITG Workshop, Feb. 2010, pp. 66-71.
[12] 3GPP TR 36.814, ”3rd Generation Partnership Project;
Technical Specification Group Radio Access Network;
Further advancements for E-UTRA physical layer aspects
(Release 9)”, V9.0.0 (2010-03).
[13] Femto Forum, (2008, Dec.) Interference
Management in UMTS Femtocells [Online]. Available:
http://www.femtoforum.org/femto/Files/File/FF
UMTS-Interference Management.pdf.
[14] C. J. C. H. Watkins, Learning from Delayed Rewards,
PhD thesis, Cambridge University, England, 1989.
143
... In a previous study, we employed a policy transfer approach to accelerate the convergence of DRL-based RAN slicing agents [11]. This was based on initializing the 4 [9] Conventional solution-aided DRL acceleration [10] Machine learning-aided DRL acceleration ML-based experience building [5] Transfer learning accelerated DRL [11], [12] Meta-learning accelerated DRL [6] Design choices-aided DRL acceleration DRL initialization strategies [11], [13] Inherent DRL agent properties [7] Unstable exploration phase Safe DRL-based RRM solutions Transforming the optimization criterion [14] Modifying the exploration process [15] policies of newly deployed agents with those of previously trained agents. The results suggest that despite the considerable differences between the traffic models of the source and target scenarios, TL can enhance the convergence behaviour. ...
... This includes the initial policy, learning rates, and neural network architecture. The authors in [13] propose a decentralized approach for interference management between femtocells and macrocells. To overcome the slow convergence of the Q-learning algorithm, they propose a Q-table initialization procedure. ...
Preprint
Deep reinforcement learning (DRL) algorithms have recently gained wide attention in the wireless networks domain. They are considered promising approaches for solving dynamic radio resource management (RRM) problems in next-generation networks. Given their capabilities to build an approximate and continuously updated model of the wireless network environments, DRL algorithms can deal with the multifaceted complexity of such environments. Nevertheless, several challenges hinder the practical adoption of DRL in commercial networks. In this article, we first discuss two key practical challenges that are faced but rarely tackled when developing DRL-based RRM solutions. We argue that it is inevitable to address these DRL-related challenges for DRL to find its way to RRM commercial solutions. In particular, we discuss the need to have safe and accelerated DRL-based RRM solutions that mitigate the slow convergence and performance instability exhibited by DRL algorithms. We then review and categorize the main approaches used in the RRM domain to develop safe and accelerated DRL-based solutions. Finally, a case study is conducted to demonstrate the importance of having safe and accelerated DRL-based RRM solutions. We employ multiple variants of transfer learning (TL) techniques to accelerate the convergence of intelligent radio access network (RAN) slicing DRL-based controllers. We also propose a hybrid TL-based approach and sigmoid function-based rewards as examples of safe exploration in DRL-based RAN slicing.
... The main reason is the combination of the samplebased stochastic approximation and the fact that the Bellman operator propagates information throughout the whole space. Many methods have been proposed to improve and speed up Q-learning, such as reducing the state space [9][10][11], modifying Q-value update [12][13][14][15][16][17][18][19][20], or specifying initial Q-values [21][22][23]. ...
... Speedy Q-learning [15] was proposed to address the problem of slow convergence in the standard form of the Q-learning algorithm. A faster Q-table initialized method was proposed that does not only update the Q-value of a single state-action pair but adds estimates for the cost function for all other possible actions of the current state for the first visiting one state [16]. A matrix-gain approach was designed to accelerate the con-vergence of Q-learning by optimizing its asymptotic variance [17]. ...
Article
Full-text available
Q -value initialization significantly influences the efficiency of Q -learning. However, there have been no precise rules to choose the initial Q -values as yet correctly, which are usually initialized to a default value. This paper proposes a novel Q -value initialization framework for cellular network applications and factorization Q -learning Initialization (FQI). The proposed method works as an add-on of Q -learning that automatically and efficiently initializes the nonupdated Q -values by utilizing the correlation model of the visited experiences built on factorization machines. In an open-source VoLTE network, FQI was introduced into Q -learning and four improved variants (Dyna Q -learning, Q λ -learning, double Q -learning, and speedy Q -learning) for performance comparison. The experiment results demonstrate that the factorized algorithms based on FQI substantially outperform the original algorithms, often learning policies that attain 1.5-8 times higher final performance measured by the episode reward and the convergence episodes.
... In this section, we simulate the effect of PWOQLA in mobile robot path planning in a grid environment with 20 × 20 obstacles. The original Q-learning, Improved Decentralized Qlearning (IDQ) [51], A � algorithm [52], and WOQLA are compared with PWOQLA to verify the effectiveness of PWOQLA in path planning. The A � algorithm is one of the basic algorithms for path planning. ...
Article
Full-text available
Q-learning is a classical reinforcement learning algorithm and one of the most important methods of mobile robot path planning without a prior environmental model. Nevertheless, Q-learning is too simple when initializing Q-table and wastes too much time in the exploration process, causing a slow convergence speed. This paper proposes a new Q-learning algorithm called the Paired Whale Optimization Q-learning Algorithm (PWOQLA) which includes four improvements. Firstly, to accelerate the convergence speed of Q-learning, a whale optimization algorithm is used to initialize the values of a Q-table. Before the exploration process, a Q-table which contains previous experience is learned to improve algorithm efficiency. Secondly, to improve the local exploitation capability of the whale optimization algorithm, a paired whale optimization algorithm is proposed in combination with a pairing strategy to speed up the search for prey. Thirdly, to improve the exploration efficiency of Q-learning and reduce the number of useless explorations, a new selective exploration strategy is introduced which considers the relationship between current position and target position. Fourthly, in order to balance the exploration and exploitation capabilities of Q-learning so that it focuses on exploration in the early stage and on exploitation in the later stage, a nonlinear function is designed which changes the value of ε in ε-greedy Q-learning dynamically based on the number of iterations. Comparing the performance of PWOQLA with other path planning algorithms, experimental results demonstrate that PWOQLA achieves a higher level of accuracy and a faster convergence speed than existing counterparts in mobile robot path planning. The code will be released at https://github.com/wanghanyu0526/improveQL.git.
... In addition to these pre-programmed mechanisms, the initializing value could in principle be set by experiences in other states without visiting the actual state. Such flexible initialization is critical for efficient machine learning (''smart initialization''; Simsek et al., 2011) and for behavioral choices in daily life, where agents/animals continually face novel states. Rather than starting from a uniform estimation over all states, an initial guess (generated via evolution and/or generalization) can help such agents/animals to quickly learn more accurate estimation. ...
Article
Full-text available
Animals both explore and avoid novel objects in the environment, but the neural mechanisms that underlie these behaviors and their dynamics remain uncharacterized. Here, we used multi-point tracking (DeepLabCut) and behavioral segmentation (MoSeq) to characterize the behavior of mice freely interacting with a novel object. Novelty elicits a characteristic sequence of behavior, starting with investigatory approach and culminating in object engagement or avoidance. Dopamine in the tail of the striatum (TS) suppresses engagement, and dopamine responses were predictive of individual variability in behavior. Behavioral dynamics and individual variability are explained by a reinforcement-learning (RL) model of threat prediction in which behavior arises from a novelty-induced initial threat prediction (akin to “shaping bonus”) and a threat prediction that is learned through dopamine-mediated threat prediction errors. These results uncover an algorithmic similarity between reward- and threat-related dopamine sub-systems.
... In all of the map cases above, the mobile robot has the same starting point and destination point. Table 2 summarizes the performance comparison of the CLSQL algorithm, IQ-FPA [23], QL [15], and IDQ [35]. Table 2 By looking at the calculation time shown in Table 2, when using IDQ and IQ-FPA, the calculation time increases as obstacles increase, while the QL performance trend is reversed. ...
Article
Full-text available
How to generate the path planning of mobile robots quickly is a problem in the field of robotics. The Q-learning(QL) algorithm has recently become increasingly used in the field of mobile robot path planning. However, its selection policy is blind in most cases in the early search process, which slows down the convergence of optimal solutions, especially in a complex environment. Therefore, in this paper, we propose a continuous local search Q-Learning (CLSQL) algorithm to solve these problems and ensure the quality of the planned path. First, the global environment is gradually divided into independent local environments. Then, the intermediate points are searched in each local environment with prior knowledge. After that, the search between each intermediate point is realized to reach the destination point. At last, by comparing other RL-based algorithms, the proposed method improves the convergence speed and computation time while ensuring the optimal path.
... It should be noted that in the test of the proposed FCA, all obstacles are enveloped by rectangles, as depicted in Figure 5(a). Using these six maps with obstacles in different shapes, sizes and layouts, the FCA is evaluated and compared with Flower Pollination (IQ-FPA) [24], classical QL [24] and Improved Decentralized QL (IDQ) [24] [26]. Table II summarizes the performance (i.e., path length and runtime) of the compared algorithms. ...
Conference Paper
Path planning is a vitally important ability for autonomous mobile robots. Because of the high computational complexity, the optimal solution is generally infeasible since the required computation time increases exponentially with the increase in the problem size. Instead, it is common to rely on heuristic and meta-heuristic algorithms to find near-optimal solutions. Although heuristic algorithms have proven effective and computationally inexpensive in small workspaces, they do not always scale to large environments and tend to get trapped in local minima. Also, while meta-heuristic algorithms are attracting considerable attention because of their effectiveness in optimization, they still require significant computational resources and are non-deterministic. In this paper, we introduce a novel Fast Constructive Algorithm (FCA) for deterministic path optimization that requires comparatively few computational resources to generate an optimized path. Our proposed FCA efficiently generates the waypoints for a path based only on the obstacles that intersect with the straight-line segment linking the robot’s current position to the target location. The key idea is to construct the path by iteratively calculating the best waypoint to avoid the next obstacle in the robot’s path. The effectiveness of the FCA is assessed on several maps with distinct complexities and its performance is compared with different state-of-the-art path planning algorithms. Our results show that the proposed FCA is competitive and can outperform existing algorithms in terms of path length and computation time.
Article
We present a Distributional Reinforcement Learning (DRL) empowered downlink power control algorithm for voice over LTE (VoLTE). We mainly focus on closed-loop power control with small cells serving an indoor environment. We model the power control problem using DRL to efficiently manage the uncertainty in the function approximation process used to evaluate the power control decisions. The proposed DRL-based power control algorithm greatly improves the performance w.r.t. Fixed Power Allocation and Deep Q-Networks-based approaches in terms of voice calls retainability.
Article
Deep reinforcement learning (DRL) algorithms have recently gained wide attention in the wireless networks domain. They are considered promising approaches for solving dynamic radio resource management (RRM) problems in next-generation networks. Given their capabilities to build an approximate and continuously updated model of the wireless network environments, DRL algorithms can deal with the multifaceted complexity of such environments. Nevertheless, several challenges hinder the practical adoption of DRL in commercial networks. In this article, we first discuss two key practical challenges that are faced but rarely tackled when developing DRL-based RRM solutions. We argue that it is inevitable to address these DRL-related challenges for DRL to find its way to RRM commercial solutions. In particular, we discuss the need to have safe and accelerated DRL-based RRM solutions that mitigate the slow convergence and performance instability exhibited by DRL algorithms. We then review and categorize the main approaches used in the RRM domain to develop safe and accelerated DRL-based solutions. Finally, a case study is conducted to demonstrate the importance of having safe and accelerated DRL-based RRM solutions. We employ multiple variants of transfer learning (TL) techniques to accelerate the convergence of intelligent radio access network (RAN) slicing DRL-based controllers. We also propose a hybrid TL-based approach and sigmoid function-based rewards as examples of safe exploration in DRL-based RAN slicing.
Conference Paper
Full-text available
This paper proposes a self-organized power allocation technique to solve the interference problem caused by a femtocell network operating in the same channel as an orthogonal frequency division multiple access cellular network. We model the femto network as a multi-agent system where the different femto base stations are the agents in charge of managing the radio resources to be allocated to their femtousers. We propose a form of real-time multi-agent reinforcement learning, known as decentralized Q-learning, to manage the interference generated to macro-users. By directly interacting with the surrounding environment in a distributed fashion, the multi-agent system is able to learn an optimal policy to solve the interference problem. Simulation results show that the introduction of the femto network increases the system capacity without decreasing the capacity of the macro network.
Conference Paper
Full-text available
A dynamic system level simulator for LTE networks was developed for investigating the interference behaviour of femtocells placed within macrocells. We simulate a multi-cell, multi-user and multi-carrier system in the downlink for Single-Input, Single-Output (SISO) and Multiple-Input, Multiple-Output (MIMO) antenna configurations.
Article
Full-text available
Photocopy. Supplied by British Library. Thesis (Ph. D.)--King's College, Cambridge, 1989.
Article
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning. Comment: See http://www.jair.org/ for any accompanying files
LTE Home Node B downlink simulation results with flexible Home Node B power
  • gpp Tsg-Ran
  • Wg4 R
3GPP TSG-RAN WG4 R4-071540, " LTE Home Node B downlink simulation results with flexible Home Node B power ", Nokia Siemens Networks, (2007-10)
Femtocells: Oppotunities and Challenges for Business and Technology
  • S R Saunders
  • S Carlaw
  • A Giustina
  • R R Bhat
  • V S Rao
  • R Siegberg
S. R. Saunders, S. Carlaw, A. Giustina, R. R. Bhat, V. S. Rao and R. Siegberg, Femtocells: Oppotunities and Challenges for Business and Technology. Great Britain: John Wiley & Sons Ltd., 2009.
Distributed Qlearning for Interference Control in OFDMA-based Femtocell Networks
  • A Galindo-Serrano
  • L Guipponi
A. Galindo-Serrano and L. Guipponi, "Distributed Qlearning for Interference Control in OFDMA-based Femtocell Networks", IEEE 71st Vehicular Technology Conference, 2010, pp. 1-5.
Dec.) Interference Management in UMTS Femtocells
Femto Forum, (2008, Dec.) Interference Management in UMTS Femtocells [Online]. Available: http://www.femtoforum.org/femto/Files/File/FF UMTS-Interference Management.pdf.