Content uploaded by Meryem Simsek
Author content
All content in this area was uploaded by Meryem Simsek on Jul 21, 2015
Content may be subject to copyright.
Improved Decentralized Q-learning Algorithm for
Interference Reduction in LTE-Femtocells
Meryem Simsek, Andreas Czylwik
Department of Communication Systems
University of Duisburg-Essen
Bismarckstrasse 81, 47057 Duisburg, Germany
Email: {simsek,czylwik}@nts.uni-due.de
Ana Galindo-Serrano, Lorenza Giupponi
Centre Tecnol
`
ogic de Telecommunicacions
de Catalunya (CTTC)
Barcelona, Spain 08860
Email: {ana.maria.galindo, lorenza.giupponi}@cttc.es
Abstract—Femtocells are receiving considerable interest in
mobile communications as a strategy to overcome the indoor
coverage problems as well as to improve the efficiency of current
macrocell systems. Nevertheless, the detrimental factor in such
networks is co-channel interference between macrocells and
femtocells, as well as among neighboring femtocells which can
dramatically decrease the overall capacity of the network. In this
paper we propose a Reinforcement Learning (RL) framework,
based on an improved decentralized Q-learning algorithm for
femtocells sharing the macrocell spectrum. Since the major
drawback of Q-learning is its slow convergence, we propose
a smart initialization procedure. The proposed algorithm will
be compared with a basic Q-learning algorithm and some
power control (PC) algorithms from literature, e.g., fixed power
allocation, received power based PC. The goal is to show the
performance improvement and enhanced convergence.
Index Terms—Femtocell system, interference management,
multi-agent system, decentralized Q-learning.
I. INTRODUCTION
The next generation mobile network (NGMN) aims to
efficiently deploy low cost and low power cellular BSs in the
subscribers home environment, known as femtocells. NGMN
aims to eliminate dead spots like home or office and let
multiple users efficiently use limited frequency resources by
providing a better wireless environment that enables high
capacity data transmission service. However, for co-channel
and closed access femtocells deployment, it is a big concern
to mitigate interference, that is caused by femtocells to the
existing macrocellular network. The interference of femtocell
networks cannot be fully avoided. However, it should be
reduced as much as possible.
A number of different deployment configurations have been
considered for femtocells [2]. Corresponding scenarios are
for example open or close access, dedicated or co-channel
deployment and fixed or adaptive downlink transmit power.
Especially the close access femtocells which are deployed
on the same channel as the macro network are considered
as the worst case interference s cenario. A key requirement
for co-channel femtocell deployment is to keep the increase
in interference caused by femtocells low enough to ensure a
low impact on the performance of the existing macrocellular
network, while still ensure enough transmit power for femto
BSs to achieve target coverage and services. Femtocells that
use co-channel allocation with macrocells can considerably
increase wireless coverage and system capacity, especially for
indoor and cell-edge users. However, the benefit is realized
only when interference between femtocells and macrocells is
well managed. Since the algorithm which is used to control the
femtocell transmit power is left as an implementation detail, a
variety of models have been analyzed. As an initial analysis in
[3] all femto BSs tr ansmit with equal maximum power. This
will lead to improved indoor coverage while decreasing the
macrocell performance rapidly. In [4], for example, the femto
BS adjusts its maximum downlink transmit power as a function
of air interface measurements to avoid interfering with macro
cell user equipments (UE). Examples of such measurements
are the total received interference, the reference signal received
power (RSRP) for the most dominant macro BS, etc. This
scheme is open loop and will be named as received power
based power control (PC) algorithm in this paper. Further PC
schemes can be found in [5–7].
Due to the selfish nature of femtocells and uncertainty on
their number and locations, self-organization techniques are
needed. Self-organization will allow femtocells to integrate
themselves into the network of the operator, learn about
their environment (neighbouring cells, interference) and tune
their parameters (power, frequency) accordingly. As a result,
distributed interference management was considered in [8] by
using a powerful learning technique known as Reinforcement
Learning (RL). Here, Q-learning is applied to the distributed
femtocell setting in the form of decentralized Q-learning.
RL [9; 10] describes a learning scenario, where an agent
tries to improve its behavior by taking actions in its environ-
ment and receiving reward for performing well or r eceiving
punishment for failure. As a new method, multi-agent RL has
been applied in many fields, such as artificial intelligence, to
solve multi-agent coordination and collaboration since it is a
promising approach for establishing autonomous agents that
improve their performance with experience. A fundamental
problem of its standard algorithm is that, although many
tasks can asymptotically be learned by adopting the Markov
Decision Process (MDP) framework, in practice they are not
solvable in a reasonable amount of time.
Therefore, in this paper we will show an improvement of the
Q-learning algorithm presented in [8] by introduction a new
2011 Wireless Advanced
978-1-4577-0109-2/11/$26.00 ©2011 IEEE 138
initialization method, which shows an enhanced convergence.
Based on an Long Term Evolution (LTE)-femtocell system
level simulation environment, we have presented in [11], we
show the performance of our proposed algorithm. We compare
our results with the performance of algorithms from literature,
such as in [3], [4] and [8].
The paper is organized as follows: In Section II, we
summarize well known PC algorithms in order to introduce
our proposed improved decentralized Q-learning algorithm in
Section III. In the next section (Section IV) we describe our
simulation environment and discuss simulation results. Finally,
we conclude the paper in Section V.
II. POWER CONTROL ALGORITHMS FOR FEMTOCELLS
In this section we describe some PC algorithms that have
been proposed for femtocells and that we will use for perfor-
mance comparison with our proposed algorithm.
We consider K macrocells, where m
K
macro UEs are ran-
domly located inside the macro coverage area. The macrocells
are deployed in an urban area and coexist with L femtocells.
Each femtocell provides service to its m
L
associated femto
UEs. We consider that the total bandwidth BW is divided
into subchannels with bandwidth ∆f = 15 kHz. Orthogonal
frequency division multiplexing (OFDM) symbols are grouped
into resource blocks (RBs). Both macrocells and femtocells
operate in the same frequency band and have the same amount
R of available RBs. We consider proportional fair scheduling,
in which all RBs are allocated to UEs. In this paper we focus
on the downlink operation. For simplicity we will neglect the
time index in the following algorithms.
We denote with p
k,M
r
and p
l,F
r
the downlink transmit power
of macro BS k and femto BS l in RB r, respectively. The
maximum transmit powers for macro and femto BSs are p
M
max
and p
F
max
, respectively. In the following, transmit powers are
denoted with p and corresponding power levels in dBm with
P . Signal-to-noise-plus-interference power ratios (SINR) are
denoted with γ and the corresponding values in dB with Γ.
A. Fixed Power Allocation
The fixed power allocation method is the most basic and
common power allocation scheme, in which the total transmit
power of each BS is equally divided among the number of
subcarriers of the system. Assuming, there are 12 subcarriers
per RB the transmit power per RB is:
p
k,M
r
= 12 ·
p
M
max
12 · R
and p
l,F
r
= 12 ·
p
F
max
12 · R
. (1)
Due to the usage of maximum transmit power, this scheme
will improve the femto UE throughput with the drawback of
interfering the macro UEs. The macrocellular performance is
expected to be reduced.
B. Received Power Based Power Control Algorithm
To address the interference management in heterogeneous
networks with co-channel deployment of macro and femto
cells a received power based PC algorithm was proposed in
[4]. In this algorithm the femto BS l adjusts its maximum
Algorithm 1 Decentralized Q-learning.
Initialize:
for each s ∈ S , a ∈ A do
initialize the Q-value representation mechanism Q
l
(s, a)
end for
evaluate the starting state s ∈ S
Learning:
loop
generate a random number r between 0 and 1
if (r < ǫ) then
select action randomly
else
select the action a ∈ A characterized by the min(Q-value)
end if
execute a
receive an immediate cost c
observe the next state s
′
update the table entry as f ollows:
Q
l
(s, a) ← (1 − α)Q
l
(s, a) + α[c + λ min
a
Q
l
(s
′
, a)]
s = s
′
end loop
downlink transmit power as a function of air interface mea-
surements and sets its maximum transmit power according to:
P
′l,F
max
= max
h
min
α · P
l
m
+ β, P
l,F
max
, P
l,F
min
i
, (2)
where P
l,F
min
is the minimum transmit power level of femto
BS l and P
l
m
is the received power level from the strongest
co-channel macro BS and α = 0.8 and β = 40 dB. The
parameter α is a linear scalar that allows altering the slope
of power control mapping curve and adjustment to different
sizes of macro cells, β is a parameter expressed in dB that
can be used for altering the exact range of P
l
m
covered by
dynamic range of power control. This PC algorithm is open-
loop and is promising in allowing adequate femtocell coverage
area without causing significant performance degradation for
the macrocells. However, only the maximum transmit power is
adapted. There is no RB-based power adaptation considered.
C. Basic Decentralized Q-learning Algorithm
In the Q-learning algorithm [14], agents learn based on the
state of the environment and a cost value. The agents learn by
taking actions and using feedback from the environment. The
Q-value Q(s, a) in Q-learning is an estimation of the value of
future costs if the agent takes a particular action a when it is in
a particular state s. By exploring the environment, the agents
create a table of Q-values for each state and each possible
action. Except, when making an exploratory move in case of
ǫ-greedy policy, the agents select the action with the minimum
Q-value.
The Q-learning algorithm with an ǫ-greedy policy has three
parameters: the learning rate α (0 ≤ α ≤ 1), the discount
factor λ (0 ≤ λ ≤ 1) and the ǫ-greedy parameter, which
is usually very small (0.01 ≤ ǫ ≤ 0.05). The learning rate
parameter limits how quickly learning can occur. The Q-
learning algorithm controls how quickly the Q-values can
change with each state/action change. If the learning rate is too
139
small, learning will occur very slowly. If the rate is too high,
then the algorithm might not converge. The discount factor
controls the value placed on future costs [9]. If the value is
low, immediate costs are optimized, while values closer to 1
cause the learning algorithm to more strongly count future
costs. The value of ǫ is a probability of taking a non-greedy
(exploratory) action in ǫ-greedy action selection method. A
non-zero value of ǫ insures that all state/action pairs will be
explored as the number of trials goes to infinity. If ǫ = 0 the
algorithm might miss optimal solutions.
The distributed femtocell scenario can be mathematically
formulated by means of a stochastic game. We design our
basic decentralized Q-learning based on [8]. Let S be the set
of possible states S = {s
r,1
, s
r,2
, . . . , s
r,n
}, and A be the
set of possible actions A = {a
r,1
, a
r,2
, . . . , a
r,m
} that each
femto BS l may choose with respect to RB r. The interactions
between the multi-agent system and the environment at each
time instant t corresponding to RB r consist of the following
sequence.
• The agent l senses the state s
l
r
= s ∈ S.
• Based on s, agent l selects an action a
l
r
= a ∈ A.
• As a result, the environment makes a transition to the
new state s
′
∈ S.
• The transition to the state s
′
generates a cost c
l
r
= c ∈ R,
for agent l.
• The cost c is fed back to the agent and the process is
repeated.
A summary of the Q-learning procedure is given in Algo-
rithm 1.
Within our system model the SINR at macro UE m allocated
in RB r of macrocell k at time t is:
γ
m
r
=
p
k(m),M
r
h
k,m,MM
r
K
X
j=1,j6=k
p
j(m),M
r
h
j,m,MM
r
|
{z }
I
M
+
L
X
l=1
p
k(m),F
r
h
l,m,FM
r
|
{z }
I
F
+σ
2
.
(3)
Here h
k,m,MM
r
indicates the link gain between the trans-
mitting macro BS k and its macro UE m; h
j,m,MM
r
indicates
the link gain between the transmitting macro BS j and macro
UE m in the macrocell at BS k; h
l,m,FM
r
indicates the link
gain between the transmitting femto BS l and macro UE m
of macrocell k; σ
2
is the noise power. I
M
and I
F
are the
interferences caused by the macro BSs and the femto BSs,
respectively.
In the following, we sum up the Q-learning algorithm as it
was proposed in [8].
• State: At time t f or femtocell l and RB r the state is
defined as:
s
l
r
= {I
r
, P
l
tot
}
where I
r
specifies the level of aggregated interference
generated by the femtocell system in RB r. The set of
possible values is based on:
I
r
=
0, if Γ
m
r
< Γ
target
− 2 dB
1, if Γ
target
− 2 dB ≤ Γ
m
r
≤ Γ
target
+ 2 dB
2, otherwise
where Γ
m
r
is the instantaneous SINR measured at
macrouser m for RB r and Γ
target
= 20 dB represents
the minimum value of SINR that can be perceived by the
macrousers. P
l
tot
=
P
r=R
r=1
P
l
r
denotes the total transmit
power by the femtocell l in all RBs at time t. The set of
possible values is based on:
P
l
tot
=
0, if P
l
tot
< P
max
− 6 dBm
1, if P
max
− 6 dBm ≤ P
l
tot
≤ P
max
+ 6 dBm
2, otherwise
where P
max
= 20 dBm is the maximum transmit power
that a femto BS can transmit.
• Action: The set of possible actions are the 60 power
levels that a femto BS can assign to RB r. Those power
levels range from -80 to 20 dBm effective radiated power
(ERP), with 1 dBm granularity between 20 dBm and 0
dBm, 2 dBm granularity between -40 dBm and 0 dBm
and 4 dBm granularity between -80 dBm and -40 dBm.
• Cost: The cost c
l
r
incurred due to the assignment of action
a in state s for femtocell l is:
c
l
r
=
(
500, if P
l
tot
> P
max
(Γ
m
r
− Γ
target
)
2
, if otherwise
,
where Γ
m
r
is the instanteneous SINR value measured at
macro UE m allocated at RB r. The rationale behind
this cost function is that the total transmit power of each
femtocell must not exceed the allowed value P
max
, and
the SINR at the macro UE m is close to the selected
target Γ
target
.
With respect to the Q-learning algorithm, the learning rate
is α = 0.5 and the discount factor is λ = 0.9.
III. IMPROVED DECENTRALIZED Q-LEARNING
ALGORITHM
The interactions between the multi-agent system and the
environment at each time instant t, corresponding to RB r for
our improved decentralized Q-learning algorithm, consist of
the following elements:
• State: In our improved decentralized Q-learning algo-
rithm we define our state with a finer granularity as in the
basic Q-learning algorithm. At ti me t for femtocell l and
RB r the state is defined as quantized values of SINRs:
s
l
r
=
−10, if Γ
m
r
≤ −8 dB
−6, if − 8 dB < Γ
m
r
≤ −4 dB
−2, if − 4 dB < Γ
m
r
≤ 0 dB
.
.
.
38, if Γ
m
r
> 40 dB.
• Actions: The set of possible actions are the 25 power
levels that a femto BS can assign to RB r. Those power
140
levels range from -80 dBm to 20 dBm ERP with 4 dBm
granularity, where 20 dBm is the maximum power that a
femto BS can transmit. The quantized power levels, that
are available for femtocell l in RB r at each time instant
t, are accordingly defined as:
P
l,F
r,Q
∈ [a
r,1
, . . . , a
r,m
] = [−80, −76, −72, ..., 20] [dBm].
Notice that the set of actions has the same granularity as
the state.
• Cost: For the Q-value update after our proposed initial-
ization algorithm, we use the same cost-function as in
the basic Q-learning algorithm case in order to show how
the performance of Q-learning can be enhanced through
the proposed initialization procedure. The cost-function
that we define for the initialization of the Q-table will be
introduced in the next subsection.
The major drawback of the Q-learning algorithm is its slow
convergence, because the learning approach requires all state-
action pairs to be visited at least once (and preferably a consid-
erable number of times) during the learning process in order
to determine an optimal policy. Furthermore, the traditional Q-
learning approach requires all state-action pairs to be visited
at least once (and preferably a considerable number of times)
during the learning process in order to determine an optimal
policy. In order to overcome this drawback, a new initialization
procedure for the Q-learning algorithm is proposed, which
shows a convergence/performance enhancement. Visiting one
state for the first time, we do not only update the Q-value of a
single state-action pair, but add estimates for the cost function
for all other possible actions of the current state, so that the
Q-table is faster initialized.
The major changes in our algorithm is not the definition
of different states and actions, but the way of initializing
the Q-table, as said before. Accordingly, the learning loop in
Algorithm 1 will be the same as in the basic decentralized
Q-learning algorithm. In the following we will only focus on
the initialization part.
A. Initialization of the Q-table
Fig. 1 demonstrates the proposed initialization procedure of
the Q-table. The x-y-plane in Fig. 1 represents the actions and
the states, where each (x,y)-point is an state-action pair. The
target SINR, Γ
target
, that can be perceived by a macro UE is
assumed to be 20 dB. This corresponds to state 18 as shown
in Fig. 1.
Assuming that the instantenous SINR Γ
m
r
of macro UE m
in RB r corresponds to state s
i
in Fig. 1 and according to the
Q-learning algorithm (see Algorithm 1) action a
i
is selected.
The difference ∆s = s
i
− 18 from state s
i
to our target state
18 can be expressed as Γ
m
r
− Γ
target
.
The interference at macro UE m in RB r is caused by both
I
M
and I
F
. Since we assume a fixed power distribution at
macro BSs, this interference can be reduced by controlling the
transmit power at femto BSs. Increasing the transmit power
of femto BS at RB r will reduce the SINR and lead to a state
closer to state 18, while at the same time better coverage and
−80
−76
...
−10
−6
s_i
...
18
22
→
action [dBm]
←
state [dB]
→ cost: (Γ−Γ
target
)
2
∆ s
∆ a
Γ
target
a
new
a
i
Fig. 1. Description of Q-table initialization.
services at femtocells are obtained. This power enhancement
is i llustr ated in Fig. 1 as ∆a. Neglecting intercell-unterference
and additive noise, the optimum action is to increase the
femtocell transmit power level by ∆s. Therefore, the cost
function, which has to be assigned to any state-action pair,
has to be minimum for the action which already includes the
transmit power level increase by ∆s.
The obtained cost-value for state s
i
and selected action a
i
is given by
c
l
r
(P
l,F
r,Q
= a
i
) = (Γ
m
r
− Γ
target
)
2
. (4)
Because of the quadratic cost function in (4) also a quadratic
increase of the initial estimated cost function with respect to
the actions is assumed. This initial estimated cost function is
depicted in Fig. 1.
The objective to find an action a
new
in order to obtain zero
cost, can be expressed as c
l
r
(P
l,F
r,Q
= a
new
) = 0, where a
new
is
a
new
= a
i
+ ∆a.
Accordingly, the new cost function can be expressed as
c
l
r,new
(P
l,F
r,Q
) = (P
l,F
r,Q
− a
new
)
2
(5)
= (P
l,F
r,Q
− a
i
− (Γ
m
r
− Γ
target
))
2
Using this new cost function the Q-table is filled in state
s
i
for each action a ∈ A, which corresponds to the quantized
power level P
l,F
r,Q
in (5).
B. Q-value Update after Initialization
In the following iterations of the Q-learning algorithm the
Q-values of unvisited states s ∈ S, s 6= s
i
are initialized in
the same way. For states that have been visited before, the
basic Q-learning algorithm is used, i.e., the corresponding Q-
value Q
l
(s, a) is updated as shown in the learning loop of
Algorithm 1.
141
IV. SIMULATION RESULTS AND DISCUSSION
An urban scenario is considered to validate the proposed
Q-learning algorithm and compare its performance with al-
gorithms from literature. The simulation scenario is depicted
in Fig. 2 and is based on the 3GPP TSG ( Technical Specifi-
cation Group) RAN (Radio Access Network) WG4 (Working
Group 4) simulation assumption and parameters, that has been
proposed in [2] and [13]. The simulation parameters can be
summarized as in TABLE I.
−300 −200 −100 0 100 200
−300
−200
−100
0
100
200
distance [m]
distance [m]
1
2
3
Macro UE
Femtocell with
1 Femto UE
Fig. 2. Simulation scenario.
TABLE I
SIMULATION PARAMETERS.
Parameter Value
Cellular layou t Hexagonal grid,
3 sectors per site, reuse 1
Intersite-distance 500 m
Femtocell d eployment scenario Dualstripe model [12]
Number of sites 1
Number of macro UEs per sector 3
Number of femto blocks (FB) per sector 1
Number of femto BS per femto block 4
Number of femto UEs per femto BS 1
Carrier frequency 2 GHz
System bandwidth 1.4 MHz
Distance dependent path loss see [12]
Shadowing standard deviation 8 dB
Shadowing correlation
Between cells 0.5
Between sectors 1
Max. macro (femto) BS tranmit power P
M
max
= 46 dBm
(P
F
max
= 20 dBm)
Traffic model Fullbuffer
Scheduling algorithm Proportional fair
Macro UE speed 3
km
h
Max. distance of macro UE to FB center 50 m
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Avg. throughput per femto UE [bps/Hz]
cumulative distribution
Fixed PC
Received Power
Based PC
Q−learning Basic
Q−learning New
Fig. 3. Average throughput of femto UE for 20 times 50 ms.
System level simulations were performed with the LTE-
femtocell system level simulator presented in [11]. For the
dualstripe deployment scenario twenty random drops were
simulated, each with a simulation time of 50 ms and 500 ms,
which corresponds to 50 and 500 transmission time intervals
(TTI), respectively. In each of the drops we force the macro
UEs to be randomly distributed within a radius of 50 m from
the femtocell block center in order t o be placed within the
coverage area of the femtocells.
For each PC algorithm the same random scenarios and chan-
nels were analyzed. Fig. 3 and Fig. 4 depict the cumulative
distribution function of average throughput after 50 TTIs of
each random scenario for femto and macro UEs, respectively,
for each of the presented PC algorithms. Since the curves show
a similar behaviour, we present results for 50 TTI simulations,
only. For macro UEs additional simulations were performed
in the same deployment scenario, but without femtocells.
This curve will be used as a reference in order to show the
performance degradation in macrocells when femtocells are
activated as interferers.
As expected the fixed PC algorithm shows the highest data
rate in femtocells (Fig. 3). The drawback of this algorithm
can be seen in Fig. 4. The key requirement of co-channel
femtocells, not to impact the existing macrocell network,
cannot be fulfilled in this algorithm. The average macro
UE throughput is very close to the baseline curve for the
received power based PC algorithm. Thus significantly less
performance can be obtained for femtocells as in the case of
fixed PC. The worst results were obtained for the basic Q-
learning algorithm since it requires longer periods of learning.
Neither macro UEs nor femto UEs can get high data rates.
This is due to the bad convergence of this algorithm. Within
50 TTIs (and also 500 TTIs) the basic Q-learning algorithm
cannot show its effectivity.
Using the proposed initialization procedure yields to per-
formance increase for femto and macro UEs compared to the
142
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
Avg. throughput per macro UE [bps/Hz]
cumulative distribution
0.4 0.5 0.6 0.7 0.8 0.9
0.4
0.45
0.5
0.55
0.6
No Femtocell
Fixed PC
Received Power Based PC
Q−learning Basic
Q−learning New
Fig. 4. Average throughput of macro UE for 20 times 50 ms.
basic Q-learning algorithm. Our algorithm shows significantly
better performance for femto UEs while at the same time
increasing the macro UE performance. Compared to the other
algorithms we can point out that our proposed Q-learning
algorithm shows better performance for femto UEs than the
received power based PC and less performance than the fixed
PC algorithm. In case of macro UE performance our algorithm
shows similar behaviour as the received power based PC, i.e.,
it is close to the performance of the reference case, while it
is better than the fixed PC algorithm. Thus our proposed Q-
learning algorithm shows the best tradeoff between the femto
and macro UE average throughput and fits best with the key
requirement of co-channel femtocells.
V. CONCLUSION
In this paper we have presented a new decentralized Q-
learning approach for interference management in a macrocel-
lular network overlaid by femtocells to improve the systems’
performance. The main drawback of the Q-learning approach
is known as its slow convergence. A mitigation approach
has been shown for this drawback. As a results, we have
introduced a new approach for the initialization procedure
of Q-learning. Hence, we have shown in a 3GPP compliant
system level simulation environment that, with respect to the
common Q-learning algorithm, our proposal yields significant
gains in terms of average UE throughput for both, macro and
femto UEs. In addition we compare our results with a basic
fixed PC algorithm, transmitting with maximum power, and
an open-loop received power based PC algorithm. We have
shown that our proposal fits best with the key requirement for
co-channel femtocells, which is to keep the increase in in-
terference caused by femtocells low enough to ensure a low
impact on the performance of existing macrocellular network,
while achieving the target coverage at femtocells.
REFERENCES
[1] S. R. Saunders, S. Carlaw, A. Giustina, R. R. Bhat,
V. S. Rao and R. Siegberg, Femtocells: Oppotunities and
Challenges for Business and Technology. Great Britain:
John Wiley & Sons Ltd., 2009.
[2] 3GPP TS 25.820, ”3rd Generation Partnership Project;
Technical Specification Group Radio Access Network; 3G
Home NodeB Study Item Technical Report (Release 8)”,
V8.2.0 (2008-09).
[3] 3GPP TSG-RAN WG4 R4-070902, ”Initial home NodeB
coexistence simulation results”, Nokia Siemens Networks,
(2007-06)
[4] 3GPP TSG-RAN WG4 R4-094245, ”Interference control
for LTE Rel-9 HeNB cells”, Nokia Siemens Networks,
(2009-11)
[5] 3GPP TSG-RAN WG4 R4-071540, ”LTE Home Node B
downlink simulation results with flexible Home Node B
power”, Nokia Siemens Networks, (2007-10)
[6] 3GPP TSG-RAN WG4 R4-071578, ”Simulation results
of macro-cell and co-channel Home NodeB with power
configuration and open access”, Alcatel-Lucent, (2007-10)
[7] 3GPP TSG-RAN WG4 R4-071621, ”HNB Coexistence
Scenario Evaluation”, Qualcomm Europe, (2007-10)
[8] A. Galindo-Serrano and L. Guipponi, ”Distributed Q-
learning for Interference Control in OFDMA-based Fem-
tocell Networks”, IEEE 71st Vehicular Technology Con-
ference, 2010, pp. 1-5.
[9] R. S. Sutton and A. G. Barto, Reinforcement Learning:
An Introduction, Cambridge, 1998
[10] L. P. Kaelbling, M. L. Littman and A. W. Moore,
”Reinforcement learning: A survey”, Journal of Artificial
Intelligence Research, vol.4, pp. 237-285,1996.
[11] M. Simsek et al, ”An LTE-femtocell dynamic system
level simulator,” in Proc. IEEE Smart Antennas (WSA),
2010 International ITG Workshop, Feb. 2010, pp. 66-71.
[12] 3GPP TR 36.814, ”3rd Generation Partnership Project;
Technical Specification Group Radio Access Network;
Further advancements for E-UTRA physical layer aspects
(Release 9)”, V9.0.0 (2010-03).
[13] Femto Forum, (2008, Dec.) Interference
Management in UMTS Femtocells [Online]. Available:
http://www.femtoforum.org/femto/Files/File/FF
UMTS-Interference Management.pdf.
[14] C. J. C. H. Watkins, Learning from Delayed Rewards,
PhD thesis, Cambridge University, England, 1989.
143