Access to this full-text is provided by Wiley.
Content available from IET Communications
This content is subject to copyright. Terms and conditions apply.
Received: 24 October 2020 Revised: 8 March 2021 Accepted: 18 March 2021 IET Communications
DOI: 10.1049/cmu2.12177
ORIGINAL RESEARCH PAPER
Multi-agent deep reinforcement learning-based energy efficient
power allocation in downlink MIMO-NOMA systems
Sonnam Jo Chol Jong Changsop Pak Hakchol Ri
Telecommunication Research Center, Kim Il Sung
University, Taesong District, Pyongyang 999093,
Democratic People’s Republic of Korea
Correspondence
Sonnam Jo,Telecommunication Research Center,
Kim Il Sung University, Taesong District, Pyongyang
999093, Democratic People’sRepublic of Korea.
Email: josonnam@yahoo.com
Abstract
NOMA and MIMO are considered to be the promising technologies to meet huge access
demands and high data rate requirements of 5G wireless networks. In this paper, the power
allocation problem in a downlink MIMO-NOMA system to maximize the energy efficiency
while ensuring the quality-of-service of all users is investigated. Two deep reinforcement
learning-based frameworks are proposed to solve this non-convex and dynamic optimiza-
tion problem, referred to as the multi-agent DDPG/TD3-based power allocation frame-
work. In particular, with current channel conditions as input, every single agent of two
multi-agent frameworks dynamically outputs the optimum power allocation policy for all
users in every cluster by DDPG/TD3 algorithm, and the additional actor network is also
added to the conventional multi-agent model in order to adjust power volumes allocated
to clusters to improve overall performance of the system. Finally, both frameworks adjust
the entire power allocation policy by updating the weights of neural networks according to
the feedback of the system. Simulation results show that the proposed multi-agent deep
reinforcement learning based power allocation frameworks can significantly improve the
energy efficiency of the MIMO-NOMA system under various transmit power limitations
and minimum data rates compared with other approaches, including the performance com-
parison over MIMO-OMA.
1 INTRODUCTION
Due to the explosive growth of internet-of-thing-based
largescale heterogeneous networks and the emergence of the
fifth generation (5G) wireless networks, the communication
requirements become more and more strict such as low latency,
high speed and massive connectivity etc. [1]. In order to meet
these stringent communication requirements and provide
high-quality communication services, non-orthogonal mul-
tiple access (NOMA) has been proposed. NOMA is widely
recognized as one of the most promising multiple access
technologies for 5G wireless networks [2]. Different from
conventional orthogonal multiple access (OMA), the multiple
users in NOMA systems can be served by the same frequency
time resources via power-domain multiplexing or code-domain
multiplexing, which can increase the network capacity and
fulfil the requirements of low latency, high throughput and
massive connection [3]. In researches, the interests of NOMA
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2021 The Authors. IET Communications published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology
are mostly about power allocation [4] and user matching [5],
and more often the objectives are to improve the users’ fairness
[6], spectrum efficiency (SE) [7] and energy efficiency [8]. The
power-domain NOMA can serve multiple users at the same
time slots and has become the potential candidate in the devel-
opment of 5G by allocating different power levels to different
users to achieve multiple access [9], thus the EE of systems
can be improved by using this technology. The power-domain
NOMA with the use of superposition coding (SC) at the
transmitter side and successive interference cancellation (SIC)
at the receiver side allows users to share the same resources
by multiplexing multiple users’ signals with different allocated
powers. In power-domain NOMA, powers are allocated to
users according to their channel gains, i.e. high and low powers
are respectively assigned to users with better and worse channel
gains, and then the user with better channel gain by SIC removes
interference from other users. For the purpose of further
improving the system performance, NOMA is capable of being
1642 wileyonlinelibrary.com/iet-com IET Commun. 2021;15:1642–1654.
SONNAM ET AL.1643
combined with various technologies like multiple-input
multiple-output (MIMO) [10–14], device to device [15–17],
deep learning [18–27] and cognitive radio [28, 29]etc.The
combination of MIMO and NOMA, which is defined as
MIMO-NOMA, is one of the hottest research areas. MIMO
is a conventional technology where the base station (BS) and
the user both have several antennas which can take advantage
of the parallel independent channel according to the charac-
teristics of space division multiplexing [30]. MIMO-NOMA
transmission can achieve high SE and low latency through
power-domain in NOMA and spatial diversity in MIMO [31].
Some research results show that MIMO-NOMA can further
improve the throughput of communication systems [32]. In this
paper we mainly focus on the issue of power allocation in the
power-domain NOMA with MIMO.
Designing the proper resource allocation algorithm is sig-
nificantly important to improve the SE or EE of the MIMO-
NOMA system [33]. Many model-based power allocation algo-
rithms have been proposed to increase the SE or EE in NOMA
systems [34, 35], and some studies have researched the joint
subchannel assignment and power allocation in order to max-
imize the EE for multicarrier NOMA systems [36, 37]. Due to
the dynamics and uncertainty that are inherent in the wireless
communication system, it is mostly difficult or even unavailable
to obtain the complete knowledge or mathematical model that
are required in these conventional power allocation approaches
in practice [38]. However, because of high computational com-
plexity of these algorithms, they are inefficient or even inap-
plicable for future communication networks. In addition, the
dynamic resource allocation is another challenge for a fast-
changing channel condition. Optimizing energy consumption
with efficient resource allocation is an important research issue
using NOMA. In practice, taking into account the dynamics
and uncertainty that are inherent in wireless communication
systems, it is generally difficult or even unavailable to obtain
the complete knowledge or mathematical model required in
these conventional resource allocation approaches. Neverthe-
less, intelligent learning methods are extensively developed to
deal with this challenge. Some studies have used DL (deep
learning) as a model-free and data-driven approach to reduce
the computational complexity with available training inputs and
outputs [39]. As one main branch of machine learning (ML), DL
is a useful method that can be applied to overcome the resource
allocation problem in the case of multiple users [21, 24,40]. DL
has been used to solve resource allocation problems by train-
ing the neural networks offline with simulated data first, and
then outputting results with the well-trained networks during
the online process [26, 27,41]. However, obtaining the correct
data set or optimal solutions used for training can be difficult
and the training process itself is actually time-consuming.
In mobile environments the accurate channel information
and the complete model of the environment’s evolution are
unknown, therefore, the model-free reinforcement learning
(RL) method has been used to solve stochastic optimization
problems in wireless networks. RL, as another main branch of
ML, can be used in solving the dynamic resources allocation
problems [18, 22,42]. RL is able to provide the solution for the
sequential decision-making problems where the environment is
unknown and the optimal policy is learned through interactions
with the environment [43, 44]. In RL connecting with wireless
networks, the agent (e.g. the base station or user) observes the
environment (e.g. wireless channel) states and discovers which
actions (e.g. decisions of subchannel assignment or power allo-
cation) yield the most numerical reward (e.g. immediate EE) by
trying them, and finally generate the policy of mapping states to
actions. That is, the agent selects an action for the environment,
and the state (e.g. current channel condition) changes after the
environment accepts the action. At the same time, the reward
is generated and then feedbacks to the agent. Finally, the agent
selects the next action based on the immediate EE and the cur-
rent channel condition in order to ensure energy efficiency in
wireless communications. The goal of RL is to adjust the param-
eters dynamically so as to maximize the reward. Besides, instead
of optimizing current benefits only, RL can generate almost
optimal decision policy which maximizes the long-term perfor-
mance of the system through constant interactions. Therefore,
RL has demonstrated its enormous advantages in many fields.
The existing works that applied the model free RL framework
usually discretise the continuous values of network parameters
into a finite set of discrete levels or acquire the stochastic policy.
Different from most of existing work that uses RL algorithm,
we consider the deterministic policy gradient-based actor-critic
reinforcement learning algorithm to solve power allocation opti-
mization problem with continuous-valued actions and states in
MIMO-NOMA systems.
In this paper, we investigate the dynamic power allo-
cation problem with multi-agent DRL-based methods in a
downlink MIMO-NOMA system. Motivated by the aforemen-
tioned considerations, we propose two multi-agent DRL-based
frameworks (i.e. multi-agent DDPG/TD3-based framework) to
improve the long-term EE of the MIMO-NOMA system while
ensuring minimum data rates of all users. We construct DRL
networks on the basis of the multi-agent model [45], including
the additional actor network which does not have its own critic
network and is updated by comprehensive impact of every agent
critic network. We refer to these two frameworks as the multi-
agent DDPG-based power allocation (MDPA) framework and
the multi-agent TD3-based power allocation (MTPA) frame-
work.
The main contributions of this paper are summarized as fol-
lows:
∙We develop two model-free DRL-based power allocation
frameworks to solve the power allocation problem in a down-
link MIMO-NOMA system. They are multi-agent model
frameworks based on DDPG and TD3 algorithms which are
on the basis of the deterministic policy gradient methods with
continuous action spaces, since the stochastic policy gradient
methods cannot properly solve the dynamics and uncertainty
that are inherent in wireless communication systems.
∙For our multi-agent model frameworks, every single agent
dynamically outputs the power allocation policy for all users
in every cluster of the MIMO-NOMA system, and we
also add the additional actor network to the conventional
1644 SONNAM ET AL.
multi-agent model in order to adjust power volumes allocated
to clusters to improve overall performance of the MIMO-
NOMA system. To the best of our knowledge, no such
method for power allocation in MIMO-NOMA systems has
been studied in the existing literatures.
∙We provide the performance analysis of proposed power allo-
cation frameworks based on multi-agent deterministic pol-
icy gradient in two-user scenario in a cluster of the MIMO-
NOMA system and compare the EE of the proposed frame-
works under various power limitations and minimum data
rates with single agent DRL-based power allocations, the dis-
crete DRL-based one and the fixed power allocation strategy.
We also verify the advantage of energy performance of our
proposed frameworks over that of the MIMO-OMA system.
The simulation results show that the TD3-based frameworks
can achieve best performance.
The rest of this paper is organized as follows. The related
work is reviewed in Section 2. Section 3introduces the system
model and problem formulation of a downlink MIMO-NOMA
system. In Section 4, we propose two multi-agent DRL-based
algorithms to solve the dynamic power allocation problem of
the MIMO-NOMA system. The simulation results are discussed
in Section 5, and we conclude this study in Section 6.
2RELATED WORK
The conventional RL algorithms suffer from slow convergence
speed and become less efficient for problems with large state
and action spaces. In order to overcome these issues, the deep
reinforcement learning (DRL), which combines deep leaning
with RL, has been proposed. One famous algorithm of DRL,
named deep Q-learning [46] that is one of the off-policy meth-
ods, uses a deep Q network (DQN) which applies deep neural
networks as function approximators to conventional RL. DRL
has already been used in many aspects such as power control in
NOMA systems [47, 48], resource allocation in heterogeneous
network [25, 49] and Internet of Things (IoT) [50]. DQN uses
a replay buffer to store tuples of historical samples in order to
stabilize training and makes efficient use of the hardware opti-
mization. And the mini-batches are randomly drawn from the
replay buffer to update the weights of the networks during train-
ing. However, the main drawback of DQN is that the output
decision can only be discrete, which brings quantization error
for continuous action tasks. Besides, the output dimension of
DQN increases exponentially for multi-action and joint opti-
mization tasks. Some works that introduced the model-free RL
framework to solve the stochastic optimization problems typi-
cally discretise the continuous variables of the studying scenario
into a finite set of discrete values (levels), such as the quantized
energy levels or power levels. However, such methods destruct
the completeness of continuous space and introduce quantiza-
tion noise, thus they are incapable of finding the true optimal
policy. RL was initially studied only with discrete action spaces,
however, practical problems sometimes require control actions
in a continuous action space [51]. Concerning our problem with
energy efficiency, both the environment state (i.e. wireless chan-
nel condition) and the action (i.e. transmission power) have con-
tinuous spaces. For problems with the continuous-valued action
the policy gradient methods [52] are very effective, and in par-
ticular it can learn stochastic policies. In order to overcome
the inefficiency and high variance of evaluating a policy in the
policy gradient methods, the actor-critic learning algorithm is
introduced to solve the optimal decision-making problems with
an infinite horizon [53]. Actor-critic algorithm is a well-known
architecture based on policy gradient theorem which allows
applications in continuous space [54]. In the actor-critic method
[55], the actor is used to generate stochastic actions whereas the
critic is used to estimate the value function and criticizes the pol-
icy that is guiding the actor, and the policy gradient evaluated by
the critic process is used to update policy.
In [51], authors developed deep deterministic policy gradient
(DDPG) method to improve the trainability of the stochastic
policy gradient method with continuous action space problems.
DDPG is an enhanced version of the deterministic policy gradi-
ent (DPG) algorithm [56] based on the actor-critic architecture,
which uses an actor network to output a deterministic action
and a critic network to evaluate the action [46]. DDPG takes
the advantages of experience replay buffer and target network
strategies from DQN to improve learning stability. By using
the combination of actor-critic algorithm, replay buffer, target
networks and batch normalization, DDPG is able to perform
continuous control, while the training efficiency and stability
are enhanced from original actor-critic network [51]. These fea-
tures make DDPG more efficient for dynamic resource alloca-
tion problems with continuous action space. They also managed
to make it an ideal candidate for application in industrial settings
[42, 57]. As it concerns the overestimation bias of the critic can
pass to the policy through policy gradient in DDPG. However,
if the actor is updated slowly, the target critics are too similar
with the current critics to avoid overestimation bias. There have
been proposed many algorithms to address the overestimation
bias in Q learning. Double DQN [58] addresses the overestima-
tion problem by using two independent Q functions. Neverthe-
less, it is difficult to find appropriate weights for different tasks
or environments. The most common failure mode of DDPG is
that the learning of Q-function which begins by overestimating
the Q-value and eventually ends by breaking down the policy
because it takes advantage of errors in the Q-function. Over-
estimation bias is a property of Q-learning which is caused by
maximization of a noisy value estimate. The value target esti-
mate depends on the target actor and target critic. As the tar-
get actor is constantly updating, the next action used in the
one-step value target estimate is also changing. Therefore, the
update of target actor may cause a big change in target value
estimate, which would cause instability in critic training [59]. In
order to reduce overestimation bias problems, the authors in
[60] extended DDPG to twin delayed deep deterministic policy
gradient algorithm (TD3), which estimates the target Q value
by using the minimum of two target Q value, called clipped
double Q learning. DPG and DDPG algorithms paved way to
TD3 algorithms by the successful works on DQN [60, 61]. TD3
adopts two critics to get a less optimistic estimate of an action
value by taking the minimum between two estimates. In TD3,
there is just one actor which is optimized with respect to the
SONNAM ET AL.1645
smaller of two critics. In order to ensure the TD-error remains
small, the actor is updated at a lower frequency than the critic,
which results in higher quality policy updates in practice [59].
Since we focus on the issue of power allocation in MIMO-
NOMA systems which can be generally comprised of several
clusters according to users’ propagation channel conditions, we
also investigate the multi-agent RL methods, where single agent
can output the policy for one cluster of the MIMO-NOMA sys-
tem. With the development of DRL, single agent algorithm is
gradually on the right track and has already addressed plenty
of difficult problems giving researchers more powerful instru-
ments to study in high dimension and more complex action
spaces. By studying more on single agent RL, researchers real-
ized that the introduction of multi-agent could achieve higher
performance in complex mission environments [62]. The multi-
agent DDPG algorithm that extends the DDPG algorithm to
the multi-agent domain is proposed for mixed cooperative-
competitive environments in [63], even though it did not have
made more considerations for joint overestimation in multi-
agent environments. In [45], the multi-agent TD3 algorithm is
proposed to reduce the overestimation error of critic networks,
and they explained that the multi-agent TD3 algorithm has a
better performance than the multi-agent DDPG algorithm in
complex mission environment. These led to an extension of it
to any other multi-agent RL algorithm.
3SYSTEM MODEL AND PROBLEM
FORMULATION
3.1 System model
In this paper we consider a downlink multi-user MIMO-NOMA
system, in which the BS equipped with Mantennas that send
data to multiple receivers that are equipped with Nantennas.
The total number of users in the system is M×L,whichare
grouped into Mclusters randomly with L(L≥2) users per clus-
ter. NOMA is applied among the users in the same cluster. For
the signal to the lth user in cluster m, denoted by UEm,l,theBS
precodes the superposed signal xm=L
l=1pm,lxm,lby using
a transmit beamforming matrix F, where pm,lis the allocated
power to the lth user in cluster m,andxm,ldenotes the normal-
ized transmit signal. We consider the composite channel model
with both Rayleigh fading and largescale path loss. In particular,
the channel matrix Hm,lfrom the BS to UEm,lcan be repre-
sented as [13]:
Hm,l=Gm,l
Ldm,l,(1)
Ldm,l=
d𝜁
m,l,dm,l>d0
d𝜁
0,dm,l≤d0
,(2)
where Gm,l∈ℂ
N×Mrepresents Rayleigh fading channel gain,
L(dm,l) denotes the path loss of UEm,llocated at a distance
dm,l(km) from the BS and assumed to be the same at each
receive antenna, d0is the reference distance according to cell
size, and ζdenotes the path loss exponent.
The precoding matrix used at the BS is denoted as F∈
ℂM×M, which implies the antenna mserves cluster mwith the
power pm=L
l=1pm,lfor any m. At the receive side, UEm,l
employs the unit-form receive detection vector vm,l∈ℂ
N×1to
suppress the inter-cluster interference. In order to completely
remove inter-cluster interference, the precoding and detection
matrices need to satisfy the following constraints: (1) F=IM,
with IMdenoting the M×Midentity matrix; (2) vm,l2=1and
vH
m,lHm,lfk=0 for any k≠m, where fkis the kth column of F
[12]. It should be noted that the number of antennas should sat-
isfy N≥Mto make this feasible. Because of the zero-forcing
(ZF)-based detection design, the inter-cluster interference can
be removed even when there exist multiple users in a cluster.
Note that only a scalar value vH
m,lHm,lfk2needs to be fed
back to the BS from UEm,l. For the considered MIMO-NOMA
scheme, the BS multiplexes the intended signals for all users
at the same frequency and time resource. Therefore, the corre-
sponding transmitted signals from the BS can be expressed as:
s=Fx,(3)
where the information-bearing vector x∈ℂ
M×1can be further
written as:
x=
p1,1x1,1+⋯+
p1,Lx1,L
⋮⋱ ⋮
pM,1xM,1+⋯+
pM,LxM,L
,(4)
where M
m=1L
l=1pm,l≤pmax ,pmax denotes the total transmit
power at the BS. Accordingly, the observed signal at UEm,lis
given by
ym,l=Hm,ls+nm,l,(5)
where nm,l∼(0,𝜎
2I) is the independent and identically
distributed (i.i.d.) additive white Gaussian (AWGN) noise vec-
tor. By applying the detection vector vm,lon the observed signal,
Equation (5) can be expressed as [10]:
vH
m,lym,l=vH
m,lHm,lfm
L
l=1pm,lxm,l
+
M
k=1,k≠m
vH
m,lHm,lfkxk+vH
m,lnm,l,
(6)
where the second term is interference from other clusters, and
xkdenotes the kth row of x. Owing to the constrain1on the
detection vector, i.e. vH
m,lHm,lfk=0 for any k≠m, Equation
1Due to the specific selection of F, this constraint is further reduced to vH
m,l
Hm,l=0, where
Hm,l=[h1,ml ⋯hm−1,ml hm+1,ml ⋯hM,ml ]andhi,ml is the ith column of Hm,l. Therefore,
vm,lcan be expressed as Um,lwm,l, where Um,lis the matrix consisting of the left singu-
lar vectors of
Hm,lcorresponding to zero singular values, and wm,lis the maximum ratio
combining vector expressed as UH
m,lhm,ml ∕UH
m,lhm,ml [11, 12].
1646 SONNAM ET AL.
(6) can be simplified as:
vH
m,lym,l=vH
m,lHm,lfm
L
l=1pm,lxm,l+vH
m,lnm,l,(7)
Without loss of generality, the effective channel gains are
ordered as:
vH
m,1Hm,1fm
2≥⋯≥vH
m,LHm,Lfm
2,(8)
At the receiver, each user conducts SIC to remove the inter-
ference from the users with worse channel gains, i.e. the interfer-
ence from UEm,l+1,⋯,UEm,Lis removed by UEm,l. As a result,
the achieved data rate at UEm,lis given by
𝜉m,l=Bm,llog2
1+
pm,lvH
m,lHm,lfm
2
l−1
k=1pm,kvH
m,lHm,lfm
2+nm,l
,(9)
where Bm,lis the bandwidth assigned to UEm,l.
3.2 Problem formulation
The total power consumption at the BS is comprised of two
parts: the fixed circuit power consumption pc, and the flexible
transmit power M
m=1L
l=1pm,l≤pmax . In this work, we find
out the power allocation coefficient qm,linstead of power vol-
ume pm,l, i.e. qm,l=pm,l∕pmis the power allocation coefficient
for UEm,lin cluster m, where pmis the power allocated to cluster
min a cell. Thus, Equation (9) can be rewritten as:
𝜉m,l=Bm,llog2
1+
pmqm,lvH
m,lHm,lfm
2
l−1
k=1pmqm,kvH
m,lHm,lfm
2+nm,l(10)
Similar to [11, 34], we define the EE (energy efficiency) of the
system in cluster mas:
𝜂m
EE =𝜉m
pc+pm
,(11)
where 𝜉m=L
l=1𝜉m,ldenotes the achievable sum rate in clus-
ter m. We aim to maximize the EE of the system in MIMO-
NOMA when each user has a predefined minimum rate 𝜉min
m,l.
Thus, our considered problem can be formulated as:
P1 ∶max
pm,qm,l
M
m=1
𝜂m
EE,(12)
s.t.C1 ∶𝜉
m,l≥𝜉min
m,l,m∈{1,…,M},l∈{1,…,L},
FIGURE 1 The architecture of wireless network with small-cells and
reinforcement learning formulation
C2 ∶
M
m=1
pm≤pmax,
C3 ∶
L
l=1
qm,l≤1,
C4 ∶qm,i≤qm,j,i≤j,i,j∈{1,…,L},
C5 ∶pm≥0,qm,l∈[0,1],∀m,l,
where C1 represents users’ minimum rate requirements, and C2
and C3 respectively represent the transmit power constraint in
a cell and a cluster. C4 indicates that the user with worse chan-
nel condition is allocated more power and C5 are the inherent
constraints of pm,qm,l.
This optimization problem is non-convex and NP-hard, and
the global optimal solution is usually difficult to obtain in prac-
tice due to the high computational complexity and the randomly
evolving channel conditions. More importantly, the conven-
tional model-based approaches can hardly satisfy the require-
ments of future wireless communication services. Therefore, we
propose two DRL-based frameworks in the following sections
to deal with this problem.
4DYNAMIC POWER ALLOCATION
WITH DRL
4.1 Deep reinforcement learning
formulation for MIMO-NOMA systems
In this section, the optimization problem of power allocation
in a downlink MIMO-NOMA system is modelled as a rein-
forcement learning task, which consists of an agent and envi-
ronment interacting with each other. In Figure 1, we describe
the architecture of wireless network with small-cells which are
ultra-densely deployed and have the same number of anten-
nas as user handsets, or even less. As shown in Figure 1,the
base station is treated as the agent and the wireless channel
of the MIMO-NOMA system is the environment. The action,
transmit power from the BS to users, taken by the DRL con-
troller (at the BS) is based on the state which is the collec-
tive information on channel condition from users. Then at
SONNAM ET AL.1647
FIGURE 2 Multi-agent DRL-based MIMO-NOMA power allocation
system model
each step, based on the observed state of the environment,
the agent performs an action from the action space to allo-
cate power to users according to the power allocation policy,
where the policy is learned by the DRL algorithms. With the
obtained transmit power, the optimal power allocation is con-
ducted and the step reward can be computed and fed back to
the agent. This reward is the energy efficiency of the MIMO-
NOMA system. The agent can perform actions, such as power
allocation, optimized for a given channel information through
a repetitive learning of the selection process that maximizes the
reward.
In this paper, we propose two multi-agent DRL-based frame-
works (i.e. MDPA and MTPA) for the downlink MIMO-
NOMA system to derive the power allocation decision, which
are respectively based on DDPG and TD3 algorithm. The
multi-agent DRL-based network comprises of the Mpairs of
actor and critic network where each pair outputs power alloca-
tion coefficients for all users in a cluster, as well as one addi-
tional actor network which decides the power volume allocated
to every cluster. In Figure 2, we show the multi-agent DRL-
based power allocation network mechanism with Mclusters
in the MIMO-NOMA system (target networks for soft update
weights are not showed), where actor network 𝜇m(m=1,…,M)
decides power allocation coefficients for all users in each clus-
ter and actor network 𝜇0decides the power volume allocated
to every cluster. Comparing with the other mechanisms in pre-
vious works [42, 48,57], we not only properly adopted the
off-policy multi-agent method in accordance with properties of
MIMO-NOMA systems, but also created additional actor net-
work (actor 𝜇0in Figure 2) which is updated in cooperation
with the other agents’ critic networks to adjust power volumes
allocated to clusters for improving overall performance of the
MIMO-NOMA system.
In order to better illustrate our algorithm, we first briefly
introduce the backgrounds of DRL. A general RL model con-
sists of four parts: state-space , action-space , immediate
reward and the transition probability between current state
and next state ss′. In every TS (time slot) t, one or several
agents take an action at(e.g. decisions of power allocation)
under current state st(e.g. current channel condition), then
receives an immediate reward rt(e.g. immediate EE) and a new
state st+1, which we define as following [42, 57].
States: The state st∈in TS tis defined as the current chan-
nel gain of all users
st=s1
t,…,sM
t,(13a)
sm
t=hm,1(t),…,hm,L(t),(13b)
where sm
t(m=1,…,M) represents the current channel gain of
all users in cluster m,andhm,l(t) represents the channel gain
between BS and user UEm,lin TS t. They are assumed to be
obtained at the beginning of the TS. The states space complexity
is related to the number of UEs in the system. Since the total
number of users in the system is M×L, which are grouped
into Mclusters randomly with Lusers per cluster, the space
complexity of stis O(M×L) and that of sm
tis O(L).
Actions: The most suitable action space should contains
all power allocation decisions, so the action atof TS tis
at=a1
t,…,aM
t,(14a)
am
t=pm
t,qm
t,(14b)
qm
t=qm,1(t),…,qm,L(t),(14c)
where pm
t∈R1is power volume allocated to cluster min a cell,
qm
t∈RLis the power allocation coefficient for all users in clus-
ter mand qm,l(t)∈R1is the power allocation coefficient for
UEm,lin TS t. The power allocation decision network outputs
power allocation coefficients to every user and power volume
for every cluster, hence, each user can use the amount of its
own power allocation coefficient portion of power volume allo-
cated to the corresponding cluster. Therefore, the actions space
complexity of the power allocation network is O(M×(L+1)).
The transmit power is a continuous variable which is infinite,
so we develop the MDPA network and the MTPA network to
solve aforementioned problems.
Rewards: We use the EE of the MIMO-NOMA system in
cluster mto represent the immediate reward rm
t∈returned
after choosing action atin state st,thatis
rm
t=𝜂
m
EE (15)
And we aim to maximize the long-term cumulative dis-
counted reward defined as:
Rm
t=
∞
i=0
𝛾irm
t+i,(16)
with discount factor 𝛾∈(0,1).
In order to achieve this goal, a policy 𝜋mfor cluster mdefined
as a function mapping from states to actions (𝜋m∶→)is
needed. The policy 𝜋macts like a guidance, i.e. it tells the agent
which am
tshould be taken in a specific sm
tto achieve the expected
Rm
t, thus maximizing Equation (16) and it is equivalent to find-
ing the optimal policy represented as 𝜋∗
m. For a typical RL prob-
lem, the Q value function [43], which describes the expected
1648 SONNAM ET AL.
cumulative reward Rm
tof starting sm
t, performing action am
tand
thereafter following policy 𝜋m, is instrumental in solving RL
problems, and is defined as:
Q𝜋msm
t,am
t=𝔼
Rm
tsm
t,am
t=𝔼∞
i=0
𝛾irm
t+ism
t,am
t
=𝔼
rm
t+𝛾Q𝜋msm
t+1,am
t+1sm
t,am
t.
(17)
The optimal policy 𝜋∗
mwhich maximizes Equation (16) also
maximizes Equation (17) for all states and actions in cluster m.
And the corresponding optimal Q value function follows 𝜋∗
mis
given as:
Q𝜋∗
msm
t,am
t=𝔼rm
t+𝛾max
am
t+1
Q𝜋∗
msm
t+1,am
t+1sm
t,am
t.
(18)
Once we have the optimal Q value function Equation (18),
the agent knows how to select actions optimally.
4.2 Multi-agent DDPG-based power
allocation (MDPA) network
We use the off-policy multi-agent DDPG network based on
actor-critic structure to solve the dynamic power allocation
problem, where the actor part is an enhanced DPG network
and the critic part is a DQN. As mentioned before, the cen-
tralized controller receives the channel gain of all users as st=
{s1
t,…,sM
t}at the beginning each TS. With input st, every DNN
named actor network with weights 𝜇k(k=0,1,…,M)outputs
a deterministic action rather than a stochastic probability of
actions which removes further sampling and integration opera-
tions that are required in other actor-critic-based methods [46],
that is
pt=p1
t,…,pM
t=𝜋
0(st;𝜇
0)pm
t=𝜋
(m)
0(st;𝜇
0),(19a)
qm
t=𝜋
m(sm
t;𝜇
m)(m=1,2,…,M) (19b)
where 𝜋0is a policy corresponding to Actor 𝜇0which decides
power volume allocated to every cluster in a cell, 𝜋(m)
0denotes
mth element of 𝜋0(st;𝜇
0) and a policy 𝜋mdecides the power
allocation coefficient for all users in cluster m.
However, a major challenge of learning with deterministic
actions is the exploration of new actions [49]. Fortunately, for
the off-policy algorithm such as DDPG, the exploration can be
treated independently from the learning process. We construct
the exploration policy by adding a noise to the original output
action which is similar to the random selection of 𝜖−greedy
method in DQN, that is
pt=𝜋0(st;𝜇
0)+0pmax
0,(20)
qm
t=𝜋msm
t;𝜇
m+m1
0,m=1,2,…,M,(21)
where 0∈RM,m∈R1represents the noise and follows a
normal distribution. The action pm
tis restricted by the interval
(0, pmax )andqm
tis restricted by the interval (0, 1).
After executing the action atand receiving rt={r1
t,…,rM
t},
the system moves to the next state st+1. Since the action atis
generated by a deterministic policy network, we rewrite Equa-
tion (17) as:
Q𝜋msm
t,am
t
=𝔼rm
t+𝛾Q𝜋msm
t+1,𝜋m
0st+1;𝜇
0,𝜋
msm
t+1;𝜇
m,
m=1,2,…,M
(22)
∇𝜇mJ(𝜋m)=𝔼∇qmQ𝜋m(sm,am;𝜃
m)pm=𝜋(m)
0(s;𝜇0)∇𝜇m𝜋m(sm;𝜇
m)
≈1
Nb
i
∇qmQ𝜋m(sm,am;𝜃
m)sm=sm
i,pm=𝜋(m)
0(si;𝜇0)∇𝜇m𝜋m(sm;𝜇
m)sm=sm
i,
(23)
∇𝜇0J(𝜋0)=𝔼M
m=1
∇pmQ𝜋m(sm,am;𝜃
m)qm=𝜋m(sm;𝜇m)∇𝜇0𝜋0(s;𝜇
0)
≈1
Nb
M
m=1
i
∇pmQ𝜋m(sm,am;𝜃
m)s=si,qm=𝜋m(sm
i;𝜇m)∇𝜇0𝜋0(s;𝜇
0)s=si,
(24)
Since it is impractical to calculate Q value of every step in
this way, we also use another DNN named critic network with
weights 𝜃m(m=1,2,…,M) to output the approximate Q value
Q𝜋m(sm
t,am
t;𝜃
m) to evaluate the selected action am
t. We utilize
the experience replay method to allow the networks to bene-
fit from learning across a set of uncorrelated tuples. With the
Nbrandom tuples (si,ai,ri,si+1) sampled from replay memory
, the actor network for 𝜋m(sm
t;𝜇
m) updates its weights in the
direction of getting larger Q value according to the determinis-
tic policy gradient theorem in [49], they are given by Equations
(23) and (24), where J(𝜋m)andJ(𝜋0) respectively represents the
expected total reward of following policy 𝜋min all states of clus-
ter mand policy 𝜋0in all states in a cell. Then we use the target
network architecture from DQN to solve the unstable learning
issue caused by using only one network to calculate target Q
values and update weights at the same time. We create a copy
of the actor network and the critic network as 𝜋m(sm;
𝜇m)and
Q𝜋m(sm,am;
𝜃m) and name them as the target actor network and
the target critic network. Thus, the target Q value can be gener-
ated by these two networks, that is
ym
i=rm
i+𝛾Q𝜋msm
i+1,𝜋(m)
0st+1;𝜇
0,𝜋
msm
i+1;
𝜇m;
𝜃m,
(25)
zm
i=rm
i+𝛾Q𝜋msm
i+1,𝜋(m)
0st+1;
𝜇0,𝜋
msm
i+1;𝜇
m;
𝜃m
(26)
SONNAM ET AL.1649
ALGORITHM 1 The multi-agent DDPG-based power allocation
(MDPA) in downlink MIMO-NOMA
1. Initialize the replay memory with capacity .
2. Initialize the multi-agent DDPG-based power allocation actor networks
𝜋0(st;𝜇
0), 𝜋m(sm
t;𝜇
m) and critic networks Q𝜋m(sm
t,am
t;𝜃
m) with random
parameters 𝜇0,𝜇
mand 𝜃m(m=1,2,…,M).
3. Initialize the target actor networks 𝜋0(st;
𝜇0), 𝜋m(sm
t;
𝜇m) and target critic
networks Q𝜋m(sm
t,am
t;
𝜃m) with parameters
𝜇0=𝜇
0,
𝜇m=𝜇
mand
𝜃m=𝜃
m.
4. Initialize the random process 0,mfor the DDPG action exploration,
the terminate TS Tmax and the parameters update interval size C.
5. Controller at the BS receives the first channel condition information of all
users as initial state s1.
6. for t=1,2,…,Tmax do
7. Controller selects the power allocation action am
t=(pm
t,qm
t) according
to Equations (20), (21).
8. Controller broadcasts the power allocation action am
tto all users, and
users transmit their signal with the specific power.
9. If action am
tfor any mcan satisfy the minimum rate requirement,
controller receives the current EE as reward rm
t. Otherwise, it receives
none reward.
10. Controller receives the next state st+1as users moving to their next
positions.
11. Store the tuple (st,at,rt,st+1)in.
12. Randomly sample a mini-batch of Nbtuples (si,ai,ri,si+1)from.
13. Critic networks Q𝜋m(sm
t,am
t;𝜃
m) updates 𝜃mby minimizing the loss
function Equation (27).
14. Actor networks 𝜋m(sm
t;𝜇
m) updates 𝜇maccording to Equation (23).
15. Critic networks Q𝜋m(sm
t,am
t;𝜃
m) updates 𝜃mby minimizing the loss
function Equation (28).
16. Actor networks 𝜋0(st;𝜇
0)updates 𝜇0according to Equation (24).
17. Update target actor networks
𝜇0,
𝜇mand target critic networks
𝜃m
according to Equation (29) in every CTSs.
18. end for
We use the same method in DQN to update the critic net-
work weights by minimizing the loss function defined as:
Lm(𝜃m)=1
Nb
iym
i−Q𝜋msm
i,am
i;𝜃
mpm
i=𝜋(m)
0(si;𝜇0)2
,
(27)
L0=1
Nb
M
m=1
izm
i−Q𝜋msm
i,am
i;𝜃
mqm
i=𝜋m(sm
i;𝜇m)2
.
(28)
Instead of directly copying the weights to target networks, we
update the weights
𝜇kand
𝜃min a soft manner to make sure the
weights are changed slowly, which greatly improves the stability
of learning, that is
𝜃m←𝜏𝜃
m+(1−𝜏
)
𝜃m,m=1,2,…,M,
𝜇k←𝜏𝜇
k+(1−𝜏
)
𝜇k,k=0,1,…,M,(29)
where 0 <𝜏<1. The MDPA algorithm in a downlink MIMO-
NOMA system is summarized in Algorithm 1.
4.3 Multi-agent TD3-based power
allocation (MTPA) network
Twin-delayed deep deterministic policy gradient (TD3) is an off-
policy model with continuous high dimensional spaces which
recently has achieved breakthrough in artificial intelligence. RL
algorithms characterized as off-policy generally utilized a sepa-
rate behaviour policy which is independent of the policy which
is being improved upon. The key advantage of the separation
is that the behaviour policy can operate by sampling all actions,
while the estimation policy can be deterministic [61]. TD3 was
built on the DDPG algorithm to increase stability and perfor-
mance with consideration of function approximation error [60].
The uniqueness of the TD3 algorithm is in the combination of
three powerful DRL techniques such as continuous double deep
Qlearning[58], policy gradient [56] and actor-critic [54]. Even
though DDPG can sometimes achieve a good performance, it
is very sensitive for hyper-parameters and other types of adjust-
ments. The overestimation of Q-value when in the beginning of
the process of learning the Q-function is actually the common
failure mode of DDPG. The overestimation bias is a property
of Q-learning in which the maximization of a noisy value esti-
mate induces a consistent overestimation [60,64]. This noise is
unavoidable when given the inaccuracy of the estimator in func-
tion appropriation settings. Therefore, the overestimation can
occur due to the inaccurate action values. TD3 is an algorithm
that solves this problem by introducing three key techniques
[60]. The first one is clipped dual Q networks. TD3 learns two
Q-functions and uses the smaller of the two Q-values to form
the target of the Bellman error. The second is the delayed policy
update, where the policy network parameters are updated after
dual Q-function networks are updated. Thirdly, the smoothed
target policy utilization makes all action values within the spe-
cific range [45].
In this paper, we also use the multi-agent TD3 network
to derive the power allocation decision in the MIMO-NOMA
system, where the network mechanism is equal to the above
DDPG network. Every actor networkm(m=1,2,…,M)learns
two Q-functions to get a less optimistic estimate of an action
value by taking the minimum between two estimates, so
two Q-functions with 𝜃j
mfor every actor network can be
expressed as:
Q𝜋msm
t,am
t;𝜃j
m
=𝔼rm
t+𝛾Q𝜋msm
t+1,𝜋
m
0st+1;𝜇
0,𝜋
msm
t+1;𝜇
m;𝜃j
m,
m=1,2,…,M,j=1,2
(30)
Similar to our DDPG-based method, we utilize the expe-
rience replay memory and then the target Q value ym
iis
1650 SONNAM ET AL.
given by:
ym
i=rm
i+𝛾min
j=1,2Q𝜋msm
i+1,𝜋(m)
0st+1;𝜇
0,qm
t;
𝜃j
m,
(31)
qm
t=𝜋msm
i+1;
𝜇m+𝜖
m1
0,m=1,2,…,M,
𝜖m∼clip 0,𝜎
𝜖m,0,1,
(32)
zm
i=rm
i+𝛾min
j=1,2Q𝜋msm
i+1,
pm
t,𝜋
msm
i+1;𝜇
m;
𝜃j
m,(33)
pm
t=𝜋(m)
0st+1;
𝜇0+𝜀
mpmax
0
,
𝜀m∼clip 0,𝜎
𝜀m,0,pmax ,
(34)
where the added noise 𝜖m,𝜀
mare clipped to keep the target
close to the original action. Then critic networks are updated by
minimizing the loss functions defined as:
Lm𝜃1
m,𝜃
2
m=1
Nb
i
jym
i−Q𝜋msm
i,am
i;𝜃j
mpm
i=𝜋(m)
0(si;𝜇0)2
,
(35)
L0=1
Nb
M
m=1
i
jzm
i−Q𝜋msm
i,am
i;𝜃j
mqm
i=𝜋m(sm
i;𝜇m)2
,
(36)
∇𝜇mJ(𝜋m)=𝔼∇qmQ𝜋msm,am;𝜃
1
mpm=𝜋(m)
0(s;𝜇0)∇𝜇m𝜋m(sm;𝜇
m)
≈1
Nb
i
∇qmQ𝜋msm,am;𝜃
1
msm=sm
i,pm=𝜋(m)
0(si;𝜇0)∇𝜇m𝜋m(sm;𝜇
m)sm=sm
i,
(37)
∇𝜇0J(𝜋0)=𝔼M
m=1
∇pmQ𝜋msm,am;𝜃
1
mqm=𝜋m(sm;𝜇m)∇𝜇0𝜋0(s;𝜇
0)
≈1
Nb
M
m=1
i
∇pmQ𝜋msm,am;𝜃
1
ms=si,qm=𝜋m(sm
i;𝜇m)∇𝜇0𝜋0(s;𝜇
0)s=si
(38)
Then we update the actor network for 𝜋min the direction
of getting larger Q value with 𝜃1
mby Equations (37) and (38).
Finally, we update the weights
𝜇0,
𝜇mand
𝜃j
min a soft manner
to make sure the weights are changed slowly, that is
𝜇0←𝜏𝜇
0+(1−𝜏
)
𝜇0,
𝜃j
m←𝜏𝜃
j
m+(1−𝜏
)
𝜃j
m,
m=1,2,…,M,j=1,2,(39)
𝜇m←𝜏𝜇
m+(1−𝜏
)
𝜇m,m=1,…,M(40)
The MTPA algorithm in a downlink MIMO-NOMA system
is summarized in Algorithm 2.
ALGORITHM 2 The multi-agent TD3-based power allocation (MTPA) in
downlink MIMO-NOMA
1. Initialize the replay memory with capacity .
2. Initialize the multi-agent TD3-based power allocation actor networks
𝜋0(st;𝜇
0), 𝜋m(sm
t;𝜇
m) and critic networks Q𝜋m(sm
t,am
t;𝜃
1
m),
Q𝜋m(sm
t,am
t;𝜃
2
m) with random parameters 𝜇0,𝜇
m,𝜃1
mand
𝜃2
m(m=1,2,…,M).
3. Initialize the target actor networks 𝜋0(st;
𝜇0), 𝜋m(sm
t;
𝜇m) and target critic
networks Q𝜋m(sm
t,am
t;
𝜃1
m), Q𝜋m(sm
t,am
t;
𝜃2
m) with parameters
𝜇0=𝜇
0,
𝜇m=𝜇
m,
𝜃1
m=𝜃
1
mand
𝜃2
m=𝜃
2
m.
4. Initialize the random process 0,mfor TD3 action exploration and the
terminate TS Tmax and the parameters update interval size C.
5. Controller at the BS receives the first channel condition information of all
users as initial state s1.
6. for t=1,2,…,Tmax do
7. Controller selects the power allocation action am
t=(pm
t,qm
t) according
to Equations (20), (21).
8. Controller broadcasts the power allocation action am
tto all users, and
users transmit their signal with specific power.
9. If action am
tfor any mcan satisfy the minimum rate requirement,
controller receives the current EE as reward rm
t. Otherwise, it receives
none reward.
10. Controller receives the next state st+1as users moving to their next
positions.
11. Store the tuple (st,at,rt,st+1)in.
12. Randomly sample a mini-batch of Nbtuples (si,ai,ri,si+1)from.
13. Critic networks Q𝜋m(sm
t,am
t;𝜃j
m) updates 𝜃j
mby minimizing the loss
function Equation (35).
14. iftmod Cthen
15. Actor networks 𝜋m(sm
t;𝜇
m) updates 𝜇maccording to Equation (37).
16. Update target actor networks
𝜇maccording to Equation (39).
17. end if
18. Critic networks Q𝜋m(sm
t,am
t;𝜃j
m) updates 𝜃j
mby minimizing the loss
function Equation (36).
19. iftmod Cthen
20. Actor networks 𝜋0(st;𝜇
0) updates 𝜇0according to Equation (38).
21. Update target actor networks
𝜇0and target critic networks
𝜃1
m,
𝜃2
m
according to Equation (40).
22. end if
23. end for
5SIMULATION RESULTS
In this section, we present the simulation results of the two
multi-agent DRL-based power allocation algorithm (i.e. MDPA
and MTPA). The results are simulated in a downlink MIMO-
NOMA system, where the BS is located at the centre of the
cell, 4 users are randomly distributed in the cell with a radius of
500 m. The specific values of the adopted simulation parame-
ters are summarized in Table 1. If not specified, Tmax =5000 TS,
pmax =1 W and all users perform random moving with a speed
of v=1m∕s. Our multi-agent DRL-based framework for sim-
ulation has five networks (three actor networks and two critic
networks) that it does not mean including target networks. The
SONNAM ET AL.1651
TAB LE 1 Parameter settings used in the simulations
Parameter Value
Framework Pytorch 1.5
Programming language Python 3.6
Channel MIMO-NOMA/AWGN
Fading Rayleigh distribution
Number of transmit antennas M=2
Number of receive antennas N=2
Number of users in a cluster L=2
Channel bandwidth Bm,l=10 MHz
Thermal noise density 𝜎2=−174 dBm∕Hz
Pathloss model 114 +38log10(dm,l)
Minimum required data rate 𝜉min
m,l=10 bps∕Hz
learning rate is set to be 0.01, discount factor 𝛾=0.9, memory
capacity =200, weights update interval C=10 and batch
size Nb=32. Specifically, every actor network has an additional
layer to scale the output to the power bound limitation. The soft
update rate 𝜏in Equations (29), (39) and (40) is set to be 0.01,
the noise process in Equations (20) and (21) follows a normal
distribution of (0,1), and 𝜎𝜖m=𝜎
𝜀m=0.1 in Equations (32)
and (34).
In order to compare the performance of the proposed
frameworks, we consider several alternative approaches: (1)
single agent DDPG/TD3-based power allocation methods
(SDPA/STPA), which output the power allocation policy
for every cluster by independent single agents based on
DDPG/TD3 algorithm to maximize the sum EE of the sys-
tem; (2) DQN-based discrete power allocation strategy (QPA),
which uses DQN method to output the allocated power of all
users by quantizing the power into 10 levels between 0 and pmax
in order to fit the input layer of DQN, because the transmit
power is a continuous variable which is infinite and the action-
space of DQN has to be finite; (3) fixed power allocation strat-
egy (FPA), which chooses the action that can maximize the sum
EE with maximum transmit power by exhaustive searching in
each TS to ensure high quality-of-service (QoS) of the system;
(4) multi-agent DDPG/TD3 -based power allocation meth-
ods for MIMO-OMA systems (MTPA-OMA/MDPA-OMA),
where the system model in MIMO-OMA is referenced in [14].
In QPA, we evaluated it by exhaustive searching in each TS.
Because it does not rely on global trajectory data, it is easy to
fall into a local optimum. However, such quantization results
in a serious problem which is the huge action-space of pos-
sible selected power. In our QPA scenario with 4 users and a
quantization level k=10, each user can be allocated the power
amount corresponding to one of 10 possible power allocation
choices, then the number of actions that can be selected has
already reached 104=10,000. It should be noted that it is only
a small system, while in future generation wireless networks
with high density of users, the size of action space grows expo-
nentially as the number of users increases. Such a large action
space leads to poor performance, because the DQN agent needs
FIGURE 3 Loss function value of two multi-agent DRL-based
frameworks
plenty of time to explore the entire action space to find out the
best power allocation option. And due to the random evolving
nature of the communication environment, it may be difficult
to choose the option which leads to the best performance. In
addition, such quantization also discards some useful informa-
tion, which is essential for finding the optimal power allocation
option. This is also the reason for us to use the off-policy net-
work based on actor-critic structure for power allocation tasks.
The convergence complexity of the DRL algorithm depends
on the size of the action-state space. For the QPA framework,
the size of output actions is O(kL×M), which exponentially
increases with the number of users and the quantization levels
of power allocation coefficients, and for our proposed frame-
works,thesizeisO(M×(L+1)). The state space needs some
space to be stored, hence the corresponding space complexity
is O(M×L) for QPA and the proposed frameworks. There-
fore, the proposed algorithms can improve the performance
of the energy efficiency of MIMO-NOMA systems with low
complexity.
First, we evaluate the convergence of the proposed two multi-
agent frameworks under learning rate 0.01 that is fixed since
it provided the better performance of learning in our simula-
tions. We fix the location of all users to determine how many
TSs are required for our frameworks to find the optimal power
allocation policy. As shown in Figure 3, the loss function value
decreases and tends to a stable value within 200 TS for two
frameworks, and the value is small enough to accurately predict
the Q value.
Then, we compare the proposed two multi-agent algorithms
in the random moving system with different power alloca-
tion approaches. It should be noticed that all results are aver-
aged by taking a moving window of 100 TSs to make the
comparison clearer. For the objective of sum EE maximiza-
tion, we investigate the performance of our multi-agent DRL-
based frameworks. Figure 4shows that MTPA frameworks
achieve the better performance of the averaged sum EE than
the other approaches. DRL-based frameworks are able to
1652 SONNAM ET AL.
FIGURE 4 Averaged energy efficiency comparison of different
algorithms
dynamically choose the transmit power of all users with con-
tinuous control according to their current channel conditions in
every TS. Specifically, the averaged sum EE value achieved by
the MTPA framework over all 5000 TSs is 21.47% higher than
the FPA approach, whereas the values for the MDPA frame-
work is 17.3% higher. And for the STPA and SDPA frame-
works, the values are 15.8% and 12.39% higher, respectively. On
the other hand, for the QPA framework based on the discrete
DRL method is only 11.13% higher. More importantly, as long
as the data rate requirement of all users is satisfied, it is unnec-
essary and energy-wasted to use maximum power for transmis-
sion since it reduces the EE of the system. This also shows the
importance of power allocation to the performance improve-
ment of the NOMA system. Also, we verify the performance
advantage of the proposed algorithms over MIMO-OMA. As
shown in Figure 4, numerical results show that the proposed
algorithms outperform the advantage of the energy efficiency
over MIMO-OMA. In our simulation scenario, MTPA achieves
18.04% higher than MTPA-OMA and MDPA achieves 16.86%
higher than MDPA-OMA.
We investigate the averaged sum EE performance of two
frameworks in 5000 TSs under different transit power limi-
tations. As shown in Figure 5, the averaged results of DRL-
based frameworks grow and tend to stable values as the maxi-
mum available power increases, while the FPA approach slightly
increases and continues to drop since it always uses full power
for the signal transmission. This is because as long as the data
rate requirement is satisfied, our algorithms no longer need
full power for the transmission and can dynamically allocate
the power based on the communication conditions to opti-
mize the sum EE no matter how pmax changes. Obviously,
our proposed power allocation methods based on determinis-
tic policy gradient with continuous action spaces outperform
the discrete DRL-based approaches and conventional methods
under most power limitation conditions and we find out that
the MTPA framework achieves the best average sum EE per-
formance compared with other approaches. And we verify the
FIGURE 5 Averaged energy efficiency versus power limitation
FIGURE 6 The variation of the energy efficiency with the minimum
required date rate
improved performance of the proposed algorithms for MIMO-
NOMA over MIMO-NOMA.
Figure 6illustrates a look on how EE varies with min-
imum required data rate 𝜉min
m,l. As shown in Figure 6,our
proposed frameworks have an outstanding performance com-
pared with the other approaches as well as over MIMO-OMA,
even though EE decreases with 𝜉min
m,l. This situation can be
explained by connecting 𝜉min
m,lwith the transmit power, i.e.
lower 𝜉min
m,lhas the same influence on EE as higher transmit
power. The obtained results indicate that the average system
energy efficiency decreases with increasing 𝜉min
m,l, because as 𝜉min
m,l
increases, the system reaches the infeasibility case more rapidly
and becomes unable to satisfy 𝜉min
m,lfor all the system users.
6 CONCLUSION
In this paper, we have studied the dynamic power alloca-
tion problem in a downlink MIMO-NOMA system and two
multi-agent DRL-based frameworks (i.e. multi-agent DDPG/
SONNAM ET AL.1653
TD3-based framework) are proposed to improve the long-term
EE of the MIMO-NOMA system while ensuring minimum
data rates of all users. Every single agent of two multi-agent
frameworks dynamically outputs the optimum power allocation
policy for all users in every cluster by DDPG/TD3 algorithm,
and the additional actor network also adjusts power volumes
allocated to clusters to improve overall performance of the
MIMO-NOMA system, finally both frameworks adjust the
entire power allocation policy by updating the weights of neural
networks according to the feedback of the system. Compared
with other approaches, our multi-agent DRL-based power
allocation frameworks can significantly improve the EE of
the MIMO-NOMA system under various transmit power
limitations and minimum data rates by adjusting the parameters
of networks. We also have verified the advantage of energy
performance the proposed algorithms over that of the MIMO-
OMA system. In future work, we are going to study on the
joint subchannel selection and power allocation problem for
more practical scenarios of MIMO-NOMA systems properly
applying more powerful DRL approaches.
REFERENCES
1. Chin, W.H., Fan, Z., Haines, R.: Emerging technologies and research chal-
lenges for 5G wireless networks. IEEE Wireless Commun. 21(2), 106–112
(2014)
2. Ding, Z., et al.: A survey on non-orthog onal multiple access for 5G net-
works: Research challenges and future trends. IEEE J. Sel. Areas Commun.
35(10), 2181–2195 (2017)
3. Wu, Z., et al.: Comprehensive study and comparison on 5G NOMA
schemes. IEEE Access 6, 18511–18519 (2018)
4. Choi, J.: Effective capacity of NOMA and a suboptimal power control pol-
icy with delay QoS. IEEE Trans. Commun. 65(4), 1849–1858 (2017)
5. Yin, Y., et al.: Dynamic user grouping-based NOMA over Rayleigh fading
channels. IEEE Access 7, 110964–110971 (2019)
6. Gui, G., Sari, H., Biglieri, E.: A new definition of fairness for non-
orthogonal multiple access. IEEE Commun. Lett. 23(7), 1267–1271 (2019)
7. Liu, X., et al.: Spectrum resource optimization for NOMA-based cognitive
radio in 5G communications. IEEE Access 6, 24904–24911 (2018)
8. Muhammed, A.J., et al.: Resource allocation for energy efficiency in
NOMA enhanced small cell networks. In: Proceedings of 2019 11th Inter-
national Conference on Wireless Communications and Signal Processing
(WCSP). Xi’an, China, pp. 1–6 (2019)
9. Ding, Z., et al.: On the performance of non-orthogonal multiple access
in 5G systems with randomly deployed users. IEEE Signal Process. Lett.
21(12), 1501–1505 (2014)
10. Cui, J., Ding, Z., Fan, P.: Power minimization strategies in downlink
MIMO-NOMA systems. In: Proceedings of 2017 IEEE International
Conference on Communications (ICC). Paris, France, pp. 1–6 (2017)
11. Zeng, M., et al.: Energy-efficient power allocation for MIMO-NOMA with
multiple users in a cluster. IEEE Access 6, 5170–5181 (2018)
12. Ding, Z., Adachi, F., Poor, H.V.: The application of MIMO to non-
orthogonal multiple access. IEEE Trans. Wireless Commun. 15(1), 537–
552 (2016)
13. Ding, Z., Schober, R., Poor, H.V.: A general MIMO framework for NOMA
downlink and uplink transmission based on signal alignment. IEEE Trans.
Wireless Commun. 15(6), 4438–4454 (2016)
14. Zeng, M., et al.: Capacity comparison between MIMO-NOMA and
MIMO-OMA with multiple users in a cluster. IEEE J. Sel. Areas Com-
mun. 35(10), 2413–2424 (2017)
15. Do, D.T., Nguyen, S.: Device-to-device transmission modes in NOMA
network with and without wireless power transfer.Comput. Commun. 139,
67–77 (2019)
16. Uddin, M.B., Kader, M.F., Shin, S.: Exploiting NOMA in D2D assisted
full-duplex cooperative relaying. Phys. Commun. 38(100914), 1–10
(2020)
17. Najimi, M.: Energy-efficient resource allocation in D2D communica-
tions for energy harvesting-enabled NOMA-based cellular networks. Int.
J. Commun. Syst. 33(2), 1–13 (2020)
18. Li, R., et al.: Tact: A transfer actor-critic learning framework for energy
saving in cellular radio access networks. IEEE Trans. Wireless Commun.
13(4), 2000–2011 (2014)
19. Luo, J., et al.: A deep learning-based approach to power minimization
in multi-carrier NOMA with SWIPT. IEEE Access 7, 17450–17460
(2019)
20. Jang, H.S., Lee, H., Quek, T.Q.S.: Deep learning-based power control for
non-orthogonal random access. IEEE Commun. Lett. 23(11), 2004–2007
(2019)
21. Adjif, M., Habachi, O., Cances, J.P.: Joint channel selection and power con-
trol for NOMA: A multi-armed bandit approach. In: Proceedings of 2019
IEEE Wireless Communications and Networking Conference Workshop
(WCNCW). Marrakech, Morocco, pp. 1–6 (2019)
22. Wei, Y., et al.: Power allocation in HetNets with hybrid energy supply using
actor-critic reinforcement learning. In: Proceedings of GLOBECOM 2017
- 2017 IEEE Global Communications Conference. Singapore, pp. 1–5
(2017)
23. Hasan, M.K., et al.: The role of deep learning in NOMA for 5G and
beyond communications. In: Proceedings of 2020 International Con-
ference on Artificial Intelligence in Information and Communication
(ICAIIC). Fukuoka, Japan, pp. 303–307 (2020)
24. Zhao, Q., Grace, D.: Transfer learning for QoS aware topology man-
agement in energy efficient 5G cognitive radio networks. In: Proceed-
ings of 1st International Conference on 5G for Ubiquitous Connectivity.
Akaslompolo, Finland, pp. 152–157 (2014)
25. Wei, Y., et al.: User scheduling and resource allocation in HetNets with
hybrid energy supply: An actor-critic reinforcement learning approach.
IEEE Trans. Wireless Commun. 17(1), 680–692 (2017)
26. Liu, M., et al.: Resource allocation for NOMA based heterogeneous IoT
with imperfect SIC: A deep learning method. In: Proceedings of 2018
IEEE 29th Annual International Symposium on Personal, Indoor and
Mobile Radio Communications (PIMRC). Bologna, Italy, pp. 1440–1446
(2018)
27. Gui, G., et al.: Deep learning for an effective non-orthogonal multiple
access scheme. IEEE Trans. Veh. Technol. 67(9), 8440–8450 (2018)
28. Ding, Z., Fan, P., Poor, H.V.: Impact of user pairing on 5G nonorthogonal
multiple-access downlink transmissions. IEEE Trans. Veh. Technol. 65(8),
6010–6023 (2016)
29. Liang, W., et al.: User pairing for downlink non-orthogonal multiple access
networks using matching algorithm. IEEE Trans. Commun. 65(12), 5319–
5332 (2017)
30. Paulraj, A.J., et al.: An overview of MIMO communications - A key to
gigabit wireless. Proc. IEEE 92(2), 198–218 (2004)
31. Hanif, M.F., et al.: A minorization -maximization method for optimizing
sum rate in non-orthogonal multiple access systems. IEEE Trans. Signal
Process. 64(1), 76–88 (2015)
32. Benjebbour, A., Kishiyama, Y.: Combination of NOMA and MIMO: Con-
cept and experimental trials. In: Proceeedings of 25th International Con-
ference on Telecommunications (ICT). St. Malo, France, pp. 433–438
(2018)
33. Islam, S.M.R., et al.: Resource allocation for downlink NOMA systems:
Key techniques and open issues. IEEE Wireless Commun. 25, 40–47
(2017)
34. Manglayev, T., Kizilirmak, R., Kho, Y.H.: Optimum power allocation for
non-orthogonal multiple access (NOMA). In: Proceedings of 25th Inter-
national Conference on Telecommunications (ICT). Baku, Azerbaijan,
pp. 1–4 (2016)
35. Zhang, Y., et al.: Energy-efficient transmission design in non-orthogonal
multiple access. IEEE Trans. Veh. Technol. 66, 2852–2857 (2017)
36. Zhang, J., et al.: Joint subcarrier assignment and downlink-uplink time-
power allocation for wireless powered OFDM-NOMA systems. In:
1654 SONNAM ET AL.
Proceedings of 10th International Conference on Wireless Communica-
tions and Signal Processing (WCSP). Hangzhou, China, pp. 1–7 (2018)
37. Zhao, J., et al.: Joint subchannel and power allocation for NOMA enhanced
D2D communications. IEEE Trans. Commun. 65(11), 5081–5094 (2017)
38. Xiong, Z., et al.: Deep reinforcement learning for mobile 5G and beyond:
Fundamentals, applications, and challenges. IEEE Veh. Technol. Mag.
14(2), 44–52 (2019)
39. Wang, T., et al.: Deep learning for wireless physical layer: Opportunities
and challenges. China Commun. 14(11), 92–111 (2017)
40. Huang, D., et al.: Deep learning based cooperative resource allocation in
5G wireless networks. Mobile Network Appl. 1–8 (2018)
41. Eisen, M., et al.: Learning optimal resource allocations in wireless systems.
IEEE Trans. Signal Process. 67(10), 2775–2790 (2018)
42. Zhang, Y., Wang, X., Xu, Y.: Energy-efficient resource allocation in uplink
NOMA systems with deep reinforcement learning. In: Proceedings of 11th
International Conference on Wireless Communications and Signal Pro-
cessing (WCSP). Xi’an, China, pp. 1–6 (2019)
43. Zhang, H., et al.: Artificial intelligence-based resource allocation in ultra-
dense networks: Applying event-triggered Q-learning algorithms. IEEE
Veh. Technol. Mag. 14(4), 56–63 (2019)
44. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. 2nd
ed., The MIT Press, Cambridge, MA (2018). http://incompleteideas.net/
book/the-book.html
45. Zhang, F., Li, J., Li, Z.: A TD3-based multi-agent deep reinforcement learn-
ing method in mixed cooperation-competition environment. Science 411,
206–215 (2020)
46. Mnih, V., et al.: Human-level control through deep reinforcement learning.
Nature 518(7540), 529–533 (2015)
47. Xiao, L., et al.: Reinforcement learning-based NOMA power allocation in
the presence of smart jamming. IEEE Trans. Veh. Technol. 67(4), 3377–
3389 (2017)
48. Zhang, S., et al.: A dynamic power allocation scheme in power-domain
NOMA using actor-critic reinforcement learning. In: Proceedings of
2018 IEEE/CIC International Conference on Communications in China
(ICCC). Beijing, China, pp. 719–723, (2018)
49. Zhao, N., et al.: Deep reinforcement learning for user association and
resource allocation in heterogeneous networks. In: Proceedings of 2018
IEEE Global Communications Conference (GLOBECOM). Abu Dhabi,
United Arab Emirates, pp. 1–6 (2018)
50. Chu, M., et al.: Reinforcement learning-based multiaccess control and bat-
tery prediction with energy harvesting in IoT systems. IEEE Internet
Things J. 6(2), 2009–2020 (2019)
51. Lillicrap, T., et al.: Continuous control with deep reinforcement learning.
In: Proceedings of 4th International Conference on Learning Representa-
tions. San Juan, Puerto Rico, pp. 1–14 (2016)
52. Grondman, I., et al.: Efficient model learning methods for actor-critic con-
trol. IEEE Trans. Syst. Man Cybern. Cybern. 42(3), 591–602 (2011)
53. Vamvoudakis, K.G., Lewis, F.: Online actor-critic algorithm to solve the
continuous-time infinite horizon optimal control problem. Automatica 46,
878–888 (2010)
54. Sutton, R.S., et al.: Policy gradient methods for reinforcement learning with
function approximation. Adv. Neural Inf. Process. Syst. 12, 1057–1063
(2000)
55. Konda, V., Gao, V.: On actor-critic algorithms. SIAM J. Control Optim.
42(4), 1143–1166 (2000)
56. Silver, D., et al.: Deterministic policy gradient algorithms. In: Proceedings
of 31st International Conference on Machine Learning. Beijing, China, vol.
32, pp. 387–395 (2014)
57. Li, Y., et al.: Energy-aware resource management for uplink non-
orthogonal multiple access: Multi-agent deep reinforcement learning.
Future Gener. Comput. Syst. 105, 684–694 (2020)
58. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with
double Q-learning. In: Proceedings of 29th Conference on Artificial Intel-
ligence (AAAI15). Austin, TX, pp. 2094–2100 (2015)
59. Bao, X., et al.: Conservative policy gradient in multi-critic setting. In: Pro-
ceedings of 2019 Chinese Automation Congress (CAC). Hangzhou, China,
pp. 1486–1489 (2019)
60. Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation
error in actor-critic methods. In: Proceedings of 35th International Con-
ference on Machine Learning. Stockholm, Sweden, pp. 1582–1591 (2018)
61. Dankwa, S., Zheng, W.: Modeling a continuous locomotion behavior of an
intelligent agent using deep reinforcement technique. In: Proceedings of
2019 IEEE 2nd International Conference on Computer and Communica-
tion Engineering Technology-CCET. Beijing, China, pp. 172–175 (2019)
62. Jaderberg, M., et al.: Human-level performance in first-person multiplayer
games with population-based deep reinforcement learning. Science 364,
859–865 (2019)
63. Lowe, R., et al.: Multi-agent actor-critic for mixed cooperative-competitive
environments. In: Proceedings of 31st Conference on Neural Information
Processing Systems. Long Beach, CA, pp. 6382–6393 (2017)
64. Thrun, S., Schwartz, A.: Issues in using function approximation for rein-
forcement learning. In: Proceedings of 4th Connectionist Models. Summer
School Hillsdale, NJ, pp. 1–9 (1993)
How to cite this article: Jo, S., et al.: Multi-agent deep
reinforcement learning-based energy efficient power
allocation in downlink MIMO-NOMA systems. IET
Commun. 15, 1642–1654 (2021).
https://doi.org/10.1049/
cmu2.12177
Content uploaded by Sonnam Jo
Author content
All content in this area was uploaded by Sonnam Jo on Jun 04, 2021
Content may be subject to copyright.