ArticlePDF Available

Multi‐agent deep reinforcement learning‐based energy efficient power allocation in downlink MIMO‐NOMA systems

Wiley
IET Communications
Authors:
  • Kim Il Sung University

Abstract and Figures

NOMA and MIMO are considered to be the promising technologies to meet huge access demands and high data rate requirements of 5G wireless networks. In this paper, the power allocation problem in a downlink MIMO-NOMA system to maximize the energy efficiency while ensuring the quality-of-service of all users is investigated. Two deep reinforcement learning-based frameworks are proposed to solve this non-convex and dynamic optimization problem, referred to as the multi-agent DDPG/TD3-based power allocation framework. In particular, with current channel conditions as input, every single agent of two multi-agent frameworks dynamically outputs the optimum power allocation policy for all users in every cluster by DDPG/TD3 algorithm, and the additional actor network is also added to the conventional multi-agent model in order to adjust power volumes allocated to clusters to improve overall performance of the system. Finally, both frameworks adjust the entire power allocation policy by updating the weights of neural networks according to the feedback of the system. Simulation results show that the proposed multi-agent deep reinforcement learning based power allocation frameworks can significantly improve the energy efficiency of the MIMO-NOMA system under various transmit power limitations and minimum data rates compared with other approaches, including the performance comparison over MIMO-OMA.
This content is subject to copyright. Terms and conditions apply.
Received: 24 October 2020 Revised: 8 March 2021 Accepted: 18 March 2021 IET Communications
DOI: 10.1049/cmu2.12177
ORIGINAL RESEARCH PAPER
Multi-agent deep reinforcement learning-based energy efficient
power allocation in downlink MIMO-NOMA systems
Sonnam Jo Chol Jong Changsop Pak Hakchol Ri
Telecommunication Research Center, Kim Il Sung
University, Taesong District, Pyongyang 999093,
Democratic People’s Republic of Korea
Correspondence
Sonnam Jo,Telecommunication Research Center,
Kim Il Sung University, Taesong District, Pyongyang
999093, Democratic People’sRepublic of Korea.
Email: josonnam@yahoo.com
Abstract
NOMA and MIMO are considered to be the promising technologies to meet huge access
demands and high data rate requirements of 5G wireless networks. In this paper, the power
allocation problem in a downlink MIMO-NOMA system to maximize the energy efficiency
while ensuring the quality-of-service of all users is investigated. Two deep reinforcement
learning-based frameworks are proposed to solve this non-convex and dynamic optimiza-
tion problem, referred to as the multi-agent DDPG/TD3-based power allocation frame-
work. In particular, with current channel conditions as input, every single agent of two
multi-agent frameworks dynamically outputs the optimum power allocation policy for all
users in every cluster by DDPG/TD3 algorithm, and the additional actor network is also
added to the conventional multi-agent model in order to adjust power volumes allocated
to clusters to improve overall performance of the system. Finally, both frameworks adjust
the entire power allocation policy by updating the weights of neural networks according to
the feedback of the system. Simulation results show that the proposed multi-agent deep
reinforcement learning based power allocation frameworks can significantly improve the
energy efficiency of the MIMO-NOMA system under various transmit power limitations
and minimum data rates compared with other approaches, including the performance com-
parison over MIMO-OMA.
1 INTRODUCTION
Due to the explosive growth of internet-of-thing-based
largescale heterogeneous networks and the emergence of the
fifth generation (5G) wireless networks, the communication
requirements become more and more strict such as low latency,
high speed and massive connectivity etc. [1]. In order to meet
these stringent communication requirements and provide
high-quality communication services, non-orthogonal mul-
tiple access (NOMA) has been proposed. NOMA is widely
recognized as one of the most promising multiple access
technologies for 5G wireless networks [2]. Different from
conventional orthogonal multiple access (OMA), the multiple
users in NOMA systems can be served by the same frequency
time resources via power-domain multiplexing or code-domain
multiplexing, which can increase the network capacity and
fulfil the requirements of low latency, high throughput and
massive connection [3]. In researches, the interests of NOMA
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2021 The Authors. IET Communications published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology
are mostly about power allocation [4] and user matching [5],
and more often the objectives are to improve the users’ fairness
[6], spectrum efficiency (SE) [7] and energy efficiency [8]. The
power-domain NOMA can serve multiple users at the same
time slots and has become the potential candidate in the devel-
opment of 5G by allocating different power levels to different
users to achieve multiple access [9], thus the EE of systems
can be improved by using this technology. The power-domain
NOMA with the use of superposition coding (SC) at the
transmitter side and successive interference cancellation (SIC)
at the receiver side allows users to share the same resources
by multiplexing multiple users’ signals with different allocated
powers. In power-domain NOMA, powers are allocated to
users according to their channel gains, i.e. high and low powers
are respectively assigned to users with better and worse channel
gains, and then the user with better channel gain by SIC removes
interference from other users. For the purpose of further
improving the system performance, NOMA is capable of being
1642 wileyonlinelibrary.com/iet-com IET Commun. 2021;15:1642–1654.
SONNAM ET AL.1643
combined with various technologies like multiple-input
multiple-output (MIMO) [10–14], device to device [15–17],
deep learning [18–27] and cognitive radio [28, 29]etc.The
combination of MIMO and NOMA, which is defined as
MIMO-NOMA, is one of the hottest research areas. MIMO
is a conventional technology where the base station (BS) and
the user both have several antennas which can take advantage
of the parallel independent channel according to the charac-
teristics of space division multiplexing [30]. MIMO-NOMA
transmission can achieve high SE and low latency through
power-domain in NOMA and spatial diversity in MIMO [31].
Some research results show that MIMO-NOMA can further
improve the throughput of communication systems [32]. In this
paper we mainly focus on the issue of power allocation in the
power-domain NOMA with MIMO.
Designing the proper resource allocation algorithm is sig-
nificantly important to improve the SE or EE of the MIMO-
NOMA system [33]. Many model-based power allocation algo-
rithms have been proposed to increase the SE or EE in NOMA
systems [34, 35], and some studies have researched the joint
subchannel assignment and power allocation in order to max-
imize the EE for multicarrier NOMA systems [36, 37]. Due to
the dynamics and uncertainty that are inherent in the wireless
communication system, it is mostly difficult or even unavailable
to obtain the complete knowledge or mathematical model that
are required in these conventional power allocation approaches
in practice [38]. However, because of high computational com-
plexity of these algorithms, they are inefficient or even inap-
plicable for future communication networks. In addition, the
dynamic resource allocation is another challenge for a fast-
changing channel condition. Optimizing energy consumption
with efficient resource allocation is an important research issue
using NOMA. In practice, taking into account the dynamics
and uncertainty that are inherent in wireless communication
systems, it is generally difficult or even unavailable to obtain
the complete knowledge or mathematical model required in
these conventional resource allocation approaches. Neverthe-
less, intelligent learning methods are extensively developed to
deal with this challenge. Some studies have used DL (deep
learning) as a model-free and data-driven approach to reduce
the computational complexity with available training inputs and
outputs [39]. As one main branch of machine learning (ML), DL
is a useful method that can be applied to overcome the resource
allocation problem in the case of multiple users [21, 24,40]. DL
has been used to solve resource allocation problems by train-
ing the neural networks offline with simulated data first, and
then outputting results with the well-trained networks during
the online process [26, 27,41]. However, obtaining the correct
data set or optimal solutions used for training can be difficult
and the training process itself is actually time-consuming.
In mobile environments the accurate channel information
and the complete model of the environment’s evolution are
unknown, therefore, the model-free reinforcement learning
(RL) method has been used to solve stochastic optimization
problems in wireless networks. RL, as another main branch of
ML, can be used in solving the dynamic resources allocation
problems [18, 22,42]. RL is able to provide the solution for the
sequential decision-making problems where the environment is
unknown and the optimal policy is learned through interactions
with the environment [43, 44]. In RL connecting with wireless
networks, the agent (e.g. the base station or user) observes the
environment (e.g. wireless channel) states and discovers which
actions (e.g. decisions of subchannel assignment or power allo-
cation) yield the most numerical reward (e.g. immediate EE) by
trying them, and finally generate the policy of mapping states to
actions. That is, the agent selects an action for the environment,
and the state (e.g. current channel condition) changes after the
environment accepts the action. At the same time, the reward
is generated and then feedbacks to the agent. Finally, the agent
selects the next action based on the immediate EE and the cur-
rent channel condition in order to ensure energy efficiency in
wireless communications. The goal of RL is to adjust the param-
eters dynamically so as to maximize the reward. Besides, instead
of optimizing current benefits only, RL can generate almost
optimal decision policy which maximizes the long-term perfor-
mance of the system through constant interactions. Therefore,
RL has demonstrated its enormous advantages in many fields.
The existing works that applied the model free RL framework
usually discretise the continuous values of network parameters
into a finite set of discrete levels or acquire the stochastic policy.
Different from most of existing work that uses RL algorithm,
we consider the deterministic policy gradient-based actor-critic
reinforcement learning algorithm to solve power allocation opti-
mization problem with continuous-valued actions and states in
MIMO-NOMA systems.
In this paper, we investigate the dynamic power allo-
cation problem with multi-agent DRL-based methods in a
downlink MIMO-NOMA system. Motivated by the aforemen-
tioned considerations, we propose two multi-agent DRL-based
frameworks (i.e. multi-agent DDPG/TD3-based framework) to
improve the long-term EE of the MIMO-NOMA system while
ensuring minimum data rates of all users. We construct DRL
networks on the basis of the multi-agent model [45], including
the additional actor network which does not have its own critic
network and is updated by comprehensive impact of every agent
critic network. We refer to these two frameworks as the multi-
agent DDPG-based power allocation (MDPA) framework and
the multi-agent TD3-based power allocation (MTPA) frame-
work.
The main contributions of this paper are summarized as fol-
lows:
We develop two model-free DRL-based power allocation
frameworks to solve the power allocation problem in a down-
link MIMO-NOMA system. They are multi-agent model
frameworks based on DDPG and TD3 algorithms which are
on the basis of the deterministic policy gradient methods with
continuous action spaces, since the stochastic policy gradient
methods cannot properly solve the dynamics and uncertainty
that are inherent in wireless communication systems.
For our multi-agent model frameworks, every single agent
dynamically outputs the power allocation policy for all users
in every cluster of the MIMO-NOMA system, and we
also add the additional actor network to the conventional
1644 SONNAM ET AL.
multi-agent model in order to adjust power volumes allocated
to clusters to improve overall performance of the MIMO-
NOMA system. To the best of our knowledge, no such
method for power allocation in MIMO-NOMA systems has
been studied in the existing literatures.
We provide the performance analysis of proposed power allo-
cation frameworks based on multi-agent deterministic pol-
icy gradient in two-user scenario in a cluster of the MIMO-
NOMA system and compare the EE of the proposed frame-
works under various power limitations and minimum data
rates with single agent DRL-based power allocations, the dis-
crete DRL-based one and the fixed power allocation strategy.
We also verify the advantage of energy performance of our
proposed frameworks over that of the MIMO-OMA system.
The simulation results show that the TD3-based frameworks
can achieve best performance.
The rest of this paper is organized as follows. The related
work is reviewed in Section 2. Section 3introduces the system
model and problem formulation of a downlink MIMO-NOMA
system. In Section 4, we propose two multi-agent DRL-based
algorithms to solve the dynamic power allocation problem of
the MIMO-NOMA system. The simulation results are discussed
in Section 5, and we conclude this study in Section 6.
2RELATED WORK
The conventional RL algorithms suffer from slow convergence
speed and become less efficient for problems with large state
and action spaces. In order to overcome these issues, the deep
reinforcement learning (DRL), which combines deep leaning
with RL, has been proposed. One famous algorithm of DRL,
named deep Q-learning [46] that is one of the off-policy meth-
ods, uses a deep Q network (DQN) which applies deep neural
networks as function approximators to conventional RL. DRL
has already been used in many aspects such as power control in
NOMA systems [47, 48], resource allocation in heterogeneous
network [25, 49] and Internet of Things (IoT) [50]. DQN uses
a replay buffer to store tuples of historical samples in order to
stabilize training and makes efficient use of the hardware opti-
mization. And the mini-batches are randomly drawn from the
replay buffer to update the weights of the networks during train-
ing. However, the main drawback of DQN is that the output
decision can only be discrete, which brings quantization error
for continuous action tasks. Besides, the output dimension of
DQN increases exponentially for multi-action and joint opti-
mization tasks. Some works that introduced the model-free RL
framework to solve the stochastic optimization problems typi-
cally discretise the continuous variables of the studying scenario
into a finite set of discrete values (levels), such as the quantized
energy levels or power levels. However, such methods destruct
the completeness of continuous space and introduce quantiza-
tion noise, thus they are incapable of finding the true optimal
policy. RL was initially studied only with discrete action spaces,
however, practical problems sometimes require control actions
in a continuous action space [51]. Concerning our problem with
energy efficiency, both the environment state (i.e. wireless chan-
nel condition) and the action (i.e. transmission power) have con-
tinuous spaces. For problems with the continuous-valued action
the policy gradient methods [52] are very effective, and in par-
ticular it can learn stochastic policies. In order to overcome
the inefficiency and high variance of evaluating a policy in the
policy gradient methods, the actor-critic learning algorithm is
introduced to solve the optimal decision-making problems with
an infinite horizon [53]. Actor-critic algorithm is a well-known
architecture based on policy gradient theorem which allows
applications in continuous space [54]. In the actor-critic method
[55], the actor is used to generate stochastic actions whereas the
critic is used to estimate the value function and criticizes the pol-
icy that is guiding the actor, and the policy gradient evaluated by
the critic process is used to update policy.
In [51], authors developed deep deterministic policy gradient
(DDPG) method to improve the trainability of the stochastic
policy gradient method with continuous action space problems.
DDPG is an enhanced version of the deterministic policy gradi-
ent (DPG) algorithm [56] based on the actor-critic architecture,
which uses an actor network to output a deterministic action
and a critic network to evaluate the action [46]. DDPG takes
the advantages of experience replay buffer and target network
strategies from DQN to improve learning stability. By using
the combination of actor-critic algorithm, replay buffer, target
networks and batch normalization, DDPG is able to perform
continuous control, while the training efficiency and stability
are enhanced from original actor-critic network [51]. These fea-
tures make DDPG more efficient for dynamic resource alloca-
tion problems with continuous action space. They also managed
to make it an ideal candidate for application in industrial settings
[42, 57]. As it concerns the overestimation bias of the critic can
pass to the policy through policy gradient in DDPG. However,
if the actor is updated slowly, the target critics are too similar
with the current critics to avoid overestimation bias. There have
been proposed many algorithms to address the overestimation
bias in Q learning. Double DQN [58] addresses the overestima-
tion problem by using two independent Q functions. Neverthe-
less, it is difficult to find appropriate weights for different tasks
or environments. The most common failure mode of DDPG is
that the learning of Q-function which begins by overestimating
the Q-value and eventually ends by breaking down the policy
because it takes advantage of errors in the Q-function. Over-
estimation bias is a property of Q-learning which is caused by
maximization of a noisy value estimate. The value target esti-
mate depends on the target actor and target critic. As the tar-
get actor is constantly updating, the next action used in the
one-step value target estimate is also changing. Therefore, the
update of target actor may cause a big change in target value
estimate, which would cause instability in critic training [59]. In
order to reduce overestimation bias problems, the authors in
[60] extended DDPG to twin delayed deep deterministic policy
gradient algorithm (TD3), which estimates the target Q value
by using the minimum of two target Q value, called clipped
double Q learning. DPG and DDPG algorithms paved way to
TD3 algorithms by the successful works on DQN [60, 61]. TD3
adopts two critics to get a less optimistic estimate of an action
value by taking the minimum between two estimates. In TD3,
there is just one actor which is optimized with respect to the
SONNAM ET AL.1645
smaller of two critics. In order to ensure the TD-error remains
small, the actor is updated at a lower frequency than the critic,
which results in higher quality policy updates in practice [59].
Since we focus on the issue of power allocation in MIMO-
NOMA systems which can be generally comprised of several
clusters according to users’ propagation channel conditions, we
also investigate the multi-agent RL methods, where single agent
can output the policy for one cluster of the MIMO-NOMA sys-
tem. With the development of DRL, single agent algorithm is
gradually on the right track and has already addressed plenty
of difficult problems giving researchers more powerful instru-
ments to study in high dimension and more complex action
spaces. By studying more on single agent RL, researchers real-
ized that the introduction of multi-agent could achieve higher
performance in complex mission environments [62]. The multi-
agent DDPG algorithm that extends the DDPG algorithm to
the multi-agent domain is proposed for mixed cooperative-
competitive environments in [63], even though it did not have
made more considerations for joint overestimation in multi-
agent environments. In [45], the multi-agent TD3 algorithm is
proposed to reduce the overestimation error of critic networks,
and they explained that the multi-agent TD3 algorithm has a
better performance than the multi-agent DDPG algorithm in
complex mission environment. These led to an extension of it
to any other multi-agent RL algorithm.
3SYSTEM MODEL AND PROBLEM
FORMULATION
3.1 System model
In this paper we consider a downlink multi-user MIMO-NOMA
system, in which the BS equipped with Mantennas that send
data to multiple receivers that are equipped with Nantennas.
The total number of users in the system is M×L,whichare
grouped into Mclusters randomly with L(L2) users per clus-
ter. NOMA is applied among the users in the same cluster. For
the signal to the lth user in cluster m, denoted by UEm,l,theBS
precodes the superposed signal xm=L
l=1pm,lxm,lby using
a transmit beamforming matrix F, where pm,lis the allocated
power to the lth user in cluster m,andxm,ldenotes the normal-
ized transmit signal. We consider the composite channel model
with both Rayleigh fading and largescale path loss. In particular,
the channel matrix Hm,lfrom the BS to UEm,lcan be repre-
sented as [13]:
Hm,l=Gm,l
Ldm,l,(1)
Ldm,l=
d𝜁
m,l,dm,l>d0
d𝜁
0,dm,ld0
,(2)
where Gm,l∈ℂ
N×Mrepresents Rayleigh fading channel gain,
L(dm,l) denotes the path loss of UEm,llocated at a distance
dm,l(km) from the BS and assumed to be the same at each
receive antenna, d0is the reference distance according to cell
size, and ζdenotes the path loss exponent.
The precoding matrix used at the BS is denoted as F
M×M, which implies the antenna mserves cluster mwith the
power pm=L
l=1pm,lfor any m. At the receive side, UEm,l
employs the unit-form receive detection vector vm,l∈ℂ
N×1to
suppress the inter-cluster interference. In order to completely
remove inter-cluster interference, the precoding and detection
matrices need to satisfy the following constraints: (1) F=IM,
with IMdenoting the M×Midentity matrix; (2) vm,l2=1and
vH
m,lHm,lfk=0 for any km, where fkis the kth column of F
[12]. It should be noted that the number of antennas should sat-
isfy NMto make this feasible. Because of the zero-forcing
(ZF)-based detection design, the inter-cluster interference can
be removed even when there exist multiple users in a cluster.
Note that only a scalar value vH
m,lHm,lfk2needs to be fed
back to the BS from UEm,l. For the considered MIMO-NOMA
scheme, the BS multiplexes the intended signals for all users
at the same frequency and time resource. Therefore, the corre-
sponding transmitted signals from the BS can be expressed as:
s=Fx,(3)
where the information-bearing vector x∈ℂ
M×1can be further
written as:
x=
p1,1x1,1+⋯+
p1,Lx1,L
⋮⋱
pM,1xM,1+⋯+
pM,LxM,L
,(4)
where M
m=1L
l=1pm,lpmax ,pmax denotes the total transmit
power at the BS. Accordingly, the observed signal at UEm,lis
given by
ym,l=Hm,ls+nm,l,(5)
where nm,l(0,𝜎
2I) is the independent and identically
distributed (i.i.d.) additive white Gaussian (AWGN) noise vec-
tor. By applying the detection vector vm,lon the observed signal,
Equation (5) can be expressed as [10]:
vH
m,lym,l=vH
m,lHm,lfm
L
l=1pm,lxm,l
+
M
k=1,km
vH
m,lHm,lfkxk+vH
m,lnm,l,
(6)
where the second term is interference from other clusters, and
xkdenotes the kth row of x. Owing to the constrain1on the
detection vector, i.e. vH
m,lHm,lfk=0 for any km, Equation
1Due to the specific selection of F, this constraint is further reduced to vH
m,l
Hm,l=0, where
Hm,l=[h1,ml hm1,ml hm+1,ml hM,ml ]andhi,ml is the ith column of Hm,l. Therefore,
vm,lcan be expressed as Um,lwm,l, where Um,lis the matrix consisting of the left singu-
lar vectors of
Hm,lcorresponding to zero singular values, and wm,lis the maximum ratio
combining vector expressed as UH
m,lhm,ml UH
m,lhm,ml [11, 12].
1646 SONNAM ET AL.
(6) can be simplified as:
vH
m,lym,l=vH
m,lHm,lfm
L
l=1pm,lxm,l+vH
m,lnm,l,(7)
Without loss of generality, the effective channel gains are
ordered as:
vH
m,1Hm,1fm
2vH
m,LHm,Lfm
2,(8)
At the receiver, each user conducts SIC to remove the inter-
ference from the users with worse channel gains, i.e. the interfer-
ence from UEm,l+1,⋯,UEm,Lis removed by UEm,l. As a result,
the achieved data rate at UEm,lis given by
𝜉m,l=Bm,llog2
1+
pm,lvH
m,lHm,lfm
2
l1
k=1pm,kvH
m,lHm,lfm
2+nm,l
,(9)
where Bm,lis the bandwidth assigned to UEm,l.
3.2 Problem formulation
The total power consumption at the BS is comprised of two
parts: the fixed circuit power consumption pc, and the flexible
transmit power M
m=1L
l=1pm,lpmax . In this work, we find
out the power allocation coefficient qm,linstead of power vol-
ume pm,l, i.e. qm,l=pm,lpmis the power allocation coefficient
for UEm,lin cluster m, where pmis the power allocated to cluster
min a cell. Thus, Equation (9) can be rewritten as:
𝜉m,l=Bm,llog2
1+
pmqm,lvH
m,lHm,lfm
2
l1
k=1pmqm,kvH
m,lHm,lfm
2+nm,l(10)
Similar to [11, 34], we define the EE (energy efficiency) of the
system in cluster mas:
𝜂m
EE =𝜉m
pc+pm
,(11)
where 𝜉m=L
l=1𝜉m,ldenotes the achievable sum rate in clus-
ter m. We aim to maximize the EE of the system in MIMO-
NOMA when each user has a predefined minimum rate 𝜉min
m,l.
Thus, our considered problem can be formulated as:
P1 max
pm,qm,l
M
m=1
𝜂m
EE,(12)
s.t.C1 ∶𝜉
m,l𝜉min
m,l,m{1,…,M},l{1,…,L},
FIGURE 1 The architecture of wireless network with small-cells and
reinforcement learning formulation
C2
M
m=1
pmpmax,
C3
L
l=1
qm,l1,
C4 qm,iqm,j,ij,i,j{1,…,L},
C5 pm0,qm,l[0,1],∀m,l,
where C1 represents users’ minimum rate requirements, and C2
and C3 respectively represent the transmit power constraint in
a cell and a cluster. C4 indicates that the user with worse chan-
nel condition is allocated more power and C5 are the inherent
constraints of pm,qm,l.
This optimization problem is non-convex and NP-hard, and
the global optimal solution is usually difficult to obtain in prac-
tice due to the high computational complexity and the randomly
evolving channel conditions. More importantly, the conven-
tional model-based approaches can hardly satisfy the require-
ments of future wireless communication services. Therefore, we
propose two DRL-based frameworks in the following sections
to deal with this problem.
4DYNAMIC POWER ALLOCATION
WITH DRL
4.1 Deep reinforcement learning
formulation for MIMO-NOMA systems
In this section, the optimization problem of power allocation
in a downlink MIMO-NOMA system is modelled as a rein-
forcement learning task, which consists of an agent and envi-
ronment interacting with each other. In Figure 1, we describe
the architecture of wireless network with small-cells which are
ultra-densely deployed and have the same number of anten-
nas as user handsets, or even less. As shown in Figure 1,the
base station is treated as the agent and the wireless channel
of the MIMO-NOMA system is the environment. The action,
transmit power from the BS to users, taken by the DRL con-
troller (at the BS) is based on the state which is the collec-
tive information on channel condition from users. Then at
SONNAM ET AL.1647
FIGURE 2 Multi-agent DRL-based MIMO-NOMA power allocation
system model
each step, based on the observed state of the environment,
the agent performs an action from the action space to allo-
cate power to users according to the power allocation policy,
where the policy is learned by the DRL algorithms. With the
obtained transmit power, the optimal power allocation is con-
ducted and the step reward can be computed and fed back to
the agent. This reward is the energy efficiency of the MIMO-
NOMA system. The agent can perform actions, such as power
allocation, optimized for a given channel information through
a repetitive learning of the selection process that maximizes the
reward.
In this paper, we propose two multi-agent DRL-based frame-
works (i.e. MDPA and MTPA) for the downlink MIMO-
NOMA system to derive the power allocation decision, which
are respectively based on DDPG and TD3 algorithm. The
multi-agent DRL-based network comprises of the Mpairs of
actor and critic network where each pair outputs power alloca-
tion coefficients for all users in a cluster, as well as one addi-
tional actor network which decides the power volume allocated
to every cluster. In Figure 2, we show the multi-agent DRL-
based power allocation network mechanism with Mclusters
in the MIMO-NOMA system (target networks for soft update
weights are not showed), where actor network 𝜇m(m=1,…,M)
decides power allocation coefficients for all users in each clus-
ter and actor network 𝜇0decides the power volume allocated
to every cluster. Comparing with the other mechanisms in pre-
vious works [42, 48,57], we not only properly adopted the
off-policy multi-agent method in accordance with properties of
MIMO-NOMA systems, but also created additional actor net-
work (actor 𝜇0in Figure 2) which is updated in cooperation
with the other agents’ critic networks to adjust power volumes
allocated to clusters for improving overall performance of the
MIMO-NOMA system.
In order to better illustrate our algorithm, we first briefly
introduce the backgrounds of DRL. A general RL model con-
sists of four parts: state-space , action-space , immediate
reward and the transition probability between current state
and next state ss. In every TS (time slot) t, one or several
agents take an action at(e.g. decisions of power allocation)
under current state st(e.g. current channel condition), then
receives an immediate reward rt(e.g. immediate EE) and a new
state st+1, which we define as following [42, 57].
States: The state stin TS tis defined as the current chan-
nel gain of all users
st=s1
t,…,sM
t,(13a)
sm
t=hm,1(t),…,hm,L(t),(13b)
where sm
t(m=1,…,M) represents the current channel gain of
all users in cluster m,andhm,l(t) represents the channel gain
between BS and user UEm,lin TS t. They are assumed to be
obtained at the beginning of the TS. The states space complexity
is related to the number of UEs in the system. Since the total
number of users in the system is M×L, which are grouped
into Mclusters randomly with Lusers per cluster, the space
complexity of stis O(M×L) and that of sm
tis O(L).
Actions: The most suitable action space should contains
all power allocation decisions, so the action atof TS tis
at=a1
t,…,aM
t,(14a)
am
t=pm
t,qm
t,(14b)
qm
t=qm,1(t),…,qm,L(t),(14c)
where pm
tR1is power volume allocated to cluster min a cell,
qm
tRLis the power allocation coefficient for all users in clus-
ter mand qm,l(t)R1is the power allocation coefficient for
UEm,lin TS t. The power allocation decision network outputs
power allocation coefficients to every user and power volume
for every cluster, hence, each user can use the amount of its
own power allocation coefficient portion of power volume allo-
cated to the corresponding cluster. Therefore, the actions space
complexity of the power allocation network is O(M×(L+1)).
The transmit power is a continuous variable which is infinite,
so we develop the MDPA network and the MTPA network to
solve aforementioned problems.
Rewards: We use the EE of the MIMO-NOMA system in
cluster mto represent the immediate reward rm
treturned
after choosing action atin state st,thatis
rm
t=𝜂
m
EE (15)
And we aim to maximize the long-term cumulative dis-
counted reward defined as:
Rm
t=
i=0
𝛾irm
t+i,(16)
with discount factor 𝛾∈(0,1).
In order to achieve this goal, a policy 𝜋mfor cluster mdefined
as a function mapping from states to actions (𝜋m)is
needed. The policy 𝜋macts like a guidance, i.e. it tells the agent
which am
tshould be taken in a specific sm
tto achieve the expected
Rm
t, thus maximizing Equation (16) and it is equivalent to find-
ing the optimal policy represented as 𝜋
m. For a typical RL prob-
lem, the Q value function [43], which describes the expected
1648 SONNAM ET AL.
cumulative reward Rm
tof starting sm
t, performing action am
tand
thereafter following policy 𝜋m, is instrumental in solving RL
problems, and is defined as:
Q𝜋msm
t,am
t=𝔼
Rm
tsm
t,am
t=𝔼
i=0
𝛾irm
t+ism
t,am
t
=𝔼
rm
t+𝛾Q𝜋msm
t+1,am
t+1sm
t,am
t.
(17)
The optimal policy 𝜋
mwhich maximizes Equation (16) also
maximizes Equation (17) for all states and actions in cluster m.
And the corresponding optimal Q value function follows 𝜋
mis
given as:
Q𝜋
msm
t,am
t=𝔼rm
t+𝛾max
am
t+1
Q𝜋
msm
t+1,am
t+1sm
t,am
t.
(18)
Once we have the optimal Q value function Equation (18),
the agent knows how to select actions optimally.
4.2 Multi-agent DDPG-based power
allocation (MDPA) network
We use the off-policy multi-agent DDPG network based on
actor-critic structure to solve the dynamic power allocation
problem, where the actor part is an enhanced DPG network
and the critic part is a DQN. As mentioned before, the cen-
tralized controller receives the channel gain of all users as st=
{s1
t,…,sM
t}at the beginning each TS. With input st, every DNN
named actor network with weights 𝜇k(k=0,1,…,M)outputs
a deterministic action rather than a stochastic probability of
actions which removes further sampling and integration opera-
tions that are required in other actor-critic-based methods [46],
that is
pt=p1
t,…,pM
t=𝜋
0(st;𝜇
0)pm
t=𝜋
(m)
0(st;𝜇
0),(19a)
qm
t=𝜋
m(sm
t;𝜇
m)(m=1,2,…,M) (19b)
where 𝜋0is a policy corresponding to Actor 𝜇0which decides
power volume allocated to every cluster in a cell, 𝜋(m)
0denotes
mth element of 𝜋0(st;𝜇
0) and a policy 𝜋mdecides the power
allocation coefficient for all users in cluster m.
However, a major challenge of learning with deterministic
actions is the exploration of new actions [49]. Fortunately, for
the off-policy algorithm such as DDPG, the exploration can be
treated independently from the learning process. We construct
the exploration policy by adding a noise to the original output
action which is similar to the random selection of 𝜖−greedy
method in DQN, that is
pt=𝜋0(st;𝜇
0)+0pmax
0,(20)
qm
t=𝜋msm
t;𝜇
m+m1
0,m=1,2,…,M,(21)
where 0RM,mR1represents the noise and follows a
normal distribution. The action pm
tis restricted by the interval
(0, pmax )andqm
tis restricted by the interval (0, 1).
After executing the action atand receiving rt={r1
t,…,rM
t},
the system moves to the next state st+1. Since the action atis
generated by a deterministic policy network, we rewrite Equa-
tion (17) as:
Q𝜋msm
t,am
t
=𝔼rm
t+𝛾Q𝜋msm
t+1,𝜋m
0st+1;𝜇
0,𝜋
msm
t+1;𝜇
m,
m=1,2,…,M
(22)
𝜇mJ(𝜋m)=𝔼qmQ𝜋m(sm,am;𝜃
m)pm=𝜋(m)
0(s;𝜇0)𝜇m𝜋m(sm;𝜇
m)
1
Nb
i
qmQ𝜋m(sm,am;𝜃
m)sm=sm
i,pm=𝜋(m)
0(si;𝜇0)𝜇m𝜋m(sm;𝜇
m)sm=sm
i,
(23)
𝜇0J(𝜋0)=𝔼M
m=1
pmQ𝜋m(sm,am;𝜃
m)qm=𝜋m(sm;𝜇m)𝜇0𝜋0(s;𝜇
0)
1
Nb
M
m=1
i
pmQ𝜋m(sm,am;𝜃
m)s=si,qm=𝜋m(sm
i;𝜇m)𝜇0𝜋0(s;𝜇
0)s=si,
(24)
Since it is impractical to calculate Q value of every step in
this way, we also use another DNN named critic network with
weights 𝜃m(m=1,2,…,M) to output the approximate Q value
Q𝜋m(sm
t,am
t;𝜃
m) to evaluate the selected action am
t. We utilize
the experience replay method to allow the networks to bene-
fit from learning across a set of uncorrelated tuples. With the
Nbrandom tuples (si,ai,ri,si+1) sampled from replay memory
, the actor network for 𝜋m(sm
t;𝜇
m) updates its weights in the
direction of getting larger Q value according to the determinis-
tic policy gradient theorem in [49], they are given by Equations
(23) and (24), where J(𝜋m)andJ(𝜋0) respectively represents the
expected total reward of following policy 𝜋min all states of clus-
ter mand policy 𝜋0in all states in a cell. Then we use the target
network architecture from DQN to solve the unstable learning
issue caused by using only one network to calculate target Q
values and update weights at the same time. We create a copy
of the actor network and the critic network as 𝜋m(sm;
𝜇m)and
Q𝜋m(sm,am;
𝜃m) and name them as the target actor network and
the target critic network. Thus, the target Q value can be gener-
ated by these two networks, that is
ym
i=rm
i+𝛾Q𝜋msm
i+1,𝜋(m)
0st+1;𝜇
0,𝜋
msm
i+1;
𝜇m;
𝜃m,
(25)
zm
i=rm
i+𝛾Q𝜋msm
i+1,𝜋(m)
0st+1;
𝜇0,𝜋
msm
i+1;𝜇
m;
𝜃m
(26)
SONNAM ET AL.1649
ALGORITHM 1 The multi-agent DDPG-based power allocation
(MDPA) in downlink MIMO-NOMA
1. Initialize the replay memory with capacity .
2. Initialize the multi-agent DDPG-based power allocation actor networks
𝜋0(st;𝜇
0), 𝜋m(sm
t;𝜇
m) and critic networks Q𝜋m(sm
t,am
t;𝜃
m) with random
parameters 𝜇0,𝜇
mand 𝜃m(m=1,2,…,M).
3. Initialize the target actor networks 𝜋0(st;
𝜇0), 𝜋m(sm
t;
𝜇m) and target critic
networks Q𝜋m(sm
t,am
t;
𝜃m) with parameters
𝜇0=𝜇
0,
𝜇m=𝜇
mand
𝜃m=𝜃
m.
4. Initialize the random process 0,mfor the DDPG action exploration,
the terminate TS Tmax and the parameters update interval size C.
5. Controller at the BS receives the first channel condition information of all
users as initial state s1.
6. for t=1,2,…,Tmax do
7. Controller selects the power allocation action am
t=(pm
t,qm
t) according
to Equations (20), (21).
8. Controller broadcasts the power allocation action am
tto all users, and
users transmit their signal with the specific power.
9. If action am
tfor any mcan satisfy the minimum rate requirement,
controller receives the current EE as reward rm
t. Otherwise, it receives
none reward.
10. Controller receives the next state st+1as users moving to their next
positions.
11. Store the tuple (st,at,rt,st+1)in.
12. Randomly sample a mini-batch of Nbtuples (si,ai,ri,si+1)from.
13. Critic networks Q𝜋m(sm
t,am
t;𝜃
m) updates 𝜃mby minimizing the loss
function Equation (27).
14. Actor networks 𝜋m(sm
t;𝜇
m) updates 𝜇maccording to Equation (23).
15. Critic networks Q𝜋m(sm
t,am
t;𝜃
m) updates 𝜃mby minimizing the loss
function Equation (28).
16. Actor networks 𝜋0(st;𝜇
0)updates 𝜇0according to Equation (24).
17. Update target actor networks
𝜇0,
𝜇mand target critic networks
𝜃m
according to Equation (29) in every CTSs.
18. end for
We use the same method in DQN to update the critic net-
work weights by minimizing the loss function defined as:
Lm(𝜃m)=1
Nb
iym
iQ𝜋msm
i,am
i;𝜃
mpm
i=𝜋(m)
0(si;𝜇0)2
,
(27)
L0=1
Nb
M
m=1
izm
iQ𝜋msm
i,am
i;𝜃
mqm
i=𝜋m(sm
i;𝜇m)2
.
(28)
Instead of directly copying the weights to target networks, we
update the weights
𝜇kand
𝜃min a soft manner to make sure the
weights are changed slowly, which greatly improves the stability
of learning, that is
𝜃m←𝜏𝜃
m+(1−𝜏
)
𝜃m,m=1,2,…,M,
𝜇k←𝜏𝜇
k+(1−𝜏
)
𝜇k,k=0,1,…,M,(29)
where 0 <𝜏<1. The MDPA algorithm in a downlink MIMO-
NOMA system is summarized in Algorithm 1.
4.3 Multi-agent TD3-based power
allocation (MTPA) network
Twin-delayed deep deterministic policy gradient (TD3) is an off-
policy model with continuous high dimensional spaces which
recently has achieved breakthrough in artificial intelligence. RL
algorithms characterized as off-policy generally utilized a sepa-
rate behaviour policy which is independent of the policy which
is being improved upon. The key advantage of the separation
is that the behaviour policy can operate by sampling all actions,
while the estimation policy can be deterministic [61]. TD3 was
built on the DDPG algorithm to increase stability and perfor-
mance with consideration of function approximation error [60].
The uniqueness of the TD3 algorithm is in the combination of
three powerful DRL techniques such as continuous double deep
Qlearning[58], policy gradient [56] and actor-critic [54]. Even
though DDPG can sometimes achieve a good performance, it
is very sensitive for hyper-parameters and other types of adjust-
ments. The overestimation of Q-value when in the beginning of
the process of learning the Q-function is actually the common
failure mode of DDPG. The overestimation bias is a property
of Q-learning in which the maximization of a noisy value esti-
mate induces a consistent overestimation [60,64]. This noise is
unavoidable when given the inaccuracy of the estimator in func-
tion appropriation settings. Therefore, the overestimation can
occur due to the inaccurate action values. TD3 is an algorithm
that solves this problem by introducing three key techniques
[60]. The first one is clipped dual Q networks. TD3 learns two
Q-functions and uses the smaller of the two Q-values to form
the target of the Bellman error. The second is the delayed policy
update, where the policy network parameters are updated after
dual Q-function networks are updated. Thirdly, the smoothed
target policy utilization makes all action values within the spe-
cific range [45].
In this paper, we also use the multi-agent TD3 network
to derive the power allocation decision in the MIMO-NOMA
system, where the network mechanism is equal to the above
DDPG network. Every actor networkm(m=1,2,…,M)learns
two Q-functions to get a less optimistic estimate of an action
value by taking the minimum between two estimates, so
two Q-functions with 𝜃j
mfor every actor network can be
expressed as:
Q𝜋msm
t,am
t;𝜃j
m
=𝔼rm
t+𝛾Q𝜋msm
t+1,𝜋
m
0st+1;𝜇
0,𝜋
msm
t+1;𝜇
m;𝜃j
m,
m=1,2,…,M,j=1,2
(30)
Similar to our DDPG-based method, we utilize the expe-
rience replay memory and then the target Q value ym
iis
1650 SONNAM ET AL.
given by:
ym
i=rm
i+𝛾min
j=1,2Q𝜋msm
i+1,𝜋(m)
0st+1;𝜇
0,qm
t;
𝜃j
m,
(31)
qm
t=𝜋msm
i+1;
𝜇m+𝜖
m1
0,m=1,2,…,M,
𝜖mclip 0,𝜎
𝜖m,0,1,
(32)
zm
i=rm
i+𝛾min
j=1,2Q𝜋msm
i+1,
pm
t,𝜋
msm
i+1;𝜇
m;
𝜃j
m,(33)
pm
t=𝜋(m)
0st+1;
𝜇0+𝜀
mpmax
0
,
𝜀mclip 0,𝜎
𝜀m,0,pmax ,
(34)
where the added noise 𝜖m,𝜀
mare clipped to keep the target
close to the original action. Then critic networks are updated by
minimizing the loss functions defined as:
Lm𝜃1
m,𝜃
2
m=1
Nb
i
jym
iQ𝜋msm
i,am
i;𝜃j
mpm
i=𝜋(m)
0(si;𝜇0)2
,
(35)
L0=1
Nb
M
m=1
i
jzm
iQ𝜋msm
i,am
i;𝜃j
mqm
i=𝜋m(sm
i;𝜇m)2
,
(36)
𝜇mJ(𝜋m)=𝔼qmQ𝜋msm,am;𝜃
1
mpm=𝜋(m)
0(s;𝜇0)𝜇m𝜋m(sm;𝜇
m)
1
Nb
i
qmQ𝜋msm,am;𝜃
1
msm=sm
i,pm=𝜋(m)
0(si;𝜇0)𝜇m𝜋m(sm;𝜇
m)sm=sm
i,
(37)
𝜇0J(𝜋0)=𝔼M
m=1
pmQ𝜋msm,am;𝜃
1
mqm=𝜋m(sm;𝜇m)𝜇0𝜋0(s;𝜇
0)
1
Nb
M
m=1
i
pmQ𝜋msm,am;𝜃
1
ms=si,qm=𝜋m(sm
i;𝜇m)𝜇0𝜋0(s;𝜇
0)s=si
(38)
Then we update the actor network for 𝜋min the direction
of getting larger Q value with 𝜃1
mby Equations (37) and (38).
Finally, we update the weights
𝜇0,
𝜇mand
𝜃j
min a soft manner
to make sure the weights are changed slowly, that is
𝜇0←𝜏𝜇
0+(1−𝜏
)
𝜇0,
𝜃j
m←𝜏𝜃
j
m+(1−𝜏
)
𝜃j
m,
m=1,2,…,M,j=1,2,(39)
𝜇m←𝜏𝜇
m+(1−𝜏
)
𝜇m,m=1,…,M(40)
The MTPA algorithm in a downlink MIMO-NOMA system
is summarized in Algorithm 2.
ALGORITHM 2 The multi-agent TD3-based power allocation (MTPA) in
downlink MIMO-NOMA
1. Initialize the replay memory with capacity .
2. Initialize the multi-agent TD3-based power allocation actor networks
𝜋0(st;𝜇
0), 𝜋m(sm
t;𝜇
m) and critic networks Q𝜋m(sm
t,am
t;𝜃
1
m),
Q𝜋m(sm
t,am
t;𝜃
2
m) with random parameters 𝜇0,𝜇
m,𝜃1
mand
𝜃2
m(m=1,2,…,M).
3. Initialize the target actor networks 𝜋0(st;
𝜇0), 𝜋m(sm
t;
𝜇m) and target critic
networks Q𝜋m(sm
t,am
t;
𝜃1
m), Q𝜋m(sm
t,am
t;
𝜃2
m) with parameters
𝜇0=𝜇
0,
𝜇m=𝜇
m,
𝜃1
m=𝜃
1
mand
𝜃2
m=𝜃
2
m.
4. Initialize the random process 0,mfor TD3 action exploration and the
terminate TS Tmax and the parameters update interval size C.
5. Controller at the BS receives the first channel condition information of all
users as initial state s1.
6. for t=1,2,…,Tmax do
7. Controller selects the power allocation action am
t=(pm
t,qm
t) according
to Equations (20), (21).
8. Controller broadcasts the power allocation action am
tto all users, and
users transmit their signal with specific power.
9. If action am
tfor any mcan satisfy the minimum rate requirement,
controller receives the current EE as reward rm
t. Otherwise, it receives
none reward.
10. Controller receives the next state st+1as users moving to their next
positions.
11. Store the tuple (st,at,rt,st+1)in.
12. Randomly sample a mini-batch of Nbtuples (si,ai,ri,si+1)from.
13. Critic networks Q𝜋m(sm
t,am
t;𝜃j
m) updates 𝜃j
mby minimizing the loss
function Equation (35).
14. iftmod Cthen
15. Actor networks 𝜋m(sm
t;𝜇
m) updates 𝜇maccording to Equation (37).
16. Update target actor networks
𝜇maccording to Equation (39).
17. end if
18. Critic networks Q𝜋m(sm
t,am
t;𝜃j
m) updates 𝜃j
mby minimizing the loss
function Equation (36).
19. iftmod Cthen
20. Actor networks 𝜋0(st;𝜇
0) updates 𝜇0according to Equation (38).
21. Update target actor networks
𝜇0and target critic networks
𝜃1
m,
𝜃2
m
according to Equation (40).
22. end if
23. end for
5SIMULATION RESULTS
In this section, we present the simulation results of the two
multi-agent DRL-based power allocation algorithm (i.e. MDPA
and MTPA). The results are simulated in a downlink MIMO-
NOMA system, where the BS is located at the centre of the
cell, 4 users are randomly distributed in the cell with a radius of
500 m. The specific values of the adopted simulation parame-
ters are summarized in Table 1. If not specified, Tmax =5000 TS,
pmax =1 W and all users perform random moving with a speed
of v=1ms. Our multi-agent DRL-based framework for sim-
ulation has five networks (three actor networks and two critic
networks) that it does not mean including target networks. The
SONNAM ET AL.1651
TAB LE 1 Parameter settings used in the simulations
Parameter Value
Framework Pytorch 1.5
Programming language Python 3.6
Channel MIMO-NOMA/AWGN
Fading Rayleigh distribution
Number of transmit antennas M=2
Number of receive antennas N=2
Number of users in a cluster L=2
Channel bandwidth Bm,l=10 MHz
Thermal noise density 𝜎2=−174 dBmHz
Pathloss model 114 +38log10(dm,l)
Minimum required data rate 𝜉min
m,l=10 bpsHz
learning rate is set to be 0.01, discount factor 𝛾=0.9, memory
capacity =200, weights update interval C=10 and batch
size Nb=32. Specifically, every actor network has an additional
layer to scale the output to the power bound limitation. The soft
update rate 𝜏in Equations (29), (39) and (40) is set to be 0.01,
the noise process in Equations (20) and (21) follows a normal
distribution of (0,1), and 𝜎𝜖m=𝜎
𝜀m=0.1 in Equations (32)
and (34).
In order to compare the performance of the proposed
frameworks, we consider several alternative approaches: (1)
single agent DDPG/TD3-based power allocation methods
(SDPA/STPA), which output the power allocation policy
for every cluster by independent single agents based on
DDPG/TD3 algorithm to maximize the sum EE of the sys-
tem; (2) DQN-based discrete power allocation strategy (QPA),
which uses DQN method to output the allocated power of all
users by quantizing the power into 10 levels between 0 and pmax
in order to fit the input layer of DQN, because the transmit
power is a continuous variable which is infinite and the action-
space of DQN has to be finite; (3) fixed power allocation strat-
egy (FPA), which chooses the action that can maximize the sum
EE with maximum transmit power by exhaustive searching in
each TS to ensure high quality-of-service (QoS) of the system;
(4) multi-agent DDPG/TD3 -based power allocation meth-
ods for MIMO-OMA systems (MTPA-OMA/MDPA-OMA),
where the system model in MIMO-OMA is referenced in [14].
In QPA, we evaluated it by exhaustive searching in each TS.
Because it does not rely on global trajectory data, it is easy to
fall into a local optimum. However, such quantization results
in a serious problem which is the huge action-space of pos-
sible selected power. In our QPA scenario with 4 users and a
quantization level k=10, each user can be allocated the power
amount corresponding to one of 10 possible power allocation
choices, then the number of actions that can be selected has
already reached 104=10,000. It should be noted that it is only
a small system, while in future generation wireless networks
with high density of users, the size of action space grows expo-
nentially as the number of users increases. Such a large action
space leads to poor performance, because the DQN agent needs
FIGURE 3 Loss function value of two multi-agent DRL-based
frameworks
plenty of time to explore the entire action space to find out the
best power allocation option. And due to the random evolving
nature of the communication environment, it may be difficult
to choose the option which leads to the best performance. In
addition, such quantization also discards some useful informa-
tion, which is essential for finding the optimal power allocation
option. This is also the reason for us to use the off-policy net-
work based on actor-critic structure for power allocation tasks.
The convergence complexity of the DRL algorithm depends
on the size of the action-state space. For the QPA framework,
the size of output actions is O(kL×M), which exponentially
increases with the number of users and the quantization levels
of power allocation coefficients, and for our proposed frame-
works,thesizeisO(M×(L+1)). The state space needs some
space to be stored, hence the corresponding space complexity
is O(M×L) for QPA and the proposed frameworks. There-
fore, the proposed algorithms can improve the performance
of the energy efficiency of MIMO-NOMA systems with low
complexity.
First, we evaluate the convergence of the proposed two multi-
agent frameworks under learning rate 0.01 that is fixed since
it provided the better performance of learning in our simula-
tions. We fix the location of all users to determine how many
TSs are required for our frameworks to find the optimal power
allocation policy. As shown in Figure 3, the loss function value
decreases and tends to a stable value within 200 TS for two
frameworks, and the value is small enough to accurately predict
the Q value.
Then, we compare the proposed two multi-agent algorithms
in the random moving system with different power alloca-
tion approaches. It should be noticed that all results are aver-
aged by taking a moving window of 100 TSs to make the
comparison clearer. For the objective of sum EE maximiza-
tion, we investigate the performance of our multi-agent DRL-
based frameworks. Figure 4shows that MTPA frameworks
achieve the better performance of the averaged sum EE than
the other approaches. DRL-based frameworks are able to
1652 SONNAM ET AL.
FIGURE 4 Averaged energy efficiency comparison of different
algorithms
dynamically choose the transmit power of all users with con-
tinuous control according to their current channel conditions in
every TS. Specifically, the averaged sum EE value achieved by
the MTPA framework over all 5000 TSs is 21.47% higher than
the FPA approach, whereas the values for the MDPA frame-
work is 17.3% higher. And for the STPA and SDPA frame-
works, the values are 15.8% and 12.39% higher, respectively. On
the other hand, for the QPA framework based on the discrete
DRL method is only 11.13% higher. More importantly, as long
as the data rate requirement of all users is satisfied, it is unnec-
essary and energy-wasted to use maximum power for transmis-
sion since it reduces the EE of the system. This also shows the
importance of power allocation to the performance improve-
ment of the NOMA system. Also, we verify the performance
advantage of the proposed algorithms over MIMO-OMA. As
shown in Figure 4, numerical results show that the proposed
algorithms outperform the advantage of the energy efficiency
over MIMO-OMA. In our simulation scenario, MTPA achieves
18.04% higher than MTPA-OMA and MDPA achieves 16.86%
higher than MDPA-OMA.
We investigate the averaged sum EE performance of two
frameworks in 5000 TSs under different transit power limi-
tations. As shown in Figure 5, the averaged results of DRL-
based frameworks grow and tend to stable values as the maxi-
mum available power increases, while the FPA approach slightly
increases and continues to drop since it always uses full power
for the signal transmission. This is because as long as the data
rate requirement is satisfied, our algorithms no longer need
full power for the transmission and can dynamically allocate
the power based on the communication conditions to opti-
mize the sum EE no matter how pmax changes. Obviously,
our proposed power allocation methods based on determinis-
tic policy gradient with continuous action spaces outperform
the discrete DRL-based approaches and conventional methods
under most power limitation conditions and we find out that
the MTPA framework achieves the best average sum EE per-
formance compared with other approaches. And we verify the
FIGURE 5 Averaged energy efficiency versus power limitation
FIGURE 6 The variation of the energy efficiency with the minimum
required date rate
improved performance of the proposed algorithms for MIMO-
NOMA over MIMO-NOMA.
Figure 6illustrates a look on how EE varies with min-
imum required data rate 𝜉min
m,l. As shown in Figure 6,our
proposed frameworks have an outstanding performance com-
pared with the other approaches as well as over MIMO-OMA,
even though EE decreases with 𝜉min
m,l. This situation can be
explained by connecting 𝜉min
m,lwith the transmit power, i.e.
lower 𝜉min
m,lhas the same influence on EE as higher transmit
power. The obtained results indicate that the average system
energy efficiency decreases with increasing 𝜉min
m,l, because as 𝜉min
m,l
increases, the system reaches the infeasibility case more rapidly
and becomes unable to satisfy 𝜉min
m,lfor all the system users.
6 CONCLUSION
In this paper, we have studied the dynamic power alloca-
tion problem in a downlink MIMO-NOMA system and two
multi-agent DRL-based frameworks (i.e. multi-agent DDPG/
SONNAM ET AL.1653
TD3-based framework) are proposed to improve the long-term
EE of the MIMO-NOMA system while ensuring minimum
data rates of all users. Every single agent of two multi-agent
frameworks dynamically outputs the optimum power allocation
policy for all users in every cluster by DDPG/TD3 algorithm,
and the additional actor network also adjusts power volumes
allocated to clusters to improve overall performance of the
MIMO-NOMA system, finally both frameworks adjust the
entire power allocation policy by updating the weights of neural
networks according to the feedback of the system. Compared
with other approaches, our multi-agent DRL-based power
allocation frameworks can significantly improve the EE of
the MIMO-NOMA system under various transmit power
limitations and minimum data rates by adjusting the parameters
of networks. We also have verified the advantage of energy
performance the proposed algorithms over that of the MIMO-
OMA system. In future work, we are going to study on the
joint subchannel selection and power allocation problem for
more practical scenarios of MIMO-NOMA systems properly
applying more powerful DRL approaches.
REFERENCES
1. Chin, W.H., Fan, Z., Haines, R.: Emerging technologies and research chal-
lenges for 5G wireless networks. IEEE Wireless Commun. 21(2), 106–112
(2014)
2. Ding, Z., et al.: A survey on non-orthog onal multiple access for 5G net-
works: Research challenges and future trends. IEEE J. Sel. Areas Commun.
35(10), 2181–2195 (2017)
3. Wu, Z., et al.: Comprehensive study and comparison on 5G NOMA
schemes. IEEE Access 6, 18511–18519 (2018)
4. Choi, J.: Effective capacity of NOMA and a suboptimal power control pol-
icy with delay QoS. IEEE Trans. Commun. 65(4), 1849–1858 (2017)
5. Yin, Y., et al.: Dynamic user grouping-based NOMA over Rayleigh fading
channels. IEEE Access 7, 110964–110971 (2019)
6. Gui, G., Sari, H., Biglieri, E.: A new definition of fairness for non-
orthogonal multiple access. IEEE Commun. Lett. 23(7), 1267–1271 (2019)
7. Liu, X., et al.: Spectrum resource optimization for NOMA-based cognitive
radio in 5G communications. IEEE Access 6, 24904–24911 (2018)
8. Muhammed, A.J., et al.: Resource allocation for energy efficiency in
NOMA enhanced small cell networks. In: Proceedings of 2019 11th Inter-
national Conference on Wireless Communications and Signal Processing
(WCSP). Xi’an, China, pp. 1–6 (2019)
9. Ding, Z., et al.: On the performance of non-orthogonal multiple access
in 5G systems with randomly deployed users. IEEE Signal Process. Lett.
21(12), 1501–1505 (2014)
10. Cui, J., Ding, Z., Fan, P.: Power minimization strategies in downlink
MIMO-NOMA systems. In: Proceedings of 2017 IEEE International
Conference on Communications (ICC). Paris, France, pp. 1–6 (2017)
11. Zeng, M., et al.: Energy-efficient power allocation for MIMO-NOMA with
multiple users in a cluster. IEEE Access 6, 5170–5181 (2018)
12. Ding, Z., Adachi, F., Poor, H.V.: The application of MIMO to non-
orthogonal multiple access. IEEE Trans. Wireless Commun. 15(1), 537–
552 (2016)
13. Ding, Z., Schober, R., Poor, H.V.: A general MIMO framework for NOMA
downlink and uplink transmission based on signal alignment. IEEE Trans.
Wireless Commun. 15(6), 4438–4454 (2016)
14. Zeng, M., et al.: Capacity comparison between MIMO-NOMA and
MIMO-OMA with multiple users in a cluster. IEEE J. Sel. Areas Com-
mun. 35(10), 2413–2424 (2017)
15. Do, D.T., Nguyen, S.: Device-to-device transmission modes in NOMA
network with and without wireless power transfer.Comput. Commun. 139,
67–77 (2019)
16. Uddin, M.B., Kader, M.F., Shin, S.: Exploiting NOMA in D2D assisted
full-duplex cooperative relaying. Phys. Commun. 38(100914), 1–10
(2020)
17. Najimi, M.: Energy-efficient resource allocation in D2D communica-
tions for energy harvesting-enabled NOMA-based cellular networks. Int.
J. Commun. Syst. 33(2), 1–13 (2020)
18. Li, R., et al.: Tact: A transfer actor-critic learning framework for energy
saving in cellular radio access networks. IEEE Trans. Wireless Commun.
13(4), 2000–2011 (2014)
19. Luo, J., et al.: A deep learning-based approach to power minimization
in multi-carrier NOMA with SWIPT. IEEE Access 7, 17450–17460
(2019)
20. Jang, H.S., Lee, H., Quek, T.Q.S.: Deep learning-based power control for
non-orthogonal random access. IEEE Commun. Lett. 23(11), 2004–2007
(2019)
21. Adjif, M., Habachi, O., Cances, J.P.: Joint channel selection and power con-
trol for NOMA: A multi-armed bandit approach. In: Proceedings of 2019
IEEE Wireless Communications and Networking Conference Workshop
(WCNCW). Marrakech, Morocco, pp. 1–6 (2019)
22. Wei, Y., et al.: Power allocation in HetNets with hybrid energy supply using
actor-critic reinforcement learning. In: Proceedings of GLOBECOM 2017
- 2017 IEEE Global Communications Conference. Singapore, pp. 1–5
(2017)
23. Hasan, M.K., et al.: The role of deep learning in NOMA for 5G and
beyond communications. In: Proceedings of 2020 International Con-
ference on Artificial Intelligence in Information and Communication
(ICAIIC). Fukuoka, Japan, pp. 303–307 (2020)
24. Zhao, Q., Grace, D.: Transfer learning for QoS aware topology man-
agement in energy efficient 5G cognitive radio networks. In: Proceed-
ings of 1st International Conference on 5G for Ubiquitous Connectivity.
Akaslompolo, Finland, pp. 152–157 (2014)
25. Wei, Y., et al.: User scheduling and resource allocation in HetNets with
hybrid energy supply: An actor-critic reinforcement learning approach.
IEEE Trans. Wireless Commun. 17(1), 680–692 (2017)
26. Liu, M., et al.: Resource allocation for NOMA based heterogeneous IoT
with imperfect SIC: A deep learning method. In: Proceedings of 2018
IEEE 29th Annual International Symposium on Personal, Indoor and
Mobile Radio Communications (PIMRC). Bologna, Italy, pp. 1440–1446
(2018)
27. Gui, G., et al.: Deep learning for an effective non-orthogonal multiple
access scheme. IEEE Trans. Veh. Technol. 67(9), 8440–8450 (2018)
28. Ding, Z., Fan, P., Poor, H.V.: Impact of user pairing on 5G nonorthogonal
multiple-access downlink transmissions. IEEE Trans. Veh. Technol. 65(8),
6010–6023 (2016)
29. Liang, W., et al.: User pairing for downlink non-orthogonal multiple access
networks using matching algorithm. IEEE Trans. Commun. 65(12), 5319–
5332 (2017)
30. Paulraj, A.J., et al.: An overview of MIMO communications - A key to
gigabit wireless. Proc. IEEE 92(2), 198–218 (2004)
31. Hanif, M.F., et al.: A minorization -maximization method for optimizing
sum rate in non-orthogonal multiple access systems. IEEE Trans. Signal
Process. 64(1), 76–88 (2015)
32. Benjebbour, A., Kishiyama, Y.: Combination of NOMA and MIMO: Con-
cept and experimental trials. In: Proceeedings of 25th International Con-
ference on Telecommunications (ICT). St. Malo, France, pp. 433–438
(2018)
33. Islam, S.M.R., et al.: Resource allocation for downlink NOMA systems:
Key techniques and open issues. IEEE Wireless Commun. 25, 40–47
(2017)
34. Manglayev, T., Kizilirmak, R., Kho, Y.H.: Optimum power allocation for
non-orthogonal multiple access (NOMA). In: Proceedings of 25th Inter-
national Conference on Telecommunications (ICT). Baku, Azerbaijan,
pp. 1–4 (2016)
35. Zhang, Y., et al.: Energy-efficient transmission design in non-orthogonal
multiple access. IEEE Trans. Veh. Technol. 66, 2852–2857 (2017)
36. Zhang, J., et al.: Joint subcarrier assignment and downlink-uplink time-
power allocation for wireless powered OFDM-NOMA systems. In:
1654 SONNAM ET AL.
Proceedings of 10th International Conference on Wireless Communica-
tions and Signal Processing (WCSP). Hangzhou, China, pp. 1–7 (2018)
37. Zhao, J., et al.: Joint subchannel and power allocation for NOMA enhanced
D2D communications. IEEE Trans. Commun. 65(11), 5081–5094 (2017)
38. Xiong, Z., et al.: Deep reinforcement learning for mobile 5G and beyond:
Fundamentals, applications, and challenges. IEEE Veh. Technol. Mag.
14(2), 44–52 (2019)
39. Wang, T., et al.: Deep learning for wireless physical layer: Opportunities
and challenges. China Commun. 14(11), 92–111 (2017)
40. Huang, D., et al.: Deep learning based cooperative resource allocation in
5G wireless networks. Mobile Network Appl. 1–8 (2018)
41. Eisen, M., et al.: Learning optimal resource allocations in wireless systems.
IEEE Trans. Signal Process. 67(10), 2775–2790 (2018)
42. Zhang, Y., Wang, X., Xu, Y.: Energy-efficient resource allocation in uplink
NOMA systems with deep reinforcement learning. In: Proceedings of 11th
International Conference on Wireless Communications and Signal Pro-
cessing (WCSP). Xi’an, China, pp. 1–6 (2019)
43. Zhang, H., et al.: Artificial intelligence-based resource allocation in ultra-
dense networks: Applying event-triggered Q-learning algorithms. IEEE
Veh. Technol. Mag. 14(4), 56–63 (2019)
44. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. 2nd
ed., The MIT Press, Cambridge, MA (2018). http://incompleteideas.net/
book/the-book.html
45. Zhang, F., Li, J., Li, Z.: A TD3-based multi-agent deep reinforcement learn-
ing method in mixed cooperation-competition environment. Science 411,
206–215 (2020)
46. Mnih, V., et al.: Human-level control through deep reinforcement learning.
Nature 518(7540), 529–533 (2015)
47. Xiao, L., et al.: Reinforcement learning-based NOMA power allocation in
the presence of smart jamming. IEEE Trans. Veh. Technol. 67(4), 3377–
3389 (2017)
48. Zhang, S., et al.: A dynamic power allocation scheme in power-domain
NOMA using actor-critic reinforcement learning. In: Proceedings of
2018 IEEE/CIC International Conference on Communications in China
(ICCC). Beijing, China, pp. 719–723, (2018)
49. Zhao, N., et al.: Deep reinforcement learning for user association and
resource allocation in heterogeneous networks. In: Proceedings of 2018
IEEE Global Communications Conference (GLOBECOM). Abu Dhabi,
United Arab Emirates, pp. 1–6 (2018)
50. Chu, M., et al.: Reinforcement learning-based multiaccess control and bat-
tery prediction with energy harvesting in IoT systems. IEEE Internet
Things J. 6(2), 2009–2020 (2019)
51. Lillicrap, T., et al.: Continuous control with deep reinforcement learning.
In: Proceedings of 4th International Conference on Learning Representa-
tions. San Juan, Puerto Rico, pp. 1–14 (2016)
52. Grondman, I., et al.: Efficient model learning methods for actor-critic con-
trol. IEEE Trans. Syst. Man Cybern. Cybern. 42(3), 591–602 (2011)
53. Vamvoudakis, K.G., Lewis, F.: Online actor-critic algorithm to solve the
continuous-time infinite horizon optimal control problem. Automatica 46,
878–888 (2010)
54. Sutton, R.S., et al.: Policy gradient methods for reinforcement learning with
function approximation. Adv. Neural Inf. Process. Syst. 12, 1057–1063
(2000)
55. Konda, V., Gao, V.: On actor-critic algorithms. SIAM J. Control Optim.
42(4), 1143–1166 (2000)
56. Silver, D., et al.: Deterministic policy gradient algorithms. In: Proceedings
of 31st International Conference on Machine Learning. Beijing, China, vol.
32, pp. 387–395 (2014)
57. Li, Y., et al.: Energy-aware resource management for uplink non-
orthogonal multiple access: Multi-agent deep reinforcement learning.
Future Gener. Comput. Syst. 105, 684–694 (2020)
58. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with
double Q-learning. In: Proceedings of 29th Conference on Artificial Intel-
ligence (AAAI15). Austin, TX, pp. 2094–2100 (2015)
59. Bao, X., et al.: Conservative policy gradient in multi-critic setting. In: Pro-
ceedings of 2019 Chinese Automation Congress (CAC). Hangzhou, China,
pp. 1486–1489 (2019)
60. Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation
error in actor-critic methods. In: Proceedings of 35th International Con-
ference on Machine Learning. Stockholm, Sweden, pp. 1582–1591 (2018)
61. Dankwa, S., Zheng, W.: Modeling a continuous locomotion behavior of an
intelligent agent using deep reinforcement technique. In: Proceedings of
2019 IEEE 2nd International Conference on Computer and Communica-
tion Engineering Technology-CCET. Beijing, China, pp. 172–175 (2019)
62. Jaderberg, M., et al.: Human-level performance in first-person multiplayer
games with population-based deep reinforcement learning. Science 364,
859–865 (2019)
63. Lowe, R., et al.: Multi-agent actor-critic for mixed cooperative-competitive
environments. In: Proceedings of 31st Conference on Neural Information
Processing Systems. Long Beach, CA, pp. 6382–6393 (2017)
64. Thrun, S., Schwartz, A.: Issues in using function approximation for rein-
forcement learning. In: Proceedings of 4th Connectionist Models. Summer
School Hillsdale, NJ, pp. 1–9 (1993)
How to cite this article: Jo, S., et al.: Multi-agent deep
reinforcement learning-based energy efficient power
allocation in downlink MIMO-NOMA systems. IET
Commun. 15, 1642–1654 (2021).
https://doi.org/10.1049/
cmu2.12177
... In addition, using different types of RL methods can be considered in this category. For instance, [9] uses Q-learning, [16], [17] use deep Q-network (DQN) agent, [14], [15], [18] use deep deterministic policy gradient (DDPG), for solving power allocation problem. ...
... In the second category, various approaches can be classified based on the type of objectives. The objectives can be categorized into three types: sum rate maximization [15], [16], [19], energy-saving [12], and energy efficiency [14], [20]. In the third category, different techniques can be classified based on defining the MDP's model. ...
... In [21], the state of an MDP model includes the power-coefficients values, data-rate of users, and vectors indicating which of the powercoefficients can be increased or decreased. In [14], the channel conditions and transmission power are considered as state and action, respectively, and both of them have continuous space. In [20], state is defined as the user data rate in each time step, and action is defined as the sets of user associations and power allocations. ...
Article
Full-text available
The uplink of a cell-free massive multiple-input multiple-output with maximum-ratio combining (MRC) and zero-forcing (ZF) schemes are investigated. A power allocation optimization problem is considered, where two conflicting metrics, namely the sum rate and fairness, are jointly optimized. As there is no closed-form expression for the achievable rate in terms of the large scale-fading (LSF) components, the sum rate fairness trade-off optimization problem cannot be solved by using known convex optimization methods. To alleviate this problem, we propose two new approaches. For the first approach, a use-and-then-forget scheme is utilized to derive a closed-form expression for the achievable rate. Then, the fairness optimization problem is iteratively solved through the proposed sequential convex approximation (SCA) scheme. For the second approach, we exploit LSF coefficients as inputs of a twin delayed deep deterministic policy gradient (TD3), which efficiently solves the non-convex sum rate fairness trade-off optimization problem. Next, the complexity and convergence properties of the proposed schemes are analyzed. Numerical results demonstrate the superiority of the proposed approaches over conventional power control algorithms in terms of the sum rate and minimum user rate for both the ZF and MRC receivers. Moreover, the proposed TD3-based power control achieves better performance than the proposed SCA-based approach as well as the fractional power scheme.
... In the devised scheme, the HBA and LBO algorithms are combined to form LBHBO, which reduces error rate of the model. Jo et al. 26 developed multiagent DRL-based frameworks (multi agent DDPG/TD3 based framework) for improvising long-term EE of MIMO-NOMA when minimum data rates were ensured for all users. This method is more advantageous as this improved EE of MIMO-NOMA under minimal data rates and many transmit power limitations. ...
... Recalculate fitness value by equation (13). 26 Find best solution. 27 End LBHBO Thus, the output processed from LBHBO by various iterations for PA is indicated by the term P out . ...
Article
Full-text available
The non‐orthogonal multiple access (NOMA) is highly capable of serving multiple users at similar times and frequencies. The power allocation (PA) is widely considered as a main factor in NOMA for efficient communication. Here, the application of multiple‐input multiple‐output (MIMO) is added to NOMA to fulfill demands of enriched spectral efficiencies and extra user data. In this research, the Ladybug Beetle Honey Badger Optimization (LBHBO) is proposed for efficient PA in MIMO‐NOMA. Initially, the received signals from the user are modulated for amplitude and frequency. Then, user grouping is conducted by fuzzy local information c‐means (FLICM) clustering followed by using PA done by proposed LBHBO. This power is then moved to the transmitter and then to the channel estimation process. Moreover, cyclic prefix (CP) removal is carried out that tends to discrete Fourier transform (DFT). Finally, quadrature amplitude modulation (QAM) demodulation is performed for reallocated output. Furthermore, LBHBO is formed by combining Ladybug Beetle Optimization (LBO) and Honey Badger Algorithm (HBA). The performance offered by the LBHBO‐PA with maximal values with energy efficiency (EE) of 25.38 Mbits/s, sum rate of 1.29 Mbits/s, and achievable rate of 100.47 Mbits/s.
... As mobile communication develops, data throughput is increasing rapidly [12], and femtocell ultra-dense network technology is being introduced to meet the demand for high data rate in cellular networks [13,14]. This technique improves the spectral efficiency of cellular networks by reducing the number of UEs each Home eNodeB (HeNB) serves and increasing the frequency reuse rate [13]. ...
... To analyze the energy benefit during UL-HO, only the power consumed due to the HO signaling message transmission in UE and BS is evaluated while UL-HO is triggered by only pure UL RSRP. For this purpose, the average power consumption per network node is evaluated as follows based on (8), (12), (13). ...
Article
Full-text available
To construct a femtocell based ultra-dense network (UDN) is the important way to satisfy the ever-growing data traffic demands in the next generation mobile networks. However, it causes frequent handovers (HOs), thus leading to negative consequences such as increased power consumption and poor quality of service (QoS). This paper represents a research on the uplink handover (UL-HO) being watched recently and proposes a new method to further improve the energy efficient and QoS. First, we confirm the UL-HO mechanism, formulate the power consumption caused by UL-HO and compare downlink handover (DL-HO) with UL-HO in terms of the performance parameters including the power consumption, handover rate (HOR), handover failure rate (HOFR), ping-pong rate (PPR), delay, packet loss rate (PLR) and throughput during handover (HO). Then, it is proved through simulation that the reliable target cell determination scheme considering uplink reference signal received power (UL-RSRP), available bandwidth, direction of user equipment (UE), and admission control is lower in HOR, HOFR, PPR, power consumption, delay and PLR. In particular, we propose a new algorithm to determine the reasonable target cell considering UL-RSRP, available bandwidth and moving direction of UE, thus reducing HOR and PPR. The improved UL-HO scheme can be implemented easily by adding some functions besides the upper-layer protocol, while it can provide a lot of benefits such as reduced power consumption, delay, frequent HOs resulting in reducing HOR and PPR. Accordingly, this can be introduced into 5G as a HO scheme.
... In [27], Abuzar B. M. Adam et al. considered the use of DL approach to solve the problem of boosting EE in NOMA networks, and the experimental results showed that the performance is closer to the optimum compared to the traditional algorithms. In recent years, there has been a focus on using Deep Reinforcement Learning (DRL) for resource allocation [28].Raghavendra Pal et al. proposed to use DRL algorithm for channel selection to achieve on-demand resource allocation [29]. Wang Xiaoming et al. used the DRL framework to solve the non-convex problem encountered in joint sub-channel assignment and power allocation in NOMA uplink and verified that it can improve EE to a great extent [30]. ...
... G denotes the number of samples. And Q tar (s j , r j |ω tar Actor , ω tar critic ) denotes the state action value fitted by the target critic network and reward, it can be expressed as (28). ...
Article
Full-text available
With the continuous development of the communication industry, there is a shift in real-time services from 4G networks to Delay Tolerable (DT) services in the context of 5G/B5G networks. Additionally, energy consumption control poses significant challenges in the current communication industry. Therefore, we study algorithms and schemes to improve the Energy Efficiency (EE) of DT services in the context of Non-Orthogonal Multiple Access (NOMA) downlink two-user communication system.First, we transformed the EE enhancement problem into a convex optimization problem based on transmission power by derivation. Secondly, we propose to use Approximate Statistical Dynamic Programming (ASDP) algorithm, Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO) to solve the problem that convex optimization cannot be decided in real time. Finally, we perform an interpretability analysis on whether the decision schemes of the agents trained by the DDPG algorithm and the PPO algorithm are reasonable. The simulation results show that the decisions made by the agent trained by the DDPG algorithm perform better compared to the ASDP algorithm and the PPO algorithm.
... In the FDD transmission mode, the DASR is maximized for each value of N in relation to a suitable weighted energy value. Increasing the power allocation with respect to the training sequence phase can lead to an improvement in the DASR, as we will see later in the results section in Section 5. Note that machine learning (ML) algorithms can be integrated into the modeling framework of power allocation [64,[71][72][73]. However, a recognizable challenge could be the potential lack of interpretability in the decision-making process, thus making it challenging to understand and validate the underlying reasoning behind power allocation choices. ...
Article
Full-text available
The expected development of the future generation of wireless communications systems such as 6G aims to achieve an ultrareliable and low-latency communications (URLLCs) while maximizing the data rates. These requirements push research into developing new advanced technologies. To this end, massive multiple input multiple output (MMIMO) is introduced as a promising transmission approach to fulfill these requirements. However, maximizing the downlink-achievable sum rate (DASR) in MMIMO with a frequency division duplex (FDD) transmission mode and limited coherence time (LCT) is very challenging. To address this challenge, this paper proposes a DASR maximization approach using a feasible power allocation optimization method. The proposed approach is based on smartly allocating the total transmit power between the data transmission and training sequence transmission for channel estimation. This can be achieved by allocating more energy to the training signal than the data transmission during the channel estimation process to improve the quality of channel estimation without compromising more training sequence length, thus maximizing the DASR. Additionally, the theory of random matrix approach is exploited to derive an asymptotic closed-form expression for the DASR with a regularized zero-forcing precoder (RZFP), which allows the power optimization process to be achieved without the need for computationally complex Monte Carlo simulations. The results provided in this paper indicate that a considerable enhancement in the DASR performance is achieved using the proposed power allocation method in comparison with the conventional uniform power allocation method.
... Deep reinforcement learning algorithms are by far the most analogous to human thinking patterns, and the highest artificial intelligence algorithms [5][6][7][8]. Combining the sensitive perception capability of the deep learning algorithm and the fuzzy decision-making capability of the reinforcement learning algorithm into one has created a new direction of autonomous perception decision-making oriented to direct input sample information [9][10][11][12]. With the rapid and in-depth development of the international logistics industry as a whole, logistics enterprises are also more concerned about supply chain management [13][14][15][16]. ...
Article
Full-text available
The use of deep reinforcement learning algorithms for strategy formulation in supply chain management enables the nodes in the supply chain to better improve their management strategies. In this paper, a supply chain model is constructed as a starting point, and deep reinforcement learning algorithms are introduced on this basis. Firstly, the decision problem of uncertainty is handled by the reinforcement learning method of functions, and the DQN algorithm (deep neural network algorithm) is divided into two parts for iterative rules. Then the target network is established to make the iterative process more stable, to improve the convergence of the algorithm, evaluate the loss function in the training process of the network, and to determine its influence factor. Then the neural network is used to improve the iteration rule, improve the output layer, select the final action, and define the model expectation reward. Finally, the Bellman equation is fitted to the function by a deep neural network to calculate the final result. The experimental results show that by analyzing and constructing the cost of international logistics under supply chain management, the capacity utilization rate of ocean freight link is 57% The unloading link is 74% and the total capacity utilization rate is calculated as 76%. It shows that using deep reinforcement learning algorithms under international logistics supply chain management is feasible and necessary for improving the management strategy research of supply chains.
... In addition, the EE scheduling based on deep neural network is considered. In [9], we used deep reinforcement learning models to solve non-convex dynamic optimization problems to enhance EE and showed the results of applying deep learning to Multiple Input Multiple Output-Non-Orthogonal Multiple Access systems. In [10,11], the authors also proposed resource allocation methods based on deep neural networks, which have been widely used in studies due to their high solvency for nonlinear programming. ...
Article
Full-text available
With the fast evolution of the wireless communications, the energy consumption of mobile network and the data flow in the network are increasing. Thus, it is very important to increase the Energy Efficiency (EE) and reduce packet loss rate. This article surveys the scheduling and resource allocation algorithm to reduce packet loss rate while increasing EE in the uplink of Long Term Evolution-Advanced system. Here, new framework that refers previous value as an optimization value of the optimization problem solving process is proposed. Furthermore, the mathematical model of the process is defined as NP-Hard problem and the new solving method is proposed. We propose a user priority metric, considering that the demand for packet loss rate and packet delay are different according to the Quality of Service (QoS) Class Identifier (QCI). For Guaranteed Bit Rate (GBR) and Non-GBR services, the proposed algorithm uses the user priority metric taking into account not only the uplink buffer status, but also the different characteristics of packet loss mechanism and energy efficiency, to select users in scheduling. To demonstrate the advantage of the proposed scheme, simulations are carried out in various size of cells like Femto cell, Pico cell, Micro cell and others and the study results are compared for the effectiveness of proposed methods. Simulation results show that the proposed algorithm enhances EE, as well as, QoS provision for different types of services.
Article
The use of machine learning algorithms for the control of schedulable loads like Heating, ventilation, and air conditioning (HVAC), illumination, dryers and irrigation systems to optimize the use of RES and increase energy saving has obtained remarkable results in the last years. However, in the residential sector of tropical countries where HVAC systems are not necessary, these loads represent only a small percentage of the total energy consumption. In order to achieve a significant impact on energy savings and promote the use of RES, other residential loads must be taken into account in tropical countries. In the case of Colombia, for example, fridges account for 24% of residential energy consumption. This research proposes the use of RL for the development of a fridge energy management system capable of minimizing energy consumption and optimizing the use of RES for cooling. The fridge energy management system is based on an RL agent to control the fridge, and an artificial neural network to model the environment and assess the impact of its actions. Compared to the original fridge control, the RL-based control successfully reduced the total energy usage by 23% while also increasing the use of RES energy.
Article
Full-text available
This paper considers an internet of vehicles (IoV) network, where multi-access edge computing (MAEC) servers are deployed at base stations (BSs) aided by multiple reconfigurable intelligent surfaces (RISs) for both uplink and downlink transmission. An intelligent task offloading methodology is designed to optimize the resource allocation scheme in the vehicular network which is based on the state of criticality of the network and the priority and size of tasks. We then develop a multi-agent deep reinforcement learning (MA-DRL) framework using the Markov game for optimizing the task offloading decision strategy. The proposed algorithm maximizes the mean utility of the IoV network and improves the communication quality. Extensive numerical results were performed that demonstrate that the RIS-assisted IoV network using the proposed MA-DRL algorithm achieves higher utility than current state-of-the art networks (not aided by RISs) and other baseline DRL algorithms, namely soft actor-critic (SAC), deep deterministic policy gradient (DDPG), twin delayed DDPG (TD3). The proposed method improves the offloading data rate of the tasks, reduces the mean delay and ensures that a higher percentage of offloaded tasks are completed compared to that of other DRL based and non-RIS assisted IoV frameworks. Index Terms-Internet of vehicles (IoV), multi-access edge computing (MAEC), reconfigurable intelligent surface (RIS), multi-agent deep reinforcement learning (MA-DRL).
Article
The goal of this thesis is to develop a learning framework for solving resource allocation problems in wireless systems. Resource allocation problems are as widespread as they are challenging to solve, in part due to the limitations in finding accurate models for these complex systems. While both exact and heuristic approaches have been developed for select problems of interest, as these systems grow in complexity to support applications in Internet of Things and autonomous behavior, it becomes necessary to have a more generic solution framework. The use of statistical machine learning is a natural choice not only in its ability to develop solutions without reliance on models, but also due to the fact that a resource allocation problem takes the form of a statistical regression problem. The second and third chapters of this thesis begin by presenting initial applications of machine learning ideas to solve problems in wireless control systems. Wireless control systems are a particular class of resource allocation problems that are a fundamental element of IoT applications. In Chapter 2, we consider the setting of controlling plants over non-stationary wireless channels. We draw a connection between the resource allocation problem and empirical risk minimization to develop convex optimization algorithms that can adapt to non-stationarities in the wireless channel. In Chapter 3, we consider the setting of controlling plants over a latency-constrained wireless channel. For this application, we utilize ideas of control-awareness in wireless scheduling to derive an assignment problem to determine optimal, latency-aware schedules. The core framework of the thesis is then presented in the fourth and fifth chapters. In Chapter 4, we formally draw a connection between a generic class of wireless resource allocation problems and constrained statistical learning, or regression. From here, this inspires the use of machine learning models to parameterize the resource allocation problem. To train the parameters of the learning model, we first establish a bounded duality gap result of the constrained optimization problem, and subsequently present a primal-dual learning algorithm. While any learning parameterization can be used, in this thesis we focus our attention on deep neural networks (DNNs). While fully connected networks can be represent many functions, they are impractical to train for large scale systems. In Chapter 5, we tackle the parallel problem in our wireless framework of developing particular learning parameterizations, or deep learning architectures, that are well suited for representing wireless resource allocation policies. Due to the graph structure inherent in wireless networks, we propose the use of graph convolutional neural networks to parameterize the resource allocation policies. Before concluding remarks and future work, in Chapter 6 we present initial results on applying the learning framework of the previous two chapters in the setting of scheduling transmissions for low-latency wireless control systems. We formulate a control-aware scheduling problem that takes the form of the constrained learning problem and apply the primal-dual learning algorithm to train the graph neural network.
Article
We explored the problem about function approximation error and complex mission adaptability in multi-agent deep reinforcement learning. This paper proposes a new multi-agent deep reinforcement learning algorithm framework named multi-agent time delayed deep deterministic policy gradient. Our work reduces the overestimation error of neural network approximation and variance of estimation result using dual-centered critic, group target network smoothing and delayed policy updating. According to experiment results, it improves the ability to adapt complex missions eventually. Then, we discuss that there is an inevitable overestimation issue about existing multi-agent algorithms about approximating real action-value equations with neural network. We also explain the approximate error of equations in the multi-agent deep deterministic policy gradient algorithm mathematically and experimentally. Finally, the application of our algorithm in the mixed cooperative competition experimental environment further demonstrates the effectiveness and generalization of our algorithm, especially improving the group’s ability of adapting complex missions and completing more difficult missions.
Article
This article has been retracted: please see Elsevier Policy on Article Withdrawal (http://www.elsevier.com/locate/withdrawalpolicy). This article has been retracted at the request of the Editor-in-Chief, authors and Xiaoming Wang. This article is the identical resubmission of a paper that was initially submitted by different group of authors, prior to this publication, to IEEE Access in August 18th and was rejected on August 30th. A conference version of it was published in 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP) on September 1st 2019 titled: Energy-Efficient Resource Allocation in Uplink NOMA Systems with Deep Reinforcement Learning, authored by Yuhan Zhang; Xiaoming Wang ; Youyun Xu DOI: 10.1109/WCSP.2019.8927898 There is no evidence of the authors of this article having access to the IEEE Access submission during peer review. Apologies are offered to readers of the journal that this was not detected during the submission process.
Conference Paper
In this paper, we consider a downlink non-orthogonal multiple access (NOMA) heterogeneous small cell networks (HSCN), where the macro base station (MBS) simultaneously communicate with multiple small cell base stations (SBS) through the wireless backhauls. The goal is to investigate an energy-efficient power, and bandwidth allocation scheme aiming at maximizing the EE of the small cells in downlink NOMA-HSCN constrained by the maximum transmit power and the minimum required data rate simultaneously. The optimization problem is non-convex due to the fractional objective function and non-convex constraints and is challenging to obtain an exact solution efficiently. To deal with the problem, the joint optimization is first decomposed into two subproblems. Then, an iterative algorithm to solve the power optimization subproblem is proposed with guaranteed convergence. Furthermore, a closed-form solution for the bandwidth allocation subproblem is derived. Finally, an iterative power and bandwidth allocation algorithm are proposed for NOMA-HSCN. Simulation results reveal the effectiveness of the proposed schemes in terms of energy efficiency (EE) as compared to the existing NOMA schemes and the conventional orthogonal multiple access (OMA) baseline schemes.
Article
This paper proposes a device-to-device (D2D) enabling cellular full-duplex (FD) cooperative (DFC) protocol using non-orthogonal multiple access (NOMA) called DFC-NOMA, where an FD relay acting D2D transmitter assists in relaying a NOMA far user's signal and transmits a D2D receiver's signal simultaneously. The ergodic capacity, outage probability, and diversity order of DFC-NOMA are theoretically investigated under the assumption of both perfect and imperfect interference cancellation. The theoretical analysis is then validated by simulations. Both analysis and simulation results demonstrate the performance gain of DFC-NOMA over conventional FD cooperative NOMA and existing D2D aided FD NOMA.