ArticlePDF Available

Multi‐agent deep reinforcement learning‐based energy efficient power allocation in downlink MIMO‐NOMA systems

IET Communications

March 2021
15(100914)

DOI:10.1049/cmu2.12177

License
CC BY 4.0

Authors:

Sonnam Jo

Kim Il Sung University

NOMA and MIMO are considered to be the promising technologies to meet huge access demands and high data rate requirements of 5G wireless networks. In this paper, the power allocation problem in a downlink MIMO-NOMA system to maximize the energy efficiency while ensuring the quality-of-service of all users is investigated. Two deep reinforcement learning-based frameworks are proposed to solve this non-convex and dynamic optimization problem, referred to as the multi-agent DDPG/TD3-based power allocation framework. In particular, with current channel conditions as input, every single agent of two multi-agent frameworks dynamically outputs the optimum power allocation policy for all users in every cluster by DDPG/TD3 algorithm, and the additional actor network is also added to the conventional multi-agent model in order to adjust power volumes allocated to clusters to improve overall performance of the system. Finally, both frameworks adjust the entire power allocation policy by updating the weights of neural networks according to the feedback of the system. Simulation results show that the proposed multi-agent deep reinforcement learning based power allocation frameworks can significantly improve the energy efficiency of the MIMO-NOMA system under various transmit power limitations and minimum data rates compared with other approaches, including the performance comparison over MIMO-OMA.

The architecture of wireless network with small‐cells and reinforcement learning formulation

…

Multi‐agent DRL‐based MIMO‐NOMA power allocation system model

…

Loss function value of two multi‐agent DRL‐based frameworks

…

Averaged energy efficiency comparison of different algorithms

…

Averaged energy efficiency versus power limitation

…

Figures - available from: IET Communications

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Wiley.

Learn more

Content available from IET Communications

This content is subject to copyright. Terms and conditions apply.

Received: 24 October 2020 Revised: 8 March 2021 Accepted: 18 March 2021 IET Communications

DOI: 10.1049/cmu2.12177

ORIGINAL RESEARCH PAPER

Multi-agent deep reinforcement learning-based energy efﬁcient

power allocation in downlink MIMO-NOMA systems

Sonnam Jo Chol Jong Changsop Pak Hakchol Ri

Telecommunication Research Center, Kim Il Sung

University, Taesong District, Pyongyang 999093,

Democratic People’s Republic of Korea

Correspondence

Sonnam Jo,Telecommunication Research Center,

Kim Il Sung University, Taesong District, Pyongyang

999093, Democratic People’sRepublic of Korea.

Email: josonnam@yahoo.com

Abstract

NOMA and MIMO are considered to be the promising technologies to meet huge access

demands and high data rate requirements of 5G wireless networks. In this paper, the power

allocation problem in a downlink MIMO-NOMA system to maximize the energy efﬁciency

while ensuring the quality-of-service of all users is investigated. Two deep reinforcement

learning-based frameworks are proposed to solve this non-convex and dynamic optimiza-

tion problem, referred to as the multi-agent DDPG/TD3-based power allocation frame-

work. In particular, with current channel conditions as input, every single agent of two

multi-agent frameworks dynamically outputs the optimum power allocation policy for all

users in every cluster by DDPG/TD3 algorithm, and the additional actor network is also

added to the conventional multi-agent model in order to adjust power volumes allocated

to clusters to improve overall performance of the system. Finally, both frameworks adjust

the entire power allocation policy by updating the weights of neural networks according to

the feedback of the system. Simulation results show that the proposed multi-agent deep

reinforcement learning based power allocation frameworks can signiﬁcantly improve the

energy efﬁciency of the MIMO-NOMA system under various transmit power limitations

and minimum data rates compared with other approaches, including the performance com-

parison over MIMO-OMA.

1 INTRODUCTION

Due to the explosive growth of internet-of-thing-based

largescale heterogeneous networks and the emergence of the

ﬁfth generation (5G) wireless networks, the communication

requirements become more and more strict such as low latency,

high speed and massive connectivity etc. [1]. In order to meet

these stringent communication requirements and provide

high-quality communication services, non-orthogonal mul-

tiple access (NOMA) has been proposed. NOMA is widely

recognized as one of the most promising multiple access

technologies for 5G wireless networks [2]. Different from

conventional orthogonal multiple access (OMA), the multiple

users in NOMA systems can be served by the same frequency

time resources via power-domain multiplexing or code-domain

multiplexing, which can increase the network capacity and

fulﬁl the requirements of low latency, high throughput and

massive connection [3]. In researches, the interests of NOMA

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is

properly cited.

are mostly about power allocation [4] and user matching [5],

and more often the objectives are to improve the users’ fairness

[6], spectrum efﬁciency (SE) [7] and energy efﬁciency [8]. The

power-domain NOMA can serve multiple users at the same

time slots and has become the potential candidate in the devel-

opment of 5G by allocating different power levels to different

users to achieve multiple access [9], thus the EE of systems

can be improved by using this technology. The power-domain

NOMA with the use of superposition coding (SC) at the

transmitter side and successive interference cancellation (SIC)

at the receiver side allows users to share the same resources

by multiplexing multiple users’ signals with different allocated

powers. In power-domain NOMA, powers are allocated to

users according to their channel gains, i.e. high and low powers

are respectively assigned to users with better and worse channel

gains, and then the user with better channel gain by SIC removes

interference from other users. For the purpose of further

improving the system performance, NOMA is capable of being

1642 wileyonlinelibrary.com/iet-com IET Commun. 2021;15:1642–1654.

SONNAM ET AL.1643

combined with various technologies like multiple-input

multiple-output (MIMO) [10–14], device to device [15–17],

deep learning [18–27] and cognitive radio [28, 29]etc.The

combination of MIMO and NOMA, which is deﬁned as

MIMO-NOMA, is one of the hottest research areas. MIMO

is a conventional technology where the base station (BS) and

the user both have several antennas which can take advantage

of the parallel independent channel according to the charac-

teristics of space division multiplexing [30]. MIMO-NOMA

transmission can achieve high SE and low latency through

power-domain in NOMA and spatial diversity in MIMO [31].

Some research results show that MIMO-NOMA can further

improve the throughput of communication systems [32]. In this

paper we mainly focus on the issue of power allocation in the

power-domain NOMA with MIMO.

Designing the proper resource allocation algorithm is sig-

niﬁcantly important to improve the SE or EE of the MIMO-

NOMA system [33]. Many model-based power allocation algo-

rithms have been proposed to increase the SE or EE in NOMA

systems [34, 35], and some studies have researched the joint

subchannel assignment and power allocation in order to max-

imize the EE for multicarrier NOMA systems [36, 37]. Due to

the dynamics and uncertainty that are inherent in the wireless

communication system, it is mostly difﬁcult or even unavailable

to obtain the complete knowledge or mathematical model that

are required in these conventional power allocation approaches

in practice [38]. However, because of high computational com-

plexity of these algorithms, they are inefﬁcient or even inap-

plicable for future communication networks. In addition, the

dynamic resource allocation is another challenge for a fast-

changing channel condition. Optimizing energy consumption

with efﬁcient resource allocation is an important research issue

using NOMA. In practice, taking into account the dynamics

and uncertainty that are inherent in wireless communication

systems, it is generally difﬁcult or even unavailable to obtain

the complete knowledge or mathematical model required in

these conventional resource allocation approaches. Neverthe-

less, intelligent learning methods are extensively developed to

deal with this challenge. Some studies have used DL (deep

learning) as a model-free and data-driven approach to reduce

the computational complexity with available training inputs and

outputs [39]. As one main branch of machine learning (ML), DL

is a useful method that can be applied to overcome the resource

allocation problem in the case of multiple users [21, 24,40]. DL

has been used to solve resource allocation problems by train-

ing the neural networks ofﬂine with simulated data ﬁrst, and

then outputting results with the well-trained networks during

the online process [26, 27,41]. However, obtaining the correct

data set or optimal solutions used for training can be difﬁcult

and the training process itself is actually time-consuming.

In mobile environments the accurate channel information

and the complete model of the environment’s evolution are

unknown, therefore, the model-free reinforcement learning

(RL) method has been used to solve stochastic optimization

problems in wireless networks. RL, as another main branch of

ML, can be used in solving the dynamic resources allocation

problems [18, 22,42]. RL is able to provide the solution for the

sequential decision-making problems where the environment is

unknown and the optimal policy is learned through interactions

with the environment [43, 44]. In RL connecting with wireless

networks, the agent (e.g. the base station or user) observes the

environment (e.g. wireless channel) states and discovers which

actions (e.g. decisions of subchannel assignment or power allo-

cation) yield the most numerical reward (e.g. immediate EE) by

trying them, and ﬁnally generate the policy of mapping states to

actions. That is, the agent selects an action for the environment,

and the state (e.g. current channel condition) changes after the

environment accepts the action. At the same time, the reward

is generated and then feedbacks to the agent. Finally, the agent

selects the next action based on the immediate EE and the cur-

rent channel condition in order to ensure energy efﬁciency in

wireless communications. The goal of RL is to adjust the param-

eters dynamically so as to maximize the reward. Besides, instead

of optimizing current beneﬁts only, RL can generate almost

optimal decision policy which maximizes the long-term perfor-

mance of the system through constant interactions. Therefore,

RL has demonstrated its enormous advantages in many ﬁelds.

The existing works that applied the model free RL framework

usually discretise the continuous values of network parameters

into a ﬁnite set of discrete levels or acquire the stochastic policy.

Different from most of existing work that uses RL algorithm,

we consider the deterministic policy gradient-based actor-critic

reinforcement learning algorithm to solve power allocation opti-

mization problem with continuous-valued actions and states in

MIMO-NOMA systems.

In this paper, we investigate the dynamic power allo-

cation problem with multi-agent DRL-based methods in a

downlink MIMO-NOMA system. Motivated by the aforemen-

tioned considerations, we propose two multi-agent DRL-based

frameworks (i.e. multi-agent DDPG/TD3-based framework) to

improve the long-term EE of the MIMO-NOMA system while

ensuring minimum data rates of all users. We construct DRL

networks on the basis of the multi-agent model [45], including

the additional actor network which does not have its own critic

network and is updated by comprehensive impact of every agent

critic network. We refer to these two frameworks as the multi-

agent DDPG-based power allocation (MDPA) framework and

the multi-agent TD3-based power allocation (MTPA) frame-

work.

The main contributions of this paper are summarized as fol-

lows:

∙We develop two model-free DRL-based power allocation

frameworks to solve the power allocation problem in a down-

link MIMO-NOMA system. They are multi-agent model

frameworks based on DDPG and TD3 algorithms which are

on the basis of the deterministic policy gradient methods with

continuous action spaces, since the stochastic policy gradient

methods cannot properly solve the dynamics and uncertainty

that are inherent in wireless communication systems.

∙For our multi-agent model frameworks, every single agent

dynamically outputs the power allocation policy for all users

in every cluster of the MIMO-NOMA system, and we

also add the additional actor network to the conventional

1644 SONNAM ET AL.

multi-agent model in order to adjust power volumes allocated

to clusters to improve overall performance of the MIMO-

NOMA system. To the best of our knowledge, no such

method for power allocation in MIMO-NOMA systems has

been studied in the existing literatures.

∙We provide the performance analysis of proposed power allo-

cation frameworks based on multi-agent deterministic pol-

icy gradient in two-user scenario in a cluster of the MIMO-

NOMA system and compare the EE of the proposed frame-

works under various power limitations and minimum data

rates with single agent DRL-based power allocations, the dis-

crete DRL-based one and the ﬁxed power allocation strategy.

We also verify the advantage of energy performance of our

proposed frameworks over that of the MIMO-OMA system.

The simulation results show that the TD3-based frameworks

can achieve best performance.

The rest of this paper is organized as follows. The related

work is reviewed in Section 2. Section 3introduces the system

model and problem formulation of a downlink MIMO-NOMA

system. In Section 4, we propose two multi-agent DRL-based

algorithms to solve the dynamic power allocation problem of

the MIMO-NOMA system. The simulation results are discussed

in Section 5, and we conclude this study in Section 6.

2RELATED WORK

The conventional RL algorithms suffer from slow convergence

speed and become less efﬁcient for problems with large state

and action spaces. In order to overcome these issues, the deep

reinforcement learning (DRL), which combines deep leaning

with RL, has been proposed. One famous algorithm of DRL,

named deep Q-learning [46] that is one of the off-policy meth-

ods, uses a deep Q network (DQN) which applies deep neural

networks as function approximators to conventional RL. DRL

has already been used in many aspects such as power control in

NOMA systems [47, 48], resource allocation in heterogeneous

network [25, 49] and Internet of Things (IoT) [50]. DQN uses

a replay buffer to store tuples of historical samples in order to

stabilize training and makes efﬁcient use of the hardware opti-

mization. And the mini-batches are randomly drawn from the

replay buffer to update the weights of the networks during train-

ing. However, the main drawback of DQN is that the output

decision can only be discrete, which brings quantization error

for continuous action tasks. Besides, the output dimension of

DQN increases exponentially for multi-action and joint opti-

mization tasks. Some works that introduced the model-free RL

framework to solve the stochastic optimization problems typi-

cally discretise the continuous variables of the studying scenario

into a ﬁnite set of discrete values (levels), such as the quantized

energy levels or power levels. However, such methods destruct

the completeness of continuous space and introduce quantiza-

tion noise, thus they are incapable of ﬁnding the true optimal

policy. RL was initially studied only with discrete action spaces,

however, practical problems sometimes require control actions

in a continuous action space [51]. Concerning our problem with

energy efﬁciency, both the environment state (i.e. wireless chan-

nel condition) and the action (i.e. transmission power) have con-

tinuous spaces. For problems with the continuous-valued action

the policy gradient methods [52] are very effective, and in par-

ticular it can learn stochastic policies. In order to overcome

the inefﬁciency and high variance of evaluating a policy in the

policy gradient methods, the actor-critic learning algorithm is

introduced to solve the optimal decision-making problems with

an inﬁnite horizon [53]. Actor-critic algorithm is a well-known

architecture based on policy gradient theorem which allows

applications in continuous space [54]. In the actor-critic method

[55], the actor is used to generate stochastic actions whereas the

critic is used to estimate the value function and criticizes the pol-

icy that is guiding the actor, and the policy gradient evaluated by

the critic process is used to update policy.

In [51], authors developed deep deterministic policy gradient

(DDPG) method to improve the trainability of the stochastic

policy gradient method with continuous action space problems.

DDPG is an enhanced version of the deterministic policy gradi-

ent (DPG) algorithm [56] based on the actor-critic architecture,

which uses an actor network to output a deterministic action

and a critic network to evaluate the action [46]. DDPG takes

the advantages of experience replay buffer and target network

strategies from DQN to improve learning stability. By using

the combination of actor-critic algorithm, replay buffer, target

networks and batch normalization, DDPG is able to perform

continuous control, while the training efﬁciency and stability

are enhanced from original actor-critic network [51]. These fea-

tures make DDPG more efﬁcient for dynamic resource alloca-

tion problems with continuous action space. They also managed

to make it an ideal candidate for application in industrial settings

[42, 57]. As it concerns the overestimation bias of the critic can

pass to the policy through policy gradient in DDPG. However,

if the actor is updated slowly, the target critics are too similar

with the current critics to avoid overestimation bias. There have

been proposed many algorithms to address the overestimation

bias in Q learning. Double DQN [58] addresses the overestima-

tion problem by using two independent Q functions. Neverthe-

less, it is difﬁcult to ﬁnd appropriate weights for different tasks

or environments. The most common failure mode of DDPG is

that the learning of Q-function which begins by overestimating

the Q-value and eventually ends by breaking down the policy

because it takes advantage of errors in the Q-function. Over-

estimation bias is a property of Q-learning which is caused by

maximization of a noisy value estimate. The value target esti-

mate depends on the target actor and target critic. As the tar-

get actor is constantly updating, the next action used in the

one-step value target estimate is also changing. Therefore, the

update of target actor may cause a big change in target value

estimate, which would cause instability in critic training [59]. In

order to reduce overestimation bias problems, the authors in

[60] extended DDPG to twin delayed deep deterministic policy

gradient algorithm (TD3), which estimates the target Q value

by using the minimum of two target Q value, called clipped

double Q learning. DPG and DDPG algorithms paved way to

TD3 algorithms by the successful works on DQN [60, 61]. TD3

adopts two critics to get a less optimistic estimate of an action

value by taking the minimum between two estimates. In TD3,

there is just one actor which is optimized with respect to the

SONNAM ET AL.1645

smaller of two critics. In order to ensure the TD-error remains

small, the actor is updated at a lower frequency than the critic,

which results in higher quality policy updates in practice [59].

Since we focus on the issue of power allocation in MIMO-

NOMA systems which can be generally comprised of several

clusters according to users’ propagation channel conditions, we

also investigate the multi-agent RL methods, where single agent

can output the policy for one cluster of the MIMO-NOMA sys-

tem. With the development of DRL, single agent algorithm is

gradually on the right track and has already addressed plenty

of difﬁcult problems giving researchers more powerful instru-

ments to study in high dimension and more complex action

spaces. By studying more on single agent RL, researchers real-

ized that the introduction of multi-agent could achieve higher

performance in complex mission environments [62]. The multi-

agent DDPG algorithm that extends the DDPG algorithm to

the multi-agent domain is proposed for mixed cooperative-

competitive environments in [63], even though it did not have

made more considerations for joint overestimation in multi-

agent environments. In [45], the multi-agent TD3 algorithm is

proposed to reduce the overestimation error of critic networks,

and they explained that the multi-agent TD3 algorithm has a

better performance than the multi-agent DDPG algorithm in

complex mission environment. These led to an extension of it

to any other multi-agent RL algorithm.

3SYSTEM MODEL AND PROBLEM

FORMULATION

3.1 System model

In this paper we consider a downlink multi-user MIMO-NOMA

system, in which the BS equipped with Mantennas that send

data to multiple receivers that are equipped with Nantennas.

The total number of users in the system is M×L,whichare

grouped into Mclusters randomly with L(L≥2) users per clus-

ter. NOMA is applied among the users in the same cluster. For

the signal to the lth user in cluster m, denoted by UEm,l,theBS

precodes the superposed signal xm=L

l=1pm,lxm,lby using

a transmit beamforming matrix F, where pm,lis the allocated

power to the lth user in cluster m,andxm,ldenotes the normal-

ized transmit signal. We consider the composite channel model

with both Rayleigh fading and largescale path loss. In particular,

the channel matrix Hm,lfrom the BS to UEm,lcan be repre-

sented as [13]:

Hm,l=Gm,l

Ldm,l,(1)

Ldm,l=









d𝜁

m,l,dm,l>d0

d𝜁

0,dm,l≤d0

,(2)

where Gm,l∈ℂ

N×Mrepresents Rayleigh fading channel gain,

L(dm,l) denotes the path loss of UEm,llocated at a distance

dm,l(km) from the BS and assumed to be the same at each

receive antenna, d0is the reference distance according to cell

size, and ζdenotes the path loss exponent.

The precoding matrix used at the BS is denoted as F∈

ℂM×M, which implies the antenna mserves cluster mwith the

power pm=L

l=1pm,lfor any m. At the receive side, UEm,l

employs the unit-form receive detection vector vm,l∈ℂ

N×1to

suppress the inter-cluster interference. In order to completely

remove inter-cluster interference, the precoding and detection

matrices need to satisfy the following constraints: (1) F=IM,

with IMdenoting the M×Midentity matrix; (2) vm,l2=1and

m,lHm,lfk=0 for any k≠m, where fkis the kth column of F

[12]. It should be noted that the number of antennas should sat-

isfy N≥Mto make this feasible. Because of the zero-forcing

(ZF)-based detection design, the inter-cluster interference can

be removed even when there exist multiple users in a cluster.

Note that only a scalar value vH

m,lHm,lfk2needs to be fed

back to the BS from UEm,l. For the considered MIMO-NOMA

scheme, the BS multiplexes the intended signals for all users

at the same frequency and time resource. Therefore, the corre-

sponding transmitted signals from the BS can be expressed as:

s=Fx,(3)

where the information-bearing vector x∈ℂ

M×1can be further

written as:

x=

p1,1x1,1+⋯+

p1,Lx1,L

⋮⋱ ⋮

pM,1xM,1+⋯+

pM,LxM,L

,(4)

where M

m=1L

l=1pm,l≤pmax ,pmax denotes the total transmit

power at the BS. Accordingly, the observed signal at UEm,lis

given by

ym,l=Hm,ls+nm,l,(5)

where nm,l∼(0,𝜎

2I) is the independent and identically

distributed (i.i.d.) additive white Gaussian (AWGN) noise vec-

tor. By applying the detection vector vm,lon the observed signal,

Equation (5) can be expressed as [10]:

m,lym,l=vH

m,lHm,lfm



l=1pm,lxm,l



k=1,k≠m

m,lHm,lfkxk+vH

m,lnm,l,

(6)

where the second term is interference from other clusters, and

xkdenotes the kth row of x. Owing to the constrain1on the

detection vector, i.e. vH

m,lHm,lfk=0 for any k≠m, Equation

1Due to the speciﬁc selection of F, this constraint is further reduced to vH

m,l

Hm,l=0, where



Hm,l=[h1,ml ⋯hm−1,ml hm+1,ml ⋯hM,ml ]andhi,ml is the ith column of Hm,l. Therefore,

vm,lcan be expressed as Um,lwm,l, where Um,lis the matrix consisting of the left singu-

lar vectors of 

Hm,lcorresponding to zero singular values, and wm,lis the maximum ratio

combining vector expressed as UH

m,lhm,ml ∕UH

m,lhm,ml [11, 12].

1646 SONNAM ET AL.

(6) can be simpliﬁed as:

m,lym,l=vH

m,lHm,lfm



l=1pm,lxm,l+vH

m,lnm,l,(7)

Without loss of generality, the effective channel gains are

ordered as:

vH

m,1Hm,1fm

2≥⋯≥vH

m,LHm,Lfm

2,(8)

At the receiver, each user conducts SIC to remove the inter-

ference from the users with worse channel gains, i.e. the interfer-

ence from UEm,l+1,⋯,UEm,Lis removed by UEm,l. As a result,

the achieved data rate at UEm,lis given by

𝜉m,l=Bm,llog2

pm,lvH

m,lHm,lfm

l−1

k=1pm,kvH

m,lHm,lfm

2+nm,l

,(9)

where Bm,lis the bandwidth assigned to UEm,l.

3.2 Problem formulation

The total power consumption at the BS is comprised of two

parts: the ﬁxed circuit power consumption pc, and the ﬂexible

transmit power M

m=1L

l=1pm,l≤pmax . In this work, we ﬁnd

out the power allocation coefﬁcient qm,linstead of power vol-

ume pm,l, i.e. qm,l=pm,l∕pmis the power allocation coefﬁcient

for UEm,lin cluster m, where pmis the power allocated to cluster

min a cell. Thus, Equation (9) can be rewritten as:

𝜉m,l=Bm,llog2

pmqm,lvH

m,lHm,lfm

l−1

k=1pmqm,kvH

m,lHm,lfm

2+nm,l(10)

Similar to [11, 34], we deﬁne the EE (energy efﬁciency) of the

system in cluster mas:

𝜂m

EE =𝜉m

pc+pm

,(11)

where 𝜉m=L

l=1𝜉m,ldenotes the achievable sum rate in clus-

ter m. We aim to maximize the EE of the system in MIMO-

NOMA when each user has a predeﬁned minimum rate 𝜉min

m,l.

Thus, our considered problem can be formulated as:

P1 ∶max

pm,qm,l



m=1

𝜂m

EE,(12)

s.t.C1 ∶𝜉

m,l≥𝜉min

m,l,m∈{1,…,M},l∈{1,…,L},

FIGURE 1 The architecture of wireless network with small-cells and

reinforcement learning formulation

C2 ∶



m=1

pm≤pmax,

C3 ∶



l=1

qm,l≤1,

C4 ∶qm,i≤qm,j,i≤j,i,j∈{1,…,L},

C5 ∶pm≥0,qm,l∈[0,1],∀m,l,

where C1 represents users’ minimum rate requirements, and C2

and C3 respectively represent the transmit power constraint in

a cell and a cluster. C4 indicates that the user with worse chan-

nel condition is allocated more power and C5 are the inherent

constraints of pm,qm,l.

This optimization problem is non-convex and NP-hard, and

the global optimal solution is usually difﬁcult to obtain in prac-

tice due to the high computational complexity and the randomly

evolving channel conditions. More importantly, the conven-

tional model-based approaches can hardly satisfy the require-

ments of future wireless communication services. Therefore, we

propose two DRL-based frameworks in the following sections

to deal with this problem.

4DYNAMIC POWER ALLOCATION

WITH DRL

4.1 Deep reinforcement learning

formulation for MIMO-NOMA systems

In this section, the optimization problem of power allocation

in a downlink MIMO-NOMA system is modelled as a rein-

forcement learning task, which consists of an agent and envi-

ronment interacting with each other. In Figure 1, we describe

the architecture of wireless network with small-cells which are

ultra-densely deployed and have the same number of anten-

nas as user handsets, or even less. As shown in Figure 1,the

base station is treated as the agent and the wireless channel

of the MIMO-NOMA system is the environment. The action,

transmit power from the BS to users, taken by the DRL con-

troller (at the BS) is based on the state which is the collec-

tive information on channel condition from users. Then at

SONNAM ET AL.1647

FIGURE 2 Multi-agent DRL-based MIMO-NOMA power allocation

system model

each step, based on the observed state of the environment,

the agent performs an action from the action space to allo-

cate power to users according to the power allocation policy,

where the policy is learned by the DRL algorithms. With the

obtained transmit power, the optimal power allocation is con-

ducted and the step reward can be computed and fed back to

the agent. This reward is the energy efﬁciency of the MIMO-

NOMA system. The agent can perform actions, such as power

allocation, optimized for a given channel information through

a repetitive learning of the selection process that maximizes the

reward.

In this paper, we propose two multi-agent DRL-based frame-

works (i.e. MDPA and MTPA) for the downlink MIMO-

NOMA system to derive the power allocation decision, which

are respectively based on DDPG and TD3 algorithm. The

multi-agent DRL-based network comprises of the Mpairs of

actor and critic network where each pair outputs power alloca-

tion coefﬁcients for all users in a cluster, as well as one addi-

tional actor network which decides the power volume allocated

to every cluster. In Figure 2, we show the multi-agent DRL-

based power allocation network mechanism with Mclusters

in the MIMO-NOMA system (target networks for soft update

weights are not showed), where actor network 𝜇m(m=1,…,M)

decides power allocation coefﬁcients for all users in each clus-

ter and actor network 𝜇0decides the power volume allocated

to every cluster. Comparing with the other mechanisms in pre-

vious works [42, 48,57], we not only properly adopted the

off-policy multi-agent method in accordance with properties of

MIMO-NOMA systems, but also created additional actor net-

work (actor 𝜇0in Figure 2) which is updated in cooperation

with the other agents’ critic networks to adjust power volumes

allocated to clusters for improving overall performance of the

MIMO-NOMA system.

In order to better illustrate our algorithm, we ﬁrst brieﬂy

introduce the backgrounds of DRL. A general RL model con-

sists of four parts: state-space , action-space , immediate

reward and the transition probability between current state

and next state ss′. In every TS (time slot) t, one or several

agents take an action at(e.g. decisions of power allocation)

under current state st(e.g. current channel condition), then

receives an immediate reward rt(e.g. immediate EE) and a new

state st+1, which we deﬁne as following [42, 57].

States: The state st∈in TS tis deﬁned as the current chan-

nel gain of all users

st=s1

t,…,sM

t,(13a)

t=hm,1(t),…,hm,L(t),(13b)

where sm

t(m=1,…,M) represents the current channel gain of

all users in cluster m,andhm,l(t) represents the channel gain

between BS and user UEm,lin TS t. They are assumed to be

obtained at the beginning of the TS. The states space complexity

is related to the number of UEs in the system. Since the total

number of users in the system is M×L, which are grouped

into Mclusters randomly with Lusers per cluster, the space

complexity of stis O(M×L) and that of sm

tis O(L).

Actions: The most suitable action space should contains

all power allocation decisions, so the action atof TS tis

at=a1

t,…,aM

t,(14a)

t=pm

t,qm

t,(14b)

t=qm,1(t),…,qm,L(t),(14c)

where pm

t∈R1is power volume allocated to cluster min a cell,

t∈RLis the power allocation coefﬁcient for all users in clus-

ter mand qm,l(t)∈R1is the power allocation coefﬁcient for

UEm,lin TS t. The power allocation decision network outputs

power allocation coefﬁcients to every user and power volume

for every cluster, hence, each user can use the amount of its

own power allocation coefﬁcient portion of power volume allo-

cated to the corresponding cluster. Therefore, the actions space

complexity of the power allocation network is O(M×(L+1)).

The transmit power is a continuous variable which is inﬁnite,

so we develop the MDPA network and the MTPA network to

solve aforementioned problems.

Rewards: We use the EE of the MIMO-NOMA system in

cluster mto represent the immediate reward rm

t∈returned

after choosing action atin state st,thatis

t=𝜂

EE (15)

And we aim to maximize the long-term cumulative dis-

counted reward deﬁned as:

∞



i=0

𝛾irm

t+i,(16)

with discount factor 𝛾∈(0,1).

In order to achieve this goal, a policy 𝜋mfor cluster mdeﬁned

as a function mapping from states to actions (𝜋m∶→)is

needed. The policy 𝜋macts like a guidance, i.e. it tells the agent

which am

tshould be taken in a speciﬁc sm

tto achieve the expected

t, thus maximizing Equation (16) and it is equivalent to ﬁnd-

ing the optimal policy represented as 𝜋∗

m. For a typical RL prob-

lem, the Q value function [43], which describes the expected

1648 SONNAM ET AL.

cumulative reward Rm

tof starting sm

t, performing action am

tand

thereafter following policy 𝜋m, is instrumental in solving RL

problems, and is deﬁned as:

Q𝜋msm

t,am

t=𝔼

Rm

tsm

t,am

t=𝔼∞



i=0

𝛾irm

t+ism

t,am

t

=𝔼

rm

t+𝛾Q𝜋msm

t+1,am

t+1sm

t,am

t.

(17)

The optimal policy 𝜋∗

mwhich maximizes Equation (16) also

maximizes Equation (17) for all states and actions in cluster m.

And the corresponding optimal Q value function follows 𝜋∗

mis

given as:

Q𝜋∗

msm

t,am

t=𝔼rm

t+𝛾max

t+1

Q𝜋∗

msm

t+1,am

t+1sm

t,am

t.

(18)

Once we have the optimal Q value function Equation (18),

the agent knows how to select actions optimally.

4.2 Multi-agent DDPG-based power

allocation (MDPA) network

We use the off-policy multi-agent DDPG network based on

actor-critic structure to solve the dynamic power allocation

problem, where the actor part is an enhanced DPG network

and the critic part is a DQN. As mentioned before, the cen-

tralized controller receives the channel gain of all users as st=

{s1

t,…,sM

t}at the beginning each TS. With input st, every DNN

named actor network with weights 𝜇k(k=0,1,…,M)outputs

a deterministic action rather than a stochastic probability of

actions which removes further sampling and integration opera-

tions that are required in other actor-critic-based methods [46],

that is

pt=p1

t,…,pM

t=𝜋

0(st;𝜇

0)pm

t=𝜋

(m)

0(st;𝜇

0),(19a)

t=𝜋

m(sm

t;𝜇

m)(m=1,2,…,M) (19b)

where 𝜋0is a policy corresponding to Actor 𝜇0which decides

power volume allocated to every cluster in a cell, 𝜋(m)

0denotes

mth element of 𝜋0(st;𝜇

0) and a policy 𝜋mdecides the power

allocation coefﬁcient for all users in cluster m.

However, a major challenge of learning with deterministic

actions is the exploration of new actions [49]. Fortunately, for

the off-policy algorithm such as DDPG, the exploration can be

treated independently from the learning process. We construct

the exploration policy by adding a noise to the original output

action which is similar to the random selection of 𝜖−greedy

method in DQN, that is

pt=𝜋0(st;𝜇

0)+0pmax

0,(20)

t=𝜋msm

t;𝜇

m+m1

0,m=1,2,…,M,(21)

where 0∈RM,m∈R1represents the noise and follows a

normal distribution. The action pm

tis restricted by the interval

(0, pmax )andqm

tis restricted by the interval (0, 1).

After executing the action atand receiving rt={r1

t,…,rM

t},

the system moves to the next state st+1. Since the action atis

generated by a deterministic policy network, we rewrite Equa-

tion (17) as:

Q𝜋msm

t,am

t

=𝔼rm

t+𝛾Q𝜋msm

t+1,𝜋m

0st+1;𝜇

0,𝜋

msm

t+1;𝜇

m,

m=1,2,…,M

(22)

∇𝜇mJ(𝜋m)=𝔼∇qmQ𝜋m(sm,am;𝜃

m)pm=𝜋(m)

0(s;𝜇0)∇𝜇m𝜋m(sm;𝜇

m)

≈1

Nb

∇qmQ𝜋m(sm,am;𝜃

m)sm=sm

i,pm=𝜋(m)

0(si;𝜇0)∇𝜇m𝜋m(sm;𝜇

m)sm=sm

(23)

∇𝜇0J(𝜋0)=𝔼M



m=1

∇pmQ𝜋m(sm,am;𝜃

m)qm=𝜋m(sm;𝜇m)∇𝜇0𝜋0(s;𝜇

0)

≈1



m=1

∇pmQ𝜋m(sm,am;𝜃

m)s=si,qm=𝜋m(sm

i;𝜇m)∇𝜇0𝜋0(s;𝜇

0)s=si,

(24)

Since it is impractical to calculate Q value of every step in

this way, we also use another DNN named critic network with

weights 𝜃m(m=1,2,…,M) to output the approximate Q value

Q𝜋m(sm

t,am

t;𝜃

m) to evaluate the selected action am

t. We utilize

the experience replay method to allow the networks to bene-

ﬁt from learning across a set of uncorrelated tuples. With the

Nbrandom tuples (si,ai,ri,si+1) sampled from replay memory

, the actor network for 𝜋m(sm

t;𝜇

m) updates its weights in the

direction of getting larger Q value according to the determinis-

tic policy gradient theorem in [49], they are given by Equations

(23) and (24), where J(𝜋m)andJ(𝜋0) respectively represents the

expected total reward of following policy 𝜋min all states of clus-

ter mand policy 𝜋0in all states in a cell. Then we use the target

network architecture from DQN to solve the unstable learning

issue caused by using only one network to calculate target Q

values and update weights at the same time. We create a copy

of the actor network and the critic network as 𝜋m(sm;

𝜇m)and

Q𝜋m(sm,am;

𝜃m) and name them as the target actor network and

the target critic network. Thus, the target Q value can be gener-

ated by these two networks, that is

i=rm

i+𝛾Q𝜋msm

i+1,𝜋(m)

0st+1;𝜇

0,𝜋

msm

i+1;

𝜇m;

𝜃m,

(25)

i=rm

i+𝛾Q𝜋msm

i+1,𝜋(m)

0st+1;

𝜇0,𝜋

msm

i+1;𝜇

m;

𝜃m

(26)

SONNAM ET AL.1649

ALGORITHM 1 The multi-agent DDPG-based power allocation

(MDPA) in downlink MIMO-NOMA

1. Initialize the replay memory with capacity .

2. Initialize the multi-agent DDPG-based power allocation actor networks

𝜋0(st;𝜇

0), 𝜋m(sm

t;𝜇

m) and critic networks Q𝜋m(sm

t,am

t;𝜃

m) with random

parameters 𝜇0,𝜇

mand 𝜃m(m=1,2,…,M).

3. Initialize the target actor networks 𝜋0(st;

𝜇0), 𝜋m(sm

t;

𝜇m) and target critic

networks Q𝜋m(sm

t,am

t;

𝜃m) with parameters 

𝜇0=𝜇

0,

𝜇m=𝜇

mand



𝜃m=𝜃

4. Initialize the random process 0,mfor the DDPG action exploration,

the terminate TS Tmax and the parameters update interval size C.

5. Controller at the BS receives the ﬁrst channel condition information of all

users as initial state s1.

6. for t=1,2,…,Tmax do

7. Controller selects the power allocation action am

t=(pm

t,qm

t) according

to Equations (20), (21).

8. Controller broadcasts the power allocation action am

tto all users, and

users transmit their signal with the speciﬁc power.

9. If action am

tfor any mcan satisfy the minimum rate requirement,

controller receives the current EE as reward rm

t. Otherwise, it receives

none reward.

10. Controller receives the next state st+1as users moving to their next

positions.

11. Store the tuple (st,at,rt,st+1)in.

12. Randomly sample a mini-batch of Nbtuples (si,ai,ri,si+1)from.

13. Critic networks Q𝜋m(sm

t,am

t;𝜃

m) updates 𝜃mby minimizing the loss

function Equation (27).

14. Actor networks 𝜋m(sm

t;𝜇

m) updates 𝜇maccording to Equation (23).

15. Critic networks Q𝜋m(sm

t,am

t;𝜃

m) updates 𝜃mby minimizing the loss

function Equation (28).

16. Actor networks 𝜋0(st;𝜇

0)updates 𝜇0according to Equation (24).

17. Update target actor networks 

𝜇0,

𝜇mand target critic networks 

𝜃m

according to Equation (29) in every CTSs.

18. end for

We use the same method in DQN to update the critic net-

work weights by minimizing the loss function deﬁned as:

Lm(𝜃m)=1

Nb

iym

i−Q𝜋msm

i,am

i;𝜃

mpm

i=𝜋(m)

0(si;𝜇0)2

(27)

L0=1



m=1

izm

i−Q𝜋msm

i,am

i;𝜃

mqm

i=𝜋m(sm

i;𝜇m)2

(28)

Instead of directly copying the weights to target networks, we

update the weights 

𝜇kand 

𝜃min a soft manner to make sure the

weights are changed slowly, which greatly improves the stability

of learning, that is



𝜃m←𝜏𝜃

m+(1−𝜏

)

𝜃m,m=1,2,…,M,



𝜇k←𝜏𝜇

k+(1−𝜏

)

𝜇k,k=0,1,…,M,(29)

where 0 <𝜏<1. The MDPA algorithm in a downlink MIMO-

NOMA system is summarized in Algorithm 1.

4.3 Multi-agent TD3-based power

allocation (MTPA) network

Twin-delayed deep deterministic policy gradient (TD3) is an off-

policy model with continuous high dimensional spaces which

recently has achieved breakthrough in artiﬁcial intelligence. RL

algorithms characterized as off-policy generally utilized a sepa-

rate behaviour policy which is independent of the policy which

is being improved upon. The key advantage of the separation

is that the behaviour policy can operate by sampling all actions,

while the estimation policy can be deterministic [61]. TD3 was

built on the DDPG algorithm to increase stability and perfor-

mance with consideration of function approximation error [60].

The uniqueness of the TD3 algorithm is in the combination of

three powerful DRL techniques such as continuous double deep

Qlearning[58], policy gradient [56] and actor-critic [54]. Even

though DDPG can sometimes achieve a good performance, it

is very sensitive for hyper-parameters and other types of adjust-

ments. The overestimation of Q-value when in the beginning of

the process of learning the Q-function is actually the common

failure mode of DDPG. The overestimation bias is a property

of Q-learning in which the maximization of a noisy value esti-

mate induces a consistent overestimation [60,64]. This noise is

unavoidable when given the inaccuracy of the estimator in func-

tion appropriation settings. Therefore, the overestimation can

occur due to the inaccurate action values. TD3 is an algorithm

that solves this problem by introducing three key techniques

[60]. The ﬁrst one is clipped dual Q networks. TD3 learns two

Q-functions and uses the smaller of the two Q-values to form

the target of the Bellman error. The second is the delayed policy

update, where the policy network parameters are updated after

dual Q-function networks are updated. Thirdly, the smoothed

target policy utilization makes all action values within the spe-

ciﬁc range [45].

In this paper, we also use the multi-agent TD3 network

to derive the power allocation decision in the MIMO-NOMA

system, where the network mechanism is equal to the above

DDPG network. Every actor networkm(m=1,2,…,M)learns

two Q-functions to get a less optimistic estimate of an action

value by taking the minimum between two estimates, so

two Q-functions with 𝜃j

mfor every actor network can be

expressed as:

Q𝜋msm

t,am

t;𝜃j

m

=𝔼rm

t+𝛾Q𝜋msm

t+1,𝜋

0st+1;𝜇

0,𝜋

msm

t+1;𝜇

m;𝜃j

m,

m=1,2,…,M,j=1,2

(30)

Similar to our DDPG-based method, we utilize the expe-

rience replay memory and then the target Q value ym

iis

1650 SONNAM ET AL.

given by:

i=rm

i+𝛾min

j=1,2Q𝜋msm

i+1,𝜋(m)

0st+1;𝜇

0,qm

t;

𝜃j

m,

(31)

qm

t=𝜋msm

i+1;

𝜇m+𝜖

m1

0,m=1,2,…,M,

𝜖m∼clip 0,𝜎

𝜖m,0,1,

(32)

i=rm

i+𝛾min

j=1,2Q𝜋msm

i+1,

t,𝜋

msm

i+1;𝜇

m;

𝜃j

m,(33)



t=𝜋(m)

0st+1;

𝜇0+𝜀

mpmax

𝜀m∼clip 0,𝜎

𝜀m,0,pmax ,

(34)

where the added noise 𝜖m,𝜀

mare clipped to keep the target

close to the original action. Then critic networks are updated by

minimizing the loss functions deﬁned as:

Lm𝜃1

m,𝜃

m=1

Nb

i

jym

i−Q𝜋msm

i,am

i;𝜃j

mpm

i=𝜋(m)

0(si;𝜇0)2

(35)

L0=1



m=1

i

jzm

i−Q𝜋msm

i,am

i;𝜃j

mqm

i=𝜋m(sm

i;𝜇m)2

(36)

∇𝜇mJ(𝜋m)=𝔼∇qmQ𝜋msm,am;𝜃

mpm=𝜋(m)

0(s;𝜇0)∇𝜇m𝜋m(sm;𝜇

m)

≈1

Nb

∇qmQ𝜋msm,am;𝜃

msm=sm

i,pm=𝜋(m)

0(si;𝜇0)∇𝜇m𝜋m(sm;𝜇

m)sm=sm

(37)

∇𝜇0J(𝜋0)=𝔼M



m=1

∇pmQ𝜋msm,am;𝜃

mqm=𝜋m(sm;𝜇m)∇𝜇0𝜋0(s;𝜇

0)

≈1



m=1

∇pmQ𝜋msm,am;𝜃

ms=si,qm=𝜋m(sm

i;𝜇m)∇𝜇0𝜋0(s;𝜇

0)s=si

(38)

Then we update the actor network for 𝜋min the direction

of getting larger Q value with 𝜃1

mby Equations (37) and (38).

Finally, we update the weights 

𝜇0,

𝜇mand 

𝜃j

min a soft manner

to make sure the weights are changed slowly, that is



𝜇0←𝜏𝜇

0+(1−𝜏

)

𝜇0,

𝜃j

m←𝜏𝜃

m+(1−𝜏

)

𝜃j

m=1,2,…,M,j=1,2,(39)



𝜇m←𝜏𝜇

m+(1−𝜏

)

𝜇m,m=1,…,M(40)

The MTPA algorithm in a downlink MIMO-NOMA system

is summarized in Algorithm 2.

ALGORITHM 2 The multi-agent TD3-based power allocation (MTPA) in

downlink MIMO-NOMA

1. Initialize the replay memory with capacity .

2. Initialize the multi-agent TD3-based power allocation actor networks

𝜋0(st;𝜇

0), 𝜋m(sm

t;𝜇

m) and critic networks Q𝜋m(sm

t,am

t;𝜃

m),

Q𝜋m(sm

t,am

t;𝜃

m) with random parameters 𝜇0,𝜇

m,𝜃1

mand

𝜃2

m(m=1,2,…,M).

3. Initialize the target actor networks 𝜋0(st;

𝜇0), 𝜋m(sm

t;

𝜇m) and target critic

networks Q𝜋m(sm

t,am

t;

𝜃1

m), Q𝜋m(sm

t,am

t;

𝜃2

m) with parameters 

𝜇0=𝜇



𝜇m=𝜇

m,

𝜃1

m=𝜃

mand 

𝜃2

m=𝜃

4. Initialize the random process 0,mfor TD3 action exploration and the

terminate TS Tmax and the parameters update interval size C.

5. Controller at the BS receives the ﬁrst channel condition information of all

users as initial state s1.

6. for t=1,2,…,Tmax do

7. Controller selects the power allocation action am

t=(pm

t,qm

t) according

to Equations (20), (21).

8. Controller broadcasts the power allocation action am

tto all users, and

users transmit their signal with speciﬁc power.

9. If action am

tfor any mcan satisfy the minimum rate requirement,

controller receives the current EE as reward rm

t. Otherwise, it receives

none reward.

10. Controller receives the next state st+1as users moving to their next

positions.

11. Store the tuple (st,at,rt,st+1)in.

12. Randomly sample a mini-batch of Nbtuples (si,ai,ri,si+1)from.

13. Critic networks Q𝜋m(sm

t,am

t;𝜃j

m) updates 𝜃j

mby minimizing the loss

function Equation (35).

14. iftmod Cthen

15. Actor networks 𝜋m(sm

t;𝜇

m) updates 𝜇maccording to Equation (37).

16. Update target actor networks 

𝜇maccording to Equation (39).

17. end if

18. Critic networks Q𝜋m(sm

t,am

t;𝜃j

m) updates 𝜃j

mby minimizing the loss

function Equation (36).

19. iftmod Cthen

20. Actor networks 𝜋0(st;𝜇

0) updates 𝜇0according to Equation (38).

21. Update target actor networks 

𝜇0and target critic networks 

𝜃1

m,

𝜃2

according to Equation (40).

22. end if

23. end for

5SIMULATION RESULTS

In this section, we present the simulation results of the two

multi-agent DRL-based power allocation algorithm (i.e. MDPA

and MTPA). The results are simulated in a downlink MIMO-

NOMA system, where the BS is located at the centre of the

cell, 4 users are randomly distributed in the cell with a radius of

500 m. The speciﬁc values of the adopted simulation parame-

ters are summarized in Table 1. If not speciﬁed, Tmax =5000 TS,

pmax =1 W and all users perform random moving with a speed

of v=1m∕s. Our multi-agent DRL-based framework for sim-

ulation has ﬁve networks (three actor networks and two critic

networks) that it does not mean including target networks. The

SONNAM ET AL.1651

TAB LE 1 Parameter settings used in the simulations

Parameter Value

Framework Pytorch 1.5

Programming language Python 3.6

Channel MIMO-NOMA/AWGN

Fading Rayleigh distribution

Number of transmit antennas M=2

Number of receive antennas N=2

Number of users in a cluster L=2

Channel bandwidth Bm,l=10 MHz

Thermal noise density 𝜎2=−174 dBm∕Hz

Pathloss model 114 +38log10(dm,l)

Minimum required data rate 𝜉min

m,l=10 bps∕Hz

learning rate is set to be 0.01, discount factor 𝛾=0.9, memory

capacity =200, weights update interval C=10 and batch

size Nb=32. Speciﬁcally, every actor network has an additional

layer to scale the output to the power bound limitation. The soft

update rate 𝜏in Equations (29), (39) and (40) is set to be 0.01,

the noise process in Equations (20) and (21) follows a normal

distribution of (0,1), and 𝜎𝜖m=𝜎

𝜀m=0.1 in Equations (32)

and (34).

In order to compare the performance of the proposed

frameworks, we consider several alternative approaches: (1)

single agent DDPG/TD3-based power allocation methods

(SDPA/STPA), which output the power allocation policy

for every cluster by independent single agents based on

DDPG/TD3 algorithm to maximize the sum EE of the sys-

tem; (2) DQN-based discrete power allocation strategy (QPA),

which uses DQN method to output the allocated power of all

users by quantizing the power into 10 levels between 0 and pmax

in order to ﬁt the input layer of DQN, because the transmit

power is a continuous variable which is inﬁnite and the action-

space of DQN has to be ﬁnite; (3) ﬁxed power allocation strat-

egy (FPA), which chooses the action that can maximize the sum

EE with maximum transmit power by exhaustive searching in

each TS to ensure high quality-of-service (QoS) of the system;

(4) multi-agent DDPG/TD3 -based power allocation meth-

ods for MIMO-OMA systems (MTPA-OMA/MDPA-OMA),

where the system model in MIMO-OMA is referenced in [14].

In QPA, we evaluated it by exhaustive searching in each TS.

Because it does not rely on global trajectory data, it is easy to

fall into a local optimum. However, such quantization results

in a serious problem which is the huge action-space of pos-

sible selected power. In our QPA scenario with 4 users and a

quantization level k=10, each user can be allocated the power

amount corresponding to one of 10 possible power allocation

choices, then the number of actions that can be selected has

already reached 104=10,000. It should be noted that it is only

a small system, while in future generation wireless networks

with high density of users, the size of action space grows expo-

nentially as the number of users increases. Such a large action

space leads to poor performance, because the DQN agent needs

FIGURE 3 Loss function value of two multi-agent DRL-based

frameworks

plenty of time to explore the entire action space to ﬁnd out the

best power allocation option. And due to the random evolving

nature of the communication environment, it may be difﬁcult

to choose the option which leads to the best performance. In

addition, such quantization also discards some useful informa-

tion, which is essential for ﬁnding the optimal power allocation

option. This is also the reason for us to use the off-policy net-

work based on actor-critic structure for power allocation tasks.

The convergence complexity of the DRL algorithm depends

on the size of the action-state space. For the QPA framework,

the size of output actions is O(kL×M), which exponentially

increases with the number of users and the quantization levels

of power allocation coefﬁcients, and for our proposed frame-

works,thesizeisO(M×(L+1)). The state space needs some

space to be stored, hence the corresponding space complexity

is O(M×L) for QPA and the proposed frameworks. There-

fore, the proposed algorithms can improve the performance

of the energy efﬁciency of MIMO-NOMA systems with low

complexity.

First, we evaluate the convergence of the proposed two multi-

agent frameworks under learning rate 0.01 that is ﬁxed since

it provided the better performance of learning in our simula-

tions. We ﬁx the location of all users to determine how many

TSs are required for our frameworks to ﬁnd the optimal power

allocation policy. As shown in Figure 3, the loss function value

decreases and tends to a stable value within 200 TS for two

frameworks, and the value is small enough to accurately predict

the Q value.

Then, we compare the proposed two multi-agent algorithms

in the random moving system with different power alloca-

tion approaches. It should be noticed that all results are aver-

aged by taking a moving window of 100 TSs to make the

comparison clearer. For the objective of sum EE maximiza-

tion, we investigate the performance of our multi-agent DRL-

based frameworks. Figure 4shows that MTPA frameworks

achieve the better performance of the averaged sum EE than

the other approaches. DRL-based frameworks are able to

1652 SONNAM ET AL.

FIGURE 4 Averaged energy efﬁciency comparison of different

algorithms

dynamically choose the transmit power of all users with con-

tinuous control according to their current channel conditions in

every TS. Speciﬁcally, the averaged sum EE value achieved by

the MTPA framework over all 5000 TSs is 21.47% higher than

the FPA approach, whereas the values for the MDPA frame-

work is 17.3% higher. And for the STPA and SDPA frame-

works, the values are 15.8% and 12.39% higher, respectively. On

the other hand, for the QPA framework based on the discrete

DRL method is only 11.13% higher. More importantly, as long

as the data rate requirement of all users is satisﬁed, it is unnec-

essary and energy-wasted to use maximum power for transmis-

sion since it reduces the EE of the system. This also shows the

importance of power allocation to the performance improve-

ment of the NOMA system. Also, we verify the performance

advantage of the proposed algorithms over MIMO-OMA. As

shown in Figure 4, numerical results show that the proposed

algorithms outperform the advantage of the energy efﬁciency

over MIMO-OMA. In our simulation scenario, MTPA achieves

18.04% higher than MTPA-OMA and MDPA achieves 16.86%

higher than MDPA-OMA.

We investigate the averaged sum EE performance of two

frameworks in 5000 TSs under different transit power limi-

tations. As shown in Figure 5, the averaged results of DRL-

based frameworks grow and tend to stable values as the maxi-

mum available power increases, while the FPA approach slightly

increases and continues to drop since it always uses full power

for the signal transmission. This is because as long as the data

rate requirement is satisﬁed, our algorithms no longer need

full power for the transmission and can dynamically allocate

the power based on the communication conditions to opti-

mize the sum EE no matter how pmax changes. Obviously,

our proposed power allocation methods based on determinis-

tic policy gradient with continuous action spaces outperform

the discrete DRL-based approaches and conventional methods

under most power limitation conditions and we ﬁnd out that

the MTPA framework achieves the best average sum EE per-

formance compared with other approaches. And we verify the

FIGURE 5 Averaged energy efﬁciency versus power limitation

FIGURE 6 The variation of the energy efﬁciency with the minimum

required date rate

improved performance of the proposed algorithms for MIMO-

NOMA over MIMO-NOMA.

Figure 6illustrates a look on how EE varies with min-

imum required data rate 𝜉min

m,l. As shown in Figure 6,our

proposed frameworks have an outstanding performance com-

pared with the other approaches as well as over MIMO-OMA,

even though EE decreases with 𝜉min

m,l. This situation can be

explained by connecting 𝜉min

m,lwith the transmit power, i.e.

lower 𝜉min

m,lhas the same inﬂuence on EE as higher transmit

power. The obtained results indicate that the average system

energy efﬁciency decreases with increasing 𝜉min

m,l, because as 𝜉min

m,l

increases, the system reaches the infeasibility case more rapidly

and becomes unable to satisfy 𝜉min

m,lfor all the system users.

6 CONCLUSION

In this paper, we have studied the dynamic power alloca-

tion problem in a downlink MIMO-NOMA system and two

multi-agent DRL-based frameworks (i.e. multi-agent DDPG/

SONNAM ET AL.1653

TD3-based framework) are proposed to improve the long-term

EE of the MIMO-NOMA system while ensuring minimum

data rates of all users. Every single agent of two multi-agent

frameworks dynamically outputs the optimum power allocation

policy for all users in every cluster by DDPG/TD3 algorithm,

and the additional actor network also adjusts power volumes

allocated to clusters to improve overall performance of the

MIMO-NOMA system, ﬁnally both frameworks adjust the

entire power allocation policy by updating the weights of neural

networks according to the feedback of the system. Compared

with other approaches, our multi-agent DRL-based power

allocation frameworks can signiﬁcantly improve the EE of

the MIMO-NOMA system under various transmit power

limitations and minimum data rates by adjusting the parameters

of networks. We also have veriﬁed the advantage of energy

performance the proposed algorithms over that of the MIMO-

OMA system. In future work, we are going to study on the

joint subchannel selection and power allocation problem for

more practical scenarios of MIMO-NOMA systems properly

applying more powerful DRL approaches.

REFERENCES

1. Chin, W.H., Fan, Z., Haines, R.: Emerging technologies and research chal-

lenges for 5G wireless networks. IEEE Wireless Commun. 21(2), 106–112

(2014)

2. Ding, Z., et al.: A survey on non-orthog onal multiple access for 5G net-

works: Research challenges and future trends. IEEE J. Sel. Areas Commun.

35(10), 2181–2195 (2017)

3. Wu, Z., et al.: Comprehensive study and comparison on 5G NOMA

schemes. IEEE Access 6, 18511–18519 (2018)

4. Choi, J.: Effective capacity of NOMA and a suboptimal power control pol-

icy with delay QoS. IEEE Trans. Commun. 65(4), 1849–1858 (2017)

5. Yin, Y., et al.: Dynamic user grouping-based NOMA over Rayleigh fading

channels. IEEE Access 7, 110964–110971 (2019)

6. Gui, G., Sari, H., Biglieri, E.: A new deﬁnition of fairness for non-

orthogonal multiple access. IEEE Commun. Lett. 23(7), 1267–1271 (2019)

7. Liu, X., et al.: Spectrum resource optimization for NOMA-based cognitive

radio in 5G communications. IEEE Access 6, 24904–24911 (2018)

8. Muhammed, A.J., et al.: Resource allocation for energy efﬁciency in

NOMA enhanced small cell networks. In: Proceedings of 2019 11th Inter-

national Conference on Wireless Communications and Signal Processing

(WCSP). Xi’an, China, pp. 1–6 (2019)

9. Ding, Z., et al.: On the performance of non-orthogonal multiple access

in 5G systems with randomly deployed users. IEEE Signal Process. Lett.

21(12), 1501–1505 (2014)

10. Cui, J., Ding, Z., Fan, P.: Power minimization strategies in downlink

MIMO-NOMA systems. In: Proceedings of 2017 IEEE International

Conference on Communications (ICC). Paris, France, pp. 1–6 (2017)

11. Zeng, M., et al.: Energy-efﬁcient power allocation for MIMO-NOMA with

multiple users in a cluster. IEEE Access 6, 5170–5181 (2018)

12. Ding, Z., Adachi, F., Poor, H.V.: The application of MIMO to non-

orthogonal multiple access. IEEE Trans. Wireless Commun. 15(1), 537–

552 (2016)

13. Ding, Z., Schober, R., Poor, H.V.: A general MIMO framework for NOMA

downlink and uplink transmission based on signal alignment. IEEE Trans.

Wireless Commun. 15(6), 4438–4454 (2016)

14. Zeng, M., et al.: Capacity comparison between MIMO-NOMA and

MIMO-OMA with multiple users in a cluster. IEEE J. Sel. Areas Com-

mun. 35(10), 2413–2424 (2017)

15. Do, D.T., Nguyen, S.: Device-to-device transmission modes in NOMA

network with and without wireless power transfer.Comput. Commun. 139,

67–77 (2019)

16. Uddin, M.B., Kader, M.F., Shin, S.: Exploiting NOMA in D2D assisted

full-duplex cooperative relaying. Phys. Commun. 38(100914), 1–10

(2020)

17. Najimi, M.: Energy-efﬁcient resource allocation in D2D communica-

tions for energy harvesting-enabled NOMA-based cellular networks. Int.

J. Commun. Syst. 33(2), 1–13 (2020)

18. Li, R., et al.: Tact: A transfer actor-critic learning framework for energy

saving in cellular radio access networks. IEEE Trans. Wireless Commun.

13(4), 2000–2011 (2014)

19. Luo, J., et al.: A deep learning-based approach to power minimization

in multi-carrier NOMA with SWIPT. IEEE Access 7, 17450–17460

(2019)

20. Jang, H.S., Lee, H., Quek, T.Q.S.: Deep learning-based power control for

non-orthogonal random access. IEEE Commun. Lett. 23(11), 2004–2007

(2019)

21. Adjif, M., Habachi, O., Cances, J.P.: Joint channel selection and power con-

trol for NOMA: A multi-armed bandit approach. In: Proceedings of 2019

IEEE Wireless Communications and Networking Conference Workshop

(WCNCW). Marrakech, Morocco, pp. 1–6 (2019)

22. Wei, Y., et al.: Power allocation in HetNets with hybrid energy supply using

actor-critic reinforcement learning. In: Proceedings of GLOBECOM 2017

- 2017 IEEE Global Communications Conference. Singapore, pp. 1–5

(2017)

23. Hasan, M.K., et al.: The role of deep learning in NOMA for 5G and

beyond communications. In: Proceedings of 2020 International Con-

ference on Artiﬁcial Intelligence in Information and Communication

(ICAIIC). Fukuoka, Japan, pp. 303–307 (2020)

24. Zhao, Q., Grace, D.: Transfer learning for QoS aware topology man-

agement in energy efﬁcient 5G cognitive radio networks. In: Proceed-

ings of 1st International Conference on 5G for Ubiquitous Connectivity.

Akaslompolo, Finland, pp. 152–157 (2014)

25. Wei, Y., et al.: User scheduling and resource allocation in HetNets with

hybrid energy supply: An actor-critic reinforcement learning approach.

IEEE Trans. Wireless Commun. 17(1), 680–692 (2017)

26. Liu, M., et al.: Resource allocation for NOMA based heterogeneous IoT

with imperfect SIC: A deep learning method. In: Proceedings of 2018

IEEE 29th Annual International Symposium on Personal, Indoor and

Mobile Radio Communications (PIMRC). Bologna, Italy, pp. 1440–1446

(2018)

27. Gui, G., et al.: Deep learning for an effective non-orthogonal multiple

access scheme. IEEE Trans. Veh. Technol. 67(9), 8440–8450 (2018)

28. Ding, Z., Fan, P., Poor, H.V.: Impact of user pairing on 5G nonorthogonal

multiple-access downlink transmissions. IEEE Trans. Veh. Technol. 65(8),

6010–6023 (2016)

29. Liang, W., et al.: User pairing for downlink non-orthogonal multiple access

networks using matching algorithm. IEEE Trans. Commun. 65(12), 5319–

5332 (2017)

30. Paulraj, A.J., et al.: An overview of MIMO communications - A key to

gigabit wireless. Proc. IEEE 92(2), 198–218 (2004)

31. Hanif, M.F., et al.: A minorization -maximization method for optimizing

sum rate in non-orthogonal multiple access systems. IEEE Trans. Signal

Process. 64(1), 76–88 (2015)

32. Benjebbour, A., Kishiyama, Y.: Combination of NOMA and MIMO: Con-

cept and experimental trials. In: Proceeedings of 25th International Con-

ference on Telecommunications (ICT). St. Malo, France, pp. 433–438

(2018)

33. Islam, S.M.R., et al.: Resource allocation for downlink NOMA systems:

Key techniques and open issues. IEEE Wireless Commun. 25, 40–47

(2017)

34. Manglayev, T., Kizilirmak, R., Kho, Y.H.: Optimum power allocation for

non-orthogonal multiple access (NOMA). In: Proceedings of 25th Inter-

national Conference on Telecommunications (ICT). Baku, Azerbaijan,

pp. 1–4 (2016)

35. Zhang, Y., et al.: Energy-efﬁcient transmission design in non-orthogonal

multiple access. IEEE Trans. Veh. Technol. 66, 2852–2857 (2017)

36. Zhang, J., et al.: Joint subcarrier assignment and downlink-uplink time-

power allocation for wireless powered OFDM-NOMA systems. In:

1654 SONNAM ET AL.

Proceedings of 10th International Conference on Wireless Communica-

tions and Signal Processing (WCSP). Hangzhou, China, pp. 1–7 (2018)

37. Zhao, J., et al.: Joint subchannel and power allocation for NOMA enhanced

D2D communications. IEEE Trans. Commun. 65(11), 5081–5094 (2017)

38. Xiong, Z., et al.: Deep reinforcement learning for mobile 5G and beyond:

Fundamentals, applications, and challenges. IEEE Veh. Technol. Mag.

14(2), 44–52 (2019)

39. Wang, T., et al.: Deep learning for wireless physical layer: Opportunities

and challenges. China Commun. 14(11), 92–111 (2017)

40. Huang, D., et al.: Deep learning based cooperative resource allocation in

5G wireless networks. Mobile Network Appl. 1–8 (2018)

41. Eisen, M., et al.: Learning optimal resource allocations in wireless systems.

IEEE Trans. Signal Process. 67(10), 2775–2790 (2018)

42. Zhang, Y., Wang, X., Xu, Y.: Energy-efﬁcient resource allocation in uplink

NOMA systems with deep reinforcement learning. In: Proceedings of 11th

International Conference on Wireless Communications and Signal Pro-

cessing (WCSP). Xi’an, China, pp. 1–6 (2019)

43. Zhang, H., et al.: Artiﬁcial intelligence-based resource allocation in ultra-

dense networks: Applying event-triggered Q-learning algorithms. IEEE

Veh. Technol. Mag. 14(4), 56–63 (2019)

44. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. 2nd

ed., The MIT Press, Cambridge, MA (2018). http://incompleteideas.net/

book/the-book.html

45. Zhang, F., Li, J., Li, Z.: A TD3-based multi-agent deep reinforcement learn-

ing method in mixed cooperation-competition environment. Science 411,

206–215 (2020)

46. Mnih, V., et al.: Human-level control through deep reinforcement learning.

Nature 518(7540), 529–533 (2015)

47. Xiao, L., et al.: Reinforcement learning-based NOMA power allocation in

the presence of smart jamming. IEEE Trans. Veh. Technol. 67(4), 3377–

3389 (2017)

48. Zhang, S., et al.: A dynamic power allocation scheme in power-domain

NOMA using actor-critic reinforcement learning. In: Proceedings of

2018 IEEE/CIC International Conference on Communications in China

(ICCC). Beijing, China, pp. 719–723, (2018)

49. Zhao, N., et al.: Deep reinforcement learning for user association and

resource allocation in heterogeneous networks. In: Proceedings of 2018

IEEE Global Communications Conference (GLOBECOM). Abu Dhabi,

United Arab Emirates, pp. 1–6 (2018)

50. Chu, M., et al.: Reinforcement learning-based multiaccess control and bat-

tery prediction with energy harvesting in IoT systems. IEEE Internet

Things J. 6(2), 2009–2020 (2019)

51. Lillicrap, T., et al.: Continuous control with deep reinforcement learning.

In: Proceedings of 4th International Conference on Learning Representa-

tions. San Juan, Puerto Rico, pp. 1–14 (2016)

52. Grondman, I., et al.: Efﬁcient model learning methods for actor-critic con-

trol. IEEE Trans. Syst. Man Cybern. Cybern. 42(3), 591–602 (2011)

53. Vamvoudakis, K.G., Lewis, F.: Online actor-critic algorithm to solve the

continuous-time inﬁnite horizon optimal control problem. Automatica 46,

878–888 (2010)

54. Sutton, R.S., et al.: Policy gradient methods for reinforcement learning with

function approximation. Adv. Neural Inf. Process. Syst. 12, 1057–1063

(2000)

55. Konda, V., Gao, V.: On actor-critic algorithms. SIAM J. Control Optim.

42(4), 1143–1166 (2000)

56. Silver, D., et al.: Deterministic policy gradient algorithms. In: Proceedings

of 31st International Conference on Machine Learning. Beijing, China, vol.

32, pp. 387–395 (2014)

57. Li, Y., et al.: Energy-aware resource management for uplink non-

orthogonal multiple access: Multi-agent deep reinforcement learning.

Future Gener. Comput. Syst. 105, 684–694 (2020)

58. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with

double Q-learning. In: Proceedings of 29th Conference on Artiﬁcial Intel-

ligence (AAAI15). Austin, TX, pp. 2094–2100 (2015)

59. Bao, X., et al.: Conservative policy gradient in multi-critic setting. In: Pro-

ceedings of 2019 Chinese Automation Congress (CAC). Hangzhou, China,

pp. 1486–1489 (2019)

60. Fujimoto, S., Hoof, H., Meger, D.: Addressing function approximation

error in actor-critic methods. In: Proceedings of 35th International Con-

ference on Machine Learning. Stockholm, Sweden, pp. 1582–1591 (2018)

61. Dankwa, S., Zheng, W.: Modeling a continuous locomotion behavior of an

intelligent agent using deep reinforcement technique. In: Proceedings of

2019 IEEE 2nd International Conference on Computer and Communica-

tion Engineering Technology-CCET. Beijing, China, pp. 172–175 (2019)

62. Jaderberg, M., et al.: Human-level performance in ﬁrst-person multiplayer

games with population-based deep reinforcement learning. Science 364,

859–865 (2019)

63. Lowe, R., et al.: Multi-agent actor-critic for mixed cooperative-competitive

environments. In: Proceedings of 31st Conference on Neural Information

Processing Systems. Long Beach, CA, pp. 6382–6393 (2017)

64. Thrun, S., Schwartz, A.: Issues in using function approximation for rein-

forcement learning. In: Proceedings of 4th Connectionist Models. Summer

School Hillsdale, NJ, pp. 1–9 (1993)

How to cite this article: Jo, S., et al.: Multi-agent deep

reinforcement learning-based energy efﬁcient power

allocation in downlink MIMO-NOMA systems. IET

Commun. 15, 1642–1654 (2021).

https://doi.org/10.1049/

cmu2.12177

Content uploaded by Sonnam Jo

Content may be subject to copyright.

Deep Reinforcement Learning-based Sum Rate Fairness Trade-off for Cell-Free mMIMO

Article

Full-text available

Jan 2022

The uplink of a cell-free massive multiple-input multiple-output with maximum-ratio combining (MRC) and zero-forcing (ZF) schemes are investigated. A power allocation optimization problem is considered, where two conflicting metrics, namely the sum rate and fairness, are jointly optimized. As there is no closed-form expression for the achievable rate in terms of the large scale-fading (LSF) components, the sum rate fairness trade-off optimization problem cannot be solved by using known convex optimization methods. To alleviate this problem, we propose two new approaches. For the first approach, a use-and-then-forget scheme is utilized to derive a closed-form expression for the achievable rate. Then, the fairness optimization problem is iteratively solved through the proposed sequential convex approximation (SCA) scheme. For the second approach, we exploit LSF coefficients as inputs of a twin delayed deep deterministic policy gradient (TD3), which efficiently solves the non-convex sum rate fairness trade-off optimization problem. Next, the complexity and convergence properties of the proposed schemes are analyzed. Numerical results demonstrate the superiority of the proposed approaches over conventional power control algorithms in terms of the sum rate and minimum user rate for both the ZF and MRC receivers. Moreover, the proposed TD3-based power control achieves better performance than the proposed SCA-based approach as well as the fractional power scheme.

Energy‐aware power allocation using Ladybug Beetle Honey Badger Optimization in multiple‐input multiple‐output‐non‐orthogonal multiple access with multiple users

Article

Full-text available

Mar 2024
INT J COMMUN SYST

The non‐orthogonal multiple access (NOMA) is highly capable of serving multiple users at similar times and frequencies. The power allocation (PA) is widely considered as a main factor in NOMA for efficient communication. Here, the application of multiple‐input multiple‐output (MIMO) is added to NOMA to fulfill demands of enriched spectral efficiencies and extra user data. In this research, the Ladybug Beetle Honey Badger Optimization (LBHBO) is proposed for efficient PA in MIMO‐NOMA. Initially, the received signals from the user are modulated for amplitude and frequency. Then, user grouping is conducted by fuzzy local information c‐means (FLICM) clustering followed by using PA done by proposed LBHBO. This power is then moved to the transmitter and then to the channel estimation process. Moreover, cyclic prefix (CP) removal is carried out that tends to discrete Fourier transform (DFT). Finally, quadrature amplitude modulation (QAM) demodulation is performed for reallocated output. Furthermore, LBHBO is formed by combining Ladybug Beetle Optimization (LBO) and Honey Badger Algorithm (HBA). The performance offered by the LBHBO‐PA with maximal values with energy efficiency (EE) of 25.38 Mbits/s, sum rate of 1.29 Mbits/s, and achievable rate of 100.47 Mbits/s.

Enhanced uplink handover scheme for improvement of energy efficiency and QoS in LTE-A/5G HetNet with ultra-dense small cells

Article

Full-text available

Nov 2023
WIREL NETW

To construct a femtocell based ultra-dense network (UDN) is the important way to satisfy the ever-growing data traffic demands in the next generation mobile networks. However, it causes frequent handovers (HOs), thus leading to negative consequences such as increased power consumption and poor quality of service (QoS). This paper represents a research on the uplink handover (UL-HO) being watched recently and proposes a new method to further improve the energy efficient and QoS. First, we confirm the UL-HO mechanism, formulate the power consumption caused by UL-HO and compare downlink handover (DL-HO) with UL-HO in terms of the performance parameters including the power consumption, handover rate (HOR), handover failure rate (HOFR), ping-pong rate (PPR), delay, packet loss rate (PLR) and throughput during handover (HO). Then, it is proved through simulation that the reliable target cell determination scheme considering uplink reference signal received power (UL-RSRP), available bandwidth, direction of user equipment (UE), and admission control is lower in HOR, HOFR, PPR, power consumption, delay and PLR. In particular, we propose a new algorithm to determine the reasonable target cell considering UL-RSRP, available bandwidth and moving direction of UE, thus reducing HOR and PPR. The improved UL-HO scheme can be implemented easily by adding some functions besides the upper-layer protocol, while it can provide a lot of benefits such as reduced power consumption, delay, frequent HOs resulting in reducing HOR and PPR. Accordingly, this can be introduced into 5G as a HO scheme.

Energy-Efficient Transmission Strategy for Delay Tolerable Services in NOMA-Based Downlink With Two Users

Article

Full-text available

Jan 2023

With the continuous development of the communication industry, there is a shift in real-time services from 4G networks to Delay Tolerable (DT) services in the context of 5G/B5G networks. Additionally, energy consumption control poses significant challenges in the current communication industry. Therefore, we study algorithms and schemes to improve the Energy Efficiency (EE) of DT services in the context of Non-Orthogonal Multiple Access (NOMA) downlink two-user communication system.First, we transformed the EE enhancement problem into a convex optimization problem based on transmission power by derivation. Secondly, we propose to use Approximate Statistical Dynamic Programming (ASDP) algorithm, Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO) to solve the problem that convex optimization cannot be decided in real time. Finally, we perform an interpretability analysis on whether the decision schemes of the agents trained by the DDPG algorithm and the PPO algorithm are reasonable. The simulation results show that the decisions made by the agent trained by the DDPG algorithm perform better compared to the ASDP algorithm and the PPO algorithm.

Maximizing the Downlink Data Rates in Massive Multiple Input Multiple Output with Frequency Division Duplex Transmission Mode Using Power Allocation Optimization Method with Limited Coherence Time

Article

Full-text available

Feb 2024

The expected development of the future generation of wireless communications systems such as 6G aims to achieve an ultrareliable and low-latency communications (URLLCs) while maximizing the data rates. These requirements push research into developing new advanced technologies. To this end, massive multiple input multiple output (MMIMO) is introduced as a promising transmission approach to fulfill these requirements. However, maximizing the downlink-achievable sum rate (DASR) in MMIMO with a frequency division duplex (FDD) transmission mode and limited coherence time (LCT) is very challenging. To address this challenge, this paper proposes a DASR maximization approach using a feasible power allocation optimization method. The proposed approach is based on smartly allocating the total transmit power between the data transmission and training sequence transmission for channel estimation. This can be achieved by allocating more energy to the training signal than the data transmission during the channel estimation process to improve the quality of channel estimation without compromising more training sequence length, thus maximizing the DASR. Additionally, the theory of random matrix approach is exploited to derive an asymptotic closed-form expression for the DASR with a regularized zero-forcing precoder (RZFP), which allows the power optimization process to be achieved without the need for computationally complex Monte Carlo simulations. The results provided in this paper indicate that a considerable enhancement in the DASR performance is achieved using the proposed power allocation method in comparison with the conventional uniform power allocation method.

Research on international logistics supply chain management strategy based on deep reinforcement learning

Article

Full-text available

Jun 2023

The use of deep reinforcement learning algorithms for strategy formulation in supply chain management enables the nodes in the supply chain to better improve their management strategies. In this paper, a supply chain model is constructed as a starting point, and deep reinforcement learning algorithms are introduced on this basis. Firstly, the decision problem of uncertainty is handled by the reinforcement learning method of functions, and the DQN algorithm (deep neural network algorithm) is divided into two parts for iterative rules. Then the target network is established to make the iterative process more stable, to improve the convergence of the algorithm, evaluate the loss function in the training process of the network, and to determine its influence factor. Then the neural network is used to improve the iteration rule, improve the output layer, select the final action, and define the model expectation reward. Finally, the Bellman equation is fitted to the function by a deep neural network to calculate the final result. The experimental results show that by analyzing and constructing the cost of international logistics under supply chain management, the capacity utilization rate of ocean freight link is 57% The unloading link is 74% and the total capacity utilization rate is calculated as 76%. It shows that using deep reinforcement learning algorithms under international logistics supply chain management is feasible and necessary for improving the management strategy research of supply chains.

QoS and energy-efficiency aware scheduling and resource allocation scheme in LTE-A uplink systems

Article

Full-text available

Dec 2022
TELECOMMUN SYST

With the fast evolution of the wireless communications, the energy consumption of mobile network and the data flow in the network are increasing. Thus, it is very important to increase the Energy Efficiency (EE) and reduce packet loss rate. This article surveys the scheduling and resource allocation algorithm to reduce packet loss rate while increasing EE in the uplink of Long Term Evolution-Advanced system. Here, new framework that refers previous value as an optimization value of the optimization problem solving process is proposed. Furthermore, the mathematical model of the process is defined as NP-Hard problem and the new solving method is proposed. We propose a user priority metric, considering that the demand for packet loss rate and packet delay are different according to the Quality of Service (QoS) Class Identifier (QCI). For Guaranteed Bit Rate (GBR) and Non-GBR services, the proposed algorithm uses the user priority metric taking into account not only the uplink buffer status, but also the different characteristics of packet loss mechanism and energy efficiency, to select users in scheduling. To demonstrate the advantage of the proposed scheme, simulations are carried out in various size of cells like Femto cell, Pico cell, Micro cell and others and the study results are compared for the effectiveness of proposed methods. Simulation results show that the proposed algorithm enhances EE, as well as, QoS provision for different types of services.

Energy-Efficient THz NOMA-Enabled Small Cells Underlaying Macrocell Using Reinforcement Learning

Conference Paper

Dec 2023

A reinforcement learning based energy optimization approach for household fridges

Article

Sep 2023

The use of machine learning algorithms for the control of schedulable loads like Heating, ventilation, and air conditioning (HVAC), illumination, dryers and irrigation systems to optimize the use of RES and increase energy saving has obtained remarkable results in the last years. However, in the residential sector of tropical countries where HVAC systems are not necessary, these loads represent only a small percentage of the total energy consumption. In order to achieve a significant impact on energy savings and promote the use of RES, other residential loads must be taken into account in tropical countries. In the case of Colombia, for example, fridges account for 24% of residential energy consumption. This research proposes the use of RL for the development of a fridge energy management system capable of minimizing energy consumption and optimizing the use of RES for cooling. The fridge energy management system is based on an RL agent to control the fridge, and an artificial neural network to model the environment and assess the impact of its actions. Compared to the original fridge control, the RL-based control successfully reduced the total energy usage by 23% while also increasing the use of RES energy.

Multi-agent DRL-based Task Offloading in Multiple RIS-aided IoV Networks

Article

Full-text available

Aug 2023
IEEE T VEH TECHNOL

This paper considers an internet of vehicles (IoV) network, where multi-access edge computing (MAEC) servers are deployed at base stations (BSs) aided by multiple reconfigurable intelligent surfaces (RISs) for both uplink and downlink transmission. An intelligent task offloading methodology is designed to optimize the resource allocation scheme in the vehicular network which is based on the state of criticality of the network and the priority and size of tasks. We then develop a multi-agent deep reinforcement learning (MA-DRL) framework using the Markov game for optimizing the task offloading decision strategy. The proposed algorithm maximizes the mean utility of the IoV network and improves the communication quality. Extensive numerical results were performed that demonstrate that the RIS-assisted IoV network using the proposed MA-DRL algorithm achieves higher utility than current state-of-the art networks (not aided by RISs) and other baseline DRL algorithms, namely soft actor-critic (SAC), deep deterministic policy gradient (DDPG), twin delayed DDPG (TD3). The proposed method improves the offloading data rate of the tasks, reduces the mean delay and ensures that a higher percentage of offloaded tasks are completed compared to that of other DRL based and non-RIS assisted IoV frameworks. Index Terms-Internet of vehicles (IoV), multi-access edge computing (MAEC), reconfigurable intelligent surface (RIS), multi-agent deep reinforcement learning (MA-DRL).

Modeling a Continuous Locomotion Behavior of an Intelligent Agent Using Deep Reinforcement Technique

Conference Paper

Full-text available

Aug 2019

Learning Optimal Resource Allocations In Wireless Systems

Article

Jan 2019

Mark Eisen

The goal of this thesis is to develop a learning framework for solving resource allocation problems in wireless systems. Resource allocation problems are as widespread as they are challenging to solve, in part due to the limitations in finding accurate models for these complex systems. While both exact and heuristic approaches have been developed for select problems of interest, as these systems grow in complexity to support applications in Internet of Things and autonomous behavior, it becomes necessary to have a more generic solution framework. The use of statistical machine learning is a natural choice not only in its ability to develop solutions without reliance on models, but also due to the fact that a resource allocation problem takes the form of a statistical regression problem. The second and third chapters of this thesis begin by presenting initial applications of machine learning ideas to solve problems in wireless control systems. Wireless control systems are a particular class of resource allocation problems that are a fundamental element of IoT applications. In Chapter 2, we consider the setting of controlling plants over non-stationary wireless channels. We draw a connection between the resource allocation problem and empirical risk minimization to develop convex optimization algorithms that can adapt to non-stationarities in the wireless channel. In Chapter 3, we consider the setting of controlling plants over a latency-constrained wireless channel. For this application, we utilize ideas of control-awareness in wireless scheduling to derive an assignment problem to determine optimal, latency-aware schedules. The core framework of the thesis is then presented in the fourth and fifth chapters. In Chapter 4, we formally draw a connection between a generic class of wireless resource allocation problems and constrained statistical learning, or regression. From here, this inspires the use of machine learning models to parameterize the resource allocation problem. To train the parameters of the learning model, we first establish a bounded duality gap result of the constrained optimization problem, and subsequently present a primal-dual learning algorithm. While any learning parameterization can be used, in this thesis we focus our attention on deep neural networks (DNNs). While fully connected networks can be represent many functions, they are impractical to train for large scale systems. In Chapter 5, we tackle the parallel problem in our wireless framework of developing particular learning parameterizations, or deep learning architectures, that are well suited for representing wireless resource allocation policies. Due to the graph structure inherent in wireless networks, we propose the use of graph convolutional neural networks to parameterize the resource allocation policies. Before concluding remarks and future work, in Chapter 6 we present initial results on applying the learning framework of the previous two chapters in the setting of scheduling transmissions for low-latency wireless control systems. We formulate a control-aware scheduling problem that takes the form of the constrained learning problem and apply the primal-dual learning algorithm to train the graph neural network.

A TD3-based Multi-agent Deep Reinforcement Learning Method in Mixed Cooperation-Competition Environment

Article

Jun 2020
NEUROCOMPUTING

We explored the problem about function approximation error and complex mission adaptability in multi-agent deep reinforcement learning. This paper proposes a new multi-agent deep reinforcement learning algorithm framework named multi-agent time delayed deep deterministic policy gradient. Our work reduces the overestimation error of neural network approximation and variance of estimation result using dual-centered critic, group target network smoothing and delayed policy updating. According to experiment results, it improves the ability to adapt complex missions eventually. Then, we discuss that there is an inevitable overestimation issue about existing multi-agent algorithms about approximating real action-value equations with neural network. We also explain the approximate error of equations in the multi-agent deep deterministic policy gradient algorithm mathematically and experimentally. Finally, the application of our algorithm in the mixed cooperative competition experimental environment further demonstrates the effectiveness and generalization of our algorithm, especially improving the group’s ability of adapting complex missions and completing more difficult missions.

The Role of Deep Learning in NOMA for 5G and Beyond Communications

Conference Paper

Feb 2020

Conservative Policy Gradient in Multi-critic Setting

Conference Paper

Nov 2019

Energy-aware resource management for uplink non-orthogonal multiple access: Multi-agent deep reinforcement learning

Article

Apr 2020
FUTURE GENER COMP SY

This article has been retracted: please see Elsevier Policy on Article Withdrawal (http://www.elsevier.com/locate/withdrawalpolicy). This article has been retracted at the request of the Editor-in-Chief, authors and Xiaoming Wang. This article is the identical resubmission of a paper that was initially submitted by different group of authors, prior to this publication, to IEEE Access in August 18th and was rejected on August 30th. A conference version of it was published in 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP) on September 1st 2019 titled: Energy-Efficient Resource Allocation in Uplink NOMA Systems with Deep Reinforcement Learning, authored by Yuhan Zhang; Xiaoming Wang ; Youyun Xu DOI: 10.1109/WCSP.2019.8927898 There is no evidence of the authors of this article having access to the IEEE Access submission during peer review. Apologies are offered to readers of the journal that this was not detected during the submission process.

Energy-Efficient Resource Allocation in Uplink NOMA Systems with Deep Reinforcement Learning

Conference Paper

Oct 2019

Resource Allocation for Energy Efficiency in NOMA Enhanced Small Cell Networks

Conference Paper

Dec 2019

In this paper, we consider a downlink non-orthogonal multiple access (NOMA) heterogeneous small cell networks (HSCN), where the macro base station (MBS) simultaneously communicate with multiple small cell base stations (SBS) through the wireless backhauls. The goal is to investigate an energy-efficient power, and bandwidth allocation scheme aiming at maximizing the EE of the small cells in downlink NOMA-HSCN constrained by the maximum transmit power and the minimum required data rate simultaneously. The optimization problem is non-convex due to the fractional objective function and non-convex constraints and is challenging to obtain an exact solution efficiently. To deal with the problem, the joint optimization is first decomposed into two subproblems. Then, an iterative algorithm to solve the power optimization subproblem is proposed with guaranteed convergence. Furthermore, a closed-form solution for the bandwidth allocation subproblem is derived. Finally, an iterative power and bandwidth allocation algorithm are proposed for NOMA-HSCN. Simulation results reveal the effectiveness of the proposed schemes in terms of energy efficiency (EE) as compared to the existing NOMA schemes and the conventional orthogonal multiple access (OMA) baseline schemes.

Joint Channel Selection and Power Control for NOMA: A Multi-Armed Bandit Approach

Conference Paper

Apr 2019

Exploiting NOMA in D2D Assisted Full-Duplex Cooperative Relaying

Article

Oct 2019

This paper proposes a device-to-device (D2D) enabling cellular full-duplex (FD) cooperative (DFC) protocol using non-orthogonal multiple access (NOMA) called DFC-NOMA, where an FD relay acting D2D transmitter assists in relaying a NOMA far user's signal and transmits a D2D receiver's signal simultaneously. The ergodic capacity, outage probability, and diversity order of DFC-NOMA are theoretically investigated under the assumption of both perfect and imperfect interference cancellation. The theoretical analysis is then validated by simulations. Both analysis and simulation results demonstrate the performance gain of DFC-NOMA over conventional FD cooperative NOMA and existing D2D aided FD NOMA.

Multi‐agent deep reinforcement learning‐based energy efficient power allocation in downlink MIMO‐NOMA systems

Abstract and Figures

Recommended publications

Reinforcing Collectivity: The Liability of Trustees and the Power of Investors in Finance Transactio...

A gentle dental nudge

Effects of Frame Repetition Through Cues in the Online Environment

Group Contingent Reinforcement with Slow Learning, Culturally Deprived Children