PreprintPDF Available

Learning from Peers: Transfer Reinforcement Learning for Joint Radio and Cache Resource Allocation in 5G Network Slicing

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Radio access network (RAN) slicing is an important part of network slicing in 5G. The evolving network architecture requires the orchestration of multiple network resources such as radio and cache resources. In recent years, machine learning (ML) techniques have been widely applied for network slicing. However, most existing works do not take advantage of the knowledge transfer capability in ML. In this paper, we propose a transfer reinforcement learning (TRL) scheme for joint radio and cache resources allocation to serve 5G RAN slicing.We first define a hierarchical architecture for the joint resources allocation. Then we propose two TRL algorithms: Q-value transfer reinforcement learning (QTRL) and action selection transfer reinforcement learning (ASTRL). In the proposed schemes, learner agents utilize the expert agents' knowledge to improve their performance on target tasks. The proposed algorithms are compared with both the model-free Q-learning and the model-based priority proportional fairness and time-to-live (PPF-TTL) algorithms. Compared with Q-learning, QTRL and ASTRL present 23.9% lower delay for Ultra Reliable Low Latency Communications slice and 41.6% higher throughput for enhanced Mobile Broad Band slice, while achieving significantly faster convergence than Q-learning. Moreover, 40.3% lower URLLC delay and almost twice eMBB throughput are observed with respect to PPF-TTL.
Content may be subject to copyright.
arXiv:2109.07999v4 [eess.SY] 1 Sep 2022
This paper has been accepted by IEEE Transactions on Cognitive Communications and Networking
Learning from Peers: Deep Transfer Reinforcement
Learning for Joint Radio and Cache Resource
Allocation in 5G RAN Slicing
Hao Zhou, Graduate Student Member, IEEE, Melike Erol-Kantarci, Senior Member, IEEE, Vincent Poor, Fellow, IEEE
Abstract—Network slicing is a critical technique for 5G com-
munications that covers radio access network (RAN), edge, trans-
port and core slicing. The evolving network architecture requires
the orchestration of multiple network resources such as radio
and cache resources. In recent years, machine learning (ML)
techniques have been widely applied for network management.
However, most existing works do not take advantage of the
knowledge transfer capability in ML. In this paper, we propose
a deep transfer reinforcement learning (DTRL) scheme for joint
radio and cache resource allocation to serve 5G RAN slicing.
We first define a hierarchical architecture for joint resource
allocation. Then we propose two DTRL algorithms: Q-value-
based deep transfer reinforcement learning (QDTRL) and action
selection-based deep transfer reinforcement learning (ADTRL).
In the proposed schemes, learner agents utilize expert agents’
knowledge to improve their performance on current tasks. The
proposed algorithms are compared with both the model-free
exploration bonus deep Q-learning (EB-DQN) and the model-
based priority proportional fairness and time-to-live (PPF-TTL)
algorithms. Compared with EB-DQN, our proposed DTRL-
based method presents 21.4% lower delay for Ultra Reliable
Low Latency Communications (URLLC) slice and 22.4% higher
throughput for enhanced Mobile Broad Band (eMBB) slice,
while achieving significantly faster convergence than EB-DQN.
Moreover, 40.8% lower URLLC delay and 59.8% higher eMBB
throughput are observed with respect to PPF-TTL.
Index Terms—5G, network slicing, edge caching, transfer
reinforcement learning
I. INTRO DUC TI O N
Driven by the increasing traffic demand of diverse mobile
applications, 5G mobile networks are expected to satisfy the
diverse quality of service (QoS) requirements of a wide variety
of services as well as service level agreements of different user
types [1]. Considering the diverse QoS demands of user types
such as enhanced Mobile Broad Band (eMBB) and Ultra Reli-
able Low Latency Communications (URLLC), network slicing
has been proposed to enable flexibility and customization of
5G networks. Based on software defined networks and network
function virtualization techniques, physical networks are split
This work was supported by Natural Sciences and Engineering Research
Council of Canada (NSERC) Collaborative Research and Training Experience
Program (CREATE) under Grant 497981, Canada Research Chairs Program,
and U.S. National Science Foundation under Grant CNS-2128448.
H. Zhou and M. Erol-Kantarci are with the School of Electrical Engineering
and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada.
(emails:{hzhou098, melike.erolkantarci}@uottawa.ca).
H. V. Poor is with the Department of Electrical and Computer Engineering,
Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).
into multiple logical network slices [2]. Each slice may include
its own controller to manage the available resources.
As an important part of network slicing, radio access
network (RAN) slicing is more complicated than core and
transport network slicing due to limited bandwidth resources
and fluctuating radio channels. For instance, a two-layer slicing
approach is introduced in [3] for low complexity RAN slicing,
which aims to find a suitable trade-off between slice isola-
tion and efficiency. A puncturing-based scheduling method
is proposed in [4] to allocate resources for the incoming
URLLC traffic and minimize the risk of interrupting eMBB
transmissions. However, the puncturing-based method may
degrade the performance of the eMBB slice, since the URLLC
slice is scheduled on top of ongoing eMBB transmissions (i.e.,
puncturing the current eMBB transmission).
The aforementioned works mainly concentrate on RAN
slicing. While the spectrum is indisputably the most critical
resource of RAN, other resources are equally important to
guarantee the network performance, especially cache resource
[5]. Incorporating caching into the RAN has attracted interest
from both academia and industry, and some novel network
architectures have been proposed to harvest the potential
advantages of edge caching [6]. Indeed, edge caching allows
storing data closer to the users by utilizing the storage capacity
available at the network devices, and it reduces the traffic to the
core network. Moreover, caching can save the backhaul capac-
ity without affecting the network delay. Cache placement and
caching strategies have been extensively studied in numerous
works, but the problem is usually addressed disjointly without
considering the RAN. For example, a location customized
caching method is presented in [7] to maximize the cache
hit ratio, and content caching locations are optimized in [8]
by considering both cloud-centric and edge-centric caching.
Augmenting edge devices in RAN with caching will bring
significant improvements for 5G networks. However, this leads
to higher complexity for network management. Network slic-
ing is expected to allocate limited resources such as bandwidth
or caching capacity between slices and fulfill the QoS require-
ments of slices. The complex network dynamics, especially
the stochastic arrival requests of slices, make the underlying
network optimization challenging. Fortunately, machine learn-
ing (ML) techniques offer promising solutions [9]. Applying
a reinforcement learning (RL) scheme can avoid the potential
complexity of defining a dedicated optimization model. For
1
instance, Q-learning is deployed in [10] to maximize the
network utility of 5G RAN slicing by jointly considering radio
and computation resources. Deep Q-learning (DQN) is used
in [11] for end-to-end network slicing, and double deep Q-
learning (DDQN) is applied for computation offloading within
sliced RANs in [12].
Although learning-based methods such as RL and deep rein-
forcement learning (DRL) have been generally applied for net-
work resource allocation, most existing works do not consider
the possibility of knowledge transfer [10]–[12]. Specifically,
an agent is designed for a specific task in these works, and
it interacts with its environment from scratch, which usually
leads to a lower exploration efficiency and longer convergence
time. Whenever a new task is assigned, the agent needs to
be retrained, even though similar tasks have been completed
before. The poor generalization capability of straightforward
RL methods motivates us to find a learning method with
better generalization and knowledge transfer capability. On
the other hand, humans can reuse the knowledge learned from
previous tasks to solve new tasks more efficiently, and this
capability can be built into ML as well [13]. Such knowledge
transfer and reuse can significantly reduce the need for a large
number of training samples, which is a common issue in many
ML methods. By incorporating knowledge transfer capability
into ML, it is expected to reduce the algorithm design and
training efforts and achieve better performance such as faster
convergence and higher average reward.
In this work, we propose two deep transfer reinforcement
learning (DTRL) based solutions for joint radio and cache re-
source allocation. In particular, we include knowledge transfer
capability in the DDQN framework by defining two different
knowledge transfer functions, and we propose two DTRL-
based algorithms accordingly. The first method is Q-value-
based deep transfer reinforcement learning (QDTRL), and the
second technique is called action selection-based deep transfer
reinforcement learning (ADTRL). Using these schemes, agents
can utilize the knowledge of experts to improve their perfor-
mance on current tasks, and consequently it can reduce the
algorithm training efforts. Furthermore, the current network
optimization schemes are usually defined in a centralized way
[10], [11]. This leads to excessive control overhead where
processing the specific requests of all devices can be a heavy
burden for the central controller. To this end, we propose
a hierarchical architecture for joint resource allocation. The
global resource manager (GRM) is responsible for inter-slice
resource allocation, and then slice resource managers (SRMs)
implement intra-slice resource allocation among associated
user equipment (UEs).
The main contributions of this work are: (1) We define a
hierarchical architecture for joint radio and cache resource
allocation of cellular networks. The GRM will intelligently
allocate resources between slices, then SRMs will distribute
radio and cache resource within each slice based on specific
rules. The proposed hierarchical architecture can reduce the
control overhead and achieve higher flexibility.
(2) We propose two DTRL-based solutions for the inter-
slice resource allocation, namely QDTRL and ADTRL. The
QDTRL can utilize the Q-values of the experts as prior
knowledge to improve the learning process, while the ADTRL
focuses on action selection knowledge. Compared with RL or
DRL, the proposed DTRL solutions show a better knowledge
transfer capability.
We further propose two baseline algorithms, including a
model-free exploration bonus DQN (EB-DQN) algorithm and
a model-based priority proportional fairness and time-to-live
(PPF-TTL) method. The proposed DTRL solutions are com-
pared with these two baseline algorithms via simulations.
The results demonstrate that DTRL-based algorithms perform
better in both network and ML metrics. In particular, ADTRL
has 21.4% lower URLLC delay and 22.4% higher eMBB
throughput than EB-DQN. The simulations also show 40.8%
lower URLLC delay and 59.8% higher eMBB throughput than
the PPF-TTL method. QDTRL and ADTRL also outperform
EB-DQN with significantly faster convergence.
The rest of this work is organized as follows. Section II
presents related work, Section III introduces the background,
and Section IV shows the system model and problem formula-
tion. Section V explains the DTRL-based resource allocation
scheme. The simulations are shown in Section VI, and Section
VII concludes this work.
II. RELATED WORK
Recently, numerous studies have applied artificial intelli-
gence (AI) techniques for resource allocation in 5G networks.
DRL is deployed in [14] for spectrum allocation in integrated
access and backhaul networks with dynamic environments,
and DRL is combined with game theory in [15] for multi-
tenant cross-slice resource allocation. Multi-agent reinforce-
ment learning is used in [16] for distributed dynamic spectrum
access sharing of communication networks. Moreover, a DRL-
based method is proposed in [17] for mobility-aware proactive
resource allocation, which pre-allocates resources to mobile
UEs in time and frequency domains. Finally, [18] proposed
a DQN-based intelligent resource management method to
improve the quality of service for 5G cloud RAN. The afore-
mentioned works show that various ML methods have been
applied for the resource management of wireless networks,
including RL [10], DRL [11], [14], [15], [17], [18], DDQN
[12], multi-agent reinforcement learning [16], and so on. In
our former work [19], we proposed a correlated Q-learning-
based method for the radio resource allocation of 5G RAN,
but the knowledge transfer was still not considered.
In these works, the main motivations for deploying ML
algorithms are the increasing complexity of wireless networks
and the difficulties to build dedicated optimization models.
Indeed, the evolving network architecture, emerging new net-
work performance requirements and increasing device num-
bers make the traditional methods such as convex optimization
more and more complicated. ML, especially RL, offers a good
opportunity to reduce the optimization complexity and make
data-driven decisions. However, a large number of samples
are needed for the learning process, which means a long
2
TABLE I
COM PAR IS ON OF RL, TL , DRL AND TR L
Algorithms Features Difficulties Applications
RL Agent has no prior knowledge about tasks. It
explores new tasks from scratch.
Long convergence time, low
exploration efficiency.
Tasks with limited state-action
space and no prior knowledge.
TL Improving generalization across different
distributions between expert and learner tasks.
Negative transfer, automation of task
mapping.
It is mainly designed for the
supervised learning domain, e.g.,
classification, regression.
DRL
Combining artificial neural networks with RL
architecture. Applying neural networks to estimate
state-action values.
Time-consuming network training and
tuning; Low sample and exploration
efficiency; Training stability.
Large state-action space or
continuous-action problem.
TRL
Utilizing prior knowledge from experts to improve
performance of learners such as higher average
reward or faster convergence.
Transfer function needs to be defined
to digest prior knowledge.
Optimization of tasks with
existing prior knowledge.
training time and low system efficiency. Furthermore, after the
long learning process, the agent can only handle one specific
task without any generalization. This learning process has
to be repeated when a new task is assigned. DRL can be
considered as a solution to reduce the number of samples
and improve convergence. However, it is still limited to a
localized domain, and the neural network training can be
time-consuming because of the hyperparameter tuning. To
this end, we propose DTRL-based solutions for the joint
radio and cache resource allocation of 5G RAN. The transfer
reinforcement learning (TRL) has been applied in [20] for
intra-beam interference management of 5G mm-Wave, but
here, the resource allocation problem is more complicated due
to a much bigger state-action space. The proposed scheme
has a satisfying knowledge transfer capability by utilizing the
knowledge of expert agents, and it outperforms EB-DQN by
faster convergence and better network performance.
III. BACKGROUND
Fig. 1. Comparison of TRL and RL.
In this section, we introduce the background of TRL to
distinguish it from RL. As shown in Fig.1, given a task pool,
the interactions between tasks and agents can be described by
the Markov decision process (MDP) < S, A, R, T >, where
Sis the state space, Ais the action space, Ris the reward
function, and Tis the transition probability. In RL, each agent
works independently to try different actions, arriving at new
states and receiving rewards. The learning phase of RL can be
defined as:
LRL :s×Ka, r (aA),(1)
where the Kis the knowledge of this agent, sis the current
state, ais the selected action, and ris the reward. Equation
(1) indicates that RL agent utilizes the collected knowledge to
select action aand receive reward runder state s.
On the contrary, TRL includes two phases: the knowledge
transfer phase and the learning phase. In the knowledge
transfer phase, as shown in Fig.1, considering task differences,
a mapping function is defined to make the knowledge of
experts digestible for the learner. Then the learner explores
current tasks on its own and forms its own knowledge. The
whole process of TRL can be defined as:
LT RL :s×M(Kexpert )×Klearner a, r (aA),(2)
where the Mis the mapping function, Kexpert is the knowl-
edge from experts, Klearner is the knowledge of the learner.
In equation (2), knowledge from experts is utilized in the
action selection of the learner, and it is expected to accelerate
the learning process LT RL . In TRL, new tasks can be better
handled based on knowledge of experts.
The RL, transfer learning (TL), DRL and TRL techniques
are compared in Table I. Compared with RL, TRL has higher
exploration efficiency and better generalization capability [21].
On the other hand, although DRL is a breakthrough approach
by combining neural networks with RL schemes, the time-
consuming network training is a well-known issue, and the
training stability and generalization capability of DRL can
cause problems [22]. Finally, although TL has been extensively
studied in the ML literature, it mainly focuses on the super-
vised learning domain such as classification and regression
[13]. Compared with TL, TRL can be more complicated
because the knowledge needs to be transferred in the context of
the MDP scheme. Moreover, due to the dedicated components
of MDP, the knowledge may exist in different forms, which
needs to be transferred in different ways [23]. Furthermore,
TRL has shown significant improvements in robot learning.
Inspired by these approaches and previous successful results
[20], we propose DTRL-based schemes for RAN slicing.
In this work, we combine state-of-the-art DDQN with TRL
and propose a DTRL-based joint resource allocation method
for 5G RAN slicing. Compared with conventional DRL that
3
Fig. 2. Proposed radio and cache resource allocation scheme
explores the task from scratch, the DTRL can extract the prior
knowledge of related tasks and reuse it for target tasks.
IV. SYS TEM MODEL AND PROBL EM FOR MU L ATION
A. Overall Architecture
Fig.2 presents the proposed joint radio and cache resource
allocation scheme. A hierarchical architecture is defined for the
resource allocation of cache-enabled macro cellular networks.
We assume the base station (BS) has the caching capability
to store content items. Our proposed scheme can be used
for any number of slices without loss of generality, but here
we mainly consider two typical slices, namely an eMBB
slice and URLLC slice, to better illustrate the framework.
Firstly, the SRMs collect the QoS requirements of associated
UEs, and the collected information is sent to the GRM.
Then the GRM will intelligently implement the inter-slice
resource allocation to divide the radio and cache resource
between eMBB and URLLC SRMs. Finally, SRMs distribute
resource blocks (RBs) to associated UEs and update cached
content items within the allocated caching capacity. We apply
learning-based methods to realize an intelligent GRM, while
rule-based methods are deployed for SRMs. This hierarchical
architecture can alleviate the burden of GRM, since the GRM
only accounts for slice-level allocation instead of handling all
the UEs directly.
B. Communication Model
In this section, we introduce the communication model.
Firstly, the total delay consists of:
dj,m,g =dtx
j,m,g +dque
j,m,g + (1 βj,g)dback
j,m,g,(3)
where dtx
j,m,g is the transmission delay of content item gfrom
BS jto UE m.dque
j,m,g refers to the queuing delay that content
gwaits in the buffer of BS jto be transmitted to UE m.
dback
j,m,g is the backhaul delay of fetching content items from
the core network. βj,g is a binary variable. βj,g = 1 when the
content item gis cached at BS j, and βj,g = 0 otherwise.
Equation (3) shows that the total delay is affected by both
communication and cache resource. The transmission delay
depends on radio resource allocation, and the edge caching
can prevent the backhaul delay. The scheduling efficiency will
affect the queuing delay.
The transmission delay depends on the link capacity be-
tween the BS and UE:
dtx =Lm
Pj,m
,(4)
where Lmis the size of content items required by UE m.
Pj,m is the link capacity between BS jand UE m, which is
calculated as follows:
Pj,m =X
q∈N RB
j
bqlog(1+ pj,q xj,q ,mgj,q,m
bqN0+P
jJj
pj,qxj,q,mgj,q,m
),
(5)
where NRB
jis the set of RBs in BS j,bqis the bandwidth of
RB q,N0is the noise power density, pj,q is the transmission
power of the RB qof the BS j,xj,q,m is a binary indicator
to denote whether RB qis allocated to the UE m,gj,q,m is
the channel gain between the BS and UE, and jJjis the
BS set except BS j. In the proposed communication system
model, we assume orthogonal frequency-division multiplexing
(OFDM) is deployed to avoid the intra-cell interference [24],
and P
jJj
pj,qxj,q,mgj,q,min equation (5) indicates the
inter-cell interference of downlink transmission from other
BSs [25].
C. Slicing-based Caching Model
We will introduce the slicing-based caching model in this
section, in which the time-to-live (TTL) method is used as the
content replacement strategy. The TTL indicates the time that
a content item is stored in a caching system before it is deleted
or replaced. The TTL value will be reset if this content item is
required again, and thus popular content items will live longer.
Although there have been many content replacement strategies,
TTL is selected because: i) this paper mainly focuses on the
inter-slice level resource allocation, and it is reasonable to
apply a well-known caching method for intra-slice caching; ii)
TTL requires no prior knowledge of content popularity, which
is more realistic. Nevertheless, our proposed architecture is
compatible with any other caching methods without loss of
generality. The complexity of the slicing-based caching model
lies in how to effectively divide the limited caching capacity
between slices, which is far more complicated than the original
TTL model. We use Njto represent the slice set in the BS j,
and the slice ncontains |Mj,n |UEs, and each UE is denoted
by m. Each slice has its own content catalog Gj,n, and the
variables grepresents the content items (g Gj,n ). We assume
all content items have the same packet size [26].
φj,n,m,g represents the request rate of UE mfor the content
item g, which denotes the frequency that content item gis
demanded by the UE m. Then we have
φj,n,m,g =pj,n,m,gφj,n,m,(6)
4
where φj,n,m is the total request rate of UE m, and pj,n,m,g
is the request rate distribution of UE mfor content items
(Pg∈G pj,n,m,g = 1).
The cache hit ratio indicates the probability of finding a
content in the cache, which can be calculated by [27]:
hj,n,g = 1 e
P
m∈Mj,n
φj,n,m,gTj,n,m,g
,(7)
where hj,n,g is the cache hit ratio of content item gin the slice
n, and the TTL of the content item gwill be reset to Tj,n,m,g
when this content item is required. Note that the number of
contents is large and the request rate of each content item is
relatively small. Based on lim
x0ex=x+ 1, we approximately
have:
hj,n,g =X
m∈Mj,n
φj,n,m,gTj,n,m,g.(8)
Meanwhile, the total cache hit ratio is related to the allo-
cated storage capacity of slice n:
X
g∈Gn
X
m∈Mn
hj,n,m,g =Cj,n
Cj,T
,(9)
where Cj,n is the allocated storage capacity for slice nin BS
j,Cj,T is the total storage capacity of BS j, and hj,n,m,g is
the cache hit ratio of UE mfor content item g. We assume the
content item ghas the same popularity for different UEs and
thus Tj,n,m,g =Tj,n,g . Given equation (8) and (9), we have
(the proof is given in the appendix):
hj,n,g =Cj,n
Cj,T
Pm∈Mj,n φj,n,m,g
Pm∈Mj,n φj,n,m
.(10)
Equation (10) indicates that a higher caching capacity Cn
leads to a higher cache hit ratio hn,g [26]. As such, popular
content items, which is indicated by a higher request rate
φn,m,g , has a higher cache hit ratio.
Then we define a cache hit rate φhit
j,n,m to describe the
frequency of requesting cached content items in the total
request rate, which can be calculated by:
φhit
j,n,m =X
g∈Gj,n
φj,n,m,ghj,n,m,g,(11)
and the cache miss rate is:
φmiss
j,n,m =φj,n,m φhit
j,n,m,(12)
which indicates the frequency of requesting non-cached con-
tent items in the total request rate.
Finally, backhaul delay is only applied when a content item
is not cached at the BS. The backhaul service is presumed to
obey the M/M/1 queue, and the average delay is [28]:
dback
j,m,g =1
B
LmPn∈N Pm∈Mj,n (φmiss
j,n,m),(13)
where Bis the backhaul capacity, Lmis the average
size of content items, B/Lmdenotes the service rate, and
Pn∈N Pm∈Mj,n (φmiss
j,n,m)is the total cache miss rate. Note
that backhaul delay will be infinite if the cache miss rate is
higher than the service rate.
D. Problem Formulation
The objective of the eMBB slice is to maximize the total
throughput, while the URLLC slice aims to minimize the
average delay. It is assumed that the content catalogs of the two
slices are not overlapped. To balance the requirements of the
two slices, the GRM needs to jointly consider the objectives
of the two slices and allocate the radio and cache resource
accordingly. For each BS, the GRM allocates radio and cache
resource by:
max wbembb,avg + (1 w)(dtar durllc ,avg ),(14)
s.t. (3) (5) (10) (11) and (13) (14a)
X
n∈Nj
Nj,n Nj,T (14b)
X
n∈Nj
Cj,n Cj,T (14c)
X
n∈Nj
X
m∈Mj,n
φmiss
mB
Lm
(14d)
X
n∈Nj
X
m∈Mj,n
xj,q,m 1(14e)
X
m∈Mj,n
X
q∈N RB
j
xj,q,m Nj,n (14f)
X
g∈Gj,n
βj,g Cj,n (14g)
where bembb,avg is the total throughput of the eMBB slice,
durllc,avg is the average latency of the URLLC slice, and dtar
is the target delay. Here we use was a weight factor in (14)
to balance the objectives of the two slices and maximize the
overall objective. Njis the slice set of BS j, which consists
of eMBB and URLLC slices. Nj,n is the number of RBs that
the GRM allocated to the slice n, and Nj,T is the total number
of RBs of BS j, and the equation (14b) is the radio resource
constraint. Cj,n is the caching capacity of slice n, and Cj,T
is the total caching capacity in BS j, and (14c) ensures that
the allocated caching capacity cannot exceed the upper limit.
Mj,n is the set of UEs in slice n, and φmiss
mis the miss rate
of UE m, and (14d) denotes that the total miss rate should not
exceed the backhaul service rate. xj,q ,m has been defined in
equation (5) as the RB allocation indicator. Equations (14e)
and (14f) denote that one RB can only be allocated to one
UE, and the total number of available RBs are Nj,n in slice
n. Finally, βj,g is a binary variable that has been defined in
equation (3) to represent whether a content item is cached.
In the defined problem formulations, Nj,n and Cj,n are the
control variables of the GRM, which means the GRM only
accounts for the inter-slice resource allocation. xj,q,m and βj,g
are the control variables of SRMs. In SRMs, we apply classic
proportional fairness algorithm for intra-slice RB allocation to
determine xj,q,m , since all UEs in the same slice are presumed
to be equally important [19]. Meanwhile, cached content items
are updated by TTL rule, which will determine βj,g .
5
TABLE II
SUM MA RY O F S TRATEG IES AN D MDP DE FIN IT I ON S
Indices Strategies States Actions2Instant Reward
Expert 1 Q-learning based radio resource allocation; No
caching capability.
(sembb, surllc ),
where sembb
denotes the
number of
eMBB packets
in the queue,
and surllc
is defined
similarly.
(Nembb, N urllc )r=wrembb + (1 w)rurllc ,
where
rembb =2
πtan1(bembb,avg )
rurllc =2
πtan1(dtar
durllc,avg ).wis the weighting
factor, rembb and rur llc are
the rewards of eMBB and
URLLC slices, respectively.
bembb,avg and durllc,avg
are average throughput
of eMBB slice and average
delay of URLLC slice. dtar is
the target URLLC delay.
Expert 2 Fixed radio resource allocation; Q-learning based
caching capacity allocation. (Cembb, C urllc)
Learner 1
(QDTRL)
DTRL-based joint radio and cache resource
allocation with Q-values mapping function. (Nembb, N urllc,Cembb , C urllc),
where Nembb and Cembb
denote the number of RBs and
caching capacity allocated to
the eMBB slice, respectively.
Nurllc and Curllc are defined
similarly for the URLLC slice.
Learner 2
(ADTRL)
DTRL-based joint radio and cache resource
allocation with action selection mapping function.
Baseline1EB-DQN for joint radio and cache resource
allocation without any prior knowledge.
1We include the PPF-TTL as the second baseline. However, since PPF-TTL is not an ML-based method, it is excluded from this table.
2Given the total number of RBs Nj,T ,Nurllc can be easily calculated if Nembb has been decided. However, we present the action definition using
(Nembb, N urllc)for better readability and scalability if more slices are included, and similarly for the action definition of cache resource allocation
(Cembb, C urllc)
V. D EEP TRAN SFE R REIN FO RCE MEN T LEAR NI NG BASE D
RES OU RCE AL LOC ATI ON
A. Overall framework
In this section, we introduce the DTRL-based inter-slice
resource allocation, where each BS is considered as an in-
dependent agent to make decisions autonomously. As shown
in Table II, five learning-based strategies are deployed. We
assume experts 1 and 2 apply Q-learning for radio and cache
resource allocation, respectively. Experts are only good at
one of radio or cache resource allocation, but they have
no multi-task knowledge. Then learners 1 and 2 can utilize
knowledge from experts to improve their own performance on
joint radio and cache resource allocation. Based on different
mapping functions, we propose two DTRL-based methods,
namely QDTRL and ADTRL. Finally, we apply EB-DQN as
a learning-based benchmark and the PPF-TTL method as a
model-based baseline. In the following, we will introduce the
experts, learners and baselines.
B. Q-learning based Experts
In this section, we assume the expert agents have learning
experience on one specific task, but they have no knowledge
of other tasks. For expert 1, it uses Q-learning for the radio
resource allocation, and there is no caching capability. For
expert 2, RBs are allocated by the numbers of UEs in
each slice, and Q-learning is used for the caching capacity
allocation.
To transform the problem formulation in equation (14) to the
RL context, we first define the MDP (S, A, T , R)for experts,
where Sis the state set, Ais the action set, Tis the transition
probability, and Ris the reward function. The MDP definitions
of experts are given below:
State: In this work, we intend to coordinate the perfor-
mance of various slices by inter-slice resource allocation.
As such, the state definition should reflect the transmis-
sion demand of each slice. The states of expert 1 and 2
are both defined by (sembb, surllc ), which indicates the
number of packets waiting in the queues of eMBB and
URLLC slices, respectively.
Action of expert 1: The expert 1 only implements
radio resource allocation, and consequently the action
(Nembb, N urllc)denotes the number of RBs allocated to
eMBB and URLLC slices.
Action of expert 2: In expert 2, the learning strategy
is only applied for caching capacity allocation, and the
action (Cembb, C urllc)indicates the caching capacity
allocated to eMBB and URLLC slices.
It is worth noting that the actions Nembb , N urllc,Cembb
and Curllc have been defined as control variables in the
problem formulation (14), and they are transformed to
actions here to serve the Q-learning scheme.
Reward function: The reward functions of expert 1 and
2 are defined by the objectives of slices:
r=wrembb + (1 w)rurllc ,(15)
where rembb and rurllc are rewards of eMBB and URLLC
slices, respectively, and wis the weight factor.
For the eMBB slice, obtaining higher throughput leads to
a higher reward, and we have.
rembb =2
πtan1(bembb,avg ),(16)
where bembb,avg is the average throughput of the eMBB
slice, and we apply the tan1function to normalize the
reward (0< rembb <1).
For the URLLC slice, lower delay means a higher re-
ward:
rurllc =2
πtan1(dtar durllc,avg ),(17)
where dtar and durllc,avg are target and achieved average
delays for the URLLC slice, respectively. Note that both
6
rembb and rurllc are normalized to balance the perfor-
mance metrics of the two slices.
Moreover, to guarantee the constraints of the problem
formulation (14), we apply penalties to the reward when
these constraints are violated.
With Q-learning, the agent aims to maximize the long-term
expected reward:
V(se) = Eπ(
X
i=0
γir(se,i, ae,i )|se=se,0),(18)
where V(se)is the long-term expected accumulated reward
of experts at state se, and here we use notation eto indicate
experts. se,0is the initial state, and r(se,i , ae,i)denotes the
reward of selecting action ae,i at state se,i in episode i, and
γis the discount factor (0 < γ < 1).
Then the state-action values of expert 1 and 2 are updated
by:
Qnew(se, ae) = Qold (se, ae)+
α(re+γmax
aQ(s
e, a)Qold(se, ae)),(19)
where Qold(se, ae)and Qnew(se, ae)are old and new Q-
values, seand s
eare current and next states of experts,
respectively, aeis the action, reis the reward, and αis
the learning rate (0< α < 1). By updating the Q-values
iteratively, experts will learn the optimal action selections to
achieve the best accumulated reward.
C. Deep Transfer Reinforcement Learning based Learners
In this section, we propose two DTRL-based algorithms,
namely QDTRL (learner 1) and ADTRL (learner 2). QDTRL
utilizes the Q-values of the experts as prior knowledge, while
ADTRL uses the action selection experience of experts for
improved performance. Consequently, we define two different
mapping functions to transfer knowledge from experts to
learners 1 and 2, respectively.
In this section, we first define the states, actions and rewards
of learners, then we introduce the DDQN framework for
two learners. Finally, we present the proposed QDTRL and
ADTRL algorithms.
1) MDP and DDQN Framework for Learners
Given the prior knowledge of experts, learners are expected
to solve more complicated problems with a larger state-action
space. To apply TRL, we define the MDP by (S, A, R, T , F),
where Fis the mapping function. The states, actions and
reward functions of the two learners are given below:
State and reward function: As shown in Table II,
we assume learners have the same state and reward
definitions as experts. The main reason is that transfer
learning is designed for tasks that share some similarities
with existing expert tasks. As such, similar state and
reward definitions can reduce the complexity of defining
mapping functions, which can better transfer the knowl-
edge from experts to learners.
Action of learners 1 and 2: Compared with experts,
the learners have to jointly consider the radio and cache
resource allocation, and then the actions are defined by
(Nembb, N urllc, C embb, Curllc ). Compared with allocat-
ing one single resource, the joint resource allocation
problem is much more complicated, especially when
multiple slices are involved.
Based on MDP definitions, we introduce our DTRL algo-
rithm. In conventional Q-learning, Q-values are updated by:
Qnew(sl, al) = Qold(sl, al)+
α(rl+γmax
aQ(s
l, a)Qold(sl, al)),(20)
where sland alare the state and action of the learner,
respectively, s
lis the next state, and rlis the reward. Here we
use the notation lto indicate the learner. Q-learning applies a
Q-table to record state-action values, and consequently it may
suffer a slow convergence issue when the state-action space
is huge. To this end, DQN is proposed by using deep neural
networks to predict Q-values [22].
When the Q-values in equation (20) converge, we have
Qold(sl, al) = Qnew(sl, al)and Qold(sl, al) = rl+
γmax
aQ(s
l, a). Then, a loss function can be defined for the
network training of DQN:
L(w) = Er(rl+γmax
aQ(s
l, a, w)Q(sl, al, w)),(21)
where Er is the loss function to represent the error between
prediction results rl+γmax
aQ(s
l, a, w)and target results
Q(sl, al, w).wis the weight of the main network, which will
predict current Q-values Q(sl, al, w).wis the weight of the
target network, and it predicts the target Q-values Q(s
l, a, w).
In DQN, note that the action selection and evaluation are
both implemented by the target network, which is indicated by
max
aQ(s
l, a, w). Meanwhile, target Q-values are calculated
by the maximum Q-value of the next state. If the maximize
operator is always included in the Q-value calculation, then the
Q-value predicted by neural networks will be obviously higher
every time [29]. To this end, the DDQN has been proposed to
decouple the action selection and evaluation. The loss function
of DDQN is defined as:
L(w) = Er(rl+
γQ(s
l,arg max
aQ(s
l, a, w), w)Q(sl, al, w)),(22)
where the main network chooses actions by al=
arg max
aQ(s
l, a, w)), and the target network evaluates the
action by Q(s
l, al, w). By decoupling the action selection
and evaluation, DDQN can prevent overestimation and better
predict Q-values than DQN.
In this work, we deploy the DDQN architecture in the
proposed DTRL. For the learner agent shown by color grey
in Fig.3, an action alis first selected and sent to the envi-
ronment. Then a tuple (sl, al, rl, s
l)will be received from the
environment, which will be saved in the experience pool. The
learner agent samples a random minibatch from the experience
pool. For every tuple (sl, al, rl, s
l), the main network predicts
Q(sl, al, w)and selects actions by a= arg max
aQ(s
l, a, w)).
The target network evaluates the action by Q(s
l, a, w). Then
7
Fig. 3. Proposed deep transfer reinforcement learning architecture for resource management.
we utilize the loss function shown by equation (22) for
gradient descent to update the weight wof the main network.
After several training sessions, the target network will copy the
weight parameters of the main network. Such a late update
of the target network serves as a stable reference for the
main network training. Here we deploy the Long Short-Term
Memory (LSTM) network as hidden layers for main and target
networks. As a special recurrent neural network, LSTM can
better capture the long term data dependency, which makes
it an ideal candidate to handle complicated wireless network
environments [30].
Finally, it is worth noting that we include two different
mapping functions in the proposed DTRL scheme. The Q-
value mapping function will affect the reward calculation of
the learner agent (indicated by the pink line in Fig.3), while
the action selection mapping function influences the action
selection (shown by the blue line in Fig.3). Accordingly,
we propose two DTRL-based methods, namely QDTRL and
ADTRL, and in the following we will introduce these two
mapping functions and corresponding algorithms.
2) Learner 1: Q-value based Deep Transfer Reinforcement
Learning
In QDTRL, the Q-values of the experts are presumed to be
the prior knowledge of the learner. The main idea behind this
is to encourage learners to select actions that have higher Q-
values in the experts. Considering the task similarities, actions
with a higher Q-value of experts are very likely to bring similar
high rewards for the learner. In particular, we consider the Q-
values of the experts as extra rewards for learners, which is
expected to improve exploration efficiency by selecting actions
with higher potential rewards [31].
Firstly, the loss function in QDTRL is defined by:
L(w) = Er(σ1QE(F(sl),F(al)) + rl+
γQ(s
l,arg max
aQ(s
l, a, w), w)Q(sl, al, w)),(23)
where Fand Fare state and action mapping functions,
respectively. Compared with equation (22), the main difference
is that σ1QE(F(sl),F(al)) is involved as an extra reward of
selecting action alunder state sl.σ1is the transfer learning
rate, which describes the importance of prior knowledge
(0σ11). A higher transfer learning rate means the prior
knowledge utilization is more important than its own learning
process, while a lower value indicates the reverse.
In equation (23), we apply σ1QE(F(sl),F(al)) to guide
the action selection of learner. However, due to different
state-action spaces, the Q-values of the experts cannot be
directly utilized by the learner, thus a function is needed
to map experts’ Q-values to the learner’s Q-table. The Q-
value mapping function consists of state mapping and action
mapping, and QEterm in equation (23) is generated by:
QE(F(sl),F(al)) = Qe,1(se,1, ae,1) + Qe,2(se,2, ae,2),(24)
where Qe,1,se,1and ae,1are the Q-value, state and action of
expert 1, respectively. Qe,2,se,2and ae,2are defined similarly
for expert 2. The objective of mapping functions Fand F
is to find the states and actions of the experts that are close
to sland al. Considering the task similarities, we can use
the existing decision knowledge of the expert agent to guide
the action selection of learners by finding similar states and
actions [32]. Fand Fare defined by:
State mapping F: For a given state sl, considering
experts and learner 1 have the same state definition, we
can always find sl=se,1=se,2. Thus Fcan be easily
defined.
Action mapping F: The goal of Fis to find
(ae,1, ae,2) = F(al). For any action al, which is defined
as al= (Nembb, N urllc,Cembb , C urllc), it can be decom-
posed into the combination of ae,1= (Nembb, N urllc)
and ae,2= (Cembb, C urllc). Then Fcan be defined
accordingly.
Based on the state and action mapping relationships, if given
sland al, we can always find specific Qe,1(se,1, ae,1)and
Qe,2(se,2, ae,2). Then QE(F(sl),F(al)) can be directly used
by learner 1. The defined Q-value mapping function is sum-
marized by Fig.4 from step 1 to 4. First, for any given (sl, al),
8
Fig. 4. Proposed Q-value-based mapping function for deep transfer reinforcement learning.
we find state action pairs (se,1, ae,1)and (se,2, ae,2)by state
mapping function Fand action mapping function F. Then,
we extract Qe,1(se,1, ae,1)and Qe,2(se,2, ae,2)from expert
agents’ Q-tables. After that, we generate QE(F(sl),F(al))
by equation (24), which will be considered as extra rewards
when selecting alunder sl. This extra reward is added to rl,
and the new tuple (sl, al, rl+σQE(F(sl),F(al)), s
l)will
be saved in the experience pool. Finally, we implement the
gradient descent by equation (23) for the network training.
3) Learner 2: Action Selection based Deep Transfer Rein-
forcement Learning
Learners are mainly designed to handle more complicated
problems than experts, which usually means larger state or ac-
tion spaces. For instance, the joint resource allocation problem
has higher action spaces than allocating one single resource,
result a very large action space and longer convergence. To this
end, we propose an ADTRL algorithm to improve exploration
efficiency by evaluating the potential optimality of actions.
More specifically, we first apply a lax bisimulation metric
to assess the MDP similarities between learners and experts.
Then, we calculate the potential advantage of different actions
in the learner and produce a lower bound for the optimality.
Finally, the optimality metrics of different actions are nor-
malized, and we assume that actions with higher potential
optimality are more likely to be selected in the exploration
phase. In the following, we will introduce the proposed method
in detail.
It is worth noting that TL is mainly applied to learner
tasks that are related to expert tasks. Specifically, it requires
similarities between expert and learner MDPs. Then, we first
introduce the Kantorovich distance K(D)(Y, Z)to describe
the similarities between two distributions [33]:
max
uf,f =1,2,...,|S|
|S|
X
f=1
(Y(sf)Z(sf))uf,
s.t. ufuk D(sf, sk)f, k = 1,2, ..., |S|
0uf1
(25)
where Yand Zare two probability distributions of sfS,
ufand ukare internal optimization variables, and D(sf, sk)
denotes a metric Dto assess the distance between sfand sk.
K(D)(Y, Z )shows the distance of two probability distribu-
tions Yand Zwith the metric Din set S. However, in TL,
we focus more on the distance metric of different MDPs, and
K(D)(Y, Z )is rewritten by [34]:
max
uf,f =1,2,...,|S1|;vk,k=1,2,...,|S2|
|S1|
X
f=1
Y(sf)uf
|S2|
X
k=1
Z(sk)vk,
s.t. ufvk D(sf, sk)
1uf1
(26)
where S1and S2are two sets with sfS1and skS2,
respectively, and K(D)(Y, Z)evaluates the distance between
two distribution set S1and S2. Here f= 1,2, .., |S1|means
that set sfhas |S1|possible values in set S1, and the proba-
bility distribution function Yof sfsatisfies P|S1|
f=1 Y(sf) = 1.
sk,S2and the probability distribution function Zcan be
defined similarly.
Then we include two MDPs < Se, Ae, Te, Re>and <
Sl, Al, Tl, Rl>to identify the difference of their state-action
pairs. The lax bisimulation metric is introduced to evaluate the
distance between two state-action pairs [35]:
D((se, ae),(sl, al)) := θ1|re(se, ae)rl(sl, al)|
+θ2K(D(Y(se, ae), Z(sl, al))),(27)
where θ1and θ2are weight factors. The first term
|re(se, ae)rl(sl, al)|represents the reward distance, and
K(D(Y(se, ae), Z(sl, al))) is the Kantorovich metric for
state-action pairs under semi-metric D.Dis defined by the
Hausdorff metric:
D(se, sl) = max( max
aeAe
min
alAlD((se, ae),(sl, al)),
min
aeAe
max
alAl
(D(se, ae),(sl, al))).
(28)
Hausdorff distance is the maximum distance from one set to
the nearest point of the other set [36], and here we define
D(se, sl)to measure the distance between action sets Aeand
Alunder state seand sl.
9
To evaluate the potential optimality of selecting alunder
sl, we include the Bellman optimality to find the state value
difference:
|Q
l(sl, al)V
e(se)|
=|Q
l(sl, al)Q
e(se, π(se))|
=|Q
l(sl, al)Q
e(se, a
e)|
=|(rl(sl, al) + γX
s
lSl
Y(s
l|sl, al)Vl(s
l))
(re(se, a
e) + γX
s
eSe
Z(s
e|se, a
e)Ve(s
e))|
|rl(sl, al)re(se, a
e)|+
γ|X
s
lSl
Y(s
l|sl, al)Vl(s)X
s
eSe
Z(s
e|se, a
e)Ve(s))|
|rl(sl, al)re(se, a
e)|+ max
alAl
min
aeAe
(
γ|X
s
lSl
Y(s
l|sl, al)Vl(s)X
s
eSe
Z(s
e|se, a
e)Ve(s)|)
(Using equation (26), (28))
=|re(se, a
e)rl(sl, al)|+γK(D((se, a
e),(sl, al)))
(Setting θ1= 1 and θ2=γin equation (27))
=D((sl, al),(se, a
e)),
(29)
where Q
l(sl, al)is the optimal state-action value of (sl, al),
V
e(se)is the optimal state value of se,s
lis the next state
of sl, and Y(s|sl, al)is the probability of arriving to s
l
by implementing alunder sl. Here we use a
e=π(se)to
represent the action selection of the expert agent. Equation
(29) shows that there is a upper bound between the state-action
pairs (sl, al)and (se, a
e). In the following, we will introduce
how to utilize equation (29) to improve the action selection of
learner.
When selecting an action, we usually consider V
l(sl)
as a target value for Q
l(sl, al), and we have V
l(sl) =
arg max
al
Q(sl, al). Then V
l(sl)Q
l(sl, al)can be used to
evaluate the potential optimality of alby:
V
l(sl)Q
l(sl, al)
=Q
l(sl, π(sl)) Q
l(sl, al)
=Q
l(sl, a
l)Q
l(sl, al)
=|Q
l(sl, a
l)Q
l(sl, al)|
=|Q
l(sl, a
l)V
e(se) + V
e(se)Q
l(sl, al)|
|Q
l(sl, a
l)V
e(se)|+|V
e(se)Q
l(sl, al)|
D((sl, a
l),(se, a
e)) + D((sl, al),(se, a
e))
(By using equation (29)).
(30)
Equation (30) gives an upper bound for the potential op-
timality of selecting alunder sl, and then a lower bound
Ol(sl, al)can be easily found by:
Q
l(sl, al)V
l(sl)
=Q
l(sl, al)Q
l(sl, π(sl))
−D((sl, a
l),(se, a
e)) D((sl, al),(se, a
e))
=Ol(sl, al).
(31)
In Ol(sl, al), note that D((sl, a
l),(se, a
e)) will not affect
alselection, since it is a constant value for given sland se.
Fig. 5. Summary of the proposed action selection mapping method.
Considering we have two experts in this work, we rewrite
Ol(sl, al)by:
Ol(sl, al) = σ2(D((sl, al),(se,1, a
e,1))
+D((sl, al),(se,2, a
e,2))) (1 σ2)(D((sl, a
l),(se,1, a
e,1))
+D((sl, a
l),(se,2, a
e,2))),
(32)
where σ2is the transfer learning rate of ADTRL (0σ21).
If σ2= 0, then Ol(sl, al)becomes a constant value for all
alAl, which means transferred knowledge will not affect
the action selection of the learner. By contrast, if σ2= 1,
Ol(sl, al)totally depends on the lax bisimulation metric
between (sl, al)and (se, a
e), which indicates the learner will
imitate the action selection of experts.
In summary, given < Se, Ae, Te, Re>as an expert MDP
and < Sl, Al, Tl, Rl>as a learner MDP, Ol(sl, al)defines
the lower bound of potential optimality of alin terms of the
distance between (sl, al)and (se, a
e). Here seis considered
as the expert state that is closest to sl. In this work, we assume
experts and learners has the same state definitions, and secan
be easily found accordingly for any sl. Finally, the probability
of choosing alis given by:
P r(al|sl) = Sig(Ol(sl, al))
P
aAl
sig(Ol(sl, a)) ,(33)
where Sig denotes the Sigmoid function for normalization.
Equation (33) means that actions with higher potential opti-
mality has a higher chance to be selected, and consequently
it will improve exploration efficiency.
Finally, we summarize the proposed action selection map-
ping method in Fig.5. Given the expert and learner MDPs as
input, we first calculate the Kantorovich and Hausdorff metrics
using equations (26) and (28), respectively. Then we calculate
the lax bisimulation metric via equation (27), and evaluate the
potential optimality of actions using equations (30) to (32).
Consequently, the optimality metrics are normalized, and the
action selection probability is produced by applying equation
(33). The proposed QDTRL and ADTRL are summarized in
Algorithm 1 and 2, respectively.
D. Baseline: Exploration bonus DQN and PPF-TTL
In this section, we include two baseline algorithms. Firstly,
EB-DQN serves as a benchmark to compare our DTRL
method with other ML-based algorithms. The MDP definition
10
Algorithm 1 QDTRL-based joint resource allocation
1: Initialize: Wireless and QDTRL parameters
2: for TTI = 1 to Ttotal do
3: for Each BS do
4: With probability ǫ, selecting actions randomly; otherwise,
choosing actions by al= arg max
aQ(s
l, a, w)).
5: GRM implements inter-slice resource allocation as equation
(14).
6: SRMs distribute radio resource to UEs by proportional
fairness, and replace cached content items.
7: Updating system state, and saving (sl, al, rl, s
l)to the
experience pool. Every CTTIs, sampling a minbatch from
experience pool randomly.
8: Find F(sl)and F(al)for any sland alin the minibatch.
9: Generating target Q-values QTar (sl, al)=
rlif done
σ1QE(F(sl),F(al))+
rl+γQ(s
l,arg max
aQ(s
l, al, w), w)else
10: Updating wusing gradient descent by minimizing the loss
L(w) = Er(QT ar(sl, al)Q(sl, al, w)).
11: Copying wto wafter several training.
12: end for
13: end for
14: Output: Performance of the network and the learning algorithm.
Algorithm 2 ADTRL-based joint resource allocation
1: Initialize: Wireless and ADTRL parameters
2: for TTI = 1 to Ttotal do
3: for Each BS do
4: With probability ǫ, selecting action alby using equation
(33); Otherwise, choosing alby arg max
aQ(s
l, a, w).
5: GRM implements inter-slice resource allocation as equa-
tion (14).
6: SRMs distribute radio resource to UEs by proportional
fairness, and replace cached content items.
7: Updating system state, and saving (sl, al, rl, s
l)to the
experience pool. Every CTTIs, sampling a minbatch from
experience pool randomly.
8: Generating target Q-values QTar (sl, al)=
rlif done
rl+γQ(s
l,arg max
aQ(s
l, a, w), w)else
9: Updating wusing gradient descent by minimizing the loss
L(w) = Er(QT ar(sl, al)Q(sl, a l, w)).
10: Copying wto wafter several training.
11: end for
12: end for
13: Output: Performance of the network and the learning algorithm.
of EB-DQN is the same as DTRL, shown in Table II. EB-
DQN agent explores the joint resource allocation task from
scratch, and no prior knowledge is included. In EB-DQN, the
loss function is defined by:
L(w) = Er(r+Ψ
pψ(s, a)+γmax
aQ(s, a, w)Q(s, a, w)),(34)
where Er has been defined in equation (21) as loss function
of neural networks, Ψis an extra reward, and ψ(s, a)is the
number of times that (s, a)is selected. Ψ
ψ(s,a)is regarded
Algorithm 3 EB-DQN-based joint resource allocation
1: Initialize: Wireless and EB-DQN parameters
2: for TTI = 1 to Ttotal do
3: for Each BS do
4: With probability ǫ, choose actions randomly; otherwise,
choosing actions by al= arg max
aQ(s
l, a, w)).
5: GRM implements inter-slice resource allocation as equation
(14).
6: SRMs distribute radio resource to UEs by proportional
fairness, and replace cached content items by TTL.
7: Updating system state, and saving (sl, al, rl, s
l)to the
experience pool. Every CTTIs, sampling a minbatch from
experience pool randomly.
8: Generating target Q-values QTar (sl, al)=
rl+Ψ
ψ(s,a)if done
rl+Ψ
ψ(s,a)+γarg max
aQ(s
l, a, w)else
9: Updating wusing gradient descent by minimizing the loss
L(w) = Er(QT ar(sl, al)Q(sl, al, w)).
10: Copying wto wafter several training.
11: end for
12: end for
13: Output: Performance of the network and the learning algorithm.
Algorithm 4 PPF-TTL based joint resource allocation
1: Initialize: Wireless networks parameters
2: for TTI = 1 to Ttotal do
3: for Each BS do
4: for Each RB do
5: Calculating the estimated transmission rate of UEs in the
queue.
6: Calculating proportional fairness metric [37].
7: Transmitting URLLC packets with the highest propor-
tional fairness. If no URLLC packet, then transmitting
eMBB packets.
8: end for
9: BS replaces cached content items by time-to-live rule.
10: end for
11: end for
12: Output: Performance of the network.
as an extra bonus for selecting actions that are less visited,
and it encourages the agent to better explore the environment.
EB-DQN-based joint resource allocation is summarized in
Algorithm 3.
On the other hand, to compare the ML methods with model-
based algorithms, we apply a model-based PPF-TTL algo-
rithm. The well-known priority proportional fairness (PPF)
algorithm is applied for radio resource allocation, in which
URLLC packets have a higher priority than eMBB packets
[37]. The RBs will first serve URLLC transmission, then
eMBB traffic will be processed. We deploy the TTL method
for caching, but no slicing and learning are included. The PPF-
TTL method is shown in Algorithm 4.
E. Computational Complexity Analyses
In this section, we analyze the computational complexity of
the proposed DTRL-based methods. Firstly, the complexity of
the DTRL method is dominated by the training and updating
11
of the LSTM network. The complexity of the LSTM network
updating consists of the running time of recurrent connections
and bias and the updating time of input and output nodes. The
computational complexity for updating the LSTM network in
DTRL is O(lhdm2
lstmc2
lstm)[30], where lhd is the number of
hidden layers, clstm is the number of memory cells in each
block and mlstm is the number of memory blocks. It is worth
noting that only the main network needs to be trained, and the
target network can copy the weight from the main network.
On the other hand, the knowledge transfer process also
contributes to the complexity. In QDTRL, the knowledge
transfer consists of the state and action mapping functions
Fand F(indicated by equation (20)). Accordingly, the time
complexity is O(PNe
q=1 |Se,q||Ae,q |), where |Ne|is the number
of experts, Se,q and Ae,q are state and action sets of experts,
respectively.
In ADTRL, the Kantorovich distance can be considered as
an optimal transportation problem, which is computable within
a strong polynomial time O(|S|2log(|S|)), where |S|is the
total number of possible distributions [38]. Meanwhile, the
Hausdorff metric can be solved in a nearly linear time [39],
which can be neglected compared with the complexity of Kan-
torovich distance. Based on equation (31), the total complexity
of knowledge transfer in ADTRL is O(|Ne||Al||S|2log(S)),
where |Al|is the action set of the learner. In summary, these
analyses show that our knowledge transfer process can be
efficiently computed, and the complexity is linearly related
to the number of experts.
VI. PE R FO RMA NC E EVALUATION
A. Parameter Settings
In this section, we consider six different cases: expert 1,
expert 2, learner 1 (QDTRL), learner 2 (ADTRL), baseline
1 (EB-DQN) and baseline 2 (PPF-TTL). We include 6 adja-
cent gNBs, and each algorithm is implemented in one gNB
randomly. Each gNB contains an eMBB slice and a URLLC
slice. The eMBB slice has 5 UEs, while the URLLC slice has
10 UEs [40]. The radius of each gNB is 300 meters, and the
distance between two adjacent gNBs is 600 meters. For each
gNB, there are 100 RBs in total, which are divided into 13
resource block groups (RBGs) [41]. We assume the caching
capacity is reallocated every 50 TTIs because it takes time to
replace the cached content items. The experience of experts is
presumed to be existing knowledge for learners.
We deploy LSTM networks with 30 nodes as hidden layers
for the target and main networks in DTRL and EB-DQN.
The network learning rate and the number of layers are
selected by the grid search method. We try different parameter
combinations and find the best performance accordingly. The
simulations include 3000 TTIs, where the first 1500 TTIs
are the exploration period, and the remaining TTIs are the
exploitation period. The simulations are implemented in MAT-
LAB 5G library with 15 runs to get the average value. Other
5G and learning parameters are shown in Table III.
TABLE III
PAR AM ET ER S S ET TI NG S
5G settings Cache settings
Bandwidth: 20MHz Caching capacity: 20 items
3GPP urban macro network TTL value: 50 TTIs
Number of RBs: 100 Contents catalog size: 40/slice
Subcarriers in each RB: 12 Contents popularity: Zipf
Subcarrier bandwidth: 15kHz Traffic model
Transmission power: 40 dBm
(uniform distributed)
URLLC/ eMBB traffic: poisson
distribution
TTI size: 2 OFDM symbols Packet size: 36 Bytes
Tx/Rx antenna gain: 15 dB. Learning settings
Retransmission settings Network layers: 4
Max number of retransmissions: 1 2 LSTM network hidden layers
Round trip delay: 4 TTIs hidden layer has 35 nodes
Hybrid automatic repeat request. Initial learning rate: 0.005
UE and gNBs Experience pool size: 150
25 eMBB UE, 50 URLLC UE Training frequency: 30 TTIs
UE random distribution Minbatch size: 30
Number of gNBs: 5 Discount factor: 0.5
Inter-gNB distance: 500m Epsilon value: 0.05
Propagation model Reward weight: 0.5
128.1+37.6log(distance(km)) Transfer learning rate: 0.7
Log-Normal shadowing: 8 dB. Ψvalue for EB-DNQ: 0.5
B. Performance Analyses of Various Learning Parameters
In this section, we analyze the algorithm performance under
diverse learning parameters. Fig.6 (a) shows the convergence
performance of QDTRL, ADTRL and EB-DQN, which is a
critical metric for learning algorithms. ADTRL has the fastest
convergence, which can be explained by the improved action
selection strategy. ADTRL takes advantage of the action se-
lection policies of experts, which indicates actions with higher
potential rewards. It proves that ADTRL applies a more effi-
cient exploration strategy and achieves a better performance.
QDTRL also presents a better convergence performance than
EB-DQN. In QDTRL, the Q-values of the experts are extracted
as extra rewards for action selections. It assumes that actions
with higher Q-values in experts can also bring higher rewards
to learners, and the exploration is accelerated. However, other
actions can still be randomly selected for exploration, lowering
the exploration efficiency. On the contrary, in EB-DQN, the
agent has no prior knowledge about current tasks. The agent
starts from scratch to explore its tasks, which leads to a longer
convergence time and a lower average reward.
Then, Fig.6 (b) shows the network performance of EB-DQN
against the extra exploration reward Ψ(shown in equation
(34)). A higher Ψvalue will encourage more explorations,
while a lower value means more exploitation. The simulations
demonstrate that a higher Ψmay hamper the network perfor-
mance by over-exploration, and a lower Ψalso degrades the
URLLC delay and eMBB throughput by under-exploration.
Therefore, an appropriate Ψvalue is critical to balance the
exploration and exploitation.
12
(a) Convergence performance comparison of EB-DQN, QDTRL
and ADTRL
(b) Network performance of EB-DQN under various extra
exploration rewards Ψ.
(c) Convergence performance of QDTRL under various Q-table
sizes of expert agents (lower value such as 0.3 indicates that
learner agent only has 30% of the expert Q-table)
(d) Convergence performance of ADTRL under various action
selection knowledge sizes of expert agents ( the learner agent
only has part of the action selection knowledge of expert agents)
(e) Network performance of QDTRL under various transfer
learning rates
(f) Network performance of ADTRL under various transfer
learning rates
Fig. 6. Convergence and network performance Comparison against learning parameters.
To investigate how the knowledge transfer can contribute to
the learner agent performance, Fig. 6 (c) shows the QDTRL
performance under various expert Q-table sizes. In particular,
a lower value such as 0.3 means that the learner agent only
has 30% of the expert Q-tables as prior knowledge. Fig. 6 (c)
demonstrates that more prior knowledge can bring better per-
formance for the learner agent, while partial prior knowledge
may lower average rewards. Similarly, Fig. 6 (d) presents the
ADTRL performance using different action selection transfer
sizes. Specifically, a lower value indicates that the learner
agent only has part of the action selection knowledge of expert
agents. Consequently, one can observe that more transferred
knowledge can improve exploration efficiency and produce a
higher average reward for the learner agent.
Finally, note that we have defined transfer learning rates
when introducing QDTRL and ADTRL, which represents
the importance of transferred knowledge. The network per-
formance of QDTRL and ADTRL under different transfer
learning rates is investigated here. In QDTRL, the transfer
learning rate is indicated by the σ1in equation (24). Fig.6 (e)
13
(a) ECDF of URLLC latency (1 Mbps eMBB traffic, 2 Mbps
URLLC traffic)
(b) URLLC latency against traffic load
(c) eMBB throughput against traffic load (d) PDR comparison against traffic load
(e) URLLC latency against backhaul capacity (f) eMBB throughput against backhaul capacity
Fig. 7. Performance comparison under various traffic loads and backhaul capacities.
shows that a higher transfer learning rate may lead to better
network performance, which is indicated by a lower URLLC
delay and a higher eMBB throughput. However, a very high
transfer learning rate may affect the exploration of the agent
itself, and it leads to sub-optimal results such as higher delays
and lower throughput. A similar trend can be observed in
Fig.6 (f) for ADTRL, in which the transfer learning rate is
indicated by variable σ2in equation (32). A higher σ2value
significantly reduces the exploration complexity and brings
better network performance. However, when σ20.6, the
transferred knowledge dominates the learning process, and it
results in performance degradation in terms of latency and
throughput.
C. Network Performance Analyses
In this section, we compare the network performance of
different algorithms under various traffic loads and backhaul
capacities. The eMBB traffic is fixed to 1 Mbps per cell, and
the URLLC traffic ranges from 1 to 6 Mbps. We first present
the results, then explain the performance of each algorithm.
Fig.7 (a) shows the Empirical Complementary Cumulative
Distribution Function (ECCDF) of the URLLC latency with 2
Mbps URLLC traffic, which presents empirical distributions
of packet delays. We zoom the area where the ECCDF value
is higher than 0.1 and the URLLC delay is lower than 1
ms to better show the results. Expert 1 and the PPF-TTL
method present the highest delay, which is indicated by the
high delay distribution in 0.1-1 interval of the ECCDF axis.
14
By contrast, expert 2 has a lower delay distribution than expert
1. Meanwhile, EB-DQN and ADTRL show comparable delay
performance for the URLLC slice. Finally, ADTRL achieves
the best delay performance than other algorithms. Fig.7 (b)
and (c) present average URLLC slice delay and eMBB slice
throughput against traffic loads. It shows that both ADTRL
and QDTRL maintain lower delay and higher throughput than
experts and baseline algorithms under various traffic loads.
Expert 2 shows a lower URLLC delay than the PPF-TTL and
expert 1, but its eMBB throughput is much lower than any
other algorithm. In Fig.7 (d), all algorithms have comparable
packet drop rate (PDR) except the PPF-TTL. In the following,
we will explain the results of each algorithm.
We first analyze the experts’ performance. The Expert 1
shows a high delay and a low throughput, because it has
no caching capability. All packets required by UEs have to
be processed by the core network, and the constant backhaul
delay leads to a high URLLC delay and a low eMBB through-
put. On the contrary, expert 2 has a lower delay because we
apply a fixed RB allocation strategy. In particular, RBs are
distributed according to the UE numbers in each slice; thus
the URLLC slice always has more RBs, which leads to a low
URLLC delay. But the eMBB slice is affected by a low eMBB
throughput.
For the baseline algorithms, EB-DQN outperforms the PPF-
TTL method because it jointly learns the radio and cache
resource allocation. The eMBB throughput of the PPF-TTL
method decreases significantly with the increasing URLLC
load, which results from the priority settings in this method. In
the PPF method, whenever a new URLLC packet arrives, it is
directly scheduled over the eMBB packets, which unavoidably
degrades the eMBB throughput.
Finally, the proposed QDTRL and ADTRL have the best
overall performance. Due to the novel DTRL-based scheme,
they can leverage the knowledge of experts and further im-
prove their own performance on new tasks. When the URLLC
traffic is 4 Mbps, ADTRL presents a 21.4% lower URLLC
delay and a 22.4% higher eMBB throughput than EB-DQN.
A 40.8% lower URLLC delay and a 59.8% higher eMBB
throughput are also observed compared with the PPF-TTL
method. The simulations show that the proposed DTRL-based
solutions achieve more promising results than the baseline
algorithms.
Moreover, backhaul capacity is one of the main bottlenecks
of 5G RAN. Here we investigate the network performance
under various backhaul capacities, and the results are shown
in Fig.7 (e) and (f). Note that the basic performance of experts
has been shown in former results, and here we mainly compare
the DTRL-based solutions with baseline algorithms.
As expected, all algorithms achieve lower delays for
URLLC slice and higher throughput for eMBB slice with
increasing backhaul capacity, because higher capacity will
reduce the backhaul delay. Moreover, QDTRL and ADTRL
still achieve lower URLLC delays and higher eMBB through-
put. Compared with EB-DQN, the satisfying performance of
QDTRL and ADTRL can still be explained by their knowledge
(a) URLLC latency against caching capacity
(b) eMBB throughput against caching capacity
(c) Cached hit ratio against caching capacity
Fig. 8. Network performance comparison under various caching capacities.
transfer strategy. Meanwhile, the worst performance of PPF-
TTL shows that learning-based methods outperform model-
based algorithms by the superior learning capability. When
the backhaul capacity is 15 Mbps, ADTRL presents an 18.9%
lower URLLC delay and a 24.2% higher eMBB throughput
than EB-DQN. Compared with the PPF-TTL method, a 24.7%
lower URLLC delay and a 54.3% higher eMBB throughput are
observed.
D. Content Caching Performance Analyses
In this section, we compare different algorithms under
various caching capacities, which is indicated by the maximum
number of content items that can be stored in gNBs. As shown
in Fig.8 (a) and (b), a higher caching capacity will reduce
15
the URLLC latency and increase the eMBB throughput. It is
because a higher caching capacity means that more items can
be cached in gNBs, and the average backhaul delay will be
reduced. The simulations show that ADTRL and QDTRL have
the best overall performance. EB-DQN still achieves a lower
URLLC delay and a higher eMBB throughput than the PPF-
TTL algorithm.
Furthermore, we present the cache hit ratio of eMBB and
URLLC slices in Fig.8 (c). The cache hit ratio represents
the proportion of packets that can be found in the cache
server when they are required. A higher cache hit ratio usually
indicates better network performance, which is affected by
the caching capacity and the content replacement strategy.
With the increasing caching capacity, cache hit ratios naturally
increase for all algorithms. As expected, the cache hit ratio of
the eMBB and URLLC slices are well maintained in QDTRL
and ADTRL. Compared with ADTRL, the ratios in EB-DQN
and PPF-TTL methods are 19.8% and 31% lower. Note that
the PPF-TTL method only has one curve because we assume
there is no slicing in the PPF-TTL algorithm. Finally, ADTRL,
QDTRL and EB-DQN have similar cache hit ratios when the
caching capacity is 60. It means that most required items can
be cached with this capacity, and then a high cache hit ratio
is observed.
VII. CO NC LUS IO N
Network slicing is a key technique to enhance flexibility in
5G networks, and ML techniques offer promising solutions.
Although widely used reinforcement learning techniques have
yielded to improved network performance, they suffer from
a long convergence time and lack of generalization. In that
sense, knowledge transfer emerges as an important approach
to improve learning performance. Yet, transfer learning in
wireless has been explored only very recently and in very
few studies. This work presented two novel deep transfer
reinforcement learning-based solutions for the joint radio and
cache resource allocation. The proposed algorithms have been
compared with two baseline algorithms via simulations. These
results have shown that the proposed methods achieve better
network performance and faster convergence speeds than these
other benchmarks. In the future, we plan to consider the
knowledge transfer between tasks with different state defini-
tions.
APP EN D IX
In the following we prove equation (10). Recalling equation
(8) and (9)
hj,n,g =X
m∈Mj,n
φj,n,m,gTj,n,m,g,
X
g∈Gn
X
m∈Mn
hj,n,m,g =Cj,n
Cj,T
.
Then we have
hj,n,g X
g∈Gn
X
m∈Mn
hj,n,m,g =Cj,n
Cj,T X
m∈Mj,n
φj,n,m,g Tj,n,m,g,
which can be easily transformed to
hj,n,g =Cj,n
Cj,T
Pm∈Mj,n φj,n,m,g Tj,n,m,g
Pg∈GnPm∈Mnhj,n,m,g
=Cj,n
Cj,T
Tj,n,g Pm∈Mj,n φj,n,m,g
Pg∈GnPm∈Mnhj,n,m,g
(using Tj,n,m,g =Tj,n,g assumption)
=Cj,n
Cj,T
Tj,n,g Pm∈Mj,n φj,n,m,g
Pg∈GnPm∈Mnφj,n,m,gTj,n,m,g
(using hj,n,m,g =φj,n,m,g Tj,n,m,g )
=Cj,n
Cj,T
Pm∈Mj,n φj,n,m,g
Pg∈GnPm∈Mnφj,n,m,g
=Cj,n
Cj,T
Pm∈Mj,n φj,n,m,g
Pm∈Mnφj,n,m
,
which is the equation (10).
ACK NOWL E DG MEN T
We would like to thank Dr. Medhat Elsayed for initial
discussions on transfer learning.
REF ERE NC E S
[1] M. Shafi, A. Molisch, P. Smith, T. Haustein, P. Zhu, P. Silva, F.
Tufvesson, A. Benjebbour, and G. Wunder, “5G: A Tutorial Overview of
Standards, Trials,Challenges, Deployment, and Practice,” IEEE Journal
on Selected Areas in Communications, vol. 35, no. 6, pp. 1201-1221,
Jun. 2017.
[2] A. Ksentini, and N. Nikaein, “Toward Enforcing Network Slicing
on RAN:Flexibility and resource Abstraction, IEEE Communications
Magazine, vol. 55, no. 6, pp. 102-108, Jun. 2017.
[3] D. Marabissi, and R. Fantacci, “Highly Flexible RAN Slicing Approach
to Manage Isolation, Priority, Efficiency,” IEEE Access, vol. 7, pp.
97130-97142, Jul. 2019.
[4] M. Alsenwi, N. Tran, M. Bennis, A. Bairagi, and C. Hong, “eMBB-
URLLC Resource Slicing: A Risk-Sensitive Approach, IEEE Commu-
nications Letters, vol. 23, no. 4, pp. 740-743, Apr. 2019.
[5] L. Li, G. Zhao, and R. S. Blum, “A Survey of Caching Techniques
in Cellular Networks: Research Issues and Challenges in Content
Placement and Delivery Strategies, IEEE Communications Surveys &
Tutorials, vol. 20, no. 3, pp. 1710-1732, Mar. 2018.
[6] M. Erol-Kantarci, “Cache-At-Relay: Energy-Efficient Content Place-
ment for Next-Generation Wireless Relays,” International Journal of
Network Management, vol. 25, no. 6, pp. 454-470, Nov./Dec. 2015.
[7] P. Yang, N. Zhang, S. Zhang, L. Yu, J. Zhang, and X. Shen, “Content
Popularity Prediction Towards Location-Aware Mobile Edge Caching,
IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 915-929, Apr.
2019.
[8] J. Kwak, Y. Kim, L. B. Le, and S. Chong, “Hybrid content caching in
5G wireless networks: Cloud versus edge caching, IEEE Transactions
on Wireless Communications, vol. 17, no. 5, pp. 3030-3045, May 2018.
[9] M. Elsayed and M. Erol-Kantarci, “AI-Enabled Future Wireless Net-
works: Challenges, Opportunities, and Open Issues, IEEE Vehicular
Technology Magazine, vol. 14, no. 3, pp. 70-77, Sep. 2019.
[10] Y. Shi, Y. E. Sagduyu, and T. Erpek, “Reinforcement Learning for
Dynamic Resource Optimization in 5G Radio Access Network Slicing,
in Proceedings of the 2020 IEEE 25th International Workshop on
CAMAD, Sep. 2020, pp. 1-6.
[11] T. Li, X. Zhu, and X. Liu, “An End-to-End Network Slicing Algorithm
Based on Deep Q-Learning for 5G Network,” IEEE Access, vol. 8, pp.
122229-122240, Jul. 2020.
[12] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Opti-
mized Computation Offloading Performance in Virtual Edge Computing
Systems via Deep Reinforcement Learning, IEEE Internet of Things
Journal, vol. 6, no. 3, pp. 4005-4018, Jun. 2019.
16
[13] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE
Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp.
1345–1359, Oct. 2010.
[14] W. Lei, Y. Ye, and M. Xiao, “Deep Reinforcement Learning-Based
Spectrum Allocation in Integrated Access and Backhaul Networks,
IEEE Transactions on Cognitive Communications and Networking, vol.
6, no. 3, pp.970-979, Sep. 2020.
[15] X. Chen, Z. Zhao, C. Wu, M. Bennis, H. Liu, Y. Ji, and H. Zhang,
”Multi-Tenant Cross-Slice Resource Orchestration: A Deep Reinforce-
ment Learning Approach, IEEE Journal on Selected Areas in Commu-
nications, vol. 37, no.10, pp. 2377-2392, Oct. 2019.
[16] H. Albinsaid, K. Singh, S. Biswas, and C. Li, “Multi-agent Reinforce-
ment Learning Based Distributed Dynamic Spectrum Access,” IEEE
Transactions on Cognitive Communications and Networking (Early
Access), DOI: 10.1109/TCCN.2021.3120996, Oct. 2021.
[17] J. Li, X. Zhang, J. Zhang, J. Wu, Q. Sun, and Y. Xie, “Deep Rein-
forcement Learning-Based Mobility-Aware Robust Proactive Resource
Allocation in Heterogeneous Networks, IEEE Transactions on Cogni-
tive Communications and Networking, vol. 6, no. 1, pp. 408-421, Mar.
2020.
[18] C. Zhang, M. Dong, and K. Ota, “Fine-Grained Management in 5G:
DQL Based Intelligent Resource Allocation for Network Function Virtu-
alization in C-RAN,” IEEE Transactions on Cognitive Communications
and Networking, vol. 6, no. 2, pp. 428-435, Jun. 2020.
[19] H. Zhou, M. Elsayed, and M. Erol-Kantarci, “RAN Resource Slicing in
5G Using Multi-Agent Correlated Q-Learning, in Proceedings of 2021
IEEE Annual International Symposium on PIMRC, pp.1-6, Sep. 2021.
[20] M. Elsayed, M. Erol-Kantarci, and H. Yanikomeroglu, “Transfer Re-
inforcement Learning for 5G New Radio mmWave Networks,” IEEE
Transactions on Wireless Communications, vol. 20, no. 5, pp. 2838-2849,
May. 2021.
[21] M. E. Taylor and P. Stone, “Cross-Domain Transfer for Reinforcement
Learning, in Proceedings of the International Conference on Machine
Learning, Jun. 2007, pp. 879–886.
[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, et al., “Human-level control
through deep reinforcement learning, Nature, vol. 518, no. 7540, pp.
529–533, Feb. 2015.
[23] M. E. Taylor, P. Stone, and Y. Liu, “Transfer Learning for Reinforcement
Learning Domains: A Survey,” Journal of Machine Learning Research,
vol. 10, pp. 1633-1685, Sep. 2009.
[24] 3GPP, “5G; NR; Physical channels and modulation (Release 15),” Tech-
nical Specification 38.211, 3rd Generation Partnership Project (3GPP),
Jul. 2018.
[25] M. Elsayed, and M. Erol-Kantarci, “Reinforcement Learning-based Joint
Power and Resource Allocation for URLLC in 5G,” in Proceedings of
2019 IEEE Global Communications Conference, pp.1-6, Dec. 2021.
[26] P. L. Vo, M. N. Nguyen, T. A. Le, and N. H. Tran, “Slicing the
Edge: Resource Allocation for RAN Network Slicing, IEEE Wireless
Communications Letters, vol. 7, no. 6, pp. 970-973, Dec. 2018.
[27] N. C. Fofack, P. Nain, G. Neglia, and D. Towsley, “Analysis of TTL-
based cache networks, in Proceedings of the 6th International ICST
Conference on Performance Evaluation Methodologies and Tools, Oct.
2012, pp. 1-10.
[28] T. Han, and N. Ansari, “Network Utility Aware Traffic Load Balancing
in Backhaul-Constrained Cache-Enabled Small Cell Networks with
Hybrid Power Supplies, IEEE Transactions on Mobile Computing,
vol. 16, no. 10, pp. 2819-2832, Oct. 2017.
[29] H. Zhou, A. Aral, I. Brandic, and M. Erol-Kantarci, “Multi-agent
Bayesian Deep Reinforcement Learning for Microgrid Energy Manage-
ment under Communication Failures, IEEE Internet of Things Journal
(early access), DOI: 10.1109/JIOT.2021.3131719, Dec. 2021.
[30] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to Forget:
Continual Prediction with LSTM,” Neural Computation, vol. 12, no.
10, pp.2451-2471, Oct. 2000.
[31] M. E. Taylor, P. Stone, and Y. Liu, “Transfer Learning via Inter-
Task Mappings for Temporal Difference Learning,” Journal of Machine
Learning Research, vol. 8, no. 1, pp. 2125–2167, Sep. 2007.
[32] Z. Zhu, K. Lin, A. K. Jain, and J. Zhou, ”Transfer Learning in Deep
Reinforcement Learning: A Survey,” arXiv:2009.07888, May. 2022.
[33] A. L. Gibbs, F. E. Su, “On Choosing and Bounding Probability Metrics,”
International Statistical Review, vol. 70, no. 3, pp. 419-435, Dec. 2002.
[34] P. S. Castro, and D. Precup, “Using Bisimulation for Policy Transfer
in MDPs,” in Proceedings of 24th AAAI Conference on Artificial
Intelligence, Jul.2010, pp.1-6.
[35] J. J. Taylor, D. Precup, and P. Panangaden, “Bounding performance
loss in approximate MDP homomorphisms” in Advances in Neural
Information Processing Systems 21, pp.1-8, Dec. 2008.
[36] M. James, Topology, 2nd ed., Prentice Hall, 1999, pp. 280–281.
[37] G. Pocovi, K. Pedersen, P. Mogensen, “Joint Link Adaptation and
Scheduling for 5G Ultra-Reliable Low-Latency Communications, IEEE
Access, vol. 6, pp. 28912-28922, May 2018.
[38] N. Ferns, P. Panangaden, and D. Precup, “Metrics for finite Markov de-
cision processes, in Proceedings of the 20th Conference on Uncertainty
in Artificial Intelligence, Jul. 2004, pp.162–169.
[39] A. A. Taha, A. Hanbury, An Efficient Algorithm for Calculating the
Exact Hausdorff Distance, in IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 37, no. 11, pp.2153-2163, Nov. 2015.
[40] M. Elsayed, and M. Erol-Kantarci, ”AI-Enabled Radio Resource Allo-
cation in 5G for URLLC and eMBB Users, in Proceedings of 2019
IEEE 2nd 5G World Forum (5GWF), pp.1-6, Nov. 2019.
[41] 3GPP, “NR; Physical Layer Procedures for Data (version 15.2.0.)” Tech-
nical Specification 38.214, 3rd Generation Partnership Project (3GPP),
Jun. 2018.
Hao Zhou is a Phd candidate at the University of
Ottawa. He got his B.Eng. and M.Eng degrees from
Huazhong University of Science and Technology in
2016, and Tianjin University in 2019, respectively, in
China. He is working towards his Phd degree at the
University of Ottawa since Sep. 2019. His research
interests include electric vehicles, microgrid energy
trading, resource management and network slicing
in 5G. He is devoted to applying machine learning
techniques for smart grid and 5G applications.
Melike Erol-Kantarci is Canada Research Chair
in AI-enabled Next-Generation Wireless Networks
and Associate Professor at the School of Electrical
Engineering and Computer Science at the University
of Ottawa. She is the founding director of the
Networked Systems and Communications Research
(NETCORE) laboratory. She has received numer-
ous awards and recognitions. Dr. Erol-Kantarci is
the co-editor of three books on smart grids, smart
cities and intelligent transportation. She has over
180 peer-reviewed publications. She has delivered
70+ keynotes, plenary talks and tutorials around the globe. She is on the
editorial board of the IEEE Transactions on Cognitive Communications and
Networking, IEEE Internet of Things Journal, IEEE Communications Letters,
IEEE Networking Letters, IEEE Vehicular Technology Magazine and IEEE
Access. She has acted as the general chair and technical program chair for
many international conferences and workshops. Her main research interests
are AI-enabled wireless networks, 5G and 6G wireless communications, smart
grid and Internet of Things. She is an IEEE ComSoc Distinguished Lecturer,
IEEE Senior member and ACM Senior Member.
H. Vincent Poor (S’72, M’77, SM’82, F’87) re-
ceived the Ph.D. degree in EECS from Princeton
University in 1977. From 1977 until 1990, he was
on the faculty of the University of Illinois at Urbana-
Champaign. Since 1990 he has been on the faculty at
Princeton, where he is currently the Michael Henry
Strater University Professor. During 2006 to 2016,
he served as the dean of Princeton’s School of
Engineering and Applied Science. He has also held
visiting appointments at several other universities,
including most recently at Berkeley and Cambridge.
His research interests are in the areas of information theory, machine learning
and network science, and their applications in wireless networks, energy
systems and related fields. Among his publications in these areas is the
forthcoming book Machine Learning and Wireless Communications. (Cam-
bridge University Press). Dr. Poor is a member of the National Academy of
Engineering and the National Academy of Sciences and is a foreign member
of the Chinese Academy of Sciences, the Royal Society, and other national
and international academies. He received the IEEE Alexander Graham Bell
Medal in 2017.
17
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
As one of key technologies of the fifth-generation (5G) communication system, network slicing can share the underlying infrastructure with different application requirements and ensure that the slices can be isolated from each other. This paper proposes an end-to-end (E2E) network slicing resource allocation algorithm based on Deep Q-Networks (DQN), which is suitable for multi-slice and multi-service scenarios. This algorithm jointly considers the radio access network slices and core network slices to dynamically allocate resources to maximize the number of access users. First we build such a model, which is a mixed integer programming problem and it needs to be dynamically adjusted according to the changes of environment. We propose to use DQN algorithm to solve this problem, which can perceive changes in the environment and make dynamic decisions. Under each decision, we need to calculate the reward value of DQN, so we divide the problem into the core side and the access side. Then the dynamic knapsack algorithm and the link mapping algorithm are used to obtain the reward. The simulation results show that the average access rate of DQN scheme is higher than 97%. Compared with the optimal allocation scheme of access side, the average access rate is increased by 9% for delay constrained slices and 5% for rate constrained slices in a dynamic environment.
Article
Full-text available
We develop a framework based on drl to solve the spectrum allocation problem in the emerging iab architecture with large scale deployment and dynamic environment. The available spectrum is divided into several orthogonal sub-channels, and the mbs and all iab nodes have the same spectrum resource for allocation, where a mbs utilizes those sub-channels for access links of associated ue as well as for backhaul links of associated iab nodes, and an iab node can utilize all for its associated ues. This is one of key features in which 5G differs from traditional settings where the backhaul networks are designed independently from the access networks. With the goal of maximizing the sum log-rate of all ue groups, we formulate the spectrum allocation problem into a mix-integer and non-linear programming. However, it is intractable to find an optimal solution especially when the IAB network is large and time-varying. To tackle this problem, we propose to use the latest DRL method by integrating an actor-critic spectrum allocation (ACSA) scheme and dnn to achieve real-time spectrum allocation in different scenarios. The proposed methods are evaluated through numerical simulations and show promising results compared with some baseline allocation policies.
Article
Microgrids (MGs) are important players for the future transactive energy systems where a number of intelligent Internet of Things (IoT) devices interact for energy management in the smart grid. Although there have been many works on MG energy management, most studies assume a perfect communication environment, where communication failures are not considered. In this article, we consider the MG as a multiagent environment with IoT devices in which AI agents exchange information with their peers for collaboration. However, the collaboration information may be lost due to communication failures or packet loss. Such events may affect the operation of the whole MG. To this end, we propose a multiagent Bayesian deep reinforcement learning (BA-DRL) method for MG energy management under communication failures. We first define a multiagent partially observable Markov decision process (MA-POMDP) to describe agents under communication failures, in which each agent can update its beliefs on the actions of its peers. Then, we apply a double deep $Q$ -learning (DDQN) architecture for $Q$ -value estimation in BA-DRL, and propose a belief-based correlated equilibrium for the joint-action selection of multiagent BA-DRL. Finally, the simulation results show that BA-DRL is robust to both power supply uncertainty and communication failure uncertainty. BA-DRL has 4.1% and 10.3% higher reward than Nash deep $Q$ -learning (Nash-DQN) and alternating direction method of multipliers (ADMM), respectively, under 1% communication failure probability.
Article
Dynamic spectrum access (DSA) is an effective solution for efficiently utilizing the radio spectrum by sharing it among various networks. Two primary tasks of a DSA controller are: 1) maximizing the quality of service of users in the licensee’s network and 2) avoiding interference in communications towards the incumbent network. These two tasks become quite challenging in a distributed DSA network due to the lack of a centralized controller to regulate the sharing of the radio spectrum between incumbents and licensees. Hence, optimization-driven techniques to design power allocation schemes in such a network often become intractable. Accordingly, in this paper, we present a distributed DSA based communication framework based on multi-agent reinforcement learning (RL), where the multiple cells in the multi-user multiple-input multiple-output (MU-MIMO) licensee network act as agents, and the average signal-to-noise ratio value is the reward. In particular, by considering the physical layer parameters of the DSA network, we analyze various RL algorithms, namely Q-learning, deep Q-network (DQN), deep deterministic policy gradient (DDPG), and twin delayed deep deterministic (TD3), whereby the licensee network learns to obtain the optimal power allocation policies for accessing the spectrum in a distributed fashion without the need for a central DSA controller to manage the interference towards the incumbent. Trade-offs are identified for the considered algorithms with respect to performance, time complexity and scalability of the DSA network.
Article
Radio access network (RAN) slicing is one of the key technologies in 5G and beyond mobile networks, where multiple logical subnets, i.e., RAN slices, are allowed to run on top of the same physical infrastructure so as to provide slice-specific services. Due to the dynamic environments of wireless networks and the diverse requirements of RAN slices, inter-slice radio resource management (IS-RRM) has become a highly challenging task in RAN slicing. In this paper, we propose a novel online convex optimization (OCO) framework for IS-RRM, which directly learns the instant resource allocation from the data revealed by previous allocations, such that sophisticated modeling and parameterization can be avoided in highly complicated and dynamic wireless environments. Specifically, an online IS-RRM scheme that employs multiple expert-algorithms running in parallel is proposed to keep track of the environmental changes and adjust the resource allocation accordingly. Both theoretical analysis and simulation results show that our proposed scheme can guarantee long-term performance comparable to the optimal strategies given in hindsight.
Article
Introducing cooperative coded caching into small cell networks is a promising approach to reducing traffic loads. By encoding content via maximum distance separable (MDS) codes, coded fragments can be collectively cached at small-cell base stations (SBSs) to enhance caching efficiency. However, content popularity is usually time-varying and unknown in practice. As a result, caching contents are anticipated to be intelligently updated by taking into account limited caching storage and interactive impacts among SBSs. In response to these challenges, we propose a multi-agent deep reinforcement learning (DRL) framework to intelligently update caching contents in dynamic environments. With the goal of minimizing long-term expected fronthaul traffic loads, we first model dynamic coded caching as a cooperative multi-agent Markov decision process. Owing to MDS coding, the resulting decision-making falls into a class of constrained reinforcement learning problems with continuous decision variables. To deal with this difficulty, we custom-build a novel DRL algorithm by embedding homotopy optimization into a deep deterministic policy gradient formalism. Next, to empower the caching framework with an effective trade-off between complexity and performance, we propose centralized, partially and fully decentralized caching controls by applying the derived DRL approach. Simulation results demonstrate the superior performance of the proposed multi-agent framework.
Article
In this paper, we aim at interference mitigation in 5G millimeter-Wave (mm-Wave) communications by employing beamforming and Non-Orthogonal Multiple Access (NOMA) techniques with the aim of improving network’s aggregate rate. Despite the potential capacity gains of mm-Wave and NOMA, many technical challenges might hinder that performance gain. In particular, the performance of Successive Interference Cancellation (SIC) diminishes rapidly as the number of users increases per beam, which leads to higher intra-beam interference. Furthermore, intersection regions between adjacent cells give rise to inter-beam inter-cell interference. To mitigate both interference levels, optimal selection of the number of beams in addition to best allocation of users to those beams is essential. In this paper, we address the problem of joint user-cell association and selection of number of beams for the purpose of maximizing the aggregate network capacity. We propose three machine learning-based algorithms; transfer Q-learning (TQL), Q-learning, and Best SINR association with Density-based Spatial Clustering of Applications with Noise (BSDC) algorithms and compare their performance under different scenarios. Under mobility, TQL and Q-learning demonstrate 12% rate improvement over BSDC at the highest offered traffic load. For stationary scenarios, Q-learning and BSDC outperform TQL, however TQL achieves about 29% convergence speedup compared to Q-learning.
Article
Ultra Reliable Low Latency Communication (URLLC) is a 5G New Radio (NR) application that requires strict reliability and latency. URLLC traffic is usually scheduled on top of the ongoing enhanced Mobile Broadband (eMBB) transmissions (i.e., puncturing the current eMBB transmission) and cannot be queued due to its hard latency requirements. In this letter, we propose a risk-sensitive based formulation to allocate resources to the incoming URLLC traffic, while minimizing the risk of the eMBB transmission (i.e., protecting the eMBB users with low data rate) and ensuring URLLC reliability. Specifically, the Conditional Value at Risk (CVaR) is introduced as a risk measure for eMBB transmission. Moreover, the reliability constraint of URLLC is formulated as a chance constraint and relaxed based on Markov's inequality. We decompose the formulated problem into two subproblems in order to transform it into a convex form and then alternatively solve them until convergence. Simulation results show that the proposed approach allocates resources to the incoming URLLC traffic efficiently, while satisfying the reliability of both eMBB and URLLC.