PreprintPDF Available

Learning from Peers: Transfer Reinforcement Learning for Joint Radio and Cache Resource Allocation in 5G Network Slicing

September 2021

September 2021

Authors:

Hao Zhou

University of Ottawa

Melike Erol Kantarci

University of Ottawa

H. Vincent Poor

Princeton University

Preprints and early-stage research may not have been peer reviewed yet.

Radio access network (RAN) slicing is an important part of network slicing in 5G. The evolving network architecture requires the orchestration of multiple network resources such as radio and cache resources. In recent years, machine learning (ML) techniques have been widely applied for network slicing. However, most existing works do not take advantage of the knowledge transfer capability in ML. In this paper, we propose a transfer reinforcement learning (TRL) scheme for joint radio and cache resources allocation to serve 5G RAN slicing.We first define a hierarchical architecture for the joint resources allocation. Then we propose two TRL algorithms: Q-value transfer reinforcement learning (QTRL) and action selection transfer reinforcement learning (ASTRL). In the proposed schemes, learner agents utilize the expert agents' knowledge to improve their performance on target tasks. The proposed algorithms are compared with both the model-free Q-learning and the model-based priority proportional fairness and time-to-live (PPF-TTL) algorithms. Compared with Q-learning, QTRL and ASTRL present 23.9% lower delay for Ultra Reliable Low Latency Communications slice and 41.6% higher throughput for enhanced Mobile Broad Band slice, while achieving significantly faster convergence than Q-learning. Moreover, 40.3% lower URLLC delay and almost twice eMBB throughput are observed with respect to PPF-TTL.

Comparison of TRL and RL.

…

Proposed deep transfer reinforcement learning architecture for resource management.

…

Proposed Q-value-based mapping function for deep transfer reinforcement learning.

…

Summary of the proposed action selection mapping method.

…

Figures - uploaded by Hao Zhou

Content may be subject to copyright.

Content uploaded by Hao Zhou

Content may be subject to copyright.

arXiv:2109.07999v4 [eess.SY] 1 Sep 2022

This paper has been accepted by IEEE Transactions on Cognitive Communications and Networking

Learning from Peers: Deep Transfer Reinforcement

Learning for Joint Radio and Cache Resource

Allocation in 5G RAN Slicing

Hao Zhou, Graduate Student Member, IEEE, Melike Erol-Kantarci, Senior Member, IEEE, Vincent Poor, Fellow, IEEE

Abstract—Network slicing is a critical technique for 5G com-

munications that covers radio access network (RAN), edge, trans-

port and core slicing. The evolving network architecture requires

the orchestration of multiple network resources such as radio

and cache resources. In recent years, machine learning (ML)

techniques have been widely applied for network management.

However, most existing works do not take advantage of the

knowledge transfer capability in ML. In this paper, we propose

a deep transfer reinforcement learning (DTRL) scheme for joint

radio and cache resource allocation to serve 5G RAN slicing.

We ﬁrst deﬁne a hierarchical architecture for joint resource

allocation. Then we propose two DTRL algorithms: Q-value-

based deep transfer reinforcement learning (QDTRL) and action

selection-based deep transfer reinforcement learning (ADTRL).

In the proposed schemes, learner agents utilize expert agents’

knowledge to improve their performance on current tasks. The

proposed algorithms are compared with both the model-free

exploration bonus deep Q-learning (EB-DQN) and the model-

based priority proportional fairness and time-to-live (PPF-TTL)

algorithms. Compared with EB-DQN, our proposed DTRL-

based method presents 21.4% lower delay for Ultra Reliable

Low Latency Communications (URLLC) slice and 22.4% higher

throughput for enhanced Mobile Broad Band (eMBB) slice,

while achieving signiﬁcantly faster convergence than EB-DQN.

Moreover, 40.8% lower URLLC delay and 59.8% higher eMBB

throughput are observed with respect to PPF-TTL.

Index Terms—5G, network slicing, edge caching, transfer

reinforcement learning

I. INTRO DUC TI O N

Driven by the increasing trafﬁc demand of diverse mobile

applications, 5G mobile networks are expected to satisfy the

diverse quality of service (QoS) requirements of a wide variety

of services as well as service level agreements of different user

types [1]. Considering the diverse QoS demands of user types

such as enhanced Mobile Broad Band (eMBB) and Ultra Reli-

able Low Latency Communications (URLLC), network slicing

has been proposed to enable ﬂexibility and customization of

5G networks. Based on software deﬁned networks and network

function virtualization techniques, physical networks are split

This work was supported by Natural Sciences and Engineering Research

Council of Canada (NSERC) Collaborative Research and Training Experience

Program (CREATE) under Grant 497981, Canada Research Chairs Program,

and U.S. National Science Foundation under Grant CNS-2128448.

H. Zhou and M. Erol-Kantarci are with the School of Electrical Engineering

and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada.

(emails:{hzhou098, melike.erolkantarci}@uottawa.ca).

H. V. Poor is with the Department of Electrical and Computer Engineering,

Princeton University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu).

into multiple logical network slices [2]. Each slice may include

its own controller to manage the available resources.

As an important part of network slicing, radio access

network (RAN) slicing is more complicated than core and

transport network slicing due to limited bandwidth resources

and ﬂuctuating radio channels. For instance, a two-layer slicing

approach is introduced in [3] for low complexity RAN slicing,

which aims to ﬁnd a suitable trade-off between slice isola-

tion and efﬁciency. A puncturing-based scheduling method

is proposed in [4] to allocate resources for the incoming

URLLC trafﬁc and minimize the risk of interrupting eMBB

transmissions. However, the puncturing-based method may

degrade the performance of the eMBB slice, since the URLLC

slice is scheduled on top of ongoing eMBB transmissions (i.e.,

puncturing the current eMBB transmission).

The aforementioned works mainly concentrate on RAN

slicing. While the spectrum is indisputably the most critical

resource of RAN, other resources are equally important to

guarantee the network performance, especially cache resource

[5]. Incorporating caching into the RAN has attracted interest

from both academia and industry, and some novel network

architectures have been proposed to harvest the potential

advantages of edge caching [6]. Indeed, edge caching allows

storing data closer to the users by utilizing the storage capacity

available at the network devices, and it reduces the trafﬁc to the

core network. Moreover, caching can save the backhaul capac-

ity without affecting the network delay. Cache placement and

caching strategies have been extensively studied in numerous

works, but the problem is usually addressed disjointly without

considering the RAN. For example, a location customized

caching method is presented in [7] to maximize the cache

hit ratio, and content caching locations are optimized in [8]

by considering both cloud-centric and edge-centric caching.

Augmenting edge devices in RAN with caching will bring

signiﬁcant improvements for 5G networks. However, this leads

to higher complexity for network management. Network slic-

ing is expected to allocate limited resources such as bandwidth

or caching capacity between slices and fulﬁll the QoS require-

ments of slices. The complex network dynamics, especially

the stochastic arrival requests of slices, make the underlying

network optimization challenging. Fortunately, machine learn-

ing (ML) techniques offer promising solutions [9]. Applying

a reinforcement learning (RL) scheme can avoid the potential

complexity of deﬁning a dedicated optimization model. For

instance, Q-learning is deployed in [10] to maximize the

network utility of 5G RAN slicing by jointly considering radio

and computation resources. Deep Q-learning (DQN) is used

in [11] for end-to-end network slicing, and double deep Q-

learning (DDQN) is applied for computation ofﬂoading within

sliced RANs in [12].

Although learning-based methods such as RL and deep rein-

forcement learning (DRL) have been generally applied for net-

work resource allocation, most existing works do not consider

the possibility of knowledge transfer [10]–[12]. Speciﬁcally,

an agent is designed for a speciﬁc task in these works, and

it interacts with its environment from scratch, which usually

leads to a lower exploration efﬁciency and longer convergence

time. Whenever a new task is assigned, the agent needs to

be retrained, even though similar tasks have been completed

before. The poor generalization capability of straightforward

RL methods motivates us to ﬁnd a learning method with

better generalization and knowledge transfer capability. On

the other hand, humans can reuse the knowledge learned from

previous tasks to solve new tasks more efﬁciently, and this

capability can be built into ML as well [13]. Such knowledge

transfer and reuse can signiﬁcantly reduce the need for a large

number of training samples, which is a common issue in many

ML methods. By incorporating knowledge transfer capability

into ML, it is expected to reduce the algorithm design and

training efforts and achieve better performance such as faster

convergence and higher average reward.

In this work, we propose two deep transfer reinforcement

learning (DTRL) based solutions for joint radio and cache re-

source allocation. In particular, we include knowledge transfer

capability in the DDQN framework by deﬁning two different

knowledge transfer functions, and we propose two DTRL-

based algorithms accordingly. The ﬁrst method is Q-value-

based deep transfer reinforcement learning (QDTRL), and the

second technique is called action selection-based deep transfer

reinforcement learning (ADTRL). Using these schemes, agents

can utilize the knowledge of experts to improve their perfor-

mance on current tasks, and consequently it can reduce the

algorithm training efforts. Furthermore, the current network

optimization schemes are usually deﬁned in a centralized way

[10], [11]. This leads to excessive control overhead where

processing the speciﬁc requests of all devices can be a heavy

burden for the central controller. To this end, we propose

a hierarchical architecture for joint resource allocation. The

global resource manager (GRM) is responsible for inter-slice

resource allocation, and then slice resource managers (SRMs)

implement intra-slice resource allocation among associated

user equipment (UEs).

The main contributions of this work are: (1) We deﬁne a

hierarchical architecture for joint radio and cache resource

allocation of cellular networks. The GRM will intelligently

allocate resources between slices, then SRMs will distribute

radio and cache resource within each slice based on speciﬁc

rules. The proposed hierarchical architecture can reduce the

control overhead and achieve higher ﬂexibility.

(2) We propose two DTRL-based solutions for the inter-

slice resource allocation, namely QDTRL and ADTRL. The

QDTRL can utilize the Q-values of the experts as prior

knowledge to improve the learning process, while the ADTRL

focuses on action selection knowledge. Compared with RL or

DRL, the proposed DTRL solutions show a better knowledge

transfer capability.

We further propose two baseline algorithms, including a

model-free exploration bonus DQN (EB-DQN) algorithm and

a model-based priority proportional fairness and time-to-live

(PPF-TTL) method. The proposed DTRL solutions are com-

pared with these two baseline algorithms via simulations.

The results demonstrate that DTRL-based algorithms perform

better in both network and ML metrics. In particular, ADTRL

has 21.4% lower URLLC delay and 22.4% higher eMBB

throughput than EB-DQN. The simulations also show 40.8%

lower URLLC delay and 59.8% higher eMBB throughput than

the PPF-TTL method. QDTRL and ADTRL also outperform

EB-DQN with signiﬁcantly faster convergence.

The rest of this work is organized as follows. Section II

presents related work, Section III introduces the background,

and Section IV shows the system model and problem formula-

tion. Section V explains the DTRL-based resource allocation

scheme. The simulations are shown in Section VI, and Section

VII concludes this work.

II. RELATED WORK

Recently, numerous studies have applied artiﬁcial intelli-

gence (AI) techniques for resource allocation in 5G networks.

DRL is deployed in [14] for spectrum allocation in integrated

access and backhaul networks with dynamic environments,

and DRL is combined with game theory in [15] for multi-

tenant cross-slice resource allocation. Multi-agent reinforce-

ment learning is used in [16] for distributed dynamic spectrum

access sharing of communication networks. Moreover, a DRL-

based method is proposed in [17] for mobility-aware proactive

resource allocation, which pre-allocates resources to mobile

UEs in time and frequency domains. Finally, [18] proposed

a DQN-based intelligent resource management method to

improve the quality of service for 5G cloud RAN. The afore-

mentioned works show that various ML methods have been

applied for the resource management of wireless networks,

including RL [10], DRL [11], [14], [15], [17], [18], DDQN

[12], multi-agent reinforcement learning [16], and so on. In

our former work [19], we proposed a correlated Q-learning-

based method for the radio resource allocation of 5G RAN,

but the knowledge transfer was still not considered.

In these works, the main motivations for deploying ML

algorithms are the increasing complexity of wireless networks

and the difﬁculties to build dedicated optimization models.

Indeed, the evolving network architecture, emerging new net-

work performance requirements and increasing device num-

bers make the traditional methods such as convex optimization

more and more complicated. ML, especially RL, offers a good

opportunity to reduce the optimization complexity and make

data-driven decisions. However, a large number of samples

are needed for the learning process, which means a long

TABLE I

COM PAR IS ON OF RL, TL , DRL AND TR L

Algorithms Features Difﬁculties Applications

RL Agent has no prior knowledge about tasks. It

explores new tasks from scratch.

Long convergence time, low

exploration efﬁciency.

Tasks with limited state-action

space and no prior knowledge.

TL Improving generalization across different

distributions between expert and learner tasks.

Negative transfer, automation of task

mapping.

It is mainly designed for the

supervised learning domain, e.g.,

classiﬁcation, regression.

DRL

Combining artiﬁcial neural networks with RL

architecture. Applying neural networks to estimate

state-action values.

Time-consuming network training and

tuning; Low sample and exploration

efﬁciency; Training stability.

Large state-action space or

continuous-action problem.

TRL

Utilizing prior knowledge from experts to improve

performance of learners such as higher average

reward or faster convergence.

Transfer function needs to be deﬁned

to digest prior knowledge.

Optimization of tasks with

existing prior knowledge.

training time and low system efﬁciency. Furthermore, after the

long learning process, the agent can only handle one speciﬁc

task without any generalization. This learning process has

to be repeated when a new task is assigned. DRL can be

considered as a solution to reduce the number of samples

and improve convergence. However, it is still limited to a

localized domain, and the neural network training can be

time-consuming because of the hyperparameter tuning. To

this end, we propose DTRL-based solutions for the joint

radio and cache resource allocation of 5G RAN. The transfer

reinforcement learning (TRL) has been applied in [20] for

intra-beam interference management of 5G mm-Wave, but

here, the resource allocation problem is more complicated due

to a much bigger state-action space. The proposed scheme

has a satisfying knowledge transfer capability by utilizing the

knowledge of expert agents, and it outperforms EB-DQN by

faster convergence and better network performance.

III. BACKGROUND

Fig. 1. Comparison of TRL and RL.

In this section, we introduce the background of TRL to

distinguish it from RL. As shown in Fig.1, given a task pool,

the interactions between tasks and agents can be described by

the Markov decision process (MDP) < S, A, R, T >, where

Sis the state space, Ais the action space, Ris the reward

function, and Tis the transition probability. In RL, each agent

works independently to try different actions, arriving at new

states and receiving rewards. The learning phase of RL can be

deﬁned as:

LRL :s×K→a, r (a∈A),(1)

where the Kis the knowledge of this agent, sis the current

state, ais the selected action, and ris the reward. Equation

(1) indicates that RL agent utilizes the collected knowledge to

select action aand receive reward runder state s.

On the contrary, TRL includes two phases: the knowledge

transfer phase and the learning phase. In the knowledge

transfer phase, as shown in Fig.1, considering task differences,

a mapping function is deﬁned to make the knowledge of

experts digestible for the learner. Then the learner explores

current tasks on its own and forms its own knowledge. The

whole process of TRL can be deﬁned as:

LT RL :s×M(Kexpert )×Klearner →a, r (a∈A),(2)

where the Mis the mapping function, Kexpert is the knowl-

edge from experts, Klearner is the knowledge of the learner.

In equation (2), knowledge from experts is utilized in the

action selection of the learner, and it is expected to accelerate

the learning process LT RL . In TRL, new tasks can be better

handled based on knowledge of experts.

The RL, transfer learning (TL), DRL and TRL techniques

are compared in Table I. Compared with RL, TRL has higher

exploration efﬁciency and better generalization capability [21].

On the other hand, although DRL is a breakthrough approach

by combining neural networks with RL schemes, the time-

consuming network training is a well-known issue, and the

training stability and generalization capability of DRL can

cause problems [22]. Finally, although TL has been extensively

studied in the ML literature, it mainly focuses on the super-

vised learning domain such as classiﬁcation and regression

[13]. Compared with TL, TRL can be more complicated

because the knowledge needs to be transferred in the context of

the MDP scheme. Moreover, due to the dedicated components

of MDP, the knowledge may exist in different forms, which

needs to be transferred in different ways [23]. Furthermore,

TRL has shown signiﬁcant improvements in robot learning.

Inspired by these approaches and previous successful results

[20], we propose DTRL-based schemes for RAN slicing.

In this work, we combine state-of-the-art DDQN with TRL

and propose a DTRL-based joint resource allocation method

for 5G RAN slicing. Compared with conventional DRL that

Fig. 2. Proposed radio and cache resource allocation scheme

explores the task from scratch, the DTRL can extract the prior

knowledge of related tasks and reuse it for target tasks.

IV. SYS TEM MODEL AND PROBL EM FOR MU L ATION

A. Overall Architecture

Fig.2 presents the proposed joint radio and cache resource

allocation scheme. A hierarchical architecture is deﬁned for the

resource allocation of cache-enabled macro cellular networks.

We assume the base station (BS) has the caching capability

to store content items. Our proposed scheme can be used

for any number of slices without loss of generality, but here

we mainly consider two typical slices, namely an eMBB

slice and URLLC slice, to better illustrate the framework.

Firstly, the SRMs collect the QoS requirements of associated

UEs, and the collected information is sent to the GRM.

Then the GRM will intelligently implement the inter-slice

resource allocation to divide the radio and cache resource

between eMBB and URLLC SRMs. Finally, SRMs distribute

resource blocks (RBs) to associated UEs and update cached

content items within the allocated caching capacity. We apply

learning-based methods to realize an intelligent GRM, while

rule-based methods are deployed for SRMs. This hierarchical

architecture can alleviate the burden of GRM, since the GRM

only accounts for slice-level allocation instead of handling all

the UEs directly.

B. Communication Model

In this section, we introduce the communication model.

Firstly, the total delay consists of:

dj,m,g =dtx

j,m,g +dque

j,m,g + (1 −βj,g)dback

j,m,g,(3)

where dtx

j,m,g is the transmission delay of content item gfrom

BS jto UE m.dque

j,m,g refers to the queuing delay that content

gwaits in the buffer of BS jto be transmitted to UE m.

dback

j,m,g is the backhaul delay of fetching content items from

the core network. βj,g is a binary variable. βj,g = 1 when the

content item gis cached at BS j, and βj,g = 0 otherwise.

Equation (3) shows that the total delay is affected by both

communication and cache resource. The transmission delay

depends on radio resource allocation, and the edge caching

can prevent the backhaul delay. The scheduling efﬁciency will

affect the queuing delay.

The transmission delay depends on the link capacity be-

tween the BS and UE:

dtx =Lm

Pj,m

,(4)

where Lmis the size of content items required by UE m.

Pj,m is the link capacity between BS jand UE m, which is

calculated as follows:

Pj,m =X

q∈N RB

bqlog(1+ pj,q xj,q ,mgj,q,m

bqN0+P

j′∈J−j

pj′,q′xj′,q′,m′gj′,q′,m′

(5)

where NRB

jis the set of RBs in BS j,bqis the bandwidth of

RB q,N0is the noise power density, pj,q is the transmission

power of the RB qof the BS j,xj,q,m is a binary indicator

to denote whether RB qis allocated to the UE m,gj,q,m is

the channel gain between the BS and UE, and j′∈J−jis the

BS set except BS j. In the proposed communication system

model, we assume orthogonal frequency-division multiplexing

(OFDM) is deployed to avoid the intra-cell interference [24],

and P

j′∈J−j

pj′,q′xj′,q′,m′gj′,q′,m′in equation (5) indicates the

inter-cell interference of downlink transmission from other

BSs [25].

C. Slicing-based Caching Model

We will introduce the slicing-based caching model in this

section, in which the time-to-live (TTL) method is used as the

content replacement strategy. The TTL indicates the time that

a content item is stored in a caching system before it is deleted

or replaced. The TTL value will be reset if this content item is

required again, and thus popular content items will live longer.

Although there have been many content replacement strategies,

TTL is selected because: i) this paper mainly focuses on the

inter-slice level resource allocation, and it is reasonable to

apply a well-known caching method for intra-slice caching; ii)

TTL requires no prior knowledge of content popularity, which

is more realistic. Nevertheless, our proposed architecture is

compatible with any other caching methods without loss of

generality. The complexity of the slicing-based caching model

lies in how to effectively divide the limited caching capacity

between slices, which is far more complicated than the original

TTL model. We use Njto represent the slice set in the BS j,

and the slice ncontains |Mj,n |UEs, and each UE is denoted

by m. Each slice has its own content catalog Gj,n, and the

variables grepresents the content items (g∈ Gj,n ). We assume

all content items have the same packet size [26].

φj,n,m,g represents the request rate of UE mfor the content

item g, which denotes the frequency that content item gis

demanded by the UE m. Then we have

φj,n,m,g =pj,n,m,gφj,n,m,(6)

where φj,n,m is the total request rate of UE m, and pj,n,m,g

is the request rate distribution of UE mfor content items

(Pg∈G pj,n,m,g = 1).

The cache hit ratio indicates the probability of ﬁnding a

content in the cache, which can be calculated by [27]:

hj,n,g = 1 −e

−P

m∈Mj,n

φj,n,m,gTj,n,m,g

,(7)

where hj,n,g is the cache hit ratio of content item gin the slice

n, and the TTL of the content item gwill be reset to Tj,n,m,g

when this content item is required. Note that the number of

contents is large and the request rate of each content item is

relatively small. Based on lim

x→0ex=x+ 1, we approximately

have:

hj,n,g =X

m∈Mj,n

φj,n,m,gTj,n,m,g.(8)

Meanwhile, the total cache hit ratio is related to the allo-

cated storage capacity of slice n:

g∈Gn

m∈Mn

hj,n,m,g =Cj,n

Cj,T

,(9)

where Cj,n is the allocated storage capacity for slice nin BS

j,Cj,T is the total storage capacity of BS j, and hj,n,m,g is

the cache hit ratio of UE mfor content item g. We assume the

content item ghas the same popularity for different UEs and

thus Tj,n,m,g =Tj,n,g . Given equation (8) and (9), we have

(the proof is given in the appendix):

hj,n,g =Cj,n

Cj,T

Pm∈Mj,n φj,n,m,g

Pm∈Mj,n φj,n,m

.(10)

Equation (10) indicates that a higher caching capacity Cn

leads to a higher cache hit ratio hn,g [26]. As such, popular

content items, which is indicated by a higher request rate

φn,m,g , has a higher cache hit ratio.

Then we deﬁne a cache hit rate φhit

j,n,m to describe the

frequency of requesting cached content items in the total

request rate, which can be calculated by:

φhit

j,n,m =X

g∈Gj,n

φj,n,m,ghj,n,m,g,(11)

and the cache miss rate is:

φmiss

j,n,m =φj,n,m −φhit

j,n,m,(12)

which indicates the frequency of requesting non-cached con-

tent items in the total request rate.

Finally, backhaul delay is only applied when a content item

is not cached at the BS. The backhaul service is presumed to

obey the M/M/1 queue, and the average delay is [28]:

dback

j,m,g =1

Lm−Pn∈N Pm∈Mj,n (φmiss

j,n,m),(13)

where Bis the backhaul capacity, Lmis the average

size of content items, B/Lmdenotes the service rate, and

Pn∈N Pm∈Mj,n (φmiss

j,n,m)is the total cache miss rate. Note

that backhaul delay will be inﬁnite if the cache miss rate is

higher than the service rate.

D. Problem Formulation

The objective of the eMBB slice is to maximize the total

throughput, while the URLLC slice aims to minimize the

average delay. It is assumed that the content catalogs of the two

slices are not overlapped. To balance the requirements of the

two slices, the GRM needs to jointly consider the objectives

of the two slices and allocate the radio and cache resource

accordingly. For each BS, the GRM allocates radio and cache

resource by:

max wbembb,avg + (1 −w)(dtar −durllc ,avg ),(14)

s.t. (3) (5) (10) (11) and (13) (14a)

n∈Nj

Nj,n ≤Nj,T (14b)

n∈Nj

Cj,n ≤Cj,T (14c)

n∈Nj

m∈Mj,n

φmiss

m≤B

(14d)

n∈Nj

m∈Mj,n

xj,q,m ≤1(14e)

m∈Mj,n

q∈N RB

xj,q,m ≤Nj,n (14f)

g∈Gj,n

βj,g ≤Cj,n (14g)

where bembb,avg is the total throughput of the eMBB slice,

durllc,avg is the average latency of the URLLC slice, and dtar

is the target delay. Here we use was a weight factor in (14)

to balance the objectives of the two slices and maximize the

overall objective. Njis the slice set of BS j, which consists

of eMBB and URLLC slices. Nj,n is the number of RBs that

the GRM allocated to the slice n, and Nj,T is the total number

of RBs of BS j, and the equation (14b) is the radio resource

constraint. Cj,n is the caching capacity of slice n, and Cj,T

is the total caching capacity in BS j, and (14c) ensures that

the allocated caching capacity cannot exceed the upper limit.

Mj,n is the set of UEs in slice n, and φmiss

mis the miss rate

of UE m, and (14d) denotes that the total miss rate should not

exceed the backhaul service rate. xj,q ,m has been deﬁned in

equation (5) as the RB allocation indicator. Equations (14e)

and (14f) denote that one RB can only be allocated to one

UE, and the total number of available RBs are Nj,n in slice

n. Finally, βj,g is a binary variable that has been deﬁned in

equation (3) to represent whether a content item is cached.

In the deﬁned problem formulations, Nj,n and Cj,n are the

control variables of the GRM, which means the GRM only

accounts for the inter-slice resource allocation. xj,q,m and βj,g

are the control variables of SRMs. In SRMs, we apply classic

proportional fairness algorithm for intra-slice RB allocation to

determine xj,q,m , since all UEs in the same slice are presumed

to be equally important [19]. Meanwhile, cached content items

are updated by TTL rule, which will determine βj,g .

TABLE II

SUM MA RY O F S TRATEG IES AN D MDP DE FIN IT I ON S

Indices Strategies States Actions2Instant Reward

Expert 1 Q-learning based radio resource allocation; No

caching capability.

(sembb, surllc ),

where sembb

denotes the

number of

eMBB packets

in the queue,

and surllc

is deﬁned

similarly.

(Nembb, N urllc )r=wrembb + (1 −w)rurllc ,

where

rembb =2

πtan−1(bembb,avg )

rurllc =2

πtan−1(dtar−

durllc,avg ).wis the weighting

factor, rembb and rur llc are

the rewards of eMBB and

URLLC slices, respectively.

bembb,avg and durllc,avg

are average throughput

of eMBB slice and average

delay of URLLC slice. dtar is

the target URLLC delay.

Expert 2 Fixed radio resource allocation; Q-learning based

caching capacity allocation. (Cembb, C urllc)

Learner 1

(QDTRL)

DTRL-based joint radio and cache resource

allocation with Q-values mapping function. (Nembb, N urllc,Cembb , C urllc),

where Nembb and Cembb

denote the number of RBs and

caching capacity allocated to

the eMBB slice, respectively.

Nurllc and Curllc are deﬁned

similarly for the URLLC slice.

Learner 2

(ADTRL)

DTRL-based joint radio and cache resource

allocation with action selection mapping function.

Baseline1EB-DQN for joint radio and cache resource

allocation without any prior knowledge.

1We include the PPF-TTL as the second baseline. However, since PPF-TTL is not an ML-based method, it is excluded from this table.

2Given the total number of RBs Nj,T ,Nurllc can be easily calculated if Nembb has been decided. However, we present the action deﬁnition using

(Nembb, N urllc)for better readability and scalability if more slices are included, and similarly for the action deﬁnition of cache resource allocation

(Cembb, C urllc)

V. D EEP TRAN SFE R REIN FO RCE MEN T LEAR NI NG BASE D

RES OU RCE AL LOC ATI ON

A. Overall framework

In this section, we introduce the DTRL-based inter-slice

resource allocation, where each BS is considered as an in-

dependent agent to make decisions autonomously. As shown

in Table II, ﬁve learning-based strategies are deployed. We

assume experts 1 and 2 apply Q-learning for radio and cache

resource allocation, respectively. Experts are only good at

one of radio or cache resource allocation, but they have

no multi-task knowledge. Then learners 1 and 2 can utilize

knowledge from experts to improve their own performance on

joint radio and cache resource allocation. Based on different

mapping functions, we propose two DTRL-based methods,

namely QDTRL and ADTRL. Finally, we apply EB-DQN as

a learning-based benchmark and the PPF-TTL method as a

model-based baseline. In the following, we will introduce the

experts, learners and baselines.

B. Q-learning based Experts

In this section, we assume the expert agents have learning

experience on one speciﬁc task, but they have no knowledge

of other tasks. For expert 1, it uses Q-learning for the radio

resource allocation, and there is no caching capability. For

expert 2, RBs are allocated by the numbers of UEs in

each slice, and Q-learning is used for the caching capacity

allocation.

To transform the problem formulation in equation (14) to the

RL context, we ﬁrst deﬁne the MDP (S, A, T , R)for experts,

where Sis the state set, Ais the action set, Tis the transition

probability, and Ris the reward function. The MDP deﬁnitions

of experts are given below:

•State: In this work, we intend to coordinate the perfor-

mance of various slices by inter-slice resource allocation.

As such, the state deﬁnition should reﬂect the transmis-

sion demand of each slice. The states of expert 1 and 2

are both deﬁned by (sembb, surllc ), which indicates the

number of packets waiting in the queues of eMBB and

URLLC slices, respectively.

•Action of expert 1: The expert 1 only implements

radio resource allocation, and consequently the action

(Nembb, N urllc)denotes the number of RBs allocated to

eMBB and URLLC slices.

•Action of expert 2: In expert 2, the learning strategy

is only applied for caching capacity allocation, and the

action (Cembb, C urllc)indicates the caching capacity

allocated to eMBB and URLLC slices.

It is worth noting that the actions Nembb , N urllc,Cembb

and Curllc have been deﬁned as control variables in the

problem formulation (14), and they are transformed to

actions here to serve the Q-learning scheme.

•Reward function: The reward functions of expert 1 and

2 are deﬁned by the objectives of slices:

r=wrembb + (1 −w)rurllc ,(15)

where rembb and rurllc are rewards of eMBB and URLLC

slices, respectively, and wis the weight factor.

For the eMBB slice, obtaining higher throughput leads to

a higher reward, and we have.

rembb =2

πtan−1(bembb,avg ),(16)

where bembb,avg is the average throughput of the eMBB

slice, and we apply the tan−1function to normalize the

reward (0< rembb <1).

For the URLLC slice, lower delay means a higher re-

ward:

rurllc =2

πtan−1(dtar −durllc,avg ),(17)

where dtar and durllc,avg are target and achieved average

delays for the URLLC slice, respectively. Note that both

rembb and rurllc are normalized to balance the perfor-

mance metrics of the two slices.

Moreover, to guarantee the constraints of the problem

formulation (14), we apply penalties to the reward when

these constraints are violated.

With Q-learning, the agent aims to maximize the long-term

expected reward:

V(se) = Eπ(

∞

i=0

γir(se,i, ae,i )|se=se,0),(18)

where V(se)is the long-term expected accumulated reward

of experts at state se, and here we use notation eto indicate

experts. se,0is the initial state, and r(se,i , ae,i)denotes the

reward of selecting action ae,i at state se,i in episode i, and

γis the discount factor (0 < γ < 1).

Then the state-action values of expert 1 and 2 are updated

by:

Qnew(se, ae) = Qold (se, ae)+

α(re+γmax

aQ(s′

e, a)−Qold(se, ae)),(19)

where Qold(se, ae)and Qnew(se, ae)are old and new Q-

values, seand s′

eare current and next states of experts,

respectively, aeis the action, reis the reward, and αis

the learning rate (0< α < 1). By updating the Q-values

iteratively, experts will learn the optimal action selections to

achieve the best accumulated reward.

C. Deep Transfer Reinforcement Learning based Learners

In this section, we propose two DTRL-based algorithms,

namely QDTRL (learner 1) and ADTRL (learner 2). QDTRL

utilizes the Q-values of the experts as prior knowledge, while

ADTRL uses the action selection experience of experts for

improved performance. Consequently, we deﬁne two different

mapping functions to transfer knowledge from experts to

learners 1 and 2, respectively.

In this section, we ﬁrst deﬁne the states, actions and rewards

of learners, then we introduce the DDQN framework for

two learners. Finally, we present the proposed QDTRL and

ADTRL algorithms.

1) MDP and DDQN Framework for Learners

Given the prior knowledge of experts, learners are expected

to solve more complicated problems with a larger state-action

space. To apply TRL, we deﬁne the MDP by (S, A, R, T , F),

where Fis the mapping function. The states, actions and

reward functions of the two learners are given below:

•State and reward function: As shown in Table II,

we assume learners have the same state and reward

deﬁnitions as experts. The main reason is that transfer

learning is designed for tasks that share some similarities

with existing expert tasks. As such, similar state and

reward deﬁnitions can reduce the complexity of deﬁning

mapping functions, which can better transfer the knowl-

edge from experts to learners.

•Action of learners 1 and 2: Compared with experts,

the learners have to jointly consider the radio and cache

resource allocation, and then the actions are deﬁned by

(Nembb, N urllc, C embb, Curllc ). Compared with allocat-

ing one single resource, the joint resource allocation

problem is much more complicated, especially when

multiple slices are involved.

Based on MDP deﬁnitions, we introduce our DTRL algo-

rithm. In conventional Q-learning, Q-values are updated by:

Qnew(sl, al) = Qold(sl, al)+

α(rl+γmax

aQ(s′

l, a)−Qold(sl, al)),(20)

where sland alare the state and action of the learner,

respectively, s′

lis the next state, and rlis the reward. Here we

use the notation lto indicate the learner. Q-learning applies a

Q-table to record state-action values, and consequently it may

suffer a slow convergence issue when the state-action space

is huge. To this end, DQN is proposed by using deep neural

networks to predict Q-values [22].

When the Q-values in equation (20) converge, we have

Qold(sl, al) = Qnew(sl, al)and Qold(sl, al) = rl+

γmax

aQ(s′

l, a). Then, a loss function can be deﬁned for the

network training of DQN:

L(w) = Er(rl+γmax

aQ(s′

l, a, w′)−Q(sl, al, w)),(21)

where Er is the loss function to represent the error between

prediction results rl+γmax

aQ(s′

l, a, w′)and target results

Q(sl, al, w).wis the weight of the main network, which will

predict current Q-values Q(sl, al, w).w′is the weight of the

target network, and it predicts the target Q-values Q(s′

l, a, w′).

In DQN, note that the action selection and evaluation are

both implemented by the target network, which is indicated by

max

aQ(s′

l, a, w′). Meanwhile, target Q-values are calculated

by the maximum Q-value of the next state. If the maximize

operator is always included in the Q-value calculation, then the

Q-value predicted by neural networks will be obviously higher

every time [29]. To this end, the DDQN has been proposed to

decouple the action selection and evaluation. The loss function

of DDQN is deﬁned as:

L(w) = Er(rl+

γQ(s′

l,arg max

aQ(s′

l, a, w), w′)−Q(sl, al, w)),(22)

where the main network chooses actions by al=

arg max

aQ(s′

l, a, w)), and the target network evaluates the

action by Q(s′

l, al, w′). By decoupling the action selection

and evaluation, DDQN can prevent overestimation and better

predict Q-values than DQN.

In this work, we deploy the DDQN architecture in the

proposed DTRL. For the learner agent shown by color grey

in Fig.3, an action alis ﬁrst selected and sent to the envi-

ronment. Then a tuple (sl, al, rl, s′

l)will be received from the

environment, which will be saved in the experience pool. The

learner agent samples a random minibatch from the experience

pool. For every tuple (sl, al, rl, s′

l), the main network predicts

Q(sl, al, w)and selects actions by a= arg max

aQ(s′

l, a, w)).

The target network evaluates the action by Q(s′

l, a, w′). Then

Fig. 3. Proposed deep transfer reinforcement learning architecture for resource management.

we utilize the loss function shown by equation (22) for

gradient descent to update the weight wof the main network.

After several training sessions, the target network will copy the

weight parameters of the main network. Such a late update

of the target network serves as a stable reference for the

main network training. Here we deploy the Long Short-Term

Memory (LSTM) network as hidden layers for main and target

networks. As a special recurrent neural network, LSTM can

better capture the long term data dependency, which makes

it an ideal candidate to handle complicated wireless network

environments [30].

Finally, it is worth noting that we include two different

mapping functions in the proposed DTRL scheme. The Q-

value mapping function will affect the reward calculation of

the learner agent (indicated by the pink line in Fig.3), while

the action selection mapping function inﬂuences the action

selection (shown by the blue line in Fig.3). Accordingly,

we propose two DTRL-based methods, namely QDTRL and

ADTRL, and in the following we will introduce these two

mapping functions and corresponding algorithms.

2) Learner 1: Q-value based Deep Transfer Reinforcement

Learning

In QDTRL, the Q-values of the experts are presumed to be

the prior knowledge of the learner. The main idea behind this

is to encourage learners to select actions that have higher Q-

values in the experts. Considering the task similarities, actions

with a higher Q-value of experts are very likely to bring similar

high rewards for the learner. In particular, we consider the Q-

values of the experts as extra rewards for learners, which is

expected to improve exploration efﬁciency by selecting actions

with higher potential rewards [31].

Firstly, the loss function in QDTRL is deﬁned by:

L(w) = Er(σ1QE(F(sl),F′(al)) + rl+

γQ(s′

l,arg max

aQ(s′

l, a, w), w′)−Q(sl, al, w)),(23)

where Fand F′are state and action mapping functions,

respectively. Compared with equation (22), the main difference

is that σ1QE(F(sl),F′(al)) is involved as an extra reward of

selecting action alunder state sl.σ1is the transfer learning

rate, which describes the importance of prior knowledge

(0≤σ1≤1). A higher transfer learning rate means the prior

knowledge utilization is more important than its own learning

process, while a lower value indicates the reverse.

In equation (23), we apply σ1QE(F(sl),F′(al)) to guide

the action selection of learner. However, due to different

state-action spaces, the Q-values of the experts cannot be

directly utilized by the learner, thus a function is needed

to map experts’ Q-values to the learner’s Q-table. The Q-

value mapping function consists of state mapping and action

mapping, and QEterm in equation (23) is generated by:

QE(F(sl),F′(al)) = Qe,1(se,1, ae,1) + Qe,2(se,2, ae,2),(24)

where Qe,1,se,1and ae,1are the Q-value, state and action of

expert 1, respectively. Qe,2,se,2and ae,2are deﬁned similarly

for expert 2. The objective of mapping functions Fand F′

is to ﬁnd the states and actions of the experts that are close

to sland al. Considering the task similarities, we can use

the existing decision knowledge of the expert agent to guide

the action selection of learners by ﬁnding similar states and

actions [32]. Fand F′are deﬁned by:

•State mapping F: For a given state sl, considering

experts and learner 1 have the same state deﬁnition, we

can always ﬁnd sl=se,1=se,2. Thus Fcan be easily

deﬁned.

•Action mapping F′: The goal of F′is to ﬁnd

(ae,1, ae,2) = F′(al). For any action al, which is deﬁned

as al= (Nembb, N urllc,Cembb , C urllc), it can be decom-

posed into the combination of ae,1= (Nembb, N urllc)

and ae,2= (Cembb, C urllc). Then F′can be deﬁned

accordingly.

Based on the state and action mapping relationships, if given

sland al, we can always ﬁnd speciﬁc Qe,1(se,1, ae,1)and

Qe,2(se,2, ae,2). Then QE(F(sl),F′(al)) can be directly used

by learner 1. The deﬁned Q-value mapping function is sum-

marized by Fig.4 from step 1 to 4. First, for any given (sl, al),

Fig. 4. Proposed Q-value-based mapping function for deep transfer reinforcement learning.

we ﬁnd state action pairs (se,1, ae,1)and (se,2, ae,2)by state

mapping function Fand action mapping function F′. Then,

we extract Qe,1(se,1, ae,1)and Qe,2(se,2, ae,2)from expert

agents’ Q-tables. After that, we generate QE(F(sl),F′(al))

by equation (24), which will be considered as extra rewards

when selecting alunder sl. This extra reward is added to rl,

and the new tuple (sl, al, rl+σQE(F(sl),F′(al)), s′

l)will

be saved in the experience pool. Finally, we implement the

gradient descent by equation (23) for the network training.

3) Learner 2: Action Selection based Deep Transfer Rein-

forcement Learning

Learners are mainly designed to handle more complicated

problems than experts, which usually means larger state or ac-

tion spaces. For instance, the joint resource allocation problem

has higher action spaces than allocating one single resource,

result a very large action space and longer convergence. To this

end, we propose an ADTRL algorithm to improve exploration

efﬁciency by evaluating the potential optimality of actions.

More speciﬁcally, we ﬁrst apply a lax bisimulation metric

to assess the MDP similarities between learners and experts.

Then, we calculate the potential advantage of different actions

in the learner and produce a lower bound for the optimality.

Finally, the optimality metrics of different actions are nor-

malized, and we assume that actions with higher potential

optimality are more likely to be selected in the exploration

phase. In the following, we will introduce the proposed method

in detail.

It is worth noting that TL is mainly applied to learner

tasks that are related to expert tasks. Speciﬁcally, it requires

similarities between expert and learner MDPs. Then, we ﬁrst

introduce the Kantorovich distance K(D)(Y, Z)to describe

the similarities between two distributions [33]:

max

uf,f =1,2,...,|S|

|S|

f=1

(Y(sf)−Z(sf))uf,

s.t. uf−uk≤ D(sf, sk)f, k = 1,2, ..., |S|

0≤uf≤1

(25)

where Yand Zare two probability distributions of sf∈S,

ufand ukare internal optimization variables, and D(sf, sk)

denotes a metric Dto assess the distance between sfand sk.

K(D)(Y, Z )shows the distance of two probability distribu-

tions Yand Zwith the metric Din set S. However, in TL,

we focus more on the distance metric of different MDPs, and

K(D)(Y, Z )is rewritten by [34]:

max

uf,f =1,2,...,|S1|;vk,k=1,2,...,|S2|

|S1|

f=1

Y(sf)uf−

|S2|

k=1

Z(sk)vk,

s.t. uf−vk≤ D(sf, sk)

−1≤uf≤1

(26)

where S1and S2are two sets with sf∈S1and sk∈S2,

respectively, and K(D)(Y, Z)evaluates the distance between

two distribution set S1and S2. Here f= 1,2, .., |S1|means

that set sfhas |S1|possible values in set S1, and the proba-

bility distribution function Yof sfsatisﬁes P|S1|

f=1 Y(sf) = 1.

sk,S2and the probability distribution function Zcan be

deﬁned similarly.

Then we include two MDPs < Se, Ae, Te, Re>and <

Sl, Al, Tl, Rl>to identify the difference of their state-action

pairs. The lax bisimulation metric is introduced to evaluate the

distance between two state-action pairs [35]:

D∼((se, ae),(sl, al)) := θ1|re(se, ae)−rl(sl, al)|

+θ2K(D′(Y(se, ae), Z(sl, al))),(27)

where θ1and θ2are weight factors. The ﬁrst term

|re(se, ae)−rl(sl, al)|represents the reward distance, and

K(D′(Y(se, ae), Z(sl, al))) is the Kantorovich metric for

state-action pairs under semi-metric D′.D′is deﬁned by the

Hausdorff metric:

D′(se, sl) = max( max

ae∈Ae

min

al∈AlD((se, ae),(sl, al)),

min

ae∈Ae

max

al∈Al

(D(se, ae),(sl, al))).

(28)

Hausdorff distance is the maximum distance from one set to

the nearest point of the other set [36], and here we deﬁne

D′(se, sl)to measure the distance between action sets Aeand

Alunder state seand sl.

To evaluate the potential optimality of selecting alunder

sl, we include the Bellman optimality to ﬁnd the state value

difference:

|Q∗

l(sl, al)−V∗

e(se)|

=|Q∗

l(sl, al)−Q∗

e(se, π∗(se))|

=|Q∗

l(sl, al)−Q∗

e(se, a∗

e)|

=|(rl(sl, al) + γX

s′

l∈Sl

Y(s′

l|sl, al)Vl(s′

l))−

(re(se, a∗

e) + γX

s′

e∈Se

Z(s′

e|se, a∗

e)Ve(s′

e))|

≤ |rl(sl, al)−re(se, a∗

e)|+

γ|X

s′

l∈Sl

Y(s′

l|sl, al)Vl(s′)−X

s′

e∈Se

Z(s′

e|se, a∗

e)Ve(s′))|

≤ |rl(sl, al)−re(se, a∗

e)|+ max

al∈Al

min

ae∈Ae

(

γ|X

s′

l∈Sl

Y(s′

l|sl, al)Vl(s′)−X

s′

e∈Se

Z(s′

e|se, a∗

e)Ve(s′)|)

(Using equation (26), (28))

=|re(se, a∗

e)−rl(sl, al)|+γK(D′((se, a∗

e),(sl, al)))

(Setting θ1= 1 and θ2=γin equation (27))

=D∼((sl, al),(se, a∗

e)),

(29)

where Q∗

l(sl, al)is the optimal state-action value of (sl, al),

V∗

e(se)is the optimal state value of se,s′

lis the next state

of sl, and Y(s′|sl, al)is the probability of arriving to s′

by implementing alunder sl. Here we use a∗

e=π∗(se)to

represent the action selection of the expert agent. Equation

(29) shows that there is a upper bound between the state-action

pairs (sl, al)and (se, a∗

e). In the following, we will introduce

how to utilize equation (29) to improve the action selection of

learner.

When selecting an action, we usually consider V∗

l(sl)

as a target value for Q∗

l(sl, al), and we have V∗

l(sl) =

arg max

Q∗(sl, al). Then V∗

l(sl)−Q∗

l(sl, al)can be used to

evaluate the potential optimality of alby:

V∗

l(sl)−Q∗

l(sl, al)

=Q∗

l(sl, π∗(sl)) −Q∗

l(sl, al)

=Q∗

l(sl, a∗

l)−Q∗

l(sl, al)

=|Q∗

l(sl, a∗

l)−Q∗

l(sl, al)|

=|Q∗

l(sl, a∗

l)−V∗

e(se) + V∗

e(se)−Q∗

l(sl, al)|

≤ |Q∗

l(sl, a∗

l)−V∗

e(se)|+|V∗

e(se)−Q∗

l(sl, al)|

≤ D∼((sl, a∗

l),(se, a∗

e)) + D∼((sl, al),(se, a∗

e))

(By using equation (29)).

(30)

Equation (30) gives an upper bound for the potential op-

timality of selecting alunder sl, and then a lower bound

Ol(sl, al)can be easily found by:

Q∗

l(sl, al)−V∗

l(sl)

=Q∗

l(sl, al)−Q∗

l(sl, π∗(sl))

≥ −D∼((sl, a∗

l),(se, a∗

e)) − D∼((sl, al),(se, a∗

e))

=Ol(sl, al).

(31)

In Ol(sl, al), note that D∼((sl, a∗

l),(se, a∗

e)) will not affect

alselection, since it is a constant value for given sland se.

Fig. 5. Summary of the proposed action selection mapping method.

Considering we have two experts in this work, we rewrite

Ol(sl, al)by:

Ol(sl, al) = −σ2(D∼((sl, al),(se,1, a∗

e,1))

+D∼((sl, al),(se,2, a∗

e,2))) −(1 −σ2)(D∼((sl, a∗

l),(se,1, a∗

e,1))

+D∼((sl, a∗

l),(se,2, a∗

e,2))),

(32)

where σ2is the transfer learning rate of ADTRL (0≤σ2≤1).

If σ2= 0, then Ol(sl, al)becomes a constant value for all

al∈Al, which means transferred knowledge will not affect

the action selection of the learner. By contrast, if σ2= 1,

Ol(sl, al)totally depends on the lax bisimulation metric

between (sl, al)and (se, a∗

e), which indicates the learner will

imitate the action selection of experts.

In summary, given < Se, Ae, Te, Re>as an expert MDP

and < Sl, Al, Tl, Rl>as a learner MDP, Ol(sl, al)deﬁnes

the lower bound of potential optimality of alin terms of the

distance between (sl, al)and (se, a∗

e). Here seis considered

as the expert state that is closest to sl. In this work, we assume

experts and learners has the same state deﬁnitions, and secan

be easily found accordingly for any sl. Finally, the probability

of choosing alis given by:

P r(al|sl) = Sig(Ol(sl, al))

a∈Al

sig(Ol(sl, a)) ,(33)

where Sig denotes the Sigmoid function for normalization.

Equation (33) means that actions with higher potential opti-

mality has a higher chance to be selected, and consequently

it will improve exploration efﬁciency.

Finally, we summarize the proposed action selection map-

ping method in Fig.5. Given the expert and learner MDPs as

input, we ﬁrst calculate the Kantorovich and Hausdorff metrics

using equations (26) and (28), respectively. Then we calculate

the lax bisimulation metric via equation (27), and evaluate the

potential optimality of actions using equations (30) to (32).

Consequently, the optimality metrics are normalized, and the

action selection probability is produced by applying equation

(33). The proposed QDTRL and ADTRL are summarized in

Algorithm 1 and 2, respectively.

D. Baseline: Exploration bonus DQN and PPF-TTL

In this section, we include two baseline algorithms. Firstly,

EB-DQN serves as a benchmark to compare our DTRL

method with other ML-based algorithms. The MDP deﬁnition

Algorithm 1 QDTRL-based joint resource allocation

1: Initialize: Wireless and QDTRL parameters

2: for TTI = 1 to Ttotal do

3: for Each BS do

4: With probability ǫ, selecting actions randomly; otherwise,

choosing actions by al= arg max

aQ(s′

l, a, w)).

5: GRM implements inter-slice resource allocation as equation

(14).

6: SRMs distribute radio resource to UEs by proportional

fairness, and replace cached content items.

7: Updating system state, and saving (sl, al, rl, s′

l)to the

experience pool. Every CTTIs, sampling a minbatch from

experience pool randomly.

8: Find F(sl)and F′(al)for any sland alin the minibatch.

9: Generating target Q-values QTar (sl, al)=







rlif done

σ1QE(F(sl),F′(al))+

rl+γQ(s′

l,arg max

aQ(s′

l, al, w), w′)else

10: Updating wusing gradient descent by minimizing the loss

L(w) = Er(QT ar(sl, al)−Q(sl, al, w)).

11: Copying wto w′after several training.

12: end for

13: end for

14: Output: Performance of the network and the learning algorithm.

Algorithm 2 ADTRL-based joint resource allocation

1: Initialize: Wireless and ADTRL parameters

2: for TTI = 1 to Ttotal do

3: for Each BS do

4: With probability ǫ, selecting action alby using equation

(33); Otherwise, choosing alby arg max

aQ(s′

l, a, w).

5: GRM implements inter-slice resource allocation as equa-

tion (14).

6: SRMs distribute radio resource to UEs by proportional

fairness, and replace cached content items.

7: Updating system state, and saving (sl, al, rl, s′

l)to the

experience pool. Every CTTIs, sampling a minbatch from

experience pool randomly.

8: Generating target Q-values QTar (sl, al)=

rlif done

rl+γQ(s′

l,arg max

aQ(s′

l, a, w), w′)else

9: Updating wusing gradient descent by minimizing the loss

L(w) = Er(QT ar(sl, al)−Q(sl, a l, w)).

10: Copying wto w′after several training.

11: end for

12: end for

13: Output: Performance of the network and the learning algorithm.

of EB-DQN is the same as DTRL, shown in Table II. EB-

DQN agent explores the joint resource allocation task from

scratch, and no prior knowledge is included. In EB-DQN, the

loss function is deﬁned by:

L(w) = Er(r+Ψ

pψ(s, a)+γmax

aQ(s′, a, w′)−Q(s, a, w)),(34)

where Er has been deﬁned in equation (21) as loss function

of neural networks, Ψis an extra reward, and ψ(s, a)is the

number of times that (s, a)is selected. Ψ

√ψ(s,a)is regarded

Algorithm 3 EB-DQN-based joint resource allocation

1: Initialize: Wireless and EB-DQN parameters

2: for TTI = 1 to Ttotal do

3: for Each BS do

4: With probability ǫ, choose actions randomly; otherwise,

choosing actions by al= arg max

aQ(s′

l, a, w)).

5: GRM implements inter-slice resource allocation as equation

(14).

6: SRMs distribute radio resource to UEs by proportional

fairness, and replace cached content items by TTL.

7: Updating system state, and saving (sl, al, rl, s′

l)to the

experience pool. Every CTTIs, sampling a minbatch from

experience pool randomly.

8: Generating target Q-values QTar (sl, al)=







rl+Ψ

√ψ(s,a)if done

rl+Ψ

√ψ(s,a)+γarg max

aQ(s′

l, a, w′)else

9: Updating wusing gradient descent by minimizing the loss

L(w) = Er(QT ar(sl, al)−Q(sl, al, w)).

10: Copying wto w′after several training.

11: end for

12: end for

13: Output: Performance of the network and the learning algorithm.

Algorithm 4 PPF-TTL based joint resource allocation

1: Initialize: Wireless networks parameters

2: for TTI = 1 to Ttotal do

3: for Each BS do

4: for Each RB do

5: Calculating the estimated transmission rate of UEs in the

queue.

6: Calculating proportional fairness metric [37].

7: Transmitting URLLC packets with the highest propor-

tional fairness. If no URLLC packet, then transmitting

eMBB packets.

8: end for

9: BS replaces cached content items by time-to-live rule.

10: end for

11: end for

12: Output: Performance of the network.

as an extra bonus for selecting actions that are less visited,

and it encourages the agent to better explore the environment.

EB-DQN-based joint resource allocation is summarized in

Algorithm 3.

On the other hand, to compare the ML methods with model-

based algorithms, we apply a model-based PPF-TTL algo-

rithm. The well-known priority proportional fairness (PPF)

algorithm is applied for radio resource allocation, in which

URLLC packets have a higher priority than eMBB packets

[37]. The RBs will ﬁrst serve URLLC transmission, then

eMBB trafﬁc will be processed. We deploy the TTL method

for caching, but no slicing and learning are included. The PPF-

TTL method is shown in Algorithm 4.

E. Computational Complexity Analyses

In this section, we analyze the computational complexity of

the proposed DTRL-based methods. Firstly, the complexity of

the DTRL method is dominated by the training and updating

of the LSTM network. The complexity of the LSTM network

updating consists of the running time of recurrent connections

and bias and the updating time of input and output nodes. The

computational complexity for updating the LSTM network in

DTRL is O(lhdm2

lstmc2

lstm)[30], where lhd is the number of

hidden layers, clstm is the number of memory cells in each

block and mlstm is the number of memory blocks. It is worth

noting that only the main network needs to be trained, and the

target network can copy the weight from the main network.

On the other hand, the knowledge transfer process also

contributes to the complexity. In QDTRL, the knowledge

transfer consists of the state and action mapping functions

Fand F′(indicated by equation (20)). Accordingly, the time

complexity is O(PNe

q=1 |Se,q||Ae,q |), where |Ne|is the number

of experts, Se,q and Ae,q are state and action sets of experts,

respectively.

In ADTRL, the Kantorovich distance can be considered as

an optimal transportation problem, which is computable within

a strong polynomial time O(|S|2log(|S|)), where |S|is the

total number of possible distributions [38]. Meanwhile, the

Hausdorff metric can be solved in a nearly linear time [39],

which can be neglected compared with the complexity of Kan-

torovich distance. Based on equation (31), the total complexity

of knowledge transfer in ADTRL is O(|Ne||Al||S|2log(S)),

where |Al|is the action set of the learner. In summary, these

analyses show that our knowledge transfer process can be

efﬁciently computed, and the complexity is linearly related

to the number of experts.

VI. PE R FO RMA NC E EVALUATION

A. Parameter Settings

In this section, we consider six different cases: expert 1,

expert 2, learner 1 (QDTRL), learner 2 (ADTRL), baseline

1 (EB-DQN) and baseline 2 (PPF-TTL). We include 6 adja-

cent gNBs, and each algorithm is implemented in one gNB

randomly. Each gNB contains an eMBB slice and a URLLC

slice. The eMBB slice has 5 UEs, while the URLLC slice has

10 UEs [40]. The radius of each gNB is 300 meters, and the

distance between two adjacent gNBs is 600 meters. For each

gNB, there are 100 RBs in total, which are divided into 13

resource block groups (RBGs) [41]. We assume the caching

capacity is reallocated every 50 TTIs because it takes time to

replace the cached content items. The experience of experts is

presumed to be existing knowledge for learners.

We deploy LSTM networks with 30 nodes as hidden layers

for the target and main networks in DTRL and EB-DQN.

The network learning rate and the number of layers are

selected by the grid search method. We try different parameter

combinations and ﬁnd the best performance accordingly. The

simulations include 3000 TTIs, where the ﬁrst 1500 TTIs

are the exploration period, and the remaining TTIs are the

exploitation period. The simulations are implemented in MAT-

LAB 5G library with 15 runs to get the average value. Other

5G and learning parameters are shown in Table III.

TABLE III

PAR AM ET ER S S ET TI NG S

5G settings Cache settings

Bandwidth: 20MHz Caching capacity: 20 items

3GPP urban macro network TTL value: 50 TTIs

Number of RBs: 100 Contents catalog size: 40/slice

Subcarriers in each RB: 12 Contents popularity: Zipf

Subcarrier bandwidth: 15kHz Trafﬁc model

Transmission power: 40 dBm

(uniform distributed)

URLLC/ eMBB trafﬁc: poisson

distribution

TTI size: 2 OFDM symbols Packet size: 36 Bytes

Tx/Rx antenna gain: 15 dB. Learning settings

Retransmission settings Network layers: 4

Max number of retransmissions: 1 2 LSTM network hidden layers

Round trip delay: 4 TTIs hidden layer has 35 nodes

Hybrid automatic repeat request. Initial learning rate: 0.005

UE and gNBs Experience pool size: 150

25 eMBB UE, 50 URLLC UE Training frequency: 30 TTIs

UE random distribution Minbatch size: 30

Number of gNBs: 5 Discount factor: 0.5

Inter-gNB distance: 500m Epsilon value: 0.05

Propagation model Reward weight: 0.5

128.1+37.6log(distance(km)) Transfer learning rate: 0.7

Log-Normal shadowing: 8 dB. Ψvalue for EB-DNQ: 0.5

B. Performance Analyses of Various Learning Parameters

In this section, we analyze the algorithm performance under

diverse learning parameters. Fig.6 (a) shows the convergence

performance of QDTRL, ADTRL and EB-DQN, which is a

critical metric for learning algorithms. ADTRL has the fastest

convergence, which can be explained by the improved action

selection strategy. ADTRL takes advantage of the action se-

lection policies of experts, which indicates actions with higher

potential rewards. It proves that ADTRL applies a more efﬁ-

cient exploration strategy and achieves a better performance.

QDTRL also presents a better convergence performance than

EB-DQN. In QDTRL, the Q-values of the experts are extracted

as extra rewards for action selections. It assumes that actions

with higher Q-values in experts can also bring higher rewards

to learners, and the exploration is accelerated. However, other

actions can still be randomly selected for exploration, lowering

the exploration efﬁciency. On the contrary, in EB-DQN, the

agent has no prior knowledge about current tasks. The agent

starts from scratch to explore its tasks, which leads to a longer

convergence time and a lower average reward.

Then, Fig.6 (b) shows the network performance of EB-DQN

against the extra exploration reward Ψ(shown in equation

(34)). A higher Ψvalue will encourage more explorations,

while a lower value means more exploitation. The simulations

demonstrate that a higher Ψmay hamper the network perfor-

mance by over-exploration, and a lower Ψalso degrades the

URLLC delay and eMBB throughput by under-exploration.

Therefore, an appropriate Ψvalue is critical to balance the

exploration and exploitation.

(a) Convergence performance comparison of EB-DQN, QDTRL

and ADTRL

(b) Network performance of EB-DQN under various extra

exploration rewards Ψ.

sizes of expert agents (lower value such as 0.3 indicates that

learner agent only has 30% of the expert Q-table)

(d) Convergence performance of ADTRL under various action

selection knowledge sizes of expert agents ( the learner agent

only has part of the action selection knowledge of expert agents)

(e) Network performance of QDTRL under various transfer

learning rates

(f) Network performance of ADTRL under various transfer

learning rates

Fig. 6. Convergence and network performance Comparison against learning parameters.

To investigate how the knowledge transfer can contribute to

the learner agent performance, Fig. 6 (c) shows the QDTRL

performance under various expert Q-table sizes. In particular,

a lower value such as 0.3 means that the learner agent only

has 30% of the expert Q-tables as prior knowledge. Fig. 6 (c)

demonstrates that more prior knowledge can bring better per-

formance for the learner agent, while partial prior knowledge

may lower average rewards. Similarly, Fig. 6 (d) presents the

ADTRL performance using different action selection transfer

sizes. Speciﬁcally, a lower value indicates that the learner

agent only has part of the action selection knowledge of expert

agents. Consequently, one can observe that more transferred

knowledge can improve exploration efﬁciency and produce a

higher average reward for the learner agent.

Finally, note that we have deﬁned transfer learning rates

when introducing QDTRL and ADTRL, which represents

the importance of transferred knowledge. The network per-

formance of QDTRL and ADTRL under different transfer

learning rates is investigated here. In QDTRL, the transfer

learning rate is indicated by the σ1in equation (24). Fig.6 (e)

(a) ECDF of URLLC latency (1 Mbps eMBB trafﬁc, 2 Mbps

URLLC trafﬁc)

(b) URLLC latency against trafﬁc load

(e) URLLC latency against backhaul capacity (f) eMBB throughput against backhaul capacity

Fig. 7. Performance comparison under various trafﬁc loads and backhaul capacities.

shows that a higher transfer learning rate may lead to better

network performance, which is indicated by a lower URLLC

delay and a higher eMBB throughput. However, a very high

transfer learning rate may affect the exploration of the agent

itself, and it leads to sub-optimal results such as higher delays

and lower throughput. A similar trend can be observed in

Fig.6 (f) for ADTRL, in which the transfer learning rate is

indicated by variable σ2in equation (32). A higher σ2value

signiﬁcantly reduces the exploration complexity and brings

better network performance. However, when σ2≥0.6, the

transferred knowledge dominates the learning process, and it

results in performance degradation in terms of latency and

throughput.

C. Network Performance Analyses

In this section, we compare the network performance of

different algorithms under various trafﬁc loads and backhaul

capacities. The eMBB trafﬁc is ﬁxed to 1 Mbps per cell, and

the URLLC trafﬁc ranges from 1 to 6 Mbps. We ﬁrst present

the results, then explain the performance of each algorithm.

Fig.7 (a) shows the Empirical Complementary Cumulative

Distribution Function (ECCDF) of the URLLC latency with 2

Mbps URLLC trafﬁc, which presents empirical distributions

of packet delays. We zoom the area where the ECCDF value

is higher than 0.1 and the URLLC delay is lower than 1

ms to better show the results. Expert 1 and the PPF-TTL

method present the highest delay, which is indicated by the

high delay distribution in 0.1-1 interval of the ECCDF axis.

By contrast, expert 2 has a lower delay distribution than expert

1. Meanwhile, EB-DQN and ADTRL show comparable delay

performance for the URLLC slice. Finally, ADTRL achieves

the best delay performance than other algorithms. Fig.7 (b)

and (c) present average URLLC slice delay and eMBB slice

throughput against trafﬁc loads. It shows that both ADTRL

and QDTRL maintain lower delay and higher throughput than

experts and baseline algorithms under various trafﬁc loads.

Expert 2 shows a lower URLLC delay than the PPF-TTL and

expert 1, but its eMBB throughput is much lower than any

other algorithm. In Fig.7 (d), all algorithms have comparable

packet drop rate (PDR) except the PPF-TTL. In the following,

we will explain the results of each algorithm.

We ﬁrst analyze the experts’ performance. The Expert 1

shows a high delay and a low throughput, because it has

no caching capability. All packets required by UEs have to

be processed by the core network, and the constant backhaul

delay leads to a high URLLC delay and a low eMBB through-

put. On the contrary, expert 2 has a lower delay because we

apply a ﬁxed RB allocation strategy. In particular, RBs are

distributed according to the UE numbers in each slice; thus

the URLLC slice always has more RBs, which leads to a low

URLLC delay. But the eMBB slice is affected by a low eMBB

throughput.

For the baseline algorithms, EB-DQN outperforms the PPF-

TTL method because it jointly learns the radio and cache

resource allocation. The eMBB throughput of the PPF-TTL

method decreases signiﬁcantly with the increasing URLLC

load, which results from the priority settings in this method. In

the PPF method, whenever a new URLLC packet arrives, it is

directly scheduled over the eMBB packets, which unavoidably

degrades the eMBB throughput.

Finally, the proposed QDTRL and ADTRL have the best

overall performance. Due to the novel DTRL-based scheme,

they can leverage the knowledge of experts and further im-

prove their own performance on new tasks. When the URLLC

trafﬁc is 4 Mbps, ADTRL presents a 21.4% lower URLLC

delay and a 22.4% higher eMBB throughput than EB-DQN.

A 40.8% lower URLLC delay and a 59.8% higher eMBB

throughput are also observed compared with the PPF-TTL

method. The simulations show that the proposed DTRL-based

solutions achieve more promising results than the baseline

algorithms.

Moreover, backhaul capacity is one of the main bottlenecks

of 5G RAN. Here we investigate the network performance

under various backhaul capacities, and the results are shown

in Fig.7 (e) and (f). Note that the basic performance of experts

has been shown in former results, and here we mainly compare

the DTRL-based solutions with baseline algorithms.

As expected, all algorithms achieve lower delays for

URLLC slice and higher throughput for eMBB slice with

increasing backhaul capacity, because higher capacity will

reduce the backhaul delay. Moreover, QDTRL and ADTRL

still achieve lower URLLC delays and higher eMBB through-

put. Compared with EB-DQN, the satisfying performance of

QDTRL and ADTRL can still be explained by their knowledge

(a) URLLC latency against caching capacity

(b) eMBB throughput against caching capacity

Fig. 8. Network performance comparison under various caching capacities.

transfer strategy. Meanwhile, the worst performance of PPF-

TTL shows that learning-based methods outperform model-

based algorithms by the superior learning capability. When

the backhaul capacity is 15 Mbps, ADTRL presents an 18.9%

lower URLLC delay and a 24.2% higher eMBB throughput

than EB-DQN. Compared with the PPF-TTL method, a 24.7%

lower URLLC delay and a 54.3% higher eMBB throughput are

observed.

D. Content Caching Performance Analyses

In this section, we compare different algorithms under

various caching capacities, which is indicated by the maximum

number of content items that can be stored in gNBs. As shown

in Fig.8 (a) and (b), a higher caching capacity will reduce

the URLLC latency and increase the eMBB throughput. It is

because a higher caching capacity means that more items can

be cached in gNBs, and the average backhaul delay will be

reduced. The simulations show that ADTRL and QDTRL have

the best overall performance. EB-DQN still achieves a lower

URLLC delay and a higher eMBB throughput than the PPF-

TTL algorithm.

Furthermore, we present the cache hit ratio of eMBB and

URLLC slices in Fig.8 (c). The cache hit ratio represents

the proportion of packets that can be found in the cache

server when they are required. A higher cache hit ratio usually

indicates better network performance, which is affected by

the caching capacity and the content replacement strategy.

With the increasing caching capacity, cache hit ratios naturally

increase for all algorithms. As expected, the cache hit ratio of

the eMBB and URLLC slices are well maintained in QDTRL

and ADTRL. Compared with ADTRL, the ratios in EB-DQN

and PPF-TTL methods are 19.8% and 31% lower. Note that

the PPF-TTL method only has one curve because we assume

there is no slicing in the PPF-TTL algorithm. Finally, ADTRL,

QDTRL and EB-DQN have similar cache hit ratios when the

caching capacity is 60. It means that most required items can

be cached with this capacity, and then a high cache hit ratio

is observed.

VII. CO NC LUS IO N

Network slicing is a key technique to enhance ﬂexibility in

5G networks, and ML techniques offer promising solutions.

Although widely used reinforcement learning techniques have

yielded to improved network performance, they suffer from

a long convergence time and lack of generalization. In that

sense, knowledge transfer emerges as an important approach

to improve learning performance. Yet, transfer learning in

wireless has been explored only very recently and in very

few studies. This work presented two novel deep transfer

reinforcement learning-based solutions for the joint radio and

cache resource allocation. The proposed algorithms have been

compared with two baseline algorithms via simulations. These

results have shown that the proposed methods achieve better

network performance and faster convergence speeds than these

other benchmarks. In the future, we plan to consider the

knowledge transfer between tasks with different state deﬁni-

tions.

APP EN D IX

In the following we prove equation (10). Recalling equation

(8) and (9)

hj,n,g =X

m∈Mj,n

φj,n,m,gTj,n,m,g,

g∈Gn

m∈Mn

hj,n,m,g =Cj,n

Cj,T

Then we have

hj,n,g X

g∈Gn

m∈Mn

hj,n,m,g =Cj,n

Cj,T X

m∈Mj,n

φj,n,m,g Tj,n,m,g,

which can be easily transformed to

hj,n,g =Cj,n

Cj,T

Pm∈Mj,n φj,n,m,g Tj,n,m,g

Pg∈GnPm∈Mnhj,n,m,g

=Cj,n

Cj,T

Tj,n,g Pm∈Mj,n φj,n,m,g

Pg∈GnPm∈Mnhj,n,m,g

(using Tj,n,m,g =Tj,n,g assumption)

=Cj,n

Cj,T

Tj,n,g Pm∈Mj,n φj,n,m,g

Pg∈GnPm∈Mnφj,n,m,gTj,n,m,g

(using hj,n,m,g =φj,n,m,g Tj,n,m,g )

=Cj,n

Cj,T

Pm∈Mj,n φj,n,m,g

Pg∈GnPm∈Mnφj,n,m,g

=Cj,n

Cj,T

Pm∈Mj,n φj,n,m,g

Pm∈Mnφj,n,m

which is the equation (10).

ACK NOWL E DG MEN T

We would like to thank Dr. Medhat Elsayed for initial

discussions on transfer learning.

REF ERE NC E S

[1] M. Shaﬁ, A. Molisch, P. Smith, T. Haustein, P. Zhu, P. Silva, F.

Tufvesson, A. Benjebbour, and G. Wunder, “5G: A Tutorial Overview of

Standards, Trials,Challenges, Deployment, and Practice,” IEEE Journal

on Selected Areas in Communications, vol. 35, no. 6, pp. 1201-1221,

Jun. 2017.

[2] A. Ksentini, and N. Nikaein, “Toward Enforcing Network Slicing

on RAN:Flexibility and resource Abstraction,” IEEE Communications

Magazine, vol. 55, no. 6, pp. 102-108, Jun. 2017.

[3] D. Marabissi, and R. Fantacci, “Highly Flexible RAN Slicing Approach

to Manage Isolation, Priority, Efﬁciency,” IEEE Access, vol. 7, pp.

97130-97142, Jul. 2019.

[4] M. Alsenwi, N. Tran, M. Bennis, A. Bairagi, and C. Hong, “eMBB-

URLLC Resource Slicing: A Risk-Sensitive Approach,” IEEE Commu-

nications Letters, vol. 23, no. 4, pp. 740-743, Apr. 2019.

[5] L. Li, G. Zhao, and R. S. Blum, “A Survey of Caching Techniques

in Cellular Networks: Research Issues and Challenges in Content

Placement and Delivery Strategies,” IEEE Communications Surveys &

Tutorials, vol. 20, no. 3, pp. 1710-1732, Mar. 2018.

[6] M. Erol-Kantarci, “Cache-At-Relay: Energy-Efﬁcient Content Place-

ment for Next-Generation Wireless Relays,” International Journal of

Network Management, vol. 25, no. 6, pp. 454-470, Nov./Dec. 2015.

[7] P. Yang, N. Zhang, S. Zhang, L. Yu, J. Zhang, and X. Shen, “Content

Popularity Prediction Towards Location-Aware Mobile Edge Caching,”

IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 915-929, Apr.

2019.

[8] J. Kwak, Y. Kim, L. B. Le, and S. Chong, “Hybrid content caching in

5G wireless networks: Cloud versus edge caching,” IEEE Transactions

on Wireless Communications, vol. 17, no. 5, pp. 3030-3045, May 2018.

[9] M. Elsayed and M. Erol-Kantarci, “AI-Enabled Future Wireless Net-

works: Challenges, Opportunities, and Open Issues,” IEEE Vehicular

Technology Magazine, vol. 14, no. 3, pp. 70-77, Sep. 2019.

[10] Y. Shi, Y. E. Sagduyu, and T. Erpek, “Reinforcement Learning for

Dynamic Resource Optimization in 5G Radio Access Network Slicing,”

in Proceedings of the 2020 IEEE 25th International Workshop on

CAMAD, Sep. 2020, pp. 1-6.

[11] T. Li, X. Zhu, and X. Liu, “An End-to-End Network Slicing Algorithm

Based on Deep Q-Learning for 5G Network,” IEEE Access, vol. 8, pp.

122229-122240, Jul. 2020.

[12] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Opti-

mized Computation Ofﬂoading Performance in Virtual Edge Computing

Systems via Deep Reinforcement Learning,” IEEE Internet of Things

Journal, vol. 6, no. 3, pp. 4005-4018, Jun. 2019.

[13] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE

Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp.

1345–1359, Oct. 2010.

[14] W. Lei, Y. Ye, and M. Xiao, “Deep Reinforcement Learning-Based

Spectrum Allocation in Integrated Access and Backhaul Networks,”

IEEE Transactions on Cognitive Communications and Networking, vol.

6, no. 3, pp.970-979, Sep. 2020.

[15] X. Chen, Z. Zhao, C. Wu, M. Bennis, H. Liu, Y. Ji, and H. Zhang,

”Multi-Tenant Cross-Slice Resource Orchestration: A Deep Reinforce-

ment Learning Approach,” IEEE Journal on Selected Areas in Commu-

nications, vol. 37, no.10, pp. 2377-2392, Oct. 2019.

[16] H. Albinsaid, K. Singh, S. Biswas, and C. Li, “Multi-agent Reinforce-

ment Learning Based Distributed Dynamic Spectrum Access,” IEEE

Transactions on Cognitive Communications and Networking (Early

Access), DOI: 10.1109/TCCN.2021.3120996, Oct. 2021.

[17] J. Li, X. Zhang, J. Zhang, J. Wu, Q. Sun, and Y. Xie, “Deep Rein-

forcement Learning-Based Mobility-Aware Robust Proactive Resource

Allocation in Heterogeneous Networks,” IEEE Transactions on Cogni-

tive Communications and Networking, vol. 6, no. 1, pp. 408-421, Mar.

2020.

[18] C. Zhang, M. Dong, and K. Ota, “Fine-Grained Management in 5G:

DQL Based Intelligent Resource Allocation for Network Function Virtu-

alization in C-RAN,” IEEE Transactions on Cognitive Communications

and Networking, vol. 6, no. 2, pp. 428-435, Jun. 2020.

[19] H. Zhou, M. Elsayed, and M. Erol-Kantarci, “RAN Resource Slicing in

5G Using Multi-Agent Correlated Q-Learning,” in Proceedings of 2021

IEEE Annual International Symposium on PIMRC, pp.1-6, Sep. 2021.

[20] M. Elsayed, M. Erol-Kantarci, and H. Yanikomeroglu, “Transfer Re-

inforcement Learning for 5G New Radio mmWave Networks,” IEEE

Transactions on Wireless Communications, vol. 20, no. 5, pp. 2838-2849,

May. 2021.

[21] M. E. Taylor and P. Stone, “Cross-Domain Transfer for Reinforcement

Learning,” in Proceedings of the International Conference on Machine

Learning, Jun. 2007, pp. 879–886.

[22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.

Bellemare, A. Graves, M. Riedmiller, et al., “Human-level control

through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp.

529–533, Feb. 2015.

[23] M. E. Taylor, P. Stone, and Y. Liu, “Transfer Learning for Reinforcement

Learning Domains: A Survey,” Journal of Machine Learning Research,

vol. 10, pp. 1633-1685, Sep. 2009.

[24] 3GPP, “5G; NR; Physical channels and modulation (Release 15),” Tech-

nical Speciﬁcation 38.211, 3rd Generation Partnership Project (3GPP),

Jul. 2018.

[25] M. Elsayed, and M. Erol-Kantarci, “Reinforcement Learning-based Joint

Power and Resource Allocation for URLLC in 5G,” in Proceedings of

2019 IEEE Global Communications Conference, pp.1-6, Dec. 2021.

[26] P. L. Vo, M. N. Nguyen, T. A. Le, and N. H. Tran, “Slicing the

Edge: Resource Allocation for RAN Network Slicing,” IEEE Wireless

Communications Letters, vol. 7, no. 6, pp. 970-973, Dec. 2018.

[27] N. C. Fofack, P. Nain, G. Neglia, and D. Towsley, “Analysis of TTL-

based cache networks,” in Proceedings of the 6th International ICST

Conference on Performance Evaluation Methodologies and Tools, Oct.

2012, pp. 1-10.

[28] T. Han, and N. Ansari, “Network Utility Aware Trafﬁc Load Balancing

in Backhaul-Constrained Cache-Enabled Small Cell Networks with

Hybrid Power Supplies,” IEEE Transactions on Mobile Computing,

vol. 16, no. 10, pp. 2819-2832, Oct. 2017.

[29] H. Zhou, A. Aral, I. Brandic, and M. Erol-Kantarci, “Multi-agent

Bayesian Deep Reinforcement Learning for Microgrid Energy Manage-

ment under Communication Failures,” IEEE Internet of Things Journal

(early access), DOI: 10.1109/JIOT.2021.3131719, Dec. 2021.

[30] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to Forget:

Continual Prediction with LSTM,” Neural Computation, vol. 12, no.

10, pp.2451-2471, Oct. 2000.

[31] M. E. Taylor, P. Stone, and Y. Liu, “Transfer Learning via Inter-

Task Mappings for Temporal Difference Learning,” Journal of Machine

Learning Research, vol. 8, no. 1, pp. 2125–2167, Sep. 2007.

[32] Z. Zhu, K. Lin, A. K. Jain, and J. Zhou, ”Transfer Learning in Deep

Reinforcement Learning: A Survey,” arXiv:2009.07888, May. 2022.

[33] A. L. Gibbs, F. E. Su, “On Choosing and Bounding Probability Metrics,”

International Statistical Review, vol. 70, no. 3, pp. 419-435, Dec. 2002.

[34] P. S. Castro, and D. Precup, “Using Bisimulation for Policy Transfer

in MDPs,” in Proceedings of 24th AAAI Conference on Artiﬁcial

Intelligence, Jul.2010, pp.1-6.

[35] J. J. Taylor, D. Precup, and P. Panangaden, “Bounding performance

loss in approximate MDP homomorphisms” in Advances in Neural

Information Processing Systems 21, pp.1-8, Dec. 2008.

[36] M. James, Topology, 2nd ed., Prentice Hall, 1999, pp. 280–281.

[37] G. Pocovi, K. Pedersen, P. Mogensen, “Joint Link Adaptation and

Scheduling for 5G Ultra-Reliable Low-Latency Communications,” IEEE

Access, vol. 6, pp. 28912-28922, May 2018.

[38] N. Ferns, P. Panangaden, and D. Precup, “Metrics for ﬁnite Markov de-

cision processes,” in Proceedings of the 20th Conference on Uncertainty

in Artiﬁcial Intelligence, Jul. 2004, pp.162–169.

[39] A. A. Taha, A. Hanbury, “An Efﬁcient Algorithm for Calculating the

Exact Hausdorff Distance,” in IEEE Transactions on Pattern Analysis

and Machine Intelligence, vol. 37, no. 11, pp.2153-2163, Nov. 2015.

[40] M. Elsayed, and M. Erol-Kantarci, ”AI-Enabled Radio Resource Allo-

cation in 5G for URLLC and eMBB Users,” in Proceedings of 2019

IEEE 2nd 5G World Forum (5GWF), pp.1-6, Nov. 2019.

[41] 3GPP, “NR; Physical Layer Procedures for Data (version 15.2.0.)” Tech-

nical Speciﬁcation 38.214, 3rd Generation Partnership Project (3GPP),

Jun. 2018.

Hao Zhou is a Phd candidate at the University of

Ottawa. He got his B.Eng. and M.Eng degrees from

Huazhong University of Science and Technology in

2016, and Tianjin University in 2019, respectively, in

China. He is working towards his Phd degree at the

University of Ottawa since Sep. 2019. His research

interests include electric vehicles, microgrid energy

trading, resource management and network slicing

in 5G. He is devoted to applying machine learning

techniques for smart grid and 5G applications.

Melike Erol-Kantarci is Canada Research Chair

in AI-enabled Next-Generation Wireless Networks

and Associate Professor at the School of Electrical

Engineering and Computer Science at the University

of Ottawa. She is the founding director of the

Networked Systems and Communications Research

(NETCORE) laboratory. She has received numer-

ous awards and recognitions. Dr. Erol-Kantarci is

the co-editor of three books on smart grids, smart

cities and intelligent transportation. She has over

180 peer-reviewed publications. She has delivered

70+ keynotes, plenary talks and tutorials around the globe. She is on the

editorial board of the IEEE Transactions on Cognitive Communications and

Networking, IEEE Internet of Things Journal, IEEE Communications Letters,

IEEE Networking Letters, IEEE Vehicular Technology Magazine and IEEE

Access. She has acted as the general chair and technical program chair for

many international conferences and workshops. Her main research interests

are AI-enabled wireless networks, 5G and 6G wireless communications, smart

grid and Internet of Things. She is an IEEE ComSoc Distinguished Lecturer,

IEEE Senior member and ACM Senior Member.

H. Vincent Poor (S’72, M’77, SM’82, F’87) re-

ceived the Ph.D. degree in EECS from Princeton

University in 1977. From 1977 until 1990, he was

on the faculty of the University of Illinois at Urbana-

Champaign. Since 1990 he has been on the faculty at

Princeton, where he is currently the Michael Henry

Strater University Professor. During 2006 to 2016,

he served as the dean of Princeton’s School of

Engineering and Applied Science. He has also held

visiting appointments at several other universities,

including most recently at Berkeley and Cambridge.

His research interests are in the areas of information theory, machine learning

and network science, and their applications in wireless networks, energy

systems and related ﬁelds. Among his publications in these areas is the

forthcoming book Machine Learning and Wireless Communications. (Cam-

bridge University Press). Dr. Poor is a member of the National Academy of

Engineering and the National Academy of Sciences and is a foreign member

of the Chinese Academy of Sciences, the Royal Society, and other national

and international academies. He received the IEEE Alexander Graham Bell

Medal in 2017.

ResearchGate has not been able to resolve any citations for this publication.

An End-to-End Network Slicing Algorithm Based on Deep Q-Learning for 5G Network

Article

Full-text available

Jul 2020

As one of key technologies of the fifth-generation (5G) communication system, network slicing can share the underlying infrastructure with different application requirements and ensure that the slices can be isolated from each other. This paper proposes an end-to-end (E2E) network slicing resource allocation algorithm based on Deep Q-Networks (DQN), which is suitable for multi-slice and multi-service scenarios. This algorithm jointly considers the radio access network slices and core network slices to dynamically allocate resources to maximize the number of access users. First we build such a model, which is a mixed integer programming problem and it needs to be dynamically adjusted according to the changes of environment. We propose to use DQN algorithm to solve this problem, which can perceive changes in the environment and make dynamic decisions. Under each decision, we need to calculate the reward value of DQN, so we divide the problem into the core side and the access side. Then the dynamic knapsack algorithm and the link mapping algorithm are used to obtain the reward. The simulation results show that the average access rate of DQN scheme is higher than 97%. Compared with the optimal allocation scheme of access side, the average access rate is increased by 9% for delay constrained slices and 5% for rate constrained slices in a dynamic environment.

Deep Reinforcement Learning Based Spectrum Allocation in Integrated Access and Backhaul Networks

Article

Full-text available

May 2020

We develop a framework based on drl to solve the spectrum allocation problem in the emerging iab architecture with large scale deployment and dynamic environment. The available spectrum is divided into several orthogonal sub-channels, and the mbs and all iab nodes have the same spectrum resource for allocation, where a mbs utilizes those sub-channels for access links of associated ue as well as for backhaul links of associated iab nodes, and an iab node can utilize all for its associated ues. This is one of key features in which 5G differs from traditional settings where the backhaul networks are designed independently from the access networks. With the goal of maximizing the sum log-rate of all ue groups, we formulate the spectrum allocation problem into a mix-integer and non-linear programming. However, it is intractable to find an optimal solution especially when the IAB network is large and time-varying. To tackle this problem, we propose to use the latest DRL method by integrating an actor-critic spectrum allocation (ACSA) scheme and dnn to achieve real-time spectrum allocation in different scenarios. The proposed methods are evaluated through numerical simulations and show promising results compared with some baseline allocation policies.

Multiagent Bayesian Deep Reinforcement Learning for Microgrid Energy Management Under Communication Failures

Article

Dec 2021

Microgrids (MGs) are important players for the future transactive energy systems where a number of intelligent Internet of Things (IoT) devices interact for energy management in the smart grid. Although there have been many works on MG energy management, most studies assume a perfect communication environment, where communication failures are not considered. In this article, we consider the MG as a multiagent environment with IoT devices in which AI agents exchange information with their peers for collaboration. However, the collaboration information may be lost due to communication failures or packet loss. Such events may affect the operation of the whole MG. To this end, we propose a multiagent Bayesian deep reinforcement learning (BA-DRL) method for MG energy management under communication failures. We first define a multiagent partially observable Markov decision process (MA-POMDP) to describe agents under communication failures, in which each agent can update its beliefs on the actions of its peers. Then, we apply a double deep $Q$ -learning (DDQN) architecture for $Q$ -value estimation in BA-DRL, and propose a belief-based correlated equilibrium for the joint-action selection of multiagent BA-DRL. Finally, the simulation results show that BA-DRL is robust to both power supply uncertainty and communication failure uncertainty. BA-DRL has 4.1% and 10.3% higher reward than Nash deep $Q$ -learning (Nash-DQN) and alternating direction method of multipliers (ADMM), respectively, under 1% communication failure probability.

RAN Resource Slicing in 5G Using Multi-Agent Correlated Q-Learning

Conference Paper

Sep 2021

Multi-Agent Reinforcement Learning-Based Distributed Dynamic Spectrum Access

Article

Oct 2021

Dynamic spectrum access (DSA) is an effective solution for efficiently utilizing the radio spectrum by sharing it among various networks. Two primary tasks of a DSA controller are: 1) maximizing the quality of service of users in the licensee’s network and 2) avoiding interference in communications towards the incumbent network. These two tasks become quite challenging in a distributed DSA network due to the lack of a centralized controller to regulate the sharing of the radio spectrum between incumbents and licensees. Hence, optimization-driven techniques to design power allocation schemes in such a network often become intractable. Accordingly, in this paper, we present a distributed DSA based communication framework based on multi-agent reinforcement learning (RL), where the multiple cells in the multi-user multiple-input multiple-output (MU-MIMO) licensee network act as agents, and the average signal-to-noise ratio value is the reward. In particular, by considering the physical layer parameters of the DSA network, we analyze various RL algorithms, namely Q-learning, deep Q-network (DQN), deep deterministic policy gradient (DDPG), and twin delayed deep deterministic (TD3), whereby the licensee network learns to obtain the optimal power allocation policies for accessing the spectrum in a distributed fashion without the need for a central DSA controller to manage the interference towards the incumbent. Trade-offs are identified for the considered algorithms with respect to performance, time complexity and scalability of the DSA network.

Online Convex Optimization for Efficient and Robust Inter-Slice Radio Resource Management

Article

Jun 2021

Radio access network (RAN) slicing is one of the key technologies in 5G and beyond mobile networks, where multiple logical subnets, i.e., RAN slices, are allowed to run on top of the same physical infrastructure so as to provide slice-specific services. Due to the dynamic environments of wireless networks and the diverse requirements of RAN slices, inter-slice radio resource management (IS-RRM) has become a highly challenging task in RAN slicing. In this paper, we propose a novel online convex optimization (OCO) framework for IS-RRM, which directly learns the instant resource allocation from the data revealed by previous allocations, such that sophisticated modeling and parameterization can be avoided in highly complicated and dynamic wireless environments. Specifically, an online IS-RRM scheme that employs multiple expert-algorithms running in parallel is proposed to keep track of the environmental changes and adjust the resource allocation accordingly. Both theoretical analysis and simulation results show that our proposed scheme can guarantee long-term performance comparable to the optimal strategies given in hindsight.

Multi-Agent Reinforcement Learning for Cooperative Coded Caching via Homotopy Optimization

Article

Mar 2021

Introducing cooperative coded caching into small cell networks is a promising approach to reducing traffic loads. By encoding content via maximum distance separable (MDS) codes, coded fragments can be collectively cached at small-cell base stations (SBSs) to enhance caching efficiency. However, content popularity is usually time-varying and unknown in practice. As a result, caching contents are anticipated to be intelligently updated by taking into account limited caching storage and interactive impacts among SBSs. In response to these challenges, we propose a multi-agent deep reinforcement learning (DRL) framework to intelligently update caching contents in dynamic environments. With the goal of minimizing long-term expected fronthaul traffic loads, we first model dynamic coded caching as a cooperative multi-agent Markov decision process. Owing to MDS coding, the resulting decision-making falls into a class of constrained reinforcement learning problems with continuous decision variables. To deal with this difficulty, we custom-build a novel DRL algorithm by embedding homotopy optimization into a deep deterministic policy gradient formalism. Next, to empower the caching framework with an effective trade-off between complexity and performance, we propose centralized, partially and fully decentralized caching controls by applying the derived DRL approach. Simulation results demonstrate the superior performance of the proposed multi-agent framework.

Transfer Reinforcement Learning for 5G-NR mm-Wave Networks

Article

Dec 2020

In this paper, we aim at interference mitigation in 5G millimeter-Wave (mm-Wave) communications by employing beamforming and Non-Orthogonal Multiple Access (NOMA) techniques with the aim of improving network’s aggregate rate. Despite the potential capacity gains of mm-Wave and NOMA, many technical challenges might hinder that performance gain. In particular, the performance of Successive Interference Cancellation (SIC) diminishes rapidly as the number of users increases per beam, which leads to higher intra-beam interference. Furthermore, intersection regions between adjacent cells give rise to inter-beam inter-cell interference. To mitigate both interference levels, optimal selection of the number of beams in addition to best allocation of users to those beams is essential. In this paper, we address the problem of joint user-cell association and selection of number of beams for the purpose of maximizing the aggregate network capacity. We propose three machine learning-based algorithms; transfer Q-learning (TQL), Q-learning, and Best SINR association with Density-based Spatial Clustering of Applications with Noise (BSDC) algorithms and compare their performance under different scenarios. Under mobility, TQL and Q-learning demonstrate 12% rate improvement over BSDC at the highest offered traffic load. For stationary scenarios, Q-learning and BSDC outperform TQL, however TQL achieves about 29% convergence speedup compared to Q-learning.

Reinforcement Learning for Dynamic Resource Optimization in 5G Radio Access Network Slicing

Conference Paper

Sep 2020

eMBB-URLLC resource slicing: A risk-sensitive approach

Article

Feb 2019

Ultra Reliable Low Latency Communication (URLLC) is a 5G New Radio (NR) application that requires strict reliability and latency. URLLC traffic is usually scheduled on top of the ongoing enhanced Mobile Broadband (eMBB) transmissions (i.e., puncturing the current eMBB transmission) and cannot be queued due to its hard latency requirements. In this letter, we propose a risk-sensitive based formulation to allocate resources to the incoming URLLC traffic, while minimizing the risk of the eMBB transmission (i.e., protecting the eMBB users with low data rate) and ensuring URLLC reliability. Specifically, the Conditional Value at Risk (CVaR) is introduced as a risk measure for eMBB transmission. Moreover, the reliability constraint of URLLC is formulated as a chance constraint and relaxed based on Markov's inequality. We decompose the formulated problem into two subproblems in order to transform it into a convex form and then alternatively solve them until convergence. Simulation results show that the proposed approach allocates resources to the incoming URLLC traffic efficiently, while satisfying the reliability of both eMBB and URLLC.

Learning from Peers: Transfer Reinforcement Learning for Joint Radio and Cache Resource Allocation in 5G Network Slicing

Abstract and Figures

Recommended publications

Knowledge Transfer based Radio and Computation Resource Allocation for 5G RAN Slicing

Learning From Peers: Deep Transfer Reinforcement Learning for Joint Radio and Cache Resource Allocat...

Knowledge Transfer based Radio and Computation Resource Allocation for 5G RAN Slicing

RAN Resource Slicing in 5G Using Multi-Agent Correlated Q-Learning