ArticlePDF Available

An efficient Deep reinforcement learning with extended Kalman filter for device-to-device communication underlaying cellular network

September 2019
Transactions on Emerging Telecommunications Technologies 30(3)

September 2019
30(3)

DOI:10.1002/ett.3671

Authors:

Pratap Khuntia

Ranjay Hazra

National Institute of Technology, Silchar

In this paper, a novel Deep Q‐learning in addition with an extended Kalman filter (EKF) is proposed to solve the channel and power allocation issue for a device‐to‐device enabled cellular network, when the prior traffic information is not known to the base station. Furthermore, this paper work explores an optimal policy for resource and power allocation between users with the aim of maximizing the sum‐rate of the overall system. The proposed work comprises of four phases, ie, cell splitting, clustering, queuing model, channel allocation, and power allocation, simultaneously. The implementation of cell splitting with novel K‐means++ clustering technique increases the network coverage, reduces co‐channel cell interference, and minimizes the transmission power of nodes, whereas the M/M/C:N queuing model solves the issue of waiting time for users in a priority based data transmission. The difficulty with the Q‐learning and Deep Q‐learning environment is to achieve an optimal policy. This is because of the uncertainty of various parameters associated with the system, especially when the state space is extremely big. In order to improve the robustness of the learner, EKF together with the Deep Q‐network is presented in this paper, which incorporates weight uncertainty of the Q‐function as well as the state uncertainty during the transition. Furthermore, the use of EKF provides an improved version of the loss function that helps the learner in achieving an optimal policy. Through numerical simulation, the advantage of our resource sharing scheme over the other existing schemes is also verified.

System model of a sectored cellular network

…

Schematic diagram representation of Deep Q‐learning with extended Kalman filter (EKF)

…

Block diagram representation of Deep Q‐learning with extended Kalman filter (EKF)

…

System topology

…

Impact of mini‐batch on convergence in Deep Q‐learning+EKF

…

Figures - available from: Transactions on Emerging Telecommunications Technologies

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Ranjay Hazra

Content may be subject to copyright.

Received: 1 January 2019 Revised: 8 April 2019 Accepted: 22 May 2019

DOI: 10.1002/ett.3671

SPECIAL ISSUE ARTICLE

An efficient Deep reinforcement learning with extended

Kalman filter for device-to-device communication

underlaying cellular network

Pratap Khuntia Ranjay Hazra

Electronics and Instrumentation

Engineering Department, National

Institute of Technology, Silchar, India

Correspondence

Pratap Khuntia, Electronics and

Instrumentation Engineering

Department, National Institute of

Technology, Silchar-788010, India.

Email: pratap.khuntia7@gmail.com

Abstract

In this paper, a novel Deep Q-learning in addition with an extended Kalman

filter (EKF) is proposed to solve the channel and power allocation issue for a

device-to-device enabled cellular network, when the prior traffic information is

not known to the base station. Furthermore, this paper work explores an opti-

mal policy for resource and power allocation between users with the aim of

maximizing the sum-rate of the overall system. The proposed work comprises

of four phases, ie, cell splitting, clustering, queuing model, channel allocation,

and power allocation, simultaneously. The implementation of cell splitting with

novel K-means++ clustering technique increases the network coverage, reduces

co-channel cell interference, and minimizes the transmission power of nodes,

whereas the M/M/C:N queuing model solves the issue of waiting time for users

in a priority based data transmission. The difficulty with the Q-learning and

Deep Q-learning environment is to achieve an optimal policy. This is because

of the uncertainty of various parameters associated with the system, especially

when the state space is extremely big. In order to improve the robustness of the

learner, EKF together with the Deep Q-network is presented in this paper, which

incorporates weight uncertainty of the Q-function as well as the state uncer-

tainty during the transition. Furthermore, the use of EKF provides an improved

version of the loss function that helps the learner in achieving an optimal policy.

Through numerical simulation, the advantage of our resource sharing scheme

over the other existing schemes is also verified.

1INTRODUCTION

The rapid rise in the growth of data traffic is a big challenge for next generation cellular networks. In the near future,

billions of connected devices will be in the global IP network. To fulfill the demand of mobile users and for efficient

management of the scarce spectrum, device-to-device (D2D) communication is a promising technology for the future

integrated next generation cellular networks.

D2D communication allows direct communication between two users in close proximity, without the involvement of

the base station (BS). In an underlaying cellular network, D2D users (D2Ds) reuse radio resources allocated to cellular

users (CUs). The reuse of resources of the CU by D2D user causes interference with each other.1,2Therefore, the selec-

tion of a suitable resource and power allocation scheme plays a vital role in reducing interference. Therefore, in order to

https://doi.org/10.1002/ett.3671

2of18 KHUNTIA AND HAZRA

reduce interference, a suitable amount of transmission power must be chosen for each CU and D2D. Thus, as a central

entity, BS determines the transmission power of each user and interference level using various scheduling algorithms.3

However, the traditional method of resource allocation does not provide a preferable optimal outcome, if the complete

traffic information is not known to the BS a priori. There are various conventional D2D schemes that aims at maximizing

the network throughput. Some of the schemes are graph-based method, fractional frequency reuse,4Lagrange multiplier,5

and optimization algorithm, eg, particle swarm optimization and genetic algorithm (GA).6A low complexity algorithm

called as inverse popularity pairing order is presented in the work of Wang and Wu,7which gives the solution of power

allocation and the pairing of best CU and D2D pair to operate in the same channel. A greedy-based resource allocation

scheme is proposed in the work of Wang et al,8where the D2Ds reuse the resources of a fixed CU taking interference

constraint into account. In the work of Hamdi et al,6the authors presented a joint channel allocation and power con-

trol algorithm for multicast D2D underlay communication. The nonlinear Perron Frobenius is proposed for solving the

nonlinear nonconvex fair power control problem. Finally, GA is proposed to solve the integral issue of the joint chan-

nel allocation and power control problem. GA is more time consuming than other optimization algorithms. However, all

of the aforementioned centralized traditional methods suffer from excess transmission overhead to obtain the complete

channel state information of the network.

Reinforcement learning (RL) technique is a machine learning approach that has gained popularity recently in wire-

less communication. There is limited work that uses the concept of RL for resource sharing in D2D communication.

Maghsudi et al9proposed a centralized resource and power allocation problem for D2D communication and presented

the problem of power control as a learning game. Maghsudi et al9also applied a Q-learning better-reply dynamics

algorithm to learn the utility function and to acquire equilibrium in power allocation. In the work of Nie et al,10 authors

discussed a distributed Q-learning technique for D2D communication that allows the D2D users to explore the trans-

mit power in an independent manner, thereby enhancing the system capacity and, in the meantime, maintain a good

quality of service (QOS) for CUs. Asheralieva et al,11 in their work, discussed a joint channel and power allocation

problem for D2D pairs, operating autonomously in a heterogeneous cellular network. The aim is to maximize the

rewards, which are the difference between the average throughput and the cost value of power consumption by con-

sidering SINR constraints into account. They applied a multiagent Q-learning algorithm, where each D2D pair acts as

a learning agent whose job is to collect the locally observed information, thereby helps the agent in selecting the best

strategy. Khan et al12 employed an online learning algorithm for power allocation, ie, a multiarmed bandit solver for

D2D enable cellular network under cross-tier and co-tier interference constraints. However, the basic Q-learning scheme

is unstructured in state space as well as the reward achieved by the learner is not optimal. The learner in Q-learning

environment follows a Markov decision process (MDP) that incorporates uncertainty in state transition probability of

the system.

Deep RL is a standard class of RL where the Deep neural network (DNN) is trained to learn either value function or

optimal policy. Deep learning is a major breakthrough in machine learning especially in developing games, autonomous

driving system, health care, robotic application, and many more. Mnih et al13 developed an artificial Deep Q-network

(DQN) that successfully learns the policies from excessively high-dimensional state space directly and have deployed the

DQN on demanding Atari 2600 games. In the work of Mao et al,14 authors applied the Deep RL technique in machine

intelligence, where the proposed system learns directly the resource management between various electronic devices by

accumulating the experiences received within a cluster. They used a Deep RL approach to sort out the problem of task

packing with multiple resource demands that aims at minimizing the connection time of a task. Le et al15 proposed a

Deep RL-based offloading mechanism, where a user learns through an online trial and error basis to offload his compu-

tational job to other mobile devices with the help of D2D communication in an underlay cellular network. In the work of

Lee et al,16 authors discussed a transmit power control strategy, where the D2D user autonomously learns to control its

transmit power by using DNN, thereby improving the average sum-rate of D2Ds by restricting the interference from CUs.

In the work of Naparstek and Cohen,17 the authors proposed a spectrum assignment process, where the multiple users

access the multiple numbers of channels. At the beginning of a time slot, a user transmits the data packet on a selected

channel based on certain transmission probability. The user then checks the successful transmission of data through an

acknowledgement signal. To handle this, they used a distributed dynamic Deep Q-learning network. Kim et al18 proposed

a distributed architecture for power allocation based on Deep learning, where the D2D user has the ability to determine

the transmit power based on its position and also consider the interference toward cellular network like BS. To control

the interference toward BS, they formulated a Deep Q-learning cost function using the Lagrangian method. In the work

of Meng et al,19 the authors addressed a Deep RL approach for allocating power to the devices in the multicellular net-

work. The approach involves a mathematical analysis of Deep RL, considering the online/offline centralized training,

KHUNTIA AND HAZRA 3of18

intercell coordination and execution in a distributed fashion. Huang et al20 discussed a cooperative resource allocation

based on Deep learning for D2D enabled 5G networks. The DQN efficiently utilizes the small scale channel information

from the dynamic environment to optimize the resources. A Deep multi-user RL method for dynamic spectrum access

in a multichannel wireless network is depicted in the work of Naparstek and Cohen.21 In this work, they considered a

long short-term memory layer in Deep Q-learning to estimate the precise state information by using partial observations.

IntheworkofHuangetal,

22 the authors used the Deep Q-learning algorithm for resource allocation in a heterogeneous

cellular network. Here, the DQN processes the QOS state of the users, eg, SINR, and takes an action at each time slot.

The user gets utility in the form of immediate reward if the QOS is satisfied. To tackle jointly the offloading and resource

allocation for a multiuser mobile computing system, Li et al23 exploited a Deep Q-learning to minimize the sum cost for

delay and energy consumption of user equipment. In the work of Li et al,24 the authors approached a power control strat-

egy for resource sharing in cognitive radios. The primary user based on a predefined power control strategy regulates its

transmit power, whereas the secondary user adopts a Deep RL method and develops an intelligent power control pol-

icy by interacting with the primary users. Yu et al25 exploited the Deep RL concept to sort out the challenges in wireless

multiple access control and addresses the benefit of Deep RL in terms of fast convergence to optimal outcomes and robust-

ness against nonoptimal hyper parameter settings. The work Chen et al26 has addressed a Deep Q-learning method to

tackle QOS aware service function chaining in a network function virtualization activated 5G system. The QOS parame-

ters are throughput, delay, and system bandwidth. Weng et al27 proposed a DNN for speech recognition that uses a back

propagation algorithm to make the mini-batch gradient descent on the neural network more functional. Ye et al28 ana-

lyzed the resource allocation problem for a vehicle-to-vehicle (V2V) communication using Deep RL, where the optimal

frequency band and transmission power requirement for setting up communication are determined by each V2V pair

itself. The problem with the earlier discussed DQN is the way it is being trained. The Q-network takes the sample of state,

action, and reward, and estimates the sample of the next state. However, these samples are correlated; it leads to inef-

ficient learning. The current Q-network parameters decide the policies and estimate the next samples for training the

network, but this leads to instability in the Deep Q-learning due to bad feedback. Summarizing the aforementioned work

on Deep Q-learning, we thus conclude that the prior work does not consider the uncertainty issue in state transition and

the weights of Q-function, thus becoming difficult to achieve an optimal policy, when state space of the environment

is very large.

Therefore, we propose a novel approach of resource allocation, where the channel assignment and power allocation

for each D2D pair are based upon RL prototype called Deep Q-learning with extended Kalman filter (EKF), an online

evaluator of optimal policy. The proposed approach alleviates the uncertainty in state transition probability and the weight

uncertainty that originates from the DQN. The whole work is divided into four phases. They are as follows.

•Splitting of cell into three sectors with 1200coverage pattern. This increases network coverage and minimizes

cochannel cell interference.

•Use of a smart and reliable clustering algorithm called K-means++ that reduces the transmission power of nodes. The

communication of all the nodes takes place with their corresponding cluster head (CH) rather than eNodeB, which is

very far away from the nodes. The adoption of such smart initialization algorithm outperforms the standard K-means

algorithm in terms of faster running time and accuracy.

•M/M/C:N queuing model with nonpreemptive limited priority scheme helps in resolving the heavy traffic load issue

for a priority-based data transmission.

•Deep Q-learning with EKF is used for channel allocation and power allocation simultaneously as well as it consid-

ers the uncertainty in state transition and weight uncertainty issues associated with Q-learning and Deep Q-learning,

respectively. We propose to utilize an EKF that processes all the uncertainty state transition probability of the system

and connects this information with the weight uncertainty associated with the DQN, thereby enhancing the robustness

of the EKF to provide an optimal policy. With the help of EKF in DQN, we provide a modified version of the Q-function,

which is differentiable and helps us to apply a Taylor series approximation to linearize the nonlinear Q-function. Fur-

thermore, an advanced version of the loss function is derived, which tends to provide optimal weights for uniformity

of the network.

The rest of this paper is arranged as follows. The system model is described in Section 2. In Section 3, the problem

formulation for the proposed resource sharing strategy using Deep Q-learning with EKF is discussed. The queuing model

is presented in Section 4 to resolve the traffic requests of different types of users in the network. The simulation results

are verified in Section 5, whereas Section 6 concludes the paper.

4of18 KHUNTIA AND HAZRA

2SYSTEM MODEL

This paper presents a joint resource and power allocation optimization scheme whose objective is to maximize the

throughput of the overall network. In this work, the transmit power of both CUs and D2Ds is optimized by satisfying the

minimum signal-to-interference-plus-noise ratio (SINR) requirement. We consider a sectored cellular network, where

each cell is split into three sectors, each of 1200based on the node position, ie, D2D, CU, the angular position of node,

and radius, ie, the transmission range. In order to manage all the users inside the sector efficiently, we frame the cluster-

ing of users using the K-means++ clustering algorithm. The communication between users and eNodeB at each sector

takes place via the specific CH associated with each type of user. Such a clustering mechanism reduces traffic overload at

the eNodeB and each device utilizes less transmit power for communicating with the nearest CH. The traffic offloading

process from eNodeB to CH is performed intelligently by using K-means++ clustering algorithm followed by a TDMA

scheme. The purpose is to group the users into a cluster and allow each device to associate with each CH whose distance

from eNodeB is minimum and channel quality toward eNodeB highest. Following this, the management of traffic request

over different discrete time slot is determined using a TDMA scheme. It means that the devices within a cluster use a

unique frequency channel and a fixed time slot of each CH for communication. Here, eg, let us assume that x =1,2....X

number of frequency channels of CUs are available within the cluster and the nthD2Dusesthexth frequency channel

of CU for communicating within an allotted time slot (ti) of each CH. We consider that the CHs operate at unique time

slot and the D2Ds use different frequency channels for transmitting data efficiently. As the devices from each cluster use

a different time slot and unique frequency, it avoids interference between intersector and intrasector devices. We, thus,

use a Deep Q-learning with EKF technique for traffic offloading and resource allocation. The role of this intelligent agent

is to observe the network states, ie, the location of users, SINR information of each user, traffic request, interference con-

dition, and then map it to an optimal action through a learning experience in every current time slot. The coordination

between CH and eNodeB helps in controlling all the users properly. The iterative K-means++ clustering process is as

follows. Firstly, the clustering process begins in each sector by randomly selecting the first cluster center from the total

available nodes. This is followed by spreading out the centers by an iterative process, which means, at each iteration, it

chooses one new center. The K-means++ algorithm is explained as follows:

Overall, the cluster formation based upon K-means++ algorithm is faster in run time and improves the quality of the

local optimal solution. Deployment of cell splitting along with the clustering technique increases the network coverage,

enhances overall system capacity, and minimizes the transmission power of nodes due to the fact that CH is now closer

to the nodes in the cluster as compared to the eNodeB. The system model of a sectored cellular network is depicted in

Figure 1.

We assume an underlay D2D communication system, where the devices directly communicate with each other using

the uplink (UL) resources of CUs. In this case, the traffic request is verified at BS followed by implementation of certain

algorithms to allocate a channel and transmit power among the users, but this traditional method of resource allocation

does not work, if the complete traffic information is not known to the BS a priori. To tackle this issue, we adopt Deep

Q-learning with EKF in the subsequent stage, where a steady interaction between DQN with the environment results in

an optimal policy. Thus, learning through such interactive process is used to complete the channel assignment and power

allocation for D2Ds. In the final stage, to limit the heavy traffic situation during channel assignment, we use an M/M/C:N

queuing system with nonpreemptive limited priority model to deal with the traffic request coming from different types

of users.

We consider a sectored multicell scenario, which consists of two types of users, ie, CUs and D2Ds. There are a total of

Mnumber of CUs indexed by C1,C2,..…,Cm,..…,CM,andNnumber of D2D pairs denoted by D1,D2,..…,Dn,..…,DN

KHUNTIA AND HAZRA 5of18

FIGURE 1 System model of a sectored

cellular network

that are distributed randomly within snumber of cell sectors, where m𝜖[1,2, ... …,M]and n𝜖[1,2, ... …,N]. Each sector

is deployed by a BS and these BSs have logical links with the CH of each cluster. The management of traffic request and

availability of a node is determined by using a TDMA scheme, which means devices from each cluster is allocated different

time slot and frequency for communicating with their respective CH. Each CH uses a distinct time slot to avoid time

conflict between the two neighboring CHs. Thus, we use a Deep Q-learning algorithm that decides whether a time slot will

be allotted to a CH for relaying the data of devices associated with the CH. Using deep learning, each CH learns from its

previous learning experience, thus relaying of data through CH on the decided time slot as per the information recorded in

the past slots. The estimated Q-values obtained for each CH is different for each time slot. Thus, the availability of time slot

for a CH at any time is determined by comparing their Q-values. We assume that the total number of channels are less than

equal to the number of CUs. Let Xnumber of UL channels be allocated to M number of CUs. As there is no free channel

available for D2Ds, D2D pairs communicate using the channel of CUs on sharing basis. Each D2D transmitter (D2DTx )

selects an appropriate channel xfrom the available channel set, ie, x𝜖[1,2, ... …,X]. As a result, it leads to interference

between D2Ds and CUs within a sector, communicating via the same channel. However, the interference in between

cell sector is negligible, as they use their own frequency set and time slot. In order to avoid interference from D2Ds on

the cellular link, QOS constraints are chosen for the CUs. Thus, for each CU, an interference boundary is set. This is

determined using the SINR margin (𝜂) at the BS from each CU and the UL transmits power (Pm) of CU. Mathematically,

the overall interfering transmit power limit from each D2D can be expressed as follows:



s=1



n=1



x=1

n,Bqx

n+𝜎2≤Pmgx

m,B

𝜂B

,∀x𝜖C,(1)

where qx

nrepresents the transmit power of nth D2D in sector sover xth channel. The channel gain at BS from D2DTx and

channel gain between CU-BS is represented as gx

n,Band gx

m,B, respectively. 𝜎2represents the noise power on xth channel.

Let gx

n,Bqx

n+𝜎2=Ix

ndenote the interfering transmit power from each D2D user over xth channel.

The devices in the cellular system use SINR to measure the quality of the channel. Let 𝜉C

m=Pmgx

m,B

ntn,Bgx

n,B+𝜎2and

𝜉D

n=qx

ngx

n,n

Pmtm,ngx

m,n+𝜎2denote the SINR associated with the CUs and D2Ds over the xth channel. Here, gx

n,nrepresents the

channel gain between nth D2D pair, whereas the interference channel gain between CU-D2D link is denoted as gx

m,n. Fur-

thermore, tm(0≤tm≤1)and tn(0≤tn≤1)are the normalized traffic rate from CU and D2D node, respectively.

Therefore, the achievable data rate for the whole system, ie, the sum of the data rate of CUs and D2Ds communicating

6of18 KHUNTIA AND HAZRA

over the same xth frequency band, is as follows:

R=BRBRC

m+RD

n,(2)

where



s=1



m=1



x=1

BRBlog 1+𝜉C

m(3)

and



s=1



n=1



x=1

𝜓m,nlog 1+𝜉D

n.(4)

Here, MSand NSrepresents the number of CUs and D2Ds in sector s.BRB denotes the bandwidth corresponding to each

channel and 𝜓m,nis a binary variable. When 𝜓m,n=1, it means mth CU shares its channel with nth D2D, otherwise it

is zero.

The optimization problem is to maximize the overall system throughput R over time for certain QOS constraints of CUs

and D2Ds that can be expressed as follows:

max



s=1



m=1



x=1

BRBlog 1+𝜉C

m+



s=1



n=1



x=1

𝜓m,nlog 1+𝜉D

n(5)

subject to



s=1



n=1



x=1

n≤Pmgx

m,B

𝜂B

(5a)

n≤Pmax,∀x𝜖C,∀n𝜖D(5b)



m=1



n=1

𝜓m,n=1.(5c)

The aim of our work is to maximize the objective function for certain interference threshold constraint set at the BS and

maximum allowable transmit power for D2D. The interference threshold, ie, Pmgx

m,B

𝜂B

, is evaluated for each CU to avoid

interference between D2D and CU link. The next step is to manage the resource allocation in an optimal manner such that

the throughput achieved by the system becomes maximum. This is achieved by the network itself through an appropriate

learning mechanism.

3DEEP Q-LEARNING WITH EKF FOR RESOURCE ALLOCATION

Reinforcement learning is a process of learning where a situation in environment state is mapped to action so that the

overall reward becomes maximum. In this section, a powerful learning algorithm called Deep Q-learning combined with

EKF is employed to resolve the channel and power allocation issue. Before this, we first discuss the various difficulty asso-

ciated with the existing schemes and then we will combine those problems into our work for solving these issue. Figure 2

represents the schematic diagram representation of the Deep Q-learning with EKF. The system follows a two-stage pro-

cess, ie, learning and resource assignment. At each time step, the agent receives the current state of the nodes, ie, position

of D2D, CU, and CH, and distance between the nodes and eNodeB. These state space information are then fed into Deep

Q-learning + EKF algorithm. The internal architecture of this block is shown in Figure 3, which basically uses a DNN and

an EKF. The agent then selects an action, ie, a frequency channel, and transmits power under a policy. Following this, the

state of the nodes transit to a new state. The interaction between agent and environment takes place for one episode, ie,

Γ=15 transmission time interval (TTI), where one TTI =1 ms. The state transition is based on an uncertainty state tran-

sition probability set. The agent stores these transition in a replay memory. The estimated error, which is the difference

between the observed Q-function stored at the output of the DNN and the target Q-value computed from the sample of a

random mini-batch, is back propagated to learn the existing uncertainty in the weight of Q-function. This error signal is

then used to estimate the gain of the Kalman filter. In this approach, the Kalman filter circulates the detailed information

from target Q-values that is estimated from the uncertainty state transition set and passed on to the set of uncertainty

weights. This process goes on for all the episodes to achieve a robust policy that assists in assigning channel and power

allocation to the users fairly, thereby maximizing the sum-rate of the network.

KHUNTIA AND HAZRA 7of18

FIGURE 2 Schematic diagram representation of Deep Q-learning with extended Kalman filter (EKF)

FIGURE 3 Block diagram representation of Deep Q-learning with extended Kalman filter (EKF)

3.1 Q-learning

The various components of the learning model are as follows: state (St) of the environment represents the location of CU,

D2D, the absolute angle between eNodeB, the distance between node and eNodeB, channel condition defined by SINR,

interference level (Ix

n), and user traffic arrival rate (𝜆). As the user's location is fixed in each simulation instant, thus St

can be modified as St =(SINR,Ix

n,𝜆). These parameters are used for allocating radio resources. The traffic request from

different types of users and its priority is based on M/M/C:N queuing model with nonpreemptive limited priority that is

discussed in the subsequent section. The set of action (Ac) at discrete time instant tthat the agent may choose consists

of a set of channels and power levels assigned to each user associated with a CH in sector Si,soAct=(x(t),Px(t)).The

8of18 KHUNTIA AND HAZRA

parameter x(t) represents the total number of channels, ie, x𝜖{0,1, .......... X},whereX=W∕B,Wrepresents the system

bandwidth, and Bthe bandwidth of each channel. The transmit power Px(t)is the power level associated with each channel

that varies between 0 to Pmax and is a very important parameter in interference control. The scalar reward function (R) is

determined using the current state St and action Ac, which generally denotes the average system capacity of the network.

The last component is that the policy 𝜋is used to maximize the total expected reward over a long run time.

In this process, the agent accumulates the previous experience and, based on these experiences, learns how to interact

with the environment. Considering that the agent is in a discrete time MDP (DTMDP) with continuous state and action

space, the reward Rtdepends on the present state Stt, the action Act,andthenextlandedstateSt′

t=Stt+1,soRt=

R(Stt,Act,St′

t). In our case, the throughput in (2) is a reward function. Since the domain of action is continuous, we

thus evaluate the action from a stochastic policy 𝜋(ActStt)=P(St′=Stt+1 Stt,Act), which indicates the mapping of

environment state to the probability of taking an action. The probability distribution P(St′=Stt+1 Stt,Act)for transition

belongs to an uncertainty transition set p(Stt,Act). A DTMDP estimates and changes the policy based on value function,

which is defined as the cumulative expected reward beginning from the state St and following the assigned policy 𝜋.

Overall, the learning goal of the agent is to search an optimal policy that will maximize the cumulative reward over a long

run time. Now, onwards, we use symbol St for Sttand Ac for Act. Let, for policy 𝜋, the value function associated with state

St be presented recursively and is as follows:

V𝜋(St)=𝔼𝜋

t≥0

𝛾tRts0=St,𝜋



=𝔼𝜋[R0+𝛾R1+𝛾2R2+...... s0=St]

=𝔼Ac∼𝜋(·St)

St′∼p(·St,Ac)

[R0+𝛾V𝜋(St′)s0=St],(6)

where 𝔼{.}denotes the expectation operator and 𝛾the discount factor that attains value between 0 to 1. The set p(Stt,Act)

represents the probability distribution of state transition.

The goal is to find an optimal policy 𝜋∗that maximizes the sum of rewards V𝜋(St),ie,

V∗(St)=V𝜋∗(St)=max

𝜋V𝜋∗(St).(7)

The optimal value that satisfies Equation (6) becomes

V∗(St)=V𝜋∗(St)=max 𝔼Ac∼𝜋(·St)

St′∼p(·St,Ac)

[R0+𝛾V∗(St′)].(8)

The difficulty of MDP is handling of randomness in terms of an initial state during sampling and in terms of distribution

of next state. Let V∗(St′)<V∗(St), then the state St is better than St′, as the agent receives more rewards in state St than

St′. The perfectness of the environment is not always known to the agent, thus the information about the immediate

reward as well as the next landed state are also not exactly known to the agent. Thus, 𝜋∗may not be necessarily unique.

Therefore, we look the other way to evaluate 𝜋∗.

The role of Q-learning is to find out 𝜋∗without having knowledge of immediate reward and the distribution of the next

state. This is possible by remanipulating Equation (8) under policy 𝜋, the value of a given state-action is expressed as

Q𝜋(St,Ac)=𝔼𝜋

t≥0

𝛾tRts0=St,a0=Ac,𝜋



=R0+𝛾𝔼St′∼p(·St,Ac)[V(St′)].(9)

Thus, the Q-function is basically the expected cumulative reward for doing an action Ac in state St and thereafter ensuing

the policy 𝜋.

The optimal value of Q-function, Q𝜋∗, is then expressed as

Q∗(St,Ac)=Q𝜋∗(St,Ac)=R0+𝛾𝔼St′∼p(·St,Ac)[V∗(St′)].(10)

KHUNTIA AND HAZRA 9of18

This gives

V∗(St)=max

Ac Q∗(St,Ac).(11)

Therefore, V∗, which satisfies Bellman equation (8), can be retrieved from Q∗. Replacing Equation (11) in (10), we obtain

Q∗(St,Ac), that is, presented as follows:

Q∗(St,Ac)=R0+𝛾𝔼St′∼p(·St,Ac)max

Ac′Q∗(St′,Ac′).(12)

Thus, the process of obtaining Q∗(St,Ac)in Q-learning is an iterative method. In general, the iterative Q-learning

process is

Qt(St,Ac)=Qt−1(St,Ac)+𝛼Rt+𝛾max

Ac′Qt−1(St′,Ac′)−Qt−1(St,Ac),(13)

where 𝛼attains value between 0 to 1 and is called the learning rate of the system. The convergence of Q(St,Ac)to Q∗(St,Ac)

occurs when 𝛼value moves toward zero and if each (St, Ac) pairs repeat infinitely. However, this is impracticable for

extremely large state space. In such a case, all states are rarely covered and thus the Q value for those states may not be

updated in the Q-table. Therefore, it takes a longer time for convergence.

3.2 Deep Q-learning

In order to solve this issue, Deep Q-learning uses a nonlinear function approximator Q(St,Ac,𝜃)to estimate the Q-function

Q(St,Ac). The function approximator Q(St,Ac,𝜃)is basically a DNN, where 𝜃represents the weights of the function.

Therefore, the learning mechanism is called as Deep Q-learning. In this learning, the neural network takes the input state

vector St =(SINR,Ix

n,𝜆)and a given set of weights 𝜃to estimate the Q-values in an iterative manner.

In DQN, the weight 𝜃is adjusted iteratively to improve the Q-function. The agent tries to minimize the Bellman error,

which means that the Q-value is made close to target Q-value. It is defined by means of a loss function represented as

follows:

Lossi(𝜃i)=𝔼Di∼p(·) R+𝛾max

Ac′Q(St′,Ac′;̃

𝜃)−Q(St,Ac;𝜃i)2.(14)

Here, Zi,̃

𝜃(Di)=R+𝛾maxAc′Q(St′,Ac′;̃

𝜃)is the target Q-value. The symbol p(·) is the probability distribution for transition

Di=(St,Ac,R,St′)at each iteration and the transition is stored in a replay memory D having the capacity N. Moreover, ̃

𝜃

represent fixed weights used to determine the target Q-value and weights 𝜃iare the estimated value at some iteration i.The

Q-value of each device calculated at different time slot is of high importance because by comparing these Q-parameter

values, the agent learns whether a time slot is suitable for transmission of a message or not. Thus, the availability of a

device at any point of time can be known. Furthermore, the reward at any time slot depends on the successful/failure of

transmission of the message. Thus, overhead during message exchange between devices and CH is determined by the size

of the transmitted message and the number of active transmitters at that point of time. If there are Knumber of devices

active at any point of time, then the overhead is K(K-1) number of messages, where each message has a size of m bits.

In order to minimize the loss function, we determine the gradient of loss function with respect to the function parameter

𝜃i,ie,

dLoss

i(𝜃i)

d𝜃i

=𝔼Di∼p(·) R+𝛾max

Ac′Q(St′,Ac′;̃

𝜃)−Q(St,Ac;𝜃i)dQ(St,Ac;𝜃i)

d𝜃i.(15)

The weight updation is as follows:

𝜃i+1←𝜃i+𝛼𝔼Di∼p(·) Zi,̃

𝜃(Di)−Q(St,Ac;𝜃i)dQ(St,Ac;𝜃i)

d𝜃i.(16)

Here, we iteratively trained the Q-function to get the Q-value close to target Zi,̃

𝜃(Di), when Q-function is optimal.

3.3 Deep Q-learning with EKF

In the first phase, we discuss the uncertainty issue of state transition probability using Deep Q-learning. Furthermore, the

weight uncertainty issue, which is not solvable in the first phase, is presented in the second phase using EKF. The block

diagram representation of Deep Q-learning with EKF is shown in Figure 3.

10 of 18 KHUNTIA AND HAZRA

In our work, we first present a modified version of loss function (15) that incorporates the uncertainty presents in

Q-learning and improves the learning process of the agent. In this process at each episode, the learner accesses the uncer-

tainty transition set p(·Sti,Aci)for estimating the all possible next landed states that is denoted by set ̂

S(·Sti,Aci).In

order to train the Q-network, the mini-batch of transition data Difrom the replay memory is fetched instead of consec-

utive samples. The agent then estimates the modified version of Znew

i,̃

𝜃(Di)given in Equation (17) and updates multiple

weights using Znew

i,̃

𝜃(Di)according to (16). This increases data efficiency greatly.

Using the aforementioned concept, the modified version of the loss function is as follows:

Lossi(𝜃i)new =𝔼Di∼p(·) R+𝛾min

P𝜖p

St′𝜖̂

pSt′Sti,Acimax

Ac′Q(St′,Ac′;̃

𝜃)−Q(St,Ac;𝜃i)2

,(17)

where

Znew

i,̃

𝜃(Di)=R+𝛾min

P𝜖p

St′𝜖̂

pSt′Sti,Acimax

Ac′Q(St′,Ac′;̃

𝜃).(18)

However, the Deep Q-learning with experience replay updates the weights in accordance with target label Znew

i,̃

𝜃(Di),but

it does not consider the uncertainty of the weights, but this is very important for designing any robust model. Thus, we

propose to use EKF with DQN to handle both weight uncertainty as well as uncertainty state transition probability.

In our work, we apply an EKF over DQN that is suitable for learning the behavior of nonlinear data or for evaluating the

weights 𝜃of a nonlinear function approximation. The EKF can now be used to process the following nonlinear function

mentioned:

𝜃i=𝜃i−1+wi−1(19)

Zi(Di)=Q(St,Ac;𝜃i)+vi,(20)

where Zi(Di)=R+𝛾maxAc′Q(St′,Ac′;̃

𝜃)is the target value at iteration iand Q(St,Ac;𝜃i)a nonlinear Q-function, whereas

wi−1and viare the process error and the measurement error with covariance Fwiand Fvi, respectively.

The nonlinearity of the Q-function can be linearized by EKF using the first-order term of Taylor expansion as mentioned

in the following:

Q(St,Ac;𝜃i)=Q(St,Ac;̆

𝜃)+ dQ(St,Ac;̆

𝜃)

d𝜃i𝜃i=̆

𝜃

(𝜃i−̆

𝜃)T.(21)

Here, ̆

𝜃is the linearization point and basically represents the result of estimated weight 𝜃i|i−1at iteration i-1. As 𝜃iis

random in nature, EKF estimates the expected value of the weights represented by ̆

𝜃based on the transition observed for

atimerange.

In the predicted stage, the predicted value of Q-function is obtained as follows:

Zii−1(Di)=𝔼[Q(St,Ac;𝜃iD1∶i−1)]

𝔼Q(St,Ac;̆

𝜃ii−1)+ dQ(St,Ac;̆

𝜃ii−1)D1∶i−1

d𝜃i

×(𝜃i−̆

𝜃)T

=Q(St,Ac;̆

𝜃ii−1)+(𝔼𝜃iD1∶i−1−̆

𝜃ii−1)T×dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

=Q(St,Ac;̆

𝜃ii−1)+(̆

𝜃ii−1−̆

𝜃ii−1)T×

dQ(St,Ac;̆

𝜃ii−1)D1∶i−1

d𝜃i

=Q(St,Ac;̆

𝜃ii−1),(22)

where 𝔼[𝜃iD1∶i−1]≈ ̆

𝜃ii−1=̆

𝜃i−1i−1, according to general conditional expectation rule, ie, 𝔼[𝜃iD1∶i]≈𝜃ii.

Now, the correct or updated value of Q-function is as follows:

Zii−1(Di)≈Q(St,Ac;𝜃i)− ̆

Zii−1(Di)

≈Q(St,Ac;𝜃i)−Q(St,Ac;̆

𝜃ii−1).(23)

KHUNTIA AND HAZRA 11 of 18

The covariance that relates the error in weights and the correct Q-value is as follows:

F̂

𝜃i,̂

Z(Di)≈𝔼̂

𝜃ii−1̂

Zii−1(Di)D1∶i−1

=𝔼(𝜃i−̆

𝜃ii−1)×(Q(St,Ac;𝜃i)−Q(St,Ac;̆

𝜃ii−1))D1∶i−1

𝔼

(𝜃i−̆

𝜃ii−1)× Q(St,Ac;̆

𝜃ii−1)+ dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

×(𝜃i−̆

𝜃ii−1)T−Q(St,Ac;̆

𝜃ii−1)D1∶i−1

=𝔼̂

𝜃ii−1̂

𝜃T

ii−1D1∶i−1

dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

=Fii−1

dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

,(24)

where 𝜃i−̆

𝜃ii−1≈̂

𝜃ii−1and 𝔼[̂

𝜃ii−1̂

𝜃T

ii−1D1∶i−1≈Fii−1is the conditionally expected covariance of error.

The covariance of updated Q-function is as follows:

F̂

z(Di)=𝔼(̂

Zii−1(Di))2D1∶i−1+Fvi

𝔼(Q(St,Ac;𝜃i)−Q(St,Ac;̆

𝜃ii−1))2D1∶i−1+Fvi

𝔼Q(St,Ac;̆

𝜃ii−1)+ dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

×(𝜃i−̆

𝜃ii−1)T−(Q(St,Ac;̆

𝜃ii−1)2D1∶i−1

+Fvi

=𝔼dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

(𝜃i−̆

𝜃ii−1)T2D1∶i−1

+Fvi

=dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

𝔼̂

𝜃ii−1̂

𝜃T

ii−1D1∶i−1dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

+Fvi

=dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

Fii−1

dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

+Fvi.(25)

Now, using the covariance computed in (24) and (25), the gain achieved for EKF is as follows:

KGi≈F̂

𝜃i,̂

Z(Di)F−1

z(Di)

=Fii−1

dQ(St,Ac;̆

𝜃ii−1)

d𝜃idQ(St,Ac;̆

𝜃ii−1)

d𝜃i

Fii−1

dQ(St,Ac;̆

𝜃ii−1)

d𝜃i

+Fvi−1

.(26)

The measurement update of weights ̆

𝜃to train the Q-function using EKF is defined as follows:

𝜃ii=̆

𝜃ii−1+𝔼Di[KGi(Zi,̃

𝜃(Di)−Q(St,Ac;̆

𝜃ii−1))].(27)

The updated covariance error is mentioned as follows:

Fii=Fii−1−KGiF̂

z(Di)KGiT.(28)

It is inferred from the EKF gain in (24) that the EKF learns adaptively to every weight that includes weight uncertainty

features. Now, the EKF estimates the posterior weight taking the knowledge of the probability of prior weights governed

on transitions D1:iobtained up to a period i. This posterior weight update is based on the likelihood of target value Zi(Di)𝜃i

and the posterior probability distribution of 𝜃iD1∶i−1. Further considering the Gaussian nature of p(Zi(Di)𝜃iand p(𝜃iD1∶i−1),

we define the following distribution:

Zi(Di)𝜃i∼(𝔼[Q(St,Ac;𝜃i)],Fvi)(29)

𝜃iD1∶i−∼(̆

𝜃ii−1,Fii−1).(30)

12 of 18 KHUNTIA AND HAZRA

Using Equation (29) and (30) and the weights ̆

𝜃in (27), we find a controlled version of loss function that builds uniformity

to the network, ie,

Lossi(𝜃i)= 1

2Fvi

𝔼(Zi,̃

𝜃(Di)−Q(St,Ac;𝜃i))2+1

2(𝜃i−̆

𝜃ii−1)F−1

ii−1(𝜃i−̆

𝜃ii−1)T.(31)

Here, the weights are controlled on the basis of updated covariance error Fi|i−1and the value of error variance Fvivaries

according to the transition Di=(St,Ac,R,St′).ThevalueofFvigoes high if the transition Diis too much corrupted. In

this situation, Fviaffects the weights largely and thus error variance Fviacts as a controlled parameter toward maintaining

uniformity in the network.

3.4 Computational complexity

The time complexity of our proposed resource sharing algorithm is as follows. In our algorithm, we have nnumber of

episodes and each episode has 15 TTI duration. In the algorithm, a total of “Ac” number of actions are handled for a

network having |S| sectors. The weight update 𝜃in the fully connected DQN depends on the width and the number of

hidden layers. Let H(𝜃)be the total number of parameters associated with the DQN and expressed as H(𝜃)=3

i=0(Li+1)

Li+1,whereL0represents the number of state inputs in the first layer and L1,L2,L3represents the number of neurons in

the hidden layer. Thus, the time complexity of our algorithm is (H(𝜃)nTSAc).

4M/M/C:N QUEUING WITH NON-PREEMPTIVE LIMITED PRIORITY

In this queuing model, service to low priority users is followed by the service to jnumber of high priority users. The users

are graded as first and referred to as high priority users, whereas the second grade users are treated as low priority users.

KHUNTIA AND HAZRA 13 of 18

TABLE 1 Expected waiting time of two graded users with j =4

Traffic density(𝜄) Waiting time of first grade user in sec (wq1) Waiting time of second grade user in sec (wq2)

00 0

0.2 1.02 ×10−60.00241

0.4 0.00167 0.00641

0.6 0.0052 0.01432

0.8 0.01341 0.02705

1.0 0.02732 0.04251

In M/M/C:N queuing system, the users arrival rate follows a Poisson distribution 𝜆∖hour, whereas the average service

rate 𝜇∖hour follows an exponential distribution. The queue length is set to Nand Cdenotes the number of servers. The

density of traffic for the system is denoted by 𝜄=𝜄1+𝜄2=𝜆1

C𝜇+𝜆2

C𝜇,where𝜆1,𝜆2represents the number of the user of first

and second grade that appears in the system. The users in the system are separated into two queues, ie, q1and q2.The

expected queue length of xth grade user is Lqx. Therefore,

Lqx =Lq1+Lq2,(32)

where Lq1=𝜆1Wq1and Lq2=𝜆2Wq2represent the queue length of first and second grade users. When C→∞,denote

a queuing system with full priority for the user of grade 1. For C→0, the second grade users get the full priority.

The expected waiting time in the queue for the first and second grade users associated with a stationary system model

is expressed as follows:

Wq1=C2𝜇2𝛽Lqx +𝜆e𝜆2

C𝜇(C𝜇𝜆2−𝜆e𝜆2+C𝜇𝛽𝜆1)(33)

Wq2=Lqx

𝜆2

−C2𝜇2𝛽𝜆1Lqx +𝜆e𝜆1𝜆2

C𝜇𝜆2(C𝜇𝜆2−𝜆1𝜆2+C𝜇𝛽𝜆1),(34)

where 𝜆e=𝜆1+𝜆2and 𝛽the service density of second grade users that change queue.

Table 1 depicts the expected waiting time of first and second graded users for j=4. It is inferred from the table that the

waiting time of first grade users under nonpreemptive limited priority scheme is very less as compared to second grade

users.

This nonpreemptive limited priority queuing scheme avoids collision of the message by adjusting jvalue, ie, when the

expected waiting time of a user in the queue is too high. Thereby, the issue of nonpreemptive priority queuing scheme

(j→∞), where there is a high probability of data collision due to heavy traffic loads at the network, is solved.

5NUMERICAL RESULTS

We consider a 4000 m × 4000 m cellular network, where the D2Ds and CUs are uniformly distributed. Each cell is splitted

into three sectors with each node having a coverage pattern of 1200. The simulated system topology and clustering of

nodes are shown in Figure 4. For experimental purpose, the proposed work is simulated with 20 D2D pairs and 10 CUs.

Therefore, a total of 10 channels are allocated in a cellular network and the transmission power of D2Ds on each channel

is divided into five power levels: 20 mw,40 mw,60 mw,80 mw,100 mw. The maximum transmits power of CU and

D2D users are 24 dBm and 21 dBm, respectively. The SINR threshold required to define the state of D2D user is 0.005.

The discount factor 𝛾=0.9 and the average user arrival rate is 𝜆1∶𝜆2=2∶1. The DQN consists of five layers. There

are three hidden layers having neurons of 400, 200, and 100, respectively. The total number of the episode is 1000 and

each episode has 4000 iterations. We use a rectified linear unit (ReLU), activation function that works in the hidden layer

and is defined as follows: R(x)=max(0,x). We implement our Deep Q-learning with EKF using simulation environment

TensorFlow based upon Python, which runs on four GPUs of Nvidia architecture and Intel CPU having 128 GB memory.

Figure 5 shows the convergence performance of Deep Q-learning with EKF for different mini-batch size by updating

the gradient. It is inferred from the figure that, when mini-batch size is small, the convergence is faster. The parameter

batch size in Deep Q-learning with EKF tells the number of experiences used during the transition to learn a policy.

Figure 6 illustrates the average system throughput with respect to the learning rate. We can infer from the figure that

the throughput of the proposed Deep Q-learning with EKF and Deep Q-learning increases with the learning rate and

achieves the maximum peak value when the learning rate is 0.58 and 0.6, respectively. With further increase in learning

14 of 18 KHUNTIA AND HAZRA

FIGURE 4 System topology

FIGURE 5 Impact of mini-batch on convergence in Deep

Q-learning+EKF

0 100 200 300 400 500 600 700 800 900 1000

Episode

160

180

200

220

240

260

280

300

320

System throughput (Mbps)

FIGURE 6 Variation of throughput with respect to the learning rate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Learning rate

100

150

200

250

300

Average system throughput (Mbps)

rate, throughput performance decreases sharply. In the case of Q-learning, throughput rises very slowly and attains the

maximum throughput for learning rate around 0.75 and then decreases sharply. The throughput performance of Deep

Q-learning with EKF is better over Deep Q-learning, as it learns the best optimal policy by incorporating state transition

probability uncertainty and weight uncertainty. Therefore, it selects the best possible actions and receives the maximum

KHUNTIA AND HAZRA 15 of 18

throughput over a long run-time. Furthermore, when the learning rate goes very high, the learner is unable to exploit all

the actions and higher learning rate leads to local optimum, thereby degrading the performance of the system.

Figure 7 and Figure 8 depicts the average system throughput for the whole system and average D2D throughput with

respect to the number of D2Ds in the system. We can infer from both the figures that, with the increase in the number of

D2Ds, the average throughput in the case of proposed Deep Q-learning with EKF outperforms the Deep Q-learning and

the centralized algorithm, ie, without learning mechanism.7,8This is because DQN in coordination with EKF processes

the uncertainty in state transition probability batch-wise by taking weight uncertainty into account. Thereby, it maximizes

the Q-value as well as the throughput under an optimal policy, which is not followed by standard Deep Q-learning and

Q-learning method. Similarly, the traditional optimization method fails to provide the best possible results due to the

lack of complete traffic information a priori. It can be depicted from Figure 7 that the overall system throughput for our

proposed Deep Q-learning with EKF yields 218.04 Mbps, when the number of D2Ds is 8. When the number of D2Ds

increases to 16, it provides an appreciable throughput of 295.2 Mbps.

Figure 9 illustrates the variation of system throughput with respect to the state transition probability. In both the

schemes, with the increase in transition probability, the throughput of the system also increases, but our proposed Deep

Q-learning with EKF provides more appreciable throughput as compared to standard Deep Q-learning. This is because

Deep Q-learning only considers state transition probability uncertainty for weight update and ignores the weight uncer-

tainty in the network. Therefore, the optimal policy it obtains is not an optimal one. As a result, there is degradation in

throughput performance Deep Q-learning.

Figure 10 shows the error measured in different Q-learning algorithm for a fixed learning rate of 0.52. We can observe

that our proposed Deep Q-learning with EKF achieves the minimum error in the system. This is because, in the case

2 4 6 8 10121416

Number of D2D pairs

100

150

200

250

300

Average system throughput (Mbps)

FIGURE 7 Throughput variation over a number of

device-to-device (D2D) users for the whole system

2 4 6 8 10121416

Number of D2D pairs

100

120

140

160

180

200

Average D2D throughput (Mbps)

FIGURE 8 Average device-to-device (D2D) throughput variation

over a number of D2D users

16 of 18 KHUNTIA AND HAZRA

FIGURE 9 Variation of system throughput with respect to the

transition probability

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Transition probability

100

150

200

250

300

350

400

System throughput (Mbps)

FIGURE 10 Error estimation for a learning rate of 0.52

0 100 200 300 400 500 600 700 800 900 1000

Number of Episode

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Estimated Loss or Error

of Deep Q-learning, we do not consider the uncertainty that lies in state transition probability and the weights used

for Q-function. Thus, the probability of selecting the wrong action increases, thereby estimated error which is nothing

but the difference between the target Q-value and estimated Q-value increases. It can also be observed from the figure

that, in case of Deep Q-learning, the graph of error fluctuates, whereas, in case of Deep Q-learning with EKF, it is fairly

smooth. This is because Deep Q-learning with EKF considers the uncertainty weights of Q-function. Furthermore, EKF

also regularizes the Q-function by collecting the uncertainty information from the state transition set and then passes it

into the uncertainty weights of Q-function.

Figure 11 shows the variation of Q-value over a number of episode. We can see that the proposed Deep Q-learning

with EKF outperforms the Deep Q-learning algorithm in both cases, ie, case 1 and case 2, with mini-batch size 5 and 10,

respectively. This is because, in our proposed algorithm, the Q-value is calculated using Equation (21), where the EKF

uses Taylor series expansion for the linearization of Q-function. The Deep Q-learning with EKF learns from the associated

uncertainty in state transition set and weights of Q-function. Thus, it converges to a robust policy that allows to perform

scheduling of actions with higher accuracy. Therefore, we achieve a higher Q-value at every time step.

The quantitative analysis of our simulation result is shown in Table 2. We compare the system-throughput (when K=10

number of D2D) and error performance of our proposed algorithm, ie, Deep Q-learning with EKF (consider uncertainty

in state transition and weights of Q-function) and Deep Q-learning (consider only state transition uncertainty) over other

learning scheme (Deep Q-learning16,17,26 and Q-learning10,12) and without learning (IPO7and GRA8) scheme. It can be

observed that our proposed scheme achieves descent performance in terms of system-throughput and error performance.

KHUNTIA AND HAZRA 17 of 18

0 100 200 300 400 500 600 700 800 900 1000

Number of Episode

0.5

1.5

2.5

Average Q-value

FIGURE 11 Variation of Q-function vs. the number of episodes

TABLE 2 System-throughput and error performance in different learning and without learning algorithm

Performance Deep Q with Deep-Q learning Deep-Q learning Q-learning IPO GRA

EKF (proposed) (only consider

state uncertainty)

System-throughput (Mbps) 235.28 202.1 27,16 81.217 110.28,10 3.212 120.47174.598

(D2D number =10)

Loss (or) error 0.0123 0.0253 0.0526 0.0720 NA NA

Abbreviations: D2D, device-to-device; EKF, extended Kalman filter.

6CONCLUSION

In this work, we use Deep Q-learning scheme combined with EKF for resource allocation in D2D enabled cellular net-

work. The DQN combined with EKF incorporates weight uncertainty of the Q-function as well as the state uncertainty

during a transition. We also present that the updated weights from the EKF-based DQN minimizes the controlled version

of the loss function that brings uniformity to the network. The novel cell sectoring concept and queuing model with non-

preemptive limited priority model help in reducing the cochannel cell interference and traffic congestion. Furthermore,

the proposed queuing model solves the issue of unnecessary waiting of users in a priority-based transmission through

proper selection of constraint parameter j. Thus, this novel approach can be applied for other real-time application, where

uncertainty is associated with the model.

ORCID

Pratap Khuntia https://orcid.org/0000-0003-1373-9907

REFERENCES

1. Fodor G, Dahlman E, Mildh G, et al. Design aspects of network assisted device-to-device communications. IEEE Commun Mag.

2012;50(3):170-177.

2. Zhang Y, Yang Y, Dai L. Energy efficiency maximization for Device-to-Device communication underlaying cellular networks on multiple

bands. IEEE Access. 2016;4:7682-7691.

3. Janis P, Koviunen V, Ribeiro C, Korhonen J, Doppler K, Hugl K. Interference-aware resource allocation for device-to-device radio

underlaying cellular networks. Paper presented at: VTC Spring 2009 - IEEE 69th Vehicular Technology Conference;2009; Barcelona, Spain.

4. Gupta S, Kumar S, Zhang R, Kalyani S, Giridhar K, Hazo L. Resource allocation for D2D links in the FFR and SFR aided cellular network.

IEEE Trans on Comm. 2016;64(10):4434-4448.

5. Baozhou Y, Qi Z. A QoS-based channel allocation and power control algorithm for device-to-device communication underlaying cellular

networks. J Commun. 2016;11(7):624-631.

6. Hamdi M, Yuan D, Zaied M. GA-based scheme for fair joint channel allocation and power control for underlaying D2D multicast com-

munications. Paper presented at: 2017 13th International Wireless Communications and Mobile Computing Conference (IWCMC); 2017;

Valencia, Spain.

18 of 18 KHUNTIA AND HAZRA

7. Wang L, Wu H. Fast pairing of device-to-device link underlay for spectrum sharing with cellular users. IEEE Commun Lett.

2014;18(10):1803-1806.

8. Wang F, Xu C, Song L, Han Z. Energy-efficient resource allocation for device-to-device underlay communication. IEEE Trans Wirel

Commun. 2015;14(4):2082-2092.

9. Maghsudi L, Sta ´

nczak S. Hybrid centralized-distributed resource allocation for device-to-device communication underlaying cellular

networks. IEEE Trans Veh Technol. 2016;65(4):2481-2495.

10. Nie S, Fan Z, Zhao M, Gu X, Zhang L. Q-learning based power control algorithm for d2d communication. Paper presented at: 2016 IEEE

27th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC); 2016; Valencia, Spain.

11. Asheralieva A, Miyanaga Y. An autonomous learning-based algorithm for joint channel and power level selection by D2D pairs in

heterogeneous cellular networks. IEEE Trans Commun. 2016;64(9):3994-4012.

12. Khan MI, Alam MM, Le Moullec Y. A multi-armed bandit solver method for adaptive power allocation in device-to-device communication.

Paper presented at: International Workshop on Recent Advances in Cellular Technologies and 5G for IoT; 2018; Leuven, Belgium.

13. Mnih V, Kavukcuoglu K, Silver D, et al. Human level control through deep reinforcement learning. Nature. 2015;518(7540):529-533.

14. Mao H, Alizadeh M, Menache I, Kandula S. Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM

Workshop on Hot Topics in Networks; 2016; Atlanta, GA.

15. Van Le D, Tham CK. A deep reinforcementlearning based offloading scheme in ad-hoc mobile clouds. Paper presented at: IEEE INFOCOM

2018 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS); 2018; Honolulu, HI.

16. Lee W, Kim M, Cho DH. Deep learning based transmit power control in underlaid device-to-device communication. IEEE Syst J. 2018.

Early access.

17. Naparstek O, Cohen K.. Deep multi-user reinforcement learning for dynamic spectrum access in multi-channel wireless networks. arXiv

preprint arXiv:1704.02613. 2018.

18. Kim J, Park J, Noh J, Cho S. Completely distributed power allocation using deep neural network for device to device communication

underlaying LTE. arXiv preprint arXiv:1802.02736. 2018.

19. Meng F, Chen P, Wu L, Cheng J. Power allocation in multi-user cellular networks: Deep reinforcementlearning approaches. arXiv preprint

arXiv:1901.07159v1. 2019.

20. Huang D, Gao Y, Li Y, et al. Deep learning based cooperative resource allocation in 5G wireless networks mobile networks and applications.

2018. https://doi.org/10.1007/s11036-018-1178-9

21. Naparstek O, Cohen K. Deep multi-user reinforcement learning for dynamic spectrum access in multichannel wireless networks. Paper

presented at: GLOBECOM 2017 - 2017 IEEE Global Communications Conference; 2017; Singapore.

22. Zhao N, Liang YC, Niyato D, Pei Y, Jiang Y. Deep reinforcement learning for user association and resource allocation in heterogeneous

networks. Paper presented at: 2018 IEEE Global Communications Conference (GLOBECOM); 2018; Abu Dhabi, UAE.

23. Li J, Gao H, Lv T, Lu Y. Deep reinforcement learning based computation offloading and resource allocation for MEC. Paper presented at:

2018 IEEE Wireless Communications and Networking Conference (WCNC); 2018; Barcelona, Spain.

24. Li X, Fang J, Cheng W, Duan H, Chen Z, Li H. Intelligent power control for spectrum sharing in cognitive radios: a deep reinforcement

learning approach. IEEE Access. 2018;6:25463-25473.

25. Yu Y, Wang T, Liew SC. Deep-reinforcement learning multiple access for heterogeneous wireless networks. Paper presented at: IEEE

International Conference on Communications (ICC); 2018; Kansas City, MO.

26. Chen X, Li Z, Zhang Y, et al. Reinforcement learning based QOS/QOE-aware service function chaining in software-driven 5G slices.arXiv

preprint arXiv:1804.02099. 2018.

27. Weng C, Yu D, Watanabe S, Juang BHF. Recurrent deep neural networks for robust speech recognition. Paper presented at: 2014 IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2014; Florence, Italy.

28. Ye H, Le GY. Deep reinforcement learning for resource allocation in V2V communications. IEEE Trans Veh Technol. 2019. Early access.

Howtocitethisarticle: Khuntia P, Hazra R. An efficient Deep reinforcement learning with extended Kalman

filter for device-to-device communication underlaying cellular network. Trans Emerging Tel Tech. 2019;30:e3671.

https://doi.org/10.1002/ett.3671

A preview of this full-text is provided by Wiley.

Learn more

Content available from Transactions on Emerging Telecommunications Technologies

This content is subject to copyright. Terms and conditions apply.

Distributed pricing-based resource allocation for dense device-to-device communications in beyond 5G networks

Preprint

Full-text available

Dec 2021

Considering the dramatic increase of data rate demand in beyond fifth generation (5G) networks due to the numerous transmitting nodes, dense device-to-device (D2D) communications where multiple D2D pairs can simultaneously reuse a cellular link can be considered as a communication paradigm for future wireless networks. Since distributed methods can be more practical compared to complex centralized schemes, we propose a low-complexity distributed pricing-based resource allocation algorithm to allocate power to cellular user equipments (CUEs) and D2D pairs constrained to the minimum quality-of-service (QoS) requirements of CUEs and D2D pairs in two phases. The price of each link is set by the base station (BS). CUEs and D2D pairs maximize their rate-and-interference-dependent utility function by adjusting their power in the first phase, while the price of each link is updated at the BS based on the minimum QoS requirements of CUEs and D2D pairs in the second phase. The proposed utility function controls D2D admission, hence it is suitable for dense scenarios. The proposed method is fast due to the closed-form solution for the first phase power allocation. Numerical results verify the effectiveness of the proposed method for practical dense beyond 5G scenarios by allocating resources to multiple D2D pairs and taking advantage of the spatial reuse gain of D2D pairs. Furthermore, the utility function definition is discussed to illustrate its effectiveness for SE maximization. Finally, the SE of the proposed method is compared to other centralized and distributed algorithms to demonstrate the higher sum-rate performance of the proposed algorithm.

Data driven 3D channel estimation for massive MIMO

Article

Full-text available

May 2021

Massive MIMO is considered as promising technology enabling 5G and beyond 5G cellular communication networks. High-resolution gain and angle estimation of the channel are significant challenges in the design of massive MIMO. The existing signal subspace-based estimation techniques lack the resolution for the detection of slight angles. The severity increases when both azimuth and elevation angle are jointly estimating for the 3D channel. The existing techniques rely on the combination of signal subspace-based approaches to estimate both azimuth and elevation angles. The resulting estimate’s accuracy is less due to its low resolution and angular coupling. Data-driven techniques can address this issue. This work is the first-ever attempt to apply data-driven techniques for 3D channel estimation of massive MIMO to the best of our knowledge. This paper considers two approaches for the same, one based on K nearest neighbour (KNN) and the other based on deep neural network (DNN) with restricted Boltzmann machine (RBM). We also investigated data generation and feature extraction for data-driven communication technologies in three ways. An intensive performance analysis of both architectures using these three feature vectors is carried out. The simulation reveals deep learning (DL) model’s superiority compared to the machine learning (ML)-based counterpart and other signal subspace-based estimation techniques. SNR of 10 dB lesser than other signal subspace estimation is required for ML-based approach, whereas the DL-based estimator needs 20 dB lesser SNR for the received signal to achieve the same BER value. In the comparative study on the feature vectors for data-driven estimation techniques, the data processing based on the Pearson correlation feature vector (PCFV) performs the best. Performance comparison of the DNN model with the KNN model is based on tenfold cross-validation showing an average AUC of 0.915 for DL based estimation and a coefficient of 0.904 for ML-based counterpart.

The New Trend of State Estimation: From Model-Driven to Hybrid-Driven Methods

Article

Full-text available

Mar 2021
SENSORS-BASEL

State estimation is widely used in various automated systems, including IoT systems, unmanned systems, robots, etc. In traditional state estimation, measurement data are instantaneous and processed in real time. With modern systems’ development, sensors can obtain more and more signals and store them. Therefore, how to use these measurement big data to improve the performance of state estimation has become a hot research issue in this field. This paper reviews the development of state estimation and future development trends. First, we review the model-based state estimation methods, including the Kalman filter, such as the extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc. Particle filters and Gaussian mixture filters that can handle mixed Gaussian noise are discussed, too. These methods have high requirements for models, while it is not easy to obtain accurate system models in practice. The emergence of robust filters, the interacting multiple model (IMM), and adaptive filters are also mentioned here. Secondly, the current research status of data-driven state estimation methods is introduced based on network learning. Finally, the main research results for hybrid filters obtained in recent years are summarized and discussed, which combine model-based methods and data-driven methods. This paper is based on state estimation research results and provides a more detailed overview of model-driven, data-driven, and hybrid-driven approaches. The main algorithm of each method is provided so that beginners can have a clearer understanding. Additionally, it discusses the future development trends for researchers in state estimation.

Distributed pricing‐based resource allocation for dense device‐to‐device communications in beyond 5G networks

Article

Full-text available

Sep 2021

Considering the dramatic increase of data rate demand in beyond fifth generation (5G) networks due to the numerous transmitting nodes, dense device‐to‐device (D2D) communications where multiple D2D pairs can simultaneously reuse a cellular link can be considered as a communication paradigm for future wireless networks. Since distributed methods can be more practical compared to complex centralized schemes, we propose a low‐complexity distributed pricing‐based resource allocation algorithm to allocate power to cellular user equipments (CUEs) and D2D pairs constrained to the minimum quality‐of‐service (QoS) requirements of CUEs and D2D pairs in two phases. The price of each link is set by the base station (BS). CUEs and D2D pairs maximize their rate‐and‐interference‐dependent utility function by adjusting their power in the first phase, while the price of each link is updated at the BS based on the minimum QoS requirements of CUEs and D2D pairs in the second phase. The proposed utility function controls D2D admission, hence it is suitable for dense scenarios. The proposed method is fast due to the closed‐form solution for the first phase power allocation. Numerical results verify the effectiveness of the proposed method for practical dense beyond 5G scenarios by allocating resources to multiple D2D pairs and taking advantage of the spatial reuse gain of D2D pairs. Furthermore, the utility function definition is discussed to illustrate its effectiveness for SE maximization. Finally, the SE of the proposed method is compared to other centralized and distributed algorithms to demonstrate the higher sum‐rate performance of the proposed algorithm.

An Efficient Channel and Power Allocation Scheme for D2D Enabled Cellular Communication System: An IoT Application

Article

Feb 2021

The Device-to-device (D2D) networks bear a close resemblance to future Internet of Things (IoT) networks. Resource management is an important aspect for realization of D2D communication in upcoming IoT networks. In this paper, we select underlay in-band D2D communication as it is more beneficial in terms of spectral efficiency even though it comes at the cost of interference with cellular communication. To deal with this difficulty, we propose a novel downlink resource sharing strategy, where multiple D2D users (D2Ds) share the resource of a single cellular user (CU), and each D2D pair reuses multiple channels from different cellular users (CUs). The proposed scheme adopts a channel selection technique, wherein multiple channels of CU can be shared by each D2D user. Furthermore, optimal power for each D2D user is determined through the Lagrangian dual optimization method. The formulated power control maximization scheme nicely balances the total transmission power of D2D and D2D sum-rate. The proposed channel and power allocation problem aims at maximizing the D2D sum-rate by increasing the number of active D2D links while preserving the quality of service (QoS) of CU. Finally, a relationship between energy efficiency (EE) and transmit power of D2D is investigated through an EE maximization problem. The overall system performance is evaluated in terms of the D2D shared ratio, throughput gain of the network, and computational complexity of the proposed optimal strategy. Further, the merits of using the proposed resource sharing scheme over the existing schemes are also verified through numerical results.

A Novel Dynamic D2D-assisted Offloading Decision Making in Multi-user Edge Computing Environment

Conference Paper

Dec 2023

An Online Learning Method for Constructing Self-update Digital Twin Model of Power Transformer Temperature Prediction

Article

Oct 2023
APPL THERM ENG

Power and delay optimization based uplink resource allocation for wireless networks with device-to-device communications

Article

Jul 2022
COMPUT COMMUN

In this paper, the resource allocation problem for all devices is investigated in a multi-sharing uplink scenario of mmWave cyclic prefix - orthogonal frequency division multiplexing (CP-OFDM) based wireless networks with device-to-device (D2D) communications. More specifically, a power and delay optimization based uplink resource allocation (PDO-URA) algorithm is proposed divided into two steps. At first, network resources are allocated to cellular user equipments (CUEs) in terms of power and transmission rate through a proposed approach that intends to maximize throughput. Next, idle resources are allocated by considering delay minimization. To this end, we propose to apply an algorithm based on delay optimization where idle resources are shared with D2D user equipments (DUEs) in the network considering the formed conflicts graphs and the delay information estimated through Network Calculus concepts such as envelope process and service curve. Computational simulations are carried out considering a communication scenario with 5G characteristics such as millimeter waves at frequencies above 6 GHz, comparing also the performance of other algorithms from the literature in terms of quality of service (QoS) parameters such as throughput and delay. The results confirm that the proposed algorithm presents superior performance in terms of throughput and delay for the mmWave CP-OFDM wireless networks with D2D communications than the other considered algorithms.

Delay minimization based uplink resource allocation for device-to-device communications considering mmWave propagation

Article

Full-text available

Apr 2021

This paper addresses the resource allocation problem in multi-sharing uplink for device-to-device (D2D) communication, one aspect of 5G communication networks. The main advantage and motivation in relation to the use of D2D communication is the significant improvement in the spectral efficiency of the system when exploiting the proximity of communication pairs and reusing idle resources of the network, mainly in the uplink mode, where there are more idle available resources. An approach is proposed for allocating resources to D2D and cellular user equipments (CUE) users in the uplink of a 5G based network which considers the estimation of delay bound value. The proposed algorithm considers minimization of total delay for users in the uplink and solves the problem by forming conflict graph and by finding the maximal weight independent set. For the user delay estimation, an approach is proposed that considers the multifractal traffic envelope process and service curve for the uplink. The performance of the algorithm is evaluated through computer simulations in comparison with those of other algorithms in the literature in terms of throughput, delay, fairness and computational complexity in a scenario with channel modeling that describes the propagation of millimeter waves at frequencies above 6 GHz. Simulation results show that the proposed allocation algorithm outperforms other algorithms in the literature, being highly efficient to 5G systems.

Overview of D2D Communication Technology under 5G Cellular Network Coverage

Conference Paper

Dec 2020

Deep Learning Based Cooperative Resource Allocation in 5G Wireless Networks

Article

Full-text available

Jun 2022
MOBILE NETW APPL

Wireless personal communication has become popular with the rapid development of 5G communication systems. Critical demands on transmission speed and QoS make it difficult to upgrade current wireless personal communication systems. In this paper, we develop a novel resource allocation method using deep learning to squeeze the benefits of resource utilization. By generating the convolutional neural network using channel information, resource allocation is to be optimized. The deep learning method could help make full use of the small scale channel information instead of traditional resource optimization, especially when the channel environment is changing fast. Simulation results indicate the fact that the performance of our proposed method is close to MMSE method and better than ZF method, and the time consumption of computation is smaller than traditional method.

A Deep Reinforcement Learning based Offloading Scheme in Ad-hoc Mobile Clouds

Conference Paper

Full-text available

Apr 2018

In this paper, we consider the problem of making an optimal offloading decision for a mobile user in an ad-hoc mobile cloud in which the mobile user can offload his computation tasks to nearby mobile cloudlets via a device-to-device (D2D) communication-enabled cellular network. We propose a deep reinforcement learning (DRL)-based offloading scheme which enables the user to make near-optimal offloading decisions by taking into account uncertainties of user's and cloudlets' movements and the cloudlets' resource availabilities. We first propose a Markov decision process (MDP)-based offloading problem formulation which considers the composite states of the user's and cloudlets' queue states and the distance states between the user and cloudlets as the system state space. The objective of the formulated MDP-based problem is to determine the optimal action on how many tasks the user should process locally and how many tasks to offload to each cloudlet at each observed system state such that the user's utility obtained by task execution is maximized while minimizing the energy consumption, task processing delay, task loss probability and required payment. Then, we use a deep reinforcement learning scheme, called deep Q-network (DQN) to learn an efficient solution for the proposed MDP-based offloading problem. Extensive simulations were performed to evaluate the performance of the proposed offloading scheme. The simulation results validate the effectiveness of the offloading policies obtained by the proposed scheme.

Deep Reinforcement Learning for User Association and Resource Allocation in Heterogeneous Cellular Networks

Article

Aug 2019

Heterogeneous cellular networks can offload the mobile traffic and reduce the deployment costs, which have been considered to be a promising technique in the next-generation wireless network. Due to the non-convex and combinatorial characteristics, it is challenging to obtain an optimal strategy for the joint user association and resource allocation issue. In this paper, a reinforcement learning (RL) approach is proposed to achieve the maximum long-term overall network utility while guaranteeing the quality of service requirements of user equipments (UEs) in the downlink of heterogeneous cellular networks. A distributed optimization method based on multiagent RL is developed. Moreover, to solve the computationally expensive problem with the large action space, multi-agent deep RL method is proposed. Specifically, the state, action and reward function are defined for UEs, and dueling double deep Q-network (D3QN) strategy is introduced to obtain the nearly optimal policy. Through message passing, the distributed UEs can obtain the global state space with a small communication overhead. With the double-Q strategy and dueling architecture, D3QN can rapidly converge to a subgame perfect Nash equilibrium. Simulation results demonstrate that D3QN achieves the better performance than other RL approaches in solving large-scale learning problems.

Power Allocation in Multi-User Cellular Networks with Deep Q Learning Approach

Conference Paper

May 2019

Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks

Article

Mar 2019

This paper investigates a deep reinforcement learning (DRL) based MAC protocol for heterogeneous wireless networking, referred to as Deep-reinforcement Learning Multiple Access (DLMA). Specifically, we consider the scenario of a number of networks operating different MAC protocols trying to access the time slots of a common wireless medium. A key challenge in our problem formulation is that we assume our DLMA network does not know the operating principles of the MACs of the other networks—i.e., DLMA does not know how the other MACs make decisions on when to transmit and when not to. The goal of DLMA is to be able to learn an “optimal” channel access strategy to achieve a certain pre-specified global objective. Possible objectives include maximizing the sum throughput and maximizing α-fairness among all networks. The underpinning learning process of DLMA is based on DRL. With proper definitions of the state space, action space, and rewards in DRL, we show that DLMA can easily maximize the sum throughput by judiciously selecting certain time slots to transmit. Maximizing general α-fairness, however, is beyond the means of the conventional reinforcement learning (RL) framework. We put forth a new multi-dimensional RL framework that enables DLMA to maximize general α-fairness. Our extensive simulation results show that DLMA can maximize sum throughput or achieve proportional fairness (two special classes of α-fairness) when coexisting with TDMA and ALOHA MAC protocols without knowing they are TDMA or ALOHA. Importantly, we show the merit of incorporating the use of neural networks into the RL framework (i.e., why DRL and not just traditional RL): specifically, the use of DRL allows DLMA (i) to learn the optimal strategy with much faster speed, and (ii) to be more robust in that it can still learn a near-optimal strategy even when the parameters in the RL framework are not optimally set.

Deep Reinforcement Learning for User Association and Resource Allocation in Heterogeneous Networks

Conference Paper

Dec 2018

Deep Reinforcement Learning for Resource Allocation in V2V Communications

Article

Feb 2019

In this paper, we develop a novel decentralized resource allocation mechanism for vehicle-to-vehicle (V2V) communications based on deep reinforcement learning, which can be applied to both unicast and broadcast scenarios. According to the decentralized resource allocation mechanism, an autonomous “agent”, a V2V link or a vehicle, makes its decisions to find the optimal sub-band and power level for transmission without requiring or having to wait for global information. Since the proposed method is decentralized, it incurs only limited transmission overhead. From the simulation results, each agent can effectively learn to satisfy the stringent latency constraints on V2V links while minimizing the interference to vehicle-to-infrastructure communications.

Deep Learning Based Transmit Power Control in Underlaid Device-to-Device Communication

Article

Sep 2018

In this paper, a means of transmit power control for underlaid device-to-device (D2D) communication is proposed based on deep learning technology. In the proposed scheme, the transmit power of D2D user equipment (DUE) is autonomously learned via a deep neural network such that the weighted sum rate (WSR) of DUEs can be maximized by considering the interference from cellular user equipment. Unlike conventional transmit power control schemes in which complex optimization problems have to be solved in an iterative manner, which possibly requires long computation time, in our proposed scheme the transmit power can be determined with a relatively low computation time. Through simulations, we confirm that the proposed scheme achieves a sufficiently high WSR with a sufficiently low computation time.

Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks

Conference Paper

May 2018

Deep reinforcement learning based computation offloading and resource allocation for MEC

Conference Paper

Apr 2018

An efficient Deep reinforcement learning with extended Kalman filter for device-to-device communication underlaying cellular network

Abstract and Figures

Recommended publications

An efficient actor‐critic reinforcement learning for device‐to‐device communication underlaying sect...

An Efficient Reinforcement Learning for Device-to-Device Communication Underlaying Cellular Network

An Actor-Critic Reinforcement Learning for Device-to-Device Communication Underlaying Cellular Netwo...

Dynamic Power Allocation in D2D Communications Using Deep Reinforcement Learning