ArticlePDF Available

An efficient Deep reinforcement learning with extended Kalman filter for device-to-device communication underlaying cellular network

Authors:

Abstract and Figures

In this paper, a novel Deep Q‐learning in addition with an extended Kalman filter (EKF) is proposed to solve the channel and power allocation issue for a device‐to‐device enabled cellular network, when the prior traffic information is not known to the base station. Furthermore, this paper work explores an optimal policy for resource and power allocation between users with the aim of maximizing the sum‐rate of the overall system. The proposed work comprises of four phases, ie, cell splitting, clustering, queuing model, channel allocation, and power allocation, simultaneously. The implementation of cell splitting with novel K‐means++ clustering technique increases the network coverage, reduces co‐channel cell interference, and minimizes the transmission power of nodes, whereas the M/M/C:N queuing model solves the issue of waiting time for users in a priority based data transmission. The difficulty with the Q‐learning and Deep Q‐learning environment is to achieve an optimal policy. This is because of the uncertainty of various parameters associated with the system, especially when the state space is extremely big. In order to improve the robustness of the learner, EKF together with the Deep Q‐network is presented in this paper, which incorporates weight uncertainty of the Q‐function as well as the state uncertainty during the transition. Furthermore, the use of EKF provides an improved version of the loss function that helps the learner in achieving an optimal policy. Through numerical simulation, the advantage of our resource sharing scheme over the other existing schemes is also verified.
This content is subject to copyright. Terms and conditions apply.
Received: 1 January 2019 Revised: 8 April 2019 Accepted: 22 May 2019
DOI: 10.1002/ett.3671
SPECIAL ISSUE ARTICLE
An efficient Deep reinforcement learning with extended
Kalman filter for device-to-device communication
underlaying cellular network
Pratap Khuntia Ranjay Hazra
Electronics and Instrumentation
Engineering Department, National
Institute of Technology, Silchar, India
Correspondence
Pratap Khuntia, Electronics and
Instrumentation Engineering
Department, National Institute of
Technology, Silchar-788010, India.
Email: pratap.khuntia7@gmail.com
Abstract
In this paper, a novel Deep Q-learning in addition with an extended Kalman
filter (EKF) is proposed to solve the channel and power allocation issue for a
device-to-device enabled cellular network, when the prior traffic information is
not known to the base station. Furthermore, this paper work explores an opti-
mal policy for resource and power allocation between users with the aim of
maximizing the sum-rate of the overall system. The proposed work comprises
of four phases, ie, cell splitting, clustering, queuing model, channel allocation,
and power allocation, simultaneously. The implementation of cell splitting with
novel K-means++ clustering technique increases the network coverage, reduces
co-channel cell interference, and minimizes the transmission power of nodes,
whereas the M/M/C:N queuing model solves the issue of waiting time for users
in a priority based data transmission. The difficulty with the Q-learning and
Deep Q-learning environment is to achieve an optimal policy. This is because
of the uncertainty of various parameters associated with the system, especially
when the state space is extremely big. In order to improve the robustness of the
learner, EKF together with the Deep Q-network is presented in this paper, which
incorporates weight uncertainty of the Q-function as well as the state uncer-
tainty during the transition. Furthermore, the use of EKF provides an improved
version of the loss function that helps the learner in achieving an optimal policy.
Through numerical simulation, the advantage of our resource sharing scheme
over the other existing schemes is also verified.
1INTRODUCTION
The rapid rise in the growth of data traffic is a big challenge for next generation cellular networks. In the near future,
billions of connected devices will be in the global IP network. To fulfill the demand of mobile users and for efficient
management of the scarce spectrum, device-to-device (D2D) communication is a promising technology for the future
integrated next generation cellular networks.
D2D communication allows direct communication between two users in close proximity, without the involvement of
the base station (BS). In an underlaying cellular network, D2D users (D2Ds) reuse radio resources allocated to cellular
users (CUs). The reuse of resources of the CU by D2D user causes interference with each other.1,2Therefore, the selec-
tion of a suitable resource and power allocation scheme plays a vital role in reducing interference. Therefore, in order to
Trans Emerging Tel Tech. 2019;30:e3671. wileyonlinelibrary.com/journal/ett © 2019 John Wiley & Sons, Ltd. 1of18
https://doi.org/10.1002/ett.3671
2of18 KHUNTIA AND HAZRA
reduce interference, a suitable amount of transmission power must be chosen for each CU and D2D. Thus, as a central
entity, BS determines the transmission power of each user and interference level using various scheduling algorithms.3
However, the traditional method of resource allocation does not provide a preferable optimal outcome, if the complete
traffic information is not known to the BS a priori. There are various conventional D2D schemes that aims at maximizing
the network throughput. Some of the schemes are graph-based method, fractional frequency reuse,4Lagrange multiplier,5
and optimization algorithm, eg, particle swarm optimization and genetic algorithm (GA).6A low complexity algorithm
called as inverse popularity pairing order is presented in the work of Wang and Wu,7which gives the solution of power
allocation and the pairing of best CU and D2D pair to operate in the same channel. A greedy-based resource allocation
scheme is proposed in the work of Wang et al,8where the D2Ds reuse the resources of a fixed CU taking interference
constraint into account. In the work of Hamdi et al,6the authors presented a joint channel allocation and power con-
trol algorithm for multicast D2D underlay communication. The nonlinear Perron Frobenius is proposed for solving the
nonlinear nonconvex fair power control problem. Finally, GA is proposed to solve the integral issue of the joint chan-
nel allocation and power control problem. GA is more time consuming than other optimization algorithms. However, all
of the aforementioned centralized traditional methods suffer from excess transmission overhead to obtain the complete
channel state information of the network.
Reinforcement learning (RL) technique is a machine learning approach that has gained popularity recently in wire-
less communication. There is limited work that uses the concept of RL for resource sharing in D2D communication.
Maghsudi et al9proposed a centralized resource and power allocation problem for D2D communication and presented
the problem of power control as a learning game. Maghsudi et al9also applied a Q-learning better-reply dynamics
algorithm to learn the utility function and to acquire equilibrium in power allocation. In the work of Nie et al,10 authors
discussed a distributed Q-learning technique for D2D communication that allows the D2D users to explore the trans-
mit power in an independent manner, thereby enhancing the system capacity and, in the meantime, maintain a good
quality of service (QOS) for CUs. Asheralieva et al,11 in their work, discussed a joint channel and power allocation
problem for D2D pairs, operating autonomously in a heterogeneous cellular network. The aim is to maximize the
rewards, which are the difference between the average throughput and the cost value of power consumption by con-
sidering SINR constraints into account. They applied a multiagent Q-learning algorithm, where each D2D pair acts as
a learning agent whose job is to collect the locally observed information, thereby helps the agent in selecting the best
strategy. Khan et al12 employed an online learning algorithm for power allocation, ie, a multiarmed bandit solver for
D2D enable cellular network under cross-tier and co-tier interference constraints. However, the basic Q-learning scheme
is unstructured in state space as well as the reward achieved by the learner is not optimal. The learner in Q-learning
environment follows a Markov decision process (MDP) that incorporates uncertainty in state transition probability of
the system.
Deep RL is a standard class of RL where the Deep neural network (DNN) is trained to learn either value function or
optimal policy. Deep learning is a major breakthrough in machine learning especially in developing games, autonomous
driving system, health care, robotic application, and many more. Mnih et al13 developed an artificial Deep Q-network
(DQN) that successfully learns the policies from excessively high-dimensional state space directly and have deployed the
DQN on demanding Atari 2600 games. In the work of Mao et al,14 authors applied the Deep RL technique in machine
intelligence, where the proposed system learns directly the resource management between various electronic devices by
accumulating the experiences received within a cluster. They used a Deep RL approach to sort out the problem of task
packing with multiple resource demands that aims at minimizing the connection time of a task. Le et al15 proposed a
Deep RL-based offloading mechanism, where a user learns through an online trial and error basis to offload his compu-
tational job to other mobile devices with the help of D2D communication in an underlay cellular network. In the work of
Lee et al,16 authors discussed a transmit power control strategy, where the D2D user autonomously learns to control its
transmit power by using DNN, thereby improving the average sum-rate of D2Ds by restricting the interference from CUs.
In the work of Naparstek and Cohen,17 the authors proposed a spectrum assignment process, where the multiple users
access the multiple numbers of channels. At the beginning of a time slot, a user transmits the data packet on a selected
channel based on certain transmission probability. The user then checks the successful transmission of data through an
acknowledgement signal. To handle this, they used a distributed dynamic Deep Q-learning network. Kim et al18 proposed
a distributed architecture for power allocation based on Deep learning, where the D2D user has the ability to determine
the transmit power based on its position and also consider the interference toward cellular network like BS. To control
the interference toward BS, they formulated a Deep Q-learning cost function using the Lagrangian method. In the work
of Meng et al,19 the authors addressed a Deep RL approach for allocating power to the devices in the multicellular net-
work. The approach involves a mathematical analysis of Deep RL, considering the online/offline centralized training,
KHUNTIA AND HAZRA 3of18
intercell coordination and execution in a distributed fashion. Huang et al20 discussed a cooperative resource allocation
based on Deep learning for D2D enabled 5G networks. The DQN efficiently utilizes the small scale channel information
from the dynamic environment to optimize the resources. A Deep multi-user RL method for dynamic spectrum access
in a multichannel wireless network is depicted in the work of Naparstek and Cohen.21 In this work, they considered a
long short-term memory layer in Deep Q-learning to estimate the precise state information by using partial observations.
IntheworkofHuangetal,
22 the authors used the Deep Q-learning algorithm for resource allocation in a heterogeneous
cellular network. Here, the DQN processes the QOS state of the users, eg, SINR, and takes an action at each time slot.
The user gets utility in the form of immediate reward if the QOS is satisfied. To tackle jointly the offloading and resource
allocation for a multiuser mobile computing system, Li et al23 exploited a Deep Q-learning to minimize the sum cost for
delay and energy consumption of user equipment. In the work of Li et al,24 the authors approached a power control strat-
egy for resource sharing in cognitive radios. The primary user based on a predefined power control strategy regulates its
transmit power, whereas the secondary user adopts a Deep RL method and develops an intelligent power control pol-
icy by interacting with the primary users. Yu et al25 exploited the Deep RL concept to sort out the challenges in wireless
multiple access control and addresses the benefit of Deep RL in terms of fast convergence to optimal outcomes and robust-
ness against nonoptimal hyper parameter settings. The work Chen et al26 has addressed a Deep Q-learning method to
tackle QOS aware service function chaining in a network function virtualization activated 5G system. The QOS parame-
ters are throughput, delay, and system bandwidth. Weng et al27 proposed a DNN for speech recognition that uses a back
propagation algorithm to make the mini-batch gradient descent on the neural network more functional. Ye et al28 ana-
lyzed the resource allocation problem for a vehicle-to-vehicle (V2V) communication using Deep RL, where the optimal
frequency band and transmission power requirement for setting up communication are determined by each V2V pair
itself. The problem with the earlier discussed DQN is the way it is being trained. The Q-network takes the sample of state,
action, and reward, and estimates the sample of the next state. However, these samples are correlated; it leads to inef-
ficient learning. The current Q-network parameters decide the policies and estimate the next samples for training the
network, but this leads to instability in the Deep Q-learning due to bad feedback. Summarizing the aforementioned work
on Deep Q-learning, we thus conclude that the prior work does not consider the uncertainty issue in state transition and
the weights of Q-function, thus becoming difficult to achieve an optimal policy, when state space of the environment
is very large.
Therefore, we propose a novel approach of resource allocation, where the channel assignment and power allocation
for each D2D pair are based upon RL prototype called Deep Q-learning with extended Kalman filter (EKF), an online
evaluator of optimal policy. The proposed approach alleviates the uncertainty in state transition probability and the weight
uncertainty that originates from the DQN. The whole work is divided into four phases. They are as follows.
Splitting of cell into three sectors with 1200coverage pattern. This increases network coverage and minimizes
cochannel cell interference.
Use of a smart and reliable clustering algorithm called K-means++ that reduces the transmission power of nodes. The
communication of all the nodes takes place with their corresponding cluster head (CH) rather than eNodeB, which is
very far away from the nodes. The adoption of such smart initialization algorithm outperforms the standard K-means
algorithm in terms of faster running time and accuracy.
M/M/C:N queuing model with nonpreemptive limited priority scheme helps in resolving the heavy traffic load issue
for a priority-based data transmission.
Deep Q-learning with EKF is used for channel allocation and power allocation simultaneously as well as it consid-
ers the uncertainty in state transition and weight uncertainty issues associated with Q-learning and Deep Q-learning,
respectively. We propose to utilize an EKF that processes all the uncertainty state transition probability of the system
and connects this information with the weight uncertainty associated with the DQN, thereby enhancing the robustness
of the EKF to provide an optimal policy. With the help of EKF in DQN, we provide a modified version of the Q-function,
which is differentiable and helps us to apply a Taylor series approximation to linearize the nonlinear Q-function. Fur-
thermore, an advanced version of the loss function is derived, which tends to provide optimal weights for uniformity
of the network.
The rest of this paper is arranged as follows. The system model is described in Section 2. In Section 3, the problem
formulation for the proposed resource sharing strategy using Deep Q-learning with EKF is discussed. The queuing model
is presented in Section 4 to resolve the traffic requests of different types of users in the network. The simulation results
are verified in Section 5, whereas Section 6 concludes the paper.
4of18 KHUNTIA AND HAZRA
2SYSTEM MODEL
This paper presents a joint resource and power allocation optimization scheme whose objective is to maximize the
throughput of the overall network. In this work, the transmit power of both CUs and D2Ds is optimized by satisfying the
minimum signal-to-interference-plus-noise ratio (SINR) requirement. We consider a sectored cellular network, where
each cell is split into three sectors, each of 1200based on the node position, ie, D2D, CU, the angular position of node,
and radius, ie, the transmission range. In order to manage all the users inside the sector efficiently, we frame the cluster-
ing of users using the K-means++ clustering algorithm. The communication between users and eNodeB at each sector
takes place via the specific CH associated with each type of user. Such a clustering mechanism reduces traffic overload at
the eNodeB and each device utilizes less transmit power for communicating with the nearest CH. The traffic offloading
process from eNodeB to CH is performed intelligently by using K-means++ clustering algorithm followed by a TDMA
scheme. The purpose is to group the users into a cluster and allow each device to associate with each CH whose distance
from eNodeB is minimum and channel quality toward eNodeB highest. Following this, the management of traffic request
over different discrete time slot is determined using a TDMA scheme. It means that the devices within a cluster use a
unique frequency channel and a fixed time slot of each CH for communication. Here, eg, let us assume that x =1,2....X
number of frequency channels of CUs are available within the cluster and the nthD2Dusesthexth frequency channel
of CU for communicating within an allotted time slot (ti) of each CH. We consider that the CHs operate at unique time
slot and the D2Ds use different frequency channels for transmitting data efficiently. As the devices from each cluster use
a different time slot and unique frequency, it avoids interference between intersector and intrasector devices. We, thus,
use a Deep Q-learning with EKF technique for traffic offloading and resource allocation. The role of this intelligent agent
is to observe the network states, ie, the location of users, SINR information of each user, traffic request, interference con-
dition, and then map it to an optimal action through a learning experience in every current time slot. The coordination
between CH and eNodeB helps in controlling all the users properly. The iterative K-means++ clustering process is as
follows. Firstly, the clustering process begins in each sector by randomly selecting the first cluster center from the total
available nodes. This is followed by spreading out the centers by an iterative process, which means, at each iteration, it
chooses one new center. The K-means++ algorithm is explained as follows:
Overall, the cluster formation based upon K-means++ algorithm is faster in run time and improves the quality of the
local optimal solution. Deployment of cell splitting along with the clustering technique increases the network coverage,
enhances overall system capacity, and minimizes the transmission power of nodes due to the fact that CH is now closer
to the nodes in the cluster as compared to the eNodeB. The system model of a sectored cellular network is depicted in
Figure 1.
We assume an underlay D2D communication system, where the devices directly communicate with each other using
the uplink (UL) resources of CUs. In this case, the traffic request is verified at BS followed by implementation of certain
algorithms to allocate a channel and transmit power among the users, but this traditional method of resource allocation
does not work, if the complete traffic information is not known to the BS a priori. To tackle this issue, we adopt Deep
Q-learning with EKF in the subsequent stage, where a steady interaction between DQN with the environment results in
an optimal policy. Thus, learning through such interactive process is used to complete the channel assignment and power
allocation for D2Ds. In the final stage, to limit the heavy traffic situation during channel assignment, we use an M/M/C:N
queuing system with nonpreemptive limited priority model to deal with the traffic request coming from different types
of users.
We consider a sectored multicell scenario, which consists of two types of users, ie, CUs and D2Ds. There are a total of
Mnumber of CUs indexed by C1,C2,..,Cm,..,CM,andNnumber of D2D pairs denoted by D1,D2,..,Dn,..,DN
KHUNTIA AND HAZRA 5of18
FIGURE 1 System model of a sectored
cellular network
that are distributed randomly within snumber of cell sectors, where m𝜖[1,2, ... ,M]and n𝜖[1,2, ... ,N]. Each sector
is deployed by a BS and these BSs have logical links with the CH of each cluster. The management of traffic request and
availability of a node is determined by using a TDMA scheme, which means devices from each cluster is allocated different
time slot and frequency for communicating with their respective CH. Each CH uses a distinct time slot to avoid time
conflict between the two neighboring CHs. Thus, we use a Deep Q-learning algorithm that decides whether a time slot will
be allotted to a CH for relaying the data of devices associated with the CH. Using deep learning, each CH learns from its
previous learning experience, thus relaying of data through CH on the decided time slot as per the information recorded in
the past slots. The estimated Q-values obtained for each CH is different for each time slot. Thus, the availability of time slot
for a CH at any time is determined by comparing their Q-values. We assume that the total number of channels are less than
equal to the number of CUs. Let Xnumber of UL channels be allocated to M number of CUs. As there is no free channel
available for D2Ds, D2D pairs communicate using the channel of CUs on sharing basis. Each D2D transmitter (D2DTx )
selects an appropriate channel xfrom the available channel set, ie, x𝜖[1,2, ... ,X]. As a result, it leads to interference
between D2Ds and CUs within a sector, communicating via the same channel. However, the interference in between
cell sector is negligible, as they use their own frequency set and time slot. In order to avoid interference from D2Ds on
the cellular link, QOS constraints are chosen for the CUs. Thus, for each CU, an interference boundary is set. This is
determined using the SINR margin (𝜂) at the BS from each CU and the UL transmits power (Pm) of CU. Mathematically,
the overall interfering transmit power limit from each D2D can be expressed as follows:
S
s=1
NS
n=1
X
x=1
gx
n,Bqx
n+𝜎2Pmgx
m,B
𝜂B
,x𝜖C,(1)
where qx
nrepresents the transmit power of nth D2D in sector sover xth channel. The channel gain at BS from D2DTx and
channel gain between CU-BS is represented as gx
n,Band gx
m,B, respectively. 𝜎2represents the noise power on xth channel.
Let gx
n,Bqx
n+𝜎2=Ix
ndenote the interfering transmit power from each D2D user over xth channel.
The devices in the cellular system use SINR to measure the quality of the channel. Let 𝜉C
m=Pmgx
m,B
qx
ntn,Bgx
n,B+𝜎2and
𝜉D
n=qx
ngx
n,n
Pmtm,ngx
m,n+𝜎2denote the SINR associated with the CUs and D2Ds over the xth channel. Here, gx
n,nrepresents the
channel gain between nth D2D pair, whereas the interference channel gain between CU-D2D link is denoted as gx
m,n. Fur-
thermore, tm(0tm1)and tn(0tn1)are the normalized traffic rate from CU and D2D node, respectively.
Therefore, the achievable data rate for the whole system, ie, the sum of the data rate of CUs and D2Ds communicating
6of18 KHUNTIA AND HAZRA
over the same xth frequency band, is as follows:
R=BRBRC
m+RD
n,(2)
where
RC
m=
S
s=1
MS
m=1
X
x=1
BRBlog 1+𝜉C
m(3)
and
RD
n=
S
s=1
NS
n=1
X
x=1
𝜓m,nlog 1+𝜉D
n.(4)
Here, MSand NSrepresents the number of CUs and D2Ds in sector s.BRB denotes the bandwidth corresponding to each
channel and 𝜓m,nis a binary variable. When 𝜓m,n=1, it means mth CU shares its channel with nth D2D, otherwise it
is zero.
The optimization problem is to maximize the overall system throughput R over time for certain QOS constraints of CUs
and D2Ds that can be expressed as follows:
max
S
s=1
MS
m=1
X
x=1
BRBlog 1+𝜉C
m+
S
s=1
NS
n=1
X
x=1
𝜓m,nlog 1+𝜉D
n(5)
subject to
S
s=1
N
n=1
X
x=1
Ix
nPmgx
m,B
𝜂B
(5a)
qx
nPmax,x𝜖C,n𝜖D(5b)
M
m=1
N
n=1
𝜓m,n=1.(5c)
The aim of our work is to maximize the objective function for certain interference threshold constraint set at the BS and
maximum allowable transmit power for D2D. The interference threshold, ie, Pmgx
m,B
𝜂B
, is evaluated for each CU to avoid
interference between D2D and CU link. The next step is to manage the resource allocation in an optimal manner such that
the throughput achieved by the system becomes maximum. This is achieved by the network itself through an appropriate
learning mechanism.
3DEEP Q-LEARNING WITH EKF FOR RESOURCE ALLOCATION
Reinforcement learning is a process of learning where a situation in environment state is mapped to action so that the
overall reward becomes maximum. In this section, a powerful learning algorithm called Deep Q-learning combined with
EKF is employed to resolve the channel and power allocation issue. Before this, we first discuss the various difficulty asso-
ciated with the existing schemes and then we will combine those problems into our work for solving these issue. Figure 2
represents the schematic diagram representation of the Deep Q-learning with EKF. The system follows a two-stage pro-
cess, ie, learning and resource assignment. At each time step, the agent receives the current state of the nodes, ie, position
of D2D, CU, and CH, and distance between the nodes and eNodeB. These state space information are then fed into Deep
Q-learning + EKF algorithm. The internal architecture of this block is shown in Figure 3, which basically uses a DNN and
an EKF. The agent then selects an action, ie, a frequency channel, and transmits power under a policy. Following this, the
state of the nodes transit to a new state. The interaction between agent and environment takes place for one episode, ie,
Γ=15 transmission time interval (TTI), where one TTI =1 ms. The state transition is based on an uncertainty state tran-
sition probability set. The agent stores these transition in a replay memory. The estimated error, which is the difference
between the observed Q-function stored at the output of the DNN and the target Q-value computed from the sample of a
random mini-batch, is back propagated to learn the existing uncertainty in the weight of Q-function. This error signal is
then used to estimate the gain of the Kalman filter. In this approach, the Kalman filter circulates the detailed information
from target Q-values that is estimated from the uncertainty state transition set and passed on to the set of uncertainty
weights. This process goes on for all the episodes to achieve a robust policy that assists in assigning channel and power
allocation to the users fairly, thereby maximizing the sum-rate of the network.
KHUNTIA AND HAZRA 7of18
FIGURE 2 Schematic diagram representation of Deep Q-learning with extended Kalman filter (EKF)
FIGURE 3 Block diagram representation of Deep Q-learning with extended Kalman filter (EKF)
3.1 Q-learning
The various components of the learning model are as follows: state (St) of the environment represents the location of CU,
D2D, the absolute angle between eNodeB, the distance between node and eNodeB, channel condition defined by SINR,
interference level (Ix
n), and user traffic arrival rate (𝜆). As the user's location is fixed in each simulation instant, thus St
can be modified as St =(SINR,Ix
n,𝜆). These parameters are used for allocating radio resources. The traffic request from
different types of users and its priority is based on M/M/C:N queuing model with nonpreemptive limited priority that is
discussed in the subsequent section. The set of action (Ac) at discrete time instant tthat the agent may choose consists
of a set of channels and power levels assigned to each user associated with a CH in sector Si,soAct=(x(t),Px(t)).The
8of18 KHUNTIA AND HAZRA
parameter x(t) represents the total number of channels, ie, x𝜖{0,1, .......... X},whereX=WB,Wrepresents the system
bandwidth, and Bthe bandwidth of each channel. The transmit power Px(t)is the power level associated with each channel
that varies between 0 to Pmax and is a very important parameter in interference control. The scalar reward function (R) is
determined using the current state St and action Ac, which generally denotes the average system capacity of the network.
The last component is that the policy 𝜋is used to maximize the total expected reward over a long run time.
In this process, the agent accumulates the previous experience and, based on these experiences, learns how to interact
with the environment. Considering that the agent is in a discrete time MDP (DTMDP) with continuous state and action
space, the reward Rtdepends on the present state Stt, the action Act,andthenextlandedstateSt
t=Stt+1,soRt=
R(Stt,Act,St
t). In our case, the throughput in (2) is a reward function. Since the domain of action is continuous, we
thus evaluate the action from a stochastic policy 𝜋(ActStt)=P(St=Stt+1 Stt,Act), which indicates the mapping of
environment state to the probability of taking an action. The probability distribution P(St=Stt+1 Stt,Act)for transition
belongs to an uncertainty transition set p(Stt,Act). A DTMDP estimates and changes the policy based on value function,
which is defined as the cumulative expected reward beginning from the state St and following the assigned policy 𝜋.
Overall, the learning goal of the agent is to search an optimal policy that will maximize the cumulative reward over a long
run time. Now, onwards, we use symbol St for Sttand Ac for Act. Let, for policy 𝜋, the value function associated with state
St be presented recursively and is as follows:
V𝜋(St)=𝔼𝜋
t0
𝛾tRts0=St,𝜋
=𝔼𝜋[R0+𝛾R1+𝛾2R2+...... s0=St]
=𝔼Ac𝜋St)
StpSt,Ac)
[R0+𝛾V𝜋(St)s0=St],(6)
where 𝔼{.}denotes the expectation operator and 𝛾the discount factor that attains value between 0 to 1. The set p(Stt,Act)
represents the probability distribution of state transition.
The goal is to find an optimal policy 𝜋that maximizes the sum of rewards V𝜋(St),ie,
V(St)=V𝜋(St)=max
𝜋V𝜋(St).(7)
The optimal value that satisfies Equation (6) becomes
V(St)=V𝜋(St)=max 𝔼Ac𝜋St)
StpSt,Ac)
[R0+𝛾V(St)].(8)
The difficulty of MDP is handling of randomness in terms of an initial state during sampling and in terms of distribution
of next state. Let V(St)<V(St), then the state St is better than St, as the agent receives more rewards in state St than
St. The perfectness of the environment is not always known to the agent, thus the information about the immediate
reward as well as the next landed state are also not exactly known to the agent. Thus, 𝜋may not be necessarily unique.
Therefore, we look the other way to evaluate 𝜋.
The role of Q-learning is to find out 𝜋without having knowledge of immediate reward and the distribution of the next
state. This is possible by remanipulating Equation (8) under policy 𝜋, the value of a given state-action is expressed as
Q𝜋(St,Ac)=𝔼𝜋
t0
𝛾tRts0=St,a0=Ac,𝜋
=R0+𝛾𝔼StpSt,Ac)[V(St)].(9)
Thus, the Q-function is basically the expected cumulative reward for doing an action Ac in state St and thereafter ensuing
the policy 𝜋.
The optimal value of Q-function, Q𝜋, is then expressed as
Q(St,Ac)=Q𝜋(St,Ac)=R0+𝛾𝔼Stp(·St,Ac)[V(St)].(10)
KHUNTIA AND HAZRA 9of18
This gives
V(St)=max
Ac Q(St,Ac).(11)
Therefore, V, which satisfies Bellman equation (8), can be retrieved from Q. Replacing Equation (11) in (10), we obtain
Q(St,Ac), that is, presented as follows:
Q(St,Ac)=R0+𝛾𝔼StpSt,Ac)max
AcQ(St,Ac).(12)
Thus, the process of obtaining Q(St,Ac)in Q-learning is an iterative method. In general, the iterative Q-learning
process is
Qt(St,Ac)=Qt1(St,Ac)+𝛼Rt+𝛾max
AcQt1(St,Ac)−Qt1(St,Ac),(13)
where 𝛼attains value between 0 to 1 and is called the learning rate of the system. The convergence of Q(St,Ac)to Q(St,Ac)
occurs when 𝛼value moves toward zero and if each (St, Ac) pairs repeat infinitely. However, this is impracticable for
extremely large state space. In such a case, all states are rarely covered and thus the Q value for those states may not be
updated in the Q-table. Therefore, it takes a longer time for convergence.
3.2 Deep Q-learning
In order to solve this issue, Deep Q-learning uses a nonlinear function approximator Q(St,Ac,𝜃)to estimate the Q-function
Q(St,Ac). The function approximator Q(St,Ac,𝜃)is basically a DNN, where 𝜃represents the weights of the function.
Therefore, the learning mechanism is called as Deep Q-learning. In this learning, the neural network takes the input state
vector St =(SINR,Ix
n,𝜆)and a given set of weights 𝜃to estimate the Q-values in an iterative manner.
In DQN, the weight 𝜃is adjusted iteratively to improve the Q-function. The agent tries to minimize the Bellman error,
which means that the Q-value is made close to target Q-value. It is defined by means of a loss function represented as
follows:
Lossi(𝜃i)=𝔼Dip(·) R+𝛾max
AcQ(St,Ac;̃
𝜃)−Q(St,Ac;𝜃i)2.(14)
Here, Zi,̃
𝜃(Di)=R+𝛾maxAcQ(St,Ac;̃
𝜃)is the target Q-value. The symbol p(·) is the probability distribution for transition
Di=(St,Ac,R,St)at each iteration and the transition is stored in a replay memory D having the capacity N. Moreover, ̃
𝜃
represent fixed weights used to determine the target Q-value and weights 𝜃iare the estimated value at some iteration i.The
Q-value of each device calculated at different time slot is of high importance because by comparing these Q-parameter
values, the agent learns whether a time slot is suitable for transmission of a message or not. Thus, the availability of a
device at any point of time can be known. Furthermore, the reward at any time slot depends on the successful/failure of
transmission of the message. Thus, overhead during message exchange between devices and CH is determined by the size
of the transmitted message and the number of active transmitters at that point of time. If there are Knumber of devices
active at any point of time, then the overhead is K(K-1) number of messages, where each message has a size of m bits.
In order to minimize the loss function, we determine the gradient of loss function with respect to the function parameter
𝜃i,ie,
dLoss
i(𝜃i)
d𝜃i
=𝔼Dip(·) R+𝛾max
AcQ(St,Ac;̃
𝜃)−Q(St,Ac;𝜃i)dQ(St,Ac;𝜃i)
d𝜃i.(15)
The weight updation is as follows:
𝜃i+1𝜃i+𝛼𝔼Dip(·) Zi,̃
𝜃(Di)−Q(St,Ac;𝜃i)dQ(St,Ac;𝜃i)
d𝜃i.(16)
Here, we iteratively trained the Q-function to get the Q-value close to target Zi,̃
𝜃(Di), when Q-function is optimal.
3.3 Deep Q-learning with EKF
In the first phase, we discuss the uncertainty issue of state transition probability using Deep Q-learning. Furthermore, the
weight uncertainty issue, which is not solvable in the first phase, is presented in the second phase using EKF. The block
diagram representation of Deep Q-learning with EKF is shown in Figure 3.
10 of 18 KHUNTIA AND HAZRA
In our work, we first present a modified version of loss function (15) that incorporates the uncertainty presents in
Q-learning and improves the learning process of the agent. In this process at each episode, the learner accesses the uncer-
tainty transition set pSti,Aci)for estimating the all possible next landed states that is denoted by set ̂
SSti,Aci).In
order to train the Q-network, the mini-batch of transition data Difrom the replay memory is fetched instead of consec-
utive samples. The agent then estimates the modified version of Znew
i,̃
𝜃(Di)given in Equation (17) and updates multiple
weights using Znew
i,̃
𝜃(Di)according to (16). This increases data efficiency greatly.
Using the aforementioned concept, the modified version of the loss function is as follows:
Lossi(𝜃i)new =𝔼Dip(·) R+𝛾min
P𝜖p
St𝜖̂
S
pStSti,Acimax
AcQ(St,Ac;̃
𝜃)−Q(St,Ac;𝜃i)2
,(17)
where
Znew
i,̃
𝜃(Di)=R+𝛾min
P𝜖p
St𝜖̂
S
pStSti,Acimax
AcQ(St,Ac;̃
𝜃).(18)
However, the Deep Q-learning with experience replay updates the weights in accordance with target label Znew
i,̃
𝜃(Di),but
it does not consider the uncertainty of the weights, but this is very important for designing any robust model. Thus, we
propose to use EKF with DQN to handle both weight uncertainty as well as uncertainty state transition probability.
In our work, we apply an EKF over DQN that is suitable for learning the behavior of nonlinear data or for evaluating the
weights 𝜃of a nonlinear function approximation. The EKF can now be used to process the following nonlinear function
mentioned:
𝜃i=𝜃i1+wi1(19)
Zi(Di)=Q(St,Ac;𝜃i)+vi,(20)
where Zi(Di)=R+𝛾maxAcQ(St,Ac;̃
𝜃)is the target value at iteration iand Q(St,Ac;𝜃i)a nonlinear Q-function, whereas
wi1and viare the process error and the measurement error with covariance Fwiand Fvi, respectively.
The nonlinearity of the Q-function can be linearized by EKF using the first-order term of Taylor expansion as mentioned
in the following:
Q(St,Ac;𝜃i)=Q(St,Ac;̆
𝜃)+ dQ(St,Ac;̆
𝜃)
d𝜃i𝜃i=̆
𝜃
(𝜃ĭ
𝜃)T.(21)
Here, ̆
𝜃is the linearization point and basically represents the result of estimated weight 𝜃i|i1at iteration i-1. As 𝜃iis
random in nature, EKF estimates the expected value of the weights represented by ̆
𝜃based on the transition observed for
atimerange.
In the predicted stage, the predicted value of Q-function is obtained as follows:
̆
Zii1(Di)=𝔼[Q(St,Ac;𝜃iD1i1)]
=
21
𝔼Q(St,Ac;̆
𝜃ii1)+ dQ(St,Ac;̆
𝜃ii1)D1i1
d𝜃i
×(𝜃ĭ
𝜃)T
=Q(St,Ac;̆
𝜃ii1)+(𝔼𝜃iD1i1̆
𝜃ii1)T×dQ(St,Ac;̆
𝜃ii1)
d𝜃i
=Q(St,Ac;̆
𝜃ii1)+(̆
𝜃ii1̆
𝜃ii1)T×
dQ(St,Ac;̆
𝜃ii1)D1i1
d𝜃i
=Q(St,Ac;̆
𝜃ii1),(22)
where 𝔼[𝜃iD1i1]≈ ̆
𝜃ii1=̆
𝜃i1i1, according to general conditional expectation rule, ie, 𝔼[𝜃iD1i]≈𝜃ii.
Now, the correct or updated value of Q-function is as follows:
̂
Zii1(Di)≈Q(St,Ac;𝜃i)− ̆
Zii1(Di)
Q(St,Ac;𝜃i)−Q(St,Ac;̆
𝜃ii1).(23)
KHUNTIA AND HAZRA 11 of 18
The covariance that relates the error in weights and the correct Q-value is as follows:
F̂
𝜃i,̂
Z(Di)𝔼̂
𝜃ii1̂
Zii1(Di)D1i1
=𝔼(𝜃ĭ
𝜃ii1)×(Q(St,Ac;𝜃i)−Q(St,Ac;̆
𝜃ii1))D1i1
=
21
𝔼
(𝜃ĭ
𝜃ii1Q(St,Ac;̆
𝜃ii1)+ dQ(St,Ac;̆
𝜃ii1)
d𝜃i
×(𝜃ĭ
𝜃ii1)TQ(St,Ac;̆
𝜃ii1)D1i1
=𝔼̂
𝜃ii1̂
𝜃T
ii1D1i1
dQ(St,Ac;̆
𝜃ii1)
d𝜃i
=Fii1
dQ(St,Ac;̆
𝜃ii1)
d𝜃i
,(24)
where 𝜃ĭ
𝜃ii1̂
𝜃ii1and 𝔼[̂
𝜃ii1̂
𝜃T
ii1D1i1Fii1is the conditionally expected covariance of error.
The covariance of updated Q-function is as follows:
F̂
z(Di)=𝔼(̂
Zii1(Di))2D1i1+Fvi
=
23
𝔼(Q(St,Ac;𝜃i)−Q(St,Ac;̆
𝜃ii1))2D1i1+Fvi
=
21
𝔼Q(St,Ac;̆
𝜃ii1)+ dQ(St,Ac;̆
𝜃ii1)
d𝜃i
×(𝜃ĭ
𝜃ii1)T−(Q(St,Ac;̆
𝜃ii1)2D1i1
+Fvi
=𝔼dQ(St,Ac;̆
𝜃ii1)
d𝜃i
(𝜃ĭ
𝜃ii1)T2D1i1
+Fvi
=dQ(St,Ac;̆
𝜃ii1)
d𝜃i
T
𝔼̂
𝜃ii1̂
𝜃T
ii1D1i1dQ(St,Ac;̆
𝜃ii1)
d𝜃i
+Fvi
=dQ(St,Ac;̆
𝜃ii1)
d𝜃i
T
Fii1
dQ(St,Ac;̆
𝜃ii1)
d𝜃i
+Fvi.(25)
Now, using the covariance computed in (24) and (25), the gain achieved for EKF is as follows:
KGiF̂
𝜃i,̂
Z(Di)F1
̂
z(Di)
=Fii1
dQ(St,Ac;̆
𝜃ii1)
d𝜃idQ(St,Ac;̆
𝜃ii1)
d𝜃i
T
Fii1
dQ(St,Ac;̆
𝜃ii1)
d𝜃i
+Fvi1
.(26)
The measurement update of weights ̆
𝜃to train the Q-function using EKF is defined as follows:
̆
𝜃ii=̆
𝜃ii1+𝔼Di[KGi(Zi,̃
𝜃(Di)−Q(St,Ac;̆
𝜃ii1))].(27)
The updated covariance error is mentioned as follows:
Fii=Fii1KGiF̂
z(Di)KGiT.(28)
It is inferred from the EKF gain in (24) that the EKF learns adaptively to every weight that includes weight uncertainty
features. Now, the EKF estimates the posterior weight taking the knowledge of the probability of prior weights governed
on transitions D1:iobtained up to a period i. This posterior weight update is based on the likelihood of target value Zi(Di)𝜃i
and the posterior probability distribution of 𝜃iD1i1. Further considering the Gaussian nature of p(Zi(Di)𝜃iand p(𝜃iD1i1),
we define the following distribution:
Zi(Di)𝜃i(𝔼[Q(St,Ac;𝜃i)],Fvi)(29)
𝜃iD1i(̆
𝜃ii1,Fii1).(30)
12 of 18 KHUNTIA AND HAZRA
Using Equation (29) and (30) and the weights ̆
𝜃in (27), we find a controlled version of loss function that builds uniformity
to the network, ie,
̆
Lossi(𝜃i)= 1
2Fvi
𝔼(Zi,̃
𝜃(Di)−Q(St,Ac;𝜃i))2+1
2(𝜃ĭ
𝜃ii1)F1
ii1(𝜃ĭ
𝜃ii1)T.(31)
Here, the weights are controlled on the basis of updated covariance error Fi|i1and the value of error variance Fvivaries
according to the transition Di=(St,Ac,R,St).ThevalueofFvigoes high if the transition Diis too much corrupted. In
this situation, Fviaffects the weights largely and thus error variance Fviacts as a controlled parameter toward maintaining
uniformity in the network.
3.4 Computational complexity
The time complexity of our proposed resource sharing algorithm is as follows. In our algorithm, we have nnumber of
episodes and each episode has 15 TTI duration. In the algorithm, a total of “Ac” number of actions are handled for a
network having |S| sectors. The weight update 𝜃in the fully connected DQN depends on the width and the number of
hidden layers. Let H(𝜃)be the total number of parameters associated with the DQN and expressed as H(𝜃)=3
i=0(Li+1)
Li+1,whereL0represents the number of state inputs in the first layer and L1,L2,L3represents the number of neurons in
the hidden layer. Thus, the time complexity of our algorithm is (H(𝜃)nTSAc).
4M/M/C:N QUEUING WITH NON-PREEMPTIVE LIMITED PRIORITY
In this queuing model, service to low priority users is followed by the service to jnumber of high priority users. The users
are graded as first and referred to as high priority users, whereas the second grade users are treated as low priority users.
KHUNTIA AND HAZRA 13 of 18
TABLE 1 Expected waiting time of two graded users with j =4
Traffic density(𝜄) Waiting time of first grade user in sec (wq1) Waiting time of second grade user in sec (wq2)
00 0
0.2 1.02 ×1060.00241
0.4 0.00167 0.00641
0.6 0.0052 0.01432
0.8 0.01341 0.02705
1.0 0.02732 0.04251
In M/M/C:N queuing system, the users arrival rate follows a Poisson distribution 𝜆hour, whereas the average service
rate 𝜇hour follows an exponential distribution. The queue length is set to Nand Cdenotes the number of servers. The
density of traffic for the system is denoted by 𝜄=𝜄1+𝜄2=𝜆1
C𝜇+𝜆2
C𝜇,where𝜆1,𝜆2represents the number of the user of first
and second grade that appears in the system. The users in the system are separated into two queues, ie, q1and q2.The
expected queue length of xth grade user is Lqx. Therefore,
Lqx =Lq1+Lq2,(32)
where Lq1=𝜆1Wq1and Lq2=𝜆2Wq2represent the queue length of first and second grade users. When C,denote
a queuing system with full priority for the user of grade 1. For C0, the second grade users get the full priority.
The expected waiting time in the queue for the first and second grade users associated with a stationary system model
is expressed as follows:
Wq1=C2𝜇2𝛽Lqx +𝜆e𝜆2
C𝜇(C𝜇𝜆2𝜆e𝜆2+C𝜇𝛽𝜆1)(33)
Wq2=Lqx
𝜆2
C2𝜇2𝛽𝜆1Lqx +𝜆e𝜆1𝜆2
C𝜇𝜆2(C𝜇𝜆2𝜆1𝜆2+C𝜇𝛽𝜆1),(34)
where 𝜆e=𝜆1+𝜆2and 𝛽the service density of second grade users that change queue.
Table 1 depicts the expected waiting time of first and second graded users for j=4. It is inferred from the table that the
waiting time of first grade users under nonpreemptive limited priority scheme is very less as compared to second grade
users.
This nonpreemptive limited priority queuing scheme avoids collision of the message by adjusting jvalue, ie, when the
expected waiting time of a user in the queue is too high. Thereby, the issue of nonpreemptive priority queuing scheme
(j), where there is a high probability of data collision due to heavy traffic loads at the network, is solved.
5NUMERICAL RESULTS
We consider a 4000 m × 4000 m cellular network, where the D2Ds and CUs are uniformly distributed. Each cell is splitted
into three sectors with each node having a coverage pattern of 1200. The simulated system topology and clustering of
nodes are shown in Figure 4. For experimental purpose, the proposed work is simulated with 20 D2D pairs and 10 CUs.
Therefore, a total of 10 channels are allocated in a cellular network and the transmission power of D2Ds on each channel
is divided into five power levels: 20 mw,40 mw,60 mw,80 mw,100 mw. The maximum transmits power of CU and
D2D users are 24 dBm and 21 dBm, respectively. The SINR threshold required to define the state of D2D user is 0.005.
The discount factor 𝛾=0.9 and the average user arrival rate is 𝜆1𝜆2=21. The DQN consists of five layers. There
are three hidden layers having neurons of 400, 200, and 100, respectively. The total number of the episode is 1000 and
each episode has 4000 iterations. We use a rectified linear unit (ReLU), activation function that works in the hidden layer
and is defined as follows: R(x)=max(0,x). We implement our Deep Q-learning with EKF using simulation environment
TensorFlow based upon Python, which runs on four GPUs of Nvidia architecture and Intel CPU having 128 GB memory.
Figure 5 shows the convergence performance of Deep Q-learning with EKF for different mini-batch size by updating
the gradient. It is inferred from the figure that, when mini-batch size is small, the convergence is faster. The parameter
batch size in Deep Q-learning with EKF tells the number of experiences used during the transition to learn a policy.
Figure 6 illustrates the average system throughput with respect to the learning rate. We can infer from the figure that
the throughput of the proposed Deep Q-learning with EKF and Deep Q-learning increases with the learning rate and
achieves the maximum peak value when the learning rate is 0.58 and 0.6, respectively. With further increase in learning
14 of 18 KHUNTIA AND HAZRA
FIGURE 4 System topology
FIGURE 5 Impact of mini-batch on convergence in Deep
Q-learning+EKF
0 100 200 300 400 500 600 700 800 900 1000
Episode
160
180
200
220
240
260
280
300
320
System throughput (Mbps)
FIGURE 6 Variation of throughput with respect to the learning rate
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Learning rate
50
100
150
200
250
300
Average system throughput (Mbps)
rate, throughput performance decreases sharply. In the case of Q-learning, throughput rises very slowly and attains the
maximum throughput for learning rate around 0.75 and then decreases sharply. The throughput performance of Deep
Q-learning with EKF is better over Deep Q-learning, as it learns the best optimal policy by incorporating state transition
probability uncertainty and weight uncertainty. Therefore, it selects the best possible actions and receives the maximum
KHUNTIA AND HAZRA 15 of 18
throughput over a long run-time. Furthermore, when the learning rate goes very high, the learner is unable to exploit all
the actions and higher learning rate leads to local optimum, thereby degrading the performance of the system.
Figure 7 and Figure 8 depicts the average system throughput for the whole system and average D2D throughput with
respect to the number of D2Ds in the system. We can infer from both the figures that, with the increase in the number of
D2Ds, the average throughput in the case of proposed Deep Q-learning with EKF outperforms the Deep Q-learning and
the centralized algorithm, ie, without learning mechanism.7,8This is because DQN in coordination with EKF processes
the uncertainty in state transition probability batch-wise by taking weight uncertainty into account. Thereby, it maximizes
the Q-value as well as the throughput under an optimal policy, which is not followed by standard Deep Q-learning and
Q-learning method. Similarly, the traditional optimization method fails to provide the best possible results due to the
lack of complete traffic information a priori. It can be depicted from Figure 7 that the overall system throughput for our
proposed Deep Q-learning with EKF yields 218.04 Mbps, when the number of D2Ds is 8. When the number of D2Ds
increases to 16, it provides an appreciable throughput of 295.2 Mbps.
Figure 9 illustrates the variation of system throughput with respect to the state transition probability. In both the
schemes, with the increase in transition probability, the throughput of the system also increases, but our proposed Deep
Q-learning with EKF provides more appreciable throughput as compared to standard Deep Q-learning. This is because
Deep Q-learning only considers state transition probability uncertainty for weight update and ignores the weight uncer-
tainty in the network. Therefore, the optimal policy it obtains is not an optimal one. As a result, there is degradation in
throughput performance Deep Q-learning.
Figure 10 shows the error measured in different Q-learning algorithm for a fixed learning rate of 0.52. We can observe
that our proposed Deep Q-learning with EKF achieves the minimum error in the system. This is because, in the case
2 4 6 8 10121416
Number of D2D pairs
50
100
150
200
250
300
Average system throughput (Mbps)
FIGURE 7 Throughput variation over a number of
device-to-device (D2D) users for the whole system
2 4 6 8 10121416
Number of D2D pairs
0
20
40
60
80
100
120
140
160
180
200
Average D2D throughput (Mbps)
FIGURE 8 Average device-to-device (D2D) throughput variation
over a number of D2D users
16 of 18 KHUNTIA AND HAZRA
FIGURE 9 Variation of system throughput with respect to the
transition probability
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Transition probability
50
100
150
200
250
300
350
400
System throughput (Mbps)
FIGURE 10 Error estimation for a learning rate of 0.52
0 100 200 300 400 500 600 700 800 900 1000
Number of Episode
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Estimated Loss or Error
of Deep Q-learning, we do not consider the uncertainty that lies in state transition probability and the weights used
for Q-function. Thus, the probability of selecting the wrong action increases, thereby estimated error which is nothing
but the difference between the target Q-value and estimated Q-value increases. It can also be observed from the figure
that, in case of Deep Q-learning, the graph of error fluctuates, whereas, in case of Deep Q-learning with EKF, it is fairly
smooth. This is because Deep Q-learning with EKF considers the uncertainty weights of Q-function. Furthermore, EKF
also regularizes the Q-function by collecting the uncertainty information from the state transition set and then passes it
into the uncertainty weights of Q-function.
Figure 11 shows the variation of Q-value over a number of episode. We can see that the proposed Deep Q-learning
with EKF outperforms the Deep Q-learning algorithm in both cases, ie, case 1 and case 2, with mini-batch size 5 and 10,
respectively. This is because, in our proposed algorithm, the Q-value is calculated using Equation (21), where the EKF
uses Taylor series expansion for the linearization of Q-function. The Deep Q-learning with EKF learns from the associated
uncertainty in state transition set and weights of Q-function. Thus, it converges to a robust policy that allows to perform
scheduling of actions with higher accuracy. Therefore, we achieve a higher Q-value at every time step.
The quantitative analysis of our simulation result is shown in Table 2. We compare the system-throughput (when K=10
number of D2D) and error performance of our proposed algorithm, ie, Deep Q-learning with EKF (consider uncertainty
in state transition and weights of Q-function) and Deep Q-learning (consider only state transition uncertainty) over other
learning scheme (Deep Q-learning16,17,26 and Q-learning10,12) and without learning (IPO7and GRA8) scheme. It can be
observed that our proposed scheme achieves descent performance in terms of system-throughput and error performance.
KHUNTIA AND HAZRA 17 of 18
0 100 200 300 400 500 600 700 800 900 1000
Number of Episode
0
0.5
1
1.5
2
2.5
Average Q-value
FIGURE 11 Variation of Q-function vs. the number of episodes
TABLE 2 System-throughput and error performance in different learning and without learning algorithm
Performance Deep Q with Deep-Q learning Deep-Q learning Q-learning IPO GRA
EKF (proposed) (only consider
state uncertainty)
System-throughput (Mbps) 235.28 202.1 27,16 81.217 110.28,10 3.212 120.47174.598
(D2D number =10)
Loss (or) error 0.0123 0.0253 0.0526 0.0720 NA NA
Abbreviations: D2D, device-to-device; EKF, extended Kalman filter.
6CONCLUSION
In this work, we use Deep Q-learning scheme combined with EKF for resource allocation in D2D enabled cellular net-
work. The DQN combined with EKF incorporates weight uncertainty of the Q-function as well as the state uncertainty
during a transition. We also present that the updated weights from the EKF-based DQN minimizes the controlled version
of the loss function that brings uniformity to the network. The novel cell sectoring concept and queuing model with non-
preemptive limited priority model help in reducing the cochannel cell interference and traffic congestion. Furthermore,
the proposed queuing model solves the issue of unnecessary waiting of users in a priority-based transmission through
proper selection of constraint parameter j. Thus, this novel approach can be applied for other real-time application, where
uncertainty is associated with the model.
ORCID
Pratap Khuntia https://orcid.org/0000-0003-1373-9907
REFERENCES
1. Fodor G, Dahlman E, Mildh G, et al. Design aspects of network assisted device-to-device communications. IEEE Commun Mag.
2012;50(3):170-177.
2. Zhang Y, Yang Y, Dai L. Energy efficiency maximization for Device-to-Device communication underlaying cellular networks on multiple
bands. IEEE Access. 2016;4:7682-7691.
3. Janis P, Koviunen V, Ribeiro C, Korhonen J, Doppler K, Hugl K. Interference-aware resource allocation for device-to-device radio
underlaying cellular networks. Paper presented at: VTC Spring 2009 - IEEE 69th Vehicular Technology Conference;2009; Barcelona, Spain.
4. Gupta S, Kumar S, Zhang R, Kalyani S, Giridhar K, Hazo L. Resource allocation for D2D links in the FFR and SFR aided cellular network.
IEEE Trans on Comm. 2016;64(10):4434-4448.
5. Baozhou Y, Qi Z. A QoS-based channel allocation and power control algorithm for device-to-device communication underlaying cellular
networks. J Commun. 2016;11(7):624-631.
6. Hamdi M, Yuan D, Zaied M. GA-based scheme for fair joint channel allocation and power control for underlaying D2D multicast com-
munications. Paper presented at: 2017 13th International Wireless Communications and Mobile Computing Conference (IWCMC); 2017;
Valencia, Spain.
18 of 18 KHUNTIA AND HAZRA
7. Wang L, Wu H. Fast pairing of device-to-device link underlay for spectrum sharing with cellular users. IEEE Commun Lett.
2014;18(10):1803-1806.
8. Wang F, Xu C, Song L, Han Z. Energy-efficient resource allocation for device-to-device underlay communication. IEEE Trans Wirel
Commun. 2015;14(4):2082-2092.
9. Maghsudi L, Sta ´
nczak S. Hybrid centralized-distributed resource allocation for device-to-device communication underlaying cellular
networks. IEEE Trans Veh Technol. 2016;65(4):2481-2495.
10. Nie S, Fan Z, Zhao M, Gu X, Zhang L. Q-learning based power control algorithm for d2d communication. Paper presented at: 2016 IEEE
27th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC); 2016; Valencia, Spain.
11. Asheralieva A, Miyanaga Y. An autonomous learning-based algorithm for joint channel and power level selection by D2D pairs in
heterogeneous cellular networks. IEEE Trans Commun. 2016;64(9):3994-4012.
12. Khan MI, Alam MM, Le Moullec Y. A multi-armed bandit solver method for adaptive power allocation in device-to-device communication.
Paper presented at: International Workshop on Recent Advances in Cellular Technologies and 5G for IoT; 2018; Leuven, Belgium.
13. Mnih V, Kavukcuoglu K, Silver D, et al. Human level control through deep reinforcement learning. Nature. 2015;518(7540):529-533.
14. Mao H, Alizadeh M, Menache I, Kandula S. Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM
Workshop on Hot Topics in Networks; 2016; Atlanta, GA.
15. Van Le D, Tham CK. A deep reinforcementlearning based offloading scheme in ad-hoc mobile clouds. Paper presented at: IEEE INFOCOM
2018 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS); 2018; Honolulu, HI.
16. Lee W, Kim M, Cho DH. Deep learning based transmit power control in underlaid device-to-device communication. IEEE Syst J. 2018.
Early access.
17. Naparstek O, Cohen K.. Deep multi-user reinforcement learning for dynamic spectrum access in multi-channel wireless networks. arXiv
preprint arXiv:1704.02613. 2018.
18. Kim J, Park J, Noh J, Cho S. Completely distributed power allocation using deep neural network for device to device communication
underlaying LTE. arXiv preprint arXiv:1802.02736. 2018.
19. Meng F, Chen P, Wu L, Cheng J. Power allocation in multi-user cellular networks: Deep reinforcementlearning approaches. arXiv preprint
arXiv:1901.07159v1. 2019.
20. Huang D, Gao Y, Li Y, et al. Deep learning based cooperative resource allocation in 5G wireless networks mobile networks and applications.
2018. https://doi.org/10.1007/s11036-018-1178-9
21. Naparstek O, Cohen K. Deep multi-user reinforcement learning for dynamic spectrum access in multichannel wireless networks. Paper
presented at: GLOBECOM 2017 - 2017 IEEE Global Communications Conference; 2017; Singapore.
22. Zhao N, Liang YC, Niyato D, Pei Y, Jiang Y. Deep reinforcement learning for user association and resource allocation in heterogeneous
networks. Paper presented at: 2018 IEEE Global Communications Conference (GLOBECOM); 2018; Abu Dhabi, UAE.
23. Li J, Gao H, Lv T, Lu Y. Deep reinforcement learning based computation offloading and resource allocation for MEC. Paper presented at:
2018 IEEE Wireless Communications and Networking Conference (WCNC); 2018; Barcelona, Spain.
24. Li X, Fang J, Cheng W, Duan H, Chen Z, Li H. Intelligent power control for spectrum sharing in cognitive radios: a deep reinforcement
learning approach. IEEE Access. 2018;6:25463-25473.
25. Yu Y, Wang T, Liew SC. Deep-reinforcement learning multiple access for heterogeneous wireless networks. Paper presented at: IEEE
International Conference on Communications (ICC); 2018; Kansas City, MO.
26. Chen X, Li Z, Zhang Y, et al. Reinforcement learning based QOS/QOE-aware service function chaining in software-driven 5G slices.arXiv
preprint arXiv:1804.02099. 2018.
27. Weng C, Yu D, Watanabe S, Juang BHF. Recurrent deep neural networks for robust speech recognition. Paper presented at: 2014 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2014; Florence, Italy.
28. Ye H, Le GY. Deep reinforcement learning for resource allocation in V2V communications. IEEE Trans Veh Technol. 2019. Early access.
Howtocitethisarticle: Khuntia P, Hazra R. An efficient Deep reinforcement learning with extended Kalman
filter for device-to-device communication underlaying cellular network. Trans Emerging Tel Tech. 2019;30:e3671.
https://doi.org/10.1002/ett.3671
... 7,8,9,10 Data offloading, energy efficiency (EE) improvement, 11 coverage extension, 12,13 and battery life extension 14 are some of the other advantages of D2D communications. Different tools such as game theory 15,16 , stochastic geometry 11,17 , convex optimization 9 , and machine learning 18,19 have been used for resource allocation to D2D communications in the literature. ...
... The equations (19), (20), and (21) should be computed at the BS acting as the receiver of CUEs. Considering the computation of the constants at the BS acting as the central computation node and the computation of the aggregate interference, there exist and three divisions for each CL at each iteration of the game. ...
Preprint
Full-text available
Considering the dramatic increase of data rate demand in beyond fifth generation (5G) networks due to the numerous transmitting nodes, dense device-to-device (D2D) communications where multiple D2D pairs can simultaneously reuse a cellular link can be considered as a communication paradigm for future wireless networks. Since distributed methods can be more practical compared to complex centralized schemes, we propose a low-complexity distributed pricing-based resource allocation algorithm to allocate power to cellular user equipments (CUEs) and D2D pairs constrained to the minimum quality-of-service (QoS) requirements of CUEs and D2D pairs in two phases. The price of each link is set by the base station (BS). CUEs and D2D pairs maximize their rate-and-interference-dependent utility function by adjusting their power in the first phase, while the price of each link is updated at the BS based on the minimum QoS requirements of CUEs and D2D pairs in the second phase. The proposed utility function controls D2D admission, hence it is suitable for dense scenarios. The proposed method is fast due to the closed-form solution for the first phase power allocation. Numerical results verify the effectiveness of the proposed method for practical dense beyond 5G scenarios by allocating resources to multiple D2D pairs and taking advantage of the spatial reuse gain of D2D pairs. Furthermore, the utility function definition is discussed to illustrate its effectiveness for SE maximization. Finally, the SE of the proposed method is compared to other centralized and distributed algorithms to demonstrate the higher sum-rate performance of the proposed algorithm.
... The DL approach outperforms traditional methods in some problems by considering the optimization through data-driven signal processing. Channel estimation, physical layer security, beamforming, spatial modulation, synchronization, signal detection and NOMA power allocation etc. [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34]. The beamforming problem is approached using the DL algorithm by [3], an adaptive system and significant improvement in gain are reported. ...
... The feed-forward DNN is used to generate the candidate angles then passed to a selection network to select the final estimate. LSTM based pilot assisted channel estimator is proposed by Khuntia and Hazra [23] for OFDM systems. The authors numerically prove the algorithm's superiority over least squares (LS) and minimum mean square estimator (MMSE) estimators. ...
Article
Full-text available
Massive MIMO is considered as promising technology enabling 5G and beyond 5G cellular communication networks. High-resolution gain and angle estimation of the channel are significant challenges in the design of massive MIMO. The existing signal subspace-based estimation techniques lack the resolution for the detection of slight angles. The severity increases when both azimuth and elevation angle are jointly estimating for the 3D channel. The existing techniques rely on the combination of signal subspace-based approaches to estimate both azimuth and elevation angles. The resulting estimate’s accuracy is less due to its low resolution and angular coupling. Data-driven techniques can address this issue. This work is the first-ever attempt to apply data-driven techniques for 3D channel estimation of massive MIMO to the best of our knowledge. This paper considers two approaches for the same, one based on K nearest neighbour (KNN) and the other based on deep neural network (DNN) with restricted Boltzmann machine (RBM). We also investigated data generation and feature extraction for data-driven communication technologies in three ways. An intensive performance analysis of both architectures using these three feature vectors is carried out. The simulation reveals deep learning (DL) model’s superiority compared to the machine learning (ML)-based counterpart and other signal subspace-based estimation techniques. SNR of 10 dB lesser than other signal subspace estimation is required for ML-based approach, whereas the DL-based estimator needs 20 dB lesser SNR for the received signal to achieve the same BER value. In the comparative study on the feature vectors for data-driven estimation techniques, the data processing based on the Pearson correlation feature vector (PCFV) performs the best. Performance comparison of the DNN model with the KNN model is based on tenfold cross-validation showing an average AUC of 0.915 for DL based estimation and a coefficient of 0.904 for ML-based counterpart.
... It simply used the classic state estimation method to evaluate the difference between the actual complex system's reference trajectory and the estimation result. Similar methods include using learning algorithms to improve the extended Kalman filter [111] and the multimodel extended Kalman filter [112]. In principle, this type of method does not modify the model but uses the difference between the reference state and the classical estimation result to describe the model's matching degree. ...
Article
Full-text available
State estimation is widely used in various automated systems, including IoT systems, unmanned systems, robots, etc. In traditional state estimation, measurement data are instantaneous and processed in real time. With modern systems’ development, sensors can obtain more and more signals and store them. Therefore, how to use these measurement big data to improve the performance of state estimation has become a hot research issue in this field. This paper reviews the development of state estimation and future development trends. First, we review the model-based state estimation methods, including the Kalman filter, such as the extended Kalman filter (EKF), unscented Kalman filter (UKF), cubature Kalman filter (CKF), etc. Particle filters and Gaussian mixture filters that can handle mixed Gaussian noise are discussed, too. These methods have high requirements for models, while it is not easy to obtain accurate system models in practice. The emergence of robust filters, the interacting multiple model (IMM), and adaptive filters are also mentioned here. Secondly, the current research status of data-driven state estimation methods is introduced based on network learning. Finally, the main research results for hybrid filters obtained in recent years are summarized and discussed, which combine model-based methods and data-driven methods. This paper is based on state estimation research results and provides a more detailed overview of model-driven, data-driven, and hybrid-driven approaches. The main algorithm of each method is provided so that beginners can have a clearer understanding. Additionally, it discusses the future development trends for researchers in state estimation.
... [7][8][9][10] Data offloading, energy efficiency (EE) improvement, 11 coverage extension, 12,13 and battery life extension 14 are some of the other advantages of D2D communications. Different tools such as game theory, 15,16 stochastic geometry, 11,17 convex optimization, 9 and machine learning 18,19 have been used for resource allocation to D2D communications in the literature. ...
Article
Full-text available
Considering the dramatic increase of data rate demand in beyond fifth generation (5G) networks due to the numerous transmitting nodes, dense device‐to‐device (D2D) communications where multiple D2D pairs can simultaneously reuse a cellular link can be considered as a communication paradigm for future wireless networks. Since distributed methods can be more practical compared to complex centralized schemes, we propose a low‐complexity distributed pricing‐based resource allocation algorithm to allocate power to cellular user equipments (CUEs) and D2D pairs constrained to the minimum quality‐of‐service (QoS) requirements of CUEs and D2D pairs in two phases. The price of each link is set by the base station (BS). CUEs and D2D pairs maximize their rate‐and‐interference‐dependent utility function by adjusting their power in the first phase, while the price of each link is updated at the BS based on the minimum QoS requirements of CUEs and D2D pairs in the second phase. The proposed utility function controls D2D admission, hence it is suitable for dense scenarios. The proposed method is fast due to the closed‐form solution for the first phase power allocation. Numerical results verify the effectiveness of the proposed method for practical dense beyond 5G scenarios by allocating resources to multiple D2D pairs and taking advantage of the spatial reuse gain of D2D pairs. Furthermore, the utility function definition is discussed to illustrate its effectiveness for SE maximization. Finally, the SE of the proposed method is compared to other centralized and distributed algorithms to demonstrate the higher sum‐rate performance of the proposed algorithm.
... D2D communication, which is the enabler of V2X, has gained much importance in future cellular networks. It has several benefits namely cellular offloading, improved spectral efficiency, enhanced IoT, reduced power consumption and transmission time [2][3][4][5][6][7]. In terms of D2D-based V2X, Vehicle-to-Vehicle (V2V) communication is widely used for safety purposes within a smart city. ...
Article
The Device-to-device (D2D) networks bear a close resemblance to future Internet of Things (IoT) networks. Resource management is an important aspect for realization of D2D communication in upcoming IoT networks. In this paper, we select underlay in-band D2D communication as it is more beneficial in terms of spectral efficiency even though it comes at the cost of interference with cellular communication. To deal with this difficulty, we propose a novel downlink resource sharing strategy, where multiple D2D users (D2Ds) share the resource of a single cellular user (CU), and each D2D pair reuses multiple channels from different cellular users (CUs). The proposed scheme adopts a channel selection technique, wherein multiple channels of CU can be shared by each D2D user. Furthermore, optimal power for each D2D user is determined through the Lagrangian dual optimization method. The formulated power control maximization scheme nicely balances the total transmission power of D2D and D2D sum-rate. The proposed channel and power allocation problem aims at maximizing the D2D sum-rate by increasing the number of active D2D links while preserving the quality of service (QoS) of CU. Finally, a relationship between energy efficiency (EE) and transmit power of D2D is investigated through an EE maximization problem. The overall system performance is evaluated in terms of the D2D shared ratio, throughput gain of the network, and computational complexity of the proposed optimal strategy. Further, the merits of using the proposed resource sharing scheme over the existing schemes are also verified through numerical results.
Article
In this paper, the resource allocation problem for all devices is investigated in a multi-sharing uplink scenario of mmWave cyclic prefix - orthogonal frequency division multiplexing (CP-OFDM) based wireless networks with device-to-device (D2D) communications. More specifically, a power and delay optimization based uplink resource allocation (PDO-URA) algorithm is proposed divided into two steps. At first, network resources are allocated to cellular user equipments (CUEs) in terms of power and transmission rate through a proposed approach that intends to maximize throughput. Next, idle resources are allocated by considering delay minimization. To this end, we propose to apply an algorithm based on delay optimization where idle resources are shared with D2D user equipments (DUEs) in the network considering the formed conflicts graphs and the delay information estimated through Network Calculus concepts such as envelope process and service curve. Computational simulations are carried out considering a communication scenario with 5G characteristics such as millimeter waves at frequencies above 6 GHz, comparing also the performance of other algorithms from the literature in terms of quality of service (QoS) parameters such as throughput and delay. The results confirm that the proposed algorithm presents superior performance in terms of throughput and delay for the mmWave CP-OFDM wireless networks with D2D communications than the other considered algorithms.
Article
Full-text available
This paper addresses the resource allocation problem in multi-sharing uplink for device-to-device (D2D) communication, one aspect of 5G communication networks. The main advantage and motivation in relation to the use of D2D communication is the significant improvement in the spectral efficiency of the system when exploiting the proximity of communication pairs and reusing idle resources of the network, mainly in the uplink mode, where there are more idle available resources. An approach is proposed for allocating resources to D2D and cellular user equipments (CUE) users in the uplink of a 5G based network which considers the estimation of delay bound value. The proposed algorithm considers minimization of total delay for users in the uplink and solves the problem by forming conflict graph and by finding the maximal weight independent set. For the user delay estimation, an approach is proposed that considers the multifractal traffic envelope process and service curve for the uplink. The performance of the algorithm is evaluated through computer simulations in comparison with those of other algorithms in the literature in terms of throughput, delay, fairness and computational complexity in a scenario with channel modeling that describes the propagation of millimeter waves at frequencies above 6 GHz. Simulation results show that the proposed allocation algorithm outperforms other algorithms in the literature, being highly efficient to 5G systems.
Article
Full-text available
Wireless personal communication has become popular with the rapid development of 5G communication systems. Critical demands on transmission speed and QoS make it difficult to upgrade current wireless personal communication systems. In this paper, we develop a novel resource allocation method using deep learning to squeeze the benefits of resource utilization. By generating the convolutional neural network using channel information, resource allocation is to be optimized. The deep learning method could help make full use of the small scale channel information instead of traditional resource optimization, especially when the channel environment is changing fast. Simulation results indicate the fact that the performance of our proposed method is close to MMSE method and better than ZF method, and the time consumption of computation is smaller than traditional method.
Conference Paper
Full-text available
In this paper, we consider the problem of making an optimal offloading decision for a mobile user in an ad-hoc mobile cloud in which the mobile user can offload his computation tasks to nearby mobile cloudlets via a device-to-device (D2D) communication-enabled cellular network. We propose a deep reinforcement learning (DRL)-based offloading scheme which enables the user to make near-optimal offloading decisions by taking into account uncertainties of user's and cloudlets' movements and the cloudlets' resource availabilities. We first propose a Markov decision process (MDP)-based offloading problem formulation which considers the composite states of the user's and cloudlets' queue states and the distance states between the user and cloudlets as the system state space. The objective of the formulated MDP-based problem is to determine the optimal action on how many tasks the user should process locally and how many tasks to offload to each cloudlet at each observed system state such that the user's utility obtained by task execution is maximized while minimizing the energy consumption, task processing delay, task loss probability and required payment. Then, we use a deep reinforcement learning scheme, called deep Q-network (DQN) to learn an efficient solution for the proposed MDP-based offloading problem. Extensive simulations were performed to evaluate the performance of the proposed offloading scheme. The simulation results validate the effectiveness of the offloading policies obtained by the proposed scheme.
Article
Heterogeneous cellular networks can offload the mobile traffic and reduce the deployment costs, which have been considered to be a promising technique in the next-generation wireless network. Due to the non-convex and combinatorial characteristics, it is challenging to obtain an optimal strategy for the joint user association and resource allocation issue. In this paper, a reinforcement learning (RL) approach is proposed to achieve the maximum long-term overall network utility while guaranteeing the quality of service requirements of user equipments (UEs) in the downlink of heterogeneous cellular networks. A distributed optimization method based on multiagent RL is developed. Moreover, to solve the computationally expensive problem with the large action space, multi-agent deep RL method is proposed. Specifically, the state, action and reward function are defined for UEs, and dueling double deep Q-network (D3QN) strategy is introduced to obtain the nearly optimal policy. Through message passing, the distributed UEs can obtain the global state space with a small communication overhead. With the double-Q strategy and dueling architecture, D3QN can rapidly converge to a subgame perfect Nash equilibrium. Simulation results demonstrate that D3QN achieves the better performance than other RL approaches in solving large-scale learning problems.
Article
This paper investigates a deep reinforcement learning (DRL) based MAC protocol for heterogeneous wireless networking, referred to as Deep-reinforcement Learning Multiple Access (DLMA). Specifically, we consider the scenario of a number of networks operating different MAC protocols trying to access the time slots of a common wireless medium. A key challenge in our problem formulation is that we assume our DLMA network does not know the operating principles of the MACs of the other networks—i.e., DLMA does not know how the other MACs make decisions on when to transmit and when not to. The goal of DLMA is to be able to learn an “optimal” channel access strategy to achieve a certain pre-specified global objective. Possible objectives include maximizing the sum throughput and maximizing α-fairness among all networks. The underpinning learning process of DLMA is based on DRL. With proper definitions of the state space, action space, and rewards in DRL, we show that DLMA can easily maximize the sum throughput by judiciously selecting certain time slots to transmit. Maximizing general α-fairness, however, is beyond the means of the conventional reinforcement learning (RL) framework. We put forth a new multi-dimensional RL framework that enables DLMA to maximize general α-fairness. Our extensive simulation results show that DLMA can maximize sum throughput or achieve proportional fairness (two special classes of α-fairness) when coexisting with TDMA and ALOHA MAC protocols without knowing they are TDMA or ALOHA. Importantly, we show the merit of incorporating the use of neural networks into the RL framework (i.e., why DRL and not just traditional RL): specifically, the use of DRL allows DLMA (i) to learn the optimal strategy with much faster speed, and (ii) to be more robust in that it can still learn a near-optimal strategy even when the parameters in the RL framework are not optimally set.
Article
In this paper, we develop a novel decentralized resource allocation mechanism for vehicle-to-vehicle (V2V) communications based on deep reinforcement learning, which can be applied to both unicast and broadcast scenarios. According to the decentralized resource allocation mechanism, an autonomous “agent”, a V2V link or a vehicle, makes its decisions to find the optimal sub-band and power level for transmission without requiring or having to wait for global information. Since the proposed method is decentralized, it incurs only limited transmission overhead. From the simulation results, each agent can effectively learn to satisfy the stringent latency constraints on V2V links while minimizing the interference to vehicle-to-infrastructure communications.
Article
In this paper, a means of transmit power control for underlaid device-to-device (D2D) communication is proposed based on deep learning technology. In the proposed scheme, the transmit power of D2D user equipment (DUE) is autonomously learned via a deep neural network such that the weighted sum rate (WSR) of DUEs can be maximized by considering the interference from cellular user equipment. Unlike conventional transmit power control schemes in which complex optimization problems have to be solved in an iterative manner, which possibly requires long computation time, in our proposed scheme the transmit power can be determined with a relatively low computation time. Through simulations, we confirm that the proposed scheme achieves a sufficiently high WSR with a sufficiently low computation time.