Content uploaded by Chengjian Sun
Author content
All content in this area was uploaded by Chengjian Sun on Jul 12, 2021
Content may be subject to copyright.
Resource Allocation in URLLC with Online
Learning for Mobile Users
Jie Zhang, Chengjian Sun and Chenyang Yang
School of Electronics and Information Engineering,
Beihang University, Beijing, China
Email: {zhang15021245,sunchengjian,cyyang}@buaa.edu.cn
Abstract—Neural networks (NNs) have been applied to solve
various problems in ultra-reliable and low-latency communica-
tions (URLLC). Facing the stringent quality of service require-
ment, the time for training and running NNs is not negligible,
and how to ensure the reliability with learning-based solutions
is challenging, especially in a dynamic environment. In this
paper, we propose an online learning method, which fine-tunes
the NNs trained without supervision to ensure the reliability
of URLLC for mobile users. A joint power and bandwidth
allocation problem, aiming to minimize the bandwidth required
for satisfying the quality of service of each user, is considered
as an example. A “learning-to-optimize” method with offline
training is provided for comparison. Simulation results show that
the proposed online learning method can achieve comparable
system performance as the offline training method, where the
time consumed for online training and inference is about 25%
of the 1ms latency bound for the considered setup. Besides,
the online learning method adapts to the abrupt change of
average packet arrival rate quickly and can ensure reliability by
setting the required overall packet loss probability conservative
slightly. By contrast, the offline training method yields much
worse reliability when the arrival rate varies.
Index Terms—Online training, real-time inference, URLLC
I. INTRODUCTION
Ultra-reliable and low-latency communication (URLLC) is
continuously to be investigated in the sixth generation mobile
systems due to various technical challenges [1].
Resource allocation plays a key role in satisfying the
quality of service (QoS) requirement. To use spectrum ef-
ficiently in wireless channels, the resources should be allo-
cated frequently to adapt to the time-varying communication
environments. To facilitate the real-time decision, the new
framework of “learning-to-optimize” proposed in [2] can be
used for URLLC, where a neural network (NN) well-trained
in an offline manner is used to learn the mapping from the
environmental parameters to the optimal resource allocation.
In general, NNs are unable to be generalized well in
dynamic environments since the training samples cannot be
gathered from all scenarios, and hence need to be re-trained.
For example, the NNs in [3] and [4] respectively need to
be re-trained periodically in the timescales of minutes and
seconds. To avoid the time used for generating labels in the
offline-training phase required by the framework in [2], which
is prohibitive for functional optimization problems [5] often
This work was supported by the key project of National Natural Science
Foundation of China (NSFC) under Grant 61731002.
appearing in URLLC, unsupervised learning can be applied
[4, 6, 7]. However, training the NNs still consumes time for
computing, which has never been taken into account and
evaluated for URLLC.
URLLC has stringent end-to-end (E2E) time budget, say 1
ms. Hence, the re-training time for NN-based solutions can
not be ignored, and low complexity online training methods
is urgently needed for URLLC to adapt to environmental
variations. Reinforcement learning (RL) is a powerful tool
for online learning, and has been used to support URLLC in
dynamic environment [8]. However, RL is designed to solve
problems formulated as Markov decision process (MDP),
which consumes unnecessary computational resource to solve
the non-MDP problems widely existed in URLLC.
In this paper, we propose an online learning method under
the framework of unsupervised deep learning for non-MPD
problems. We take a downlink resource allocation problem
in URLLC as an example, where the transmit power and
bandwidth are allocated among users according to their small-
scale and large-scale channel gains, respectively. The power
allocation policy is learned by a NN, which is trained on-
line together with the optimization of bandwidth allocation.
For adapting to the time-varying large-scale channel gains
of mobile users, an offline training method is provided for
comparison. Simulation results show that, with only a few
iterations for each observation of the large-scale channel
gains, the online learning method can achieve comparable
performance as the offline method. The total time used by the
online training and inference is even shorter than the inference
time of the offline method, both of which are much shorter
than the latency requirement.
II. PRO BL EM FO RM UL ATIO N AN D EXISTING SOLUTION
Consider a downlink (DL) orthogonal frequency division
multiple access system supporting URLLC, where a base
station (BS) equipped with Ntantennas serves Ksingle-
antenna mobile users with maximal power of Pmax.
The small-scale channels are time-varying with coherence
time Ts. Within the duration Ts, multiple frames, each with du-
ration Tf, are used for DL and uplink (UL) transmission. The
duration for DL transmission is τ. The large-scale channels are
also time-varying, which can be regarded as unchanged within
duration TLbut vary among the durations. The relation of the
timescales is shown in Fig. 1.
Time
濷
濷
ࢀࡸࢀࡸ
ࢀ࢙
ࢀࢌ
濷
UL DL
Fig. 1. Timescales of channel variations and frame duration.
Since the packet size uin URLLC is usually small, the
bandwidth required for transmitting each packet is less than the
channel coherence bandwidth, i.e., the channel is flat fading.
Since the E2E delay requirement in URLLC is typically
shorter than Ts, the channel is quasi-static and time diversity
cannot be exploited. To guarantee transmission reliability, we
consider frequency hopping, where each user is assigned with
different subchannels in adjacent frames. When the frequency
interval between adjacent subchannels exceeds the coherence
bandwidth, the small scale channels of a user among frames
are independent.
In URLLC, the blocklength of channel coding is finite
due to the short transmission duration. To characterize the
impact of decoding errors on reliability, the achievable rate
in finite blocklength regime is required. In quasi-static flat
fading channels, when channel state information is available
at the BS and a user (say the kth user), the achievable rate (in
packets/frame) can be accurately approximated by [9],
sk≈τWk
uln 2 "ln 1 + αkgkPk
N0Wk−rVk
τWk
Q−1
G(εc
k)#(1)
where Wkand Pkare the bandwidth and transmit power
allocated to the kth user, εc
kis the decoding error probability,
αkand gkare the large-scale and small-scale channel gains of
the kth user, respectively, N0is the single-side noise spectral
density, Q−1
G(x)is the inverse of the Gaussian Q-function, and
Vk= 1 −1
h1+ αkgkPk
N0Wki2[9]. If the signal-to-noise ratio (SNR)
αkgkPk
N0Wk≥5dB, Vk≈1is accurate [10]. Since high SNR is
required for URLLC, such approximation is accurate.
Packets for each user arrive at the buffer of the BS randomly
and may accumulate into a queue. We consider that the packets
for different users wait in different queues.
The QoS requirements of URLLC can be characterized by
a delay bound Dmax and an overall packet loss probability
εmax. The delay of UL transmission, backhaul and processing
have been studied in literature [11, 12]. By further removing
the time used for online learning and inference, herein Dmax
is the DL delay, which consists of the queueing delay (denoted
as Dq
kfor the kth user), transmission delay Dtand decoding
delay Dc.Dt=Tfand Dcis a constant value [13]. All these
delay components are measured in frames. Due to the random
packet arrival, Dq
kis random. To ensure the delay requirement,
Dq
kshould be bounded by Dq
max ,Dmax −Dt−Dc.
If the queueing delay of a packet exceeds Dq
max, the packet
will be useless. The queueing delay violation probability for
the kth user, denoted as εq
k,Pr{Dq
k>Dq
max}, can be bounded
by εq
k< e−θkBE
kDq
max ,εq,U B
k[14], where θkis the QoS
exponent that satisfies CE
k≥BE
k,CE
kis effective capacity
depending on the service process and can be expressed as
CE
k=−1
θkln Eg{e−θksk}(packet/frame) [14] and BE
kis
effective bandwidth depending on the packet arrival process.
Here, Eg{·} denotes the expectation over the small-scale
channel gains.
The overall reliability requirement can be characterized by
1−(1 −εc
k)(1 −εq
k)≈εc
k+εq
k≤εmax. This approximation is
very accurate, because the values of εcand εqare very small
in URLLC. The queueing delay requirement (Dq
max,εq
k) can
be satisfied if εq,U B
kis satisfied. Then, the overall reliability
requirement can be ensured if εmax =εc
k+e−θkBE
kDq
max . For
simplicity, we assume εc
k=εmax/2as in [11]. Then, the
QoS exponent θksatisfying (Dq
max,εq
k) can be obtained from
e−θkBE
kDq
max =εmax/2, with which the QoS of URLLC can
be ensured if −1
θkln Eg{e−θksk} ≥ BE
k. For example, when
the packets of the kth user arrive according to Poisson process
with average arrival rate ak,BE
k=ak
θk(eθk−1) (packet/frame)
and θk= ln h1−ln(εmax /2)
akDq
max i[11].
To exploit multi-user diversity, we allocate transmit power
to each user according to the small-scale channel gains of
all users g,{g1, ..., gK}. To reduce the complexity, the
bandwidth is allocated among users according to their large-
scale channel gains [4]. To improve the resource efficiency,
we minimize the total bandwidth required to ensure the QoS
by optimizing the power and the bandwidth allocation,
min
Pk(g),Wk
K
X
k=1
Wk(2)
s.t. −1
θk
lnEg{e−θksk} ≥ BE
k(2a)
K
X
k=1
Pk≤Pmax, Pk≥0, Wk≥0(2b)
Problem (2) involves two timescales, which is a functional
optimization problem. Besides, the constraint in (2a) is not
with closed-form expression. To solve such a challenging
problem, an unsupervised deep learning method is proposed
in [4] to find Pk(g)and Wk,k= 1,· · · , K for every given
value of α,{α1, ..., αK}. In particular, problem (2) is first
transformed into its primal-dual problem as follows,
max
h(g),λk
min
Pk(g),Wk
K
X
k=1 hWk+λkEg{e−θksk} − e−θkBE
ki
+ZRK
+
h(g) K
X
k=1
Pk(g)−Pmax!dg(3)
s.t.
K
X
k=1
Pk(g)≤Pmax (3a)
Pk(g)≥0, Wk≥0, h(g), λk≥0(3b)
where the objective is the Lagrangian function of problem
(2), and h(g)and λkare the Lagrangian multipliers. Then,
Pk(g)/Pmax is parameterized as a NN with model param-
eters ω, denoted as N(g;ω). Since the required bandwidth
decreases when more power are allocated, the equality in
(3a) holds. Thereby, by applying Softmax function in the
output layer, N(g;ω)can satisfy the maximal transmit power
constraint. Wkand the parameterized form of Pk(g)are
optimized from,
max
λk
min
ω,Wk
L,
K
X
k=1 hWk+λkEg{e−θksk}−e−θkBE
ki (4)
s.t. Wk≥0, λk≥0(4a)
By taking the objective function in (4) as the loss function,
the optimal power and bandwidth allocation can be found from
stochastic gradient methods as shown in [4].
III. LEARNING RESOURCE ALL OC ATIO N IN OFFL IN E AN D
ONLINE MANNER
When large-scale channel gains change, N(g;ω)needs
to be re-trained and bandwidth allocation needs to be re-
optimized. To avoid the re-training, one can extend the idea
in [2] to learn P(g,α) = [P1(g,α),· · · , PK(g,α)] and
W(α) = [W1(α),· · · , WK(α)] with offline-trained NNs.
Alternatively, we can learn P(g)=[P1(g),· · · , PK(g)] with
online-trained NN and optimize W= [W1,· · · , WK]by
tracking αsampled with period of TLin an online manner.
A. Learning to Optimize Problem (2) with Offline Training
According to the proof in [7], problem (2) can be equiva-
lently transformed into the following form,
max
h(g,α)λk(α)min
Pk(g,α),Wk(α)
¯
L,
K
X
k=1
Eα(Wk(α)+λk(α)Ege−θkˆsk−e−θkBE
k
+ZRK
+
h(g,α) K
X
k=1
Pk(g,α)−Pmax!dg
)
s.t. (3a),Pk(g,α)≥0, Wk(α)≥0, h(g,α), λk(α)≥0
where Eα{·} denotes the expectation over the large-scale
channel gains.
We approximate Pk(g,α)/Pmax ,Wk(α)and λk(α),k=
1,· · · , K, with NNs denoted as NP(g,α;ωP),NW(α;ωW)
and Nλ(α;ωλ), respectively. To ensure positive bandwidth
and Lagrangian multipliers, we choose SoftPlus as the
output layer of NW(α;ωω)and Nλ(α;ωλ). To satisfy the
maximal transmit power constraint, we choose SoftMax
as the output layer of NP(g,α;ωP). Then, by taking the
Lagrange function as the loss function, we can train ωPand
ωWby stochastic gradient descent (SGD) and train ωλby
stochastic gradient ascent (SGA) as follows
ωP(t+1) = ωP(t)−φP(t)∇ωPˆ
L(t)
=ωP(t)−φP(t)Pmax∇ωPNP(α,g;ωP)∇Pˆ
L(t)
ωW(t+1) = ωW(t)−φW(t)∇ωWˆ
L(t)
=ωW(t)−φW(t)∇ωWNW(α;ωW)∇Wˆ
L(t)
ωλ(t+1) = ωλ(t)+φλ(t)∇ωλˆ
L(t)
=ωλ(t)+φλ(t)∇ωλNλ(α;ωλ)∇λˆ
L(t)
where ˆ
L(t),1
NbPNb
n=1PK
k=1hWk+λke−θkˆs(t)
k,n −e−θkBE
ki,
ˆs(t)
k,n is computed by (1) with a realization of large-scale
channel gain and a realization of small-scale channel gain,
and Nbis the batch size in each iteration. The gradient matrix
∇ωWNW(α;ωW),∇ωPNP(α,g;ωP)and ∇ωλNλ(α;ωλ)
can be computed through backward propagation, and ∇Wˆ
L(t),
∇Pˆ
L(t)and ∇λˆ
L(t)can be computed as
∇Wˆ
L(t)=(1−1
Nb
Nb
X
n=1
λt
k,n
∂ˆs(t)
k,n
∂Wk
θke−θkˆs(t)
k,n , k = 1,· · ·, K)
∇Pˆ
L(t)=(−1
Nb
Nb
X
n=1
λt
k,n
∂ˆs(t)
k,n
∂Pk
e−θkˆs(t)
k,n , k = 1,· · ·, K)
∇λˆ
L(t)=(1
Nb
Nb
X
n=1
e−θkˆs(t)
k,n −e−θkBE
k, k = 1,· · ·, K)
B. Learning to Optimize (2) by Online Training and Tracking
When the large-scale channel gains change slightly, it is
unnecessary to retrain the NNs from scratch, where the model
parameters only need fine tuning.
To learn Pk(g)and Wkin an online manner, we can use
stochastic gradient method to solve problem (4). In particular,
for the lth sampled value of α, we fine-tune the model
parameter ω(i.e., train N(g;ω)) and update Wkand λkfrom
Niterations in the lth round as follows,
ωl(t+ 1) = ωl(t)−φ(t)∇ωˆ
L(t)
=ωl(t)−φ(t)Pmax∇ωN(g;ω)∇Pˆ
L(t)
Wl
k(t+ 1) = "Wl
k(t)−φ(t)∂ˆ
L(t)
∂Wk#+
λl
k(t+ 1) = "λl
k(t) + φ(t)∂ˆ
L(t)
∂λk#+
where [x]+,max{x, 0}yields positive values, ˆ
L(t)has
the same form as defined in Section III-A but herein ˆs(t)
k,n
is achievable rate computed by the nth realization of gkgiven
the lth sampled value of α, and Nbis the batch size in each
iteration. When the (l+ 1)th sampled value of αis observed,
the (l+ 1)th round of online training are initialized as
ωl+1(1) = ωl(N), λkl+1 (1) = λkl(N), W l+1
k(1)=Wl
k(N)
For the first round, the values of ω,λkand Wkcan be set via
pre-training.
C. Procedure and Online Computational Complexity
In the two-timescale policy, bandwidth allocation is updated
with period TLwhen αchanges, and power allocation is
updated in each frame Tfwhen gchanges.
With the offline training method, the functions Pk(g,α)and
Wk(α)can be obtained once the NNs have been trained with
random realizations of gand α. We only consider the time
used to obtain one value of P(g,α)and W(α)via forward
propagation of NP(g,α;ωP)and NW(α;ωω), respectively
denoted as toff
Pand toff
W. The fraction of the time consumed
for inference per unit time can be obtained as
ηoff =TL
Tf
toff
P+toff
W.TL×100% (5)
With online learning, Pk(g)and Wkare obtained for each
value of αkwith Ntimes of iterations. Denote the time used
by each round of iterations as ton, and the time used for
forward propagation of N(g;ω)to obtain one value of P(g)
as ton
P. Then, the fraction of the time consumed for online
training and resource allocation per unit time is,
ηon =TL
Tf
ton
P+Nton.TL×100% (6)
IV. SIMULATION RESULTS
In this section, we compare the performance of the offline
training method in Section III-A (called Method-B), and the
online learning method in Section III-B (called Method-C).
Since there is no optimal solution for problem (2), we use the
method in [4] as a baseline (called Method-A).
𝒗𝟏
O
B
𝒗𝟐
CA
𝒗𝒌
…𝒗𝑲
𝑫𝟏𝒎
𝑫𝟐𝒎
𝒙𝒌
𝑫𝟏𝒎
…
Fig. 2. Simulation setup: cell radius D1= 250 m, and D2= 50 m.
We consider K= 10 users, which are equally located in the
OA segment of a road when they start to request the URLLC
service, as shown in Fig. 2. Pmax = 36 dBm, Nt= 32, and
N0=−173 dBm/Hz. The path loss model is 35.3+37.6 lg(dk)
where dkis the distance between the kth user and the BS, and
the shadowing is zero mean log-normal with 8dB standard
deviation and 50 m correlation distance. Small-scale channels
is subject to Rayleigh distribution. To show the impact of
the speed of environmental change on the online learning, we
consider four groups of users with different velocities, where
the users in each group have the same velocity. M= 100
test locations (and hence the large-scale channel gains) are
uniformly selected on the road for evaluation. The remaining
simulation parameters are shown in Table I. This simulation
setup is used unless otherwise specified.
The fine-tuned hyper-parameters of the NNs are as follows.
The numbers of hidden layers of all the NNs are 4. The nodes
of hidden layers of Method-B and Method-C are 40 and 20,
respectively. For offline training, each sample includes a large-
scale channel gain (generated by a randomly location in the
cell with the path loss model and shadowing) and a small-scale
channel gain (generated according to Rayleigh distribution).
For online learning and testing the offline training method,
the large-scale and small-scale channel gains are generated
according to the simulation setup.
TABLE I
SIMULATION PARAMETERS
Overall packet loss probability εmax 10−5
Frame duration Tfand DL transmission time τ0.1ms and 0.05 ms
DL delay bound Dmax 10 frames/1ms
Transmission delay Dtand decoding delay Dc1frame (i.e., 0.1ms)
Packet size u20 bytes (160 bits)
Average packet arrival rate a0.2packets/frame
Sampling period of large-scale channel gain TL0.1s
To evaluate the performance of online learning and of-
fline training methods in terms of minimizing the band-
width and satisfying the constraints, we use the average
relative error of the Lagrangian function as a metric, εAX
L=
PM
m=1
|ˆ
LX(m)−ˆ
LA(m)|
Mˆ
LA(m), X =Bor C, where ˆ
LA(m)and
ˆ
LX(m)is the Lagrangian function of Method-A and Method-
X obtained at the mth test location, respectively. For a fair
comparison, we use the Lagrangian multiplier of Method-A
to compute the Lagrangian function of the other two methods
such that all methods achieve the same performance in terms
of satisfying the constraints.
In Fig. 3, we provide the performance of the online learning
method versus the number of iterations N. The results are
averaged over 50 rounds of independent training. It is shown
that only a few iterations are required for online learning to
achieve the same performance as offline training, even for high
speed scenarios (e.g. N= 5 for 40 km/h, and N= 11 for
80 km/h).
12345678910111213
10-3
10-2
Fig. 3. Performance of online learning versus N.
To evaluate the accuracy of the learning methods, we define
the average relative error of the learned bandwidth as, εAX
W=
PM
m=1 PK
k=1
|WX
k(m)−WA
k(m)|
KM W A
k(m), X =Bor C, where WA
k(m)
and WX
k(m)are the bandwidth allocated to the kth user by
Method-A and Method-X at the mth test location, respectively.
The average relative error of the learned power allocation is
defined in the same way. The results are provided in Table II.
We can find that the relative errors are small, where the error
of learning power allocation is larger. When the number of
iterations is given, the accuracy of online training decreases
with the increase of velocity.
TABLE II
REL ATIVE ER ROR S OF T HE LEA RNE D BANDWIDTH AND POWE R
Method Method-B Method-C (N= 11)
Resources Bandwidth Power Bandwidth Power
v= 10km/h0.91% 3.66% 0.27% 1.45%
v= 20km/h0.91% 3.66% 0.38% 2.02%
v= 40km/h0.91% 3.66% 0.57% 2.74%
v= 80km/h0.91% 3.66% 0.81% 3.86%
It is note worthy that, to improve generalization ability, the
offline training method can also be designed to further learn
the mapping from some other environmental parameters to
the optimal solution. Nonetheless, in practice there are always
parameters that are unable to learn in this way. To demonstrate
what happens in dynamic environment, we simulate a simple
scenario, where the average packet arrival rate ais 0.2pack-
ets/frame from 0s to 5s, and changes to 0.3packets/frame
from 5s to 10 s, and the velocities of all users is set as
40 km/h. The overall packet loss probabilities of the worst
user achieved by different learning methods are provided
in Fig. 4, where the large-scale channel gain only includes
path loss in this figure to emphasize the impact of the time-
varying arrival rate. It is shown that the offline training method
yields much worse reliability after achanges, indicating poor
generalization ability to the packet arrival rate. Although the
online learning method also cannot enure εmax = 10−5due
to the errors of estimating ˆ
L(t)and the stochastic gradient
method, its overall packet loss probability is always less than
2εmax and is robust to the abrupt change of a. This suggests
that the reliability of the online learning method can be ensured
by setting εmax a little conservatively during the training, say
as 0.5·10−5.
0246810
10-5
10-4
Fig. 4. Achieved overall reliability with time-varying αand a,N= 5.
In Table III, we compare the computational complexity
between offline and online training methods, where toff
P,toff
W,
ton
Pand ton are tested on an Intel® Core™ i7-8700K CPU
@3.70 GHz. Recall from Table I that the sampling period of
large-scale channel gain is TL= 100 ms and TL/Tf= 1000,
ηon = 25.95% means that about 26 ms is used for online
training N(g;ω), obtain one optimal value of Wand 1000
optimal values of P. Further recalling the definition of ηon,
the result means that in average 0.26 ms will be used in every
1ms (i.e., the time budget for learning and inference is 1/4
of the E2E delay bound). For Method-B, in average 0.28 ms
is used for inference in every 1ms without considering the
time for offline training.
TABLE III
COM PUTATI ONA L COMPLEXITY,AL L USE RS M OVE W ITH 80 K M/H.
Offline Training Method toff
Ptoff
Wηoff
Computational Time 0.028 ms 0.026 ms 28.03%
Online Training Method (N= 11)ton
Pton ηon
Computational Time 0.025 ms 0.086 ms 25.95%
V. CONCLUSION
In this paper, we jointly optimized bandwidth and power
allocation in URLLC with deep learning, where the NNs are
trained either online or offline. Simulation results showed that
the proposed online learning method can achieve the system
performance of offline training via a few steps of online update
and training, and is robust to the change of environmental
parameters. By evaluating with a regular computer, the time
consumed by online training a NN and optimizing the resource
allocation is in average one fourth of the E2E delay bound. By
setting the overall packet loss probability a little conservative,
the reliability can be guaranteed by the online learning method.
In conclusion, our results suggest that online deep learning is
possible to be applied for real-time resource allocation in dy-
namic environment with low online computational complexity.
REFERENCES
[1] C. She, R. Dong, Z. Gu, Z. Hou, Y. Li, W. Hardjawana, C. Yang,
L. Song, and B. Vucetic, “Deep learning for ultra-reliable and low-
latency communications in 6G networks,” IEEE Netw., vol. 34, no. 5,
pp. 219–225, 2020.
[2] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for interference
management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5348–
5453, Oct. 2018.
[3] A. Alkhateeb, S. Alex, P. Varkey, Y. Li, Q. Qu, and D. Tujkovic, “Deep
learning coordinated beamforming for highly-mobile millimeter wave
systems,” IEEE Access, vol. 6, pp. 37328–37 348, May 2018.
[4] C. Sun and C. Yang, “Unsupervised deep learning for ultra-reliable and
low-latency communications,” IEEE GLOBECOM, 2019.
[5] E. Zeidler, Nonlinear functional analysis and its applications: III:
variational methods and optimization. Springer Science & Business
Media, 2013.
[6] M. Eisen, C. Zhang, L. F. Chamon, D. D. Lee, and A. Ribeiro, “Learning
optimal resource allocations in wireless systems,” IEEE Trans. Signal
Process., vol. 67, no. 10, pp. 2775–2790, May 2019.
[7] C. Sun and C. Yang, “Learning to optimize with unsupervised learning:
Training deep neural networks for URLLC,” IEEE PIMRC, 2019.
[8] A. T. Z. Kasgari, W. Saad, M. Mozaffari, and H. V. Poor, “Experi-
enced deep reinforcement learning with generative adversarial networks
(GANs) for model-free ultra reliable low latency communication,” IEEE
Trans. Commun., pp. 1–1, 2020.
[9] W. Yang, G. Durisi, T. Koch, et al., “Quasi-static multiple-antenna fading
channels at finite blocklength,” IEEE Trans. Inf. Theory, vol. 60, no. 7,
pp. 4232–4264, Jul. 2014.
[10] J. G. S. Schiessl and H. Al-Zubaidy, “Delay analysis for wireless fading
channels with finite blocklength channel coding,” ACM MSWiM, 2015.
[11] C. She, C. Yang, and T. Q. S. Quek, “Joint uplink and downlink resource
configuration for ultra-reliable and low-latency communications,” IEEE
Trans. Commun., vol. 66, no. 5, pp. 2266–2280, May 2018.
[12] B. Makki, T. Svensson, G. Caire, and M. Zorzi, “Fast HARQ over finite
blocklength codes: A technique for low-latency reliable communication,”
IEEE Trans. Wireless Commun., vol. 18, no. 1, pp. 194–209, Jan 2019.
[13] M. Condoluci, T. Mahmoodi, E. Steinbach, and M. Dohler, “Soft re-
source reservation for low-delayed teleoperation over mobile networks,”
IEEE Access, vol. 5, pp. 10 445–10 455, May 2017.
[14] J. Tang and X. Zhang, “Quality-of-service driven power and rate
adaptation over wireless links,” IEEE Trans. Wireless Commun., vol. 6,
no. 8, pp. 3058–3068, 2007.