ArticlePDF Available

Deep Recurrent Reinforcement Learning for Partially Observable User Association in a Vertical Heterogenous Network

Authors:

Abstract

To ensure ubiquitous connectivity and meet increasing users’ demands in next-generation wireless networks, we investigate user association in a three-layer network consisting of a terrestrial base station (TBS), a high-altitude platform station (HAPS), and a satellite. To maintain scalability and reduce frequent channel state information (CSI) exchanges across layers, we assume only the CSI of previously associated links is available. To address our partially observable user association problem, we propose the action-specific deep recurrent Q-network (ADRQN) method. This involves incorporating a long short-term memory (LSTM) layer alongside fully connected layers, utilizing both observation and action vectors in the policy network. We compare our proposed ADRQN method to the exhaustive search, deep Q-network, deep recurrent Q-network, and convex optimization methods and demonstrate the necessity of using the action vector. Finally, we consider the absence of perfect CSI and show that our ADRQN agent outperforms the convex optimization method in this scenario.
IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023 3235
Deep Recurrent Reinforcement Learning for Partially Observable User
Association in a Vertical Heterogenous Network
Hesam Khoshkbari and Georges Kaddoum
Abstract To ensure ubiquitous connectivity and meet increas-
ing users’ demands in next-generation wireless networks,
we investigate user association in a three-layer network consisting
of a terrestrial base station (TBS), a high-altitude platform
station (HAPS), and a satellite. To maintain scalability and
reduce frequent channel state information (CSI) exchanges across
layers, we assume only the CSI of previously associated links
is available. To address our partially observable user associ-
ation problem, we propose the action-specific deep recurrent
Q-network (ADRQN) method. This involves incorporating a
long short-term memory (LSTM) layer alongside fully connected
layers, utilizing both observation and action vectors in the policy
network. We compare our proposed ADRQN method to the
exhaustive search, deep Q-network, deep recurrent Q-network,
and convex optimization methods and demonstrate the necessity
of using the action vector. Finally, we consider the absence of
perfect CSI and show that our ADRQN agent outperforms the
convex optimization method in this scenario.
Index Terms HAPS, deep recurrent reinforcement learning,
user association, partially observable Markov decision process.
I. INTRODUCTION
DEPLOYING non-terrestrial base stations (NTBSs), such
as very low earth orbit satellites (VLEOs) and
high-altitude platform stations (HAPSs), is crucial to meet
future wireless communication demands and ensure global
connectivity. In this regard, several studies have explored
the applications and challenges of utilizing HAPSs [1] and
VLEOs [2]. However, integrating NTBSs into existing terres-
trial networks poses resource allocation challenges. To tackle
this, [3],[4] investigated user scheduling and link association
in coexisting TBSs and NTBSs, aiming to align each user
with the appropriate radio access network. The authors in [5]
applied successive convex approximation for power control
to mitigate interference in cases of coexisting TBSs and a
HAPS. The authors of [6] proposed a beamforming scheme
for an integrated satellite-HAPS network to investigate the
trade-off between sum rate maximization and total transmit
power minimization. The authors of [7] investigated multi-
cast communication in a satellite-aerial integrated network
with rate-splitting multiple access to increase the network’s
sum-rate. Furthermore, several studies have turned to deep
reinforcement learning (DRL) for wireless network resource
Manuscript received 2 October 2023; revised 1 November 2023; accepted
6 November 2023. Date of publication 8 November 2023; date of current ver-
sion 12 December 2023. The associate editor coordinating the review of this
letter and approving it for publication was A. Yadav. (Corresponding author:
Hesam Khoshkbari.)
Hesam Khoshkbari is with the Department of Electrical Engineering,
École de Technologie Supérieure, Montréal, QC H3C 1K3, Canada (e-mail:
hesam.khoshkbari.1@ens.etsmtl.ca).
Georges Kaddoum is with the Department of Electrical Engineering, École
de Technologie Supérieure, Montréal, QC H3C 1K3, Canada, and also with
the Artificial Intelligence and Cyber Systems Research Center, Lebanese
American University (LAU), Beirut 1102-2801, Lebanon.
Digital Object Identifier 10.1109/LCOMM.2023.3331216
allocation [8],[9],[10]. In [8], DRL was applied to user
association among NTBSs, aiming to maximize the network’s
sum-rate while reducing handoffs due to NTBS mobility.
In [9], DRL was utilized for power control in a HAPS,
mitigating interference with a cellular network. Authors in [10]
employed reinforcement learning (RL) in a low-earth orbit
satellite (LEO) network to distribute the load of overloaded
LEO satellites to adjacent ones, thereby maximizing the satel-
lites’ battery lifetime.
Although instructive, the problem of user association in a
three-tier network composed of a terrestrial layer (i.e., a TBS),
an aerial layer (i.e., a HAPS), and a space layer (i.e., a satellite)
has not been fully explored. Maximizing sum-rate necessitates
users connecting to the base station (BS) with the highest link
quality. To achieve this, the DRL agent requires the channel
state information (CSI) of each user for all layers. However,
this approach incurs significant overhead in the network and
is not scalable since the number of required CSIs increases
with the number of layers. To overcome this issue, we assume
that each user knows only the channels between itself and
its previously associated BS, which ensures that the required
CSI is limited to the number of users and independent of the
number of layers. However, this assumption yields a partially
observable problem. To solve partially observable Markov
decision process (POMDP) problems, the authors in [11]
and [12] studied channel association and antenna selection,
respectively, while the agent’s knowledge is limited to the
observed channels in the previous time slot. They modelled
the channels as being in a good state or bad state, derived
transition probabilities, computed the belief using Bayes’ rule,
and applied RL to solve the problem. In our scenario, with
three layers having varying link qualities, channels cannot be
independently modelled with only two states. Additionally, the
continuous state space and expansive action space make RL
less effective in our case.
In our previous work [13], we employed a deep Q-network
(DQN) agent to perform user association between a TBS and
a HAPS using global CSI. As it is stated in [14], HAPSs are
seen as supplements to current terrestrial-satellite networks.
Thus, we extend our previous work by adding a space layer
to our network and operating under the assumption of partial
observability. It is noted that this assumption reduces CSI
sharing overhead, down to one-third compared to global CSI
requirements. In this letter, to solve the user association
problem in a three-tier network while considering the partial
observability assumption, we use action-specific deep recur-
rent Q-network (ADRQN) method [15]. In this case, each user
only sends the channel between itself and the previously asso-
ciated BS to the agent. We compare our method with convex
optimization, DQN, and exhaustive search, demonstrating its
superior performance over DQN. Additionally, our ADRQN
1558-2558 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
3236 IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023
Fig. 1. System model illustration.
agent outperforms convex optimization under imperfect
CSI.
II. SYSTEM MODEL AND PROBLEM FORMULATION
This letter aims to solve the user association problem
in a three-layer wireless network to maximize its sum-rate
while considering the partially observable CSI assumption (see
Fig. 1). Specifically, we assume that our DRL agent is located
in a TBS equipped with Mantennas, where at each time slot
t, it schedules Usingle-antenna users to be served by itself,
a HAPS with Nantennas, or a satellite with Eantennas. We
make the assumption that each layer in our system model
operates at a different frequency band. Considering Kt,K
t,
and K′′
tas the sets of users served by the TBS, the HAPS,
and the satellite, respectively, we note that each user can be
served by only one BS at a time, i.e., Kt K
t K′′
t= 0.
We denote Ht[h1,t,h2,t ,· · · ,hU,t]Tas the channel
matrix between the TBS and all users, where hu,t = [hui,t]M
i=1
is an M×1vector of the channel coefficients between the
u-th user and the TBS’s antennas. We assume the first-order
Gauss-Markov channel model for u Ktat time slot tas
hu,t ξuhu,t1+p1ξ2
uzu,t, f or u = 1, . . . , U, (1)
where ξu[0,1] is the fading correlation factor, hu,0
CN (0M×1,σ2
h,u), and zu,t CN (0M×1,σ2
h,u), where
σ2
h,u =σ2
h,uIM×1.σ2
h,u is derived using the following path
loss model σ2
h,u =ρd3
u.
Assuming wu,t CM×1as the zero-forcing beamforming
(ZFB) to cancel out the intra-cell interference between user u
and the TBS, the signal-to-interference-plus-noise ratio (SINR)
for u Ktis obtained as γu∈Kt=|hu,twu,t |2
ˆσ2
n
, where ˆσ2
nis the
average power of the total received noise. Thus, the sum-rate
of users served by the TBS can be derived as
R=X
u∈Kt
log2(1 + γu∈Kt).(2)
Following the above formulation, we denote Gt
[g1,t,g2,t ,· · · ,gU,t]T, where gu,t = [guj,t ]N
j=1 is an N×
1vector of channel coefficients between the u-th user and
the HAPS antennas. Assuming the same line of sight (LoS)
component for all users u K
t, as explained in [16], we utilize
the first-order Gauss-Markov channel model to capture small
variations in channel links as
gu,t gLoS +ξugu,t1+p1ξ2
uz
u,t, f or u = 1, . . . , U,
(3)
where gu,0 CN (0N×1,σ2
h,u),gLoS = (gLoS )IN×1, and
z
u,t CN (0N×1,σ2
h,u), where σ2
h,u is the variance of
the channel between the u-th user and the HAPS antennas.
Assuming the number of users associated with HAPS is lower
than the number of antennas at the HAPS (i.e., |K
t| N),
we use ZFB to eliminate the intra-cell interference. Thus,
we can write the sum-rate of user u K
tas
R=X
u∈K
t
log2(1 + γ
u∈K
t).(4)
where γ
u∈K
t=|gu,tw
u,t|2
σ2
n
. Here, wu,t = [Wt]u=
[ˆ
Gtˆ
GH
tˆ
Gt1]uis the u-th user’s zero-forcing precoding
vector, and σ2
nis the average noise power.
Similarly, we denote Xt[x1,t,x2,t,· · · ,xU,t]T, where
xu,t = [xuj,t]E
j=1 is an E×1vector of channel coefficients
between the u-th user and the satellite’s antennas. Considering
the same LoS component for u K′′
t, the satellite’s channels
can be writen as
xu,t xLoS +ξ
uxu,t1+p1ξ2
uz′′
u,t, f or u = 1, . . . , U,
(5)
where xu,0 C N (0E×1,σ′′2
h,u),xLoS = (xLoS )IE×1, and
z′′
u,t CN (0E×1,σ′′ 2
h,u). Here, σ′′2
h,u is the variance of
the channel between the u-th user and the satellite anten-
nas. To calculate the sum-rate of users u K′′
t, assuming
|K′′
t| E, ZFB w′′
u,t to eliminate the inta-cell interference,
and γ′′
u∈K′′
t=|xu,tw′′
u,t|2
σ′′2
n
with σ2
nbeing the average noise
power, we get
R′′ =X
u∈K′′
t
log2(1 + γ′′
u∈K′′
t).(6)
Consequently, our optimization problem can be written as
max
Kt,K
t,K′′
t
R+R+R′′
s.t. |K
t| N
|K′′
t| E
Kt K
t K′′
t= 0 (7)
where the term Kt K
t K′′
t= 0 ensures that each user can
be served by only one BS. Our defined optimization problem
in (7) can be viewed as the joint antenna selection at NTBSs
(i.e., choosing the number of active antennas at NTBSs) and
user association between three layers, which is NP-hard and
non-convex. More specifically, we assume our agent only has
access to a subset of CSIs at each time slot.
III. OUR PRO POS ED DEEP RECURRENT
Q-NETWORK ALGORITHM
A DQN agent assumes complete observability of the state
space, limiting its performance in POMDP problems. On the
other hand, conventional methods assume that the dynam-
ics of the environment, i.e., the transition probabilities, are
known and use Bayes’ theorem for belief estimation and
problem-solving. However, our work involves a continuous
state space and three system tiers, rendering a model-based
environment representation impractical. To solve this issue,
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
KHOSHKBARI AND KADDOUM: DEEP RECURRENT RL FOR PARTIALLY OBSERVABLE USER ASSOCIATION 3237
Fig. 2. Integration of the ADRQN agent at the TBS.
the deep recurrent Q-network (DRQN) [17] method has been
introduced, which uses a long short-term memory (LSTM)
layer besides fully connected neural networks (FCNNs) in
the policy network to estimate the Q-functions. However, the
DRQN method overlooks the action, which, as we will show
in our simulation results, decreases the performance.
In this letter, we use the ADRQN method, which concate-
nates the action vector with the observation vector and feeds
it to the policy network to estimate the action-state value
function, i.e., Q(vt1,at1,ot|θ)at time slot t. Here, θ
is the weights of the policy network, for which an LSTM
layer is used in addition to the fully-connected layers to better
approximate the Q-function. at1is the previously executed
action vector at time slot t1,otis the observation vector
(we do not have access to the exact state vector st), and vt1
denotes the output of the LSTM layer, which is defined as
vt=LST M (vt1,at1,ot).(8)
We employ a buffer and a target network to increase
the stability of our ADRQN agent during training [18].
As depicted in Fig. 2, we store the transition tuple
(at1,ot,at, R(st, at),ot+1)in buffer Dat each time slot
tand employ a random mini-batch of data to update our
ADRQN agent. Additionally, we maintain a target network
with weights ˜
θ, serving as a delayed copy of our main
(policy) network. Following the above explanations, we use
the following loss-function for updating our main network
L(θ) = E({a,o},a,R(s,a),o+)∼D (yQ(v,a,o|θ))2,
(9)
where vand aare the previous LSTM output and action,
respectively, o+is the next observation, and yis defined as
y=R(s,a) + γmax
a∈A Qv,a,o+|˜
θ.(10)
In the sequel, we define the action space, observation space,
and reward function, and then explain our training procedure.
1) Action: Our action space A, is made up of U×1action
vectors ai= [a1,i a2,i . . . aU,i]T, where airepresents the
i-th possible user scheduling vector. Here, au,i = 0 denotes
that user uis associated to the TBS, while au,i = 1 means the
HAPS is serving user u, and au,i = 2 denotes that user uis
associated with the VLEO. At last, considering the constraints
in our optimization problem (7),PU
u=1[au,i = 1] Nand
PU
u=1[au,i = 2] Eshould be satisfied.
2) Reward: In accordance with our notations in Section II,
our reward function is defined as the sum-rate of the network,
i.e., R(s,a) = R+R+R′′.
3) Observations: At time slot t, our U×1observation
vector ot O is defined as ot= [o1,t o2,t . . . oU,t]T, with
oi,t =
hu,t2,if ai= 0
gu,t2,if ai= 1
xu,t2,if ai= 2,
(11)
where aiis the i-th element of the action vector in t1.
Training Procedure:We start by populating buffer Dwith
random values and initializing the main and target networks
to be same, i.e., ˜
θ=θ. To explore different system layouts and
prevent overfitting to specific user locations, we run Iepisode
episodes where the environment is reset at the beginning
of each episode, redistributing users and generating chan-
nels as discussed in Section II. We implement the ε-greedy
exploration strategy, progressively reducing the exploration
probability throughout the training phase, starting with ε1= 1
for the initial episode.
At the start of each episode, the ADRQN agent executes
a random action or receives {at1,ot}and the action with
the highest Q value is chosen. Afterwards, the agent gets the
reward value R(st,at)and new observation ot+1. We then
store the information in the buffer, and every ηtime slots,
we sample a mini-batch of information to update the main
network. Furthermore, every ηtime slots, we update the target
network by replacing its weights with the main network’s
weights. Finally, we save the model every η′′ episodes to
upload it in the testing phase to assess its performance.
IV. SIMULATION RESULTS
A. Benchmarks
1) DRQN: It is similar to our ADRQN method, where the
only difference is that the DRQN method overlooks the actions
that, as we will show, negatively affect the performance.
2) DQN: We assume three scenarios for the DQN method:
1) the DQN-F method, which has access to the global CSI,
i.e., h,g, and x. 2) the DQN-P, which only receives the
observations and does not use the actions. 3) the DQN-PA
method, which receives the same input as our ADRQN agent.
3) Exhaustive Search (ES) Selection: Evaluates all possible
actions at each time slot and chooses the one with the highest
reward value. It is noted that this method is impractical and
operates independently of states or observations.
4) Random Selection: Selects actions randomly, indepen-
dently of states or observations.
5) Convex Optimization (CO): We assume fixed power
instead of uniform power allocation and solve our problem
with CO for a better comparison. Note that the CO method
needs the global CSI to solve (7).
6) Convex Optimization With Uncertainty Bounds (COUB):
For further investigation, we add a method to our imperfect
CSI comparisons used in [19] that employs uncertainty bounds
to solve the convex optimization problem with imperfect CSI.
Note that, like the CO method, the COUB method also needs
global CSI to solve (7).
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
3238 IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023
Algorithm 1 training Algorithm of Our Proposed
ADRQN Framework
Input: frequency of main and target networks and
saving model η,η,η′′, number of episodes and time
slots Iepisode,L, exploration probability at the i-th
episode εi, final εivalue εfinal, final εiupdating
episode I;
1Initialize: buffer D, main network’s weights θ
randomly, target network’s weights ˜
θ=θ;
2for i1to Iepisode do
3Reset the environment and initialize o0;
4Update
εi=
11εfinal
Iif 1< i < I
εfinal I< i < Iepisode
;
5for t1to Ldo
6Randomly select action with a probability of εi
if not: feed {at1,ot}to the main network
and choose:
at= arg maxaQ(vt1,at1,ot|θ);
7Execute action at, receive reward R(st,at),
and new observation ot+1, and store
({at1,ot},at, R(st,at),ot+1)in D;
8if t%η= 0 then
9Sample mini-batch of data and update the
main network using (9);
10 if t%η= 0 then
11 Update target network: ˜
θθ;
12 if i%η′′ = 0 then
13 Save the learned model;
Output: Learned ADRQN model Q(v,a,o|θ)
B. Numerical Results
During training, we assume U=6 single-antenna users
distributed in a 3 km radius cell. We consider a TBS with
M=16 antennas and a supply power of P= 10 dB, a HAPS
with an altitude of 18 km that is equipped with N=2 antennas
and a supply power of P=10 dB, and a VLEO with an
altitude of 250 km that is equipped with E=1 antenna and a
supply power of P′′ =10 dB. We utilize the aperture antenna
model [20] for both the HAPS and satellite and design it for
maximum gain at the cell border. We use the free-space path
loss model to generate large-scale fading for the HAPS and
the satellite. To account for atmospheric attenuation in the
satellite, we utilize the model proposed in [21]. For power
allocation, we assume uniform distribution. For CO and COUB
methods, we allocate P/(MNE)dB per TBS user, P/N
dB per HAPS user, and P′′/E dB per VLEO user. Providing
the full episode to the agent is optimal but computationally
demanding, so we limit it to 10 time slots. Training parameters
in Algorithm 1and other details are in Table I.
Using the abovementioned values for the system model,
we evaluate all benchmark methods and compare our ADRQN
with them in Table II. As expected, ES selection and random
selection achieve the best and worst results, respectively.
Our proposed ADRQN’s performance is close to that of the
TABLE I
SIMULATION PARAMETERS
TABLE II
COMPARATIVE RESULTS
DQN-F, which requires the full CSI, i.e., it employs the state
vector with 3×Uelements. Moreover, ADRQN and CO
methods’ performances closely align, highlighting ADRQN’s
efficacy compared to mathematical approaches. It’s worth
noting that the CO method operates at each time slot and uses
global CSI, while the ADRQN agent does not require learning
or backpropagation during testing. As previously mentioned,
accounting for the action vector is vital for our problem. The
sum-rate for the DRQN method, which employs only the
observation vector, decreases by 0.49 bps compared to that of
the ADRQN method. Additionally, comparing the DQN-PA
and DQN-P methods reveals that incorporating the action
vector boosts DQN performance by 0.68 bps. This is due to
the fact that, with a partially observable observation vector,
employing the action vector helps the agent understand that
each layer has different ranges of CSI values so that it can
interpret them separately.
To investigate the scalability of our ADRQN method,
we plot the average sum-rate vs the numbers of users in
Fig. 3, assuming N=3 and E=2. For better visualization,
we only plot the ES selection and random selection as the
upper and lower bounds, and the CO method as the principle
benchmark. As shown, the sum-rate for all methods increases
as the number of users increases, and our proposed ADRQN
agent remains very close to the ES selection.
To show the robustness of our proposed ADRQN method,
we consider a practical scenario where perfect CSI is unavail-
able. We assume the imperfect CSI is defined by adding
additive Gaussian noise. Assuming ϵhas the variance of the
channel estimation error, Fig. 4illustrates the average sum-rate
vs ϵh. Here, we assume U=7, M=16, N=3, E=2,
and ϵhvaries between 0.2 and 0.8. Compared to the results
shown in Fig. 3, the sum-rate of the ADRQN method decreases
by 0.48 bps for ϵh=0.2 and then remains approximately
constant. The sum-rate of the CO method, on the other hand,
drops to 61.63 bps (from 62.79 bps) for ϵh=0.2, and then
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
KHOSHKBARI AND KADDOUM: DEEP RECURRENT RL FOR PARTIALLY OBSERVABLE USER ASSOCIATION 3239
Fig. 3. Average sum-rate vs numbers of users.
Fig. 4. Averaged sum-rate versus ϵh.
decreases by 7.98 bps as ϵhincreases to 0.8. Although the
sum-rate of the COUB method is similar to that of the ADRQN
method for small estimation errors, it decreases for higher
ϵhvalues (more than 0.4), for example, to 57.17 bps for
ϵh=0.8.
C. Computational Complexity
The computational complexity of DRL methods is contin-
gent on the dimension of the deep neural network employed
as the policy network. If we consider |A|to be the dimension
of the action space, for U=6, N=2, and E=1, the com-
putational complexity of the ADRQN method is O(|A|2.58),
whereas that of the DQN-F method is O(|A|2.50)and that
of the DRQN method is O(|A|2.575). The computational
complexity of the CO method is O(|A|2.51). Note that once the
ADRQN agent has been trained using Algorithm 1, no learning
or backpropagation is needed in the testing phase.
V. CONCLUSION
This letter investigated user association in a three-layer
network using only a subset of the global CSI, which ensures
the scalability of our proposed method by increasing the num-
ber of the network’s layers. Moreover, our proposed method
does not require transition probabilities, which makes it more
feasible to solve POMDP problems in wireless networks.
We compared our proposed ADRQN method to various bench-
marks and demonstrated its superior performance in solving
our POMDP problem. We also discussed the importance of
using the action vector in addition to the observation vector.
At last, we investigated the imperfect CSI case and showed
that the ADRQN agent maintains its performance while the
sum-rate of the CO method drops significantly.
REFERENCES
[1] G. Karabulut Kurt et al., A vision and framework for the high
altitude platform station (HAPS) networks of the future,” IEEE Commun.
Surveys Tuts., vol. 23, no. 2, pp. 729–779, 2nd Quart., 2021.
[2] N. H. Crisp et al., “The benefits of very low earth orbit for earth obser-
vation missions, Progr. Aerosp. Sci., vol. 117, pp. 1–23, Aug. 2020.
[3] Y. Yuan, L. Lei, T. X. Vu, Z. Chang, S. Chatzinotas, and S. Sun,
“Adapting to dynamic LEO-B5G systems: Meta-critic learning based
efficient resource scheduling, IEEE Trans. Wireless Commun., vol. 21,
no. 11, pp. 9582–9595, Nov. 2022.
[4] A. Alsharoa and M.-S. Alouini, “Improvement of the global con-
nectivity using integrated satellite-airborne-terrestrial networks with
resource optimization,” IEEE Trans. Wireless Commun., vol. 19, no. 8,
pp. 5088–5100, Aug. 2020.
[5] A. Alidadi Shamsabadi, A. Yadav, O. Abbasi, and H. Yanikomeroglu,
“Handling interference in integrated HAPS-terrestrial networks through
radio resource management,” IEEE Wireless Commun. Lett., vol. 11,
no. 12, pp. 2585–2589, Dec. 2022.
[6] Z. Lin, M. Lin, Y. Huang, T. D. Cola, and W.-P. Zhu, “Robust multi-
objective beamforming for integrated satellite and high altitude platform
network with imperfect channel state information,” IEEE Trans. Signal
Process., vol. 67, no. 24, pp. 6384–6396, Dec. 2019.
[7] Z. Lin, M. Lin, T. de Cola, J.-B. Wang, W.-P. Zhu, and J. Cheng,
“Supporting IoT with rate-splitting multiple access in satellite and
aerial-integrated networks, IEEE Internet Things J., vol. 8, no. 14,
pp. 11123–11134, Jul. 2021.
[8] Y. Cao, S.-Y. Lien, and Y.-C. Liang, “Deep reinforcement learning
for multi-user access control in non-terrestrial networks,” IEEE Trans.
Commun., vol. 69, no. 3, pp. 1605–1619, Mar. 2021.
[9] S. Jo, W. Yang, H. K. Choi, E. Noh, H.-S. Jo, and J. Park, “Deep
Q-learning-based transmission power control of a high altitude plat-
form station with spectrum sharing,” Sensors, vol. 22, no. 4, p. 1630,
Feb. 2022.
[10] H. Tsuchida et al., “Efficient power control for satellite-borne batteries
using Q-learning in low-earth-orbit satellite constellations, IEEE Wire-
less Commun. Lett., vol. 9, no. 6, pp. 809–812, Jun. 2020.
[11] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for
multi-channel opportunistic access: Structure, optimality, and perfor-
mance,” IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440,
Dec. 2008.
[12] S. Sharifi, S. Shahbazpanahi, and M. Dong, “A POMDP-based antenna
selection for massive MIMO communication, IEEE Trans. Commun.,
vol. 70, no. 3, pp. 2025–2041, Mar. 2022.
[13] H. Khoshkbari, S. Sharifi, and G. Kaddoum, “User association in a
VHetNet with delayed CSI: A deep reinforcement learning approach,”
IEEE Commun. Lett., vol. 27, no. 8, pp. 2257–2261, Aug. 2023.
[14] E. Cianca et al., “Integrated satellite-HAP systems,” IEEE Commun.
Mag., vol. 43, no. 12, pp. 33–39, Dec. 2005.
[15] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-
ment learning for POMDPs,” 2017, arXiv:1704.07978.
[16] E. Falletti, M. Laddomada, M. Mondin, and F. Sellone, “Integrated ser-
vices from high-altitude platforms: A flexible communication system,”
IEEE Commun. Mag., vol. 44, no. 2, pp. 85–94, Feb. 2006.
[17] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially
observable MDPs, 2015, arXiv:1507.06527.
[18] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[19] K. Pathak, S. S. Kalamkar, and A. Banerjee, “Optimal user scheduling
in energy harvesting wireless networks, IEEE Trans. Commun., vol. 66,
no. 10, pp. 4622–4636, Oct. 2018.
[20] A. Ibrahim and A. S. Alfa, “Using Lagrangian relaxation for radio
resource allocation in high altitude platforms,” IEEE Trans. Wireless
Commun., vol. 14, no. 10, pp. 5823–5835, Oct. 2015.
[21] R. Wang, P. Ren, D. Xu, and L. Lu, “Stochastic geometry anal-
ysis of LEO constellation coverage under atmospheric attenuation,
in Proc. IEEE 96th Veh. Technol. Conf. (VTC-Fall), Sep. 2022,
pp. 1–5.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
... RL algorithms are gaining traction [5], [6], [21] because they are able to adapt in real-time without the need for prior statistical knowledge of spectrum data [20]. However, their adaptability in CRNs when the CSI is delayed, noisy, falsified, or unavailable remains an open research area [22], [23]. ...
... Furthermore, due to the limitations of low-powered IoT devices, gathering accurate CSI becomes challenging. Consequently, adapting RL-based decision-making with ICSI remains a crucial focus area for improving RL performance in CRNs [22], [23], [27]. ...
Article
Full-text available
Cognitive radio networks (CRNs) mitigate spectrum scarcity by leveraging the holes in the licensed spectrum to enable internet of things (IoT) devices to opportunistically access the spectrum. However, IoT devices need to sense the spectrum before they can access it, which is an energy-intensive process and hinders the practical implementation of opportunistic spectrum access for energy-constrained IoT devices. In this context, reinforcement learning-based algorithms that encourage cooperation among IoT devices to eliminate the need for constant sensing are promising candidates for practical CRN implementation. As exciting as the application of reinforcement learning to CRNs is, benchmarking the performance of different algorithms is a huge challenge due to a lack of standardized comparison metrics, especially for hybrid action spaces that comprise both discrete and continuous actions. We propose a hybrid discrete-continuous space deep reinforcement learning algorithm that maximizes the energy efficiency of CRNs by optimizing sensing, cooperation, and transmission by IoT devices.We also analyze the algorithm’s performance by setting the theoretical upper bound for throughput and find that it reaches 99.4% of the theoretical upper bound, while its discrete action-space version reaches 96% and other baseline algorithms range between 70% and 86%.
... Further, the authors perform user association to maximize the total throughput of the network while avoiding frequent handoffs resulting from the mobility of airborne vehicles. The authors of [18] studied the DRL-based user association in a VHetNet with en emphasize on the role of satellites. In [19], the authors study minimizing the age of information (AoI) in an intelligent transportation system (ITS) where UAVs collect the produced information by sensors on the vehicles to provide up-to-date data. ...
... The obtained action assigns a subset of users to be served by the HAPS and the remaining ones are served by the TBS. We use (18) to obtain the reward value Rps t , a t q. Afterward, the time-varying channels evolve based on (28). ...
Article
Full-text available
In this paper, we devise a deep SARSA-learning (DSRL) user scheduling algorithm for a base station (BS) that uses a high-altitude platform station (HAPS) as a backup to serve multiple users in a wireless cellular network. Considering a realistic scenario, we assume that only the outdated channel state information (CSI) of the terrestrial base station (TBS) is available in our defined user scheduling problem. To do so, we define a user scheduling problem using a Markov decision process (MDP) framework, where the objective function is to maximize the sum-rate with a minimum number of active antennas at the HAPS. Our performance analysis shows that the sum-rate obtained with our proposed DSRL algorithm is close to the optimal sum-rate achieved with an exhaustive search method. We also develop a heuristic optimization method to solve the user scheduling problem at the BS. We show that for a scenario where perfect CSI is not available, our proposed DSRL algorithm outperforms the heuristic optimization method.
Article
Full-text available
Non-terrestrial base stations (NTBSs) must be employed for next-generation wireless networks to provide users with ubiquitous connectivity and a higher data rate. In vertical heterogeneous networks (VHetNets), associating users with either a terrestrial base station (TBS) or a NTBS to maximize the sum-rate of the network while accounting for the resource limitations that exist at the NTBS poses a challenge. Moreover, a practical user association method should be capable of working in a realistic situation in which instantaneous channel state information (CSI) is not available. To solve this problem, we propose a deep Q-learning (DQL) approach in which a satellite is our agent and schedules each user to a TBS or a high-altitude platform station (HAPS) in each time slot using the CSI of the previous time slot. The proposed method achieves nearly identical results as the exhaustive search action selection method. Furthermore, we investigate the effect of imperfect CSI and show our proposed method outperforms the convex optimization user association scheme in the presence of noisy CSI.
Article
Full-text available
Vertical heterogeneous networks (vHetNets) are promising architectures to bring significant advantages for 6G and beyond mobile communications. High altitude platform station (HAPS), one of the nodes in the vHetNets, can be considered as a complementary platform for terrestrial networks to meet the ever-increasing dynamic capacity demand and provide sustainable wireless networks for future. However, the problem of interference is the bottleneck for the optimal operation of such an integrated network. Thus, designing efficient interference management techniques is inevitable. In this work, we aim to design a joint power-subcarrier allocation scheme in order to achieve fairness for all users. We formulate the max-min fairness (MMF) optimization problem and develop a rapid converging iterative algorithm to solve it. Numerical results validate the superiority of the proposed algorithm and show better performance over other conventional network scenarios.
Article
Full-text available
Low earth orbit (LEO) satellite-assisted communications have been considered as one of the key elements in beyond 5G systems to provide wide coverage and cost-efficient data services. Such dynamic space-terrestrial topologies impose an exponential increase in the degrees of freedom in network management. In this paper, we address two practical issues for an over-loaded LEO-terrestrial system. The first challenge is how to efficiently schedule resources to serve a massive number of connected users, such that more data and users can be delivered/served. The second challenge is how to make the algorithmic solution more resilient in adapting to dynamic wireless environments. We first propose an iterative suboptimal algorithm to provide an offline benchmark. To adapt to unforeseen variations, we propose an enhanced meta-critic learning algorithm (EMCL), where a hybrid neural network for parameterization and the Wolpertinger policy for action mapping are designed in EMCL. The results demonstrate EMCL’s effectiveness and fast-response capabilities in over-loaded systems and in adapting to dynamic environments compare to previous actor-critic and meta-learning methods.
Article
Full-text available
A High Altitude Platform Station (HAPS) can facilitate high-speed data communication over wide areas using high-power line-of-sight communication; however, it can significantly interfere with existing systems. Given spectrum sharing with existing systems, the HAPS transmission power must be adjusted to satisfy the interference requirement for incumbent protection. However, excessive transmission power reduction can lead to severe degradation of the HAPS coverage. To solve this problem, we propose a multi-agent Deep Q-learning (DQL)-based transmission power control algorithm to minimize the outage probability of the HAPS downlink while satisfying the interference requirement of an interfered system. In addition, a double DQL (DDQL) is developed to prevent the potential risk of action-value overestimation from the DQL. With a proper state, reward, and training process, all agents cooperatively learn a power control policy for achieving a near-optimal solution. The proposed DQL power control algorithm performs equal or close to the optimal exhaustive search algorithm for varying positions of the interfered system. The proposed DQL and DDQL power control yields the same performance, which indicates that the actional value overestimation does not adversely affect the quality of the learned policy.
Article
Full-text available
A High Altitude Platform Station (HAPS) is a network node that operates in the stratosphere at an of altitude around 20 km and is instrumental for providing communication services. Precipitated by technological innovations in the areas of autonomous avionics, array antennas, solar panel efficiency levels, and battery energy densities, and fueled by flourishing industry ecosystems, the HAPS has emerged as an indispensable component of next-generations of wireless networks. In this article, we provide a vision and framework for the HAPS networks of the future supported by a comprehensive and state-of-the-art literature review. We highlight the unrealized potential of HAPS systems and elaborate on their unique ability to serve metropolitan areas. The latest advancements and promising technologies in the HAPS energy and payload systems are discussed. The integration of the emerging Reconfigurable Smart Surface (RSS) technology in the communications payload of HAPS systems for providing a cost-effective deployment is proposed. A detailed overview of the radio resource management in HAPS systems is presented along with synergistic physical layer techniques, including Faster-Than-Nyquist (FTN) signaling. Numerous aspects of handoff management in HAPS systems are described. The notable contributions of Artificial Intelligence (AI) in HAPS, including machine learning in the design, topology management, handoff, and resource allocation aspects are emphasized. The extensive overview of the literature we provide is crucial for substantiating our vision that depicts the expected deployment opportunities and challenges in the next 10 years (next-generation networks), as well as in the subsequent 10 years (next-next-generation networks).
Article
Full-text available
To satisfy the explosive access demands of internet-of-things (IoT) devices, various kinds of multiple access techniques have received much attention. In this paper, we investigate the multicast communication of a satellite and aerial integrated network (SAIN) with rate-splitting multiple access (RSMA), where both satellite and unmanned aerial vehicle (UAV) components are controlled by network management center and operate in the same frequency band. Considering a content delivery scenario, the UAV sub-network adopts the RSMA to support massive access of IoT devices and achieve desired performances of interference suppression, spectral efficiency and hardware complexity. We first formulate an optimization problem to maximize the sum-rate of the considered system subject to the signal-interference-plus-noise-ratio requirements of IoT devices and per-antenna power constraints at the UAV and satellite. To solve this non-convex optimization problem, we exploit the sequential convex approximation and the first-order Taylor expansion to convert the original optimization problem into a solvable one with rank-one constraint, and then propose an iterative penalty function based algorithm to solve it. Finally, simulation results verify that the proposed method can effectively suppress the mutual interference and improve the system sum rate compared to the benchmark schemes.
Article
We use a partially observable Markov decision process (POMDP) framework to design an optimal antenna selection policy for downlink transmit beamforming at a multi-antenna base station (BS) equipped with only a limited number of RF chains. Assuming that the channel state evolves according to a finite-state Markov process and that only the channel coefficients which correspond to previously selected antennas, are available at the BS, we use the POMDP framework for antenna selection with the aim to maximize the long-term expected downlink data rate . To avoid the high computational complexity of the value iteration algorithm, we focus on the myopic policy and prove that in the case of positively correlated two-state Markov model for the channel over each antenna, the myopic policy is optimal for antenna selection for any number of RF chains . Based on this finding, for general fading channels, we propose to quantize each channel into two levels and apply the myopic policy for antenna selection. Our simulation results show that using this two-state coarse channel quantization for antenna selection results in only a small loss in performance, as compared to the antenna selection technique which uses full channel state information without quantization.
Article
Non-Terrestrial Networks (NTNs) composed of space-borne (e.g., satellites) and airborne vehicles (e.g., drones and blimps) have recently been proposed by 3GPP as a new paradigm of infrastructures to enhance the capacity and coverage of existing terrestrial wireless networks. The mobility of non-terrestrial base stations (NT-BSs) however leads to a dynamic environment, which imposes unique challenges for handover and throughput optimization particularly in multi-user access control for NTNs. To achieve performance optimization, each terrestrial user equipment (UE) should autonomously estimate the dynamics of moving NT-BSs, which is different from the existing user access control schemes in terrestrial wireless networks. Consequently, new learning schemes for optimum multi-user access control are desired. In this paper, we therefore propose a UE-driven deep reinforcement learning (DRL) based scheme, in which a centralized agent deployed at the backhaul side of NT-BSs is responsible for training the parameter of a deep Q-network (DQN), and each UE independently makes its own access decisions based on the parameter from the trained DQN. With the proposed scheme, each UE is able to access a proper NT-BS intelligently to enhance the long-term system throughput and avoid frequent handovers among NT-BSs. Through comprehensive simulation studies, we justify the performance of the proposed scheme, and show its effectiveness in addressing the fundamental issues in the NTNs deployment.
Article
Very low Earth orbits (VLEO), typically classified as orbits below approximately 450 km in altitude, have the potential to provide significant benefits to spacecraft over those that operate in higher altitude orbits. This paper provides a comprehensive review and analysis of these benefits to spacecraft operations in VLEO, with parametric investigation of those which apply specifically to Earth observation missions. The most significant benefit for optical imaging systems is that a reduction in orbital altitude improves spatial resolution for a similar payload specification. Alternatively mass and volume savings can be made whilst maintaining a given performance. Similarly, for radar and lidar systems, the signal-to-noise ratio can be improved. Additional benefits include improved geospatial position accuracy, improvements in communications link-budgets, and greater launch vehicle insertion capability. The collision risk with orbital debris and radiation environment can be shown to be improved in lower altitude orbits, whilst compliance with IADC guidelines for spacecraft post-mission lifetime and deorbit is also assisted. Finally, VLEO offers opportunities to exploit novel atmosphere-breathing electric propulsion systems and aerodynamic attitude and orbit control methods. However, key challenges associated with our understanding of the lower thermosphere, aerodynamic drag, the requirement to provide a meaningful orbital lifetime whilst minimising spacecraft mass and complexity, and atomic oxygen erosion still require further research. Given the scope for significant commercial, societal, and environmental impact which can be realised with higher performing Earth observation platforms, renewed research efforts to address the challenges associated with VLEO operations are required.