Content uploaded by Hesam Khoshkbari
Author content
All content in this area was uploaded by Hesam Khoshkbari on Dec 14, 2023
Content may be subject to copyright.
IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023 3235
Deep Recurrent Reinforcement Learning for Partially Observable User
Association in a Vertical Heterogenous Network
Hesam Khoshkbari and Georges Kaddoum
Abstract— To ensure ubiquitous connectivity and meet increas-
ing users’ demands in next-generation wireless networks,
we investigate user association in a three-layer network consisting
of a terrestrial base station (TBS), a high-altitude platform
station (HAPS), and a satellite. To maintain scalability and
reduce frequent channel state information (CSI) exchanges across
layers, we assume only the CSI of previously associated links
is available. To address our partially observable user associ-
ation problem, we propose the action-specific deep recurrent
Q-network (ADRQN) method. This involves incorporating a
long short-term memory (LSTM) layer alongside fully connected
layers, utilizing both observation and action vectors in the policy
network. We compare our proposed ADRQN method to the
exhaustive search, deep Q-network, deep recurrent Q-network,
and convex optimization methods and demonstrate the necessity
of using the action vector. Finally, we consider the absence of
perfect CSI and show that our ADRQN agent outperforms the
convex optimization method in this scenario.
Index Terms— HAPS, deep recurrent reinforcement learning,
user association, partially observable Markov decision process.
I. INTRODUCTION
DEPLOYING non-terrestrial base stations (NTBSs), such
as very low earth orbit satellites (VLEOs) and
high-altitude platform stations (HAPSs), is crucial to meet
future wireless communication demands and ensure global
connectivity. In this regard, several studies have explored
the applications and challenges of utilizing HAPSs [1] and
VLEOs [2]. However, integrating NTBSs into existing terres-
trial networks poses resource allocation challenges. To tackle
this, [3],[4] investigated user scheduling and link association
in coexisting TBSs and NTBSs, aiming to align each user
with the appropriate radio access network. The authors in [5]
applied successive convex approximation for power control
to mitigate interference in cases of coexisting TBSs and a
HAPS. The authors of [6] proposed a beamforming scheme
for an integrated satellite-HAPS network to investigate the
trade-off between sum rate maximization and total transmit
power minimization. The authors of [7] investigated multi-
cast communication in a satellite-aerial integrated network
with rate-splitting multiple access to increase the network’s
sum-rate. Furthermore, several studies have turned to deep
reinforcement learning (DRL) for wireless network resource
Manuscript received 2 October 2023; revised 1 November 2023; accepted
6 November 2023. Date of publication 8 November 2023; date of current ver-
sion 12 December 2023. The associate editor coordinating the review of this
letter and approving it for publication was A. Yadav. (Corresponding author:
Hesam Khoshkbari.)
Hesam Khoshkbari is with the Department of Electrical Engineering,
École de Technologie Supérieure, Montréal, QC H3C 1K3, Canada (e-mail:
hesam.khoshkbari.1@ens.etsmtl.ca).
Georges Kaddoum is with the Department of Electrical Engineering, École
de Technologie Supérieure, Montréal, QC H3C 1K3, Canada, and also with
the Artificial Intelligence and Cyber Systems Research Center, Lebanese
American University (LAU), Beirut 1102-2801, Lebanon.
Digital Object Identifier 10.1109/LCOMM.2023.3331216
allocation [8],[9],[10]. In [8], DRL was applied to user
association among NTBSs, aiming to maximize the network’s
sum-rate while reducing handoffs due to NTBS mobility.
In [9], DRL was utilized for power control in a HAPS,
mitigating interference with a cellular network. Authors in [10]
employed reinforcement learning (RL) in a low-earth orbit
satellite (LEO) network to distribute the load of overloaded
LEO satellites to adjacent ones, thereby maximizing the satel-
lites’ battery lifetime.
Although instructive, the problem of user association in a
three-tier network composed of a terrestrial layer (i.e., a TBS),
an aerial layer (i.e., a HAPS), and a space layer (i.e., a satellite)
has not been fully explored. Maximizing sum-rate necessitates
users connecting to the base station (BS) with the highest link
quality. To achieve this, the DRL agent requires the channel
state information (CSI) of each user for all layers. However,
this approach incurs significant overhead in the network and
is not scalable since the number of required CSIs increases
with the number of layers. To overcome this issue, we assume
that each user knows only the channels between itself and
its previously associated BS, which ensures that the required
CSI is limited to the number of users and independent of the
number of layers. However, this assumption yields a partially
observable problem. To solve partially observable Markov
decision process (POMDP) problems, the authors in [11]
and [12] studied channel association and antenna selection,
respectively, while the agent’s knowledge is limited to the
observed channels in the previous time slot. They modelled
the channels as being in a good state or bad state, derived
transition probabilities, computed the belief using Bayes’ rule,
and applied RL to solve the problem. In our scenario, with
three layers having varying link qualities, channels cannot be
independently modelled with only two states. Additionally, the
continuous state space and expansive action space make RL
less effective in our case.
In our previous work [13], we employed a deep Q-network
(DQN) agent to perform user association between a TBS and
a HAPS using global CSI. As it is stated in [14], HAPSs are
seen as supplements to current terrestrial-satellite networks.
Thus, we extend our previous work by adding a space layer
to our network and operating under the assumption of partial
observability. It is noted that this assumption reduces CSI
sharing overhead, down to one-third compared to global CSI
requirements. In this letter, to solve the user association
problem in a three-tier network while considering the partial
observability assumption, we use action-specific deep recur-
rent Q-network (ADRQN) method [15]. In this case, each user
only sends the channel between itself and the previously asso-
ciated BS to the agent. We compare our method with convex
optimization, DQN, and exhaustive search, demonstrating its
superior performance over DQN. Additionally, our ADRQN
1558-2558 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
3236 IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023
Fig. 1. System model illustration.
agent outperforms convex optimization under imperfect
CSI.
II. SYSTEM MODEL AND PROBLEM FORMULATION
This letter aims to solve the user association problem
in a three-layer wireless network to maximize its sum-rate
while considering the partially observable CSI assumption (see
Fig. 1). Specifically, we assume that our DRL agent is located
in a TBS equipped with Mantennas, where at each time slot
t, it schedules Usingle-antenna users to be served by itself,
a HAPS with Nantennas, or a satellite with Eantennas. We
make the assumption that each layer in our system model
operates at a different frequency band. Considering Kt,K′
t,
and K′′
tas the sets of users served by the TBS, the HAPS,
and the satellite, respectively, we note that each user can be
served by only one BS at a time, i.e., Kt∩ K′
t∩ K′′
t= 0.
We denote Ht≜[h1,t,h2,t ,· · · ,hU,t]Tas the channel
matrix between the TBS and all users, where hu,t = [hui,t]M
i=1
is an M×1vector of the channel coefficients between the
u-th user and the TBS’s antennas. We assume the first-order
Gauss-Markov channel model for u∈ Ktat time slot tas
hu,t ≜ξuhu,t−1+p1−ξ2
uzu,t, f or u = 1, . . . , U, (1)
where ξu∈[0,1] is the fading correlation factor, hu,0∼
CN (0M×1,σ2
h,u), and zu,t ∼ CN (0M×1,σ2
h,u), where
σ2
h,u =σ2
h,uIM×1.σ2
h,u is derived using the following path
loss model σ2
h,u =ρd−3
u.
Assuming wu,t ∈CM×1as the zero-forcing beamforming
(ZFB) to cancel out the intra-cell interference between user u
and the TBS, the signal-to-interference-plus-noise ratio (SINR)
for u∈ Ktis obtained as γu∈Kt=|hu,twu,t |2
ˆσ2
n
, where ˆσ2
nis the
average power of the total received noise. Thus, the sum-rate
of users served by the TBS can be derived as
R=X
u∈Kt
log2(1 + γu∈Kt).(2)
Following the above formulation, we denote Gt≜
[g1,t,g2,t ,· · · ,gU,t]T, where gu,t = [guj,t ]N
j=1 is an N×
1vector of channel coefficients between the u-th user and
the HAPS antennas. Assuming the same line of sight (LoS)
component for all users u∈ K′
t, as explained in [16], we utilize
the first-order Gauss-Markov channel model to capture small
variations in channel links as
gu,t ≜gLoS +ξugu,t−1+p1−ξ2
uz′
u,t, f or u = 1, . . . , U,
(3)
where gu,0∼ CN (0N×1,σ′2
h,u),gLoS = (gLoS )IN×1, and
z′
u,t ∼ CN (0N×1,σ′2
h,u), where σ′2
h,u is the variance of
the channel between the u-th user and the HAPS antennas.
Assuming the number of users associated with HAPS is lower
than the number of antennas at the HAPS (i.e., |K′
t| ≤ N),
we use ZFB to eliminate the intra-cell interference. Thus,
we can write the sum-rate of user u∈ K′
tas
R′=X
u∈K′
t
log2(1 + γ′
u∈K′
t).(4)
where γ′
u∈K′
t=|gu,tw′
u,t|2
σ′2
n
. Here, w′u,t = [W′t]u=
[ˆ
Gtˆ
GH
tˆ
Gt−1]uis the u-th user’s zero-forcing precoding
vector, and σ′2
nis the average noise power.
Similarly, we denote Xt≜[x1,t,x2,t,· · · ,xU,t]T, where
xu,t = [xuj,t]E
j=1 is an E×1vector of channel coefficients
between the u-th user and the satellite’s antennas. Considering
the same LoS component for u∈ K′′
t, the satellite’s channels
can be writen as
xu,t ≜xLoS +ξ′
uxu,t−1+p1−ξ′2
uz′′
u,t, f or u = 1, . . . , U,
(5)
where xu,0∼ C N (0E×1,σ′′2
h,u),xLoS = (xLoS )IE×1, and
z′′
u,t ∼ CN (0E×1,σ′′ 2
h,u). Here, σ′′2
h,u is the variance of
the channel between the u-th user and the satellite anten-
nas. To calculate the sum-rate of users u∈ K′′
t, assuming
|K′′
t| ≤ E, ZFB w′′
u,t to eliminate the inta-cell interference,
and γ′′
u∈K′′
t=|xu,tw′′
u,t|2
σ′′2
n
with σ′2
nbeing the average noise
power, we get
R′′ =X
u∈K′′
t
log2(1 + γ′′
u∈K′′
t).(6)
Consequently, our optimization problem can be written as
max
Kt,K′
t,K′′
t
R+R′+R′′
s.t. |K′
t| ≤ N
|K′′
t| ≤ E
Kt∩ K′
t∩ K′′
t= 0 (7)
where the term Kt∩ K′
t∩ K′′
t= 0 ensures that each user can
be served by only one BS. Our defined optimization problem
in (7) can be viewed as the joint antenna selection at NTBSs
(i.e., choosing the number of active antennas at NTBSs) and
user association between three layers, which is NP-hard and
non-convex. More specifically, we assume our agent only has
access to a subset of CSIs at each time slot.
III. OUR PRO POS ED DEEP RECURRENT
Q-NETWORK ALGORITHM
A DQN agent assumes complete observability of the state
space, limiting its performance in POMDP problems. On the
other hand, conventional methods assume that the dynam-
ics of the environment, i.e., the transition probabilities, are
known and use Bayes’ theorem for belief estimation and
problem-solving. However, our work involves a continuous
state space and three system tiers, rendering a model-based
environment representation impractical. To solve this issue,
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
KHOSHKBARI AND KADDOUM: DEEP RECURRENT RL FOR PARTIALLY OBSERVABLE USER ASSOCIATION 3237
Fig. 2. Integration of the ADRQN agent at the TBS.
the deep recurrent Q-network (DRQN) [17] method has been
introduced, which uses a long short-term memory (LSTM)
layer besides fully connected neural networks (FCNNs) in
the policy network to estimate the Q-functions. However, the
DRQN method overlooks the action, which, as we will show
in our simulation results, decreases the performance.
In this letter, we use the ADRQN method, which concate-
nates the action vector with the observation vector and feeds
it to the policy network to estimate the action-state value
function, i.e., Q(vt−1,at−1,ot|θ)at time slot t. Here, θ
is the weights of the policy network, for which an LSTM
layer is used in addition to the fully-connected layers to better
approximate the Q-function. at−1is the previously executed
action vector at time slot t−1,otis the observation vector
(we do not have access to the exact state vector st), and vt−1
denotes the output of the LSTM layer, which is defined as
vt=LST M (vt−1,at−1,ot).(8)
We employ a buffer and a target network to increase
the stability of our ADRQN agent during training [18].
As depicted in Fig. 2, we store the transition tuple
(at−1,ot,at, R(st, at),ot+1)in buffer Dat each time slot
tand employ a random mini-batch of data to update our
ADRQN agent. Additionally, we maintain a target network
with weights ˜
θ, serving as a delayed copy of our main
(policy) network. Following the above explanations, we use
the following loss-function for updating our main network
L(θ) = E({a−,o},a,R(s,a),o+)∼D (y−Q(v−,a−,o|θ))2,
(9)
where v−and a−are the previous LSTM output and action,
respectively, o+is the next observation, and yis defined as
y=R(s,a) + γmax
a∈A Qv,a,o+|˜
θ.(10)
In the sequel, we define the action space, observation space,
and reward function, and then explain our training procedure.
1) Action: Our action space A, is made up of U×1action
vectors ai= [a1,i a2,i . . . aU,i]T, where airepresents the
i-th possible user scheduling vector. Here, au,i = 0 denotes
that user uis associated to the TBS, while au,i = 1 means the
HAPS is serving user u, and au,i = 2 denotes that user uis
associated with the VLEO. At last, considering the constraints
in our optimization problem (7),PU
u=1[au,i = 1] ≤Nand
PU
u=1[au,i = 2] ≤Eshould be satisfied.
2) Reward: In accordance with our notations in Section II,
our reward function is defined as the sum-rate of the network,
i.e., R(s,a) = R+R′+R′′.
3) Observations: At time slot t, our U×1observation
vector ot∈ O is defined as ot= [o1,t o2,t . . . oU,t]T, with
oi,t =
∥hu,t∥2,if ai= 0
∥gu,t∥2,if ai= 1
∥xu,t∥2,if ai= 2,
(11)
where aiis the i-th element of the action vector in t−1.
Training Procedure:We start by populating buffer Dwith
random values and initializing the main and target networks
to be same, i.e., ˜
θ=θ. To explore different system layouts and
prevent overfitting to specific user locations, we run Iepisode
episodes where the environment is reset at the beginning
of each episode, redistributing users and generating chan-
nels as discussed in Section II. We implement the ε-greedy
exploration strategy, progressively reducing the exploration
probability throughout the training phase, starting with ε1= 1
for the initial episode.
At the start of each episode, the ADRQN agent executes
a random action or receives {at−1,ot}and the action with
the highest Q value is chosen. Afterwards, the agent gets the
reward value R(st,at)and new observation ot+1. We then
store the information in the buffer, and every ηtime slots,
we sample a mini-batch of information to update the main
network. Furthermore, every η′time slots, we update the target
network by replacing its weights with the main network’s
weights. Finally, we save the model every η′′ episodes to
upload it in the testing phase to assess its performance.
IV. SIMULATION RESULTS
A. Benchmarks
1) DRQN: It is similar to our ADRQN method, where the
only difference is that the DRQN method overlooks the actions
that, as we will show, negatively affect the performance.
2) DQN: We assume three scenarios for the DQN method:
1) the DQN-F method, which has access to the global CSI,
i.e., h,g, and x. 2) the DQN-P, which only receives the
observations and does not use the actions. 3) the DQN-PA
method, which receives the same input as our ADRQN agent.
3) Exhaustive Search (ES) Selection: Evaluates all possible
actions at each time slot and chooses the one with the highest
reward value. It is noted that this method is impractical and
operates independently of states or observations.
4) Random Selection: Selects actions randomly, indepen-
dently of states or observations.
5) Convex Optimization (CO): We assume fixed power
instead of uniform power allocation and solve our problem
with CO for a better comparison. Note that the CO method
needs the global CSI to solve (7).
6) Convex Optimization With Uncertainty Bounds (COUB):
For further investigation, we add a method to our imperfect
CSI comparisons used in [19] that employs uncertainty bounds
to solve the convex optimization problem with imperfect CSI.
Note that, like the CO method, the COUB method also needs
global CSI to solve (7).
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
3238 IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023
Algorithm 1 training Algorithm of Our Proposed
ADRQN Framework
Input: frequency of main and target networks and
saving model η,η′,η′′, number of episodes and time
slots Iepisode,L, exploration probability at the i-th
episode εi, final εivalue εfinal, final εiupdating
episode I′;
1Initialize: buffer D, main network’s weights θ
randomly, target network’s weights ˜
θ=θ;
2for i∈1to Iepisode do
3Reset the environment and initialize o0;
4Update
εi=
1−1−εfinal
I′if 1< i < I′
εfinal I′< i < Iepisode
;
5for t∈1to Ldo
6Randomly select action with a probability of εi
if not: feed {at−1,ot}to the main network
and choose:
at= arg maxaQ(vt−1,at−1,ot|θ);
7Execute action at, receive reward R(st,at),
and new observation ot+1, and store
({at−1,ot},at, R(st,at),ot+1)in D;
8if t%η= 0 then
9Sample mini-batch of data and update the
main network using (9);
10 if t%η′= 0 then
11 Update target network: ˜
θ←θ;
12 if i%η′′ = 0 then
13 Save the learned model;
Output: Learned ADRQN model Q(v,a,o|θ)
B. Numerical Results
During training, we assume U=6 single-antenna users
distributed in a 3 km radius cell. We consider a TBS with
M=16 antennas and a supply power of P= 10 dB, a HAPS
with an altitude of 18 km that is equipped with N=2 antennas
and a supply power of P′=10 dB, and a VLEO with an
altitude of 250 km that is equipped with E=1 antenna and a
supply power of P′′ =10 dB. We utilize the aperture antenna
model [20] for both the HAPS and satellite and design it for
maximum gain at the cell border. We use the free-space path
loss model to generate large-scale fading for the HAPS and
the satellite. To account for atmospheric attenuation in the
satellite, we utilize the model proposed in [21]. For power
allocation, we assume uniform distribution. For CO and COUB
methods, we allocate P/(M−N−E)dB per TBS user, P′/N
dB per HAPS user, and P′′/E dB per VLEO user. Providing
the full episode to the agent is optimal but computationally
demanding, so we limit it to 10 time slots. Training parameters
in Algorithm 1and other details are in Table I.
Using the abovementioned values for the system model,
we evaluate all benchmark methods and compare our ADRQN
with them in Table II. As expected, ES selection and random
selection achieve the best and worst results, respectively.
Our proposed ADRQN’s performance is close to that of the
TABLE I
SIMULATION PARAMETERS
TABLE II
COMPARATIVE RESULTS
DQN-F, which requires the full CSI, i.e., it employs the state
vector with 3×Uelements. Moreover, ADRQN and CO
methods’ performances closely align, highlighting ADRQN’s
efficacy compared to mathematical approaches. It’s worth
noting that the CO method operates at each time slot and uses
global CSI, while the ADRQN agent does not require learning
or backpropagation during testing. As previously mentioned,
accounting for the action vector is vital for our problem. The
sum-rate for the DRQN method, which employs only the
observation vector, decreases by 0.49 bps compared to that of
the ADRQN method. Additionally, comparing the DQN-PA
and DQN-P methods reveals that incorporating the action
vector boosts DQN performance by 0.68 bps. This is due to
the fact that, with a partially observable observation vector,
employing the action vector helps the agent understand that
each layer has different ranges of CSI values so that it can
interpret them separately.
To investigate the scalability of our ADRQN method,
we plot the average sum-rate vs the numbers of users in
Fig. 3, assuming N=3 and E=2. For better visualization,
we only plot the ES selection and random selection as the
upper and lower bounds, and the CO method as the principle
benchmark. As shown, the sum-rate for all methods increases
as the number of users increases, and our proposed ADRQN
agent remains very close to the ES selection.
To show the robustness of our proposed ADRQN method,
we consider a practical scenario where perfect CSI is unavail-
able. We assume the imperfect CSI is defined by adding
additive Gaussian noise. Assuming ϵhas the variance of the
channel estimation error, Fig. 4illustrates the average sum-rate
vs ϵh. Here, we assume U=7, M=16, N=3, E=2,
and ϵhvaries between 0.2 and 0.8. Compared to the results
shown in Fig. 3, the sum-rate of the ADRQN method decreases
by 0.48 bps for ϵh=0.2 and then remains approximately
constant. The sum-rate of the CO method, on the other hand,
drops to 61.63 bps (from 62.79 bps) for ϵh=0.2, and then
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.
KHOSHKBARI AND KADDOUM: DEEP RECURRENT RL FOR PARTIALLY OBSERVABLE USER ASSOCIATION 3239
Fig. 3. Average sum-rate vs numbers of users.
Fig. 4. Averaged sum-rate versus ϵh.
decreases by 7.98 bps as ϵhincreases to 0.8. Although the
sum-rate of the COUB method is similar to that of the ADRQN
method for small estimation errors, it decreases for higher
ϵhvalues (more than 0.4), for example, to 57.17 bps for
ϵh=0.8.
C. Computational Complexity
The computational complexity of DRL methods is contin-
gent on the dimension of the deep neural network employed
as the policy network. If we consider |A|to be the dimension
of the action space, for U=6, N=2, and E=1, the com-
putational complexity of the ADRQN method is O(|A|2.58),
whereas that of the DQN-F method is O(|A|2.50)and that
of the DRQN method is O(|A|2.575). The computational
complexity of the CO method is O(|A|2.51). Note that once the
ADRQN agent has been trained using Algorithm 1, no learning
or backpropagation is needed in the testing phase.
V. CONCLUSION
This letter investigated user association in a three-layer
network using only a subset of the global CSI, which ensures
the scalability of our proposed method by increasing the num-
ber of the network’s layers. Moreover, our proposed method
does not require transition probabilities, which makes it more
feasible to solve POMDP problems in wireless networks.
We compared our proposed ADRQN method to various bench-
marks and demonstrated its superior performance in solving
our POMDP problem. We also discussed the importance of
using the action vector in addition to the observation vector.
At last, we investigated the imperfect CSI case and showed
that the ADRQN agent maintains its performance while the
sum-rate of the CO method drops significantly.
REFERENCES
[1] G. Karabulut Kurt et al., “A vision and framework for the high
altitude platform station (HAPS) networks of the future,” IEEE Commun.
Surveys Tuts., vol. 23, no. 2, pp. 729–779, 2nd Quart., 2021.
[2] N. H. Crisp et al., “The benefits of very low earth orbit for earth obser-
vation missions,” Progr. Aerosp. Sci., vol. 117, pp. 1–23, Aug. 2020.
[3] Y. Yuan, L. Lei, T. X. Vu, Z. Chang, S. Chatzinotas, and S. Sun,
“Adapting to dynamic LEO-B5G systems: Meta-critic learning based
efficient resource scheduling,” IEEE Trans. Wireless Commun., vol. 21,
no. 11, pp. 9582–9595, Nov. 2022.
[4] A. Alsharoa and M.-S. Alouini, “Improvement of the global con-
nectivity using integrated satellite-airborne-terrestrial networks with
resource optimization,” IEEE Trans. Wireless Commun., vol. 19, no. 8,
pp. 5088–5100, Aug. 2020.
[5] A. Alidadi Shamsabadi, A. Yadav, O. Abbasi, and H. Yanikomeroglu,
“Handling interference in integrated HAPS-terrestrial networks through
radio resource management,” IEEE Wireless Commun. Lett., vol. 11,
no. 12, pp. 2585–2589, Dec. 2022.
[6] Z. Lin, M. Lin, Y. Huang, T. D. Cola, and W.-P. Zhu, “Robust multi-
objective beamforming for integrated satellite and high altitude platform
network with imperfect channel state information,” IEEE Trans. Signal
Process., vol. 67, no. 24, pp. 6384–6396, Dec. 2019.
[7] Z. Lin, M. Lin, T. de Cola, J.-B. Wang, W.-P. Zhu, and J. Cheng,
“Supporting IoT with rate-splitting multiple access in satellite and
aerial-integrated networks,” IEEE Internet Things J., vol. 8, no. 14,
pp. 11123–11134, Jul. 2021.
[8] Y. Cao, S.-Y. Lien, and Y.-C. Liang, “Deep reinforcement learning
for multi-user access control in non-terrestrial networks,” IEEE Trans.
Commun., vol. 69, no. 3, pp. 1605–1619, Mar. 2021.
[9] S. Jo, W. Yang, H. K. Choi, E. Noh, H.-S. Jo, and J. Park, “Deep
Q-learning-based transmission power control of a high altitude plat-
form station with spectrum sharing,” Sensors, vol. 22, no. 4, p. 1630,
Feb. 2022.
[10] H. Tsuchida et al., “Efficient power control for satellite-borne batteries
using Q-learning in low-earth-orbit satellite constellations,” IEEE Wire-
less Commun. Lett., vol. 9, no. 6, pp. 809–812, Jun. 2020.
[11] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for
multi-channel opportunistic access: Structure, optimality, and perfor-
mance,” IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440,
Dec. 2008.
[12] S. Sharifi, S. Shahbazpanahi, and M. Dong, “A POMDP-based antenna
selection for massive MIMO communication,” IEEE Trans. Commun.,
vol. 70, no. 3, pp. 2025–2041, Mar. 2022.
[13] H. Khoshkbari, S. Sharifi, and G. Kaddoum, “User association in a
VHetNet with delayed CSI: A deep reinforcement learning approach,”
IEEE Commun. Lett., vol. 27, no. 8, pp. 2257–2261, Aug. 2023.
[14] E. Cianca et al., “Integrated satellite-HAP systems,” IEEE Commun.
Mag., vol. 43, no. 12, pp. 33–39, Dec. 2005.
[15] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-
ment learning for POMDPs,” 2017, arXiv:1704.07978.
[16] E. Falletti, M. Laddomada, M. Mondin, and F. Sellone, “Integrated ser-
vices from high-altitude platforms: A flexible communication system,”
IEEE Commun. Mag., vol. 44, no. 2, pp. 85–94, Feb. 2006.
[17] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially
observable MDPs,” 2015, arXiv:1507.06527.
[18] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[19] K. Pathak, S. S. Kalamkar, and A. Banerjee, “Optimal user scheduling
in energy harvesting wireless networks,” IEEE Trans. Commun., vol. 66,
no. 10, pp. 4622–4636, Oct. 2018.
[20] A. Ibrahim and A. S. Alfa, “Using Lagrangian relaxation for radio
resource allocation in high altitude platforms,” IEEE Trans. Wireless
Commun., vol. 14, no. 10, pp. 5823–5835, Oct. 2015.
[21] R. Wang, P. Ren, D. Xu, and L. Lu, “Stochastic geometry anal-
ysis of LEO constellation coverage under atmospheric attenuation,”
in Proc. IEEE 96th Veh. Technol. Conf. (VTC-Fall), Sep. 2022,
pp. 1–5.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.