ArticlePDF Available

Deep Recurrent Reinforcement Learning for Partially Observable User Association in a Vertical Heterogenous Network

December 2023
IEEE Communications Letters PP(99):1-1

December 2023
PP(99):1-1

DOI:10.1109/LCOMM.2023.3331216

Authors:

Hesam Khoshkbari

École de Technologie Supérieure

Georges Kaddoum

École de Technologie Supérieure

To ensure ubiquitous connectivity and meet increasing users’ demands in next-generation wireless networks, we investigate user association in a three-layer network consisting of a terrestrial base station (TBS), a high-altitude platform station (HAPS), and a satellite. To maintain scalability and reduce frequent channel state information (CSI) exchanges across layers, we assume only the CSI of previously associated links is available. To address our partially observable user association problem, we propose the action-specific deep recurrent Q-network (ADRQN) method. This involves incorporating a long short-term memory (LSTM) layer alongside fully connected layers, utilizing both observation and action vectors in the policy network. We compare our proposed ADRQN method to the exhaustive search, deep Q-network, deep recurrent Q-network, and convex optimization methods and demonstrate the necessity of using the action vector. Finally, we consider the absence of perfect CSI and show that our ADRQN agent outperforms the convex optimization method in this scenario.

Content uploaded by Hesam Khoshkbari

Content may be subject to copyright.

IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023 3235

Deep Recurrent Reinforcement Learning for Partially Observable User

Association in a Vertical Heterogenous Network

Hesam Khoshkbari and Georges Kaddoum

Abstract— To ensure ubiquitous connectivity and meet increas-

ing users’ demands in next-generation wireless networks,

we investigate user association in a three-layer network consisting

of a terrestrial base station (TBS), a high-altitude platform

station (HAPS), and a satellite. To maintain scalability and

reduce frequent channel state information (CSI) exchanges across

layers, we assume only the CSI of previously associated links

is available. To address our partially observable user associ-

ation problem, we propose the action-speciﬁc deep recurrent

Q-network (ADRQN) method. This involves incorporating a

long short-term memory (LSTM) layer alongside fully connected

layers, utilizing both observation and action vectors in the policy

network. We compare our proposed ADRQN method to the

exhaustive search, deep Q-network, deep recurrent Q-network,

and convex optimization methods and demonstrate the necessity

of using the action vector. Finally, we consider the absence of

perfect CSI and show that our ADRQN agent outperforms the

convex optimization method in this scenario.

Index Terms— HAPS, deep recurrent reinforcement learning,

user association, partially observable Markov decision process.

I. INTRODUCTION

DEPLOYING non-terrestrial base stations (NTBSs), such

as very low earth orbit satellites (VLEOs) and

high-altitude platform stations (HAPSs), is crucial to meet

future wireless communication demands and ensure global

connectivity. In this regard, several studies have explored

the applications and challenges of utilizing HAPSs [1] and

VLEOs [2]. However, integrating NTBSs into existing terres-

trial networks poses resource allocation challenges. To tackle

this, [3],[4] investigated user scheduling and link association

in coexisting TBSs and NTBSs, aiming to align each user

with the appropriate radio access network. The authors in [5]

applied successive convex approximation for power control

to mitigate interference in cases of coexisting TBSs and a

HAPS. The authors of [6] proposed a beamforming scheme

for an integrated satellite-HAPS network to investigate the

trade-off between sum rate maximization and total transmit

power minimization. The authors of [7] investigated multi-

cast communication in a satellite-aerial integrated network

with rate-splitting multiple access to increase the network’s

sum-rate. Furthermore, several studies have turned to deep

reinforcement learning (DRL) for wireless network resource

Manuscript received 2 October 2023; revised 1 November 2023; accepted

6 November 2023. Date of publication 8 November 2023; date of current ver-

sion 12 December 2023. The associate editor coordinating the review of this

letter and approving it for publication was A. Yadav. (Corresponding author:

Hesam Khoshkbari.)

Hesam Khoshkbari is with the Department of Electrical Engineering,

École de Technologie Supérieure, Montréal, QC H3C 1K3, Canada (e-mail:

hesam.khoshkbari.1@ens.etsmtl.ca).

Georges Kaddoum is with the Department of Electrical Engineering, École

de Technologie Supérieure, Montréal, QC H3C 1K3, Canada, and also with

the Artiﬁcial Intelligence and Cyber Systems Research Center, Lebanese

American University (LAU), Beirut 1102-2801, Lebanon.

Digital Object Identiﬁer 10.1109/LCOMM.2023.3331216

allocation [8],[9],[10]. In [8], DRL was applied to user

association among NTBSs, aiming to maximize the network’s

sum-rate while reducing handoffs due to NTBS mobility.

In [9], DRL was utilized for power control in a HAPS,

mitigating interference with a cellular network. Authors in [10]

employed reinforcement learning (RL) in a low-earth orbit

satellite (LEO) network to distribute the load of overloaded

LEO satellites to adjacent ones, thereby maximizing the satel-

lites’ battery lifetime.

Although instructive, the problem of user association in a

three-tier network composed of a terrestrial layer (i.e., a TBS),

an aerial layer (i.e., a HAPS), and a space layer (i.e., a satellite)

has not been fully explored. Maximizing sum-rate necessitates

users connecting to the base station (BS) with the highest link

quality. To achieve this, the DRL agent requires the channel

state information (CSI) of each user for all layers. However,

this approach incurs signiﬁcant overhead in the network and

is not scalable since the number of required CSIs increases

with the number of layers. To overcome this issue, we assume

that each user knows only the channels between itself and

its previously associated BS, which ensures that the required

CSI is limited to the number of users and independent of the

number of layers. However, this assumption yields a partially

observable problem. To solve partially observable Markov

decision process (POMDP) problems, the authors in [11]

and [12] studied channel association and antenna selection,

respectively, while the agent’s knowledge is limited to the

observed channels in the previous time slot. They modelled

the channels as being in a good state or bad state, derived

transition probabilities, computed the belief using Bayes’ rule,

and applied RL to solve the problem. In our scenario, with

three layers having varying link qualities, channels cannot be

independently modelled with only two states. Additionally, the

continuous state space and expansive action space make RL

less effective in our case.

In our previous work [13], we employed a deep Q-network

(DQN) agent to perform user association between a TBS and

a HAPS using global CSI. As it is stated in [14], HAPSs are

seen as supplements to current terrestrial-satellite networks.

Thus, we extend our previous work by adding a space layer

to our network and operating under the assumption of partial

observability. It is noted that this assumption reduces CSI

sharing overhead, down to one-third compared to global CSI

requirements. In this letter, to solve the user association

problem in a three-tier network while considering the partial

observability assumption, we use action-speciﬁc deep recur-

rent Q-network (ADRQN) method [15]. In this case, each user

only sends the channel between itself and the previously asso-

ciated BS to the agent. We compare our method with convex

optimization, DQN, and exhaustive search, demonstrating its

superior performance over DQN. Additionally, our ADRQN

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.

3236 IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023

Fig. 1. System model illustration.

agent outperforms convex optimization under imperfect

CSI.

II. SYSTEM MODEL AND PROBLEM FORMULATION

This letter aims to solve the user association problem

in a three-layer wireless network to maximize its sum-rate

while considering the partially observable CSI assumption (see

Fig. 1). Speciﬁcally, we assume that our DRL agent is located

in a TBS equipped with Mantennas, where at each time slot

t, it schedules Usingle-antenna users to be served by itself,

a HAPS with Nantennas, or a satellite with Eantennas. We

make the assumption that each layer in our system model

operates at a different frequency band. Considering Kt,K′

and K′′

tas the sets of users served by the TBS, the HAPS,

and the satellite, respectively, we note that each user can be

served by only one BS at a time, i.e., Kt∩ K′

t∩ K′′

t= 0.

We denote Ht≜[h1,t,h2,t ,· · · ,hU,t]Tas the channel

matrix between the TBS and all users, where hu,t = [hui,t]M

i=1

is an M×1vector of the channel coefﬁcients between the

u-th user and the TBS’s antennas. We assume the ﬁrst-order

Gauss-Markov channel model for u∈ Ktat time slot tas

hu,t ≜ξuhu,t−1+p1−ξ2

uzu,t, f or u = 1, . . . , U, (1)

where ξu∈[0,1] is the fading correlation factor, hu,0∼

CN (0M×1,σ2

h,u), and zu,t ∼ CN (0M×1,σ2

h,u), where

σ2

h,u =σ2

h,uIM×1.σ2

h,u is derived using the following path

loss model σ2

h,u =ρd−3

Assuming wu,t ∈CM×1as the zero-forcing beamforming

(ZFB) to cancel out the intra-cell interference between user u

and the TBS, the signal-to-interference-plus-noise ratio (SINR)

for u∈ Ktis obtained as γu∈Kt=|hu,twu,t |2

ˆσ2

, where ˆσ2

nis the

average power of the total received noise. Thus, the sum-rate

of users served by the TBS can be derived as

R=X

u∈Kt

log2(1 + γu∈Kt).(2)

Following the above formulation, we denote Gt≜

[g1,t,g2,t ,· · · ,gU,t]T, where gu,t = [guj,t ]N

j=1 is an N×

1vector of channel coefﬁcients between the u-th user and

the HAPS antennas. Assuming the same line of sight (LoS)

component for all users u∈ K′

t, as explained in [16], we utilize

the ﬁrst-order Gauss-Markov channel model to capture small

variations in channel links as

gu,t ≜gLoS +ξugu,t−1+p1−ξ2

uz′

u,t, f or u = 1, . . . , U,

(3)

where gu,0∼ CN (0N×1,σ′2

h,u),gLoS = (gLoS )IN×1, and

z′

u,t ∼ CN (0N×1,σ′2

h,u), where σ′2

h,u is the variance of

the channel between the u-th user and the HAPS antennas.

Assuming the number of users associated with HAPS is lower

than the number of antennas at the HAPS (i.e., |K′

t| ≤ N),

we use ZFB to eliminate the intra-cell interference. Thus,

we can write the sum-rate of user u∈ K′

tas

R′=X

u∈K′

log2(1 + γ′

u∈K′

t).(4)

where γ′

u∈K′

t=|gu,tw′

u,t|2

σ′2

. Here, w′u,t = [W′t]u=

[ˆ

Gtˆ

tˆ

Gt−1]uis the u-th user’s zero-forcing precoding

vector, and σ′2

nis the average noise power.

Similarly, we denote Xt≜[x1,t,x2,t,· · · ,xU,t]T, where

xu,t = [xuj,t]E

j=1 is an E×1vector of channel coefﬁcients

between the u-th user and the satellite’s antennas. Considering

the same LoS component for u∈ K′′

t, the satellite’s channels

can be writen as

xu,t ≜xLoS +ξ′

uxu,t−1+p1−ξ′2

uz′′

u,t, f or u = 1, . . . , U,

(5)

where xu,0∼ C N (0E×1,σ′′2

h,u),xLoS = (xLoS )IE×1, and

z′′

u,t ∼ CN (0E×1,σ′′ 2

h,u). Here, σ′′2

h,u is the variance of

the channel between the u-th user and the satellite anten-

nas. To calculate the sum-rate of users u∈ K′′

t, assuming

|K′′

t| ≤ E, ZFB w′′

u,t to eliminate the inta-cell interference,

and γ′′

u∈K′′

t=|xu,tw′′

u,t|2

σ′′2

with σ′2

nbeing the average noise

power, we get

R′′ =X

u∈K′′

log2(1 + γ′′

u∈K′′

t).(6)

Consequently, our optimization problem can be written as

max

Kt,K′

t,K′′

R+R′+R′′

s.t. |K′

t| ≤ N

|K′′

t| ≤ E

Kt∩ K′

t∩ K′′

t= 0 (7)

where the term Kt∩ K′

t∩ K′′

t= 0 ensures that each user can

be served by only one BS. Our deﬁned optimization problem

in (7) can be viewed as the joint antenna selection at NTBSs

(i.e., choosing the number of active antennas at NTBSs) and

user association between three layers, which is NP-hard and

non-convex. More speciﬁcally, we assume our agent only has

access to a subset of CSIs at each time slot.

III. OUR PRO POS ED DEEP RECURRENT

Q-NETWORK ALGORITHM

A DQN agent assumes complete observability of the state

space, limiting its performance in POMDP problems. On the

other hand, conventional methods assume that the dynam-

ics of the environment, i.e., the transition probabilities, are

known and use Bayes’ theorem for belief estimation and

problem-solving. However, our work involves a continuous

state space and three system tiers, rendering a model-based

environment representation impractical. To solve this issue,

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.

KHOSHKBARI AND KADDOUM: DEEP RECURRENT RL FOR PARTIALLY OBSERVABLE USER ASSOCIATION 3237

Fig. 2. Integration of the ADRQN agent at the TBS.

the deep recurrent Q-network (DRQN) [17] method has been

introduced, which uses a long short-term memory (LSTM)

layer besides fully connected neural networks (FCNNs) in

the policy network to estimate the Q-functions. However, the

DRQN method overlooks the action, which, as we will show

in our simulation results, decreases the performance.

In this letter, we use the ADRQN method, which concate-

nates the action vector with the observation vector and feeds

it to the policy network to estimate the action-state value

function, i.e., Q(vt−1,at−1,ot|θ)at time slot t. Here, θ

is the weights of the policy network, for which an LSTM

layer is used in addition to the fully-connected layers to better

approximate the Q-function. at−1is the previously executed

action vector at time slot t−1,otis the observation vector

(we do not have access to the exact state vector st), and vt−1

denotes the output of the LSTM layer, which is deﬁned as

vt=LST M (vt−1,at−1,ot).(8)

We employ a buffer and a target network to increase

the stability of our ADRQN agent during training [18].

As depicted in Fig. 2, we store the transition tuple

(at−1,ot,at, R(st, at),ot+1)in buffer Dat each time slot

tand employ a random mini-batch of data to update our

ADRQN agent. Additionally, we maintain a target network

with weights ˜

θ, serving as a delayed copy of our main

(policy) network. Following the above explanations, we use

the following loss-function for updating our main network

L(θ) = E({a−,o},a,R(s,a),o+)∼D (y−Q(v−,a−,o|θ))2,

(9)

where v−and a−are the previous LSTM output and action,

respectively, o+is the next observation, and yis deﬁned as

y=R(s,a) + γmax

a∈A Qv,a,o+|˜

θ.(10)

In the sequel, we deﬁne the action space, observation space,

and reward function, and then explain our training procedure.

1) Action: Our action space A, is made up of U×1action

vectors ai= [a1,i a2,i . . . aU,i]T, where airepresents the

i-th possible user scheduling vector. Here, au,i = 0 denotes

that user uis associated to the TBS, while au,i = 1 means the

HAPS is serving user u, and au,i = 2 denotes that user uis

associated with the VLEO. At last, considering the constraints

in our optimization problem (7),PU

u=1[au,i = 1] ≤Nand

u=1[au,i = 2] ≤Eshould be satisﬁed.

2) Reward: In accordance with our notations in Section II,

our reward function is deﬁned as the sum-rate of the network,

i.e., R(s,a) = R+R′+R′′.

3) Observations: At time slot t, our U×1observation

vector ot∈ O is deﬁned as ot= [o1,t o2,t . . . oU,t]T, with

oi,t =









∥hu,t∥2,if ai= 0

∥gu,t∥2,if ai= 1

∥xu,t∥2,if ai= 2,

(11)

where aiis the i-th element of the action vector in t−1.

Training Procedure:We start by populating buffer Dwith

random values and initializing the main and target networks

to be same, i.e., ˜

θ=θ. To explore different system layouts and

prevent overﬁtting to speciﬁc user locations, we run Iepisode

episodes where the environment is reset at the beginning

of each episode, redistributing users and generating chan-

nels as discussed in Section II. We implement the ε-greedy

exploration strategy, progressively reducing the exploration

probability throughout the training phase, starting with ε1= 1

for the initial episode.

At the start of each episode, the ADRQN agent executes

a random action or receives {at−1,ot}and the action with

the highest Q value is chosen. Afterwards, the agent gets the

reward value R(st,at)and new observation ot+1. We then

store the information in the buffer, and every ηtime slots,

we sample a mini-batch of information to update the main

network. Furthermore, every η′time slots, we update the target

network by replacing its weights with the main network’s

weights. Finally, we save the model every η′′ episodes to

upload it in the testing phase to assess its performance.

IV. SIMULATION RESULTS

A. Benchmarks

1) DRQN: It is similar to our ADRQN method, where the

only difference is that the DRQN method overlooks the actions

that, as we will show, negatively affect the performance.

2) DQN: We assume three scenarios for the DQN method:

1) the DQN-F method, which has access to the global CSI,

i.e., h,g, and x. 2) the DQN-P, which only receives the

observations and does not use the actions. 3) the DQN-PA

method, which receives the same input as our ADRQN agent.

3) Exhaustive Search (ES) Selection: Evaluates all possible

actions at each time slot and chooses the one with the highest

reward value. It is noted that this method is impractical and

operates independently of states or observations.

4) Random Selection: Selects actions randomly, indepen-

dently of states or observations.

5) Convex Optimization (CO): We assume ﬁxed power

instead of uniform power allocation and solve our problem

with CO for a better comparison. Note that the CO method

needs the global CSI to solve (7).

6) Convex Optimization With Uncertainty Bounds (COUB):

For further investigation, we add a method to our imperfect

CSI comparisons used in [19] that employs uncertainty bounds

to solve the convex optimization problem with imperfect CSI.

Note that, like the CO method, the COUB method also needs

global CSI to solve (7).

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.

3238 IEEE COMMUNICATIONS LETTERS, VOL. 27, NO. 12, DECEMBER 2023

Algorithm 1 training Algorithm of Our Proposed

ADRQN Framework

Input: frequency of main and target networks and

saving model η,η′,η′′, number of episodes and time

slots Iepisode,L, exploration probability at the i-th

episode εi, ﬁnal εivalue εﬁnal, ﬁnal εiupdating

episode I′;

1Initialize: buffer D, main network’s weights θ

randomly, target network’s weights ˜

θ=θ;

2for i∈1to Iepisode do

3Reset the environment and initialize o0;

4Update

εi=





1−1−εﬁnal

I′if 1< i < I′

εﬁnal I′< i < Iepisode

;

5for t∈1to Ldo

6Randomly select action with a probability of εi

if not: feed {at−1,ot}to the main network

and choose:

at= arg maxaQ(vt−1,at−1,ot|θ);

7Execute action at, receive reward R(st,at),

and new observation ot+1, and store

({at−1,ot},at, R(st,at),ot+1)in D;

8if t%η= 0 then

9Sample mini-batch of data and update the

main network using (9);

10 if t%η′= 0 then

11 Update target network: ˜

θ←θ;

12 if i%η′′ = 0 then

13 Save the learned model;

Output: Learned ADRQN model Q(v,a,o|θ)

B. Numerical Results

During training, we assume U=6 single-antenna users

distributed in a 3 km radius cell. We consider a TBS with

M=16 antennas and a supply power of P= 10 dB, a HAPS

with an altitude of 18 km that is equipped with N=2 antennas

and a supply power of P′=10 dB, and a VLEO with an

altitude of 250 km that is equipped with E=1 antenna and a

supply power of P′′ =10 dB. We utilize the aperture antenna

model [20] for both the HAPS and satellite and design it for

maximum gain at the cell border. We use the free-space path

loss model to generate large-scale fading for the HAPS and

the satellite. To account for atmospheric attenuation in the

satellite, we utilize the model proposed in [21]. For power

allocation, we assume uniform distribution. For CO and COUB

methods, we allocate P/(M−N−E)dB per TBS user, P′/N

dB per HAPS user, and P′′/E dB per VLEO user. Providing

the full episode to the agent is optimal but computationally

demanding, so we limit it to 10 time slots. Training parameters

in Algorithm 1and other details are in Table I.

Using the abovementioned values for the system model,

we evaluate all benchmark methods and compare our ADRQN

with them in Table II. As expected, ES selection and random

selection achieve the best and worst results, respectively.

Our proposed ADRQN’s performance is close to that of the

TABLE I

SIMULATION PARAMETERS

TABLE II

COMPARATIVE RESULTS

DQN-F, which requires the full CSI, i.e., it employs the state

vector with 3×Uelements. Moreover, ADRQN and CO

methods’ performances closely align, highlighting ADRQN’s

efﬁcacy compared to mathematical approaches. It’s worth

noting that the CO method operates at each time slot and uses

global CSI, while the ADRQN agent does not require learning

or backpropagation during testing. As previously mentioned,

accounting for the action vector is vital for our problem. The

sum-rate for the DRQN method, which employs only the

observation vector, decreases by 0.49 bps compared to that of

the ADRQN method. Additionally, comparing the DQN-PA

and DQN-P methods reveals that incorporating the action

vector boosts DQN performance by 0.68 bps. This is due to

the fact that, with a partially observable observation vector,

employing the action vector helps the agent understand that

each layer has different ranges of CSI values so that it can

interpret them separately.

To investigate the scalability of our ADRQN method,

we plot the average sum-rate vs the numbers of users in

Fig. 3, assuming N=3 and E=2. For better visualization,

we only plot the ES selection and random selection as the

upper and lower bounds, and the CO method as the principle

benchmark. As shown, the sum-rate for all methods increases

as the number of users increases, and our proposed ADRQN

agent remains very close to the ES selection.

To show the robustness of our proposed ADRQN method,

we consider a practical scenario where perfect CSI is unavail-

able. We assume the imperfect CSI is deﬁned by adding

additive Gaussian noise. Assuming ϵhas the variance of the

channel estimation error, Fig. 4illustrates the average sum-rate

vs ϵh. Here, we assume U=7, M=16, N=3, E=2,

and ϵhvaries between 0.2 and 0.8. Compared to the results

shown in Fig. 3, the sum-rate of the ADRQN method decreases

by 0.48 bps for ϵh=0.2 and then remains approximately

constant. The sum-rate of the CO method, on the other hand,

drops to 61.63 bps (from 62.79 bps) for ϵh=0.2, and then

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.

KHOSHKBARI AND KADDOUM: DEEP RECURRENT RL FOR PARTIALLY OBSERVABLE USER ASSOCIATION 3239

Fig. 3. Average sum-rate vs numbers of users.

Fig. 4. Averaged sum-rate versus ϵh.

decreases by 7.98 bps as ϵhincreases to 0.8. Although the

sum-rate of the COUB method is similar to that of the ADRQN

method for small estimation errors, it decreases for higher

ϵhvalues (more than 0.4), for example, to 57.17 bps for

ϵh=0.8.

C. Computational Complexity

The computational complexity of DRL methods is contin-

gent on the dimension of the deep neural network employed

as the policy network. If we consider |A|to be the dimension

of the action space, for U=6, N=2, and E=1, the com-

putational complexity of the ADRQN method is O(|A|2.58),

whereas that of the DQN-F method is O(|A|2.50)and that

of the DRQN method is O(|A|2.575). The computational

complexity of the CO method is O(|A|2.51). Note that once the

ADRQN agent has been trained using Algorithm 1, no learning

or backpropagation is needed in the testing phase.

V. CONCLUSION

This letter investigated user association in a three-layer

network using only a subset of the global CSI, which ensures

the scalability of our proposed method by increasing the num-

ber of the network’s layers. Moreover, our proposed method

does not require transition probabilities, which makes it more

feasible to solve POMDP problems in wireless networks.

We compared our proposed ADRQN method to various bench-

marks and demonstrated its superior performance in solving

our POMDP problem. We also discussed the importance of

using the action vector in addition to the observation vector.

At last, we investigated the imperfect CSI case and showed

that the ADRQN agent maintains its performance while the

sum-rate of the CO method drops signiﬁcantly.

REFERENCES

[1] G. Karabulut Kurt et al., “A vision and framework for the high

altitude platform station (HAPS) networks of the future,” IEEE Commun.

Surveys Tuts., vol. 23, no. 2, pp. 729–779, 2nd Quart., 2021.

[2] N. H. Crisp et al., “The beneﬁts of very low earth orbit for earth obser-

vation missions,” Progr. Aerosp. Sci., vol. 117, pp. 1–23, Aug. 2020.

[3] Y. Yuan, L. Lei, T. X. Vu, Z. Chang, S. Chatzinotas, and S. Sun,

“Adapting to dynamic LEO-B5G systems: Meta-critic learning based

efﬁcient resource scheduling,” IEEE Trans. Wireless Commun., vol. 21,

no. 11, pp. 9582–9595, Nov. 2022.

[4] A. Alsharoa and M.-S. Alouini, “Improvement of the global con-

nectivity using integrated satellite-airborne-terrestrial networks with

resource optimization,” IEEE Trans. Wireless Commun., vol. 19, no. 8,

pp. 5088–5100, Aug. 2020.

[5] A. Alidadi Shamsabadi, A. Yadav, O. Abbasi, and H. Yanikomeroglu,

“Handling interference in integrated HAPS-terrestrial networks through

radio resource management,” IEEE Wireless Commun. Lett., vol. 11,

no. 12, pp. 2585–2589, Dec. 2022.

[6] Z. Lin, M. Lin, Y. Huang, T. D. Cola, and W.-P. Zhu, “Robust multi-

objective beamforming for integrated satellite and high altitude platform

network with imperfect channel state information,” IEEE Trans. Signal

Process., vol. 67, no. 24, pp. 6384–6396, Dec. 2019.

[7] Z. Lin, M. Lin, T. de Cola, J.-B. Wang, W.-P. Zhu, and J. Cheng,

“Supporting IoT with rate-splitting multiple access in satellite and

aerial-integrated networks,” IEEE Internet Things J., vol. 8, no. 14,

pp. 11123–11134, Jul. 2021.

[8] Y. Cao, S.-Y. Lien, and Y.-C. Liang, “Deep reinforcement learning

for multi-user access control in non-terrestrial networks,” IEEE Trans.

Commun., vol. 69, no. 3, pp. 1605–1619, Mar. 2021.

[9] S. Jo, W. Yang, H. K. Choi, E. Noh, H.-S. Jo, and J. Park, “Deep

Q-learning-based transmission power control of a high altitude plat-

form station with spectrum sharing,” Sensors, vol. 22, no. 4, p. 1630,

Feb. 2022.

[10] H. Tsuchida et al., “Efﬁcient power control for satellite-borne batteries

using Q-learning in low-earth-orbit satellite constellations,” IEEE Wire-

less Commun. Lett., vol. 9, no. 6, pp. 809–812, Jun. 2020.

[11] Q. Zhao, B. Krishnamachari, and K. Liu, “On myopic sensing for

multi-channel opportunistic access: Structure, optimality, and perfor-

mance,” IEEE Trans. Wireless Commun., vol. 7, no. 12, pp. 5431–5440,

Dec. 2008.

[12] S. Shariﬁ, S. Shahbazpanahi, and M. Dong, “A POMDP-based antenna

selection for massive MIMO communication,” IEEE Trans. Commun.,

vol. 70, no. 3, pp. 2025–2041, Mar. 2022.

[13] H. Khoshkbari, S. Shariﬁ, and G. Kaddoum, “User association in a

VHetNet with delayed CSI: A deep reinforcement learning approach,”

IEEE Commun. Lett., vol. 27, no. 8, pp. 2257–2261, Aug. 2023.

[14] E. Cianca et al., “Integrated satellite-HAP systems,” IEEE Commun.

Mag., vol. 43, no. 12, pp. 33–39, Dec. 2005.

[15] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-

ment learning for POMDPs,” 2017, arXiv:1704.07978.

[16] E. Falletti, M. Laddomada, M. Mondin, and F. Sellone, “Integrated ser-

vices from high-altitude platforms: A ﬂexible communication system,”

IEEE Commun. Mag., vol. 44, no. 2, pp. 85–94, Feb. 2006.

[17] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially

observable MDPs,” 2015, arXiv:1507.06527.

[18] V. Mnih et al., “Human-level control through deep reinforcement learn-

ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[19] K. Pathak, S. S. Kalamkar, and A. Banerjee, “Optimal user scheduling

in energy harvesting wireless networks,” IEEE Trans. Commun., vol. 66,

no. 10, pp. 4622–4636, Oct. 2018.

[20] A. Ibrahim and A. S. Alfa, “Using Lagrangian relaxation for radio

resource allocation in high altitude platforms,” IEEE Trans. Wireless

Commun., vol. 14, no. 10, pp. 5823–5835, Oct. 2015.

[21] R. Wang, P. Ren, D. Xu, and L. Lu, “Stochastic geometry anal-

ysis of LEO constellation coverage under atmospheric attenuation,”

in Proc. IEEE 96th Veh. Technol. Conf. (VTC-Fall), Sep. 2022,

pp. 1–5.

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on December 14,2023 at 04:19:51 UTC from IEEE Xplore. Restrictions apply.

Partially Cooperative RL for Hybrid Action CRNs With Imperfect CSI

Article

Full-text available

Jan 2024

Cognitive radio networks (CRNs) mitigate spectrum scarcity by leveraging the holes in the licensed spectrum to enable internet of things (IoT) devices to opportunistically access the spectrum. However, IoT devices need to sense the spectrum before they can access it, which is an energy-intensive process and hinders the practical implementation of opportunistic spectrum access for energy-constrained IoT devices. In this context, reinforcement learning-based algorithms that encourage cooperation among IoT devices to eliminate the need for constant sensing are promising candidates for practical CRN implementation. As exciting as the application of reinforcement learning to CRNs is, benchmarking the performance of different algorithms is a huge challenge due to a lack of standardized comparison metrics, especially for hybrid action spaces that comprise both discrete and continuous actions. We propose a hybrid discrete-continuous space deep reinforcement learning algorithm that maximizes the energy efficiency of CRNs by optimizing sensing, cooperation, and transmission by IoT devices.We also analyze the algorithm’s performance by setting the theoretical upper bound for throughput and find that it reaches 99.4% of the theoretical upper bound, while its discrete action-space version reaches 96% and other baseline algorithms range between 70% and 86%.

Deep Reinforcement Learning Approach for HAPS User Scheduling in Massive MIMO Communications

Article

Full-text available

Jan 2023

In this paper, we devise a deep SARSA-learning (DSRL) user scheduling algorithm for a base station (BS) that uses a high-altitude platform station (HAPS) as a backup to serve multiple users in a wireless cellular network. Considering a realistic scenario, we assume that only the outdated channel state information (CSI) of the terrestrial base station (TBS) is available in our defined user scheduling problem. To do so, we define a user scheduling problem using a Markov decision process (MDP) framework, where the objective function is to maximize the sum-rate with a minimum number of active antennas at the HAPS. Our performance analysis shows that the sum-rate obtained with our proposed DSRL algorithm is close to the optimal sum-rate achieved with an exhaustive search method. We also develop a heuristic optimization method to solve the user scheduling problem at the BS. We show that for a scenario where perfect CSI is not available, our proposed DSRL algorithm outperforms the heuristic optimization method.

User Association in a VHetNet with Delayed CSI: A Deep Reinforcement Learning Approach

Article

Full-text available

Aug 2023

Non-terrestrial base stations (NTBSs) must be employed for next-generation wireless networks to provide users with ubiquitous connectivity and a higher data rate. In vertical heterogeneous networks (VHetNets), associating users with either a terrestrial base station (TBS) or a NTBS to maximize the sum-rate of the network while accounting for the resource limitations that exist at the NTBS poses a challenge. Moreover, a practical user association method should be capable of working in a realistic situation in which instantaneous channel state information (CSI) is not available. To solve this problem, we propose a deep Q-learning (DQL) approach in which a satellite is our agent and schedules each user to a TBS or a high-altitude platform station (HAPS) in each time slot using the CSI of the previous time slot. The proposed method achieves nearly identical results as the exhaustive search action selection method. Furthermore, we investigate the effect of imperfect CSI and show our proposed method outperforms the convex optimization user association scheme in the presence of noisy CSI.

Handling Interference in Integrated HAPS-Terrestrial Networks Through Radio Resource Management

Article

Full-text available

Sep 2022

Vertical heterogeneous networks (vHetNets) are promising architectures to bring significant advantages for 6G and beyond mobile communications. High altitude platform station (HAPS), one of the nodes in the vHetNets, can be considered as a complementary platform for terrestrial networks to meet the ever-increasing dynamic capacity demand and provide sustainable wireless networks for future. However, the problem of interference is the bottleneck for the optimal operation of such an integrated network. Thus, designing efficient interference management techniques is inevitable. In this work, we aim to design a joint power-subcarrier allocation scheme in order to achieve fairness for all users. We formulate the max-min fairness (MMF) optimization problem and develop a rapid converging iterative algorithm to solve it. Numerical results validate the superiority of the proposed algorithm and show better performance over other conventional network scenarios.

Adapting to Dynamic LEO-B5G Systems: Meta-Critic Learning Based Efficient Resource Scheduling

Article

Full-text available

Nov 2022

Low earth orbit (LEO) satellite-assisted communications have been considered as one of the key elements in beyond 5G systems to provide wide coverage and cost-efficient data services. Such dynamic space-terrestrial topologies impose an exponential increase in the degrees of freedom in network management. In this paper, we address two practical issues for an over-loaded LEO-terrestrial system. The first challenge is how to efficiently schedule resources to serve a massive number of connected users, such that more data and users can be delivered/served. The second challenge is how to make the algorithmic solution more resilient in adapting to dynamic wireless environments. We first propose an iterative suboptimal algorithm to provide an offline benchmark. To adapt to unforeseen variations, we propose an enhanced meta-critic learning algorithm (EMCL), where a hybrid neural network for parameterization and the Wolpertinger policy for action mapping are designed in EMCL. The results demonstrate EMCL’s effectiveness and fast-response capabilities in over-loaded systems and in adapting to dynamic environments compare to previous actor-critic and meta-learning methods.

Deep Q-Learning-Based Transmission Power Control of a High Altitude Platform Station with Spectrum Sharing

Article

Full-text available

Feb 2022
SENSORS-BASEL

A High Altitude Platform Station (HAPS) can facilitate high-speed data communication over wide areas using high-power line-of-sight communication; however, it can significantly interfere with existing systems. Given spectrum sharing with existing systems, the HAPS transmission power must be adjusted to satisfy the interference requirement for incumbent protection. However, excessive transmission power reduction can lead to severe degradation of the HAPS coverage. To solve this problem, we propose a multi-agent Deep Q-learning (DQL)-based transmission power control algorithm to minimize the outage probability of the HAPS downlink while satisfying the interference requirement of an interfered system. In addition, a double DQL (DDQL) is developed to prevent the potential risk of action-value overestimation from the DQL. With a proper state, reward, and training process, all agents cooperatively learn a power control policy for achieving a near-optimal solution. The proposed DQL power control algorithm performs equal or close to the optimal exhaustive search algorithm for varying positions of the interfered system. The proposed DQL and DDQL power control yields the same performance, which indicates that the actional value overestimation does not adversely affect the quality of the learned policy.

A Vision and Framework for the High Altitude Platform Station (HAPS) Networks of the Future

Article

Full-text available

Mar 2021

A High Altitude Platform Station (HAPS) is a network node that operates in the stratosphere at an of altitude around 20 km and is instrumental for providing communication services. Precipitated by technological innovations in the areas of autonomous avionics, array antennas, solar panel efficiency levels, and battery energy densities, and fueled by flourishing industry ecosystems, the HAPS has emerged as an indispensable component of next-generations of wireless networks. In this article, we provide a vision and framework for the HAPS networks of the future supported by a comprehensive and state-of-the-art literature review. We highlight the unrealized potential of HAPS systems and elaborate on their unique ability to serve metropolitan areas. The latest advancements and promising technologies in the HAPS energy and payload systems are discussed. The integration of the emerging Reconfigurable Smart Surface (RSS) technology in the communications payload of HAPS systems for providing a cost-effective deployment is proposed. A detailed overview of the radio resource management in HAPS systems is presented along with synergistic physical layer techniques, including Faster-Than-Nyquist (FTN) signaling. Numerous aspects of handoff management in HAPS systems are described. The notable contributions of Artificial Intelligence (AI) in HAPS, including machine learning in the design, topology management, handoff, and resource allocation aspects are emphasized. The extensive overview of the literature we provide is crucial for substantiating our vision that depicts the expected deployment opportunities and challenges in the next 10 years (next-generation networks), as well as in the subsequent 10 years (next-next-generation networks).

Supporting IoT With Rate-Splitting Multiple Access in Satellite and Aerial Integrated Networks

Article

Full-text available

Jan 2021

To satisfy the explosive access demands of internet-of-things (IoT) devices, various kinds of multiple access techniques have received much attention. In this paper, we investigate the multicast communication of a satellite and aerial integrated network (SAIN) with rate-splitting multiple access (RSMA), where both satellite and unmanned aerial vehicle (UAV) components are controlled by network management center and operate in the same frequency band. Considering a content delivery scenario, the UAV sub-network adopts the RSMA to support massive access of IoT devices and achieve desired performances of interference suppression, spectral efficiency and hardware complexity. We first formulate an optimization problem to maximize the sum-rate of the considered system subject to the signal-interference-plus-noise-ratio requirements of IoT devices and per-antenna power constraints at the UAV and satellite. To solve this non-convex optimization problem, we exploit the sequential convex approximation and the first-order Taylor expansion to convert the original optimization problem into a solvable one with rank-one constraint, and then propose an iterative penalty function based algorithm to solve it. Finally, simulation results verify that the proposed method can effectively suppress the mutual interference and improve the system sum rate compared to the benchmark schemes.

Stochastic Geometry Analysis of LEO Constellation Coverage under Atmospheric Attenuation

Conference Paper

Sep 2022

A POMDP Based Antenna Selection for Massive MIMO Communication

Article

Nov 2021

We use a partially observable Markov decision process (POMDP) framework to design an optimal antenna selection policy for downlink transmit beamforming at a multi-antenna base station (BS) equipped with only a limited number of RF chains. Assuming that the channel state evolves according to a finite-state Markov process and that only the channel coefficients which correspond to previously selected antennas, are available at the BS, we use the POMDP framework for antenna selection with the aim to maximize the long-term expected downlink data rate . To avoid the high computational complexity of the value iteration algorithm, we focus on the myopic policy and prove that in the case of positively correlated two-state Markov model for the channel over each antenna, the myopic policy is optimal for antenna selection for any number of RF chains . Based on this finding, for general fading channels, we propose to quantize each channel into two levels and apply the myopic policy for antenna selection. Our simulation results show that using this two-state coarse channel quantization for antenna selection results in only a small loss in performance, as compared to the antenna selection technique which uses full channel state information without quantization.

Deep Reinforcement Learning For Multi-User Access Control in Non-Terrestrial Networks

Article

Nov 2020

Non-Terrestrial Networks (NTNs) composed of space-borne (e.g., satellites) and airborne vehicles (e.g., drones and blimps) have recently been proposed by 3GPP as a new paradigm of infrastructures to enhance the capacity and coverage of existing terrestrial wireless networks. The mobility of non-terrestrial base stations (NT-BSs) however leads to a dynamic environment, which imposes unique challenges for handover and throughput optimization particularly in multi-user access control for NTNs. To achieve performance optimization, each terrestrial user equipment (UE) should autonomously estimate the dynamics of moving NT-BSs, which is different from the existing user access control schemes in terrestrial wireless networks. Consequently, new learning schemes for optimum multi-user access control are desired. In this paper, we therefore propose a UE-driven deep reinforcement learning (DRL) based scheme, in which a centralized agent deployed at the backhaul side of NT-BSs is responsible for training the parameter of a deep Q-network (DQN), and each UE independently makes its own access decisions based on the parameter from the trained DQN. With the proposed scheme, each UE is able to access a proper NT-BS intelligently to enhance the long-term system throughput and avoid frequent handovers among NT-BSs. Through comprehensive simulation studies, we justify the performance of the proposed scheme, and show its effectiveness in addressing the fundamental issues in the NTNs deployment.

The benefits of very low earth orbit for earth observation missions

Article

Aug 2020
PROG AEROSP SCI

Very low Earth orbits (VLEO), typically classified as orbits below approximately 450 km in altitude, have the potential to provide significant benefits to spacecraft over those that operate in higher altitude orbits. This paper provides a comprehensive review and analysis of these benefits to spacecraft operations in VLEO, with parametric investigation of those which apply specifically to Earth observation missions. The most significant benefit for optical imaging systems is that a reduction in orbital altitude improves spatial resolution for a similar payload specification. Alternatively mass and volume savings can be made whilst maintaining a given performance. Similarly, for radar and lidar systems, the signal-to-noise ratio can be improved. Additional benefits include improved geospatial position accuracy, improvements in communications link-budgets, and greater launch vehicle insertion capability. The collision risk with orbital debris and radiation environment can be shown to be improved in lower altitude orbits, whilst compliance with IADC guidelines for spacecraft post-mission lifetime and deorbit is also assisted. Finally, VLEO offers opportunities to exploit novel atmosphere-breathing electric propulsion systems and aerodynamic attitude and orbit control methods. However, key challenges associated with our understanding of the lower thermosphere, aerodynamic drag, the requirement to provide a meaningful orbital lifetime whilst minimising spacecraft mass and complexity, and atomic oxygen erosion still require further research. Given the scope for significant commercial, societal, and environmental impact which can be realised with higher performing Earth observation platforms, renewed research efforts to address the challenges associated with VLEO operations are required.

Deep Recurrent Reinforcement Learning for Partially Observable User Association in a Vertical Heterogenous Network

Abstract

Recommended publications

User Association in a VHetNet with Delayed CSI: A Deep Reinforcement Learning Approach

Deep Reinforcement Learning Approach for HAPS User Scheduling in Massive MIMO Communications

Enhancing Next-Generation Urban Connectivity: Is the Integrated HAPS-Terrestrial Network a Solution?

Impact of Objective Function on Spectral Efficiency in Integrated HAPS-Terrestrial Networks