ChapterPDF Available

Deep Reinforcement Learning Applied to Multi-agent Informative Path Planning in Environmental Missions

October 2022

October 2022

DOI:10.1007/978-3-031-26564-8_2

In book: Mobile Robots: Motion Control and Path Planning. Studies in Computational Intelligence
Publisher: Springer

Authors:

Samuel Yanes Luis

Universidad de Sevilla

M. Perales

Universidad de Sevilla

Daniel Gutiérrez

Universidad de Sevilla

S.L. Toral

Universidad de Sevilla

Deep Reinforcement Learning algorithms have gained attention lately due to their ability to solve complex decision problems with a model-free and zero-derivative approach. In the case of multi-agent problems, these algorithms can help to easily find efficient cooperative policies in a feasible amount of time. In this chapter, we present the Informative Patrolling Problem, a commonplace task in the conservation of water resources. The approach is presented here as a convenient methodology for the synthesis of cooperative policies than can solve simultaneous objectives present in the unmanned monitoring of lakes and rivers: maximizing the collected information of water parameters and the collision-free routing with multiple surface vehicles. For this mixed objective, it is proposed a Deep Q-Learning scheme with a convolutional network as a shared fleet policy. In order to solve the credit assignment problem, it is proposed an effective multiagent decomposition of the informative reward with a discussion of other several state-of-the-art topics of Reinforcement Learning: noisy networks for enhanced exploration of the state-action domain, the use of a visual states, and the shaping of the reward function. This methodology, as it is quantitative demonstrated, allows a significant improvement in water resource monitoring compared to other heuristics.

Content uploaded by Daniel Gutiérrez

Content may be subject to copyright.

Deep Reinforcement Learning applied to

multi-agent informative path planning in

environmental missions

Samuel Yanes Luis, Manuel Perales Esteve, Daniel Guti´

errez Reina and Sergio

Toral Mar´

ın

Key words: Deep Reinforcement Learning, Information Gathering,

Multiagent, Environmental Monitoring

Abstract Deep Reinforcement Learning algorithms have gained attention lately due

to their ability to solve complex decision problems with a model-free and zero-

derivative approach. In the case of multi-agent problems, these algorithms can help

to easily ﬁnd efﬁcient cooperative policies in a feasible amount of time. In this

chapter, we present the Informative Patrolling Problem, a commonplace task in the

conservation of water resources. The approach is presented here as a convenient

methodology for the synthesis of cooperative policies than can solve simultaneous

objectives present in the unmanned monitoring of lakes and rivers: maximizing the

collected information of water parameters and the collision-free routing with mul-

tiple surface vehicles. For this mixed objective, it is proposed a Deep Q-Learning

scheme with a convolutional network as a shared ﬂeet policy. In order to solve the

credit assignment problem, it is proposed an effective multiagent decomposition

of the informative reward with a discussion of other several state-of-the-art topics

of Reinforcement Learning: noisy networks for enhanced exploration of the state-

action domain, the use of a visual states, and the shaping of the reward function. This

Samuel Yanes Luis

Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,

e-mail: syanes@us.es

Manuel Perales-Esteve

Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,

e-mail: mperales@us.es

Daniel Guti´

errez Reina

Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,

e-mail: dgutierrezreina@us.es

Sergio Toral Mar´

ın

Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,

e-mail: storal@us.es

2 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

methodology, as it is quantitative demonstrated, allows a signiﬁcant improvement in

water resource monitoring compared to other heuristics.

1 Introduction

Conservation of drinking water reserves is a strategic objective for societies. This

resource has a direct impact on the health of people, agriculture, and industry in

a country. However, due to human related causes, these large bodies of water are

continuously being polluted: untreated discharges, oil spills from boats, toxic algae

proliferation, etc., drastically impoverish the quality of the aquatic ecosystem and

alter the biological balance of the local fauna and ﬂora. In the case of particularly

large water bodies, such as Lake Ypacara´

ı (Paraguay, 60 km2) or the Mar Menor

(Spain, 170 km2), manual samplings for continuous monitoring of the variables of

water quality (WQ) become strenuous and costly. Several human and material re-

sources are needed, and there is always a risk to operators in contaminated scenar-

ios. An efﬁcient way to collect physicochemical water data is to use autonomous

surface vehicles (ASVs) [22] equipped with highly sensitive WQ sensors. These

sensors can measure physical parameters such as pH, temperature, oxygen satura-

tion levels, and water turbidity, or chemical parameters such as nitrites, sulphates,

and dissolved chlorophyll. All these parameters can be used, by means of a proper

data analysis, to infer the biological state of a lake or a river. In such enormous water

bodies it is mandatory to use a ﬂeet of multiple vehicles, as the battery budget of

one single agent do not provide enough coverage. With multiple vehicle it is easy to

obtain a complete set of measures that truly represents the useful information to be

considered for resource conservation. The challenge of the multi-vehicle paradigm

lies in the need for an effective coordination policy that allows them to take sam-

ples autonomously in the lake following a low redundancy criterion. The vehicles

must coordinate to share the search space to collect as much information as possi-

ble. Information, represented by the values of the water variables, will be obtained

sequentially as the vehicles acquire the measurements one by one at different lo-

cations in the scenario. An effective ﬂeet policy must consider taking samples in

those places with the higher information available, which is to sample in places of

high uncertainty. The deﬁnition of the information depends on the context but we

can deﬁne a general corollary of such task, the Informative Path Planning (IPP) [5],

where the objective is to obtain an optimal acquisition path Ψ∗that maximizes this

informative value Ion a ﬁxed time budget T:

Ψ∗=max

ΨI(Ψ)(1)

DRL applied to multi-agent IPP in environmental missions 3

Fig. 1 Autonomous vehicles

carrying out a surveillance

mission in a recreational lake

in Seville (Spain).

Information Ican be deﬁned stochastically in terms of the reduction of uncer-

tainty of each sampled point of the environment [24]. A reasonable initial hypothe-

sis states that every navigable point pin the complete set Xbehaves like a Gaussian

random variable p∈X∼N(µ,σ). Then, let being Σ[X,Xmeas]the spatial corre-

lation matrix that describes the statistical relationship between the samples, that

is, the uncertainty level for each point in the locations where Xmeas was sampled.

This matrix indicates the level of uncertainty of each point, considering the loca-

tions where Xmeas has been sampled. This approach implements a Mat´

ern Kernel

Function (MKF) [21] to spatially correlate samples as a function of their adjacency,

under the acceptable assumption that physically close samples will be more closely

related. Given that, the IPP will consist of sequentially deciding the next physical

point at which to take a sample, following the information gain maximization crite-

rion. Following the deﬁnition of information gain in information theory, Σ[X,Xmeas]

deﬁnes how informative a point is according to the decrease in entropy of the model

formed by the set in points of the lake Xand the sample points Xmeas [21]. Finally,

the IPP problem to be solved here consists of minimizing the total entropy of the

scenario H[X|Xmeas].

In addition to the informative criterion, the ﬂeet policy must consider in the

movement planiﬁcation any obstacles and nonnavigable zones. In this multiagent

paradigm, it is also needed to consider the agents to be moving obstacles, so the path

planning becomes intrinsically dynamic and harder to solve. As the complexity of

this problem explodes because of the many feasible paths and restrictions in every

realistic scenario, it is needed a proper algorithm that can deal with the IPP problem

with multiple agents and can scale to different ﬂeet sizes. In this chapter, a Rein-

forcement Learning (RL) approach is proposed to deal with the high dimension of

this problem, inspired by the success of [15] where a Q-Learning algorithm solves

hard tasks of proﬁciency. The RL paradigm tries to solve sequential problems by

trial and error. In the RL paradigm, an agent tries to maximize the reward obtained

rtin the instant t, by taking an action atin an environment [25]. The information

about the environment, called the state st, is usually observable in some form, so

4 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

it can be interpreted to maximize the reward. This whole process ends when the

agent has synthesized a policy π(s)that maps a state sinto the best learned action

a(see Figure 2). To adapt the IPP problem to this paradigm, a formulation has been

proposed in terms of a Markov Decision Process (MDP) where the aforementioned

agent interacts sequentially in a ﬁnite control horizon T, by taking discrete actions,

which physically corresponds to movements, that maximize the reward at the end of

a mission:

π(s)∗=max

∑

t=0

[r(st,a=π(st))] (2)

Fig. 2 Basic reinforcement

learning scheme. In every RL

case, there is at least an agent

that interacts through actions

with an environment. Returns

information of two types: the

next state and a reward that

evaluates the action in the

current state.

Agent

action at

observation st

Environment

reward rt

For this purpose, the reward function r(s,a)must be tailored to meet the objec-

tives of the path planning, i.e., the most informative, non-redundant, obstacle-free

routes for every ASV in the ﬂeet. In the following sections, an appropriate reward

function for the multi-agent IPP is presented.

In [15], the use of deep convolutional neural networks (CNN) was introduced

as a novelty in the RL scene to deal with high-dimensional states. This was a

methodological revolution for RL. CNNs were used to represent the state-action

function Qof the MDP. This deep Qfunction represents the expected future re-

ward given a current state sand an action a. The capacity of neural networks to ﬁt

non-linear and complex behavior allowed the algorithm to supersede other bench-

marks. Through many iterations, the neural function is optimized with the given

reward signal by randomly interacting with the environment. This algorithm, called

Deep Q-Learning, constitutes a robust algorithm to solve sequential decision prob-

lems such as the one presented in this chapter. All things considered, this chapter

adapts the single-agent Deep-Q Learning (DQL) method to solve the multi-agent

IPP, taking advantage of the model-free ability of this algorithm. There is additional

DRL applied to multi-agent IPP in environmental missions 5

complexity in doing so, as this problem deals with multiple agents at the same time.

Multi-agent RL (MARL) involves certain difﬁculties [9], such as learning scalabil-

ity, noninjectivity of the reward function, and nonstationery of the environment. In

this chapter, we present an observational method to include all the information in

the visual input of the network for better efﬁciency. Additionally, to ﬁnd a suitable

policy, it is necessary to sufﬁciently explore the state-action set with an increasing

size with each agent included in the ﬂeet. Therefore, classic exploration strategies,

such as ε−greedy [34], fail to ﬁnd the optimal actions for success in the IPP task. To

overcome this situation, in this chapter, we propose the use of noisy neural networks

(noisy-NN) [7]. Noisy NN are a nondeterministic version of the classical neural net-

works. Using a Gaussian distribution, each parameter introduces some level of noise

that can enhances exploration.

This chapter addresses certain research gaps in the literature on multiagent op-

timization, information gathering, and deep reinforcement learning. With this ap-

proach, we propose an entropic criterion to synthesize the most effective routes

that maximizes the information collected for a ﬂeet of vehicles. While other algo-

rithms designed for entropy minimization have been designed solely for single agent

paradigm [20], our multiagent formulation by means of DRL results in a very con-

venient and ﬂexible approach. Another highlight is the use of a global visual state,

which allows a better scaling for higher number of vehicles. Previous works tend to

use low resolutions, local observations, and a formulation that does not scale well

with the number of agents [29, 34, 10]. Our proposal places value on the use of

noisy neural networks for multi-agent policies and proposes a global visual obser-

vation method for homogeneous ﬂeets. Finally, this chapter’s contribution can be

summarized as follows:

• The formulation of the IPP from the perspective of multi-agent uncertainty re-

duction.

• The application of DRL for the multi-agent IPP resolution.

• A policy formulation with noisy neurons and a state that allows for interchange-

ability between agents.

The chapter is organized as follows:

In Section 2, a brief review of the literature is developed to place the reader in the

context of the problem to be solved. In Section 3, the IPP is formally described with

all the constraints. In Section 4, the methods used to solve IPP are described, that

is, the DRL algorithm, the state, the neural network, etc. In Section 5, the charac-

teristics and results of the experiments are presented. In Section 6, these results are

discussed and compared with other heuristics. Finally, in Section 7, the conclusions

of this work and future research lines are presented.

6 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

2 Related work

The use of ASVs for surveillance or monitoring has been gaining momentum re-

cently due to advances in robotics and vehicle autonomy [4]. The use of small

robotic vehicles is not only more efﬁcient but is also cheaper compared to the hu-

man cost of periodic surveillance missions. In [6] there is a good example of an

ASV application for bathymetric measurements of lakes and seas. In this work, a

low-cost vehicle is designed to perform bathymetry (depth of lakes and river basins)

in an unmanned way. Another example of a use of ASVs is in [16], where a sur-

face vehicle is implemented to monitor disaster zones after hurricanes. The vehicle

is equipped with cameras and an underwater sonar camera to search for obstacles

in the presence of turbid waters. Another widespread utility of ASVs is the conser-

vation of aquatic natural resources, such as lakes and rivers. Here we can distin-

guish between the different sub-tasks the conservation of such resources involves:

I) Patrolling [33, 35, 34], which consists of continuously monitor the resource as

a method of early warning for contamination or sudden changes in the healthiness

levels. II) Efﬁcient model regression [18, 17, 11], which is to efﬁciently obtain an

accurate WQ model of the waters with one or more vehicles. III) Peak detection

[11], which consists in detecting the maxima of contamination in the shortest pos-

sible time. IV) Informative coverage [1, 2], which consists of covering efﬁciently

a certain area to acquire the most information possible. The IPP ﬁts into the set

of problems within Informative coverage. In 3 it has been depicted typical uses of

ASVs in the literature and organized in three main branches: disaster assistance,

topological characterization, and environmental monitoring.

In [33] the non-homogeneous patrolling problem was formulated as a single

agent directive graph problem. The ASV was trained using DRL, resulting in an

efﬁcient policy to minimize the average waiting time in lakes with a dissimilar im-

portance criterion. Similarly, the patrol case was extended in [34], with a different

number of vehicles for each simulation and a low state resolution. These works

have demonstrated the ability of DRL to synthesize high-performance behaviors for

patrolling tasks with ASVs but also the high dimension and complexity of these

problems. In [35] it was compared the use of such DRL techniques with Genetic

Algorithms (GA). The scalability was proven to be better with DRL, which moti-

vates this chapter to choose the DQL for solving the IPP. ASVs were also used in

[18] to obtain a contamination model of the Ypacara´

ı Lake. This work proposes a

Bayesian approach within Gaussian Processes (GP) for obtaining an accurate WQ

scalar map. In the same line, [17] generalized the algorithm for the multi-agent case.

The sampling space was divided using a Voronoi-based tessellation method. The

main difference between this approach and ours is that, when continuously moni-

toring, it is not considered any cost for sampling. In the proposed method, the IPP is

solved by taking as many samples as possible to compel with higher accuracy. An-

other branch of algorithms for monitoring are based on particle swarm optimization

(PSO). They were used for the same purpose in [11, 26, 27]. The PSO was applied

DRL applied to multi-agent IPP in environmental missions 7

ASVs applications

Topological

characterization

Bathymetric studies

Shore measurement

Disaster assistance

Human search

Deployment of

wireless networks

Environmental

monitoring

Environmental patrolling

Model acquisition

Garbage collection

Early-alert system

Maxima localization

Efficient model

regression

Informative coverage

Fig. 3 Different applications of ASVs in the literature.

in a ﬂeet of four ASVs for detection of contamination peaks. This heuristic approach

is a simple and model-free approach to localize global and local maxima that adapts

naturally to the multi-agent paradigm. Nonetheless, its performance and behavior

heavily depends on the hyperparameters. In the informative coverage, some solu-

tions has been proposed by the means of GAs [1, 2, 36]. In [1], the informative

problem is modeled as a Travelman Sales Problem (TSP), where the objective is to

cover the Ypacara´

ı Lake in the search of green algae blooms. The algorithm resulted

in Eulerian circuits that maximize the effective area in a single-agent scenario. In [2]

it is proposed a similar approach but this time focused on an online re-optimization

once algae blooms are detected. Those algorithms generates quasi-optimal solutions

but too long to be applied in a real scenario. Our approach seeks an effective pol-

icy for coverage within a ﬁxed time budget. Additionally, in [36], the problem is

extended to multiple agents and multiple objectives for obtaining closed patrolling

paths. This work uses a graph formulation and analyses the disperse Pareto-optimal

solutions of using multiple agents in coverage problems.

When addressing the use of DRL with unmanned vehicles, there are several in-

teresting examples in the literature [12, 37, 32]. In [12], it is proposed a DRL-based

approach to fulﬁll the complete coverage in cleaning tasks with a multi-form robot.

This single-agent application deals with high action dimensionality, uses an algo-

rithm named Actor-Critic with Experience Replay (ACER). The DRL approaches

with ASVs usually put the focus on the low-lever control or trajectory planning [37].

In this work, RL was applied to obtain a motor controller for path tracking with a un-

derwater unmanned vehicle (UUV). The applied algorithm resulted in a suitable and

realistic controller that quickly reduces the positioning error. In [32] there is an ex-

8 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

ample of the use of DRL for collision avoidance. It is proposed a deep policy trained

using DQL to learn how to modify the control signals in order to avoid moving ob-

stacles in the trajectory. This way the agent must learn two simultaneous tasks: to

avoid the obstacles and to track the reference path. This is similar to our approach,

where the agent must perform the informative gathering task at the same time as

it avoids non-navigable areas. Other single-agent approaches deal with informative

tasks instead of pure path tracking [19, 28]. In [19], it is proposed the use of DQL for

patrolling tasks using a camera for survey scenarios. This approach seeks a policy

that optimally patrols the environment by focusing on high-importance zones, but

does not deal with obstacle or boundaries avoidance. In [28], an aerial vehicle agent

is trained to perform an informative coverage with take-off and landing restrictions.

This hard task is achieved by also using DQL. A simple reward function is tailored

to award the correct actions and the information collection.

Ref. Application Algorithm Vehicle Multiagent

[6] Topological characterization of

lakes

Manual path ASV No

[16] Hurricane disaster-scenario mon-

itoring

Manual path ASV No

[1] Informative coverage of lakes Genetic Algorithm ASV No

[36] Multi-objective Non-

homogeneous patrolling

Evolutionary Strategy ASV Yes

[2] Adaptive informative coverage of

lakes

Genetic Algorithm ASV No

[18] WQ model acquisition of lakes Bayesian Optimization ASV No

[17] WQ model acquisition of lakes Bayesian Optimization ASV Yes

[11] Peak detection and model acq. Particle Swarm Optimization ASV Yes

[26] Peak detection and model acq. Particle Swarm Optimization ASV Yes

[27] Peak detection and model acq. Particle Swarm Optimization ASV Yes

[33] Non-homogeneous patrolling Deep Q-Learning ASV No

[34] Non-homogeneous patrolling Multi-agent Deep Q-Learning ASV Yes

[10] Dynamic informative tracking Multi-agent Deep Q-Learning UAVs Yes

[30] Dynamic informative tracking Multi-agent Deep Q-Learning UAVs Yes

Table 1 Summary of related works using autonomous vehicles for environmental monitoring

tasks.

With regard to the multi-agent paradigm, DRL approaches must deal with com-

munication problems between agents, the credit assignment problem, and others

[9]. In [14] it is proposed a method for decentralized learning in hybrid competitive-

cooperative scenarios. This approach adapts the single-agent Deep Deterministic

Policy Gradient algorithm [13] to comply with decentralized deploy for continu-

ous actions. Our approach differs from this in the fact that our problem is fully-

cooperative and the learning, as the agents are homologous in actions and in their

observational abilities, do not require a decentralized learning. In [10], an algorithm

to train two agents to survey for wildﬁres is proposed. This work uses the DQL

DRL applied to multi-agent IPP in environmental missions 9

algorithm with a centralized policy for different agents and an observation-based

formulation of the process. Different from our approach, this work does not directly

act on uncertainty, and the agent’s observations are only useful for the ﬁre-front de-

tection. Our approach, in contrast, directly tries to minimize the uncertainty which

can be a valid approach independently to the process under survey. In [30], a similar

approach is solved using different architectures for the DQL algorithm. This work

addresses the study of different multi-agent algorithms to monitor wildﬁres. The

conclusions of this work serve this chapter as a guideline for choosing a centralized

approach for the IPP solving using DRL. The proposed network in this chapter dif-

fers from the one used in [30] because the observation of every agent is fully visual

and can include all others agent information in an image to be processed. This alle-

viates the need of retraining when the ﬂeet grows in size. Another relevant aspect is

that the state in this chapter proposal reunites the global information of the scenario

whereas in [30] the observation is local to every agent.

3 Statement of the problem

The main objective of this monitoring application is to ﬁnd a policy that coordinates

a ﬂeet of ASVs to efﬁciently collect information of the WQ measurements. This

section will explain the stochastic assumptions under the information criterion and

the statement of the Informative Path Planning.

3.1 Information framework

The IPP starts deﬁning a navigation space X∈R2, where every vehicle in the ﬂeet

can take a measurements of the water. It is also deﬁned the gathered samples loca-

tions subset Xmeas ∈R2, and a navigation map M|X=∀p:= [px,py]|;M(p) = 1.

It is a reasonable hypothesis to assume that every possible point that can be sam-

pled behaves like a Gaussian random variable with mean µand variance csuch that

p≈N(µ,c). Now, we can assert that the visitable space Xbehaves as a Multivariate

Gaussian Distribution (MGD), where X∼Nn(¯

µ,Σ), with Σbeing the correlation

matrix of X.

Here, we can deﬁne a function that serves as a surrogate model of the correlation

as we take samples. This function will serve as an indicator of the information we

have over the navigable space Xdepending on the measurements taken Xmeas. As

we expect the WQ distribution to be smooth to a certain level, it can be used a

10 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

Mat´

ern Kernel Function (MKF) to deﬁne the correlation between two samples in

two different locations, in a smooth and exponential manner following Eq. (3). The

hypothesis has sense because the WQ parameters cannot change drastically from

one point to a close one like it has been studied before 1. Consequently, near samples

from the same or various agents will be highly correlated and the reduction of the

uncertainty in this locations, will be high but redundant. Two parameters will model

the MKF: i) the parameter νmodels the smoothness of the uncertainty decay with

the samples distance ∥p,p′∥, and ii) the parameter lwill serve to scale how much

correlated two measurements (p,p′)are with each other. These values are usually

chosen based on prior knowledge of the environment to be monitored [29, 20] or on

how intensively the environment is meant to be covered. In Figure 4 it is depicted

the effect on the uncertainty when using a MKF with (ν,l) = (1.5,1).

MKF(p,p′) = 21−ν

Γ(ν) √2ν∥p−p′∥2

l!ν

Kν √2ν∥p−p′∥2

l!(3)

When evaluating with the kernel function every sample in a instant t, it is possible

to obtain the conditional correlation matrix Σ[X|Xmeas]according with the following

expression [21]:

Σ[X|Xmeas] =

Σ[X,X]−Σ[X,Xmeas]×Σ[Xmeas,Xmeas]−1×Σ[Xmeas ,X]T(4)

Now, the monitoring objective involves decreasing the entropy associated with

the conditional correlation. The information entropy H[X|Xmeas]gives a measure of

the uncertainty about the monitoring domain and the randomness of a sample at an

arbitrary point in that space. The lower the entropy, the more conﬁdence one has

about the scenario. Finally, the entropy can be calculated as [21]:

H[X|Xmeas] = 1

2log(|Σ[X|Xmeas]|) + dim(X)

2log(2πe)(5)

It can be seen that by decreasing the covariance matrix determinant, the entropy

will be reduced. Reducing this determinant can be done by two ways according to

[24]: by reducing directly the product of the eigenvalues of Σ[X|Xmeas]2(called

D-criterion) or by reducing the sum of the eigenvalues. The latter, called A-criterion

for the uncertainty reduction, can be achieved by decreasing the trace of Σ[X|Xmeas],

1https://marmenor.upct.es/maps/

2As it is demonstrated in [21] and because Σis a positive semi-deﬁnite matrix, |Σ|=∏dim(X)

i=0λi

DRL applied to multi-agent IPP in environmental missions 11

which is the sum of its diagonal. By reducing this values, we will be reducing the

determinant of Σ[X|Xmeas]and consequently, the entropy of the gathered informa-

tion:

Tr(Σ[X|Xmeas]) ≡T r(Σ) =

dim(X)

∑

i=0

λi=

dim(X)

∑

i=0

Σii (6)

In Figure 4 it is depicted how the uncertainty σ(X)for one-dimensional space

associated to Σ[X|Xmeas], its diagonal, decrease with two independent samples p1

and p2. In the vicinity of those sample points, σis not 1.0 at all since a smooth

correlation is assumed with the MKF function.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

0.0

0.2

0.4

0.6

0.8

1.0

(

)

σX

meas

(

, X

)

(

, X

)

Fig. 4 Uncertainty conditioning process. When we take two samples p1and p2, and incorporate

to Xmeas, it is updated Σ[X|Xmeas]according to Eq. (4). It is observed that the uncertainty σ(X)

associated with Σ[X|Xmeas]becomes zero at the sample points (assuming no sampling noise) and

in their environments according to the Mat´

ern function. The entropy H[X|Xmeas]of the process

decreases accordingly. The intermediate uncertainty between samples supposes a redundancy in

the acquisition. Thus, two very near samples makes no signiﬁcant reduction to σ(X).

12 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

3.2 Informative Path Planning

Once the information framework is formulated, we can continue by stating the se-

quential decision problem under the IPP. The ﬁnal objective of the IPP is has been

reduced to minimize the total entropy of the information. For this work, we will

take the uncertainty, the trace of Σ[X|Xmeas],T r(Σ), as an analogous measurement

of the entropy (it has been aforementioned that they are directly related). The objec-

tive is to ﬁnd a path Ψ={ψ1,ψ2, ..., ψN}for every agent in a ﬂeet of Nagents that

minimize the uncertainty at the end of a mission:

Ψ∗=min

ΨTr(Σ,Ψ)(7)

Given that every agent can move sequentially from a point pj

tto another pj

t+1, ev-

ery path ψis composed by Tpossible movements. The distance dmeas from one posi-

tion to the next one is ﬁxed. This way, the speed of every vehicle is ﬁxed to a constant

value. Additionally, with every movement a new WQ sample is obtained for every

vehicle in the ﬂeet. With regards to obstacles, given the navigation map M, which in-

dicates if a point pis navigable or not, the solutions will be restricted to those paths

that are bounded to navigable zones. This way, the ﬂeet must coordinate to ﬁnd the

best solutions starting from any arbitrary initial points Ψ0={ψt=0

1, ..., ψt=0

that efﬁciently minimize the uncertainty without any collisions between them and

the non-navigable zones. This means that taking samples on the very same place

than other agent (or the same) has measured is, obviously, not desired. Finally, the

IPP corollary can be enunciated as:

minimize

Ψ={ψ1, ..., ψN}Tr (Σ[X|Xmeas (Ψ)])

subject to ∀p∈Xmeas(Ψ)→M(p) = 1

(8)

In Figure 5 it is depicted a typical situation were two vehicles intersects their

paths taking redundant samples in-between.

4 Methodology

First, in this Section it is explained the Markov Decision Process that serves as the

sequential framework to implement a Reinforcement Learning approach for the IPP:

the state representation, the reward function and the actions. Secondly, the Deep Q-

DRL applied to multi-agent IPP in environmental missions 13

(X)

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

Fig. 5 When addressing the multi-agent IPP, it is necessary to consider that two very near samples

are redundant, and the closer, the less interesting are both. This ﬁgure represent two intersecting

path that incurs in redundancy (lower values of σ(X)). Taking samples in the very borders also is

considered useless, as the uncertainty cannot be reduced outside the limits.

Learning approach is presented and the different sub-modules depicted: the Deep

policy, the noisy implementation of the neural network and the Prioritized Buffer

Replay mechanism.

4.1 Markov Decision Process

The IPP can be formulated as a Markov Decision Process (MDP) to fulﬁll the re-

quirements of the Reinforcement Learning theory [25]. In any MDP there is, at least,

one agent that can perform an action atfrom a set of valid actions Aat instant t.

This action interacts with the environment which produces a state st+1and a reward

rt. The state represents all the environment information available and, in the strict

sense, it is all the information necessary to reconstruct it. The agent can observe

the scenario trough an observation mechanism o=O(s)that maps the environment

state into an observation o. In general terms, the observation has incomplete infor-

mation of the dynamics of the scenario. When the observation is perfect, st≡ot, the

MDP is deﬁned to be a Fully-Observable MDP (FOMDP). In the contrary case, it

is known as a Partially-Observable MDP (POMDP). Every agent also has a policy

a=π(o)that maps an observation into an action. The policy can be deterministic or

14 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

stochastic, and deﬁnes the strategy the agent follows to obtain the maximum reward

possible in a ﬁnite action time T:

π∗(o) = max

π"T

∑

t=0

r(at=π(ot),st)#(9)

The optimal policy π∗(o)is reached and the MDP is considered solved when there

is no better strategy that obtains, on average, a better reward.

4.1.1 Agents actions

As it was aforementioned in Subsection 3.2, every agent perform the same type of

actions, which are movements in |A|possible directions with a constant distance

dmeas. Those actions will correspond to the cardinal directions {N, S, E, W, NE, NW,

SE, SW}. With a constant speed, the ﬂeet will move coordinately and every path in

Ψwill have the same length dmax. Every action also involves taking a sample in the

next resulting point and obtaining a new uncertainty matrix Σ.

4.1.2 The state and observation

There is not a unique way to represent the state of this problem. For accomplish

the problem, the observation made by agents must be feasible and must include

all the information possible to gather the real problem state. Here we propose a

full visual observation, similarly to related works such as[33, 15]. This observation

is composed by the following elements of the state, as channels of an image of

[58 ×38]pixels:

1. The navigation map M: formed by an image with its values being {0,1}. A value

1 means this position is considered non-navigable.

2. The agent jposition: also formed by a binary image of the same size with 1 in

the position of the agent that is making the observation.

3. The ﬂeet state: another binary image with a 1 in those cells occupied by the ob-

served agents positions from the perspective of the agent making the observation.

4. The uncertainty map σ(X)which is the diagonal values of Σ[X|Xmeas]to repre-

sent the values of the trace according to Eq. 6.

DRL applied to multi-agent IPP in environmental missions 15

In Figure 6 it is depicted the different channels of the state.

0 20

Navigation map Uncertainty map Fleet positions Agent position

(a) (b) (c) (d)

Fig. 6 An arbitrary observation otof an agent: (a) The navigation map Mwhich is the same for

every agent. (b) The uncertainty values mapped into an image. (c) The other agents apart from the

observer discretized in an image. (d) The agent that observes the state within its position.

4.1.3 The reward function

For the DRL to optimize an useful policy for the information gathering, it is vital to

design an effective reward function. The reward function must evaluate how good

an action is in a particular state. The total reward at the end of an episode, which

is, the sum of the consecutive rewards for every agent, will represent how good the

ﬂeet performs in the given scenario. The reward must be positive for good actions

and negative for bad actions. The goodness or badness of every action is something

that is worth to discuss.

For this particular problem, the total minimization of the uncertainty could be a

good measurement of the quality of each vehicle movement. Being a ﬂeet action ¯at,

with every agent iin the instant t, ¯at={a1

t,a2

t, ..., aN

t}at instant t:

r(st,aj

t) = T r(Σ)t−Tr(Σ)t−1

dim(T r(Σ)) =∆Ut

dim(T r(Σ)) (10)

This is equivalent to say that every agent will be rewarded with the total amount

of uncertainty reduced by all the ﬂeet movements. Here we can expect the agents to

learn to minimize the uncertainty, but there is an intrinsic problem with this direct

formulation. A bad action of an agent and a good action of another will assign the

same credit for both agents, so the learning will be harder since agents are partially

biased by others behavior. This problem, called the credit assignment problem, is

16 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

typical of multi-agent games [9], and in this particular case it will be signiﬁcant

as the process explained in Eq. 4 is difﬁcult to compute for different agents. Here

we propose a solution for this problem using simple knowledge of the scenario

dynamics: we can compute the individual contribution to the uncertainty reduction

as if the agent was alone. This way, for every agent action aj

t∈ˆatit can be computed

the individual contribution ∆Ut

jas:

∆Ut

j=Tr(Σ[X|Xmeas]t−1)−T r(ΣX|Xt−1

meas ∪pt

j)(11)

This will be equivalent to considering an individualistic reward that will not take

into account the redundancies of the sampling at instant t. Taking two samples in

near places will be rewarded as if they were unique samples, so the reward signal

must consequently be modulated in terms of the redundancy. It is not difﬁcult to

understand that the closer the agents taking samples the higher the redundancy. We

propose a linear decrease of the individual contribution as a function of the distance

of each agent to to enhance the dispersion of the agents trough the search space. It

can be imposed a clipped linear function to do so like in Figure 7.

0 10 20 30 40 50

Mean distance to the vehicle j,

∀

j i

dji

1.0

0.8

0.6

0.4

0.2

0.0

Distance penalization

dmin dmax

Fig. 7 The penalization for the distance will be proportional to the mean distance between the

agent jand others. It is imposed that, from a maximum distance dmax, the penalization will be zero

and vice versa for a distance close to dmin.

To address the avoidance of nonnavigable areas, it is also imposed that any move-

ment to an invalid zone will be penalized with cp<0. The ﬁnal reward function for

an agent jwill be:

r(st

j,at

j) = (cpwhen ainvolves a collision

∆Ut

j+λ×cd(dji)otherwise (12)

DRL applied to multi-agent IPP in environmental missions 17

4.2 Deep Q Learning

Deep Q-Learning was applied for the ﬁrst time successfully in [15], were the Google

Mind Team trained an agent to play several classic ATARI games to have super-

human capabilities. The success of this work resides in the use of CNNs to represent

the state-action function, which is called the Qfunction. This function represents, in

the sense of the Bellman equations [3], the expected discounted future reward given

a state sand an action ais

Q(st,at) = E"r(st,at) + γ∑

s′∈S

p(s′|s,a)max

a′∈A

Q(s′,a′)#(13)

where γis a discounted factor that will ponder the importance of earlier rewards,

and p(s′|s,a)is the probability of transition to st+1from the current state and given

the current action 3.

In DQL, Qwill be represented by a neural network with Q(s,a;θ)(with θbeing

the parameters) and must be optimized by trial and error using a Stochastic Gra-

dient Descent (SGD) algorithm to ﬁnd the optimal policy according to Bellman’s

principle of optimality [3]:

π(s;θ) = argmax

a∈A

Q(s,a;θ)(14)

The deep policy π(s;θ)will be optimized by generating many tuples of experi-

ences <st,at,rt,st+1>and minimizing the Bellman error E:

E(st,at,rt,st+1) = Q(st,at)−hrt+γ×max

aQ(st+1,a)i

| {z }

Target value y(rt,st+1)

(15)

This error value, usually called time difference error, represents the prediction

error when predicting the expected future value with Q(s,a). This error is minimized

by taking SGD steps with respect to a loss function, so the parameters of Q(s,a;θ)

move towards the direction of minimizing such value. This optimization method,

as we can see in Eq. 15, uses the very same function under optimization to predict

future values, also called target values. This method, named bootstrapping, involves

3In a total deterministic environment, this probability is assumed to be 1.

18 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

certain instabilities [25] that can cause the learning to be a total failure. In [15, 8],

the solution to this comes from the use of two effective methods:

1. The use of a twin target function Qto predict the values Qin the t+1 instant.

This function, Qtarget (s′,a,θ′), is not optimized directly, but instead periodically

copies its parameters θ′from Q(s,a;θ).

2. The use of an Experience Replay (ER). The ER technique consists of saving

every experience in a memory and training the neural policy periodically using

batches of samples randomly chosen from it. Random sampling ensures that there

is no correlation between experiences to reduce network bias.

Replay memory

Experience batch

Fig. 8 Deep QLearning algorithm scheme.

Finally, this algorithm uses the trial and error method to optimize Q(s,a)so that

it can be used following Eq. 14. When addressing the multi-agent problem, the same

Qfunction can be extended to all agents as they are homologous in actions and in

observations. Since the objective in this problem is intended to be fully cooperative

and the interactions of every agent are included through the reward as was explained

before, the neural policy can be trained using every individual state to process one

action for every agent. Here, it is worth to mention that, as every agent situation is

part of the state and the behavior of every agent evolves within learning, the dynam-

ics of the environment becomes nonstationary. This involves a higher complexity

DRL applied to multi-agent IPP in environmental missions 19

than classical stationary problems, as in the ATARI cases [15], and, in the end, a

higher need for exploration of the state-action domain.

4.2.1 Prioritized Experience Replay

As it is mentioned before, the ER consists of sampling batches of previous expe-

riences, saved in a memory buffer Mas they occur, to ﬁt the Q(s,a;θ)function.

When the experiences are sampled with equal probability we are neglecting that

some experiences could be more useful for training than others. This situation usu-

ally happens when a very unknown state is sampled and the prediction error is big.

Therefore, it makes sense the possibility to sample the experiences in a no-uniform

fashion. With a Prioritized Experience Replay (PER) [23], every experience in the

memory Ei∈Mhas a probability pito be sampled for the loss computing. This prob-

ability will be proportional to the Bellman error according to Eq. 15. The higher the

prediction error, the higher the probability to be sampled in the future for a better ﬁt

of the prediction. Nonetheless, it is important to let other less erratic experiences to

be also sampled to avoid bias. This way, the probability of sampling is imposed to

be dependent to a prioritizing exponent α∈[0,1]that modulates the uniformity of

the priorities 4:

pi=E(st

i,at

i,rt

i,st+1

i)α

"|M|

∑

i=0

E(st

i,at

i,rt

i,st+1

i)α#(16)

Nonetheless, this sampling is a potential source of bias because the loss will

be much bigger for those experiences with high prediction errors, specially in the

beginning of the learning. To modulate the importance of every error in the loss,

it is computed an importance weight wifor every experiences in the memory. This

weight depends on a parameter βthat will represent the level of compensation of

the error in the loss. This way, the importance weight is expressed as

wi=1

|M|·1

piβ

(17)

and the loss is computed with the following formula:

4α=0 means full uniform sampling and vice versa.

20 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

L(B) = "wi·1

|B|

∑

i=0

(Q(si,ai)−yi)2#(18)

4.2.2 Deep Policy

The convolutional policy designed to represent Q(s,a)has three initial convolutional

ﬁlters to extract the features of the graphical state. This ﬁrst convolutional head is

made up of 64, 32, and 16 ﬁlters of the same kernel size (3 ×3). After this, the fea-

tures are ﬂattened and processed by three linear layers of 256 neurons each. Every

neuron uses a Leaky Rectiﬁer Linear Unit (Leaky-ReLU) as the activation function.

In the ﬁnal stage of the policy, a dueling architecture is implemented [31]. With the

dueling architecture, the value function V(s)is segregated from the Advantage val-

ues A(s,a)(see Eq. 19). The former represents the expected accumulated reward in

the current state. The latter represents the expected action-state value with respect

to the current state value.

A(s,a) = Q(s,a)−V(s)

This technique enables a better generalization of action learning in the presence

of similar states, and, according to [31], the calculation of Q(s,a)will be as follows:

Q(s,a) = V(s,a) + A(s,a)−1

|A|∑A(s,a)(19)

[64 Filters, 3x3] [32 Filters, 3x3] [16 Filters, 3x3]

[256] [256] [256] [256]

[256]

[1]

[8] A(s)

V(s)

Q(s,a)

Fig. 9 The convolutional neural network implemented to process the state and to estimate the Q

function. It is made up of three convolutional blocks for feature extraction and three dense noisy

layers. In the end, the dense layer is decomposed into a value and advantage head.

In this work, we also propose a possible application of noisy neural networks as a

technique to enhance the explorative behavior in multi-agent problems. Noisy neu-

ral networks (nNN) have been shown to be useful in complex scenarios, but as far

DRL applied to multi-agent IPP in environmental missions 21

as authors are concerned, their use has been limited to cases with a single agent [7].

With nNN, a stochastic term sampled from a Gaussian distribution is added to every

weight and bias. Therefore, every neuron parameter (wi,bi)has two distributions

ξ= (ξw,ξb). These distributions are modeled by two parameters each: (µw,σw)for

the weights and (µb,σb)for the biases. With every evaluation of Q(s,a), new values

are sampled from ˆ

ξ(see Figure 10). This makes the policy intrinsically stochastic as

different evaluations of the very same state shall return different values. This effect

causes the network to have an inherent adaptive exploratory capacity. The balance

between exploration of the state-action domain and the exploitation in the sense of

Eq. 14 is incorporated parametrically. In classical DQL implementations, the most

common exploration method is the ε-greedy policy, where there is a probability of ε

of choosing a total random action. By annealing εfrom 1 (at the beginning of opti-

mization) and near 0, the agent explores. Nevertheless, in multi-agent problems like

this, random exploration could be insufﬁcient. Furthermore, the annealing values

must be chosen to explore enough in the ﬁrst training stages.

Fig. 10 In (a), a classic neural network layer with parameters ¯wand biases ¯

b. In (b), a noisy layer

is presented. For each previous parameter, a stochastic value is added by sampling a Gaussian

distribution of mean µand deviation σ. Note that these parameters are trainable and will modulate

the stochasticity of the policy.

22 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

5 Simulation and results

This section describes the simulation conditions for each experiment and the results

obtained. First, we compare the importance of individual assignment in the reward

function, as mentioned in Section 4.1.3. Then, the beneﬁts of using noisy networks

are explained in a comparison with an ε-greedy exploratory policy. All simulations

were performed with a ﬂeet size Nof four agents. The navigation map corresponds

to the Ypacara´

ı Lake (Paraguay) as a real case scenario (see the past Figure 6a).

All optimizations were carried out on a double Intel Xeon Silver 4210R 3.4GH,

187Gb RAM, and a Nvidia RTX 3090 with 24Gb VRAM. All hyperparameters of

the algorithms are resumed in Table 2.

Hyperparameter Value

Learning rate (lr) 1 ×10−4

Batch Size (|B|) 64

Memory replay size 1 ×106

Update frequency for target 1000 epochs

Discount factor (γ)0.99

Loss function Mean Squared Error

SGD optimizer Adam

εvalue (if applied) [1, 0.05] (annealed)

εdecay interval (% of learning progress) [0, 33]

Prioritizing exponent (α) 0.4

Importance exponent (β) [0.6, 1] (annealed)

Number of episodes 50.000

Scenario parameter Value

Mat´

ern Kernel lenghtscale (l) 0.75

Mat´

ern Kernel differential parameter (ν) 1.5

Path maximum length (dmax) 29km

ASV speed 2ms−1

Fleet size N4

Max. collisions per missions permitted 5

Table 2 Hyperparameters and scenario values used in the DRL training experiments.

The hyperparametrization of this algorithm has been conducted following pre-

vious values in similar approaches [33, 35]. It has been observed that the selected

batch size is adequate for this problem and decreasing the number of experiences

per SGD step only augments the variance. Other parameters such as the Prioritized

Replay or the discount factor, are taken from the original papers [23, 15].

DRL applied to multi-agent IPP in environmental missions 23

5.1 Reward function comparison

In order to validate the reward function in terms of the credit assignment, two vari-

ants are tested: i) the total uncertainty reduced from one instant to another is re-

warded equally to each agent in the sense of Eq. 7 (coupled reward), and ii) every

agent receives only reward for its own contribution without any considerations (de-

coupled reward; see Eq. 12). Note that the distance penalization is applied exactly

the same for both cases.

0 10000 20000 30000 40000 50000

Episodes

0.2

0.4

0.6

0.8

1.0

Normalized uncertainty

Coupled reward

Decoupled reward

Fig. 11 Normalized uncertainty ˆ

Uat the end of the mission along the training process for two

variants of the reward: coupled and decoupled.

In Figure 11, it can be seen the uncertainty over the training progress for both

variations. With the decoupled reward, the algorithm is able to resolve better the

credit assignment and obtain a better policy than in the coupled case. There is a

40% improvement on average from one case to another, which is signiﬁcant to the

exploration objective. The behavior of the coupled reward agents tend to be more

redundant as can be seen in Figure 12b. As the reward does not properly evaluates

the individual contribution of every agent, the policy tend to overestimate its be-

havior by producing lazy agents. Since a good performance of one agent augments

the valor of Qfor the others, this leads to unequal contributions to the task. The

proposed algorithm is able to synthesize good policies as seen in Figure 12a, where

the Lake scenario is almost completely covered and the agents tend to move more

efﬁciently across the search space with much less self intersecting trajectories.

24 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

Agent 1

Agent 2

Agent 3

Agent 4

Agent 1

Agent 2

Agent 3

Agent 4

(a) (b)

Fig. 12 Resulting paths of the same mission using (a) the decoupled reward and (b) the coupled

one, both with a nNN policy.

5.2 Exploration efﬁciency

When addressing the exploration ability of nNN, it has been observed that the speed

of convergence and performance of noisy neural network overcomes in this case the

ε-greedy implementation 5. In 13, it can be seen the accumulated reward the agents

receive on average for both trainings. The very ﬁrst steps of the ε-greedy training

consists of an exploration phase where the actions are randomly taken. The perfor-

mance of the training slowly increases as εdecreases over time. This exploration

period must be selected and properly tuned. With less exploration (see ε-greedy

in the same ﬁgure), we have observed a worse performance. On the contrary, the

nNN policy needs many less episodes to converge to a high performance policy.

Whereas in the ε-greedy case it is necessary to process 16500 episodes to greedily

exploit the policy, the nNN is able to ﬁnd collision-free paths in 1000 episodes and

higher rewards in 5000 episodes. In terms of the uncertainty, the noisy proposal is

able to obtain the same performance if not slightly better (1% better on average) for

higher explorative ε-greedy values. This means, for the same performance, the al-

gorithm takes 5 times more episodes. Nonetheless, with less exploration, the agent

convergence is faster but the performance becomes worse with higher variation).

This indicates similar policies are encountered but the enhanced exploration policy

of nNN allows a more sample efﬁcient training without any epsilon decay tuning.

5From this point, the decoupled reward is selected for better performance.

DRL applied to multi-agent IPP in environmental missions 25

²-greedy Policy

²-greedy Policy 2

Noisy Policy

0 10000 20000 30000 40000 50000

Episodes

6 × 10-2

2 × 10-1

1×10 -1

Final reward

Mission length

Uncertainty Ut=T

Fig. 13 Average accumulated reward (up), mission length (middle), and ﬁnal uncertainty (bottom),

over the training process for an ε-greedy policy and the proposed noisy policy.

5.3 Comparison with other algorithms

In order to validate the performance of the proposed framework, we propose differ-

ent algorithms to compare the policy efﬁciency. The metrics under study will be: i)

the average accumulated reward at the end of the episode among the agents AER, ii)

the remaining uncertainty at the end of the episode Ut=T, the mean distance between

ASVs, and ﬁnally, the root mean squared error (RMSE) of the inferred WQ model

respect to the ground truth. For the latter, we propose Gaussian Process (GPs), like

in [18], to perform a regression with the sampled values as they are acquired. The

proposed benchmark used to represent the WQ scalar ﬁeld will be a sum of positive

peaks and negatives peaks using two inverse Shekel functions 6. This scalar ﬁeld

disposes a random number of peaks across the navigable surface to represent the

maxima and minima of the water quality parameters with different intensities each

(see Figure 14).

With regards to the benchmark algorithms, we present different heuristics and

optimization methods based in previous proposed approaches:

6See https://deap.readthedocs.io/en/master/api/benchmarks.html for the complete deﬁnition

26 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

WQ normalized values

1.5

1.0

0.5

0.0

0.5

1.0

1.5

Fig. 14 Example of a WQ scalar ﬁeld used for RMSE evaluation by means of a Shekel function.

1. Random Safe Search (RSS): This heuristic will generate random paths without

collisions of any kind. This is not a good path planning strategy, but serves as a

lower bound to validate the problem complexity.

2. Random Wandering (RW): In this heuristic, agents will travel following straight

trajectories until an obstacle is reached. Then, the agent chooses a new straight

path different from the last direction to avoid going back in its footsteps.

3. ofﬂine Genetic Algorithm Optimization (Ofﬂine-GA): This approach will be

similar to [1], were a GA is used to optimize the paths. For every evaluation sce-

nario, paths are optimized using mutation, crossover, and selection operations to

obtain the best collective path. The designed reward is used as the ﬁtness function

for each individual (here individual refers to the ASVs paths). To avoid colliding

paths, the death penalty is applied for individuals that contains collisions. Note

that, for every different starting conditions, Ofﬂine-GA must re-optimize, which

makes this algorithm unfeasible for online deploy. This algorithm will serve as

well as a performance indicator of how high the reward can be achieved when

intensively optimizing every individual case.

4. Receding Horizon Optimization (RH): In the RH approach, the paths are op-

timized according to the reward function up to an horizon of Hsteps from the

current point using a copy of the current online scenario. Once the optimiza-

tion budget is reached, only the ﬁrst action for every ASV is processed and a

new H-step optimization begins. This approach is similar to an Model Predictive

Control (MPC) algorithm used in [10] which can be useful in online optimization

scenarios.

DRL applied to multi-agent IPP in environmental missions 27

In Table 3, the results of the different approaches are presented. It has been eval-

uated every algorithm in 10 test scenarios with the same starting conditions and

WQ ground truths. In terms of the AER metric, the DRL approach obtains better

results than others algorithms with a lower deviation which indicates a more re-

liable and stable behavior. The ﬁnal uncertainty after a mission is consequently

reduced to 0.053 on average, a 14% less respect to the second best-performance

algorithm Ofﬂine-GA. With respect to the estimation error, the DRL approach

obtains a better accuracy than any other approach. Regardless the estimation

method could affect this metric, in terms of performance, it can be seen that

the most informative paths (those with lower uncertainty) obtains better models

of the WQ parameters.

Metric AER Ut=TDistance RMSE

Algorithm Mean Std. dev. Mean Std. dev. Mean Std. dev. Mean Std. dev.

RSS 41.039 6.65 0.23 0.019 18.64 2.63 0.152 0.233

RW 70.970 8.90 0.13 0.028 18.13 5.09 0.027 0.035

Ofﬂine-GA 92.112 4.34 0.062 0.005 19.71 4.54 0.0057 0.005

RH H=577.688 8.50 0.117 0.026 17.46 4.43 0.0452 0.062

RH H=10 80.114 9.72 0.109 0.031 18.25 4.45 0.0285 0.034

DRL 97.013 1.45 0.053 0.003 18.94 4.46 0.0053 0.003

Table 3 Statistical results of evaluating the proposed algorithm and other benchmarks in 10 sce-

narios with different starting points and WQ scalar ﬁelds.

In Figure 16, are presented example paths generated using the aforementioned

algorithms. The RW heuristic, despite being aleatory, tend to generate long paths

across the lake surface but with unnecessary redundancy and self intersecting

trajectories. This strategy is enhanced in the Ofﬂine-GA, where the redundancy

is reduced with excellent performance. Nonetheless, it happens that as it is easier

to optimize straight paths it is difﬁcult to improve those paths, since big zones of

the surface remain unvisited. The DRL approach, on the contrary, not only learns

how to resolve almost every scenario, but also learns that those visits adjacent to

the shores return lower rewards. This generates a shore-aware behavior and those

areas are avoided. Nonetheless, the GA algorithm is able to obtain acceptable

solutions but at the cost of a highly intensive computation for every single case.

With respect to the receding horizon, we have observed that it is prone to stall in

local minima. With higher prediction horizons, the algorithm have difﬁculties in

ﬁnding good local sub-solutions and with lower ones, the performance is higher

but still sub-optimal.

If we consider the progress of the uncertainty reduction over a mission progress,

we can see in Figure 15 how the RMSE score decreases along with the uncer-

28 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

Agent 1

Agent 2

Agent 3

Agent 4

Agent 1

Agent 2

Agent 3

Agent 4

Agent 1

Agent 2

Agent 3

Agent 4

Agent 1

Agent 2

Agent 3

Agent 4

(a) (b) (c)

(d) (e) (g)

Fig. 15 Resulting paths of the proposed algorithms for comparison and the DRL approach: (a)

RSS, (b) RW, (c) Ofﬂine-GA, (d) RH(H=5), (e) RH(H=10), and (f) DRL.

tainty. The DRL approach obtains lower results (14% and 7% improvement re-

spect to the Ofﬂine-GA approach) in less time and with less deviation.

6 Discussion of the results

The use of the DRL involves always the design of a reward function as a way to

describe the desired behavior of the policy. The results have shown that, in the par-

ticular IPP case, it is vital to design a proper function in order to achieve a good

performance. This situation aggravated when the state-action space increases which

is the case of every multi-agent paradigm. The multi-agent IPP has demonstrated to

DRL applied to multi-agent IPP in environmental missions 29

10-1

6 × 10-2

2 × 10-1

3 × 10-1

4 × 10-1

Uncertainty

Algorithm

Noisy DRL

Random Wanderer

Random Safe Search

MPC-GA H=5

MPC-GA H=10

0 20 40 60 80 100

Step

10-2

10-1

100

RMSE

Fig. 16 Example of a WQ scalar ﬁeld used for RMSE evaluation by means of a Shekel function.

need a decoupling of the real information dynamic in order to cope with the credit

assignment problem: when addressing this, it is better to consider individual contri-

butions to the task and add an additional term for redundancy such as the distance.

According to our simulations the improvement could reach up to a 84% respect

to the couple baseline. This modiﬁcation, while it involves more calculations, it is

absolutely necessary for the DRL to converge in a competitive policy.

Regarding the explorative behavior of the two variants under test, ε-greedy and

noisy networks, it is clear that the noisy layers can boost up the algorithm efﬁciency

to adjust the explorative behavior as part of the parameters of the network. This

method, while it involves more parameters to tune, avoid the tedious task of ﬁne-

tuning the ε-greedy parameters. It has been observed a decrease in performance

when the explorative behavior is not enough. For contrary, when the exploration

phase is bigger in time, the policy performance is not as high as the obtained using

nNN. This fact presents a discussion about the exploration needs of every complex

or multi-agent problem: how much is it necessary to explore? Additionally, it could

happen that the full random behavior of the agents are ineffective and some form of

parametric noise is required (as it happens in much more complex games like ATARI

Moctezuma’s Revenge [7]). In this particular case, although the scope is limited, the

nNN have resulted better in sample efﬁciency and in terms of the obtained score.

30 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

When addressing the comparison with other approaches, DRL have proven to be

a good methodology to cope with the IPP objectives. The algorithm is able to deal

with arbitrary initial points and obtain top performance among other approaches.

When comparing the results of the DRL-trained policy with the Ofﬂine-GA, it can

be observed that the scores are better in RMSE and uncertainty, which validates the

proposal for online deploy. Although we obtain good performance with the Ofﬂine-

GA, it is only useful to optimize scenarios one-by-one, evaluating several solutions

for the same scenario. In the simulations, DRL evaluates the same number of times

the simulator but with different missions and starting points. This way, the policy

generalized the task independently to the ﬂeet situation. The fact that the DRL policy

overcomes the Ofﬂine-GA performance without any in-place optimization, validates

its ability to generally solve the IPP.

Another important task that it is worth to discuss is the ability to reduce the model

error for every algorithm. It can be seen that the lower the uncertainty, the lower the

RMSE. Nonetheless, the lowest bound of the error will be related not only to the

acquisition path Ψ, but also the the regression method used and the ground truth

itself. For this case, a plausible WQ parameter benchmark has been used, following

the behavior of WQ parameters in the Mar Menor7(Spain) and previous works [18,

17]. Nonetheless, when different benchmarks are used representing other variable

distributions, the error could behaves differently from our case. In the end, the error

reduction is dependent on a suitable regression method. Once the regression method

comply with the dynamics of what we want to measure, the uncertainty criteria

serves to reduce this error in a fast and powerful way.

7 Conclusions

In this chapter a Deep Reinforcement Learning approach have been proposed to deal

with the Informative Path Planning problem using multiple surface vehicles. The

proposed informative framework uses a Kernel function to correlate the samples in

order to obtain the level of uncertainty remaining in the scenario under monitoring.

Within this problem it is also considered the nonnavigable zones of the Ypacara´

Lake as a testing scenario. All this restrictions are considered in a tailored reward

function that take multi-agent credit assignment in consideration. The proposed re-

ward function deals better with the IPP problem when the individual contributions

to the uncertainty are explicit when compared to using the total reduced uncertainty

for every agent. It has been also proposed the use of several mechanisms of DRL

that have been useful in previous works: Prioritized Replay, Advantage networks,

and noisy neurons. The latter, have been observed to enhance the exploration and

sample efﬁciency of the algorithm respect to the classic ε-greedy strategy. In the

DRL applied to multi-agent IPP in environmental missions 31

end, the algorithm had been compared to many other heuristics to validate its per-

formance. The algorithm is able to perform better than any other with a 20% higher

rewards respect to the receding horizon counterpart or 5% better to the intensive

ofﬂine-GA optimization. In future lines, it is intended to study how this problem

can be expanded to an arbitrary number of agents in every mission, instead of se-

lecting a ﬁxed size per training. This would result in a ﬂeet-size concerned policy

that considers the appearing and disappearing of agents in the course of a mission.

Other interesting aspect to investigate is how the action and state formulation affects

the learning. It is worth to study if noisy state inputs could help in generalization

with different boundaries or with moving obstacles in the middle of the navigable

zones.

Acknowledgements This work has been funded by the Spanish ”Ministerio de Ciencia, Inno-

vaci´

on y Universidades” under the PhD grant FPU-2020 (Formaci´

on del Profesorado Universitario)

of Samuel Yanes Luis.

References

1. Arzamendia M, Gregor D, Gutierrez-Reina D, Toral S (2019) An evolutionary

approach to constrained path planning of an autonomous surface vehicle for

maximizing the covered area of ypacarai lake. Soft Computing 23(5):1723–

1734

2. Arzamendia M, Gutierrez D, Toral S, Gregor D, Asimakopoulou E, Bessis N

(2019) Intelligent online learning strategy for an autonomous surface vehicle

in lake environments using evolutionary computation. IEEE Intelligent Trans-

portation Systems Magazine 11(4):110–125

3. Bellman RE (2003) Dynamic Programming. Dover Publications, Inc., USA

4. Coley K (2015) Unmanned surface vehicles : The future of data-collection.

Ocean Challenge 21:14–15

5. Cover TM, Thomas JA (2006) Elements of Information Theory (Wiley Series

in Telecommunications and Signal Processing). Wiley-Interscience, USA

6. Ferreira H, Almeida C, Martins A, Almeida J, Dias N, Dias A, Silva E (2009)

Autonomous bathymetry for risk assessment with roaz robotic surface vehicle.

In: OCEANS 2009-EUROPE, pp 1–6, DOI 10.1109/OCEANSE.2009.5278235

7. Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos

R, Hassabis D, Pietquin O, Blundell C, Legg S (2017) Noisy networks for ex-

ploration. CoRR abs/1706.10295, URL http://arxiv.org/abs/1706.

10295,1706.10295

8. van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with dou-

ble q-learning. CoRR abs/1509.06461, URL http://arxiv.org/abs/

1509.06461,1509.06461

32 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

9. Hoen PJt, Tuyls K, Panait L, Luke S, La Poutr´

e JA (2006) An overview of co-

operative and competitive multiagent learning. In: Tuyls K, Hoen PJ, Verbeeck

K, Sen S (eds) Learning and Adaption in Multi-Agent Systems, Springer Berlin

Heidelberg, Berlin, Heidelberg, pp 1–46

10. Julian KD, Kochenderfer MJ (2018) Distributed wildﬁre surveillance with au-

tonomous aircraft using deep reinforcement learning. CoRR abs/1810.04244,

URL http://arxiv.org/abs/1810.04244,1810.04244

11. Kathen MJT, Flores IJ, Reina DG (2021) An informative path planner for a

swarm of asvs based on an enhanced pso with gaussian surrogate model com-

ponents intended for water monitoring applications. Electronics 10(13):1605

12. Krishna Lakshmanan A, Elara Mohan R, Ramalingam B, Vu Le A, Veera-

jagadeshwar P, Tiwari K, Ilyas M (2020) Complete coverage path planning

using reinforcement learning for tetromino based cleaning and maintenance

robot. Automation in Construction 112(May 2019):103,078, DOI 10.1016/

j.autcon.2020.103078, URL https://doi.org/10.1016/j.autcon.

2020.103078

13. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra

D (2016) Continuous control with deep reinforcement learning. In: Bengio Y,

LeCun Y (eds) ICLR, URL http://dblp.uni-trier.de/db/conf/

iclr/iclr2016.html#LillicrapHPHETS15

14. Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-Agent

Actor-Critic for Mixed Cooperative-Competitive Environments. NIPS’17, Cur-

ran Associates Inc., Red Hook, NY, USA

15. Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through

deep reinforcement learning. Nature 518(7540):529–533, DOI 10.1038/

nature14236, URL http://dx.doi.org/10.1038/nature14236

16. Murphy RR, Steimle E, Grifﬁn C, Cullins C, Hall M, Pratt K (2008) Coop-

erative use of unmanned sea surface and micro aerial vehicles at hurricane

wilma. Journal of Field Robotics 25(3):164–180, DOI https://doi.org/10.1002/

rob.20235

17. Peralta F, Reina DG, Toral S, Arzamendia M, Gregor D (2021) A bayesian op-

timization approach for multi-function estimation for environmental monitor-

ing using an autonomous surface vehicle: Ypacarai lake case study. Electronics

10(8):963

18. Peralta Samaniego F, Reina DG, Toral Mar´

ın SL, Gregor DO, Arzamendia

M (2021) A bayesian optimization approach for water resources monitoring

through an autonomous surface vehicle: The ypacarai lake case study. IEEE

Access 9(1):9163–9179, DOI 10.1109/ACCESS.2021.3050934

19. Piciarelli C, Foresti GL (2019) Drone patrolling with reinforcement learn-

ing. ACM International Conference Proceeding Series (1):1–6, DOI 10.1145/

3349801.3349805

20. Popovi´

c M, Vidal-Calleja T, Hitz G (2020) An informative path planning frame-

work for uav-based terrain monitoring. Autonomous Robot 44:889–911, DOI

https://doi.org/10.1007/s10514-020-09903-2

DRL applied to multi-agent IPP in environmental missions 33

21. Rasmussen C, Williams C (2006) Gaussian Processes for Machine Learning.

Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA,

USA, DOI https://doi.org/10.7551/mitpress/3206.003.0001

22. S´

anchez-Garc´

ıa J, Garc´

ıa-Campos J, Arzamendia M, Reina D, Toral S, Gre-

gor D (2018) A survey on unmanned aerial and aquatic vehicle multi-hop net-

works: Wireless communications, evaluation tools and applications. Computer

Communications 119:43–65, DOI 10.1016/j.comcom.2018.02.002

23. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience

replay. DOI 10.48550/ARXIV.1511.05952, URL https://arxiv.org/

abs/1511.05952

24. Sim R, Roy N (2005) Global a-optimal robot exploration in slam. pp 661 – 666,

DOI 10.1109/ROBOT.2005.1570193

25. Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction. A Brad-

ford Book, Cambridge, MA, USA

26. Ten Kathen MJ, Flores IJ, Reina DG (2021) A comparison of pso-based infor-

mative path planners for autonomous surface vehicles for water resource mon-

itoring. In: 7th International Conference on Machine Learning Technologies

(ICMLT 2022), ACM

27. Ten Kathen MJ, Reina DG, Flores IJ (2021) A comparison of pso-based in-

formative path planners for detecting pollution peaks of the ypacarai lake with

autonomous surface vehicles. In: International Conference on Optimization and

Learning OLA’2022

28. Theile M, Bayerlein H, Nai R, Gesbert D, Caccamo M (2020) Uav coverage

path planning under varying power constraints using deep reinforcement learn-

ing. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Sys-

tems (IROS), IEEE, pp 1444–1449

29. Viseras A, Garcia R (2019) Deepig: Multi-robot information gathering

with deep reinforcement learning. IEEE Robotics and Automation Letters

4(3):3059–3066, DOI 10.1109/LRA.2019.2924839

30. Viseras A, Meißner M, Marchal J (2021) Wildﬁre front monitoring with multi-

ple uavs using deep q-learning. IEEE Access PP:1–1, DOI 10.1109/ACCESS.

2021.3055651

31. Wang Z, de Freitas N, Lanctot M (2015) Dueling network architectures for deep

reinforcement learning. CoRR abs/1511.06581, URL http://arxiv.org/

abs/1511.06581,1511.06581

32. Woo J, Kim N (2020) Collision avoidance for an unmanned sur-

face vehicle using deep reinforcement learning. Ocean Engineering

199:107,001, DOI https://doi.org/10.1016/j.oceaneng.2020.107001, URL

https://www.sciencedirect.com/science/article/pii/

S0029801820300792

33. Yanes S, Reina DG, Toral Mar´

ın SL (2020) A deep reinforcement learning ap-

proach for the patrolling problem of water resources through autonomous sur-

face vehicles: The ypacarai lake case. IEEE Access 6(1):1–1, DOI 10.1109/

ACCESS.2020.3036938

34 Yanes, S., Perales, M., Guti´

errez-Reina, D. and Toral, S.

34. Yanes S, Reina DG, Mar´

ın SLT (2021) A multiagent deep reinforcement learn-

ing approach for path planning in autonomous surface vehicles: The ypacara´

lake patrolling case. IEEE Access 9:17,084–17,099

35. Yanes Luis S, Guti´

errez-Reina D, Toral Marin S (2021) A dimensional

comparison between evolutionary algorithm and deep reinforcement learning

methodologies for autonomous surface vehicles with water quality sensors.

Sensors 21(8), DOI 10.3390/s21082862, URL https://www.mdpi.com/

1424-8220/21/8/2862

36. Yanes Luis S, Peralta F, Tapia C ´

ordoba A, ´

Alvaro Rodr´

ıguez del Nozal,

Toral Mar´

ın S, Guti´

errez Reina D (2022) An evolutionary multi-objective

path planning of a ﬂeet of asvs for patrolling water resources. Engineer-

ing Applications of Artiﬁcial Intelligence 112:104,852, DOI https://doi.org/

10.1016/j.engappai.2022.104852, URL https://www.sciencedirect.

com/science/article/pii/S0952197622001051

37. Zhang Q, Lin J, Sha Q, He B, Li G (2020) Deep interactive rein-

forcement learning for path following of autonomous underwater vehicle.

CoRR abs/2001.03359, URL https://arxiv.org/abs/2001.03359,

2001.03359

Multi-UAV Adaptive Path Planning Using Deep Reinforcement Learning

Preprint

Full-text available

Mar 2023

Efficient aerial data collection is important in many remote sensing applications. In large-scale monitoring scenarios, deploying a team of unmanned aerial vehicles (UAVs) offers improved spatial coverage and robustness against individual failures. However, a key challenge is cooperative path planning for the UAVs to efficiently achieve a joint mission goal. We propose a novel multi-agent informative path planning approach based on deep reinforcement learning for adaptive terrain monitoring scenarios using UAV teams. We introduce new network feature representations to effectively learn path planning in a 3D workspace. By leveraging a counterfactual baseline, our approach explicitly addresses credit assignment to learn cooperative behaviour. Our experimental evaluation shows improved planning performance, i.e. maps regions of interest more quickly, with respect to non-counterfactual variants. Results on synthetic and real-world data show that our approach has superior performance compared to state-of-the-art non-learning-based methods, while being transferable to varying team sizes and communication constraints.

A survey on autonomous environmental monitoring approaches: towards unifying active sensing and reinforcement learning

Article

Full-text available

Mar 2024

The environmental pollution caused by various sources has escalated the climate crisis making the need to establish reliable, intelligent, and persistent environmental monitoring solutions more crucial than ever. Mobile sensing systems are a popular platform due to their cost-effectiveness and adaptability. However, in practice, operation environments demand highly intelligent and robust systems that can cope with an environment’s changing dynamics. To achieve this reinforcement learning has become a popular tool as it facilitates the training of intelligent and robust sensing agents that can handle unknown and extreme conditions. In this paper, a framework that formulates active sensing as a reinforcement learning problem is proposed. This framework allows unification with multiple essential environmental monitoring tasks and algorithms such as coverage, patrolling, source seeking, exploration and search and rescue. The unified framework represents a step towards bridging the divide between theoretical advancements in reinforcement learning and real-world applications in environmental monitoring. A critical review of the literature in this field is carried out and it is found that despite the potential of reinforcement learning for environmental active sensing applications there is still a lack of practical implementation and most work remains in the simulation phase. It is also noted that despite the consensus that, multi-agent systems are crucial to fully realize the potential of active sensing there is a lack of research in this area.

Learning-based methods for adaptive informative path planning

Article

Jun 2024
ROBOT AUTON SYST

Decoupling Patrolling Tasks for Water Quality Monitoring: A Multi-Agent Deep Reinforcement Learning Approach

Article

Full-text available

Jan 2024

This study proposes the use of an Autonomous Surface Vehicle (ASV) fleet with water quality sensors for efficient patrolling to monitor water resource pollution. This is formulated as a Patrolling Problem, which consists of planning and executing efficient routes to continuously monitor a given area. When patrolling Lake Ypacaraí with ASVs, the scenario transforms into a Partially Observable Markov Game (POMG) due to unknown pollution levels. Given the computational complexity, a Multi-Agent Deep Reinforcement Learning (MADRL) approach is adopted, with a common policy for homogeneous agents. A consensus algorithm assists in collision avoidance and coordination. The work introduces exploration and reinforcement phases to the patrolling problem. The Exploration Phase aims at homogeneous map coverage, while the Intensification Phase prioritizes high polluted areas. The innovative introduction of a transition variable, ν, efficiently controls the transition from exploration to intensification. Results demonstrate the superiority of the method, which outperforms a Single-Phase (trained on a single task) Deep Q-Network (DQN) by an average of 17% on the intensification task. The proposed multitask learning approach with parameter sharing, coupled with DQN training, outperforms Task-Specific DQN (two DQNs trained on separate tasks) by 6% in exploration and 13% in intensification. It also outperforms the heuristic-based Lawn Mower Path Planner (LMPP) and Random Wanderer Path Planner (RWPP) algorithms, by 35% and 20% on average respectively. Additionally, it outperforms a Particle Swarm Optimization-based Path Planner (PSOPP) by an average of 26%. The algorithm demonstrates adaptability in unforeseen scenarios, giving users flexibility in configuration.

Multi-UAV Adaptive Path Planning Using Deep Reinforcement Learning

Conference Paper

Oct 2023

Monitoring Peak Pollution Points of Water Resources with Autonomous Surface Vehicles Using a PSO-Based Informative Path Planner

Chapter

Full-text available

Oct 2022

The preservation of water resources is an increasingly urgent issue. Therefore , monitoring the water quality of these resources is a very important task so that appropriate actions could be taken. This chapter focuses on water resource monitoring using a fleet of Autonomous Surface Vehicles equipped with sensors capable of measuring water quality parameters. The objective is to obtain the maximum points of contamination of the water resource through the exploration and exploitation of the water surface. The proposed algorithm is based on Particle Swarm Optimization (PSO) in combination with some machine learning techniques (Gaussian Process , Bayesian Optimization, among others) to address the limitations of PSO, such as premature convergence and difficulty in setting the initial values of the coefficients. To validate the performance of the algorithm, uni-modal and multi-modal benchmark functions are used in the simulation experiments. The results show that the proposed algorithm, the Enhanced GP-based PSO, based on the epsilon greedy method has the best performance for detecting water resource pollution peaks. It was also demonstrated that this algorithm is the one that generates the most accurate water quality model. However, when it comes to finding the highest pollution peak, the algorithm with the best response is the Enhanced GP-based PSO with a focus on exploitation.

An evolutionary multi-objective path planning of a fleet of ASVs for patrolling water resources

Article

Full-text available

Jun 2022
ENG APPL ARTIF INTEL

The rapid increase of human activities with direct influence on the environment has motivated the global awareness of the need to efficiently monitor the natural resources. Among the wide range of problems addressed, such as overuse of agrochemicals, uncontrolled waste, etc., the contamination of water resources plays a protagonist role, given its close links with biodiversity and the food chain. Water monitoring is considered one of the most efficient ways to deal with these problems, especially through the use of autonomous vehicles, which can boost the capabilities and efficiency of the monitoring routines with appropriate strategies. In this work, the monitoring problem is addressed by means of the Non-Homogeneous Patrolling Problem with closed circuits. This problem has a great computational complexity, especially when multiple targets are included in a monitoring mission. A formulation based on closed metric graphs and the application of a multi-objective genetic algorithm is proposed to provide Pareto-efficient monitoring solutions for a variable number of Autonomous Surface Vehicles. To address the multi-agent, multi-objective and constrained paradigm, efficient genetic operators have been designed for the generation of valid solutions in an affordable time. The method results in Pareto-efficient solutions for scenarios with disjoint and uncorrelated objectives, which outperform the fitness of other solutions by a factor of 2, on average. The results provide decision makers a method to find different non-dominated strategies depending on the monitoring needs, depending on fleet and vehicle size.

A Comparison of PSO-based Informative Path Planners for Detecting Pollution Peaks of the Ypacarai Lake with Autonomous Surface Vehicles

Conference Paper

Full-text available

Mar 2022

A Comparison of PSO-Based Informative Path Planners for Autonomous Surface Vehicles for Water Resource Monitoring

Conference Paper

Full-text available

Feb 2022

Preserving water resources is an objective that is constantly being pursued. Monitoring the aquatic environments is an action to fulfill this objective, since the state of the water quality will be controlled. The monitoring task can be carried out with Autonomous Surface Vehicles equipped with sensors that measure water quality parameters and with a monitoring system. This paper presents a comparison between informative path planners based on PSO for autonomous surface vehicles for water resources monitoring. The case presented is the case of Ypacarai Lake. The simulations carried out allow visualizing and comparing the response of different methods. The methods evaluated are the Local Best method, the Global Best method, the Uncertainty method, the Contamination method, the Classic PSO, Enhanced GP-based PSO, and the Epsilon Greedy method. For the optimization of the Enhanced GP-based PSO coefficients, Bayesian optimization is selected. The results show that the Enhanced GP-based PSO is the algorithm with the best solutions for monitoring the lake environment.

An Informative Path Planner for a Swarm of ASVs Based on an Enhanced PSO with Gaussian Surrogate Model Components Intended for Water Monitoring Applications

Article

Full-text available

Jul 2021

Controlling the water quality of water supplies has always been a critical challenge, and water resource monitoring has become a need in recent years. Manual monitoring is not recommended in the case of large water surfaces for a variety of reasons, including expense and time consumption. In the last few years, researchers have proposed the use of autonomous vehicles for monitoring tasks. Fleets or swarms of vehicles can be deployed to conduct water resource explorations by using path planning techniques to guide the movements of each vehicle. The main idea of this work is the development of a monitoring system for Ypacarai Lake, where a fleet of autonomous surface vehicles will be guided by an improved particle swarm optimization based on the Gaussian process as a surrogate model. The purpose of using the surrogate model is to model water quality parameter behavior and to guide the movements of the vehicles toward areas where samples have not yet been collected; these areas are considered areas with high uncertainty or unexplored areas and areas with high contamination levels of the lake. The results show that the proposed approach, namely the enhanced GP-based PSO, balances appropriately the exploration and exploitation of the surface of Ypacarai Lake. In addition, the proposed approach has been compared with other techniques like the original particle swarm optimization and the particle swarm optimization with Gaussian process uncertainty component in a simulated Ypacarai Lake environment. The obtained results demonstrate the superiority of the proposed enhanced GP-based PSO in terms of mean square error with respect to the other techniques

A Dimensional Comparison Between Evolutionary Algorithm and Deep Reinforcement Learning Methodologies for Autonomous Surface Vehicles with Water Quality Sensors

Article

Full-text available

Apr 2021
SENSORS-BASEL

The monitoring of water resources using Autonomous Surface Vehicles with water-quality sensors has been a recent approach due to the advances in unmanned transportation technology. The Ypacaraí Lake, the biggest water resource in Paraguay, suffers from a major contamination problem because of cyanobacteria blooms. In order to supervise the blooms using these on-board sensor modules, a Non-Homogeneous Patrolling Problem (a NP-hard problem) must be solved in a feasible amount of time. A dimensionality study is addressed to compare the most common methodologies, Evolutionary Algorithm and Deep Reinforcement Learning, in different map scales and fleet sizes with changes in the environmental conditions. The results determined that Deep Q-Learning overcomes the evolutionary method in terms of sample-efficiency by a 50-70% in higher resolutions. Furthermore, it reacts better than the Evolutionary Algorithm in high space-state actions. In contrast, the evolutionary approach shows a better efficiency in lower resolutions and needs fewer parameters to synthesize robust solutions. This study reveals that Deep Q-learning approaches exceed in efficiency for the Non-Homogeneous Patrolling Problem but with many hyper-parameters involved in the stability and convergence.

A Bayesian Optimization Approach for Multi-Function Estimation for Environmental Monitoring Using an Autonomous Surface Vehicle: Ypacarai Lake Case Study

Article

Full-text available

Apr 2021

Bayesian Optimization is a sequential method that manages to optimize a single and costly objective function based on a surrogate model. In this work, we propose a Bayesian Optimization system dedicated to monitor and estimate multiple water quality parameters simultaneously using a single Autonomous Surface Vehicle. The proposed work combines different strategies and methods for this monitoring task, evaluating two approaches for acquisition function fusion: the coupled and the decoupled techniques. We also consider a dynamic parametrization of the maximum measurement distance traveled by the ASV so that the monitoring system balances the total number of measurements and the total distance, which is related to the energy required. To evaluate the proposed approach, the Ypacarai Lake (Paraguay) serves as the test scenario, where multiple maps of water quality parameters, such as pH, Dissolved Oxygen, need to be obtained efficiently. The proposed system is compared with Predictive Entropy Search for Multi-Objective Optimization with Constraints (PESMOC) algorithm and Genetic Algorithm (GA) path planning for the Ypacarai Lake scenario. The obtained results show that the proposed approach is 10.82% better than other optimization methods in terms of R2 Score with noiseless measurements and up to 17.23% better when the data is noisy. Additionally, the proposed approach achieves a good average computational time for the whole mission when compared to other methods, 3% better than the GA technique and 46.5% better than the PESMOC approach

UAV Coverage Path Planning under Varying Power Constraints using Deep Reinforcement Learning

Conference Paper

Full-text available

Oct 2020

Coverage path planning (CPP) is the task of designing a trajectory that enables a mobile agent to travel over every point of an area of interest. We propose a new method to control an unmanned aerial vehicle (UAV) carrying a camera on a CPP mission with random start positions and multiple options for landing positions in an environment containing no-fly zones. While numerous approaches have been proposed to solve similar CPP problems, we leverage end-to-end reinforcement learning (RL) to learn a control policy that generalizes over varying power constraints for the UAV. Despite recent improvements in battery technology, the maximum flying range of small UAVs is still a severe constraint, which is exacerbated by variations in the UAV’s power consumption that are hard to predict. By using map-like input channels to feed spatial information through convolutional network layers to the agent, we are able to train a double deep Q-network (DDQN) to make control decisions for the UAV, balancing limited power budget and coverage goal. The proposed method can be applied to a wide variety of environments and harmonizes complex goal structures with system constraints.

Wildfire Front Monitoring with Multiple UAVs using Deep Q-Learning

Article

Full-text available

Jan 2021

Wildfires destroy thousands of hectares every summer all over the globe. To provide an effective response and to mitigate wildfires impact, firefighters require a real-time monitoring of the fire front. This paper proposes a cooperative reinforcement learning (RL) framework that allows a team of autonomous unmanned aerial vehicles (UAVs) to learn how to monitor a fire front. In the literature, independent Q-learners were proposed to solve a wildfire monitoring task with two UAVs. Here we propose a framework that can be easily extended to a larger number of UAVs. Our framework builds on two methods: multiple single trained Q-learning agents (MSTA) and value decomposition networks (VDN). MSTA trains a single UAV controller, which is then "copied" to each of the UAVs in the team. In contrast, VDN trains agents to learn how to cooperate. We benchmarked in simulations our two considered methods – MSTA and VDN – against two state-of-the-art approaches: independent Q-learners and a joint Q-learner. Simulation results show that our considered methods outperform state-of-the-art approaches in a wildfire front monitoring task with up to 9 fixed-wing and multi-copter UAVs.

A Multiagent Deep Reinforcement Learning Approach for Path Planning in Autonomous Surface Vehicles: The YpacaraC-Lake Patrolling Case.

Article

Full-text available

Jan 2021

Autonomous surfaces vehicles (ASVs) excel at monitoring and measuring aquatic nutrients due to their autonomy, mobility, and relatively low cost. When planning paths for such vehicles, the task of patrolling with multiple agents is usually addressed with heuristics approaches, such as Reinforcement Learning (RL), because of the complexity and high dimensionality of the problem. Not only do efficient paths have to be designed, but addressing disturbances in movement or the battery’s performance is mandatory. For this multiagent patrolling task, the proposed approach is based on a centralized Convolutional Deep Q-Network, designed with a final independent dense layer for every agent to deal with scalability, with the hypothesis/assumption that every agent has the same properties and capabilities. For this purpose, a tailored reward function is created which penalizes illegal actions (such as collisions) and rewards visiting idle cells (cells that remains unvisited for a long time). A comparison with various multiagent Reinforcement Learning (MARL) algorithms has been done (Independent Q-Learning, Dueling Q-Network and multiagent Double Deep Q-Learning) in a case-study scenario like the Ypacaraí lake in Asunción (Paraguay). The training results in multiagent policy leads to an average improvement of 15% compared to lawn mower trajectories and a 6% improvement over the IDQL for the case-study considered. When evaluating the training speed, the proposed approach runs three times faster than the independent algorithm.

Deep Reinforcement Learning Applied to Multi-agent Informative Path Planning in Environmental Missions

Abstract

Recommended publications

Censored deep reinforcement patrolling with information criterion for monitoring large water resourc...

Censored Deep Reinforcement Patrolling with Information Criterion for Monitoring Large Water Resourc...

Censored Deep Reinforcement Patrolling with Information Criterion for Monitoring Large Water Resourc...

Genetic Enhanced Model-based DeepReinforcement Learning for Informative Path Planning in the Lake Yp...