ChapterPDF Available

Deep Reinforcement Learning Applied to Multi-agent Informative Path Planning in Environmental Missions

Authors:

Abstract

Deep Reinforcement Learning algorithms have gained attention lately due to their ability to solve complex decision problems with a model-free and zero-derivative approach. In the case of multi-agent problems, these algorithms can help to easily find efficient cooperative policies in a feasible amount of time. In this chapter, we present the Informative Patrolling Problem, a commonplace task in the conservation of water resources. The approach is presented here as a convenient methodology for the synthesis of cooperative policies than can solve simultaneous objectives present in the unmanned monitoring of lakes and rivers: maximizing the collected information of water parameters and the collision-free routing with multiple surface vehicles. For this mixed objective, it is proposed a Deep Q-Learning scheme with a convolutional network as a shared fleet policy. In order to solve the credit assignment problem, it is proposed an effective multiagent decomposition of the informative reward with a discussion of other several state-of-the-art topics of Reinforcement Learning: noisy networks for enhanced exploration of the state-action domain, the use of a visual states, and the shaping of the reward function. This methodology, as it is quantitative demonstrated, allows a significant improvement in water resource monitoring compared to other heuristics.
Deep Reinforcement Learning applied to
multi-agent informative path planning in
environmental missions
Samuel Yanes Luis, Manuel Perales Esteve, Daniel Guti´
errez Reina and Sergio
Toral Mar´
ın
Key words: Deep Reinforcement Learning, Information Gathering,
Multiagent, Environmental Monitoring
Abstract Deep Reinforcement Learning algorithms have gained attention lately due
to their ability to solve complex decision problems with a model-free and zero-
derivative approach. In the case of multi-agent problems, these algorithms can help
to easily find efficient cooperative policies in a feasible amount of time. In this
chapter, we present the Informative Patrolling Problem, a commonplace task in the
conservation of water resources. The approach is presented here as a convenient
methodology for the synthesis of cooperative policies than can solve simultaneous
objectives present in the unmanned monitoring of lakes and rivers: maximizing the
collected information of water parameters and the collision-free routing with mul-
tiple surface vehicles. For this mixed objective, it is proposed a Deep Q-Learning
scheme with a convolutional network as a shared fleet policy. In order to solve the
credit assignment problem, it is proposed an effective multiagent decomposition
of the informative reward with a discussion of other several state-of-the-art topics
of Reinforcement Learning: noisy networks for enhanced exploration of the state-
action domain, the use of a visual states, and the shaping of the reward function. This
Samuel Yanes Luis
Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,
e-mail: syanes@us.es
Manuel Perales-Esteve
Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,
e-mail: mperales@us.es
Daniel Guti´
errez Reina
Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,
e-mail: dgutierrezreina@us.es
Sergio Toral Mar´
ın
Dpt. of Electronics, University of Sevilla, Av. de Los Descubrimientos s/n, 41003, Sevilla, Spain,
e-mail: storal@us.es
1
2 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
methodology, as it is quantitative demonstrated, allows a significant improvement in
water resource monitoring compared to other heuristics.
1 Introduction
Conservation of drinking water reserves is a strategic objective for societies. This
resource has a direct impact on the health of people, agriculture, and industry in
a country. However, due to human related causes, these large bodies of water are
continuously being polluted: untreated discharges, oil spills from boats, toxic algae
proliferation, etc., drastically impoverish the quality of the aquatic ecosystem and
alter the biological balance of the local fauna and flora. In the case of particularly
large water bodies, such as Lake Ypacara´
ı (Paraguay, 60 km2) or the Mar Menor
(Spain, 170 km2), manual samplings for continuous monitoring of the variables of
water quality (WQ) become strenuous and costly. Several human and material re-
sources are needed, and there is always a risk to operators in contaminated scenar-
ios. An efficient way to collect physicochemical water data is to use autonomous
surface vehicles (ASVs) [22] equipped with highly sensitive WQ sensors. These
sensors can measure physical parameters such as pH, temperature, oxygen satura-
tion levels, and water turbidity, or chemical parameters such as nitrites, sulphates,
and dissolved chlorophyll. All these parameters can be used, by means of a proper
data analysis, to infer the biological state of a lake or a river. In such enormous water
bodies it is mandatory to use a fleet of multiple vehicles, as the battery budget of
one single agent do not provide enough coverage. With multiple vehicle it is easy to
obtain a complete set of measures that truly represents the useful information to be
considered for resource conservation. The challenge of the multi-vehicle paradigm
lies in the need for an effective coordination policy that allows them to take sam-
ples autonomously in the lake following a low redundancy criterion. The vehicles
must coordinate to share the search space to collect as much information as possi-
ble. Information, represented by the values of the water variables, will be obtained
sequentially as the vehicles acquire the measurements one by one at different lo-
cations in the scenario. An effective fleet policy must consider taking samples in
those places with the higher information available, which is to sample in places of
high uncertainty. The definition of the information depends on the context but we
can define a general corollary of such task, the Informative Path Planning (IPP) [5],
where the objective is to obtain an optimal acquisition path Ψthat maximizes this
informative value Ion a fixed time budget T:
Ψ=max
ΨI(Ψ)(1)
DRL applied to multi-agent IPP in environmental missions 3
Fig. 1 Autonomous vehicles
carrying out a surveillance
mission in a recreational lake
in Seville (Spain).
Information Ican be defined stochastically in terms of the reduction of uncer-
tainty of each sampled point of the environment [24]. A reasonable initial hypothe-
sis states that every navigable point pin the complete set Xbehaves like a Gaussian
random variable pXN(µ,σ). Then, let being Σ[X,Xmeas]the spatial corre-
lation matrix that describes the statistical relationship between the samples, that
is, the uncertainty level for each point in the locations where Xmeas was sampled.
This matrix indicates the level of uncertainty of each point, considering the loca-
tions where Xmeas has been sampled. This approach implements a Mat´
ern Kernel
Function (MKF) [21] to spatially correlate samples as a function of their adjacency,
under the acceptable assumption that physically close samples will be more closely
related. Given that, the IPP will consist of sequentially deciding the next physical
point at which to take a sample, following the information gain maximization crite-
rion. Following the definition of information gain in information theory, Σ[X,Xmeas]
defines how informative a point is according to the decrease in entropy of the model
formed by the set in points of the lake Xand the sample points Xmeas [21]. Finally,
the IPP problem to be solved here consists of minimizing the total entropy of the
scenario H[X|Xmeas].
In addition to the informative criterion, the fleet policy must consider in the
movement planification any obstacles and nonnavigable zones. In this multiagent
paradigm, it is also needed to consider the agents to be moving obstacles, so the path
planning becomes intrinsically dynamic and harder to solve. As the complexity of
this problem explodes because of the many feasible paths and restrictions in every
realistic scenario, it is needed a proper algorithm that can deal with the IPP problem
with multiple agents and can scale to different fleet sizes. In this chapter, a Rein-
forcement Learning (RL) approach is proposed to deal with the high dimension of
this problem, inspired by the success of [15] where a Q-Learning algorithm solves
hard tasks of proficiency. The RL paradigm tries to solve sequential problems by
trial and error. In the RL paradigm, an agent tries to maximize the reward obtained
rtin the instant t, by taking an action atin an environment [25]. The information
about the environment, called the state st, is usually observable in some form, so
4 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
it can be interpreted to maximize the reward. This whole process ends when the
agent has synthesized a policy π(s)that maps a state sinto the best learned action
a(see Figure 2). To adapt the IPP problem to this paradigm, a formulation has been
proposed in terms of a Markov Decision Process (MDP) where the aforementioned
agent interacts sequentially in a finite control horizon T, by taking discrete actions,
which physically corresponds to movements, that maximize the reward at the end of
a mission:
π(s)=max
π
T
t=0
[r(st,a=π(st))] (2)
Fig. 2 Basic reinforcement
learning scheme. In every RL
case, there is at least an agent
that interacts through actions
with an environment. Returns
information of two types: the
next state and a reward that
evaluates the action in the
current state.
Agent
action at
observation st
Environment
reward rt
For this purpose, the reward function r(s,a)must be tailored to meet the objec-
tives of the path planning, i.e., the most informative, non-redundant, obstacle-free
routes for every ASV in the fleet. In the following sections, an appropriate reward
function for the multi-agent IPP is presented.
In [15], the use of deep convolutional neural networks (CNN) was introduced
as a novelty in the RL scene to deal with high-dimensional states. This was a
methodological revolution for RL. CNNs were used to represent the state-action
function Qof the MDP. This deep Qfunction represents the expected future re-
ward given a current state sand an action a. The capacity of neural networks to fit
non-linear and complex behavior allowed the algorithm to supersede other bench-
marks. Through many iterations, the neural function is optimized with the given
reward signal by randomly interacting with the environment. This algorithm, called
Deep Q-Learning, constitutes a robust algorithm to solve sequential decision prob-
lems such as the one presented in this chapter. All things considered, this chapter
adapts the single-agent Deep-Q Learning (DQL) method to solve the multi-agent
IPP, taking advantage of the model-free ability of this algorithm. There is additional
DRL applied to multi-agent IPP in environmental missions 5
complexity in doing so, as this problem deals with multiple agents at the same time.
Multi-agent RL (MARL) involves certain difficulties [9], such as learning scalabil-
ity, noninjectivity of the reward function, and nonstationery of the environment. In
this chapter, we present an observational method to include all the information in
the visual input of the network for better efficiency. Additionally, to find a suitable
policy, it is necessary to sufficiently explore the state-action set with an increasing
size with each agent included in the fleet. Therefore, classic exploration strategies,
such as εgreedy [34], fail to find the optimal actions for success in the IPP task. To
overcome this situation, in this chapter, we propose the use of noisy neural networks
(noisy-NN) [7]. Noisy NN are a nondeterministic version of the classical neural net-
works. Using a Gaussian distribution, each parameter introduces some level of noise
that can enhances exploration.
This chapter addresses certain research gaps in the literature on multiagent op-
timization, information gathering, and deep reinforcement learning. With this ap-
proach, we propose an entropic criterion to synthesize the most effective routes
that maximizes the information collected for a fleet of vehicles. While other algo-
rithms designed for entropy minimization have been designed solely for single agent
paradigm [20], our multiagent formulation by means of DRL results in a very con-
venient and flexible approach. Another highlight is the use of a global visual state,
which allows a better scaling for higher number of vehicles. Previous works tend to
use low resolutions, local observations, and a formulation that does not scale well
with the number of agents [29, 34, 10]. Our proposal places value on the use of
noisy neural networks for multi-agent policies and proposes a global visual obser-
vation method for homogeneous fleets. Finally, this chapter’s contribution can be
summarized as follows:
The formulation of the IPP from the perspective of multi-agent uncertainty re-
duction.
The application of DRL for the multi-agent IPP resolution.
A policy formulation with noisy neurons and a state that allows for interchange-
ability between agents.
The chapter is organized as follows:
In Section 2, a brief review of the literature is developed to place the reader in the
context of the problem to be solved. In Section 3, the IPP is formally described with
all the constraints. In Section 4, the methods used to solve IPP are described, that
is, the DRL algorithm, the state, the neural network, etc. In Section 5, the charac-
teristics and results of the experiments are presented. In Section 6, these results are
discussed and compared with other heuristics. Finally, in Section 7, the conclusions
of this work and future research lines are presented.
6 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
2 Related work
The use of ASVs for surveillance or monitoring has been gaining momentum re-
cently due to advances in robotics and vehicle autonomy [4]. The use of small
robotic vehicles is not only more efficient but is also cheaper compared to the hu-
man cost of periodic surveillance missions. In [6] there is a good example of an
ASV application for bathymetric measurements of lakes and seas. In this work, a
low-cost vehicle is designed to perform bathymetry (depth of lakes and river basins)
in an unmanned way. Another example of a use of ASVs is in [16], where a sur-
face vehicle is implemented to monitor disaster zones after hurricanes. The vehicle
is equipped with cameras and an underwater sonar camera to search for obstacles
in the presence of turbid waters. Another widespread utility of ASVs is the conser-
vation of aquatic natural resources, such as lakes and rivers. Here we can distin-
guish between the different sub-tasks the conservation of such resources involves:
I) Patrolling [33, 35, 34], which consists of continuously monitor the resource as
a method of early warning for contamination or sudden changes in the healthiness
levels. II) Efficient model regression [18, 17, 11], which is to efficiently obtain an
accurate WQ model of the waters with one or more vehicles. III) Peak detection
[11], which consists in detecting the maxima of contamination in the shortest pos-
sible time. IV) Informative coverage [1, 2], which consists of covering efficiently
a certain area to acquire the most information possible. The IPP fits into the set
of problems within Informative coverage. In 3 it has been depicted typical uses of
ASVs in the literature and organized in three main branches: disaster assistance,
topological characterization, and environmental monitoring.
In [33] the non-homogeneous patrolling problem was formulated as a single
agent directive graph problem. The ASV was trained using DRL, resulting in an
efficient policy to minimize the average waiting time in lakes with a dissimilar im-
portance criterion. Similarly, the patrol case was extended in [34], with a different
number of vehicles for each simulation and a low state resolution. These works
have demonstrated the ability of DRL to synthesize high-performance behaviors for
patrolling tasks with ASVs but also the high dimension and complexity of these
problems. In [35] it was compared the use of such DRL techniques with Genetic
Algorithms (GA). The scalability was proven to be better with DRL, which moti-
vates this chapter to choose the DQL for solving the IPP. ASVs were also used in
[18] to obtain a contamination model of the Ypacara´
ı Lake. This work proposes a
Bayesian approach within Gaussian Processes (GP) for obtaining an accurate WQ
scalar map. In the same line, [17] generalized the algorithm for the multi-agent case.
The sampling space was divided using a Voronoi-based tessellation method. The
main difference between this approach and ours is that, when continuously moni-
toring, it is not considered any cost for sampling. In the proposed method, the IPP is
solved by taking as many samples as possible to compel with higher accuracy. An-
other branch of algorithms for monitoring are based on particle swarm optimization
(PSO). They were used for the same purpose in [11, 26, 27]. The PSO was applied
DRL applied to multi-agent IPP in environmental missions 7
ASVs applications
Topological
characterization
Bathymetric studies
Shore measurement
Disaster assistance
Human search
Deployment of
wireless networks
Environmental
monitoring
Environmental patrolling
Model acquisition
Garbage collection
Early-alert system
Maxima localization
Efficient model
regression
Informative coverage
Fig. 3 Different applications of ASVs in the literature.
in a fleet of four ASVs for detection of contamination peaks. This heuristic approach
is a simple and model-free approach to localize global and local maxima that adapts
naturally to the multi-agent paradigm. Nonetheless, its performance and behavior
heavily depends on the hyperparameters. In the informative coverage, some solu-
tions has been proposed by the means of GAs [1, 2, 36]. In [1], the informative
problem is modeled as a Travelman Sales Problem (TSP), where the objective is to
cover the Ypacara´
ı Lake in the search of green algae blooms. The algorithm resulted
in Eulerian circuits that maximize the effective area in a single-agent scenario. In [2]
it is proposed a similar approach but this time focused on an online re-optimization
once algae blooms are detected. Those algorithms generates quasi-optimal solutions
but too long to be applied in a real scenario. Our approach seeks an effective pol-
icy for coverage within a fixed time budget. Additionally, in [36], the problem is
extended to multiple agents and multiple objectives for obtaining closed patrolling
paths. This work uses a graph formulation and analyses the disperse Pareto-optimal
solutions of using multiple agents in coverage problems.
When addressing the use of DRL with unmanned vehicles, there are several in-
teresting examples in the literature [12, 37, 32]. In [12], it is proposed a DRL-based
approach to fulfill the complete coverage in cleaning tasks with a multi-form robot.
This single-agent application deals with high action dimensionality, uses an algo-
rithm named Actor-Critic with Experience Replay (ACER). The DRL approaches
with ASVs usually put the focus on the low-lever control or trajectory planning [37].
In this work, RL was applied to obtain a motor controller for path tracking with a un-
derwater unmanned vehicle (UUV). The applied algorithm resulted in a suitable and
realistic controller that quickly reduces the positioning error. In [32] there is an ex-
8 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
ample of the use of DRL for collision avoidance. It is proposed a deep policy trained
using DQL to learn how to modify the control signals in order to avoid moving ob-
stacles in the trajectory. This way the agent must learn two simultaneous tasks: to
avoid the obstacles and to track the reference path. This is similar to our approach,
where the agent must perform the informative gathering task at the same time as
it avoids non-navigable areas. Other single-agent approaches deal with informative
tasks instead of pure path tracking [19, 28]. In [19], it is proposed the use of DQL for
patrolling tasks using a camera for survey scenarios. This approach seeks a policy
that optimally patrols the environment by focusing on high-importance zones, but
does not deal with obstacle or boundaries avoidance. In [28], an aerial vehicle agent
is trained to perform an informative coverage with take-off and landing restrictions.
This hard task is achieved by also using DQL. A simple reward function is tailored
to award the correct actions and the information collection.
Ref. Application Algorithm Vehicle Multiagent
[6] Topological characterization of
lakes
Manual path ASV No
[16] Hurricane disaster-scenario mon-
itoring
Manual path ASV No
[1] Informative coverage of lakes Genetic Algorithm ASV No
[36] Multi-objective Non-
homogeneous patrolling
Evolutionary Strategy ASV Yes
[2] Adaptive informative coverage of
lakes
Genetic Algorithm ASV No
[18] WQ model acquisition of lakes Bayesian Optimization ASV No
[17] WQ model acquisition of lakes Bayesian Optimization ASV Yes
[11] Peak detection and model acq. Particle Swarm Optimization ASV Yes
[26] Peak detection and model acq. Particle Swarm Optimization ASV Yes
[27] Peak detection and model acq. Particle Swarm Optimization ASV Yes
[33] Non-homogeneous patrolling Deep Q-Learning ASV No
[34] Non-homogeneous patrolling Multi-agent Deep Q-Learning ASV Yes
[34] Non-homogeneous patrolling Multi-agent Deep Q-Learning ASV Yes
[10] Dynamic informative tracking Multi-agent Deep Q-Learning UAVs Yes
[30] Dynamic informative tracking Multi-agent Deep Q-Learning UAVs Yes
Table 1 Summary of related works using autonomous vehicles for environmental monitoring
tasks.
With regard to the multi-agent paradigm, DRL approaches must deal with com-
munication problems between agents, the credit assignment problem, and others
[9]. In [14] it is proposed a method for decentralized learning in hybrid competitive-
cooperative scenarios. This approach adapts the single-agent Deep Deterministic
Policy Gradient algorithm [13] to comply with decentralized deploy for continu-
ous actions. Our approach differs from this in the fact that our problem is fully-
cooperative and the learning, as the agents are homologous in actions and in their
observational abilities, do not require a decentralized learning. In [10], an algorithm
to train two agents to survey for wildfires is proposed. This work uses the DQL
DRL applied to multi-agent IPP in environmental missions 9
algorithm with a centralized policy for different agents and an observation-based
formulation of the process. Different from our approach, this work does not directly
act on uncertainty, and the agent’s observations are only useful for the fire-front de-
tection. Our approach, in contrast, directly tries to minimize the uncertainty which
can be a valid approach independently to the process under survey. In [30], a similar
approach is solved using different architectures for the DQL algorithm. This work
addresses the study of different multi-agent algorithms to monitor wildfires. The
conclusions of this work serve this chapter as a guideline for choosing a centralized
approach for the IPP solving using DRL. The proposed network in this chapter dif-
fers from the one used in [30] because the observation of every agent is fully visual
and can include all others agent information in an image to be processed. This alle-
viates the need of retraining when the fleet grows in size. Another relevant aspect is
that the state in this chapter proposal reunites the global information of the scenario
whereas in [30] the observation is local to every agent.
3 Statement of the problem
The main objective of this monitoring application is to find a policy that coordinates
a fleet of ASVs to efficiently collect information of the WQ measurements. This
section will explain the stochastic assumptions under the information criterion and
the statement of the Informative Path Planning.
3.1 Information framework
The IPP starts defining a navigation space XR2, where every vehicle in the fleet
can take a measurements of the water. It is also defined the gathered samples loca-
tions subset Xmeas R2, and a navigation map M|X=p:= [px,py]|;M(p) = 1.
It is a reasonable hypothesis to assume that every possible point that can be sam-
pled behaves like a Gaussian random variable with mean µand variance csuch that
pN(µ,c). Now, we can assert that the visitable space Xbehaves as a Multivariate
Gaussian Distribution (MGD), where XNn(¯
µ,Σ), with Σbeing the correlation
matrix of X.
Here, we can define a function that serves as a surrogate model of the correlation
as we take samples. This function will serve as an indicator of the information we
have over the navigable space Xdepending on the measurements taken Xmeas. As
we expect the WQ distribution to be smooth to a certain level, it can be used a
10 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
Mat´
ern Kernel Function (MKF) to define the correlation between two samples in
two different locations, in a smooth and exponential manner following Eq. (3). The
hypothesis has sense because the WQ parameters cannot change drastically from
one point to a close one like it has been studied before 1. Consequently, near samples
from the same or various agents will be highly correlated and the reduction of the
uncertainty in this locations, will be high but redundant. Two parameters will model
the MKF: i) the parameter νmodels the smoothness of the uncertainty decay with
the samples distance p,p, and ii) the parameter lwill serve to scale how much
correlated two measurements (p,p)are with each other. These values are usually
chosen based on prior knowledge of the environment to be monitored [29, 20] or on
how intensively the environment is meant to be covered. In Figure 4 it is depicted
the effect on the uncertainty when using a MKF with (ν,l) = (1.5,1).
MKF(p,p) = 21ν
Γ(ν) 2νpp2
l!ν
Kν 2νpp2
l!(3)
When evaluating with the kernel function every sample in a instant t, it is possible
to obtain the conditional correlation matrix Σ[X|Xmeas]according with the following
expression [21]:
Σ[X|Xmeas] =
Σ[X,X]Σ[X,Xmeas]×Σ[Xmeas,Xmeas]1×Σ[Xmeas ,X]T(4)
Now, the monitoring objective involves decreasing the entropy associated with
the conditional correlation. The information entropy H[X|Xmeas]gives a measure of
the uncertainty about the monitoring domain and the randomness of a sample at an
arbitrary point in that space. The lower the entropy, the more confidence one has
about the scenario. Finally, the entropy can be calculated as [21]:
H[X|Xmeas] = 1
2log(|Σ[X|Xmeas]|) + dim(X)
2log(2πe)(5)
It can be seen that by decreasing the covariance matrix determinant, the entropy
will be reduced. Reducing this determinant can be done by two ways according to
[24]: by reducing directly the product of the eigenvalues of Σ[X|Xmeas]2(called
D-criterion) or by reducing the sum of the eigenvalues. The latter, called A-criterion
for the uncertainty reduction, can be achieved by decreasing the trace of Σ[X|Xmeas],
1https://marmenor.upct.es/maps/
2As it is demonstrated in [21] and because Σis a positive semi-definite matrix, |Σ|=dim(X)
i=0λi
DRL applied to multi-agent IPP in environmental missions 11
which is the sum of its diagonal. By reducing this values, we will be reducing the
determinant of Σ[X|Xmeas]and consequently, the entropy of the gathered informa-
tion:
Tr(Σ[X|Xmeas]) T r(Σ) =
dim(X)
i=0
λi=
dim(X)
i=0
Σii (6)
In Figure 4 it is depicted how the uncertainty σ(X)for one-dimensional space
associated to Σ[X|Xmeas], its diagonal, decrease with two independent samples p1
and p2. In the vicinity of those sample points, σis not 1.0 at all since a smooth
correlation is assumed with the MKF function.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
X
0.0
0.2
0.4
0.6
0.8
1.0
σ
(
X
)
p
1
p
2
σX
|
meas
k
(
p
1
, X
)
k
(
p
2
, X
)
Fig. 4 Uncertainty conditioning process. When we take two samples p1and p2, and incorporate
to Xmeas, it is updated Σ[X|Xmeas]according to Eq. (4). It is observed that the uncertainty σ(X)
associated with Σ[X|Xmeas]becomes zero at the sample points (assuming no sampling noise) and
in their environments according to the Mat´
ern function. The entropy H[X|Xmeas]of the process
decreases accordingly. The intermediate uncertainty between samples supposes a redundancy in
the acquisition. Thus, two very near samples makes no significant reduction to σ(X).
12 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
3.2 Informative Path Planning
Once the information framework is formulated, we can continue by stating the se-
quential decision problem under the IPP. The final objective of the IPP is has been
reduced to minimize the total entropy of the information. For this work, we will
take the uncertainty, the trace of Σ[X|Xmeas],T r(Σ), as an analogous measurement
of the entropy (it has been aforementioned that they are directly related). The objec-
tive is to find a path Ψ={ψ1,ψ2, ..., ψN}for every agent in a fleet of Nagents that
minimize the uncertainty at the end of a mission:
Ψ=min
ΨTr(Σ,Ψ)(7)
Given that every agent can move sequentially from a point pj
tto another pj
t+1, ev-
ery path ψis composed by Tpossible movements. The distance dmeas from one posi-
tion to the next one is fixed. This way, the speed of every vehicle is fixed to a constant
value. Additionally, with every movement a new WQ sample is obtained for every
vehicle in the fleet. With regards to obstacles, given the navigation map M, which in-
dicates if a point pis navigable or not, the solutions will be restricted to those paths
that are bounded to navigable zones. This way, the fleet must coordinate to find the
best solutions starting from any arbitrary initial points Ψ0={ψt=0
1, ..., ψt=0
N}
that efficiently minimize the uncertainty without any collisions between them and
the non-navigable zones. This means that taking samples on the very same place
than other agent (or the same) has measured is, obviously, not desired. Finally, the
IPP corollary can be enunciated as:
minimize
Ψ={ψ1, ..., ψN}Tr (Σ[X|Xmeas (Ψ)])
subject to pXmeas(Ψ)M(p) = 1
(8)
In Figure 5 it is depicted a typical situation were two vehicles intersects their
paths taking redundant samples in-between.
4 Methodology
First, in this Section it is explained the Markov Decision Process that serves as the
sequential framework to implement a Reinforcement Learning approach for the IPP:
the state representation, the reward function and the actions. Secondly, the Deep Q-
DRL applied to multi-agent IPP in environmental missions 13
X1
0
2
4
6
8
10
X2
0
2
4
6
8
10
(X)
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
Fig. 5 When addressing the multi-agent IPP, it is necessary to consider that two very near samples
are redundant, and the closer, the less interesting are both. This figure represent two intersecting
path that incurs in redundancy (lower values of σ(X)). Taking samples in the very borders also is
considered useless, as the uncertainty cannot be reduced outside the limits.
Learning approach is presented and the different sub-modules depicted: the Deep
policy, the noisy implementation of the neural network and the Prioritized Buffer
Replay mechanism.
4.1 Markov Decision Process
The IPP can be formulated as a Markov Decision Process (MDP) to fulfill the re-
quirements of the Reinforcement Learning theory [25]. In any MDP there is, at least,
one agent that can perform an action atfrom a set of valid actions Aat instant t.
This action interacts with the environment which produces a state st+1and a reward
rt. The state represents all the environment information available and, in the strict
sense, it is all the information necessary to reconstruct it. The agent can observe
the scenario trough an observation mechanism o=O(s)that maps the environment
state into an observation o. In general terms, the observation has incomplete infor-
mation of the dynamics of the scenario. When the observation is perfect, stot, the
MDP is defined to be a Fully-Observable MDP (FOMDP). In the contrary case, it
is known as a Partially-Observable MDP (POMDP). Every agent also has a policy
a=π(o)that maps an observation into an action. The policy can be deterministic or
14 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
stochastic, and defines the strategy the agent follows to obtain the maximum reward
possible in a finite action time T:
π(o) = max
π"T
t=0
r(at=π(ot),st)#(9)
The optimal policy π(o)is reached and the MDP is considered solved when there
is no better strategy that obtains, on average, a better reward.
4.1.1 Agents actions
As it was aforementioned in Subsection 3.2, every agent perform the same type of
actions, which are movements in |A|possible directions with a constant distance
dmeas. Those actions will correspond to the cardinal directions {N, S, E, W, NE, NW,
SE, SW}. With a constant speed, the fleet will move coordinately and every path in
Ψwill have the same length dmax. Every action also involves taking a sample in the
next resulting point and obtaining a new uncertainty matrix Σ.
4.1.2 The state and observation
There is not a unique way to represent the state of this problem. For accomplish
the problem, the observation made by agents must be feasible and must include
all the information possible to gather the real problem state. Here we propose a
full visual observation, similarly to related works such as[33, 15]. This observation
is composed by the following elements of the state, as channels of an image of
[58 ×38]pixels:
1. The navigation map M: formed by an image with its values being {0,1}. A value
1 means this position is considered non-navigable.
2. The agent jposition: also formed by a binary image of the same size with 1 in
the position of the agent that is making the observation.
3. The fleet state: another binary image with a 1 in those cells occupied by the ob-
served agents positions from the perspective of the agent making the observation.
4. The uncertainty map σ(X)which is the diagonal values of Σ[X|Xmeas]to repre-
sent the values of the trace according to Eq. 6.
DRL applied to multi-agent IPP in environmental missions 15
In Figure 6 it is depicted the different channels of the state.
0 20
0
10
20
30
40
50
0 20
0
10
20
30
40
50
0 20
0
10
20
30
40
50
0 20
0
10
20
30
40
50
Navigation map Uncertainty map Fleet positions Agent position
(a) (b) (c) (d)
Fig. 6 An arbitrary observation otof an agent: (a) The navigation map Mwhich is the same for
every agent. (b) The uncertainty values mapped into an image. (c) The other agents apart from the
observer discretized in an image. (d) The agent that observes the state within its position.
4.1.3 The reward function
For the DRL to optimize an useful policy for the information gathering, it is vital to
design an effective reward function. The reward function must evaluate how good
an action is in a particular state. The total reward at the end of an episode, which
is, the sum of the consecutive rewards for every agent, will represent how good the
fleet performs in the given scenario. The reward must be positive for good actions
and negative for bad actions. The goodness or badness of every action is something
that is worth to discuss.
For this particular problem, the total minimization of the uncertainty could be a
good measurement of the quality of each vehicle movement. Being a fleet action ¯at,
with every agent iin the instant t, ¯at={a1
t,a2
t, ..., aN
t}at instant t:
r(st,aj
t) = T r(Σ)tTr(Σ)t1
dim(T r(Σ)) =Ut
dim(T r(Σ)) (10)
This is equivalent to say that every agent will be rewarded with the total amount
of uncertainty reduced by all the fleet movements. Here we can expect the agents to
learn to minimize the uncertainty, but there is an intrinsic problem with this direct
formulation. A bad action of an agent and a good action of another will assign the
same credit for both agents, so the learning will be harder since agents are partially
biased by others behavior. This problem, called the credit assignment problem, is
16 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
typical of multi-agent games [9], and in this particular case it will be significant
as the process explained in Eq. 4 is difficult to compute for different agents. Here
we propose a solution for this problem using simple knowledge of the scenario
dynamics: we can compute the individual contribution to the uncertainty reduction
as if the agent was alone. This way, for every agent action aj
tˆatit can be computed
the individual contribution Ut
jas:
Ut
j=Tr(Σ[X|Xmeas]t1)T r(ΣX|Xt1
meas pt
j)(11)
This will be equivalent to considering an individualistic reward that will not take
into account the redundancies of the sampling at instant t. Taking two samples in
near places will be rewarded as if they were unique samples, so the reward signal
must consequently be modulated in terms of the redundancy. It is not difficult to
understand that the closer the agents taking samples the higher the redundancy. We
propose a linear decrease of the individual contribution as a function of the distance
of each agent to to enhance the dispersion of the agents trough the search space. It
can be imposed a clipped linear function to do so like in Figure 7.
0 10 20 30 40 50
Mean distance to the vehicle j,
1
N
X
j i
dji
1.0
0.8
0.6
0.4
0.2
0.0
Distance penalization
cd
dmin dmax
Fig. 7 The penalization for the distance will be proportional to the mean distance between the
agent jand others. It is imposed that, from a maximum distance dmax, the penalization will be zero
and vice versa for a distance close to dmin.
To address the avoidance of nonnavigable areas, it is also imposed that any move-
ment to an invalid zone will be penalized with cp<0. The final reward function for
an agent jwill be:
r(st
j,at
j) = (cpwhen ainvolves a collision
Ut
j+λ×cd(dji)otherwise (12)
DRL applied to multi-agent IPP in environmental missions 17
4.2 Deep Q Learning
Deep Q-Learning was applied for the first time successfully in [15], were the Google
Mind Team trained an agent to play several classic ATARI games to have super-
human capabilities. The success of this work resides in the use of CNNs to represent
the state-action function, which is called the Qfunction. This function represents, in
the sense of the Bellman equations [3], the expected discounted future reward given
a state sand an action ais
Q(st,at) = E"r(st,at) + γ
sS
p(s|s,a)max
aA
Q(s,a)#(13)
where γis a discounted factor that will ponder the importance of earlier rewards,
and p(s|s,a)is the probability of transition to st+1from the current state and given
the current action 3.
In DQL, Qwill be represented by a neural network with Q(s,a;θ)(with θbeing
the parameters) and must be optimized by trial and error using a Stochastic Gra-
dient Descent (SGD) algorithm to find the optimal policy according to Bellman’s
principle of optimality [3]:
π(s;θ) = argmax
aA
Q(s,a;θ)(14)
The deep policy π(s;θ)will be optimized by generating many tuples of experi-
ences <st,at,rt,st+1>and minimizing the Bellman error E:
E(st,at,rt,st+1) = Q(st,at)hrt+γ×max
aQ(st+1,a)i
| {z }
Target value y(rt,st+1)
(15)
This error value, usually called time difference error, represents the prediction
error when predicting the expected future value with Q(s,a). This error is minimized
by taking SGD steps with respect to a loss function, so the parameters of Q(s,a;θ)
move towards the direction of minimizing such value. This optimization method,
as we can see in Eq. 15, uses the very same function under optimization to predict
future values, also called target values. This method, named bootstrapping, involves
3In a total deterministic environment, this probability is assumed to be 1.
18 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
certain instabilities [25] that can cause the learning to be a total failure. In [15, 8],
the solution to this comes from the use of two effective methods:
1. The use of a twin target function Qto predict the values Qin the t+1 instant.
This function, Qtarget (s,a,θ), is not optimized directly, but instead periodically
copies its parameters θfrom Q(s,a;θ).
2. The use of an Experience Replay (ER). The ER technique consists of saving
every experience in a memory and training the neural policy periodically using
batches of samples randomly chosen from it. Random sampling ensures that there
is no correlation between experiences to reduce network bias.
Replay memory
Experience batch
Fig. 8 Deep QLearning algorithm scheme.
Finally, this algorithm uses the trial and error method to optimize Q(s,a)so that
it can be used following Eq. 14. When addressing the multi-agent problem, the same
Qfunction can be extended to all agents as they are homologous in actions and in
observations. Since the objective in this problem is intended to be fully cooperative
and the interactions of every agent are included through the reward as was explained
before, the neural policy can be trained using every individual state to process one
action for every agent. Here, it is worth to mention that, as every agent situation is
part of the state and the behavior of every agent evolves within learning, the dynam-
ics of the environment becomes nonstationary. This involves a higher complexity
DRL applied to multi-agent IPP in environmental missions 19
than classical stationary problems, as in the ATARI cases [15], and, in the end, a
higher need for exploration of the state-action domain.
4.2.1 Prioritized Experience Replay
As it is mentioned before, the ER consists of sampling batches of previous expe-
riences, saved in a memory buffer Mas they occur, to fit the Q(s,a;θ)function.
When the experiences are sampled with equal probability we are neglecting that
some experiences could be more useful for training than others. This situation usu-
ally happens when a very unknown state is sampled and the prediction error is big.
Therefore, it makes sense the possibility to sample the experiences in a no-uniform
fashion. With a Prioritized Experience Replay (PER) [23], every experience in the
memory EiMhas a probability pito be sampled for the loss computing. This prob-
ability will be proportional to the Bellman error according to Eq. 15. The higher the
prediction error, the higher the probability to be sampled in the future for a better fit
of the prediction. Nonetheless, it is important to let other less erratic experiences to
be also sampled to avoid bias. This way, the probability of sampling is imposed to
be dependent to a prioritizing exponent α[0,1]that modulates the uniformity of
the priorities 4:
pi=E(st
i,at
i,rt
i,st+1
i)α
"|M|
i=0
E(st
i,at
i,rt
i,st+1
i)α#(16)
Nonetheless, this sampling is a potential source of bias because the loss will
be much bigger for those experiences with high prediction errors, specially in the
beginning of the learning. To modulate the importance of every error in the loss,
it is computed an importance weight wifor every experiences in the memory. This
weight depends on a parameter βthat will represent the level of compensation of
the error in the loss. This way, the importance weight is expressed as
wi=1
|M|·1
piβ
(17)
and the loss is computed with the following formula:
4α=0 means full uniform sampling and vice versa.
20 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
L(B) = "wi·1
|B|
|B|
i=0
(Q(si,ai)yi)2#(18)
4.2.2 Deep Policy
The convolutional policy designed to represent Q(s,a)has three initial convolutional
filters to extract the features of the graphical state. This first convolutional head is
made up of 64, 32, and 16 filters of the same kernel size (3 ×3). After this, the fea-
tures are flattened and processed by three linear layers of 256 neurons each. Every
neuron uses a Leaky Rectifier Linear Unit (Leaky-ReLU) as the activation function.
In the final stage of the policy, a dueling architecture is implemented [31]. With the
dueling architecture, the value function V(s)is segregated from the Advantage val-
ues A(s,a)(see Eq. 19). The former represents the expected accumulated reward in
the current state. The latter represents the expected action-state value with respect
to the current state value.
A(s,a) = Q(s,a)V(s)
This technique enables a better generalization of action learning in the presence
of similar states, and, according to [31], the calculation of Q(s,a)will be as follows:
Q(s,a) = V(s,a) + A(s,a)1
|A|A(s,a)(19)
s
[64 Filters, 3x3] [32 Filters, 3x3] [16 Filters, 3x3]
[256] [256] [256] [256]
[256]
[1]
[8] A(s)
V(s)
Q(s,a)
Fig. 9 The convolutional neural network implemented to process the state and to estimate the Q
function. It is made up of three convolutional blocks for feature extraction and three dense noisy
layers. In the end, the dense layer is decomposed into a value and advantage head.
In this work, we also propose a possible application of noisy neural networks as a
technique to enhance the explorative behavior in multi-agent problems. Noisy neu-
ral networks (nNN) have been shown to be useful in complex scenarios, but as far
DRL applied to multi-agent IPP in environmental missions 21
as authors are concerned, their use has been limited to cases with a single agent [7].
With nNN, a stochastic term sampled from a Gaussian distribution is added to every
weight and bias. Therefore, every neuron parameter (wi,bi)has two distributions
ˆ
ξ= (ξw,ξb). These distributions are modeled by two parameters each: (µw,σw)for
the weights and (µb,σb)for the biases. With every evaluation of Q(s,a), new values
are sampled from ˆ
ξ(see Figure 10). This makes the policy intrinsically stochastic as
different evaluations of the very same state shall return different values. This effect
causes the network to have an inherent adaptive exploratory capacity. The balance
between exploration of the state-action domain and the exploitation in the sense of
Eq. 14 is incorporated parametrically. In classical DQL implementations, the most
common exploration method is the ε-greedy policy, where there is a probability of ε
of choosing a total random action. By annealing εfrom 1 (at the beginning of opti-
mization) and near 0, the agent explores. Nevertheless, in multi-agent problems like
this, random exploration could be insufficient. Furthermore, the annealing values
must be chosen to explore enough in the first training stages.
Fig. 10 In (a), a classic neural network layer with parameters ¯wand biases ¯
b. In (b), a noisy layer
is presented. For each previous parameter, a stochastic value is added by sampling a Gaussian
distribution of mean µand deviation σ. Note that these parameters are trainable and will modulate
the stochasticity of the policy.
22 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
5 Simulation and results
This section describes the simulation conditions for each experiment and the results
obtained. First, we compare the importance of individual assignment in the reward
function, as mentioned in Section 4.1.3. Then, the benefits of using noisy networks
are explained in a comparison with an ε-greedy exploratory policy. All simulations
were performed with a fleet size Nof four agents. The navigation map corresponds
to the Ypacara´
ı Lake (Paraguay) as a real case scenario (see the past Figure 6a).
All optimizations were carried out on a double Intel Xeon Silver 4210R 3.4GH,
187Gb RAM, and a Nvidia RTX 3090 with 24Gb VRAM. All hyperparameters of
the algorithms are resumed in Table 2.
Hyperparameter Value
Learning rate (lr) 1 ×104
Batch Size (|B|) 64
Memory replay size 1 ×106
Update frequency for target 1000 epochs
Discount factor (γ)0.99
Loss function Mean Squared Error
SGD optimizer Adam
εvalue (if applied) [1, 0.05] (annealed)
εdecay interval (% of learning progress) [0, 33]
Prioritizing exponent (α) 0.4
Importance exponent (β) [0.6, 1] (annealed)
Number of episodes 50.000
Scenario parameter Value
Mat´
ern Kernel lenghtscale (l) 0.75
Mat´
ern Kernel differential parameter (ν) 1.5
Path maximum length (dmax) 29km
ASV speed 2ms1
Fleet size N4
Max. collisions per missions permitted 5
Table 2 Hyperparameters and scenario values used in the DRL training experiments.
The hyperparametrization of this algorithm has been conducted following pre-
vious values in similar approaches [33, 35]. It has been observed that the selected
batch size is adequate for this problem and decreasing the number of experiences
per SGD step only augments the variance. Other parameters such as the Prioritized
Replay or the discount factor, are taken from the original papers [23, 15].
DRL applied to multi-agent IPP in environmental missions 23
5.1 Reward function comparison
In order to validate the reward function in terms of the credit assignment, two vari-
ants are tested: i) the total uncertainty reduced from one instant to another is re-
warded equally to each agent in the sense of Eq. 7 (coupled reward), and ii) every
agent receives only reward for its own contribution without any considerations (de-
coupled reward; see Eq. 12). Note that the distance penalization is applied exactly
the same for both cases.
Fig. 11 Normalized uncertainty ˆ
Uat the end of the mission along the training process for two
variants of the reward: coupled and decoupled.
In Figure 11, it can be seen the uncertainty over the training progress for both
variations. With the decoupled reward, the algorithm is able to resolve better the
credit assignment and obtain a better policy than in the coupled case. There is a
40% improvement on average from one case to another, which is significant to the
exploration objective. The behavior of the coupled reward agents tend to be more
redundant as can be seen in Figure 12b. As the reward does not properly evaluates
the individual contribution of every agent, the policy tend to overestimate its be-
havior by producing lazy agents. Since a good performance of one agent augments
the valor of Qfor the others, this leads to unequal contributions to the task. The
proposed algorithm is able to synthesize good policies as seen in Figure 12a, where
the Lake scenario is almost completely covered and the agents tend to move more
efficiently across the search space with much less self intersecting trajectories.
24 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
Agent 1
Agent 2
Agent 3
Agent 4
Agent 1
Agent 2
Agent 3
Agent 4
(a) (b)
Fig. 12 Resulting paths of the same mission using (a) the decoupled reward and (b) the coupled
one, both with a nNN policy.
5.2 Exploration efficiency
When addressing the exploration ability of nNN, it has been observed that the speed
of convergence and performance of noisy neural network overcomes in this case the
ε-greedy implementation 5. In 13, it can be seen the accumulated reward the agents
receive on average for both trainings. The very first steps of the ε-greedy training
consists of an exploration phase where the actions are randomly taken. The perfor-
mance of the training slowly increases as εdecreases over time. This exploration
period must be selected and properly tuned. With less exploration (see ε-greedy
in the same figure), we have observed a worse performance. On the contrary, the
nNN policy needs many less episodes to converge to a high performance policy.
Whereas in the ε-greedy case it is necessary to process 16500 episodes to greedily
exploit the policy, the nNN is able to find collision-free paths in 1000 episodes and
higher rewards in 5000 episodes. In terms of the uncertainty, the noisy proposal is
able to obtain the same performance if not slightly better (1% better on average) for
higher explorative ε-greedy values. This means, for the same performance, the al-
gorithm takes 5 times more episodes. Nonetheless, with less exploration, the agent
convergence is faster but the performance becomes worse with higher variation).
This indicates similar policies are encountered but the enhanced exploration policy
of nNN allows a more sample efficient training without any epsilon decay tuning.
5From this point, the decoupled reward is selected for better performance.
DRL applied to multi-agent IPP in environmental missions 25
25
50
75
²-greedy Policy
²-greedy Policy 2
Noisy Policy
25
50
75
0 10000 20000 30000 40000 50000
Episodes
6 × 10-2
2 × 10-1
1×10 -1
Final reward
Mission length
Uncertainty Ut=T
Fig. 13 Average accumulated reward (up), mission length (middle), and final uncertainty (bottom),
over the training process for an ε-greedy policy and the proposed noisy policy.
5.3 Comparison with other algorithms
In order to validate the performance of the proposed framework, we propose differ-
ent algorithms to compare the policy efficiency. The metrics under study will be: i)
the average accumulated reward at the end of the episode among the agents AER, ii)
the remaining uncertainty at the end of the episode Ut=T, the mean distance between
ASVs, and finally, the root mean squared error (RMSE) of the inferred WQ model
respect to the ground truth. For the latter, we propose Gaussian Process (GPs), like
in [18], to perform a regression with the sampled values as they are acquired. The
proposed benchmark used to represent the WQ scalar field will be a sum of positive
peaks and negatives peaks using two inverse Shekel functions 6. This scalar field
disposes a random number of peaks across the navigable surface to represent the
maxima and minima of the water quality parameters with different intensities each
(see Figure 14).
With regards to the benchmark algorithms, we present different heuristics and
optimization methods based in previous proposed approaches:
6See https://deap.readthedocs.io/en/master/api/benchmarks.html for the complete definition
26 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
X
0
10
20
30
40
50
Y
0
5
10
15
20
25
30
35
40
WQ normalized values
1.5
1.0
0.5
0.0
0.5
1.0
1.5
Fig. 14 Example of a WQ scalar field used for RMSE evaluation by means of a Shekel function.
1. Random Safe Search (RSS): This heuristic will generate random paths without
collisions of any kind. This is not a good path planning strategy, but serves as a
lower bound to validate the problem complexity.
2. Random Wandering (RW): In this heuristic, agents will travel following straight
trajectories until an obstacle is reached. Then, the agent chooses a new straight
path different from the last direction to avoid going back in its footsteps.
3. offline Genetic Algorithm Optimization (Offline-GA): This approach will be
similar to [1], were a GA is used to optimize the paths. For every evaluation sce-
nario, paths are optimized using mutation, crossover, and selection operations to
obtain the best collective path. The designed reward is used as the fitness function
for each individual (here individual refers to the ASVs paths). To avoid colliding
paths, the death penalty is applied for individuals that contains collisions. Note
that, for every different starting conditions, Offline-GA must re-optimize, which
makes this algorithm unfeasible for online deploy. This algorithm will serve as
well as a performance indicator of how high the reward can be achieved when
intensively optimizing every individual case.
4. Receding Horizon Optimization (RH): In the RH approach, the paths are op-
timized according to the reward function up to an horizon of Hsteps from the
current point using a copy of the current online scenario. Once the optimiza-
tion budget is reached, only the first action for every ASV is processed and a
new H-step optimization begins. This approach is similar to an Model Predictive
Control (MPC) algorithm used in [10] which can be useful in online optimization
scenarios.
DRL applied to multi-agent IPP in environmental missions 27
In Table 3, the results of the different approaches are presented. It has been eval-
uated every algorithm in 10 test scenarios with the same starting conditions and
WQ ground truths. In terms of the AER metric, the DRL approach obtains better
results than others algorithms with a lower deviation which indicates a more re-
liable and stable behavior. The final uncertainty after a mission is consequently
reduced to 0.053 on average, a 14% less respect to the second best-performance
algorithm Offline-GA. With respect to the estimation error, the DRL approach
obtains a better accuracy than any other approach. Regardless the estimation
method could affect this metric, in terms of performance, it can be seen that
the most informative paths (those with lower uncertainty) obtains better models
of the WQ parameters.
Metric AER Ut=TDistance RMSE
Algorithm Mean Std. dev. Mean Std. dev. Mean Std. dev. Mean Std. dev.
RSS 41.039 6.65 0.23 0.019 18.64 2.63 0.152 0.233
RW 70.970 8.90 0.13 0.028 18.13 5.09 0.027 0.035
Offline-GA 92.112 4.34 0.062 0.005 19.71 4.54 0.0057 0.005
RH H=577.688 8.50 0.117 0.026 17.46 4.43 0.0452 0.062
RH H=10 80.114 9.72 0.109 0.031 18.25 4.45 0.0285 0.034
DRL 97.013 1.45 0.053 0.003 18.94 4.46 0.0053 0.003
Table 3 Statistical results of evaluating the proposed algorithm and other benchmarks in 10 sce-
narios with different starting points and WQ scalar fields.
In Figure 16, are presented example paths generated using the aforementioned
algorithms. The RW heuristic, despite being aleatory, tend to generate long paths
across the lake surface but with unnecessary redundancy and self intersecting
trajectories. This strategy is enhanced in the Offline-GA, where the redundancy
is reduced with excellent performance. Nonetheless, it happens that as it is easier
to optimize straight paths it is difficult to improve those paths, since big zones of
the surface remain unvisited. The DRL approach, on the contrary, not only learns
how to resolve almost every scenario, but also learns that those visits adjacent to
the shores return lower rewards. This generates a shore-aware behavior and those
areas are avoided. Nonetheless, the GA algorithm is able to obtain acceptable
solutions but at the cost of a highly intensive computation for every single case.
With respect to the receding horizon, we have observed that it is prone to stall in
local minima. With higher prediction horizons, the algorithm have difficulties in
finding good local sub-solutions and with lower ones, the performance is higher
but still sub-optimal.
If we consider the progress of the uncertainty reduction over a mission progress,
we can see in Figure 15 how the RMSE score decreases along with the uncer-
28 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
Agent 1
Agent 2
Agent 3
Agent 4
Agent 1
Agent 2
Agent 3
Agent 4
Agent 1
Agent 2
Agent 3
Agent 4
Agent 1
Agent 2
Agent 3
Agent 4
(a) (b) (c)
(d) (e) (g)
Fig. 15 Resulting paths of the proposed algorithms for comparison and the DRL approach: (a)
RSS, (b) RW, (c) Offline-GA, (d) RH(H=5), (e) RH(H=10), and (f) DRL.
tainty. The DRL approach obtains lower results (14% and 7% improvement re-
spect to the Offline-GA approach) in less time and with less deviation.
6 Discussion of the results
The use of the DRL involves always the design of a reward function as a way to
describe the desired behavior of the policy. The results have shown that, in the par-
ticular IPP case, it is vital to design a proper function in order to achieve a good
performance. This situation aggravated when the state-action space increases which
is the case of every multi-agent paradigm. The multi-agent IPP has demonstrated to
DRL applied to multi-agent IPP in environmental missions 29
10-1
6 × 10-2
2 × 10-1
3 × 10-1
4 × 10-1
Uncertainty
Algorithm
Noisy DRL
GA
Random Wanderer
Random Safe Search
MPC-GA H=5
MPC-GA H=10
0 20 40 60 80 100
Step
10-2
10-1
100
RMSE
Fig. 16 Example of a WQ scalar field used for RMSE evaluation by means of a Shekel function.
need a decoupling of the real information dynamic in order to cope with the credit
assignment problem: when addressing this, it is better to consider individual contri-
butions to the task and add an additional term for redundancy such as the distance.
According to our simulations the improvement could reach up to a 84% respect
to the couple baseline. This modification, while it involves more calculations, it is
absolutely necessary for the DRL to converge in a competitive policy.
Regarding the explorative behavior of the two variants under test, ε-greedy and
noisy networks, it is clear that the noisy layers can boost up the algorithm efficiency
to adjust the explorative behavior as part of the parameters of the network. This
method, while it involves more parameters to tune, avoid the tedious task of fine-
tuning the ε-greedy parameters. It has been observed a decrease in performance
when the explorative behavior is not enough. For contrary, when the exploration
phase is bigger in time, the policy performance is not as high as the obtained using
nNN. This fact presents a discussion about the exploration needs of every complex
or multi-agent problem: how much is it necessary to explore? Additionally, it could
happen that the full random behavior of the agents are ineffective and some form of
parametric noise is required (as it happens in much more complex games like ATARI
Moctezuma’s Revenge [7]). In this particular case, although the scope is limited, the
nNN have resulted better in sample efficiency and in terms of the obtained score.
30 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
When addressing the comparison with other approaches, DRL have proven to be
a good methodology to cope with the IPP objectives. The algorithm is able to deal
with arbitrary initial points and obtain top performance among other approaches.
When comparing the results of the DRL-trained policy with the Offline-GA, it can
be observed that the scores are better in RMSE and uncertainty, which validates the
proposal for online deploy. Although we obtain good performance with the Offline-
GA, it is only useful to optimize scenarios one-by-one, evaluating several solutions
for the same scenario. In the simulations, DRL evaluates the same number of times
the simulator but with different missions and starting points. This way, the policy
generalized the task independently to the fleet situation. The fact that the DRL policy
overcomes the Offline-GA performance without any in-place optimization, validates
its ability to generally solve the IPP.
Another important task that it is worth to discuss is the ability to reduce the model
error for every algorithm. It can be seen that the lower the uncertainty, the lower the
RMSE. Nonetheless, the lowest bound of the error will be related not only to the
acquisition path Ψ, but also the the regression method used and the ground truth
itself. For this case, a plausible WQ parameter benchmark has been used, following
the behavior of WQ parameters in the Mar Menor7(Spain) and previous works [18,
17]. Nonetheless, when different benchmarks are used representing other variable
distributions, the error could behaves differently from our case. In the end, the error
reduction is dependent on a suitable regression method. Once the regression method
comply with the dynamics of what we want to measure, the uncertainty criteria
serves to reduce this error in a fast and powerful way.
7 Conclusions
In this chapter a Deep Reinforcement Learning approach have been proposed to deal
with the Informative Path Planning problem using multiple surface vehicles. The
proposed informative framework uses a Kernel function to correlate the samples in
order to obtain the level of uncertainty remaining in the scenario under monitoring.
Within this problem it is also considered the nonnavigable zones of the Ypacara´
ı
Lake as a testing scenario. All this restrictions are considered in a tailored reward
function that take multi-agent credit assignment in consideration. The proposed re-
ward function deals better with the IPP problem when the individual contributions
to the uncertainty are explicit when compared to using the total reduced uncertainty
for every agent. It has been also proposed the use of several mechanisms of DRL
that have been useful in previous works: Prioritized Replay, Advantage networks,
and noisy neurons. The latter, have been observed to enhance the exploration and
sample efficiency of the algorithm respect to the classic ε-greedy strategy. In the
7
DRL applied to multi-agent IPP in environmental missions 31
end, the algorithm had been compared to many other heuristics to validate its per-
formance. The algorithm is able to perform better than any other with a 20% higher
rewards respect to the receding horizon counterpart or 5% better to the intensive
offline-GA optimization. In future lines, it is intended to study how this problem
can be expanded to an arbitrary number of agents in every mission, instead of se-
lecting a fixed size per training. This would result in a fleet-size concerned policy
that considers the appearing and disappearing of agents in the course of a mission.
Other interesting aspect to investigate is how the action and state formulation affects
the learning. It is worth to study if noisy state inputs could help in generalization
with different boundaries or with moving obstacles in the middle of the navigable
zones.
Acknowledgements This work has been funded by the Spanish ”Ministerio de Ciencia, Inno-
vaci´
on y Universidades” under the PhD grant FPU-2020 (Formaci´
on del Profesorado Universitario)
of Samuel Yanes Luis.
References
1. Arzamendia M, Gregor D, Gutierrez-Reina D, Toral S (2019) An evolutionary
approach to constrained path planning of an autonomous surface vehicle for
maximizing the covered area of ypacarai lake. Soft Computing 23(5):1723–
1734
2. Arzamendia M, Gutierrez D, Toral S, Gregor D, Asimakopoulou E, Bessis N
(2019) Intelligent online learning strategy for an autonomous surface vehicle
in lake environments using evolutionary computation. IEEE Intelligent Trans-
portation Systems Magazine 11(4):110–125
3. Bellman RE (2003) Dynamic Programming. Dover Publications, Inc., USA
4. Coley K (2015) Unmanned surface vehicles : The future of data-collection.
Ocean Challenge 21:14–15
5. Cover TM, Thomas JA (2006) Elements of Information Theory (Wiley Series
in Telecommunications and Signal Processing). Wiley-Interscience, USA
6. Ferreira H, Almeida C, Martins A, Almeida J, Dias N, Dias A, Silva E (2009)
Autonomous bathymetry for risk assessment with roaz robotic surface vehicle.
In: OCEANS 2009-EUROPE, pp 1–6, DOI 10.1109/OCEANSE.2009.5278235
7. Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos
R, Hassabis D, Pietquin O, Blundell C, Legg S (2017) Noisy networks for ex-
ploration. CoRR abs/1706.10295, URL http://arxiv.org/abs/1706.
10295,1706.10295
8. van Hasselt H, Guez A, Silver D (2015) Deep reinforcement learning with dou-
ble q-learning. CoRR abs/1509.06461, URL http://arxiv.org/abs/
1509.06461,1509.06461
32 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
9. Hoen PJt, Tuyls K, Panait L, Luke S, La Poutr´
e JA (2006) An overview of co-
operative and competitive multiagent learning. In: Tuyls K, Hoen PJ, Verbeeck
K, Sen S (eds) Learning and Adaption in Multi-Agent Systems, Springer Berlin
Heidelberg, Berlin, Heidelberg, pp 1–46
10. Julian KD, Kochenderfer MJ (2018) Distributed wildfire surveillance with au-
tonomous aircraft using deep reinforcement learning. CoRR abs/1810.04244,
URL http://arxiv.org/abs/1810.04244,1810.04244
11. Kathen MJT, Flores IJ, Reina DG (2021) An informative path planner for a
swarm of asvs based on an enhanced pso with gaussian surrogate model com-
ponents intended for water monitoring applications. Electronics 10(13):1605
12. Krishna Lakshmanan A, Elara Mohan R, Ramalingam B, Vu Le A, Veera-
jagadeshwar P, Tiwari K, Ilyas M (2020) Complete coverage path planning
using reinforcement learning for tetromino based cleaning and maintenance
robot. Automation in Construction 112(May 2019):103,078, DOI 10.1016/
j.autcon.2020.103078, URL https://doi.org/10.1016/j.autcon.
2020.103078
13. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra
D (2016) Continuous control with deep reinforcement learning. In: Bengio Y,
LeCun Y (eds) ICLR, URL http://dblp.uni-trier.de/db/conf/
iclr/iclr2016.html#LillicrapHPHETS15
14. Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-Agent
Actor-Critic for Mixed Cooperative-Competitive Environments. NIPS’17, Cur-
ran Associates Inc., Red Hook, NY, USA
15. Mnih V, Kavukcuoglu K, Silver D, et al (2015) Human-level control through
deep reinforcement learning. Nature 518(7540):529–533, DOI 10.1038/
nature14236, URL http://dx.doi.org/10.1038/nature14236
16. Murphy RR, Steimle E, Griffin C, Cullins C, Hall M, Pratt K (2008) Coop-
erative use of unmanned sea surface and micro aerial vehicles at hurricane
wilma. Journal of Field Robotics 25(3):164–180, DOI https://doi.org/10.1002/
rob.20235
17. Peralta F, Reina DG, Toral S, Arzamendia M, Gregor D (2021) A bayesian op-
timization approach for multi-function estimation for environmental monitor-
ing using an autonomous surface vehicle: Ypacarai lake case study. Electronics
10(8):963
18. Peralta Samaniego F, Reina DG, Toral Mar´
ın SL, Gregor DO, Arzamendia
M (2021) A bayesian optimization approach for water resources monitoring
through an autonomous surface vehicle: The ypacarai lake case study. IEEE
Access 9(1):9163–9179, DOI 10.1109/ACCESS.2021.3050934
19. Piciarelli C, Foresti GL (2019) Drone patrolling with reinforcement learn-
ing. ACM International Conference Proceeding Series (1):1–6, DOI 10.1145/
3349801.3349805
20. Popovi´
c M, Vidal-Calleja T, Hitz G (2020) An informative path planning frame-
work for uav-based terrain monitoring. Autonomous Robot 44:889–911, DOI
https://doi.org/10.1007/s10514-020-09903-2
DRL applied to multi-agent IPP in environmental missions 33
21. Rasmussen C, Williams C (2006) Gaussian Processes for Machine Learning.
Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA,
USA, DOI https://doi.org/10.7551/mitpress/3206.003.0001
22. S´
anchez-Garc´
ıa J, Garc´
ıa-Campos J, Arzamendia M, Reina D, Toral S, Gre-
gor D (2018) A survey on unmanned aerial and aquatic vehicle multi-hop net-
works: Wireless communications, evaluation tools and applications. Computer
Communications 119:43–65, DOI 10.1016/j.comcom.2018.02.002
23. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience
replay. DOI 10.48550/ARXIV.1511.05952, URL https://arxiv.org/
abs/1511.05952
24. Sim R, Roy N (2005) Global a-optimal robot exploration in slam. pp 661 666,
DOI 10.1109/ROBOT.2005.1570193
25. Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction. A Brad-
ford Book, Cambridge, MA, USA
26. Ten Kathen MJ, Flores IJ, Reina DG (2021) A comparison of pso-based infor-
mative path planners for autonomous surface vehicles for water resource mon-
itoring. In: 7th International Conference on Machine Learning Technologies
(ICMLT 2022), ACM
27. Ten Kathen MJ, Reina DG, Flores IJ (2021) A comparison of pso-based in-
formative path planners for detecting pollution peaks of the ypacarai lake with
autonomous surface vehicles. In: International Conference on Optimization and
Learning OLA’2022
28. Theile M, Bayerlein H, Nai R, Gesbert D, Caccamo M (2020) Uav coverage
path planning under varying power constraints using deep reinforcement learn-
ing. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Sys-
tems (IROS), IEEE, pp 1444–1449
29. Viseras A, Garcia R (2019) Deepig: Multi-robot information gathering
with deep reinforcement learning. IEEE Robotics and Automation Letters
4(3):3059–3066, DOI 10.1109/LRA.2019.2924839
30. Viseras A, Meißner M, Marchal J (2021) Wildfire front monitoring with multi-
ple uavs using deep q-learning. IEEE Access PP:1–1, DOI 10.1109/ACCESS.
2021.3055651
31. Wang Z, de Freitas N, Lanctot M (2015) Dueling network architectures for deep
reinforcement learning. CoRR abs/1511.06581, URL http://arxiv.org/
abs/1511.06581,1511.06581
32. Woo J, Kim N (2020) Collision avoidance for an unmanned sur-
face vehicle using deep reinforcement learning. Ocean Engineering
199:107,001, DOI https://doi.org/10.1016/j.oceaneng.2020.107001, URL
https://www.sciencedirect.com/science/article/pii/
S0029801820300792
33. Yanes S, Reina DG, Toral Mar´
ın SL (2020) A deep reinforcement learning ap-
proach for the patrolling problem of water resources through autonomous sur-
face vehicles: The ypacarai lake case. IEEE Access 6(1):1–1, DOI 10.1109/
ACCESS.2020.3036938
34 Yanes, S., Perales, M., Guti´
errez-Reina, D. and Toral, S.
34. Yanes S, Reina DG, Mar´
ın SLT (2021) A multiagent deep reinforcement learn-
ing approach for path planning in autonomous surface vehicles: The ypacara´
ı
lake patrolling case. IEEE Access 9:17,084–17,099
35. Yanes Luis S, Guti´
errez-Reina D, Toral Marin S (2021) A dimensional
comparison between evolutionary algorithm and deep reinforcement learning
methodologies for autonomous surface vehicles with water quality sensors.
Sensors 21(8), DOI 10.3390/s21082862, URL https://www.mdpi.com/
1424-8220/21/8/2862
36. Yanes Luis S, Peralta F, Tapia C ´
ordoba A, ´
Alvaro Rodr´
ıguez del Nozal,
Toral Mar´
ın S, Guti´
errez Reina D (2022) An evolutionary multi-objective
path planning of a fleet of asvs for patrolling water resources. Engineer-
ing Applications of Artificial Intelligence 112:104,852, DOI https://doi.org/
10.1016/j.engappai.2022.104852, URL https://www.sciencedirect.
com/science/article/pii/S0952197622001051
37. Zhang Q, Lin J, Sha Q, He B, Li G (2020) Deep interactive rein-
forcement learning for path following of autonomous underwater vehicle.
CoRR abs/2001.03359, URL https://arxiv.org/abs/2001.03359,
2001.03359
... Reinforcement learning (RL) has emerged as a popular approach for efficiently learning online decision-making in robotics [11][12][13][14][15][16]. Recent works apply RL for single-agent IPP to enhance path quality and computation time for adaptive data collection [11,14,17,18]. ...
... Both works are limited to constant UAV altitudes ignoring potentially varying sensor noise with altitude. Recently, Luis et al. [16] proposed a deep Q-learning algorithm that supports learning cooperation by penalising redundant measurements based on the inter-UAV distance. Their reward design is tailored to pure exploration instead of adaptively monitoring areas of interest. ...
Preprint
Full-text available
Efficient aerial data collection is important in many remote sensing applications. In large-scale monitoring scenarios, deploying a team of unmanned aerial vehicles (UAVs) offers improved spatial coverage and robustness against individual failures. However, a key challenge is cooperative path planning for the UAVs to efficiently achieve a joint mission goal. We propose a novel multi-agent informative path planning approach based on deep reinforcement learning for adaptive terrain monitoring scenarios using UAV teams. We introduce new network feature representations to effectively learn path planning in a 3D workspace. By leveraging a counterfactual baseline, our approach explicitly addresses credit assignment to learn cooperative behaviour. Our experimental evaluation shows improved planning performance, i.e. maps regions of interest more quickly, with respect to non-counterfactual variants. Results on synthetic and real-world data show that our approach has superior performance compared to state-of-the-art non-learning-based methods, while being transferable to varying team sizes and communication constraints.
... One may choose to use RL in such systems as it allows agents to learn coordinated policies that may be complicated and time-consuming to hard code. Yanes Luis et al. (2022) uses DQN with prioritised experience replay to develop a multi-agent framework for IPP. The proposed method is tested and (Chen et al., 2020a;Chen et al., 2020b); Maciel-Pearson et al. (2019) applied to the Peralta et al. (2020). ...
Article
Full-text available
The environmental pollution caused by various sources has escalated the climate crisis making the need to establish reliable, intelligent, and persistent environmental monitoring solutions more crucial than ever. Mobile sensing systems are a popular platform due to their cost-effectiveness and adaptability. However, in practice, operation environments demand highly intelligent and robust systems that can cope with an environment’s changing dynamics. To achieve this reinforcement learning has become a popular tool as it facilitates the training of intelligent and robust sensing agents that can handle unknown and extreme conditions. In this paper, a framework that formulates active sensing as a reinforcement learning problem is proposed. This framework allows unification with multiple essential environmental monitoring tasks and algorithms such as coverage, patrolling, source seeking, exploration and search and rescue. The unified framework represents a step towards bridging the divide between theoretical advancements in reinforcement learning and real-world applications in environmental monitoring. A critical review of the literature in this field is carried out and it is found that despite the potential of reinforcement learning for environmental active sensing applications there is still a lack of practical implementation and most work remains in the simulation phase. It is also noted that despite the consensus that, multi-agent systems are crucial to fully realize the potential of active sensing there is a lack of research in this area.
Article
Full-text available
This study proposes the use of an Autonomous Surface Vehicle (ASV) fleet with water quality sensors for efficient patrolling to monitor water resource pollution. This is formulated as a Patrolling Problem, which consists of planning and executing efficient routes to continuously monitor a given area. When patrolling Lake Ypacaraí with ASVs, the scenario transforms into a Partially Observable Markov Game (POMG) due to unknown pollution levels. Given the computational complexity, a Multi-Agent Deep Reinforcement Learning (MADRL) approach is adopted, with a common policy for homogeneous agents. A consensus algorithm assists in collision avoidance and coordination. The work introduces exploration and reinforcement phases to the patrolling problem. The Exploration Phase aims at homogeneous map coverage, while the Intensification Phase prioritizes high polluted areas. The innovative introduction of a transition variable, ν, efficiently controls the transition from exploration to intensification. Results demonstrate the superiority of the method, which outperforms a Single-Phase (trained on a single task) Deep Q-Network (DQN) by an average of 17% on the intensification task. The proposed multitask learning approach with parameter sharing, coupled with DQN training, outperforms Task-Specific DQN (two DQNs trained on separate tasks) by 6% in exploration and 13% in intensification. It also outperforms the heuristic-based Lawn Mower Path Planner (LMPP) and Random Wanderer Path Planner (RWPP) algorithms, by 35% and 20% on average respectively. Additionally, it outperforms a Particle Swarm Optimization-based Path Planner (PSOPP) by an average of 26%. The algorithm demonstrates adaptability in unforeseen scenarios, giving users flexibility in configuration.
Chapter
Full-text available
The preservation of water resources is an increasingly urgent issue. Therefore , monitoring the water quality of these resources is a very important task so that appropriate actions could be taken. This chapter focuses on water resource monitoring using a fleet of Autonomous Surface Vehicles equipped with sensors capable of measuring water quality parameters. The objective is to obtain the maximum points of contamination of the water resource through the exploration and exploitation of the water surface. The proposed algorithm is based on Particle Swarm Optimization (PSO) in combination with some machine learning techniques (Gaussian Process , Bayesian Optimization, among others) to address the limitations of PSO, such as premature convergence and difficulty in setting the initial values of the coefficients. To validate the performance of the algorithm, uni-modal and multi-modal benchmark functions are used in the simulation experiments. The results show that the proposed algorithm, the Enhanced GP-based PSO, based on the epsilon greedy method has the best performance for detecting water resource pollution peaks. It was also demonstrated that this algorithm is the one that generates the most accurate water quality model. However, when it comes to finding the highest pollution peak, the algorithm with the best response is the Enhanced GP-based PSO with a focus on exploitation.
Article
Full-text available
The rapid increase of human activities with direct influence on the environment has motivated the global awareness of the need to efficiently monitor the natural resources. Among the wide range of problems addressed, such as overuse of agrochemicals, uncontrolled waste, etc., the contamination of water resources plays a protagonist role, given its close links with biodiversity and the food chain. Water monitoring is considered one of the most efficient ways to deal with these problems, especially through the use of autonomous vehicles, which can boost the capabilities and efficiency of the monitoring routines with appropriate strategies. In this work, the monitoring problem is addressed by means of the Non-Homogeneous Patrolling Problem with closed circuits. This problem has a great computational complexity, especially when multiple targets are included in a monitoring mission. A formulation based on closed metric graphs and the application of a multi-objective genetic algorithm is proposed to provide Pareto-efficient monitoring solutions for a variable number of Autonomous Surface Vehicles. To address the multi-agent, multi-objective and constrained paradigm, efficient genetic operators have been designed for the generation of valid solutions in an affordable time. The method results in Pareto-efficient solutions for scenarios with disjoint and uncorrelated objectives, which outperform the fitness of other solutions by a factor of 2, on average. The results provide decision makers a method to find different non-dominated strategies depending on the monitoring needs, depending on fleet and vehicle size.
Conference Paper
Full-text available
Preserving water resources is an objective that is constantly being pursued. Monitoring the aquatic environments is an action to fulfill this objective, since the state of the water quality will be controlled. The monitoring task can be carried out with Autonomous Surface Vehicles equipped with sensors that measure water quality parameters and with a monitoring system. This paper presents a comparison between informative path planners based on PSO for autonomous surface vehicles for water resources monitoring. The case presented is the case of Ypacarai Lake. The simulations carried out allow visualizing and comparing the response of different methods. The methods evaluated are the Local Best method, the Global Best method, the Uncertainty method, the Contamination method, the Classic PSO, Enhanced GP-based PSO, and the Epsilon Greedy method. For the optimization of the Enhanced GP-based PSO coefficients, Bayesian optimization is selected. The results show that the Enhanced GP-based PSO is the algorithm with the best solutions for monitoring the lake environment.
Article
Full-text available
Controlling the water quality of water supplies has always been a critical challenge, and water resource monitoring has become a need in recent years. Manual monitoring is not recommended in the case of large water surfaces for a variety of reasons, including expense and time consumption. In the last few years, researchers have proposed the use of autonomous vehicles for monitoring tasks. Fleets or swarms of vehicles can be deployed to conduct water resource explorations by using path planning techniques to guide the movements of each vehicle. The main idea of this work is the development of a monitoring system for Ypacarai Lake, where a fleet of autonomous surface vehicles will be guided by an improved particle swarm optimization based on the Gaussian process as a surrogate model. The purpose of using the surrogate model is to model water quality parameter behavior and to guide the movements of the vehicles toward areas where samples have not yet been collected; these areas are considered areas with high uncertainty or unexplored areas and areas with high contamination levels of the lake. The results show that the proposed approach, namely the enhanced GP-based PSO, balances appropriately the exploration and exploitation of the surface of Ypacarai Lake. In addition, the proposed approach has been compared with other techniques like the original particle swarm optimization and the particle swarm optimization with Gaussian process uncertainty component in a simulated Ypacarai Lake environment. The obtained results demonstrate the superiority of the proposed enhanced GP-based PSO in terms of mean square error with respect to the other techniques
Article
Full-text available
The monitoring of water resources using Autonomous Surface Vehicles with water-quality sensors has been a recent approach due to the advances in unmanned transportation technology. The Ypacaraí Lake, the biggest water resource in Paraguay, suffers from a major contamination problem because of cyanobacteria blooms. In order to supervise the blooms using these on-board sensor modules, a Non-Homogeneous Patrolling Problem (a NP-hard problem) must be solved in a feasible amount of time. A dimensionality study is addressed to compare the most common methodologies, Evolutionary Algorithm and Deep Reinforcement Learning, in different map scales and fleet sizes with changes in the environmental conditions. The results determined that Deep Q-Learning overcomes the evolutionary method in terms of sample-efficiency by a 50-70% in higher resolutions. Furthermore, it reacts better than the Evolutionary Algorithm in high space-state actions. In contrast, the evolutionary approach shows a better efficiency in lower resolutions and needs fewer parameters to synthesize robust solutions. This study reveals that Deep Q-learning approaches exceed in efficiency for the Non-Homogeneous Patrolling Problem but with many hyper-parameters involved in the stability and convergence.
Article
Full-text available
Bayesian Optimization is a sequential method that manages to optimize a single and costly objective function based on a surrogate model. In this work, we propose a Bayesian Optimization system dedicated to monitor and estimate multiple water quality parameters simultaneously using a single Autonomous Surface Vehicle. The proposed work combines different strategies and methods for this monitoring task, evaluating two approaches for acquisition function fusion: the coupled and the decoupled techniques. We also consider a dynamic parametrization of the maximum measurement distance traveled by the ASV so that the monitoring system balances the total number of measurements and the total distance, which is related to the energy required. To evaluate the proposed approach, the Ypacarai Lake (Paraguay) serves as the test scenario, where multiple maps of water quality parameters, such as pH, Dissolved Oxygen, need to be obtained efficiently. The proposed system is compared with Predictive Entropy Search for Multi-Objective Optimization with Constraints (PESMOC) algorithm and Genetic Algorithm (GA) path planning for the Ypacarai Lake scenario. The obtained results show that the proposed approach is 10.82% better than other optimization methods in terms of R2 Score with noiseless measurements and up to 17.23% better when the data is noisy. Additionally, the proposed approach achieves a good average computational time for the whole mission when compared to other methods, 3% better than the GA technique and 46.5% better than the PESMOC approach
Conference Paper
Full-text available
Coverage path planning (CPP) is the task of designing a trajectory that enables a mobile agent to travel over every point of an area of interest. We propose a new method to control an unmanned aerial vehicle (UAV) carrying a camera on a CPP mission with random start positions and multiple options for landing positions in an environment containing no-fly zones. While numerous approaches have been proposed to solve similar CPP problems, we leverage end-to-end reinforcement learning (RL) to learn a control policy that generalizes over varying power constraints for the UAV. Despite recent improvements in battery technology, the maximum flying range of small UAVs is still a severe constraint, which is exacerbated by variations in the UAV’s power consumption that are hard to predict. By using map-like input channels to feed spatial information through convolutional network layers to the agent, we are able to train a double deep Q-network (DDQN) to make control decisions for the UAV, balancing limited power budget and coverage goal. The proposed method can be applied to a wide variety of environments and harmonizes complex goal structures with system constraints.
Article
Full-text available
Wildfires destroy thousands of hectares every summer all over the globe. To provide an effective response and to mitigate wildfires impact, firefighters require a real-time monitoring of the fire front. This paper proposes a cooperative reinforcement learning (RL) framework that allows a team of autonomous unmanned aerial vehicles (UAVs) to learn how to monitor a fire front. In the literature, independent Q-learners were proposed to solve a wildfire monitoring task with two UAVs. Here we propose a framework that can be easily extended to a larger number of UAVs. Our framework builds on two methods: multiple single trained Q-learning agents (MSTA) and value decomposition networks (VDN). MSTA trains a single UAV controller, which is then "copied" to each of the UAVs in the team. In contrast, VDN trains agents to learn how to cooperate. We benchmarked in simulations our two considered methods – MSTA and VDN – against two state-of-the-art approaches: independent Q-learners and a joint Q-learner. Simulation results show that our considered methods outperform state-of-the-art approaches in a wildfire front monitoring task with up to 9 fixed-wing and multi-copter UAVs.
Article
Full-text available
Autonomous surfaces vehicles (ASVs) excel at monitoring and measuring aquatic nutrients due to their autonomy, mobility, and relatively low cost. When planning paths for such vehicles, the task of patrolling with multiple agents is usually addressed with heuristics approaches, such as Reinforcement Learning (RL), because of the complexity and high dimensionality of the problem. Not only do efficient paths have to be designed, but addressing disturbances in movement or the battery’s performance is mandatory. For this multiagent patrolling task, the proposed approach is based on a centralized Convolutional Deep Q-Network, designed with a final independent dense layer for every agent to deal with scalability, with the hypothesis/assumption that every agent has the same properties and capabilities. For this purpose, a tailored reward function is created which penalizes illegal actions (such as collisions) and rewards visiting idle cells (cells that remains unvisited for a long time). A comparison with various multiagent Reinforcement Learning (MARL) algorithms has been done (Independent Q-Learning, Dueling Q-Network and multiagent Double Deep Q-Learning) in a case-study scenario like the Ypacaraí lake in Asunción (Paraguay). The training results in multiagent policy leads to an average improvement of 15% compared to lawn mower trajectories and a 6% improvement over the IDQL for the case-study considered. When evaluating the training speed, the proposed approach runs three times faster than the independent algorithm.