ArticlePDF Available

Enhanced watershed model evaluation incorporating hydrologic signatures and consistency within efficient surrogate multi-objective optimization

Authors:
  • Ant Group
Environmental Modelling and Software 175 (2024) 105983
Available online 14 February 2024
1364-8152/© 2024 Elsevier Ltd. All rights reserved.
Enhanced watershed model evaluation incorporating hydrologic signatures
and consistency within efcient surrogate multi-objective optimization
Wei Xia
a
,
b
,
c
,
*
, Taimoor Akhtar
d
, Wei Lu
e
, Christine A. Shoemaker
a
,
b
,
c
a
Department of Civil and Environmental Engineering, National University of Singapore, 117576, Singapore
b
Department of Industrial Systems Engineering and Management, National University of Singapore, 117576, Singapore
c
Energy and Environmental Sustainability for Megacities (E2S2) Phase II, Campus for Research Excellence and Technological Enterprise (CREATE), 138602, Singapore
d
RWDI Consulting Engineers and Scientists., N1G 4P6, ON, Canada
e
Ant Group, 310000, China
ARTICLE INFO
Keywords:
Surrogate model
Multi-objective optimization
Watershed model
Automatic calibration
Hydrology signature
ABSTRACT
This paper presents a new framework for calibrating computationally expensive watershed models with multi-
objective optimization methods and hydrological consistency analysis. The analysis evaluates different algo-
rithms efciencies for nding watershed model calibration solutions within a limited budget. Two surrogate
multi-objective algorithms GOMORS and ParEGO are compared to ve evolutionary algorithms without surro-
gates on two watershed models. We test the algorithmsperformance with two multi-objective formulations (i.e.,
threshold-based ow separation and decomposition of the Nash-Sutcliffe Efciency (NSE)). Results indicate that
the surrogate-based GOMORS is the most computationally efcient overall. We also propose a framework to
select among the calibration solutions obtained from multi-objective optimization using different hydrologic
signatures. GOMORS is assessed for its ability to identify hydrologically acceptable calibrations. The decom-
position of NSE is the most effective calibration formulation in terms of hydraulic consistency analysis. In
addition, hydrologic signatures could be used effectively to lter non-dominated solutions obtained from multi-
objective optimization.
1. Introduction
1.1. Background and motivation
Analyzing trade-offs between conicting objectives for parameter
estimation of watershed models through Multi-Objective (MO) optimi-
zation can identify multiple plausible calibration alternatives that pro-
vide valuable information during the parameter estimation process to
enhance understanding of model adequacy, uncertainty and structural &
data deciencies (Kollat et al., 2012).
However, there are numerous challenges to model calibration via
MO optimization. For instance, a major reason multi-objective optimi-
zation is not more widely used is that it usually requires many more
model simulations (especially if many objectives are included in an
optimization formulation) than single objective optimization of
parameter goodness of t, which is difcult for complex simulation
models that are computationally expensive. Moreover, the calibration
perspective typically includes many potentially conicting objectives,
due to which identication of a suitable MO calibration formulation,
where the number of objectives is limited, is a non-trivial task.
Hence, in order to make MO calibration more accessible for modelers
who are dealing with distributed and expensive watershed models, it is
imperative to i) identify algorithms that are suitable for calibration on a
limited budget and ii) identify combinations of calibration objectives
that can adequately represent the many conicting hydrologic model
calibration targets within a limited number of objectives (in order to
ensure that calibration optimization is not overly complex or compu-
tationally difcult.) In this study, calibration with a limited budget
refers to cases where the number of allowed evaluations for the cali-
bration is relatively small and is constrained by factors including 1) the
computing resources (both in computing speed and time) that the cali-
bration can use, 2) the wall-clock time that the person conducting the
calibration is willing to allocate or wait for the optimization, and 3) the
computing time of each model evaluation.
Hydrologic Signatures (HS) are important streamow characteristics
of the natural ow regimes, such as timing and magnitude of extreme
* Corresponding author. Department of Civil and Environmental Engineering, National University of Singapore, 117576, Singapore.
E-mail address: xiawei@u.nus.edu (W. Xia).
Contents lists available at ScienceDirect
Environmental Modelling and Software
journal homepage: www.elsevier.com/locate/envsoft
https://doi.org/10.1016/j.envsoft.2024.105983
Received 15 February 2023; Received in revised form 21 January 2024; Accepted 13 February 2024
Environmental Modelling and Software 175 (2024) 105983
2
ows (Shai and Tolson, 2015), that are critical for adequate evaluation
of hydrologic models. Recently, HS were utilized in many studies as
supplements (Chilkoti et al., 2018) or alternatives (Sahraei et al., 2020)
to statistical Goodness of Fit (GOF) measures for investigating whether
streamow components are properly simulated in watershed models.
However, there are many hydrologic signatures (e.g., in our study, about
9 hydrologic signatures are considered), hence it is challenging to use so
many HS directly as the sub-objective functions of MO calibration. It is
thus also important to understand how HS can be incorporated into
calibration frameworks for expensive MO where simulation evaluation
budgets are limited.
1.2. Literature review
Watershed simulation models typically involve, i) physical parame-
ters that are difcult to measure directly, and/or ii) conceptual param-
eters that are impossible to measure. Conceptual parameters stem from
considerable simplications employed in modeling of the natural pro-
cesses within the watershed. Model calibration is a process that can be
employed to adjust values of such parameters, where the aim is to mimic
reality, via comparison of model response to historically observed
measurements. Traditional calibration methodologies typically involve
manual trial-and-error, with expert opinion of a hydrologist within an
interactive calibration framework. The value of expert opinion cannot
be disregarded; however, manual trial-and-error methods can be
extremely time consuming and complicated.
A bulk of prior and contemporary research has focused on automatic
calibration schemes involving single objective optimization (e.g., Boyle
et al., 2000; Madsen et al., 2002; Tolson and Shoemaker, 2007b; Vrugt
et al., 2003). The simulation-optimization problem for a complex
watershed model formulated within the automatic calibration frame-
work is typically non-linear and probably has multiple local minima (i.
e., is multi-modal) (Tang et al., 2006), and various studies have
employed heuristic search techniques seeking the globally optimal
solution.
Watershed calibration based on a single aggregated calibration per-
formance metric could lead to a signicant loss of information within
the calibration process. The calibration process could be highly sensitive
to various factors, especially the proposed objective function used to
assess the adequacy of a set of parameter values. Moreover, calibration
experts typically consider many criteria in the model calibration pro-
cess, most of which are not even included into the objective function
within an automatic calibration scheme (Shai and Tolson, 2015).
Prior research efforts (Bekele and Nicklow, 2007; Gupta et al., 1998;
Yapo et al., 1998) indicate that signicant conicts and trade-offs might
exist between calibration criteria, and that visualization of these
trade-offs can assist decision makers in understanding model limits and
choosing appropriate calibrations. Numerous performance measures can
be employed to quantify the potentially conicting calibration criteria.
Statistical Goodness of Fit (GOF) metrics (e.g., Nash-Sutcliffe Efciency
(NSE) (Nash and Sutcliffe, 1970), bias (e.g. mean absolute error)) that
are calculated from streamow time series and also performance metrics
(e.g., relative deviation) calculated from hydrological signatures (HS)
(Hingray et al., 2010) are some of the measures that can be used as
objective functions in multi-objective optimization formulations.
Many studies have focused on employing multi-objective optimiza-
tion for watershed model calibration, highlighting the effectiveness of
multi-objective analysis in deducing various Pareto optimal calibrations
(Gupta et al., 2003; Madsen, 2003; Yapo et al., 1998), providing added
insight into model ambiguity resulting from model imperfections and
parameter uncertainty and understanding modeling limitations. Pareto
optimal calibrations are sets of solutions found by multi-objective
optimization that are non-dominated to each other but are superior to
the rest of solutions in the search space. A solution is a non-dominated
solution when no other solution found in multi-objective optimization
search is better than it in terms of all objectives (Akhtar and Shoemaker,
2016). Ahmadi et al. (2014) apply multi-objective optimization for
calibration of a SWAT model and conclude that MO optimization is more
effective than single objective optimization. Wu et al. (2021) use a
sequential multicriteria algorithm to iteratively adjust parameter ranges
for hydrologic model calibration and uncertainty analysis. They also
highlight that computational requirement of calibrating distributed, and
semi distributed hydrologic models are typically very high.
Numerous research contributions have been made in proposing al-
gorithms for multi-objective optimization, within the simulation-
optimization framework (e.g., Coello et al., 2007). Various algorithmic
contributions, within the water resources community have also been
made (Asadzadeh et al., 2014; Maier et al., 2014; Nicklow et al., 2010;
Sahraei et al., 2019; Tang et al., 2006, 2007), with many focusing on
evolutionary strategies within their optimization frameworks.
Multi-objective evolutionary strategies are frequently referred to as
(MOEA) in contemporary literature. Tang et al. (2006) provide a
comparative analysis of various evolutionary algorithms, in order to
assess their effectiveness in hydrological model calibration.
While prior research and current industrial calibration techniques
indicate the inherent multi-criteria nature of the hydrological model
calibration problem and emphasize the advantages of multi-objective
optimization in model calibration, the computational complexity of
distributed hydrological models poses a huge challenge to the use of
multi-criteria optimization algorithms in the calibration process. It
should be noted that the calibration optimization problem is a compu-
tationally expensive simulation optimization problem, since the objec-
tive function(s) are evaluated via simulations and running watershed
model simulations for each set of parameters considered can take a
signicant length of time.
There is a dearth in prior literature on the effective and efcient use
of MO algorithms for hydrologic model calibration for expensive prob-
lems that have a limited budget of simulation evaluations. Here, by
expensive problems, we refer to optimization problems where each
evaluation is resource-intensive in terms of computing resources and
time to complete the entire optimization process. Identifying algorithms
that are effective with a limited number of model evaluations provides
tools that can both be used for large watersheds and for watershed
models with a lot of spatial detail (both of which are more computa-
tionally expensive and hence the number of model evaluations will be
limited). Moreover, such algorithms are also effective in multi-step/
sequential calibration scenarios, i.e., scenarios where calibration ex-
perts apply automatic algorithms for model calibration in multiple it-
erations, with changes made in parameter choice and range in each
iteration (Xia et al., 2022a; Wu et al., 2021; Franco et al., 2020; Zamani
et al., 2020). Thus, assessment and identication of suitable MO algo-
rithms for expensive watershed problems and limited evaluation bud-
gets is important.
The use of surrogate models within an optimization algorithm can be
highly effective in reducing time for computing objectives for multi-
objective calibration of complex watershed problems. The terms sur-
rogate, response surfaceand meta modelare all used to describe
the use of existing information to build a multivariate approximation of
the objective function or model simulation, which then guides the
optimization search. Surrogate based single-objective optimization al-
gorithms have been widely used in calibration applications in many
areas. Most surrogate methods used in these optimization applications
include: a) radial basis function (RBF) based methods (Müller et al.,
2013; Regis & Shoemaker, 2007, 2013; Wild et al., 2008; Xia and
Shoemaker, 2021; Xia et al., 2021), b) Kriging based methods (Gong and
Duan, 2017; Jones et al., 1998), and c) articial neural network (ANN)
(Zou et al., 2007) based methods. Popular surrogate optimization ap-
plications in water resources include single-objective watershed model
calibration (Regis and Shoemaker, 2013), groundwater model calibra-
tion (Mugunthan and Shoemaker, 2006), lake water quality and hy-
drodynamic model calibration (Xia & Shoemaker, 2022a, 2022b; Xia
et al., 2022; Xia et al., 2021), earth system model calibration (Lu et al.,
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
3
2018; Cheng et al., 2023) and carbon sequestration model calibration
(Espinet and Shoemaker, 2013). Razavi et al. (2012) provided a
comprehensive review of literature on the use of surrogates in water
resources.
Surrogates have also been used in application of multi-objective
optimization to complex water resources problems. Baú and Mayer
(2006) employ kriging based surrogates within a multi-objective
framework for optimal design of groundwater (pump-and-treat) reme-
diation systems. Behzadian et al. (2009) combine a MOEA with an ANN
to efciently deduce optimal sampling locations of pressure loggers for a
water distribution system. di Pierro et al. (2009) explore the use of
surrogate based MO algorithms including PAREGO with application to
water distribution network design. Castelletti et al. (2010) incorporate
the use of numerous surrogate methods for efcient multi-objective
optimization with application to water quality planning in reservoirs
and lakes. Lu et al. (2019) combined polynomial regression models with
an iterative algorithm for supporting storage pond design in an urban
drainage system. However, the effective use of surrogate-based methods
and other efcient MO algorithms for expensive watershed model cali-
bration, is relatively unexplored. For instance, there are no prior studies
that adequately analyze and compare MO algorithms on a limited model
evaluation budget.
In addition to the computational challenges associated with using
MO for parameter calibration, the application of MO will yield a set of
non-dominanted solutions. These solutions are multi-objectively
equivalent (i.e., without additional information, it is impossible to
designate any of the non-dominated solutions as superior to other non-
dominated solutions). Note that this multi-objective equivalence in non-
dominated solution is somewhat different from the well-known concept
of equinality (Beven and Binley, 1992), wherein multiple sets of pa-
rameters provide equally goodor acceptable model performance.
The non-dominated solution achieved in MO represents solutions that
are good in one or multiple ways in which the best t of a model to the
data can be dened. The equinal and non-dominated sets of parameters
might overlap but may not be equivalent, as also discussed in Gupta
et al. (1998).
Identifying a set of calibration parameters rather than a single best
parameter (as is the case in MO calibration) means that the parameter
uncertainty among the non-dominated solutions will be propagated to
the simulation output, causing predictive uncertainty. Multiple meth-
odologies have been proposed to address this limitation within the
context of predictive uncertainty. For example, informal Bayesian
approach generalized likelihood uncertainty estimation (GLUE) meth-
odology (Beven and Binley, 1992) and various Bayesian Markov chain
Monte Carlo or MCMC methods (e.g., Kuczera and Parent, 1998; Vrugt
et al., 2003).
The GLUE method involves generating a large ensemble of feasible
parameters sets commonly based on uniform random sampling and then
assessing the model performance for each set against observed data with
a likelihood function. A subjective threshold on model performance is
applied to select parameter sets with good model performance, which
are known as behavioral sets. The model output uncertainty is rep-
resented by the likelihood-weighted output across the behavioral sets.
The latter requires the denition of a formal likelihood function. The
model output uncertainty is obtained by evaluating model output for a
set of parameters sampled from the posterior parameter distributions.
Both GLUE and formal Bayesian approaches often involve a large
number of evaluations, with studies in the literature commonly report-
ing from 60,000 to 100,000 model evaluations. Consequently, GLUE and
formal Bayesian approaches are computationally infeasible for problems
that are computationally expensive.
The uncertainty approximation from non-dominated solutions does
not apply a subjective threshold like GLUE but rather uses the non-
dominated solutions obtained by MO. Additionally, no probability is
applied when aggregating the model output across multiple simulation
output in the nal set of non-dominated solutions. However, one issue
related to utilizing non-dominated solutions for predictive uncertainty
estimation lies in the presence of solutions that may exhibit poor per-
formance in one or a few subobjectives. These solutions, despite their
suboptimal performance in specic aspects, might be included in the
non-dominated set due to their superior performance in other sub-
objectives. In addition, too many non-dominated solutions in the nal
solution set will enlarge the model predictive uncertainty. Conse-
quently, it becomes crucial to implement a selection process among
these non-dominated solutions to ensure that the selected solutions
exhibit high-quality behavior.
There is limited knowledge regarding the best practices for selecting
non-dominant solutions in MO calibration. Additional criteria (e.g.,
Hydrological Signatures metrics (Hingray et al., 2010)) beyond the
performance metrics used in MO calibration could potentially be used to
screen the non-dominant solutions. However, no studies have explored
this possibility. Moreover, there is limited knowledge available on how
to effectively incorporate HS into model calibration when models are
expensive and simulation budgets are limited (e.g., the number of
evaluations allowed for the whole optimization search process is less
than 1000). For instance, Sahraei et al. (2020) and Shai and Tolson
(2015) investigate different calibration formulations that use HS-based
and statistical GOF based objective functions but with budgets in
excess of 5000 simulation evaluations (which would take more than 7
days in wall-clock time for solving the MO calibration problem in our
case). In addition, using HS directly as objective functions would result
in many objectives which would make the MO calibration too expensive
to solve and resulting in a larger number of non-dominated solutions.
Moreover, MO algorithms analyzed in this study are not designed for
many-objective(i.e., 4 or more objective) problems.
1.3. Research contributions
The primary contribution of this study is a suggested framework for
multi-objective (MO) hydrologic model calibration analysis on a limited
budget of simulation evaluations (dened to be below 600 evaluations
in this study, which is around 1 day in computing time for our problem).
The efciency of 7 different surrogate and non-surrogate optimization
methods are comprehensively compared on two watershed calibration
problems with available long-term observation data sets and using two
different MO formulations. We not only compare the solutions among
different optimization methods but also assess the solutions from MO
calibration with a limited budget (below 600 evaluations) to solutions
with a sufcient budget (over 100,000 evaluations) in order to investi-
gate if it is feasible to solve the MO calibration with a limited budget.
We eventually propose an efcient multi-objective calibration
framework to evaluate and lter the non-dominated solutions from MO
calibrations. The framework incorporates hydrologic signatures into the
calibration framework by using pre-dened thresholds of acceptability
on various hydrologic signatures. The framework allows the evaluation
of the solution quality of MO calibration and the selection of high-
quality subset solutions from multiple non-dominated solutions. We
adopt the hydrologic consistency denition and hydrologic fre-
quencyplots from Shai and Tolson (2015) to assess the overall solu-
tions quality from a MO calibration search and propose a new hydrologic
consistency frequency heatmap to assess the calibration quality in terms
of each hydrologic criteria. Our framework enables modellers to un-
derstand the quality of MO calibration and gives insights on model
setup, parameter sensitivity & output uncertainty. Moreover, the
framework we proposed is general and could be used for other water-
shed model calibration problems, and also in sequential calibration
frameworks where model parameters/structure are adjusted in each
calibration iteration (Xia and Shoemaker, 2022a).
2. Calibration assessment framework
The multi-objective optimal calibration problem discussed in this
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
4
study, can be formulated as a constrained optimization problem:
min
θΩF(θ) = [f1(θ),,fk(θ)]T(1)
where θ= [θ1θn] Ω is the vector of n model parameters to be cali-
brated. Ω is the domain of the solution space dened by the lower and
upper bound of the model parameters (i.e., θmin and θmax). F(θ)is the
vector of k calibration objectives, which can be subjectively dened by a
calibration expert. In order to evaluate the objective F(θ)for a candidate
decision vector θ, a computationally expensive run of the watershed
simulation model is to be performed.
Hence, a desirable property of an optimization methodology is the
ability to produce good solutions within a limited budget of simulation
evaluations. The budget on simulation evaluations depends on the
computation time of a model and the total time available for the cali-
bration process. The goal of MO optimization is to nd the set of Pareto
optimal solutions, which is the set of all points y in domain Ω for which
there is no other point x in Ω such that f
j
(x) f
j
(y) for all j from 1 to k. In
MO optimization, the Pareto set is approximated by a set of non-
dominated solutions obtained after a xed number of evaluations of
F(θ).
Multi-objective model calibration has two critical components,
namely 1) selection of appropriate objectives for formulation of the MO
hydrologic calibration, and 2) utilization of an effective and efcient
algorithm for optimization. Section 2.1 discusses the formulations used
in this study. Subsequently, Section 2.2 introduces the model calibration
case studies used in the study, Section 2.3 briey describes the algo-
rithms used in the calibration assessment framework, Section 2.4 ex-
plains assessment metrics used in our study, and Section 2.5 describes
the approach we used to assess the model parameter and output
uncertianty of the nal calibration solutions. The entire calibration
assessment framework is designed to identify formulations and algo-
rithms that are efcient and effective for calibration of expensive hy-
drologic/watershed problems within a limited number of model
evaluations.
2.1. Calibration formulations
This study utilizes goodness-of-t (GOF) measures as objectives to
create numerous multi-objective calibration formulations. For each
watershed case study tested, all calibration formulations utilized in this
study differ in the choice of objectives used only, whereas choice of
parameters and their respective ranges remain same across all formu-
lations. The GOF measures utilized in the study include Nash-Sutcliffe
Efciency (NSE). Two different multi-objective ow calibration formu-
lations are used in this study (as described below). The rst one con-
siders the high ow separately from the mixture of moderate and low
ow. The second one considers different components of the decomposed
NSE as objective functions.
2.1.1. Formulation 1 - threshold based ow separation (2-objective)
This formulation considers the trade-off between high ow calibra-
tion and moderate/low ow calibration, by utilizing the NSE GOF
measure. The relative importance of different hydrological processes
varies between high ow and low ow situations (Kollat et al., 2012) so
a given set of model parameters might do relatively well at matching
data under high ow conditions and relatively poorly under low ow
conditions (or vice versa). Hence it is reasonable to consider the goals of
getting adequate ts to the data under low and high ow conditions as
two different objectives for which there will be a trade-off.
f1(θ) = NSEHF(θ)
f2(θ) = NSELF(θ)where NSE(θ) = 1n
i=1(yobs,iysim,i(θ))2
n
i=1(yobs,i
μ
obs)2(2)
Equation (2) describes the objectives used in our rst formulation,
where NSEHF(θ)is the NSE of high ows (given parameter vector θ) and
NSELF(θ)is the NSE of low ows. In the above equation, yobs,i and ysim,i(θ)
are the measured and simulated ows (given calibration parameter set
θ) on day i, respectively;
μ
obs is the estimated mean values of measured
ows; and n is the number of simulated days. In all our experiments,
high ows are dened as all ows above the 95-th percentile observed
ow when observations are sorted in ascending order of ow magni-
tude. Low ows are all ows below this threshold so moderate ows are
included in the lowows. When calculating the NSE value, the time
series for high (or low) ow is derived by extracting the corresponding
time steps from both measured and simulated ows, where the
measured ow magnitudes are classied as high (or low) ow. As a
result, the measured and simulated high (or low) ow time series are of
the same length.
2.1.2. Formulation 2 - decomposition of NSE of ow (3-objective)
Gupta et al. (2009) highlight that the NSE criterion can be decom-
posed into three components a) linear correlation (represented by f1(θ)
in Equation (3)), b) relative variability (represented by f2(θ)in Equation
(3)), and c) relative bias (represented by f3(θ)in Equation (3)). Each
component focuses on calibrating a different and potentially conicting
aspect of ow. The relative bias component of the criterion tends to
minimize volume balance errors; the relative variability tends to mimic
the ashiness of the hydrograph, inherently focusing on capturing
extreme ows; while the correlation criterion, in combination with
relative variability tends to capture the shape of the hydrograph. This
MO formulation utilizes these three components as independent objec-
tives, as dened in Equation (3) below:
f1(θ) = [r1]2,where r=
n
i=1
yobs,iysim,i(θ) n
μ
obs
μ
sim(θ)
(n1)
σ
obs
σ
sim(θ)
f2(θ) = [DDS(θ)]2,where DDS (θ) =
σ
obs
σ
sim(θ)
σ
obs
f3(θ) = [DDM(θ)]2,where DDM (θ) =
μ
obs
μ
sim(θ)
μ
obs
(3)
where
μ
obs and
μ
sim (θ)are estimated means of observed and simulated
(given parameter set θ) ows, respectively.
σ
obs and
σ
sim (θ)are the
estimated values of standard deviation of measured and simulated ows
(given parameter set θ), respectively. All other symbols in Equation (3)
are dened in Section 2.1.1, and in Table A2.
Formulation 2 is an objective function set that includes both GOF
measures and hydrologic signatures, since correlation r is purely a GOF
objective, whereas squared relative deviation between observed and
simulated mean (relative bias) and observed and simulated standard
deviation (relative variability) are HS-based objectives. Discharge mean
and standard deviation are key hydrologic signatures that are also
included in the signatures used to compute hydrologic consistency later
in this study (see Table A2 and Section 2.4.2).
2.2. Case studies
The two Watershed Model case studies used in our MO calibration
framework are derived from the Cannonsville Watershed modeling case
study (Tolson and Shoemaker, 2007a). The Soil and Water Assessment
Tool (SWAT) is used for model development (Arnold et al., 2012). SWAT
is a widely used, physically based, deterministic and semi-distributed
watershed modeling tool (Abbaspour et al., 2004; McDonald et al.,
2019; Wang et al., 2019). More detailed introduction of SWAT is pro-
vided in Supporting Information, Section S1.
2.2.1. Case study I: Cannonsville Watershed
Tolson and Shoemaker (2007b) introduce two scaled variations of
the Cannonsville SWAT model, as ow calibration case studies. Case
Study I incorporates a computationally expensive calibration model,
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
5
which constitutes 43 subbasins and predicts ow within the 1178 km
2
Cannonsville Watershed. The delineation of subbasins was performed
using a high-resolution (25 m) digital elevation map obtained from the
New York City Department of Environmental Protection (NYCDEP),
coupled with stream network denition from US Census TIGER les.
Land use information at a 25 m grid resolution was obtained from the
NYCDEP and derived from thematic mapper satellite imagery. Soil
property inputs were extracted from the State Soils Geographic Database
(1:250,000) using an area-weighted averaging approach for each map
unit. The climate data inputs, encompassing minimum and maximum
temperature, precipitation, solar radiation, and relative humidity, were
derived from meticulously measured data. The temporal resolution of
the model is daily, and a single simulation run spans approximately 1
min for a 10-year simulation period. For an in-depth understanding of
the Cannonsville SWAT model, refer to the comprehensive description
provided by Tolson and Shoemaker (2007a). The model calibration
exercise is focused on the United States Geological Survey (USGS)
Walton ow monitoring location, which drains up to 860 km
2
of the
watershed. Tolson and Shoemaker (2007b) identify 15 parameters
(dened in Appendix, Table A1), that are to be calibrated, for ow
prediction in Case Study I. The values of these 15 parameters are con-
stant for all subbasins and do not vary spatially, following the same
setting as in the study by Tolson and Shoemaker (2007b). Case Study I
employs a nine-year time period, for daily ow calibration at the Walton
ow monitoring station.
2.2.2. Case study II: townbrook watershed
Townbrook is a sub-watershed within the Cannonsville Watershed,
covering an area of around 37 km
2
. Case Study II is derived from the
single subbasin Townbrook SWAT model developed by Tolson and
Shoemaker (2007b). The model inputs and time steps setup of the
Townbrook SWAT model are the same as the Cannonsville SWAT model
as describedin the section above. The Townbrook SWAT model is a
relatively inexpensive model (one simulation run takes up to 10 s for a
10-year modeling period), and hence can be used extensively for algo-
rithm comparison. The model predicts ow within the Townbrook
watershed and is employed as a ow calibration case study. The
Townbrook sub-watershed is monitored for ow by the United States
Geological Survey (USGS Station 01421618), and Case Study II has a
10-year time period, for daily ow calibration of 15 parameters of the
Townbrook SWAT model.
Given two calibration formulations, and two case studies, we
developed 4 watershed calibration test case studies, as our test suite for
comparative algorithm analysis in the MO framework. The nomencla-
ture used for the test case studies, along with a brief overview of the
problems, is provided for reference in Table 1. FW is an abbreviation for
Full-Watershed, which is a reference to the Cannonsville case study
(Case Study I). SW is an abbreviation for Sub-Watershed, which is a
reference to the Townbrook case study (Case Study II).
2.3. Optimization algorithms
Within the water resources literature, several recent studies have
shown that in general, watershed model calibration problems can be
highly multi-modal, and existence of false non-dominated Pareto fronts
(e.g., locally optimal fronts) can pose a signicant threat to the accuracy
of the optimization process (Kollat et al., 2012; Reed et al., 2012). Hadka
and Reed (2013) state that multi-modality is a severe challenge for most
multi-objective evolutionary algorithms (MOEAs). In our analysis, we
aim to compare optimization algorithms with varying search capabil-
ities to understand their capabilities in tackling the numerous optimi-
zation challenges of watershed calibration within a very limited
simulation evaluation budget.
The optimization literature contains various multi-objective opti-
mization algorithms, specically for tackling simulation optimization
problems. These search-based meta-heuristics are primarily multi-
objective Evolutionary Algorithms (MOEAs). Coello et al. (2007) sug-
gest that use of MOEAs can be highly benecial, since the
population-based structure of an evolutionary algorithm can be exploi-
ted to simultaneously achieve the two goals of i) convergence to the
Pareto front, and ii) maintain a diverse set of trade-off solutions.
Various search methodologies within evolutionary optimization can
be employed to tackle these two-fold aims of multi-objective optimiza-
tion. Local search, decomposition of the objective vector into multiple
single objective optimization problems and employing multi-method
search are some of these search methodologies. The ve multi-
objective evolutionary algorithms (MOEAs) used for comparison in
this study, i.e., NSGA-II (Deb et al., 2002), MOEA/D (Zhang and Li,
2007), AMALGAM (Vrugt and Robinson, 2007), BORG (Hadka and
Reed, 2013) and
ε
-MOEA (Laumanns et al., 2002) employ different and
unique search mechanisms for optimization, and have been used in
numerous watershed model calibration applications in the past (Ahmadi
et al., 2014; Chilkoti et al., 2018; Ercan and Goodall, 2016; Shai and
Tolson, 2015; Zhang et al., 2013). For instance, Ercan and Goodall
(2016) introduced a generic software tool for using NSGA-II in
multi-objective and multi-site calibration of SWAT models; Zhang et al.
(2013) proposed PP-SWAT, a parallel multi-objective calibration tool
designed for parallel and efcient calibration of SWAT models using
AMALGAM; and Chilkoti et al. (2018) coupled SWAT with BORG for
effective low-ow calibration of semi-distributed hydrologic models.
PADDS (Asadzadeh and Tolson, 2013; Sahraei et al., 2019) is another
(and non-evolutionary method) algorithm that has performed well on
MO water problems (Yang et al., 2017). We did not include PADDS in
this analysis because Shai and Tolson (2015) observe that performance
of PADDS and AMALGAM is comparable for watershed calibration
problems and we do compare to AMALGAM.
Since we are interested in efcient multi-objective calibration of
expensive watershed models, incorporation of surrogate assisted opti-
mization methodologies is important in this analysis. Surrogate assisted
search methods typically employ computationally inexpensive response
surface models (or surrogate models), within the iterative search
process, in order to efciently guide the search towards optimal solu-
tions. Hence, we compare two surrogate based algorithms, ParEGO
(Knowles, 2006) and GOMORS (Akhtar and Shoemaker, 2016), along
with the ve evolutionary algorithms discussed above, for analyzing
their relative effectiveness in MO calibration of expensive hydro-
logic/watershed models on a limited budget of a few hundred simulation
evaluations. In surrogate-assisted optimization methods such as ParEGO
and GOMORS, the surrogate model is constructed by utilizing evalua-
tions already explored by the optimization algorithm. The surrogate
takes the decision variables (e.g., calibration parameter values) as input
and produces the objective function value (e.g., performance metric
calculated using SWAT simulation output for a given set of calibration
parameter values) as output. The tted surrogate model serves as a
computationally inexpensive predictive tool for estimating the objective
function value for a given set of calibration parameter values. This
Table 1
The ow calibration test case suite employed in comparative algorithm analysis
with formulations dened in Section 2.1, Case studies dened in Section 2.2, and
algorithms introduced in Section 2.3. FW =Cannonsville Full-Watershed and
SW =Townbrook Sub-Watershed.
Problem
Name
Formulation Equation
No.
Case
Study
Objectives Applied
Algorithm(s)
SW-2 1. Threshold 2 II/SW 2 All
SW-3 2. NSE-
Decom
3 II/SW 3 All
FW-2 1. Threshold 2 I/FW 2 All
FW-3 2. NSE-
Decom
3 I/FW 3 All
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
6
predictive capability is then employed to strategically sample evaluation
points within the search space, contributing to an enhanced and efcient
optimization process.
ParEGO (Knowles, 2006) uses a Kriging-based surrogate surface for
multi-objective optimization, and it is specically designed for appli-
cations involving a very limited evaluation budget. GOMORS (Akhtar
and Shoemaker, 2016) is another iterative scheme, which employs
Radial Basis Functions (RBF) as a surrogate model to guide
multi-objective search towards the optimal set of solutions. Both the
Kriging-based surrogate surface and the RBF surrogate are versatile
data-driven models that can effectively capture diverse regression re-
lationships between decision variables and objective function values.
Their general-purpose nature enables direct application to a range of
problems, extending beyond the specic model calibration issues
encountered in this study, such as those involving SWAT. GOMORS
optimizes RBF surrogates (one surrogate is tted for each objective) with
an MOEA search on the surrogate in each iteration, in order to improve
algorithm efciency. Akhtar and Shoemaker (2016) report that
GOMORS outperforms ParEGO and NSGA-II on test problems and on a
hypothetical groundwater remediation design problem with a limited
evaluation budget (400 evaluations). GOMORS is implemented within
pySOT, a python-based framework surrogate optimization toolbox,
designed for implementing single and multi-objective surrogate algo-
rithms (Eriksson et al., 2019).
2.4. Assessment metrics
Calibration solutions obtained from different algorithms and for-
mulations requires a careful assessment and selection under MO per-
formance metrics. There are numerous MO performance metrics that
measure an MO solution with a single number. Coello et al. (2007)
provide a comprehensive list of MO performance metrics.
However, from a hydrologic perspective, assessment of calibration
solutions may benet from considerations of various measures including
GOF and HS, which may be included in post-optimization calibration
assessment (Shai and Tolson, 2015). Thus, the calibration assessment
framework of this study uses both traditional and hydrological metrics
for assessment of algorithms and formulations on limited evaluation
budgets.
2.4.1. Hypervolume
Hypervolume (Auger et al., 2009) is the traditional MO performance
assessment metric used in analysis of algorithms in this study. Hyper-
volume incorporates both convergence to the ideal front as well as di-
versity of solutions on the front and is dened as the total feasible
objective space (bounded by reference points) dominated by estimate of
the Pareto front obtained by an algorithm (see Supporting Information
Section S2 and Fig. S1 for further illustration). This study uses a
normalized version of hypervolume, where hypervolume dominated by
an algorithm solution is divided by the hypervolume dominated by the
true Pareto front of the problem. Thus, hypervolume coverage, as
dened in this study, is the proportion of feasible objective space
dominated by an algorithm (see Supporting Information Section S3 and
Fig. S2 for further illustration). A higher value of hypervolume coverage
indicates a better solution and the ideal value is one. Also, since the true
Pareto front of an optimization problem is unknown, we use the tness
values of the best solution of all algorithms and all trials to develop an
estimate of the Pareto front for that optimization problem.
2.4.2. Hydrologic consistency
We consider nine HS in this studys calibration performance analysis
(dened in Appendix B) and use relative signature deviations to quantify
HS-based performance (for each signature). Relative signature de-
viations have been used in other studies (Sahraei et al., 2020; Shai and
Tolson, 2015) to quantify signature-based performance and are
described (as used in this study) as:
Dxx(θ) = Sobs
xx Ssim
xx (θ)
Sobs
xx
(4)
In the above equation Dxx(θ)is the relative deviation between
observed signature value Sobs
xx (computed using observed ows) and
simulated signature value Ssim
xx (θ)(computed using simulated ows, and
the parameter vector θ), where ‘xx is an abbreviation denoting the
signature being computed (please refer to Table A2, for signature ab-
breviations and denitions).
Shai and Tolson (2015) introduced the term hydrologic consistency
when analyzing calibration formulations. Lets assume that calibration
of a hydrologic model requires consideration of N criteria where an
acceptability threshold is dened for each criterion (a criterion could be
for example a GOF measure or an HS deviation). Examples of accept-
ability threshold may include: 1) Kling-Gupta Efciency (KGE) >x and
2) D
xx
(θ) <y, where x and y are user dened thresholds and KGE is a
statistical metric commonly used for the evaluation of hydrological
model performance (Kling et al., 2012). It combines three components:
correlation (Pearsons correlation coefcient), bias (the ratio of the
mean simulated to observed values), and variability (the ratio of the
standard deviation of simulated to observed values). D
xx
(θ) is the rela-
tive deviation between observed and simulated signature value dened
in Equation (4). According to Shai and Tolson (2015) hydrologic
consistency of a calibration solution is dened as the number of satised
criteria, n (out of N), i.e., the number of criteria that are within their
dened acceptability thresholds. This type of analysis of calibration
solutions is common when HS are considered in calibration (Martinez
and Gupta, 2010). Shai and Tolson (2015) also introduced frequency
plots of hydrologic consistency, to illustrate sampling efcacy of
different calibration algorithms and formulations. This study uses the
hydrologic consistency frequency (HCF) plot and HCF heatmap to
analyze the performance of algorithms and formulations on a limited
evaluation budget.
Hydrologic consistency frequency plot. A hydrologic consistency fre-
quency (HCF) plot introduced by Shai and Tolson (2015) is akin to a
probability exceedance plot and essentially plots the number of satised
criteria (N) on the x-axis, and the proportion of function evaluations of
an MO calibration experiment that satisfy at least n criteria, on the
y-axis. HCF plots are an effective presentation of hydrologic consistency
of calibration search. An MO calibration experiment is more hydrolog-
ically consistent, if a higher proportion of its evaluations satisfy more
criteria (relative to other calibration optimization experiments). Thus,
higher HCF curves are desirable and are better. In our study, we use only
the non-dominated solutions from multi-objective optimization for HCF
analysis, which is different from the HCF plot introduced by Shai and
Tolson (2015) where all solutions from calibration are used.
Hydrologic consistency frequency heatmap. HCF plot gives the pro-
portion of evaluation points that satisfy multiple criteria but cannot
provide information on how each of the different hydrologic criteria is
being satised. We introduce the HCF heatmap that shows the frequency
Table 2
The percentage of function evaluations, i.e., relative progress required by
GOMORS to reach the mean best Hypervolume achieved by other algorithms
after 600 function evaluations. The numbers in the brackets preceding x report
speed-up factors of GOMORS relative to other algorithms. All the results are
averaged over 20 trials. (Percentage less than 100% indicates the algorithm is
slower than GOMORS.)
Algorithm SW-2 SW-3 FW-2 FW-3
ParEGO 72% [1.4x] 52% [1.9x] 32% [3.1x] 47% [2.1x]
NSGA-II 12% [8.6x] 9% [11x] 10% [10x] 7% [15x]
AMALGAM 58% [1.7x] 33% [3.0x] 48% [2.1x] 30% [3.4x]
MOEA/D 23% [4.3x] 23% [4.3x] 16% [6.1x] 15% [6.7x]
BORG MOEA 36% [2.8x] 29% [3.4x] 35% [2.8x] 35% [2.8x]
ε
-MOEA 29% [3.4x] 36% [2.8x] 30% [3.3x] 20% [5.0x]
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
7
of each hydrologic criterion satised for the non-dominated solutions
that satised at least N hydrologic criteria. The horizontal axis of the
heatmap is the number of satised criteria (N). The vertical axis plots
different hydrologic criteria dened. The color of each grid in the
heatmap plots the frequency of each hydrologic criterion being satised
among the solutions that satised at least N criteria.
2.5. Model parameter and output uncertainty assessment
In this study, we quantied model parameter and output uncertainty
based on carefully selected non-dominated solutions. The hydrologic
consistency frequency plot and map introduced in Section 2.4 were
applied to the non-dominated solutions identied by the most effective
Multi-Objective (MO) algorithm. Following the analysis of hydrologic
consistency frequency plots and maps, which illustrate the performance
of non-dominated solutions across various hydrologic signatures, the
modeler can establish a threshold (e.g., specifying a minimum number
of hydrologic consistency criteria that must be satised). This threshold
serves as a criterion for selecting a subset of non-dominated solutions
assumed to be of high quality. These selected solutions not only belong
to the non-dominated set but also satisfy a greater number of hydrologic
consistency criteria compared to the remaining non-dominated
solutions.
To investigate parameter uncertainty, we compared the range (upper
and lower bounds of parameter values) of the nal selected non-
dominated solution set with the original parameter range dened for
the optimization search. This comparative analysis allows us to assess
the extent to which the methods proposed in our study contribute to
reducing parameter uncertainty.
Model output uncertainty is derived from these carefully selected
non-dominated solutions and is quantied as the range between the
highest and lowest output at each time step within the ensemble of
selected non-dominated solutions. To evaluate model performance, we
compared the ensemble mean and range of model output from the
selected non-dominated parameter sets with observed data. This
comprehensive approach enables a thorough investigation into both
parameter and output uncertainties, offering insights into the efcacy of
the proposed methodology.
3. Experimental setup
The experimental setup of this study is divided into two core com-
ponents. In the rst component, we focus on analysis of algorithm per-
formance by comparing the algorithms listed in Section 2.3 on 4
problems, SW-2, SW-3, FW-2, and FW-3. Due to the stochastic nature of
all algorithms, multiple trial runs were performed for the above-
mentioned problems. We performed 20 trials for each algorithm with
600 function evaluations on each of the 4 problems. Algorithms are
analyzed both in terms of traditional optimization efciency and
signature-based hydrologic consistency. Hypervolume coverage and
trade-off visualization plots (for 2-objective formulations) are used for
traditional algorithm assessment. Hydrologic consistency plots (see
Section 2.4.2) are used for hydrologically relevant analysis. Results of
algorithm comparison based on hypervolume coverage are discussed in
Section 4.1 and results via consistency analysis are discussed in Section
4.3.
The second component of the analysis in this study explores the so-
lution quality of different optimization formulations on a limited eval-
uation budget. The study employs GOMORS as the optimization
algorithm and assesses the quality of calibration solutions in terms of
hydrologic consistency. Two different formulations introduced in Sec-
tion 2.1 are compared in this analysis, with an emphasis on under-
standing the efcacy and effectiveness of these formulations in producing
good calibration alternatives. Efcacy, in this study, is dened as the
ability with which a formulation can frequently nd calibration solu-
tions that have high hydrologic consistencies (see Section 2.4.2 for
denition of hydrologic consistency). HCF plots and HCF heatmap are
used to understand and compare efcacy. Effectiveness, in this study,
focuses more on comparing the best calibration solutions found by
different formulations. Results of formulation comparison and solution
quality assessment are discussed in Section 4.4.
3.1. Algorithm settings
A small trial-and-error exercise was performed to tune population
sizes for the multi-objective evolutionary algorithms (MOEAs) NSGA-II,
MOEA/D,
ε
-MOEA and AMALGAM. Since performance of MOEAs is
highly dependent on population sizes, we ran multiple trials of the al-
gorithms on the SW cases, i.e., SW-2 and SW-3, with population sizes of
20, 50, 100 and 200, and an evaluation budget of 600. We chose 600
evaluations because when the FW-3 problem is running on a desktop
(Intel Core i7-6700 CPU @3.40 GHz, 16G RAM), the runtime is around 1
min for each evaluation; therefore, the computational time is around 10
h for a single optimization scenario of FW running in serial on a single
computer, which is affordable under this budget. The initial trial-and-
error analysis showed that within the limited evaluation budget of
600, a population size of 20 was desirable for all MOEAs in the case of
the 2-objective problems, whereas a population size of 50 was desirable
in the case of the 3-objective problems to achieve convergence of tness.
BORG MOEA has an adaptive population size and hence, does not
require tuning of the population. Given the limited evaluation budget,
we changed the initial, minimum, and maximum population size values
for BORG MOEA to 16, 10, and 100, respectively. ParEGOs parameter
conguration recommended by Knowles (2006) was employed, and the
GOMORS parameter conguration recommended by Akhtar and Shoe-
maker (2016) was used.
3.2. Hydrologic signatures and consistency levels
Hydrologic signatures (HS) are important streamow characteristics
of the natural ow regimes, such as timing and magnitude of extreme
ows (Shai and Tolson, 2015). Table A2 denes nine HS used in this
work. As introduced in Section 2.4.2, hydrologic consistency of a cali-
bration solution is dened as the number of hydrologic criteria that are
satised by a calibration solution. Ten hydrologic criteria (the KGE
metric plus the nine HS) are considered in the hydrologic consistency
analysis of this study. Moreover, satisfaction level of a hydrologic
signature (for a calibration solution) is achieved when absolute relative
deviation (see Section 2.1.1 and Equation (2)) of the signature is within
a user-dened percentage. Satisfaction level with respect to KGE is
achieved when the KGE score of a calibration solution is greater than a
user-dened threshold. Two different user-dened hydrologic consis-
tency levels are considered in the algorithm and formulation analysis of
this study. For the rst hydrologic consistency level, satisfaction level for
KGE is KGE >0.5, and satisfaction level for the absolute value of a HS is
the absolute value of HS <25%. For the second (and higher) hydrologic
consistency level, satisfaction level for KGE is KGE >0.6, and satisfaction
level is the absolute value of HS <15%. These consistency level de-
nitions are arbitrary and can be modied as per the perspective of a
model calibration expert.
4. Results and discussion
4.1. Traditional algorithm comparison: hypervolume coverage
We rst analyze overall efciency of all algorithms by plotting
hypervolume coverage (see Section 2.4.1) values (averaged over mul-
tiple trial runs) against number of function evaluations. These plots are
called progress graphs that are computed for all algorithms on both
watershed calibration problems using Formulations 1 and 2 (see
Table 1), i.e., algorithms are compared on problems SW-2, SW-3, FW-2,
and FW-3. The progress graphs are given in Fig. 1.
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
8
Each subgure in Fig. 1 corresponds to one of the four watershed test
problems mentioned above and provides visualizations of the average
hypervolume covered (averaged over multiple trials) as a function of
number of evaluations for all algorithms. A higher hypervolume
coverage indicates a better Pareto front. So the fact that GOMORS
eventually has the highest curve in all cases (as shown in Fig. 1) in-
dicates GOMORS is performing better than the other algorithms on all
four multi objective problems considered (where S or F indicates small
or full watershed and 2 or 3 is the number of objectives. As was
mentioned earlier, hypervolume progress considers both convergence
and diversication; and higher values of the hypervolume metric are
desirable. It is evident from the analysis in Fig. 1 that overall average
performance of GOMORS is better than all other algorithms for all
watershed test problems within a limited watershed simulation evalu-
ation budget of 600 since the blue curves ascend faster and to higher
values. The progress graphs also indicate relative superiority of the two
surrogate algorithms GOMORS and ParEGO over the non-surrogate al-
gorithms for a limited evaluation budget of 600.
While it is evident from Fig. 1 that average performance of GOMORS
is better than other algorithms with regard to hypervolume coverage, we
further analyze relative efciency of GOMORS (compared to other al-
gorithms) by reporting the percentage of function evaluations required
by GOMORS to reach the best hypervolume value (averaged over mul-
tiple trials) achieved by other algorithms after 600 function evaluations.
This percentage is referred to as relative progress in subsequent dis-
cussions and is reported in Table 2. Table 2 also reports the speed-up
of GOMORS, relative to other algorithms, where speed-up, calculated
after 600 functions evaluations, is the number of evaluations required by
GOMORS to reach the same hypervolume as other algorithms obtained
after 600 evaluations. For instance, speed-up of GOMORS, relative to
AMALGAM is 600/200 =3.0 times, for SW-3, because it only takes 33%
as much time for GOMORS to solve the problem as is required by
AMALGAM.
All relative progress percentages reported in Table 2 are considerably
less than 100%, implying that for every calibration problem, GOMORS
obtained equivalent average hypervolume values to all other algorithms
in less time. For instance, in comparison to NSGA-II, GOMORS obtained
equivalent average hypervolume within almost 10% function evalua-
tions (see the third row of Table 2).
In Fig. 1, a notable observation is that GOMORS seems to converge
faster in addressing the FW problem compared to the SW problem, as
indicated by the hypervolume coverage progress. This shows that
despite the SW problem being of smaller model scale (constituting a sub-
watershed of FW), it does not show a reduction in the number of eval-
uations required for algorithmic converge. This highlights the inherent
complexity of the SW optimization problem and challenges our initial
expectation that smaller-scale watershed model inherently demand
fewer optimization evaluations for convergence. However, it is unclear
Fig. 1. Progress graphs of MO solutions with plots of best hypervolume coverage values against number of function evaluations, averaged over 20 trials. Each subplot
corresponds to the progress plots of each test case study mentioned in the respective titles: (a) SW-2, (b) SW-3, (c) FW-2, and (d) FW-3. Higher curves are better.
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
9
why a smaller watershed model calibration takes more evaluations to
converge than the larger watershed model calibration problem. The
complexity arises from the fundamental differences in simulation out-
puts and observation data between the two models, making direct
comparisons challenging. The incongruence in optimization conver-
gence metrics between the SW and FW problems stems from the non-
comparability of the hypervolume coverage metric across multiple
problems. This is due to the metrics calculation being relative to the
reference and ideal solutions, values estimated by the optimization al-
gorithms evaluation points within each problem context (see Support-
ing Information Section S3 for details).
4.2. Statistical signicance
In order to analyze the difference in performance between GOMORS,
ParEGO, AMALGAM and Borg MOEA in further detail, the two-sided
Mann-Whitney Rank Sum test (Conover, 1998) was performed over
the hypervolume metric values obtained for each algorithm in multiple
trials. The Rank Sum test is a non-parametric statistical hypothesis test
for deducing whether results obtained from one algorithm in multiple
trial runs are signicantly different from results obtained from another
algorithm in multiple trials. The algorithms are compared in pairs and
the Rank Sum Test is performed for all watershed problems after 200,
400 and 600 evaluations of each algorithm are complete. Hence, there
are 36 Rank Sum tests for each algorithm, and 18 for each test problem.
A summary of the Mann-Whitney Rank Sum Test is provided in
Table 3. We see in Table 3, that GOMORS is better than the alternative
algorithm in 34 out of 36 cases. Table 3 also indicates that none of the
other algorithms is statistically better than GOMORS for any of the 36
combinations of problems (SW-2, SW-3, FW-2, FW-3), algorithms (Par-
EGO, AMALGAM, Borg MOEA) and numbers of evaluations (200, 400,
or 600).
For all cases of GOMORS versus AMALGAM and GOMORS versus
Borg MOEA comparisons, the p-value is very low (p 0.1), afrming
the hypothesis that GOMORS is signicantly better than both
AMALGAM and Borg MOEA. GOMORS also performs better than Par-
EGO in general since 10 out of 12 p-values for GOMORS vs ParEGO are
below 0.1 (with GOMORS being the superior algorithm), whereas there
are no cases where ParEGO performs better than GOMORS (with p <
0.1).
While ParEGOs performance may not be as good as GOMORS, it is
distinctly better than both AMALGAM and Borg MOEA (as per hyper-
volume coverage) as indicated by the positive p values of less than 0.1 in
14 out of 24 cases. GOMORS, ParEGO, AMALGAM and Borg MOEA all
decisively outperform NSGA-II and MOEA/D (see Fig. 1). Hence, NSGA-
II and MOEA/D were not included in the rank sum test analysis.
Our analysis highlights that GOMORS frequently outperforms
AMALGAM, Borg MOEA and ParEGO. However, none of the other al-
gorithms outperforms GOMORS on any case study. Also, the fact that the
two surrogate methods, GOMORS and ParEGO outperform the widely
used AMALGAM algorithm, suggests that there is a distinct advantage
for surrogates for multi-objective optimization for watershed model
calibration with limited number of evaluations.
4.3. Proximity to the approximated Pareto Front
Another critical question is that how well the non-dominated front
(obtained from an algorithm) compares against the approximate
Pareto front (computed from many simulations). Since Pareto fronts of
2-objective problems are easier to visualize, we compare non-dominated
fronts obtained by different algorithms for the 2-objective FW-2 problem
(see Table 1), against the approximated Pareto front in Fig. 2. The results
are based on 20 trials for each algorithm, i.e., the best and worst trials
are the trial with the best or worst hypervolume coverage among all 20
trials for that algorithm. Moreover, since the true Pareto front of the
watershed problems analyzed in this study is not known, we use the non-
dominated points of all points evaluated in all algorithms for FW-2
problem (7 algorithms * 20 trials * 600 evaluations), and also a long run
of AMALGAM with 50,000 evaluations (approximately 50,000 +
7*20*600 =134,000 points in total) as the approximatePareto front.
Fig. 2 plots this approximated Pareto front (depicted in red), against the
non-dominated fronts obtained by GOMORS, ParEGO and BORG after
600 function evaluations (from best and worst algorithm trials). We only
compare GOMORS, ParEGO and BORG here, since these algorithms are
the best (as per hypervolume coverage) for the FW-2 problem.
It is evident from Fig. 2, that non-dominated fronts obtained from
GOMORS after only 600 evaluations, are visually close in proximity to
the approximated Pareto front. Moreover, Fig. 2 shows non-dominated
fronts from the best and worst trials of GOMORS, ParEGO and BORG.
These results also indicate that even the worst trials of GOMORS are
quite good, almost as good as the best results for the other algorithms.
Thus, GOMORS is more robust than other algorithms in performance.
The worst front of ParEGO performs very poorly, indicating its lack of
reliability across multiple trials, when applied to the FW-2 problem.
4.4. Hydrologic consistency analysis of solutions from multi-objective
optimization
The MO calibration assessment so far, shows that GOMORS is the
best performing algorithm on a limited evaluation budget in terms of
algorithm efciency. This section compares the two multi-objective
formulations using GOMORS as the optimization algorithm on hydro-
logic quality, to propose an appropriate formulation for calibration on a
limited evaluation budget. As mentioned in prior discussion, hydrologic
consistency is appropriate and highly relevant, when attempting to
Table 3
Summary of statistical comparison via Mann-Whitney Rank Sum Test applied to
GOMORS, ParEGO, AMALGAM and Borg MOEA, according to the hypervolume
coverage metric with 600 evaluations. The Sub-Rows in the table correspond to
the pairs of algorithms that are compared, and the sub-columns correspond to
the number of function evaluations after which algorithms are compared.
Table cells report p-values obtained from the two-sided Rank Sum Test with null
hypothesis that algorithm performances are not different (20 trials). Underlined
p-values highlight cases where the rst listed algorithm performs better than the
second listed algorithm in column 1, at a 10% signicance level (i.e., p <0.1).
There are no cases where the second listed is signicantly better than the rst
listed algorithm (i.e., with p <0.1).
Algorithm Problem SW-2 Problem SW-3
200 400 600 200 400 600
GOMORS vs
ParEGO
0.9138 0.3438 6.3e-
03
1.2e-
02
1.6e-
02
8.7e-
03
GOMORS vs
AMALGAM
9.2e-
06
1.1e-
06
1.3e-
05
2.1e-
07
9.8e-
07
4.8e-
06
GOMORS vs Borg 2.1e-
04
1.1e-
04
4.8e-
06
1.1e-
06
1.5e-
06
2.9e-
06
ParEGO vs
AMALGAM
1.1e-
06
2.8e-
05
0.1441 4.4e-
05
2.9e-
04
9.4e-
03
ParEGO vs Borg 6.2e-
05
2.7e-
03
9.4e-
03
9.7e-
04
1.2e-
04
6.5e-
04
AMALGAM vs
Borg
0.1762 0.2674 0.1167 0.4819 0.4819 0.3040
Algorithm Problem FW-2 Problem FW-3
200 400 600 200 400 600
GOMORS vs
ParEGO
8.8e-
05
1.1e-
03
1.9e-
03
4.9e-
03
0.0699 0.0515
GOMORS vs
AMALGAM
1.7e-
06
6.2e-
05
4.4e-
04
1.7e-
05
1.1e-
06
4.3e-
06
GOMORS vs Borg 4.3e-
06
2.3e-
04
4.8e-
04
1.3e-
05
1.9e-
05
1.5e-
04
ParEGO vs
AMALGAM
0.1441 0.9569 0.2235 0.5885 4.8e-
02
2.0e-
02
ParEGO vs Borg 0.3302 0.7455 0.9784 0.4651 0.0989 0.1045
AMALGAM vs
Borg
0.5338 0.5338 0.3040 0.7660 0.7049 0.4328
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
10
understand algorithm performance in hydrologic model calibration. The
two satisfaction levels dened in Section 3.2, for nine HS (see Table A2)
and KGE, are used here to analyze algorithms. Moreover, hydrologic
consistency is visualized via consistency frequency plots (Shai and
Tolson, 2015) (see Figs. 3 and 4).
To analyze the meaningfulness gained from each formulation, Figs. 3
and 4 compares the HCF plots of the two formulations given in Table 1
with GOMORS as the MO algorithm, and with application to the SW and
FW model, respectively. Fig. 3 (a) and Fig. 4 (a) show the HCF plots with
all solutions from 20 trials optimization experiments. Fig. 3 (b) and
Fig. 4 (b) show the ltered HCF plots with only non-dominated solutions
identied from each of the 20 trials.
It is indicated that Formulations 2 have better hydrologic consis-
tencies than the Formulation 1 on the SW problem (see Fig. 3).
Formulation 2 is a mixed formulation with one GOF measure and two HS
(see Section 2.1.2). Moreover, the performance of Formulation 2 is the
best for in both satisfaction thresholds (kge >0.6 and |HS|<15% and kge
>0.5 and |HS|<25%) in the SW problem.
For the FW problem, Formulation 2 seems to have slightly better
performance than Formulation 1 according to the HCF plot with all
solutions in optimization experiments (Fig. 4 (a)). However, the differ-
ence between Formulation 2 and 1 is not obvious as shown in the SW
problem. In addition, from the ltered HCF plot using non-dominated
solutions (Fig. 4 (b)), Formulation 2 generates a higher proportion of
non-dominated solutions when a larger number of criteria are satised.
For instance, formulation 2 has higher proportion of non-dominated
solutions that satised at least 6 criteria in higher threshold (and at
least 8 criteria satised in the lower threshold) than formulation 1. In
real practice, users might be more interested in non-dominated solutions
than all solutions and solutions that satised larger number of criteria
than a few numbers of criteria satised. Hence formulations that could
produce larger proportion of non-dominated solutions that satised
more criteria are preferable.
The HCF plots in Figs. 3 and 4 only give the proportion of evaluation
points that satisfy multiple criteria but cannot provide information on
how each of the different hydrology criteria is satised. We introduce
the HCF heatmap that shows the frequency of each of the different hy-
drology criteria satised for the non-dominated solutions that satised
Fig. 2. Comparison of best and worst non-dominated fronts of (a) GOMORS, (b) ParEGO, and (c) BORG after 600 evaluations against the approximated Pareto front
of problem FW-2 based on 134,000 simulations from which the Pareto Front was calculated. The approximated Pareto front is in red. The best non-dominated front
achieved by the optimization algorithm is in dark blue. The worst non-dominated front is in light blue. The best and worst non-dominated fronts are obtained from
the best and worst performed trials, respectively, among the 20 repeated random trials for each algorithm.
Fig. 3. Hydrologic consistency frequency (HCF) plots, for GOMORS when applied to problems SW-2 and SW-3 (see Table 1). Subgure (a) plots the average (over
multiple trials) proportion of evaluated points that satisfy at least N scores (x-axis) as per the different hydrologic satisfaction levels dened in the legend (and in
Section 3.2). Subgure (b) plots the average (over multiple trials) proportion of non-dominated points (according to the relevant formulation) that satisfy at least N
scores (x-axis) as per the hydrologic satisfaction levels dened in the legend (and in Section 3.2). Higher curve, i.e., higher proportion satisfaction is better (for all
subplots). Note that there are 9 criteria and hence proportion of evaluations satisfying 10 criteria is zero.
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
11
at least N hydrologic criteria (as shown in Fig. 5 for the SW problem and
Fig. 6 for the FW problem). Fig. 5 shows that as the number of criteria
satised (N) increases, the frequency of each individual criterion being
satised also increases. FMS, DP, and DMD are the three hydrologic
criteria that have lower frequency of being satised compared to other
hydrologic criteria in all cases (in both SW-2 and SW-3). KGE has the
highest frequency of being satised amongst all criteria. Satisfaction
frequencies for most criterion from SW-3 are generally higher than that
from SW-2 (especially for the higher satisfaction threshold (i.e.,
KGE>0.6 and |HS|<15%)).
Fig. 4. Hydrologic consistency frequency (HCF) plots, for GOMORS when applied to problems FW-2 and FW-3 (see Table 1). Subgure (a) plots the average (over
multiple trials) proportion of evaluated points that satisfy at least N scores (x-axis) as per the different hydrologic satisfaction levels dened in the legend (and in
Section 3.2). Subgure (b) plots the average (over multiple trials) proportion of non-dominated points (according to the relevant formulation) that satisfy at least N
scores (x-axis) as per the hydrologic satisfaction levels dened in the legend (and in Section 3.2). Higher curve, i.e., higher proportion satisfaction is better (for all
subplots). Note that there are 9 criteria and hence proportion of evaluations satisfying 10 criteria is zero.
Fig. 5. Hydrologic consistency frequency (HCF) heatmap, for GOMORS when applied to problems SW-2 and SW-3 (see Table 1). Subgures plot the frequency of each
hydrologic criteria satised of the non-dominated evaluated points that satisfy at least N hydrologic satisfaction levels dened in Section 3.2. Subgure (a) and (c) are
plots for the SW-2 problem; Subgure (c) and (d) are plots for the SW-3 problem. Deeper color indicates a higher proportion satisfaction and so it is better (for
all subplots).
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
12
Similar ndings are also seen in the FW problems where FMS and DP
are the criteria that have, comparatively, fewer frequencies of being
satised; and solutions from FW-3 have a higher frequency in terms of
satisfying each criterion. The relatively low frequencies of satisfaction of
DP and FMS can be attributed to structural errors arising from weather
data inaccuracies, parameterization & model setup inadequacies, and
SWAT model structural errors. For instance, multiple studies report that
SWAT tends to underestimate peak ows (Muhammad et al., 2019; Me
et al., 2015). Moreover, the model setup of both the SW & FW problems
utilizes a small set of stations for weather input (Tolson and Shoemaker,
2007a). This may result in weather data induced inaccuracies in model
outputs.
Fig. 7 gives the distributions of each hydrologic criteria for ltered
non-dominated evaluations, i.e., the evaluations that satised at least 6
hydrologic criteria. The average value of hydrologic criteria of these
ltered solutions from using Objective Function 2 is better (higher KGE
and lower hydrologic signature) than that from using Objective Function
1 for almost all hydrologic signatures, except for the DAC for FW-3.
Moreover, the distribution of signatures for ltered solutions with
Objective 2 (i.e., SW-3 and FW-3) has less positive/negative bias (i.e.,
boxplots for SW-3 and FW-3, typically range between negative and
positive values). This implies that the ltered non-dominated solutions
for Objective Function 2 have less bias in individual signatures and thus
may better represent model uncertainties if these ltered solutions are
used as an ensemble of calibration alternatives (similar to equinality
(Her and Seong, 2018; Beven and Binley, 2014)). Section 4.5 analyzes
the performance of ltered solutions in further detail for one optimi-
zation run of the FW-3 problem.
The ltered solutions generally have a high value of KGE. The
average KGE value of ltered solutions is around 0.75 for SW-2 and SW-
3, respectively. For the FW-2 and FW-3 problems, the KGE value is
around 0.8. The value of all hydrologic signatures except for the FMS,
DP, and DMD in the SW problem and except for the FMS and DP in the
FW problem are generally less than 0.15 (within the red dash line). This
implies that hydrologic criteria could be an effective tool used for non-
dominated solution ltering (e.g., selecting high-quality subset solutions
from a large set of non-dominated solutions).
4.5. Calibration selection/ltering & analysis of simulated hydrographs
Filtered non-dominated solutionsare further analyzed to assess if
the calibrated (using efcient MO optimization) hydrologic model of the
Cannonsville watershed (FW) performs adequately, and to also get in-
sights on model setup, parameter sensitivity & output uncertainty. Fig. 8
provides an overview of the performance of ltered non-dominated
solutions obtained from the median GOMORS optimization run (me-
dian trial out of 20, as per hypervolume) of FW-3. Filtered non-
dominated solutions in Fig. 8 are the subset of non-dominated solu-
tions of the median GOMORS trial (for FW-3) that satisfy at least 7
consistency criteria (as per threshold denition 2: KGE >0.6 and |HS| <
15%).
Fig. 8 also reports the parameter range (see Fig. 8(a)) and output
uncertainty derived from ltered solutions only. Filtered solutions
onlyare the subset of all 600 simulation runs (118 out of 600) of the
median GOMORS trial (for FW-3) that satisfy at least 7 consistency
criteria (as per threshold denition 2: KGE >0.6 and |HS| <15%).
Consequently, ltered non-dominated solutions are also a subset of
ltered solutions only. The purpose of reporting ltered solutions
only along-with ltered non-dominated solutions in Fig. 8 is to
illustrate that a smaller ensemble of good-quality calibrations can be
obtained by using both ltering via non-domination and ltering via
hydrologic consistency satisfaction.
A total of 21 (out of 39) non-dominated solutions (from the median
FW-3 calibration trial) satisfy at least 7 hydrologic consistency metrics,
Fig. 6. Hydrologic consistency frequency (HCF) heatmap, for GOMORS when applied to problems FW-2 and FW-3 (see Table 1). Subgures plot the frequency of each
hydrologic criteria satised of the non-dominated evaluated points that satisfy at least N hydrologic satisfaction levels dened in Section 3.2. Subgure (a) and (c)
plot for the FW-2 problem; Subgure (c) and (d) plot for the FW-3 problem. Higher curve, i.e., higher proportion satisfaction is better (for all subplots).
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
13
and parameter sets & simulation output ensembles of all these solutions
are summarized in Fig. 8. Fig. 8(a) shows the normalized parameter
values and ranges of all parameters included in calibration optimization
(see Table A1 for parameter descriptions and optimization ranges). The
parameter range highlighted in gray corresponds to ltered only so-
lutions and it is evident that this range is much larger. Whereas the
parameter ranges of ltered non-dominated solutions(highlighted in
yellow) are smaller and better illustrated sensitivity of calibration pa-
rameters. For instance, GW_DELAY, ALPHA_BF, and CN2_f appear to be
more sensitive, as evidenced by the relatively narrow distributions of
their values for high-quality calibration solutions, as shown in Fig. 8(a).
In contrast, SFTMP and SMTMP appear to be less sensitive, with their
values spanning a wider range across the search space. The seemingly
high sensitivity of GW_DELAY, which is the percolation lag time (Arnold
et al., 2012), may be because the optimization range for this parameter
is set to be too high (0.001500 days; see Table 1). CN2_f is a multi-
plicative factor to calibrate the curve number in model sub-catchments
and is expected to be a sensitive parameter. In general, insights from
parameter values, trends, and ranges for the ltered non-dominated
solutions can assist in ascertaining the validity of the calibration pro-
cess and allow for changes in the calibration setup, if necessary. This is
facilitated by the use of an efcient surrogate algorithm for MO
calibration.
Fig. 8 also includes three sub-gures (Fig. 8 (b-d)) that compare
simulated (mean hydrograph of ltered calibrations and range of
hydrographs) and measured hydrographs. Fig. 8-b shows the respective
hydrographs for the full simulation period (19941999) (including
mean and upper/lower bounds of ensemble simulations), whereas
Fig. 8-b and Fig. 8-c show hydrographs for dry and wet years for a more
detailed illustration of differences between measures and simulated
ows. The mean and upper/lower bounds of hydrographs are calculated
from the ensemble simulations as described in Section 2.5. These gures
illustrate that the ensemble (of 21 ltered calibrations) simulated
hydrographs t relatively better on non-peak ows. Peak ows are
usually underestimated, which is expected, and consistent with prior
analysis (see discussion related to Fig. 7). Overall, the mean and ranges
of simulated ows capture the measured hydrology adequately both in
terms of timing and overall water balance.
In order to further understand and compare the simulated and
measured ows, Fig. 8 also includes measures and simulated Flow
Duration Curves (FDC) in Fig. 8-e (where ows are shown on a log-
scale), and a scatter plot of measured vs mean simulated ow in
Fig. 8-f. These plots also show that, overall, the ensemble simulation
outputs compare well against measured ow.
4.6. Further discussion
4.6.1. Predictive uncertainty from MO with HS consistency selection
criteria
Our study introduces a novel framework for watershed model cali-
bration, integrating multi-objective optimization for parameter search,
the selection of non-dominated solutions based on hydrologic consis-
tency criteria, and subsequent evaluation of model parameter and
output uncertainty from the nal solutions. It is imperative to delineate
the distinctive features of our approach in comparison to existing
methods for parameter estimation with model uncertainty
Fig. 7. Box-plots depicting distributions of consistency criteria on ltered non-dominated fronts, i.e., non-dominated solutions where at least 6 criteria are satised as
per stricter satisfaction denition given in Section 3.2 (KGE >0.6 +|HS| <0.15), for GOMORS (all trials) when applied to both formulations of the SW (Townbrook
watershed) and FW (Cannonsville watershed) case studies (see Table 1). Subgure (a) plots the distributions of all consistency criteria on ltered non-dominated
fronts of SW-2 and SW-3 problems. Subgure (b) plots the distributions of all consistency criteria on ltered non-dominated fronts of FW-2 and FW-3 problems.
The horizontal dashed line shows ideal value of a HS-based criteria and dotted lines show allowed range for HS-based criteria (i.e., |HS| <0.15). Higher KGE values
are desired and ideal KGE value is 1.
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
14
quantication. While our study is not primarily focused on uncertainty
quantication, we emphasize that our approach empowers modelers to
gain insights into model output uncertainty through the parameter sets
obtained via multi-objective optimization and the selection of non-
dominated solutions based on hydrologic signature consistency
criteria. Our method employs a multi-objective search to identify a set of
superior parameters, represented by non-dominated solutions, each
excelling in at least one objective dened in the multi-objective prob-
lem. Hydrologic consistency criteria are then applied to select a subset of
non-dominated solutions meeting high-quality standards, satisfying
multiple hydrologic consistency criteria. Consequently, the nal solu-
tions represent robust choices derived from diverse parameter values. In
the FW problem, for instance, 21 sets of parameters are identied as
nal solutions, forming the basis for deriving model output uncertainty.
The uncertainty analysis in our study is mainly focused on the uncer-
tainty quantication of the nal solutions (e.g., selected non-dominated
solution) our approach got.
Our goal is not to solve the equinality issue for multi-objective
Fig. 8. Analysis of Filtered Calibrations: a) Parameter sets and ranges, b-d) Hydrographs of measured ow and the mean and upper/lower bounds of simulated ow
for Cannonsville SWAT model, e) Flow duration curves of measured and simulation ows (mean and upper/lower bounds) and f) Scatter plots of mean simulated ow
(mean of all hydrographs of all ltered calibrations) vs measured ow. Please note that Filtered Only(118 out of 600) is the subset of calibrations (derived from the
median GOMORS run for FW-3 problem) that satisfy 7 consistency hydrologic criteria, whereas Filtered NDis the ensemble/subset calibrations (21 out of 600;
derived from the median GOMORS run for FW-3 problem) that are both non-dominated and satisfy 7 hydrologic consistency criteria. Moreover, mean simulated ow
is the mean hydrograph obtained from the ltered NDsimulation ensemble.
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
15
calibration problems, which is beyond the scope of this paper. However,
it would be worthwhile to discuss how our approach in uncertainty
approximation differs from previous methods, such as the informal
Bayesian approach GLUE and various formal Bayesian Markov chain
Monte Carlo (MCMC) methods, as discussed in the literature section, in
several respects. Firstly, the solutions we included for uncertainty
analysis are achieved through surrogate-assisted multi-objective opti-
mization and hydrologic consistency criteria. In GLUE, a set of behav-
ioral solutions is selected by a subjective threshold value for the
likelihood function (for example, a percentage of the total sampled so-
lutions), leading to model output uncertainty that signicantly depends
on the modeler dening this subjective threshold. In the formal Bayesian
approach, a cut-off threshold for behavioral and non-behavioral solu-
tions is based on the underlying parameter probability distribution,
necessitating the denition of a formal likelihood function. Another
distinction lies in the derivation of prediction uncertainty. In GLUE,
prediction uncertainty is obtained from likelihood-weighted predictions
or quantiles with a behavioral parameter set, where the likelihood value
measures the agreement between model predictions and observations.
In formal Bayesian MCMC methods, predictive uncertainty is estimated
by evaluating the model output for parameter sets sampled from the
posterior distribution of model parameters. For the framework we pro-
pose, we derive model output uncertainty by establishing upper and
lower bounds of the model output from the selected non-dominated
solutions (i.e., behavioral solutions). We do not assign separate
weights to different behavioral solutions because these selected non-
dominated solutions are all high quality, and there is no straightfor-
ward way to rank their quality to obtain an appropriate weight. Our
approach does not employ a statistical likelihood function like formal
statistical methods, and some subjective decisions are still necessary
such as determining the number of hydrologic consistent criteria to meet
for selecting non-dominated solutions. Therefore, the derived model
uncertainty is somehow dependent on the modelers subjective choices.
Nevertheless, our study introduces the hydrologic consistency frequency
plots and map as effective tools for aiding modeler in making decision.
In our study, we focused on using multi-objective (MO) calibration,
and the uncertainty analysis also centers around solutions obtained from
MO with multiple criteria. Although our primary focus is not to compare
MO calibration with single-objective calibration, we would like to
discuss some aspects in terms of parameter and simulation uncertainty
associated with MO and single-objective calibration. Compared with
single-objective calibration, MO calibration could help improve the
identiability of model parameters by considering multiple aspects of
the system. Single-objective calibration can struggle with parameter
identiability issues, especially if the chosen objective function is not
sensitive to certain aspects of the hydrological response. However, MO
calibration requires more extensive and diverse datasets to adequately
capture different aspects of the hydrological system than single-
objective calibration. As the number of objectives increases, it also be-
comes more challenging for MO to solve the optimization search prob-
lems. The challenge to nd the real Pareto front in a limited budget can
also introduce uncertainty to parameter identication and simulation
output. Hence, it is important to have efcient MO methods that can nd
better solutions in a shorter time, which is what our study aims to
achieve. Additionally, the inclusion of multi-criteria in MO can lead to
solutions that excel in one or some objectives but perform extremely
poorly in others. Among the non-dominated solutions, these unique
solutions could also increase the uncertainty in model simulation
output. Therefore, it is also important to conduct a selection of the non-
dominated solutions from MO, exemplied by the hydrological signa-
ture consistency selection approach we proposed.
4.6.2. Versatile application potential beyond current study
In this study, we applied our calibration assessment approach to two
watershed models using SWAT. It is crucial to emphasize that our
approach has broader applicability beyond SWAT and can be extended
to other watershed models. Our proposed approach is inherently general
for two main reasons. Firstly, the surrogate models employed in
surrogate-assistant methods are versatile and can be applied to any
computationally expensive model. This exibility allows our approach
to transcend the connes of a specic model, enabling its utilization
across a spectrum of watershed models. Secondly, the hydrologic con-
sistency criteria, a pivotal component of our approach, only necessitates
simulation output time series. This means that it can be employed to
analyze the tness of runoff simulation time series produced by various
watershed models. The adaptability of this criterion adds to the gener-
alizability of our approach, making it applicable to a wide range of
hydrological models. The generality of our approach is supported by the
adaptability of surrogate models and the inclusivity of hydrologic con-
sistency criteria, rendering it applicable and valuable in the broader
context of hydrological modeling.
5. Conclusions
This study is novel in 1) providing the rst comprehensive compar-
ative analysis of both multiple surrogate and non-surrogate MO algo-
rithms for efcient calibration of computationally expensive
hydrological models and 2) proposing a framework to evaluate and lter
the many non-dominated solutions obtained from multi-objective opti-
mization using hydrologic signatures.
We compared the performance of 7 different MO algorithms,
including the surrogate-assisted methods, GOMORS and ParEGO, the
multi-method and adaptive evolutionary search methods, AMALGAM
and BORG MOEA, and the widely used MOEAs, NSGA-II,
ε
-MOEA, and
MOEA/D. These algorithms are tested on two different watershed
models that were developed using the Soil and Water Assessment Tool
(SWAT) and tested with two different multi-objective optimization
formulations.
As is also illustrated in our analysis, the selection of both an appro-
priate algorithm and the objective function vector (i.e., formulations) is
extremely important in hydrologic model calibration, especially when
the underlying watershed problems are computationally expensive, and
the number of available evaluations is limited. GOMORS outperforms all
other algorithms on all four watershed calibration problems (two
watershed models with two different formulations) based on a) the
traditional hypervolume metric-based analysis, b) visual comparison of
trade-off curves, and c) statistical testing, within an evaluation budget of
600. The second-best method overall is the surrogate-based ParEGO
which supports the usefulness of surrogates for computationally
expensive functions. However, GOMORS signicantly outperforms
ParEGO as well as the other algorithms, which suggests that the specics
of the surrogate algorithm are also important.
We proposed a new framework that uses hydrologic signatures to
evaluate and lter the non-dominated solutions from the multi-objective
calibration. The framework uses hydrologic consistency frequency
(based on the number of dened hydrologic criteria satised) to assess
the quality of non-dominated solutions from a MO calibration search.
We adopt the hydrologic consistency frequency plot from Shai and
Tolson (2015) to assess the overall quality of solutions from a MO
calibration search and proposed a new hydrologic consistency frequency
heatmap to assess the calibration quality in terms of each hydrologic
criterion. These analyses allow us to assess the calibration quality from
different MO formulations and to lter the non-dominated solutions to
nd the high-quality solutions in order to help make sensible decisions
about which set of parameter values should be used.
Our hydrologic consistency analysis indicates that, among the two
MO formulations tested, the three-objective decomposition of NSE,
performs better when tested on the two watersheds analyzed in this
study (e.g., solutions from this formulation have a higher probability of
satisfying more hydrologic criteria). We also found that for both
watershed models, the calibration solutions overall have relatively low
frequencies of satisfaction of DP and FMS (See Table A2 for denitions),
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
16
which might be attributed to errors arising from weather data inaccur-
acies, parameterization & model setup inadequacies, and SWAT model
structural errors.
We also analyzed the quality of ltered non-dominated solutions
based on the hydrologic consistency analysis. Results show that ltered
non-dominated solutions from GOMORS using the three-objective
decomposition of NSE generally capture the measured hydrology
adequately both in terms of timing and overall water balance. This is
facilitated using an efcient surrogate algorithm for MO calibration and
the effectiveness of using hydrologic criteria for non-dominated solution
ltering.
The framework we proposed is general and could be used for other
watershed model calibration problems, which could be investigated in
future work. This framework could also be extended for efcient
sequential model calibration, where parameter ranges or model struc-
ture may be adjusted iteratively or sequentially, in each calibration run/
iteration.
Software availability
Name of software: GOMORS_pySOT
Description: A surrogate-assisted Multi-Objective Optimization (MO)
strategy, designed for computational expensive MO problems, e.g.,
expensive environmental simulation optimization problems, hyper-
parameter tuning of Deep Neural Networks etc. The Townbrook
watershed SWAT model calibration problem is provided as an example
for the use of GOMORS.
Developer: Taimoor Akhtar Contact: taimoor.akhtar@gmail.com.
Program language: Python.
Availability and cost:
Free and open source for non-commercial use. Availability in GitHub
https://github.com/drkupi/GOMORS_pySOT.
CRediT authorship contribution statement
Wei Xia: Writing review & editing, Writing original draft,
Visualization, Validation, Software, Methodology, Investigation, Formal
analysis, Conceptualization. Taimoor Akhtar: Writing review &
editing, Writing original draft, Visualization, Validation, Software,
Resources, Methodology, Investigation, Formal analysis, Conceptuali-
zation. Wei Lu: Writing review & editing, Methodology, Investigation,
Formal analysis. Christine A. Shoemaker: Writing review & editing,
Supervision, Funding acquisition, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
Data availability
I have shared the link to the code in the manuscript
Acknowledgments
This research was done primarily at National University of Singapore
(NUS), supported by the National Research Foundation, Prime Minis-
ters Ofce, Singapore under its Campus for Research Excellence and
Technological Enterprise (CREATE) program. Dr. Xia, Dr. Akhtar and Dr.
Lu were also supported by Prof. Shoemakers NUS start up grant. This
work is an extension of research started at Cornell University by Akhtar
and Shoemaker with nancial support funded from the Fulbright-HEC
Pakistan program and from a USA-NSF grant to Prof. Shoemaker. The
data and algorithms used in this study are provided in tables and gures
or listed in the references. The model and data for the Townbrook
watershed is shared in Hydroshare repository (Xia et al., 2023) as an
example for reader to run the GOMORS code. The data related to the
Townbrook and Cannosville watershed model were obtained by Prof.
Bryan A. Tolson and Prof. Shoemaker from New York State government.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.envsoft.2024.105983.
Appendix A: Table A1
Description of SWAT Model Parameters Calibrated in the Two Models Used in this Study. The parameter range in original optimization setting and
the parameter range of Filtered Onlysolutions and Filtered NDsolutions. Filtered Only(118 out of 600) are the subset of calibrations (derived
from the median GOMORS run for FW-3 problem) that satisfy 7 consistency hydrologic criteria, whereas Filtered ND are the ensemble/subset
calibrations (21 out of 600; derived from the median GOMORS run for FW-3 problem) that are both non-dominated and satisfy 7 hydrologic con-
sistency criteria.
No. SWAT Code Related
Input le
Denition and unit (if any) Original
range
Calibrated Range (Filtered
Only solutions)
Calibrated Range (Filtered
ND solutions)
1 SFTMP .bsn Snowfall temperature (C) 5~5 5.0~3.80 5.0~0.998
2 SMTMP .bsn Snow melt base temperature (C) 5~5 5.0~5.0 3.6~5.0
3 SMFMX .bsn Maximum melt rate for snow during year where deg C refers
to the air temperature (mm H20/C-day)
1.58 1.58.0 1.58.0
4 TIMP .bsn Snowpack temperature lag factor 0.011 0.041.0 0.531
5 SURLAG .bsn Surface runoff lag coefcient 1~24 1~24.0 1.05.65
6 GW_DELAY .gw Groundwater delay time (days) 0.001500 0.001500 0.0018.40
7 ALPHA_BF .gw Alpha factor for groundwater recession curve 0.0011 0.0821 0.831
8 GWWQMN .gw Threshold depth for shallow aquifer (mm H
2
O) 0.001500 0.001500 183.27500
9 LAT_TIME .hru lateral ow travel time (days) 0.001180 0.001180 0.001101.40
10 ESCO .bsn &.hru Soil evaporation compensation factor 0.011 0.010.85 0.010.26
11 CN2_f .mgt CN multiplicative factor 0.751.25 0.750.99 0.750.89
12 DEPTH_f .sol Depth multiplicative factor 0~1 0~1 0.0141
(continued on next page)
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
17
(continued)
No. SWAT Code Related
Input le
Denition and unit (if any) Original
range
Calibrated Range (Filtered
Only solutions)
Calibrated Range (Filtered
ND solutions)
13 BD_f .sol Bulk Density multiplicative factor 0~1 0~1 0~1
14 AWC_f .sol Available Water Content multiplicative factor 0~1 0.00041 0.731
15 KSAT_f .sol Saturated hydraulic conductivity multiplicative factor 0~1 0~1 0~0.18
Appendix B: Table A2
Details of Eight Hydrological Signatures Used in this Study.
Group Symbol Full name Calculation Explanation
Water balance RR Overall runoff to rainfall
ratio N
t=1Qt/N
t=1Pt Qt and Pt are the runoff and precipitation in time step t
Flow duration curve
(FDC)
FHV High-segment volume H
h=1Qh h are ow indices located within the high-ow segment (exceedance probabilities
5%)
H is the index of the maximum ow
FMS Mid-segment slope log(Qm1) log(Qm2)m
1
and m
2
are the lowest (20%) and highest (70%) ow exceedance probabilities that
are located at the two sides of the mid-segment
FLV Low-segment volume L
l=1(log(Ql)
log (QL))
l are ow indices located within the low-ow segment (ow exceedance probabilities
70%)
L is the index of the minimum ow
Discharge statistics DS Discharge Standard
Deviation
σ
=
N
t=1(QtQ)2/N
σ
is the standard deviation of the ow time series
DM Mean Discharge
μ
=N
t=1Qt/N
μ
is the mean of the ow time series
DP Peak discharge Max(Qt)Peak of the ow time series
DMD Median discharge Med(Qt)Median of the ow time series
DAC Lag-1 autocorrelation
coefcient N1
t=1(QtQ)(Qt+1Q)
N
t=1(QtQ)(QtQ)
Q is the mean ow
References
Abbaspour, K.C., Johnson, C.A., van Genuchten, M.T., 2004. Estimating uncertain ow
and transport parameters using a sequential uncertainty tting procedure. Vadose
Zone J. 3 (4), 13401352. https://doi.org/10.2136/vzj2004.1340.
Ahmadi, M., Arabi, M., Ascough, J.C., Fontane, D.G., Engel, B.A., 2014. Toward
improved calibration of watershed models: multisite multiobjective measures of
information. Environ. Model. Software 59, 135145. https://doi.org/10.1016/j.
envsoft.2014.05.012.
Akhtar, T., Shoemaker, C.A., 2016. Multi objective optimization of computationally
expensive multi-modal functions with RBF surrogates and multi-rule selection.
J. Global Optim. 64 (1), 1732. https://doi.org/10.1007/s10898-015-0270-y.
Arnold, J.G., Moriasi, D.N., Gassman, P.W., Abbaspour, K.C., White, M.J., Srinivasan, R.,
et al., 2012. SWAT: model use, calibration, and validation. Transactions of the
ASABE 55 (4), 14911508. ISSN2151-0032.
Asadzadeh, M., Tolson, B., 2013. Pareto archived dynamically dimensioned search with
hypervolume-based selection for multi-objective optimization. Eng. Optim. 45 (12),
14891509. https://doi.org/10.1080/0305215X.2012.748046.
Asadzadeh, M., Tolson, B.A., Burn, D.H., 2014. A new selection metric for multiobjective
hydrologic model calibration. Water Resour. Res. 50 (9), 70827099. https://doi.
org/10.1002/2013WR014970.
Auger, A., Bader, J., Brockhoff, D., Zitzler, E., 2009. Theory of the hypervolume
indicator : optimal
μ
-distributions and the choice of the reference point. Proceedings
of the tenth ACM SIGEVO workshop on Foundations of genetic algorithms,
pp. 87102. https://doi.org/10.1145/1527125.1527138.
Baú, D.a., Mayer, A.S., 2006. Stochastic management of pump-and-treat strategies using
surrogate functions. Adv. Water Resour. 29 (12), 19011917. https://doi.org/
10.1016/j.advwatres.2006.01.008.
Behzadian, K., Kapelan, Z., Savic, D., Ardeshir, A., 2009. Environmental Modelling &
Software Stochastic sampling design using a multi-objective genetic algorithm and
adaptive neural networks. Environ. Model. Software 24 (4), 530541. https://doi.
org/10.1016/j.envsoft.2008.09.013.
Bekele, E., Nicklow, J., 2007. Multi-objective automatic calibration of SWAT using
NSGA-II. J. Hydrol. 341 (34), 165176. https://doi.org/10.1016/j.
jhydrol.2007.05.014.
Beven, K., Binley, A., 1992. The future of distributed models: model calibration and
uncertainty prediction. Hydrol. Process. 6 (3), 279298.
Beven, K., Binley, A., 2014. GLUE: 20 years on. Hydrol. Process. 28 (24), 58975918.
Boyle, D.P., Gupta, H.V., Sorooshian, S., 2000. Toward improved calibration of
hydrologic models: combining the strengths of manual and automatic methods.
Water Resour. Res. 36 (12), 3663. https://doi.org/10.1029/2000WR900207.
Castelletti, A., Pianosi, F., Soncini-Sessa, R., Antenucci, J.P., 2010. A multiobjective
response surface approach for improved water quality planning in lakes and
reservoirs. Water Resour. Res. 46 (6), 116. https://doi.org/10.1029/
2009WR008389.
Cheng, Y., Xia, W., Detto, M., Shoemaker, C.A., 2023. A framework to calibrate
ecosystem demography models within Earth system models using parallel surrogate
global optimization. Water Resour. Res. 59 (1), e2022WR032945.
Chilkoti, V., Bolisetti, T., Balachandar, R., 2018. Multi-objective autocalibration of SWAT
model for improved low ow performance for a small snowfed catchment. Hydrol.
Sci. J. 63 (10), 14821501. https://doi.org/10.1080/02626667.2018.1505047.
Coello, C.A.C., Lamont, G.B., Van Veldhuizen, D.A., 2007. Evolutionary Algorithms for
Solving Multi-Objective Problems, vol. 5. Springer, New York, pp. 79104.
Conover, W.J., 1998. Practical Nonparametric Statistics, third ed. Wiley.
Deb, K., Member, A., Pratap, A., Agarwal, S., Meyarivan, T., 2002. A fast and elitist multi-
objective genetic algorithm:NSGAII. IEEE Trans. Evol. Comput. 6 (2), 182197.
di Pierro, F., Khu, S., Savic, D., Berardi, L., 2009. Efcient multi-objective optimal design
of water distribution networks on a budget of simulations using hybrid algorithms.
Environ. Model. Software 24, 202213. https://doi.org/10.1016/j.
envsoft.2008.06.008.
Ercan, M.B., Goodall, J.L., 2016. Design and implementation of a general software
library for using NSGA-II with SWAT for multi-objective model calibration. Environ.
Model. Software 84, 112120. https://doi.org/10.1016/J.ENVSOFT.2016.06.017.
Eriksson, D., Bindel, D., Shoemaker, C.A., 2019. pySOT and POAP: an event-driven
asynchronous framework for surrogate optimization. ArXiv Preprint ArXiv:
1908.00420 1, 119 arxiv.org/abs/1908.00420.
Espinet, A.J., Shoemaker, C.a., 2013. Comparison of optimization algorithms for
parameter estimation of multi-phase ow models with application to geological
carbon sequestration. Adv. Water Resour. 54, 133148. https://doi.org/10.1016/j.
advwatres.2013.01.003.
Franco, A.C.L., Oliveira, D. Y. de, Bonum´
a, N.B., 2020. Comparison of single-site, multi-
site and multi-variable SWAT calibration strategies. Hydrol. Sci. J. 65 (14),
23762389. https://doi.org/10.1080/02626667.2020.1810252.
Gong, W., Duan, Q., 2017. An adaptive surrogate modeling-based sampling strategy for
parameter optimization and distribution estimation (ASMO-PODE). Environ. Model.
Software 95, 6175. https://doi.org/10.1016/j.envsoft.2017.05.005.
Gupta, H.V., Bastidas, L.A., Vrugt, J.A., Sorooshian, S., 2003. Multiple criteria global
optimization for watershed model calibration. In: Duan, Q., Gupta, H.V.,
Sorooshian, S., Rousseau, A.N., Turcotte, R. (Eds.), Calibration of Watershed Models.
Gupta, H.V., Sorooshian, S., Yapo, P.O., 1998. Toward improved calibration of
hydrologic models: multiple and noncommensurable measures of information. Water
Resour. Res. 34 (4), 751. https://doi.org/10.1029/97WR03495.
W. Xia et al.
Environmental Modelling and Software 175 (2024) 105983
18
Gupta, H.V., Kling, H., Yilmaz, K.K., Martinez, G.F., 2009. Decomposition of the mean
squared error and NSE performance criteria : implications for improving
hydrological modelling. J. Hydrol. 377 (12), 8091. https://doi.org/10.1016/j.
jhydrol.2009.08.003.
Hadka, D., Reed, P., 2013. BORG: an auto-adaptive many-objective evolutionary
computing framework. Evol. Comput. 21 (2), 231259. https://doi.org/10.1162/
EVCO_a_00075.
Her, Y., Seong, C., 2018. Responses of hydrological model equinality, uncertainty, and
performance to multi-objective parameter calibration. J. Hydroinf. 20 (4), 864885.
Hingray, B., Schaei, B., Mezghani, A., Hamdi, Y., 2010. Signature-based model
calibration for hydrological prediction in mesoscale Alpine catchments. Hydrol. Sci.
J. 55 (6), 10021016. https://doi.org/10.1080/02626667.2010.505572.
Knowles, J., 2006. ParEGO: a hybrid algorithm with on-line landscape approximation for
expensive multiobjective optimization problems. IEEE Trans. Evol. Comput. 10 (1),
5066. https://doi.org/10.1109/TEVC.2005.851274.
Jones, D.R., Schonlau, M., Welch, W.J., 1998. Efcient global optimization of expensive
black-box functions. J. Global Optim. 455492. https://doi.org/10.1023/A:
1008306431147.
Kling, H., Fuchs, M., Paulin, M., 2012. Runoff conditions in the upper Danube basin
under an ensemble of climate change scenarios. J. Hydrol. 424425, 264277.
Kollat, J.B., Reed, P.M., Wagener, T., 2012. When are multiobjective calibration trade-
offs in hydrologic models meaningful? Water Resour. Res. 48 (3), 119. https://doi.
org/10.1029/2011WR011534.
Kuczera, G., Parent, E., 1998. Monte Carlo assessment of parameter uncertainty in
conceptual catchment models: the Metropolis algorithm. J. Hydrol. 211 (14),
6985.
Laumanns, M., Thiele, L., Deb, K., Zitzler, E., 2002. Combining convergence and diversity
in evolutionary multi-objective optimization. Evol. Comput. 10 (3), 263282.
Lu, D., Ricciuto, D., Stoyanov, M., Gu, L., 2018. Calibration of the E3SM land model
using surrogate-based global optimization. J. Adv. Model. Earth Syst. 10 (6),
13371356.
Lu, W., Qin, X., Yu, J., 2019. On comparison of two-level and global optimization
schemes for layout design of storage ponds. J. Hydrol. 570, 544554.
Madsen, H., 2003. Parameter estimation in distributed hydrological catchment
modelling using automatic calibration with multiple objectives. Adv. Water Resour.
26 (2), 205216. https://doi.org/10.1016/S0309-1708(02)00092-1.
Madsen, H., Wilson, G., Ammentorp, H.C., 2002. Comparison of different automated
strategies for calibration of rainfall-runoff models. J. Hydrol. 261, 4859.
Maier, H.R., Kapelan, Z., Kasprzyk, J.R., Kollat, J.B., Matott, L.S., Cunha, M.C., Dandy, G.
C., Gibbs, M.S., Keedwell, E., Marchi, A., Ostfeld, A., Savic, D., Solomatine, D.P.,
Vrugt, J.A., Zecchin, A.C., Minsker, B.S., Barbour, E.J., Kuczera, G., Pasha, F., et al.,
2014. Evolutionary algorithms and other metaheuristics in water resources: current
status, research challenges and future directions. Environ. Model. Software 62,
271299. https://doi.org/10.1016/j.envsoft.2014.09.013.
Martinez, G.F., Gupta, H.V., 2010. Toward improved identication of hydrological
models: a diagnostic evaluation of the abcdmonthly water balance model for the
conterminous United States. Water Resour. Res. 46 (8), 121. https://doi.org/
10.1029/2009WR008294.
Me, W., Abell, J.M., Hamilton, D.P., 2015. Effects of hydrologic conditions on SWAT
model performance and parameter sensitivity for a small, mixed land use catchment
in New Zealand. Hydrol. Earth Syst. Sci. 19 (10), 41274147.
McDonald, S., Mohammed, I.N., Bolten, J.D., Pulla, S., Meechaiya, C., Markert, A.,
Nelson, E.J., Srinivasan, R., Lakshmi, V., 2019. Web-based decision support system
tools: the Soil and Water Assessment Tool Online visualization and analyses
(SWATOnline) and NASA earth observation data downloading and reformatting tool
(NASAaccess). Environ. Model. Software 120 (December 2018), 104499. https://doi.
org/10.1016/j.envsoft.2019.104499.
Mugunthan, P., Shoemaker, C.A., 2006. Assessing the impacts of parameter uncertainty
for computationally expensive groundwater models. Water Resour. Res. 42 (10)
https://doi.org/10.1029/2005WR004640.
Muhammad, A., Evenson, G.R., Stadnyk, T.A., Boluwade, A., Jha, S.K., Coulibaly, P.,
2019. Impact of model structure on the accuracy of hydrological modeling of a
Canadian Prairie watershed. J. Hydrol.: Reg. Stud. 21, 4056.
Müller, J., Shoemaker, C.A., Pich´
e, R., 2013. SO-MI: a surrogate model algorithm for
computationally expensive nonlinear mixed-integer black-box global optimization
problems. Comput. Oper. Res. 40 (5), 13831400.
Nash, J.E., Sutcliffe, J.V., 1970. River ow forecasting through conceptual models part I -
A discussion of principles. J. Hydrol. 10 (3), 282290. https://doi.org/10.1016/
0022-1694(70)90255-6.
Nicklow, J., Reed, P., Savic, D., Dessalegne, T., 2010. State of the art for genetic
algorithms and beyond in water resources planning and management. J. Water
Resour. Plan. Manag. 136 (4), 412432.
Razavi, S., Tolson, B.a., Burn, D.H., 2012. Review of surrogate modeling in water
resources. Water Resour. Res. 48 (7), W07401. https://doi.org/10.1029/
2011WR011527.
Reed, P.M., Hadka, D., Herman, J.D., Kasprzyk, J.R., Kollat, J.B., 2012. Evolutionary
multiobjective optimization in water resources: the past, present, and future. Adv.
Water Resour. https://doi.org/10.1016/j.advwatres.2012.01.005.
Regis, R.G., Shoemaker, C.A., 2013. Combining radial basis function surrogates and
dynamic coordinate search in high-dimensional expensive black-box optimization.
Eng. Optim. 45 (5), 529555. https://doi.org/10.1080/0305215X.2012.687731.
Regis, R.G., Shoemaker, C.A., 2007. Stochastic radial basis function method for the
global optimization of expensive functions. Inf. J. Comput. 19, 497509.
Sahraei, S., Asadzadeh, M., Shai, M., 2019. Toward effective many-objective
optimization: rounded-archiving. Environ. Model. Software 122, 104535. https://
doi.org/10.1016/j.envsoft.2019.104535.
Sahraei, S., Asadzadeh, M., Unduche, F., 2020. Signature-based multi-modelling and
multi-objective calibration of hydrologic models: application in ood forecasting for
Canadian Prairies. J. Hydrol. 588 (May), 125095 https://doi.org/10.1016/j.
jhydrol.2020.125095.
Shai, M., Tolson, B.A., 2015. Optimizing Hydrological Consistency by Incorporating
Hydrological Signatures into Model Calibration Objectives. Water Resources
Research, pp. 26162633.
Tang, Y., Reed, P.M., Kollat, J.B., 2007. Parallelization strategies for rapid and robust
evolutionary multiobjective optimization in water resources applications. Adv.
Water Resour. 30 (3), 335353. https://doi.org/10.1016/j.advwatres.2006.06.006.
Tang, Y., Reed, P., Wagener, T., 2006. How effective and efcient are multiobjective
evolutionary algorithms at hydrologic model calibration? Hydrol. Earth Syst. Sci. 10
(2), 289307. https://doi.org/10.5194/hess-10-289-2006.
Tolson, B.A., Shoemaker, C.A., 2007a. Cannonsville reservoir watershed SWAT2000
model development, calibration and validation. J. Hydrol. 337 (12), 6886.
https://doi.org/10.1016/j.jhydrol.2007.01.017.
Tolson, B.A., Shoemaker, C.A., 2007b. Dynamically dimensioned search algorithm for
computationally efcient watershed model calibration. Water Resour. Res. 43 (1),
116. https://doi.org/10.1029/2005WR004723.
Vrugt, J.A., Gupta, H.V., Bouten, W., Sorooshian, S., 2003. A Shufed Complex Evolution
Metropolis algorithm for optimization and uncertainty assessment of hydrologic
model parameters. Water Resour. Res. 39 (8) https://doi.org/10.1029/
2002WR001642.
Vrugt, J. a, Robinson, B.a., 2007. Improved evolutionary optimization from genetically
adaptive multimethod search. Proc. Natl. Acad. Sci. U.S.A. 104 (3), 708711.
https://doi.org/10.1073/pnas.0610471104.
Wang, Y., Jiang, R., Xie, J., Zhao, Y., Yan, D., Yang, S., 2019. Soil and water assessment
tool (SWAT) model: a systemic review. J. Coast Res. 93 (sp1), 22. https://doi.org/
10.2112/si93-004.1.
Wild, S.M., Regis, R.G., Shoemaker, C.A., 2008. ORBIT: Optimization by radial basis
function interpolation in trust-region. SIAM J. Sci. Comput. 30 (6), 31973219.
Wu, H., Chen, B., Ye, X., Guo, H., Meng, X., Zhang, B., 2021. An improved calibration and
uncertainty analysis approach using a multicriteria sequential algorithm for
hydrological modeling. Sci. Rep. 11 (1), 16954.
Xia, W., Akhtar, T., Shoemaker, C.A., 2022. A novel objective function DYNO for
automatic multivariable calibration of 3D lake models. Hydrol. Earth Syst. Sci. 26
(13), 36513671. https://doi.org/10.5194/hess-26-3651-2022.
Xia, W., Shoemaker, C., 2021. GOPS: efcient RBF surrogate global optimization
algorithm with high dimensions and many parallel processors including application
to multimodal water quality PDE model calibration. Optim. Eng. 22, 27412777.
Xia, W., Shoemaker, C.A., 2022a. A repetitive parameterization and optimization
strategy for the calibration of complex and computationally expensive process-based
models with application to a 3D water quality model of a tropical reservoir. Water
Resour. Res. 58 (5), e2021WR031054.
Xia, W., Shoemaker, C.A., 2022b. Improving the speed of global parallel optimization on
PDE models with processor afnity scheduling. Comput. Aided Civ. Infrastruct. Eng.
37 (3), 279299.
Xia, W., Shoemaker, C., Akhtar, T., Nguyen, M.T., 2021. Efficient parallel surrogate
optimization algorithm and framework with application to parameter calibration of
computationally expensive three-dimensional hydrodynamic lake PDE models.
Environ. Model. Software 135, 104910.
Xia, W., Akhtar, T., Lu, W., Shoemaker, C.A., 2023. Enhanced watershed model
evaluation incorporating hydrologic signatures and consistency within efcient
surrogate multi-objective optimization. Hydro [Dateset]. http://www.hydroshare.
org/resource/77f2f1a6625c4a03b7660a87f55faaa4.
Yapo, P., Gupta, H., Sorooshian, S., 1998. Multi-objective global optimization for
hydrologic models. J. Hydrol. 204 (14), 8397. https://doi.org/10.1016/S0022-
1694(97)00107-8.
Yang, G., Guo, S., Liu, P., Li, L., Liu, Z., 2017. Multiobjective cascade reservoir operation
rules and uncertainty analysis based on PA-DDS algorithm. J. Water Resour. Plann.
Manag. 143 (7), 04017025.
Zamani, M., Shrestha, N.K., Akhtar, T., Boston, T., Daggupati, P., 2020. Advancing model
calibration and uncertainty analysis of SWAT models using cloud computing
infrastructure: lcc-swat. J. Hydroinf. https://doi.org/10.2166/hydro.2020.066.
October.
Zhang, Q., Li, H., 2007. MOEA/D: a multiobjective evolutionary algorithm based on
decomposition. IEEE Trans. Evol. Comput. 11 (6), 712731. https://doi.org/
10.1109/TEVC.2007.892759.
Zhang, X., Beeson, P., Link, R., Manowitz, D., Izaurralde, R.C., Sadeghi, A., Thomson, A.
M., Sahajpal, R., Srinivasan, R., Arnold, J.G., 2013. Efcient multi-objective
calibration of a computationally intensive hydrologic model with parallel computing
software in Python. Environ. Model. Software 46, 208218. https://doi.org/
10.1016/j.envsoft.2013.03.013.
Zou, R., Lung, W.S., Wu, J., 2007. An adaptive neural network embedded genetic
algorithm approach for inverse water quality modeling. Water Resour. Res. 43 (8),
113. https://doi.org/10.1029/2006WR005158.
W. Xia et al.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The climatic feedbacks from vegetation, particularly from tropical forests, can alter climate through land‐atmospheric interactions. Expected shifts in species composition can alter these interactions with profound effects on climate and terrestrial ecosystem dynamics. Ecosystem demographic (ED) models can explicitly represent vegetation dynamics and are a key component of next‐generation Earth System Models (ESMs). Although ED models exhibit greater fidelity and allow more direct comparisons with observations, their interacting parameters can be more difficult to calibrate due to the complex interactions among vegetation groups and physical processes. In addition, while representation of forest successional coexistence in ESMs is necessary to accurately capture forest‐climate interactions, few models can simulate forest coexistence and few studies have calibrated coexisted forest species. Furthermore, although both vegetation characteristics and soil properties affect vegetation dynamics, few studies have paid attention to jointly calibrating parameters related to these two processes. In this study, we develop a computationally‐efficient and physical model structure‐based framework that uses a parallel surrogate global optimization algorithm to calibrate ED models. We calibrate two typically coexisted tropical tree species, early and late successional plants, in a state‐of‐the‐art ED model that is capable of simulating successional diversity in forests. We concurrently calibrate vegetation and soil parameters and validate results against carbon, energy, and water cycle measurements collected in Barro Colorado Island, Panama. The framework can find optimal solutions within 4–12 iterations for 19‐dimensional problems. The calibration for tropical forests has important implications for predicting land‐atmospheric interactions and responses of tropical forests to environmental changes.
Article
Full-text available
This study introduced a novel Dynamically Normalized Objective Function (DYNO) for multivariable (i.e., temperature and velocity) model calibration problems. DYNO combines the error metrics of multiple variables into a single objective function by dynamically normalizing each variable's error terms using information available during the search. DYNO is proposed to dynamically adjust the weight of the error of each variable hence balancing the calibration to each variable during optimization search. DYNO is applied to calibrate a tropical hydrodynamic model where temperature and velocity observation data are used for model calibration simultaneously. We also investigated the efficiency of DYNO by comparing the calibration results obtained with DYNO with the results obtained through calibrating to temperature only and with the results obtained through calibrating to velocity only. The results indicate that DYNO can balance the calibration in terms of water temperature and velocity and that calibrating to only one variable (e.g., temperature or velocity) cannot guarantee the goodness-of-fit of another variable (e.g., velocity or temperature) in our case. Our study implies that in practical application, for an accurate spatially distributed hydrodynamic quantification, including direct velocity measurements is likely to be more effective than using only temperature measurements for calibrating a 3D hydrodynamic model. Our example problems were computed with a parallel optimization method PODS, but DYNO can also be easily used in serial applications.
Article
Full-text available
Parameter calibration is critical for modeling, especially for current process‐based models that are complex with many chemical and biological processes and immeasurable model parameters. This analysis quantifies significant disadvantages of the traditional use of local or global sensitivity analysis (SA) for selecting calibration parameters of nonlinear, expensive models when there are a large number of constituents and parameters. We propose a new Repetitive parameterization and optimization (Rep‐OPT) strategy that uses multiple optimization steps; and between each optimization step, a modeler picks the parameters to be optimized in the next optimization step. The modeler picks the parameters in each iteration following a suggested set of steps that analyze which processes and parameters are related to the poorly fit constituents with the current parameter set. We successfully applied the Rep‐OPT strategy on a complex tropical water quality model with more than 91 parameters using real data. We demonstrate that expert knowledge with assistance of proposed postanalysis techniques (i.e., trade‐off analysis, component analysis, and mass‐balance analysis) can identify the right calibration parameters and obtain excellent model fit. In contrast, the traditional approach using SA with optimization (SA‐OPT) does not find the right calibration parameters for our data. The solution found by Rep‐OPT excellently improves manual solution by 32.7% in goodness‐of‐fit, and all calibrated constituents fit well to observations. The solution found by SA‐OPT using global SA improves manual solution by only 13.3%. Local sensitivity by SA‐OPT performs very poorly being 49.6% worse than manual solution.
Article
Full-text available
Hydrological models are widely used as simplified, conceptual, mathematical representatives for water resource management. The performance of hydrological modeling is usually challenged by model calibration and uncertainty analysis during modeling exercises. In this study, a multicriteria sequential calibration and uncertainty analysis (MS-CUA) method was proposed to improve the efficiency and performance of hydrological modeling with high reliability. To evaluate the performance and feasibility of the proposed method, two case studies were conducted in comparison with two other methods, sequential uncertainty fitting algorithm (SUFI-2) and generalized likelihood uncertainty estimation (GLUE). The results indicated that the MS-CUA method could quickly locate the highest posterior density regions to improve computational efficiency. The developed method also provided better-calibrated results (e.g., the higher NSE value of 0.91, 0.97, and 0.74) and more balanced uncertainty analysis results (e.g., the largest P/R ratio values of 1.23, 2.15, and 1.00) comparing with other traditional methods for both case studies.
Article
Full-text available
This paper describes a new parallel global surrogate-based algorithm Global Optimization in Parallel with Surrogate (GOPS) for the minimization of continuous black-box objective functions that might have multiple local minima, are expensive to compute, and have no derivative information available. The task of picking P new evaluation points for P processors in each iteration is addressed by sampling around multiple center points at which the objective function has been previously evaluated. The GOPS algorithm improves on earlier algorithms by (a) new center points are selected based on bivariate non-dominated sorting of previously evaluated points with additional constraints to ensure the objective value is below a target percentile and (b) as iterations increase, the number of centers decreases, and the number of evaluation points per center increases. These strategies and the hyperparameters controlling them significantly improve GOPS’s parallel performance on high dimensional problems in comparison to other global optimization algorithms, especially with a larger number of processors. GOPS is tested with up to 128 processors in parallel on 14 synthetic black-box optimization benchmarking test problems (in 10, 21, and 40 dimensions) and one 21-dimensional parameter estimation problem for an expensive real-world nonlinear lake water quality model with partial differential equations that takes 22 min for each objective function evaluation. GOPS numerically significantly outperforms (especially on high dimensional problems and with larger numbers of processors) the earlier algorithms SOP and PSD-MADS-VNS (and these two algorithms have outperformed other algorithms in prior publications).
Article
Full-text available
Calibration and uncertainty analysis of a complex, over-parameterized environmental models such as the Soil and Water Assessment Tool (SWAT) requires thousands of simulation runs and multiple calibration iterations. A parallel calibration system is thus desired that can be deployed on cloud-based architectures for reducing calibration runtime. This paper presents a cloud-based calibration and uncertainty analysis system called LCC-SWAT that is designed for SWAT models. Two optimization techniques, sequential uncertainty fitting (SUFI-2) and dynamically dimensioned search (DDS), have been implemented in LCC-SWAT. Moreover, the cloud-based system has been deployed on the Southern Ontario Smart Computing Innovation Platform's (SOSCIP) Cloud Analytics platform for diagnostic assessment of parallel calibration runtime on both single-node and multi-node CPU architectures. Unlike other calibrations/uncertainty analysis systems developed on the cloud, this system is capable of generating a comprehensive set of statistical information automatically, which facilitates broader analyses of the performance of the SWAT models. Experimental results on SWAT models of different complexities showed that LCC-SWAT can reduce runtime significantly. The runtime reduction is more pronounced for more complex and computationally intensive models. However, the reported runtime efficiency is significantly higher for single node systems. Comparative experiments with DDS and SUFI-2 show that parallel DDS outperforms parallel SUFI-2 in terms of both parameter identifiability and reducing uncertainty in model simulations. LCC-SWAT is a flexible calibration system and other optimization algorithms and asynchronous parallelization strategies can be added to it in future.
Article
Parallel global optimization of expensive simulation models like nonlinear partial differential equations (PDEs) can speed up model calibration or project design decisions, but the impact of memory management on the efficiency of using parallel global optimization methods has not been previously studied. This paper quantifies cache memory limitations arising during parallel optimization of expensive PDE models. An efficient parallel optimization algorithm is applied to model calibration for two different, expensive real-world PDEs (i.e., hydrodynamic and water quality analysis for a 250-hectare lake). One of these two lake models takes 4.5 h per simulation in serial, but that PDE simulation time per simulation increases to 12 h with parallel optimization if default processor scheduling strategy is used on a modern nonuniform memory access multicore platform. We proposed a novel mixed affinity scheduling strategy for parallel simulation optimization that increases computational efficiency by as much as 20% over the default affinity strategy.
Article
Parameter calibration for computationally expensive environmental models (e.g., hydrodynamic models) is challenging because of limits on computing budget and on human time for analysis and because the optimization problem can have multiple local minima and no available derivatives. We present a new general-purpose parallel surrogate global optimization method Parallel Optimization with Dynamic coordinate search using Surrogates (PODS) that reduces the number of model simulations as well as the human time needed for proper calibration of these multimodal problems without derivatives. PODS outperforms state-of-art parallel surrogate algorithms and a heuristic method, Parallel Differential Evolution (P-DE), on all eight well-known test problems. We further apply PODS to the parameter calibration of two expensive (5 h per simulation), three-dimensional hydrodynamic models with the assistant of High-Performance Computing (HPC). Results indicate that PODS outperforms the popularly used P-DE algorithm in speed (about twice faster) and accuracy with 24 parallel processors.
Article
This study compares single-site, multi-site and multi-variable SWAT calibration. The SWAT model was applied to a large basin (63 884 km²) and calibrated at a monthly time step with the SUFI-2 algorithm, using the Kling-Gupta efficiency (KGE) as the objective function. Multi-variable calibration was performed by combining streamflow and remote sensing-derived actual evapotranspiration data. Parameter transferability was also investigated, by daily time step validation. The KGE for the outlet ranged from 0.73 to 0.86, and the average KGE of all streamflow gauge stations ranged from 0.73 to 0.80, reflecting a good overall simulation performance for the monthly time step. In daily time step validation, KGE ranged from 0.62 to 0.68, and the Nash-Sutcliffe efficiency ranged from 0.40 to 0.60, for the average of all gauge stations. Multi-site and multi-variable calibrations did not significantly improve inner sub-basin simulation performance but improved streamflow uncertainty when compared to single-site calibration.
Article
Multi-modelling aims to make use of the strengths of single hydrologic models to improve the accuracy of simulating the watershed system behavior. Considering hydrological signatures such as the flow duration curve segmentation in the calibration of each hydrologic model leads to a better parameter identifiability. In this study, a novel weighted average model-wrapper based on flow duration curve segmentation is introduced to aggregate the calibrated models into a multi-model. The proposed framework is applied to develop a model-wrapper of the Upper Assiniboine River Basin for flood forecasting upstream of the Shellmouth reservoir in the Prairie region of Canada. The HEC-HMS, HBV-EC, HSPF, and WATFLOOD hydrologic models that are being used at the Hydrologic Forecast Centre of Manitoba Infrastructure for operational inflow forecasting are calibrated using signature-based multi-objective optimization. These models have significantly different structural complexities. The calibration of each of these models is set up as three simulation-optimization problems with different objective functions to balance the model capability in simulating multiple important hydrological signatures. Results show that the model-wrapper outperforms each of the single calibrated models that are of operational use at Manitoba Infrastructure, e.g. NSE improved from 0.44 for the best individual model to 0.76 for the model-wrapper in the calibration period. Moreover, the weights associated with each hydrologic model component indicate the contribution rate of the individual models to the model-wrapper in high-flow, mid-flow, and low-flow portions of streamflow time series. Quantifying the contribution of each model component provides a deeper insight into model selection strategy, especially when a component has minimal or no contribution, e.g. HEC-HMS and HBV-EC in this paper, to the model-wrapper performance in all ranges of streamflow simulation compared to other model components.