ArticlePDF Available

Enhanced watershed model evaluation incorporating hydrologic signatures and consistency within efficient surrogate multi-objective optimization

February 2024
Environmental Modelling & Software 175(4):105983

February 2024
175(4):105983

DOI:10.1016/j.envsoft.2024.105983

Authors:

Wei Xia

National University of Singapore

Wei Lu

Ant Group

Christine A. Shoemaker

Cornell University

Content uploaded by Wei Lu

Content may be subject to copyright.

Environmental Modelling and Software 175 (2024) 105983

Available online 14 February 2024

Enhanced watershed model evaluation incorporating hydrologic signatures

and consistency within efcient surrogate multi-objective optimization

Wei Xia

, Taimoor Akhtar

, Wei Lu

, Christine A. Shoemaker

Department of Civil and Environmental Engineering, National University of Singapore, 117576, Singapore

Department of Industrial Systems Engineering and Management, National University of Singapore, 117576, Singapore

Energy and Environmental Sustainability for Megacities (E2S2) Phase II, Campus for Research Excellence and Technological Enterprise (CREATE), 138602, Singapore

RWDI Consulting Engineers and Scientists., N1G 4P6, ON, Canada

Ant Group, 310000, China

ARTICLE INFO

Keywords:

Surrogate model

Multi-objective optimization

Watershed model

Automatic calibration

Hydrology signature

ABSTRACT

This paper presents a new framework for calibrating computationally expensive watershed models with multi-

objective optimization methods and hydrological consistency analysis. The analysis evaluates different algo-

rithms’ efciencies for nding watershed model calibration solutions within a limited budget. Two surrogate

multi-objective algorithms GOMORS and ParEGO are compared to ve evolutionary algorithms without surro-

gates on two watershed models. We test the algorithms’ performance with two multi-objective formulations (i.e.,

threshold-based ow separation and decomposition of the Nash-Sutcliffe Efciency (NSE)). Results indicate that

the surrogate-based GOMORS is the most computationally efcient overall. We also propose a framework to

select among the calibration solutions obtained from multi-objective optimization using different hydrologic

signatures. GOMORS is assessed for its ability to identify hydrologically acceptable calibrations. The decom-

position of NSE is the most effective calibration formulation in terms of hydraulic consistency analysis. In

addition, hydrologic signatures could be used effectively to lter non-dominated solutions obtained from multi-

objective optimization.

1. Introduction

1.1. Background and motivation

Analyzing trade-offs between conicting objectives for parameter

estimation of watershed models through Multi-Objective (MO) optimi-

zation can identify multiple plausible calibration alternatives that pro-

vide valuable information during the parameter estimation process to

enhance understanding of model adequacy, uncertainty and structural &

data deciencies (Kollat et al., 2012).

However, there are numerous challenges to model calibration via

MO optimization. For instance, a major reason multi-objective optimi-

zation is not more widely used is that it usually requires many more

model simulations (especially if many objectives are included in an

optimization formulation) than single objective optimization of

parameter goodness of t, which is difcult for complex simulation

models that are computationally expensive. Moreover, the calibration

perspective typically includes many potentially conicting objectives,

due to which identication of a suitable MO calibration formulation,

where the number of objectives is limited, is a non-trivial task.

Hence, in order to make MO calibration more accessible for modelers

who are dealing with distributed and expensive watershed models, it is

imperative to i) identify algorithms that are suitable for calibration on a

limited budget and ii) identify combinations of calibration objectives

that can adequately represent the many conicting hydrologic model

calibration targets within a limited number of objectives (in order to

ensure that calibration optimization is not overly complex or compu-

tationally difcult.) In this study, calibration with a “limited budget”

refers to cases where the number of allowed evaluations for the cali-

bration is relatively small and is constrained by factors including 1) the

computing resources (both in computing speed and time) that the cali-

bration can use, 2) the wall-clock time that the person conducting the

calibration is willing to allocate or wait for the optimization, and 3) the

computing time of each model evaluation.

Hydrologic Signatures (HS) are important streamow characteristics

of the natural ow regimes, such as timing and magnitude of extreme

* Corresponding author. Department of Civil and Environmental Engineering, National University of Singapore, 117576, Singapore.

E-mail address: xiawei@u.nus.edu (W. Xia).

Contents lists available at ScienceDirect

Environmental Modelling and Software

journal homepage: www.elsevier.com/locate/envsoft

https://doi.org/10.1016/j.envsoft.2024.105983

Received 15 February 2023; Received in revised form 21 January 2024; Accepted 13 February 2024

Environmental Modelling and Software 175 (2024) 105983

ows (Shai and Tolson, 2015), that are critical for adequate evaluation

of hydrologic models. Recently, HS were utilized in many studies as

supplements (Chilkoti et al., 2018) or alternatives (Sahraei et al., 2020)

to statistical Goodness of Fit (GOF) measures for investigating whether

streamow components are properly simulated in watershed models.

However, there are many hydrologic signatures (e.g., in our study, about

9 hydrologic signatures are considered), hence it is challenging to use so

many HS directly as the sub-objective functions of MO calibration. It is

thus also important to understand how HS can be incorporated into

calibration frameworks for expensive MO where simulation evaluation

budgets are limited.

1.2. Literature review

Watershed simulation models typically involve, i) physical parame-

ters that are difcult to measure directly, and/or ii) conceptual param-

eters that are impossible to measure. Conceptual parameters stem from

considerable simplications employed in modeling of the natural pro-

cesses within the watershed. Model calibration is a process that can be

employed to adjust values of such parameters, where the aim is to mimic

reality, via comparison of model response to historically observed

measurements. Traditional calibration methodologies typically involve

manual trial-and-error, with expert opinion of a hydrologist within an

interactive calibration framework. The value of expert opinion cannot

be disregarded; however, manual trial-and-error methods can be

extremely time consuming and complicated.

A bulk of prior and contemporary research has focused on automatic

calibration schemes involving single objective optimization (e.g., Boyle

et al., 2000; Madsen et al., 2002; Tolson and Shoemaker, 2007b; Vrugt

et al., 2003). The simulation-optimization problem for a complex

watershed model formulated within the automatic calibration frame-

work is typically non-linear and probably has multiple local minima (i.

e., is “multi-modal”) (Tang et al., 2006), and various studies have

employed heuristic search techniques seeking the globally optimal

solution.

Watershed calibration based on a single aggregated calibration per-

formance metric could lead to a signicant loss of information within

the calibration process. The calibration process could be highly sensitive

to various factors, especially the proposed objective function used to

assess the adequacy of a set of parameter values. Moreover, calibration

experts typically consider many criteria in the model calibration pro-

cess, most of which are not even included into the objective function

within an automatic calibration scheme (Shai and Tolson, 2015).

Prior research efforts (Bekele and Nicklow, 2007; Gupta et al., 1998;

Yapo et al., 1998) indicate that signicant conicts and trade-offs might

exist between calibration criteria, and that visualization of these

trade-offs can assist decision makers in understanding model limits and

choosing appropriate calibrations. Numerous performance measures can

be employed to quantify the potentially conicting calibration criteria.

Statistical Goodness of Fit (GOF) metrics (e.g., Nash-Sutcliffe Efciency

(NSE) (Nash and Sutcliffe, 1970), bias (e.g. mean absolute error)) that

are calculated from streamow time series and also performance metrics

(e.g., relative deviation) calculated from hydrological signatures (HS)

(Hingray et al., 2010) are some of the measures that can be used as

objective functions in multi-objective optimization formulations.

Many studies have focused on employing multi-objective optimiza-

tion for watershed model calibration, highlighting the effectiveness of

multi-objective analysis in deducing various Pareto optimal calibrations

(Gupta et al., 2003; Madsen, 2003; Yapo et al., 1998), providing added

insight into model ambiguity resulting from model imperfections and

parameter uncertainty and understanding modeling limitations. Pareto

optimal calibrations are sets of solutions found by multi-objective

optimization that are non-dominated to each other but are superior to

the rest of solutions in the search space. A solution is a non-dominated

solution when no other solution found in multi-objective optimization

search is better than it in terms of all objectives (Akhtar and Shoemaker,

2016). Ahmadi et al. (2014) apply multi-objective optimization for

calibration of a SWAT model and conclude that MO optimization is more

effective than single objective optimization. Wu et al. (2021) use a

sequential multicriteria algorithm to iteratively adjust parameter ranges

for hydrologic model calibration and uncertainty analysis. They also

highlight that computational requirement of calibrating distributed, and

semi distributed hydrologic models are typically very high.

Numerous research contributions have been made in proposing al-

gorithms for multi-objective optimization, within the simulation-

optimization framework (e.g., Coello et al., 2007). Various algorithmic

contributions, within the water resources community have also been

made (Asadzadeh et al., 2014; Maier et al., 2014; Nicklow et al., 2010;

Sahraei et al., 2019; Tang et al., 2006, 2007), with many focusing on

evolutionary strategies within their optimization frameworks.

Multi-objective evolutionary strategies are frequently referred to as

(MOEA) in contemporary literature. Tang et al. (2006) provide a

comparative analysis of various evolutionary algorithms, in order to

assess their effectiveness in hydrological model calibration.

While prior research and current industrial calibration techniques

indicate the inherent multi-criteria nature of the hydrological model

calibration problem and emphasize the advantages of multi-objective

optimization in model calibration, the computational complexity of

distributed hydrological models poses a huge challenge to the use of

multi-criteria optimization algorithms in the calibration process. It

should be noted that the calibration optimization problem is a compu-

tationally expensive simulation optimization problem, since the objec-

tive function(s) are evaluated via simulations and running watershed

model simulations for each set of parameters considered can take a

signicant length of time.

There is a dearth in prior literature on the effective and efcient use

of MO algorithms for hydrologic model calibration for expensive prob-

lems that have a limited budget of simulation evaluations. Here, by

“expensive” problems, we refer to optimization problems where each

evaluation is resource-intensive in terms of computing resources and

time to complete the entire optimization process. Identifying algorithms

that are effective with a limited number of model evaluations provides

tools that can both be used for large watersheds and for watershed

models with a lot of spatial detail (both of which are more computa-

tionally expensive and hence the number of model evaluations will be

limited). Moreover, such algorithms are also effective in multi-step/

sequential calibration scenarios, i.e., scenarios where calibration ex-

perts apply automatic algorithms for model calibration in multiple it-

erations, with changes made in parameter choice and range in each

iteration (Xia et al., 2022a; Wu et al., 2021; Franco et al., 2020; Zamani

et al., 2020). Thus, assessment and identication of suitable MO algo-

rithms for expensive watershed problems and limited evaluation bud-

gets is important.

The use of surrogate models within an optimization algorithm can be

highly effective in reducing time for computing objectives for multi-

objective calibration of complex watershed problems. The terms “sur-

rogate”, “response surface” and “meta model” are all used to describe

the use of existing information to build a multivariate approximation of

the objective function or model simulation, which then guides the

optimization search. Surrogate based single-objective optimization al-

gorithms have been widely used in calibration applications in many

areas. Most surrogate methods used in these optimization applications

include: a) radial basis function (RBF) based methods (Müller et al.,

2013; Regis & Shoemaker, 2007, 2013; Wild et al., 2008; Xia and

Shoemaker, 2021; Xia et al., 2021), b) Kriging based methods (Gong and

Duan, 2017; Jones et al., 1998), and c) articial neural network (ANN)

(Zou et al., 2007) based methods. Popular surrogate optimization ap-

plications in water resources include single-objective watershed model

calibration (Regis and Shoemaker, 2013), groundwater model calibra-

tion (Mugunthan and Shoemaker, 2006), lake water quality and hy-

drodynamic model calibration (Xia & Shoemaker, 2022a, 2022b; Xia

et al., 2022; Xia et al., 2021), earth system model calibration (Lu et al.,

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

2018; Cheng et al., 2023) and carbon sequestration model calibration

(Espinet and Shoemaker, 2013). Razavi et al. (2012) provided a

comprehensive review of literature on the use of surrogates in water

resources.

Surrogates have also been used in application of multi-objective

optimization to complex water resources problems. Baú and Mayer

(2006) employ kriging based surrogates within a multi-objective

framework for optimal design of groundwater (pump-and-treat) reme-

diation systems. Behzadian et al. (2009) combine a MOEA with an ANN

to efciently deduce optimal sampling locations of pressure loggers for a

water distribution system. di Pierro et al. (2009) explore the use of

surrogate based MO algorithms including PAREGO with application to

water distribution network design. Castelletti et al. (2010) incorporate

the use of numerous surrogate methods for efcient multi-objective

optimization with application to water quality planning in reservoirs

and lakes. Lu et al. (2019) combined polynomial regression models with

an iterative algorithm for supporting storage pond design in an urban

drainage system. However, the effective use of surrogate-based methods

and other efcient MO algorithms for expensive watershed model cali-

bration, is relatively unexplored. For instance, there are no prior studies

that adequately analyze and compare MO algorithms on a limited model

evaluation budget.

In addition to the computational challenges associated with using

MO for parameter calibration, the application of MO will yield a set of

non-dominanted solutions. These solutions are multi-objectively

equivalent (i.e., without additional information, it is impossible to

designate any of the non-dominated solutions as superior to other non-

dominated solutions). Note that this multi-objective equivalence in non-

dominated solution is somewhat different from the well-known concept

of equinality (Beven and Binley, 1992), wherein multiple sets of pa-

rameters provide “equally good” or “acceptable” model performance.

The non-dominated solution achieved in MO represents solutions that

are good in one or multiple ways in which the best t of a model to the

data can be dened. The equinal and non-dominated sets of parameters

might overlap but may not be equivalent, as also discussed in Gupta

et al. (1998).

Identifying a set of calibration parameters rather than a single best

parameter (as is the case in MO calibration) means that the parameter

uncertainty among the non-dominated solutions will be propagated to

the simulation output, causing predictive uncertainty. Multiple meth-

odologies have been proposed to address this limitation within the

context of predictive uncertainty. For example, informal Bayesian

approach generalized likelihood uncertainty estimation (GLUE) meth-

odology (Beven and Binley, 1992) and various Bayesian Markov chain

Monte Carlo or MCMC methods (e.g., Kuczera and Parent, 1998; Vrugt

et al., 2003).

The GLUE method involves generating a large ensemble of feasible

parameters sets commonly based on uniform random sampling and then

assessing the model performance for each set against observed data with

a likelihood function. A subjective threshold on model performance is

applied to select parameter sets with good model performance, which

are known as “behavioral” sets. The model output uncertainty is rep-

resented by the likelihood-weighted output across the behavioral sets.

The latter requires the denition of a formal likelihood function. The

model output uncertainty is obtained by evaluating model output for a

set of parameters sampled from the posterior parameter distributions.

Both GLUE and formal Bayesian approaches often involve a large

number of evaluations, with studies in the literature commonly report-

ing from 60,000 to 100,000 model evaluations. Consequently, GLUE and

formal Bayesian approaches are computationally infeasible for problems

that are computationally expensive.

The uncertainty approximation from non-dominated solutions does

not apply a subjective threshold like GLUE but rather uses the non-

dominated solutions obtained by MO. Additionally, no probability is

applied when aggregating the model output across multiple simulation

output in the nal set of non-dominated solutions. However, one issue

related to utilizing non-dominated solutions for predictive uncertainty

estimation lies in the presence of solutions that may exhibit poor per-

formance in one or a few subobjectives. These solutions, despite their

suboptimal performance in specic aspects, might be included in the

non-dominated set due to their superior performance in other sub-

objectives. In addition, too many non-dominated solutions in the nal

solution set will enlarge the model predictive uncertainty. Conse-

quently, it becomes crucial to implement a selection process among

these non-dominated solutions to ensure that the selected solutions

exhibit high-quality behavior.

There is limited knowledge regarding the best practices for selecting

non-dominant solutions in MO calibration. Additional criteria (e.g.,

Hydrological Signatures metrics (Hingray et al., 2010)) beyond the

performance metrics used in MO calibration could potentially be used to

screen the non-dominant solutions. However, no studies have explored

this possibility. Moreover, there is limited knowledge available on how

to effectively incorporate HS into model calibration when models are

expensive and simulation budgets are limited (e.g., the number of

evaluations allowed for the whole optimization search process is less

than 1000). For instance, Sahraei et al. (2020) and Shai and Tolson

(2015) investigate different calibration formulations that use HS-based

and statistical GOF based objective functions but with budgets in

excess of 5000 simulation evaluations (which would take more than 7

days in wall-clock time for solving the MO calibration problem in our

case). In addition, using HS directly as objective functions would result

in many objectives which would make the MO calibration too expensive

to solve and resulting in a larger number of non-dominated solutions.

Moreover, MO algorithms analyzed in this study are not designed for

“many-objective” (i.e., 4 or more objective) problems.

1.3. Research contributions

The primary contribution of this study is a suggested framework for

multi-objective (MO) hydrologic model calibration analysis on a limited

budget of simulation evaluations (dened to be below 600 evaluations

in this study, which is around 1 day in computing time for our problem).

The efciency of 7 different surrogate and non-surrogate optimization

methods are comprehensively compared on two watershed calibration

problems with available long-term observation data sets and using two

different MO formulations. We not only compare the solutions among

different optimization methods but also assess the solutions from MO

calibration with a limited budget (below 600 evaluations) to solutions

with a sufcient budget (over 100,000 evaluations) in order to investi-

gate if it is feasible to solve the MO calibration with a limited budget.

We eventually propose an efcient multi-objective calibration

framework to evaluate and lter the non-dominated solutions from MO

calibrations. The framework incorporates hydrologic signatures into the

calibration framework by using pre-dened thresholds of acceptability

on various hydrologic signatures. The framework allows the evaluation

of the solution quality of MO calibration and the selection of high-

quality subset solutions from multiple non-dominated solutions. We

adopt the “hydrologic consistency” denition and “hydrologic fre-

quency” plots from Shai and Tolson (2015) to assess the overall solu-

tions quality from a MO calibration search and propose a new hydrologic

consistency frequency heatmap to assess the calibration quality in terms

of each hydrologic criteria. Our framework enables modellers to un-

derstand the quality of MO calibration and gives insights on model

setup, parameter sensitivity & output uncertainty. Moreover, the

framework we proposed is general and could be used for other water-

shed model calibration problems, and also in sequential calibration

frameworks where model parameters/structure are adjusted in each

calibration iteration (Xia and Shoemaker, 2022a).

2. Calibration assessment framework

The multi-objective optimal calibration problem discussed in this

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

study, can be formulated as a constrained optimization problem:

min

θ∈ΩF(θ) = [f1(θ),…,fk(θ)]T(1)

where θ= [θ1…θn] ∈ Ω is the vector of n model parameters to be cali-

brated. Ω is the domain of the solution space dened by the lower and

upper bound of the model parameters (i.e., θmin and θmax). F(θ)is the

vector of k calibration objectives, which can be subjectively dened by a

calibration expert. In order to evaluate the objective F(θ)for a candidate

decision vector θ, a computationally expensive run of the watershed

simulation model is to be performed.

Hence, a desirable property of an optimization methodology is the

ability to produce good solutions within a limited budget of simulation

evaluations. The budget on simulation evaluations depends on the

computation time of a model and the total time available for the cali-

bration process. The goal of MO optimization is to nd the set of Pareto

optimal solutions, which is the set of all points y in domain Ω for which

there is no other point x in Ω such that f

(x) ≤f

(y) for all j from 1 to k. In

MO optimization, the Pareto set is approximated by a set of non-

dominated solutions obtained after a xed number of evaluations of

F(θ).

Multi-objective model calibration has two critical components,

namely 1) selection of appropriate objectives for formulation of the MO

hydrologic calibration, and 2) utilization of an effective and efcient

algorithm for optimization. Section 2.1 discusses the formulations used

in this study. Subsequently, Section 2.2 introduces the model calibration

case studies used in the study, Section 2.3 briey describes the algo-

rithms used in the calibration assessment framework, Section 2.4 ex-

plains assessment metrics used in our study, and Section 2.5 describes

the approach we used to assess the model parameter and output

uncertianty of the nal calibration solutions. The entire calibration

assessment framework is designed to identify formulations and algo-

rithms that are efcient and effective for calibration of expensive hy-

drologic/watershed problems within a limited number of model

evaluations.

2.1. Calibration formulations

This study utilizes goodness-of-t (GOF) measures as objectives to

create numerous multi-objective calibration formulations. For each

watershed case study tested, all calibration formulations utilized in this

study differ in the choice of objectives used only, whereas choice of

parameters and their respective ranges remain same across all formu-

lations. The GOF measures utilized in the study include Nash-Sutcliffe

Efciency (NSE). Two different multi-objective ow calibration formu-

lations are used in this study (as described below). The rst one con-

siders the high ow separately from the mixture of moderate and low

ow. The second one considers different components of the decomposed

NSE as objective functions.

2.1.1. Formulation 1 - threshold based ow separation (2-objective)

This formulation considers the trade-off between high ow calibra-

tion and moderate/low ow calibration, by utilizing the NSE GOF

measure. The relative importance of different hydrological processes

varies between high ow and low ow situations (Kollat et al., 2012) so

a given set of model parameters might do relatively well at matching

data under high ow conditions and relatively poorly under low ow

conditions (or vice versa). Hence it is reasonable to consider the goals of

getting adequate ts to the data under low and high ow conditions as

two different objectives for which there will be a trade-off.

f1(θ) = NSEHF(θ)

f2(θ) = NSELF(θ)where NSE(θ) = 1−∑n

i=1(yobs,i−ysim,i(θ))2

∑n

i=1(yobs,i−

obs)2(2)

Equation (2) describes the objectives used in our rst formulation,

where NSEHF(θ)is the NSE of high ows (given parameter vector θ) and

NSELF(θ)is the NSE of low ows. In the above equation, yobs,i and ysim,i(θ)

are the measured and simulated ows (given calibration parameter set

θ) on day i, respectively;

obs is the estimated mean values of measured

ows; and n is the number of simulated days. In all our experiments,

high ows are dened as all ows above the 95-th percentile observed

ow when observations are sorted in ascending order of ow magni-

tude. Low ows are all ows below this threshold so moderate ows are

included in the “low” ows. When calculating the NSE value, the time

series for high (or low) ow is derived by extracting the corresponding

time steps from both measured and simulated ows, where the

measured ow magnitudes are classied as high (or low) ow. As a

result, the measured and simulated high (or low) ow time series are of

the same length.

2.1.2. Formulation 2 - decomposition of NSE of ow (3-objective)

Gupta et al. (2009) highlight that the NSE criterion can be decom-

posed into three components a) linear correlation (represented by f1(θ)

in Equation (3)), b) relative variability (represented by f2(θ)in Equation

(3)), and c) relative bias (represented by f3(θ)in Equation (3)). Each

component focuses on calibrating a different and potentially conicting

aspect of ow. The relative bias component of the criterion tends to

minimize volume balance errors; the relative variability tends to mimic

the ashiness of the hydrograph, inherently focusing on capturing

extreme ows; while the correlation criterion, in combination with

relative variability tends to capture the shape of the hydrograph. This

MO formulation utilizes these three components as independent objec-

tives, as dened in Equation (3) below:

f1(θ) = [r−1]2,where r=∑

i=1

yobs,iysim,i(θ) − n

obs

sim(θ)

(n−1)

obs

sim(θ)

f2(θ) = [DDS(θ)]2,where DDS (θ) =

obs −

sim(θ)

obs

f3(θ) = [DDM(θ)]2,where DDM (θ) =

obs −

sim(θ)

obs

(3)

where

obs and

sim (θ)are estimated means of observed and simulated

(given parameter set θ) ows, respectively.

obs and

sim (θ)are the

estimated values of standard deviation of measured and simulated ows

(given parameter set θ), respectively. All other symbols in Equation (3)

are dened in Section 2.1.1, and in Table A2.

Formulation 2 is an objective function set that includes both GOF

measures and hydrologic signatures, since correlation r is purely a GOF

objective, whereas squared relative deviation between observed and

simulated mean (relative bias) and observed and simulated standard

deviation (relative variability) are HS-based objectives. Discharge mean

and standard deviation are key hydrologic signatures that are also

included in the signatures used to compute hydrologic consistency later

in this study (see Table A2 and Section 2.4.2).

2.2. Case studies

The two Watershed Model case studies used in our MO calibration

framework are derived from the Cannonsville Watershed modeling case

study (Tolson and Shoemaker, 2007a). The Soil and Water Assessment

Tool (SWAT) is used for model development (Arnold et al., 2012). SWAT

is a widely used, physically based, deterministic and semi-distributed

watershed modeling tool (Abbaspour et al., 2004; McDonald et al.,

2019; Wang et al., 2019). More detailed introduction of SWAT is pro-

vided in Supporting Information, Section S1.

2.2.1. Case study I: Cannonsville Watershed

Tolson and Shoemaker (2007b) introduce two scaled variations of

the Cannonsville SWAT model, as ow calibration case studies. Case

Study I incorporates a computationally expensive calibration model,

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

which constitutes 43 subbasins and predicts ow within the 1178 km

Cannonsville Watershed. The delineation of subbasins was performed

using a high-resolution (25 m) digital elevation map obtained from the

New York City Department of Environmental Protection (NYCDEP),

coupled with stream network denition from US Census TIGER les.

Land use information at a 25 m grid resolution was obtained from the

NYCDEP and derived from thematic mapper satellite imagery. Soil

property inputs were extracted from the State Soils Geographic Database

(1:250,000) using an area-weighted averaging approach for each map

unit. The climate data inputs, encompassing minimum and maximum

temperature, precipitation, solar radiation, and relative humidity, were

derived from meticulously measured data. The temporal resolution of

the model is daily, and a single simulation run spans approximately 1

min for a 10-year simulation period. For an in-depth understanding of

the Cannonsville SWAT model, refer to the comprehensive description

provided by Tolson and Shoemaker (2007a). The model calibration

exercise is focused on the United States Geological Survey (USGS)

Walton ow monitoring location, which drains up to 860 km

of the

watershed. Tolson and Shoemaker (2007b) identify 15 parameters

(dened in Appendix, Table A1), that are to be calibrated, for ow

prediction in Case Study I. The values of these 15 parameters are con-

stant for all subbasins and do not vary spatially, following the same

setting as in the study by Tolson and Shoemaker (2007b). Case Study I

employs a nine-year time period, for daily ow calibration at the Walton

ow monitoring station.

2.2.2. Case study II: townbrook watershed

Townbrook is a sub-watershed within the Cannonsville Watershed,

covering an area of around 37 km

. Case Study II is derived from the

single subbasin Townbrook SWAT model developed by Tolson and

Shoemaker (2007b). The model inputs and time steps setup of the

Townbrook SWAT model are the same as the Cannonsville SWAT model

as describedin the section above. The Townbrook SWAT model is a

relatively inexpensive model (one simulation run takes up to 10 s for a

10-year modeling period), and hence can be used extensively for algo-

rithm comparison. The model predicts ow within the Townbrook

watershed and is employed as a ow calibration case study. The

Townbrook sub-watershed is monitored for ow by the United States

Geological Survey (USGS Station 01421618), and Case Study II has a

10-year time period, for daily ow calibration of 15 parameters of the

Townbrook SWAT model.

Given two calibration formulations, and two case studies, we

developed 4 watershed calibration test case studies, as our test suite for

comparative algorithm analysis in the MO framework. The nomencla-

ture used for the test case studies, along with a brief overview of the

problems, is provided for reference in Table 1. FW is an abbreviation for

“Full-Watershed”, which is a reference to the Cannonsville case study

(Case Study I). SW is an abbreviation for “Sub-Watershed”, which is a

reference to the Townbrook case study (Case Study II).

2.3. Optimization algorithms

Within the water resources literature, several recent studies have

shown that in general, watershed model calibration problems can be

highly multi-modal, and existence of false non-dominated Pareto fronts

(e.g., locally optimal fronts) can pose a signicant threat to the accuracy

of the optimization process (Kollat et al., 2012; Reed et al., 2012). Hadka

and Reed (2013) state that multi-modality is a severe challenge for most

multi-objective evolutionary algorithms (MOEAs). In our analysis, we

aim to compare optimization algorithms with varying search capabil-

ities to understand their capabilities in tackling the numerous optimi-

zation challenges of watershed calibration within a very limited

simulation evaluation budget.

The optimization literature contains various multi-objective opti-

mization algorithms, specically for tackling simulation optimization

problems. These search-based meta-heuristics are primarily multi-

objective Evolutionary Algorithms (MOEAs). Coello et al. (2007) sug-

gest that use of MOEAs can be highly benecial, since the

population-based structure of an evolutionary algorithm can be exploi-

ted to simultaneously achieve the two goals of i) convergence to the

Pareto front, and ii) maintain a diverse set of trade-off solutions.

Various search methodologies within evolutionary optimization can

be employed to tackle these two-fold aims of multi-objective optimiza-

tion. Local search, decomposition of the objective vector into multiple

single objective optimization problems and employing multi-method

search are some of these search methodologies. The ve multi-

objective evolutionary algorithms (MOEAs) used for comparison in

this study, i.e., NSGA-II (Deb et al., 2002), MOEA/D (Zhang and Li,

2007), AMALGAM (Vrugt and Robinson, 2007), BORG (Hadka and

Reed, 2013) and

-MOEA (Laumanns et al., 2002) employ different and

unique search mechanisms for optimization, and have been used in

numerous watershed model calibration applications in the past (Ahmadi

et al., 2014; Chilkoti et al., 2018; Ercan and Goodall, 2016; Shai and

Tolson, 2015; Zhang et al., 2013). For instance, Ercan and Goodall

(2016) introduced a generic software tool for using NSGA-II in

multi-objective and multi-site calibration of SWAT models; Zhang et al.

(2013) proposed PP-SWAT, a parallel multi-objective calibration tool

designed for parallel and efcient calibration of SWAT models using

AMALGAM; and Chilkoti et al. (2018) coupled SWAT with BORG for

effective low-ow calibration of semi-distributed hydrologic models.

PADDS (Asadzadeh and Tolson, 2013; Sahraei et al., 2019) is another

(and non-evolutionary method) algorithm that has performed well on

MO water problems (Yang et al., 2017). We did not include PADDS in

this analysis because Shai and Tolson (2015) observe that performance

of PADDS and AMALGAM is comparable for watershed calibration

problems and we do compare to AMALGAM.

Since we are interested in efcient multi-objective calibration of

expensive watershed models, incorporation of surrogate assisted opti-

mization methodologies is important in this analysis. Surrogate assisted

search methods typically employ computationally inexpensive response

surface models (or “surrogate” models), within the iterative search

process, in order to efciently guide the search towards optimal solu-

tions. Hence, we compare two surrogate based algorithms, ParEGO

(Knowles, 2006) and GOMORS (Akhtar and Shoemaker, 2016), along

with the ve evolutionary algorithms discussed above, for analyzing

their relative effectiveness in MO calibration of expensive hydro-

logic/watershed models on a limited budget of a few hundred simulation

evaluations. In surrogate-assisted optimization methods such as ParEGO

and GOMORS, the surrogate model is constructed by utilizing evalua-

tions already explored by the optimization algorithm. The surrogate

takes the decision variables (e.g., calibration parameter values) as input

and produces the objective function value (e.g., performance metric

calculated using SWAT simulation output for a given set of calibration

parameter values) as output. The tted surrogate model serves as a

computationally inexpensive predictive tool for estimating the objective

function value for a given set of calibration parameter values. This

Table 1

The ow calibration test case suite employed in comparative algorithm analysis

with formulations dened in Section 2.1, Case studies dened in Section 2.2, and

algorithms introduced in Section 2.3. FW =Cannonsville Full-Watershed and

SW =Townbrook Sub-Watershed.

Problem

Name

Formulation Equation

No.

Case

Study

Objectives Applied

Algorithm(s)

SW-2 1. Threshold 2 II/SW 2 All

SW-3 2. NSE-

Decom

3 II/SW 3 All

FW-2 1. Threshold 2 I/FW 2 All

FW-3 2. NSE-

Decom

3 I/FW 3 All

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

predictive capability is then employed to strategically sample evaluation

points within the search space, contributing to an enhanced and efcient

optimization process.

ParEGO (Knowles, 2006) uses a Kriging-based surrogate surface for

multi-objective optimization, and it is specically designed for appli-

cations involving a very limited evaluation budget. GOMORS (Akhtar

and Shoemaker, 2016) is another iterative scheme, which employs

Radial Basis Functions (RBF) as a surrogate model to guide

multi-objective search towards the optimal set of solutions. Both the

Kriging-based surrogate surface and the RBF surrogate are versatile

data-driven models that can effectively capture diverse regression re-

lationships between decision variables and objective function values.

Their general-purpose nature enables direct application to a range of

problems, extending beyond the specic model calibration issues

encountered in this study, such as those involving SWAT. GOMORS

optimizes RBF surrogates (one surrogate is tted for each objective) with

an MOEA search on the surrogate in each iteration, in order to improve

algorithm efciency. Akhtar and Shoemaker (2016) report that

GOMORS outperforms ParEGO and NSGA-II on test problems and on a

hypothetical groundwater remediation design problem with a limited

evaluation budget (400 evaluations). GOMORS is implemented within

pySOT, a python-based framework surrogate optimization toolbox,

designed for implementing single and multi-objective surrogate algo-

rithms (Eriksson et al., 2019).

2.4. Assessment metrics

Calibration solutions obtained from different algorithms and for-

mulations requires a careful assessment and selection under MO per-

formance metrics. There are numerous MO performance metrics that

measure an MO solution with a single number. Coello et al. (2007)

provide a comprehensive list of MO performance metrics.

However, from a hydrologic perspective, assessment of calibration

solutions may benet from considerations of various measures including

GOF and HS, which may be included in post-optimization calibration

assessment (Shai and Tolson, 2015). Thus, the calibration assessment

framework of this study uses both traditional and hydrological metrics

for assessment of algorithms and formulations on limited evaluation

budgets.

2.4.1. Hypervolume

Hypervolume (Auger et al., 2009) is the traditional MO performance

assessment metric used in analysis of algorithms in this study. Hyper-

volume incorporates both convergence to the ideal front as well as di-

versity of solutions on the front and is dened as the total feasible

objective space (bounded by reference points) dominated by estimate of

the Pareto front obtained by an algorithm (see Supporting Information

Section S2 and Fig. S1 for further illustration). This study uses a

normalized version of hypervolume, where hypervolume dominated by

an algorithm solution is divided by the hypervolume dominated by the

true Pareto front of the problem. Thus, hypervolume coverage, as

dened in this study, is the proportion of feasible objective space

dominated by an algorithm (see Supporting Information Section S3 and

Fig. S2 for further illustration). A higher value of hypervolume coverage

indicates a better solution and the ideal value is one. Also, since the true

Pareto front of an optimization problem is unknown, we use the tness

values of the best solution of all algorithms and all trials to develop an

estimate of the Pareto front for that optimization problem.

2.4.2. Hydrologic consistency

We consider nine HS in this study’s calibration performance analysis

(dened in Appendix B) and use relative signature deviations to quantify

HS-based performance (for each signature). Relative signature de-

viations have been used in other studies (Sahraei et al., 2020; Shai and

Tolson, 2015) to quantify signature-based performance and are

described (as used in this study) as:

Dxx(θ) = Sobs

xx −Ssim

xx (θ)

Sobs

(4)

In the above equation Dxx(θ)is the “relative deviation” between

observed signature value Sobs

xx (computed using observed ows) and

simulated signature value Ssim

xx (θ)(computed using simulated ows, and

the parameter vector θ), where ‘xx’ is an abbreviation denoting the

signature being computed (please refer to Table A2, for signature ab-

breviations and denitions).

Shai and Tolson (2015) introduced the term hydrologic consistency

when analyzing calibration formulations. Let’s assume that calibration

of a hydrologic model requires consideration of N criteria where an

acceptability threshold is dened for each criterion (a criterion could be

for example a GOF measure or an HS deviation). Examples of accept-

ability threshold may include: 1) Kling-Gupta Efciency (KGE) >x and

2) D

(θ) <y, where x and y are user dened thresholds and KGE is a

statistical metric commonly used for the evaluation of hydrological

model performance (Kling et al., 2012). It combines three components:

correlation (Pearson’s correlation coefcient), bias (the ratio of the

mean simulated to observed values), and variability (the ratio of the

standard deviation of simulated to observed values). D

(θ) is the rela-

tive deviation between observed and simulated signature value dened

in Equation (4). According to Shai and Tolson (2015) hydrologic

consistency of a calibration solution is dened as the number of satised

criteria, n (out of N), i.e., the number of criteria that are within their

dened acceptability thresholds. This type of analysis of calibration

solutions is common when HS are considered in calibration (Martinez

and Gupta, 2010). Shai and Tolson (2015) also introduced frequency

plots of hydrologic consistency, to illustrate sampling efcacy of

different calibration algorithms and formulations. This study uses the

hydrologic consistency frequency (HCF) plot and HCF heatmap to

analyze the performance of algorithms and formulations on a limited

evaluation budget.

Hydrologic consistency frequency plot. A hydrologic consistency fre-

quency (HCF) plot introduced by Shai and Tolson (2015) is akin to a

probability exceedance plot and essentially plots the number of satised

criteria (N) on the x-axis, and the proportion of function evaluations of

an MO calibration experiment that satisfy at least n criteria, on the

y-axis. HCF plots are an effective presentation of hydrologic consistency

of calibration search. An MO calibration experiment is more hydrolog-

ically consistent, if a higher proportion of its evaluations satisfy more

criteria (relative to other calibration optimization experiments). Thus,

higher HCF curves are desirable and are better. In our study, we use only

the non-dominated solutions from multi-objective optimization for HCF

analysis, which is different from the HCF plot introduced by Shai and

Tolson (2015) where all solutions from calibration are used.

Hydrologic consistency frequency heatmap. HCF plot gives the pro-

portion of evaluation points that satisfy multiple criteria but cannot

provide information on how each of the different hydrologic criteria is

being satised. We introduce the HCF heatmap that shows the frequency

Table 2

The percentage of function evaluations, i.e., relative progress required by

GOMORS to reach the mean best Hypervolume achieved by other algorithms

after 600 function evaluations. The numbers in the brackets preceding x report

speed-up factors of GOMORS relative to other algorithms. All the results are

averaged over 20 trials. (Percentage less than 100% indicates the algorithm is

slower than GOMORS.)

Algorithm SW-2 SW-3 FW-2 FW-3

ParEGO 72% [1.4x] 52% [1.9x] 32% [3.1x] 47% [2.1x]

NSGA-II 12% [8.6x] 9% [11x] 10% [10x] 7% [15x]

AMALGAM 58% [1.7x] 33% [3.0x] 48% [2.1x] 30% [3.4x]

MOEA/D 23% [4.3x] 23% [4.3x] 16% [6.1x] 15% [6.7x]

BORG MOEA 36% [2.8x] 29% [3.4x] 35% [2.8x] 35% [2.8x]

-MOEA 29% [3.4x] 36% [2.8x] 30% [3.3x] 20% [5.0x]

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

of each hydrologic criterion satised for the non-dominated solutions

that satised at least N hydrologic criteria. The horizontal axis of the

heatmap is the number of satised criteria (N). The vertical axis plots

different hydrologic criteria dened. The color of each grid in the

heatmap plots the frequency of each hydrologic criterion being satised

among the solutions that satised at least N criteria.

2.5. Model parameter and output uncertainty assessment

In this study, we quantied model parameter and output uncertainty

based on carefully selected non-dominated solutions. The hydrologic

consistency frequency plot and map introduced in Section 2.4 were

applied to the non-dominated solutions identied by the most effective

Multi-Objective (MO) algorithm. Following the analysis of hydrologic

consistency frequency plots and maps, which illustrate the performance

of non-dominated solutions across various hydrologic signatures, the

modeler can establish a threshold (e.g., specifying a minimum number

of hydrologic consistency criteria that must be satised). This threshold

serves as a criterion for selecting a subset of non-dominated solutions

assumed to be of high quality. These selected solutions not only belong

to the non-dominated set but also satisfy a greater number of hydrologic

consistency criteria compared to the remaining non-dominated

solutions.

To investigate parameter uncertainty, we compared the range (upper

and lower bounds of parameter values) of the nal selected non-

dominated solution set with the original parameter range dened for

the optimization search. This comparative analysis allows us to assess

the extent to which the methods proposed in our study contribute to

reducing parameter uncertainty.

Model output uncertainty is derived from these carefully selected

non-dominated solutions and is quantied as the range between the

highest and lowest output at each time step within the ensemble of

selected non-dominated solutions. To evaluate model performance, we

compared the ensemble mean and range of model output from the

selected non-dominated parameter sets with observed data. This

comprehensive approach enables a thorough investigation into both

parameter and output uncertainties, offering insights into the efcacy of

the proposed methodology.

3. Experimental setup

The experimental setup of this study is divided into two core com-

ponents. In the rst component, we focus on analysis of algorithm per-

formance by comparing the algorithms listed in Section 2.3 on 4

problems, SW-2, SW-3, FW-2, and FW-3. Due to the stochastic nature of

all algorithms, multiple trial runs were performed for the above-

mentioned problems. We performed 20 trials for each algorithm with

600 function evaluations on each of the 4 problems. Algorithms are

analyzed both in terms of traditional optimization efciency and

signature-based hydrologic consistency. Hypervolume coverage and

trade-off visualization plots (for 2-objective formulations) are used for

traditional algorithm assessment. Hydrologic consistency plots (see

Section 2.4.2) are used for hydrologically relevant analysis. Results of

algorithm comparison based on hypervolume coverage are discussed in

Section 4.1 and results via consistency analysis are discussed in Section

4.3.

The second component of the analysis in this study explores the so-

lution quality of different optimization formulations on a limited eval-

uation budget. The study employs GOMORS as the optimization

algorithm and assesses the quality of calibration solutions in terms of

hydrologic consistency. Two different formulations introduced in Sec-

tion 2.1 are compared in this analysis, with an emphasis on under-

standing the efcacy and effectiveness of these formulations in producing

good calibration alternatives. Efcacy, in this study, is dened as the

ability with which a formulation can frequently nd calibration solu-

tions that have high hydrologic consistencies (see Section 2.4.2 for

denition of hydrologic consistency). HCF plots and HCF heatmap are

used to understand and compare efcacy. Effectiveness, in this study,

focuses more on comparing the “best calibration solutions” found by

different formulations. Results of formulation comparison and solution

quality assessment are discussed in Section 4.4.

3.1. Algorithm settings

A small trial-and-error exercise was performed to tune population

sizes for the multi-objective evolutionary algorithms (MOEAs) NSGA-II,

MOEA/D,

-MOEA and AMALGAM. Since performance of MOEAs is

highly dependent on population sizes, we ran multiple trials of the al-

gorithms on the SW cases, i.e., SW-2 and SW-3, with population sizes of

20, 50, 100 and 200, and an evaluation budget of 600. We chose 600

evaluations because when the FW-3 problem is running on a desktop

(Intel Core i7-6700 CPU @3.40 GHz, 16G RAM), the runtime is around 1

min for each evaluation; therefore, the computational time is around 10

h for a single optimization scenario of FW running in serial on a single

computer, which is affordable under this budget. The initial trial-and-

error analysis showed that within the limited evaluation budget of

600, a population size of 20 was desirable for all MOEAs in the case of

the 2-objective problems, whereas a population size of 50 was desirable

in the case of the 3-objective problems to achieve convergence of tness.

BORG MOEA has an adaptive population size and hence, does not

require tuning of the population. Given the limited evaluation budget,

we changed the initial, minimum, and maximum population size values

for BORG MOEA to 16, 10, and 100, respectively. ParEGO’s parameter

conguration recommended by Knowles (2006) was employed, and the

GOMORS parameter conguration recommended by Akhtar and Shoe-

maker (2016) was used.

3.2. Hydrologic signatures and consistency levels

Hydrologic signatures (HS) are important streamow characteristics

of the natural ow regimes, such as timing and magnitude of extreme

ows (Shai and Tolson, 2015). Table A2 denes nine HS used in this

work. As introduced in Section 2.4.2, hydrologic consistency of a cali-

bration solution is dened as the number of hydrologic criteria that are

“satised” by a calibration solution. Ten hydrologic criteria (the KGE

metric plus the nine HS) are considered in the hydrologic consistency

analysis of this study. Moreover, satisfaction level of a hydrologic

signature (for a calibration solution) is achieved when absolute relative

deviation (see Section 2.1.1 and Equation (2)) of the signature is within

a user-dened percentage. Satisfaction level with respect to KGE is

achieved when the KGE score of a calibration solution is greater than a

user-dened threshold. Two different user-dened hydrologic consis-

tency levels are considered in the algorithm and formulation analysis of

this study. For the rst hydrologic consistency level, satisfaction level for

KGE is KGE >0.5, and satisfaction level for the absolute value of a HS is

the absolute value of HS <25%. For the second (and higher) hydrologic

consistency level, satisfaction level for KGE is KGE >0.6, and satisfaction

level is the absolute value of HS <15%. These consistency level de-

nitions are arbitrary and can be modied as per the perspective of a

model calibration expert.

4. Results and discussion

4.1. Traditional algorithm comparison: hypervolume coverage

We rst analyze overall efciency of all algorithms by plotting

hypervolume coverage (see Section 2.4.1) values (averaged over mul-

tiple trial runs) against number of function evaluations. These plots are

called progress graphs that are computed for all algorithms on both

watershed calibration problems using Formulations 1 and 2 (see

Table 1), i.e., algorithms are compared on problems SW-2, SW-3, FW-2,

and FW-3. The progress graphs are given in Fig. 1.

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

Each subgure in Fig. 1 corresponds to one of the four watershed test

problems mentioned above and provides visualizations of the average

hypervolume covered (averaged over multiple trials) as a function of

number of evaluations for all algorithms. A higher hypervolume

coverage indicates a better Pareto front. So the fact that GOMORS

eventually has the highest curve in all cases (as shown in Fig. 1) in-

dicates GOMORS is performing better than the other algorithms on all

four multi objective problems considered (where S or F indicates small

or full watershed and 2 or 3 is the number of objectives. As was

mentioned earlier, hypervolume progress considers both convergence

and diversication; and higher values of the hypervolume metric are

desirable. It is evident from the analysis in Fig. 1 that overall average

performance of GOMORS is better than all other algorithms for all

watershed test problems within a limited watershed simulation evalu-

ation budget of 600 since the blue curves ascend faster and to higher

values. The progress graphs also indicate relative superiority of the two

surrogate algorithms GOMORS and ParEGO over the non-surrogate al-

gorithms for a limited evaluation budget of 600.

While it is evident from Fig. 1 that average performance of GOMORS

is better than other algorithms with regard to hypervolume coverage, we

further analyze relative efciency of GOMORS (compared to other al-

gorithms) by reporting the percentage of function evaluations required

by GOMORS to reach the best hypervolume value (averaged over mul-

tiple trials) achieved by other algorithms after 600 function evaluations.

This percentage is referred to as relative progress in subsequent dis-

cussions and is reported in Table 2. Table 2 also reports the “speed-up”

of GOMORS, relative to other algorithms, where speed-up, calculated

after 600 functions evaluations, is the number of evaluations required by

GOMORS to reach the same hypervolume as other algorithms obtained

after 600 evaluations. For instance, speed-up of GOMORS, relative to

AMALGAM is 600/200 =3.0 times, for SW-3, because it only takes 33%

as much time for GOMORS to solve the problem as is required by

AMALGAM.

All relative progress percentages reported in Table 2 are considerably

less than 100%, implying that for every calibration problem, GOMORS

obtained equivalent average hypervolume values to all other algorithms

in less time. For instance, in comparison to NSGA-II, GOMORS obtained

equivalent average hypervolume within almost 10% function evalua-

tions (see the third row of Table 2).

In Fig. 1, a notable observation is that GOMORS seems to converge

faster in addressing the FW problem compared to the SW problem, as

indicated by the hypervolume coverage progress. This shows that

despite the SW problem being of smaller model scale (constituting a sub-

watershed of FW), it does not show a reduction in the number of eval-

uations required for algorithmic converge. This highlights the inherent

complexity of the SW optimization problem and challenges our initial

expectation that smaller-scale watershed model inherently demand

fewer optimization evaluations for convergence. However, it is unclear

Fig. 1. Progress graphs of MO solutions with plots of best hypervolume coverage values against number of function evaluations, averaged over 20 trials. Each subplot

corresponds to the progress plots of each test case study mentioned in the respective titles: (a) SW-2, (b) SW-3, (c) FW-2, and (d) FW-3. Higher curves are better.

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

why a smaller watershed model calibration takes more evaluations to

converge than the larger watershed model calibration problem. The

complexity arises from the fundamental differences in simulation out-

puts and observation data between the two models, making direct

comparisons challenging. The incongruence in optimization conver-

gence metrics between the SW and FW problems stems from the non-

comparability of the hypervolume coverage metric across multiple

problems. This is due to the metric’s calculation being relative to the

reference and ideal solutions, values estimated by the optimization al-

gorithm’s evaluation points within each problem context (see Support-

ing Information Section S3 for details).

4.2. Statistical signicance

In order to analyze the difference in performance between GOMORS,

ParEGO, AMALGAM and Borg MOEA in further detail, the two-sided

Mann-Whitney Rank Sum test (Conover, 1998) was performed over

the hypervolume metric values obtained for each algorithm in multiple

trials. The Rank Sum test is a non-parametric statistical hypothesis test

for deducing whether results obtained from one algorithm in multiple

trial runs are signicantly different from results obtained from another

algorithm in multiple trials. The algorithms are compared in pairs and

the Rank Sum Test is performed for all watershed problems after 200,

400 and 600 evaluations of each algorithm are complete. Hence, there

are 36 Rank Sum tests for each algorithm, and 18 for each test problem.

A summary of the Mann-Whitney Rank Sum Test is provided in

Table 3. We see in Table 3, that GOMORS is better than the alternative

algorithm in 34 out of 36 cases. Table 3 also indicates that none of the

other algorithms is statistically better than GOMORS for any of the 36

combinations of problems (SW-2, SW-3, FW-2, FW-3), algorithms (Par-

EGO, AMALGAM, Borg MOEA) and numbers of evaluations (200, 400,

or 600).

For all cases of GOMORS versus AMALGAM and GOMORS versus

Borg MOEA comparisons, the p-value is very low (p ≪ 0.1), afrming

the hypothesis that GOMORS is signicantly better than both

AMALGAM and Borg MOEA. GOMORS also performs better than Par-

EGO in general since 10 out of 12 p-values for GOMORS vs ParEGO are

below 0.1 (with GOMORS being the superior algorithm), whereas there

are no cases where ParEGO performs better than GOMORS (with p <

0.1).

While ParEGO’s performance may not be as good as GOMORS, it is

distinctly better than both AMALGAM and Borg MOEA (as per hyper-

volume coverage) as indicated by the positive p values of less than 0.1 in

14 out of 24 cases. GOMORS, ParEGO, AMALGAM and Borg MOEA all

decisively outperform NSGA-II and MOEA/D (see Fig. 1). Hence, NSGA-

II and MOEA/D were not included in the rank sum test analysis.

Our analysis highlights that GOMORS frequently outperforms

AMALGAM, Borg MOEA and ParEGO. However, none of the other al-

gorithms outperforms GOMORS on any case study. Also, the fact that the

two surrogate methods, GOMORS and ParEGO outperform the widely

used AMALGAM algorithm, suggests that there is a distinct advantage

for surrogates for multi-objective optimization for watershed model

calibration with limited number of evaluations.

4.3. Proximity to the approximated Pareto Front

Another critical question is that how well the non-dominated front

(obtained from an algorithm) compares against the “approximate”

Pareto front (computed from many simulations). Since Pareto fronts of

2-objective problems are easier to visualize, we compare non-dominated

fronts obtained by different algorithms for the 2-objective FW-2 problem

(see Table 1), against the approximated Pareto front in Fig. 2. The results

are based on 20 trials for each algorithm, i.e., the best and worst trials

are the trial with the best or worst hypervolume coverage among all 20

trials for that algorithm. Moreover, since the true Pareto front of the

watershed problems analyzed in this study is not known, we use the non-

dominated points of all points evaluated in all algorithms for FW-2

problem (7 algorithms * 20 trials * 600 evaluations), and also a long run

of AMALGAM with 50,000 evaluations (approximately 50,000 +

7*20*600 =134,000 points in total) as the “approximate” Pareto front.

Fig. 2 plots this approximated Pareto front (depicted in red), against the

non-dominated fronts obtained by GOMORS, ParEGO and BORG after

600 function evaluations (from best and worst algorithm trials). We only

compare GOMORS, ParEGO and BORG here, since these algorithms are

the best (as per hypervolume coverage) for the FW-2 problem.

It is evident from Fig. 2, that non-dominated fronts obtained from

GOMORS after only 600 evaluations, are visually close in proximity to

the approximated Pareto front. Moreover, Fig. 2 shows non-dominated

fronts from the best and worst trials of GOMORS, ParEGO and BORG.

These results also indicate that even the worst trials of GOMORS are

quite good, almost as good as the best results for the other algorithms.

Thus, GOMORS is more robust than other algorithms in performance.

The worst front of ParEGO performs very poorly, indicating its lack of

reliability across multiple trials, when applied to the FW-2 problem.

4.4. Hydrologic consistency analysis of solutions from multi-objective

optimization

The MO calibration assessment so far, shows that GOMORS is the

best performing algorithm on a limited evaluation budget in terms of

algorithm efciency. This section compares the two multi-objective

formulations using GOMORS as the optimization algorithm on hydro-

logic quality, to propose an appropriate formulation for calibration on a

limited evaluation budget. As mentioned in prior discussion, hydrologic

consistency is appropriate and highly relevant, when attempting to

Table 3

Summary of statistical comparison via Mann-Whitney Rank Sum Test applied to

GOMORS, ParEGO, AMALGAM and Borg MOEA, according to the hypervolume

coverage metric with 600 evaluations. The Sub-Rows in the table correspond to

the pairs of algorithms that are compared, and the sub-columns correspond to

the number of function evaluations after which algorithms are compared.

Table cells report p-values obtained from the two-sided Rank Sum Test with null

hypothesis that algorithm performances are not different (20 trials). Underlined

p-values highlight cases where the rst listed algorithm performs better than the

second listed algorithm in column 1, at a 10% signicance level (i.e., p <0.1).

There are no cases where the second listed is signicantly better than the rst

listed algorithm (i.e., with p <0.1).

Algorithm Problem SW-2 Problem SW-3

200 400 600 200 400 600

GOMORS vs

ParEGO

0.9138 0.3438 6.3e-

1.2e-

1.6e-

8.7e-

GOMORS vs

AMALGAM

9.2e-

1.1e-

1.3e-

2.1e-

9.8e-

4.8e-

GOMORS vs Borg 2.1e-

1.1e-

4.8e-

1.1e-

1.5e-

2.9e-

ParEGO vs

AMALGAM

1.1e-

2.8e-

0.1441 4.4e-

2.9e-

9.4e-

ParEGO vs Borg 6.2e-

2.7e-

9.4e-

9.7e-

1.2e-

6.5e-

AMALGAM vs

Borg

0.1762 0.2674 0.1167 0.4819 0.4819 0.3040

Algorithm Problem FW-2 Problem FW-3

200 400 600 200 400 600

GOMORS vs

ParEGO

8.8e-

1.1e-

1.9e-

4.9e-

0.0699 0.0515

GOMORS vs

AMALGAM

1.7e-

6.2e-

4.4e-

1.7e-

1.1e-

4.3e-

GOMORS vs Borg 4.3e-

2.3e-

4.8e-

1.3e-

1.9e-

1.5e-

ParEGO vs

AMALGAM

0.1441 0.9569 0.2235 0.5885 4.8e-

2.0e-

ParEGO vs Borg 0.3302 0.7455 0.9784 0.4651 0.0989 0.1045

AMALGAM vs

Borg

0.5338 0.5338 0.3040 0.7660 0.7049 0.4328

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

understand algorithm performance in hydrologic model calibration. The

two satisfaction levels dened in Section 3.2, for nine HS (see Table A2)

and KGE, are used here to analyze algorithms. Moreover, hydrologic

consistency is visualized via consistency frequency plots (Shai and

Tolson, 2015) (see Figs. 3 and 4).

To analyze the meaningfulness gained from each formulation, Figs. 3

and 4 compares the HCF plots of the two formulations given in Table 1

with GOMORS as the MO algorithm, and with application to the SW and

FW model, respectively. Fig. 3 (a) and Fig. 4 (a) show the HCF plots with

all solutions from 20 trials optimization experiments. Fig. 3 (b) and

Fig. 4 (b) show the ltered HCF plots with only non-dominated solutions

identied from each of the 20 trials.

It is indicated that Formulations 2 have better hydrologic consis-

tencies than the Formulation 1 on the SW problem (see Fig. 3).

Formulation 2 is a mixed formulation with one GOF measure and two HS

(see Section 2.1.2). Moreover, the performance of Formulation 2 is the

best for in both satisfaction thresholds (kge >0.6 and |HS|<15% and kge

>0.5 and |HS|<25%) in the SW problem.

For the FW problem, Formulation 2 seems to have slightly better

performance than Formulation 1 according to the HCF plot with all

solutions in optimization experiments (Fig. 4 (a)). However, the differ-

ence between Formulation 2 and 1 is not obvious as shown in the SW

problem. In addition, from the ltered HCF plot using non-dominated

solutions (Fig. 4 (b)), Formulation 2 generates a higher proportion of

non-dominated solutions when a larger number of criteria are satised.

For instance, formulation 2 has higher proportion of non-dominated

solutions that satised at least 6 criteria in higher threshold (and at

least 8 criteria satised in the lower threshold) than formulation 1. In

real practice, users might be more interested in non-dominated solutions

than all solutions and solutions that satised larger number of criteria

than a few numbers of criteria satised. Hence formulations that could

produce larger proportion of non-dominated solutions that satised

more criteria are preferable.

The HCF plots in Figs. 3 and 4 only give the proportion of evaluation

points that satisfy multiple criteria but cannot provide information on

how each of the different hydrology criteria is satised. We introduce

the HCF heatmap that shows the frequency of each of the different hy-

drology criteria satised for the non-dominated solutions that satised

Fig. 2. Comparison of best and worst non-dominated fronts of (a) GOMORS, (b) ParEGO, and (c) BORG after 600 evaluations against the approximated Pareto front

of problem FW-2 based on 134,000 simulations from which the Pareto Front was calculated. The approximated Pareto front is in red. The best non-dominated front

achieved by the optimization algorithm is in dark blue. The worst non-dominated front is in light blue. The best and worst non-dominated fronts are obtained from

the best and worst performed trials, respectively, among the 20 repeated random trials for each algorithm.

Fig. 3. Hydrologic consistency frequency (HCF) plots, for GOMORS when applied to problems SW-2 and SW-3 (see Table 1). Subgure (a) plots the average (over

multiple trials) proportion of evaluated points that satisfy at least N scores (x-axis) as per the different hydrologic satisfaction levels dened in the legend (and in

Section 3.2). Subgure (b) plots the average (over multiple trials) proportion of non-dominated points (according to the relevant formulation) that satisfy at least N

scores (x-axis) as per the hydrologic satisfaction levels dened in the legend (and in Section 3.2). Higher curve, i.e., higher proportion satisfaction is better (for all

subplots). Note that there are 9 criteria and hence proportion of evaluations satisfying 10 criteria is zero.

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

at least N hydrologic criteria (as shown in Fig. 5 for the SW problem and

Fig. 6 for the FW problem). Fig. 5 shows that as the number of criteria

satised (N) increases, the frequency of each individual criterion being

satised also increases. FMS, DP, and DMD are the three hydrologic

criteria that have lower frequency of being satised compared to other

hydrologic criteria in all cases (in both SW-2 and SW-3). KGE has the

highest frequency of being satised amongst all criteria. Satisfaction

frequencies for most criterion from SW-3 are generally higher than that

from SW-2 (especially for the higher satisfaction threshold (i.e.,

KGE>0.6 and |HS|<15%)).

Fig. 4. Hydrologic consistency frequency (HCF) plots, for GOMORS when applied to problems FW-2 and FW-3 (see Table 1). Subgure (a) plots the average (over

multiple trials) proportion of evaluated points that satisfy at least N scores (x-axis) as per the different hydrologic satisfaction levels dened in the legend (and in

Section 3.2). Subgure (b) plots the average (over multiple trials) proportion of non-dominated points (according to the relevant formulation) that satisfy at least N

scores (x-axis) as per the hydrologic satisfaction levels dened in the legend (and in Section 3.2). Higher curve, i.e., higher proportion satisfaction is better (for all

subplots). Note that there are 9 criteria and hence proportion of evaluations satisfying 10 criteria is zero.

Fig. 5. Hydrologic consistency frequency (HCF) heatmap, for GOMORS when applied to problems SW-2 and SW-3 (see Table 1). Subgures plot the frequency of each

hydrologic criteria satised of the non-dominated evaluated points that satisfy at least N hydrologic satisfaction levels dened in Section 3.2. Subgure (a) and (c) are

plots for the SW-2 problem; Subgure (c) and (d) are plots for the SW-3 problem. Deeper color indicates a higher proportion satisfaction and so it is better (for

all subplots).

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

Similar ndings are also seen in the FW problems where FMS and DP

are the criteria that have, comparatively, fewer frequencies of being

satised; and solutions from FW-3 have a higher frequency in terms of

satisfying each criterion. The relatively low frequencies of satisfaction of

DP and FMS can be attributed to structural errors arising from weather

data inaccuracies, parameterization & model setup inadequacies, and

SWAT model structural errors. For instance, multiple studies report that

SWAT tends to underestimate peak ows (Muhammad et al., 2019; Me

et al., 2015). Moreover, the model setup of both the SW & FW problems

utilizes a small set of stations for weather input (Tolson and Shoemaker,

2007a). This may result in weather data induced inaccuracies in model

outputs.

Fig. 7 gives the distributions of each hydrologic criteria for “ltered

non-dominated evaluations”, i.e., the evaluations that satised at least 6

hydrologic criteria. The average value of hydrologic criteria of these

ltered solutions from using Objective Function 2 is better (higher KGE

and lower hydrologic signature) than that from using Objective Function

1 for almost all hydrologic signatures, except for the DAC for FW-3.

Moreover, the distribution of signatures for ltered solutions with

Objective 2 (i.e., SW-3 and FW-3) has less positive/negative bias (i.e.,

boxplots for SW-3 and FW-3, typically range between negative and

positive values). This implies that the ltered non-dominated solutions

for Objective Function 2 have less bias in individual signatures and thus

may better represent model uncertainties if these ltered solutions are

used as an ensemble of calibration alternatives (similar to equinality

(Her and Seong, 2018; Beven and Binley, 2014)). Section 4.5 analyzes

the performance of ltered solutions in further detail for one optimi-

zation run of the FW-3 problem.

The ltered solutions generally have a high value of KGE. The

average KGE value of ltered solutions is around 0.75 for SW-2 and SW-

3, respectively. For the FW-2 and FW-3 problems, the KGE value is

around 0.8. The value of all hydrologic signatures except for the FMS,

DP, and DMD in the SW problem and except for the FMS and DP in the

FW problem are generally less than 0.15 (within the red dash line). This

implies that hydrologic criteria could be an effective tool used for non-

dominated solution ltering (e.g., selecting high-quality subset solutions

from a large set of non-dominated solutions).

4.5. Calibration selection/ltering & analysis of simulated hydrographs

“Filtered non-dominated solutions” are further analyzed to assess if

the calibrated (using efcient MO optimization) hydrologic model of the

Cannonsville watershed (FW) performs adequately, and to also get in-

sights on model setup, parameter sensitivity & output uncertainty. Fig. 8

provides an overview of the performance of “ltered non-dominated

solutions” obtained from the median GOMORS optimization run (me-

dian trial out of 20, as per hypervolume) of FW-3. Filtered non-

dominated solutions in Fig. 8 are the subset of non-dominated solu-

tions of the median GOMORS trial (for FW-3) that satisfy at least 7

consistency criteria (as per threshold denition 2: KGE >0.6 and |HS| <

15%).

Fig. 8 also reports the parameter range (see Fig. 8(a)) and output

uncertainty derived from “ltered solutions only”. “Filtered solutions

only” are the subset of all 600 simulation runs (118 out of 600) of the

median GOMORS trial (for FW-3) that satisfy at least 7 consistency

criteria (as per threshold denition 2: KGE >0.6 and |HS| <15%).

Consequently, “ltered non-dominated solutions” are also a subset of

“ltered solutions only”. The purpose of reporting “ltered solutions

only” along-with “ltered non-dominated solutions” in Fig. 8 is to

illustrate that a smaller ensemble of good-quality calibrations can be

obtained by using both ltering via non-domination and ltering via

hydrologic consistency satisfaction.

A total of 21 (out of 39) non-dominated solutions (from the median

FW-3 calibration trial) satisfy at least 7 hydrologic consistency metrics,

Fig. 6. Hydrologic consistency frequency (HCF) heatmap, for GOMORS when applied to problems FW-2 and FW-3 (see Table 1). Subgures plot the frequency of each

hydrologic criteria satised of the non-dominated evaluated points that satisfy at least N hydrologic satisfaction levels dened in Section 3.2. Subgure (a) and (c)

plot for the FW-2 problem; Subgure (c) and (d) plot for the FW-3 problem. Higher curve, i.e., higher proportion satisfaction is better (for all subplots).

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

and parameter sets & simulation output ensembles of all these solutions

are summarized in Fig. 8. Fig. 8(a) shows the normalized parameter

values and ranges of all parameters included in calibration optimization

(see Table A1 for parameter descriptions and optimization ranges). The

parameter range highlighted in gray corresponds to “ltered only so-

lutions” and it is evident that this range is much larger. Whereas the

parameter ranges of “ltered non-dominated solutions” (highlighted in

yellow) are smaller and better illustrated sensitivity of calibration pa-

rameters. For instance, GW_DELAY, ALPHA_BF, and CN2_f appear to be

more sensitive, as evidenced by the relatively narrow distributions of

their values for high-quality calibration solutions, as shown in Fig. 8(a).

In contrast, SFTMP and SMTMP appear to be less sensitive, with their

values spanning a wider range across the search space. The seemingly

high sensitivity of GW_DELAY, which is the percolation lag time (Arnold

et al., 2012), may be because the optimization range for this parameter

is set to be too high (0.001–500 days; see Table 1). CN2_f is a multi-

plicative factor to calibrate the curve number in model sub-catchments

and is expected to be a sensitive parameter. In general, insights from

parameter values, trends, and ranges for the ltered non-dominated

solutions can assist in ascertaining the validity of the calibration pro-

cess and allow for changes in the calibration setup, if necessary. This is

facilitated by the use of an efcient surrogate algorithm for MO

calibration.

Fig. 8 also includes three sub-gures (Fig. 8 (b-d)) that compare

simulated (mean hydrograph of ltered calibrations and range of

hydrographs) and measured hydrographs. Fig. 8-b shows the respective

hydrographs for the full simulation period (1994–1999) (including

mean and upper/lower bounds of ensemble simulations), whereas

Fig. 8-b and Fig. 8-c show hydrographs for dry and wet years for a more

detailed illustration of differences between measures and simulated

ows. The mean and upper/lower bounds of hydrographs are calculated

from the ensemble simulations as described in Section 2.5. These gures

illustrate that the ensemble (of 21 ltered calibrations) simulated

hydrographs t relatively better on non-peak ows. Peak ows are

usually underestimated, which is expected, and consistent with prior

analysis (see discussion related to Fig. 7). Overall, the mean and ranges

of simulated ows capture the measured hydrology adequately both in

terms of timing and overall water balance.

In order to further understand and compare the simulated and

measured ows, Fig. 8 also includes measures and simulated Flow

Duration Curves (FDC) in Fig. 8-e (where ows are shown on a log-

scale), and a scatter plot of measured vs mean simulated ow in

Fig. 8-f. These plots also show that, overall, the ensemble simulation

outputs compare well against measured ow.

4.6. Further discussion

4.6.1. Predictive uncertainty from MO with HS consistency selection

criteria

Our study introduces a novel framework for watershed model cali-

bration, integrating multi-objective optimization for parameter search,

the selection of non-dominated solutions based on hydrologic consis-

tency criteria, and subsequent evaluation of model parameter and

output uncertainty from the nal solutions. It is imperative to delineate

the distinctive features of our approach in comparison to existing

methods for parameter estimation with model uncertainty

Fig. 7. Box-plots depicting distributions of consistency criteria on ltered non-dominated fronts, i.e., non-dominated solutions where at least 6 criteria are satised as

per stricter satisfaction denition given in Section 3.2 (KGE >0.6 +|HS| <0.15), for GOMORS (all trials) when applied to both formulations of the SW (Townbrook

watershed) and FW (Cannonsville watershed) case studies (see Table 1). Subgure (a) plots the distributions of all consistency criteria on ltered non-dominated

fronts of SW-2 and SW-3 problems. Subgure (b) plots the distributions of all consistency criteria on ltered non-dominated fronts of FW-2 and FW-3 problems.

The horizontal dashed line shows ideal value of a HS-based criteria and dotted lines show allowed range for HS-based criteria (i.e., |HS| <0.15). Higher KGE values

are desired and ideal KGE value is 1.

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

quantication. While our study is not primarily focused on uncertainty

quantication, we emphasize that our approach empowers modelers to

gain insights into model output uncertainty through the parameter sets

obtained via multi-objective optimization and the selection of non-

dominated solutions based on hydrologic signature consistency

criteria. Our method employs a multi-objective search to identify a set of

superior parameters, represented by non-dominated solutions, each

excelling in at least one objective dened in the multi-objective prob-

lem. Hydrologic consistency criteria are then applied to select a subset of

non-dominated solutions meeting high-quality standards, satisfying

multiple hydrologic consistency criteria. Consequently, the nal solu-

tions represent robust choices derived from diverse parameter values. In

the FW problem, for instance, 21 sets of parameters are identied as

nal solutions, forming the basis for deriving model output uncertainty.

The uncertainty analysis in our study is mainly focused on the uncer-

tainty quantication of the nal solutions (e.g., selected non-dominated

solution) our approach got.

Our goal is not to solve the equinality issue for multi-objective

Fig. 8. Analysis of Filtered Calibrations: a) Parameter sets and ranges, b-d) Hydrographs of measured ow and the mean and upper/lower bounds of simulated ow

for Cannonsville SWAT model, e) Flow duration curves of measured and simulation ows (mean and upper/lower bounds) and f) Scatter plots of mean simulated ow

(mean of all hydrographs of all ltered calibrations) vs measured ow. Please note that “Filtered Only” (118 out of 600) is the subset of calibrations (derived from the

median GOMORS run for FW-3 problem) that satisfy 7 consistency hydrologic criteria, whereas “Filtered ND” is the ensemble/subset calibrations (21 out of 600;

derived from the median GOMORS run for FW-3 problem) that are both non-dominated and satisfy 7 hydrologic consistency criteria. Moreover, “mean simulated ow”

is the mean hydrograph obtained from the “ltered ND” simulation ensemble.

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

calibration problems, which is beyond the scope of this paper. However,

it would be worthwhile to discuss how our approach in uncertainty

approximation differs from previous methods, such as the informal

Bayesian approach GLUE and various formal Bayesian Markov chain

Monte Carlo (MCMC) methods, as discussed in the literature section, in

several respects. Firstly, the solutions we included for uncertainty

analysis are achieved through surrogate-assisted multi-objective opti-

mization and hydrologic consistency criteria. In GLUE, a set of behav-

ioral solutions is selected by a subjective threshold value for the

likelihood function (for example, a percentage of the total sampled so-

lutions), leading to model output uncertainty that signicantly depends

on the modeler dening this subjective threshold. In the formal Bayesian

approach, a cut-off threshold for behavioral and non-behavioral solu-

tions is based on the underlying parameter probability distribution,

necessitating the denition of a formal likelihood function. Another

distinction lies in the derivation of prediction uncertainty. In GLUE,

prediction uncertainty is obtained from likelihood-weighted predictions

or quantiles with a behavioral parameter set, where the likelihood value

measures the agreement between model predictions and observations.

In formal Bayesian MCMC methods, predictive uncertainty is estimated

by evaluating the model output for parameter sets sampled from the

posterior distribution of model parameters. For the framework we pro-

pose, we derive model output uncertainty by establishing upper and

lower bounds of the model output from the selected non-dominated

solutions (i.e., behavioral solutions). We do not assign separate

weights to different behavioral solutions because these selected non-

dominated solutions are all high quality, and there is no straightfor-

ward way to rank their quality to obtain an appropriate weight. Our

approach does not employ a statistical likelihood function like formal

statistical methods, and some subjective decisions are still necessary

such as determining the number of hydrologic consistent criteria to meet

for selecting non-dominated solutions. Therefore, the derived model

uncertainty is somehow dependent on the modeler’s subjective choices.

Nevertheless, our study introduces the hydrologic consistency frequency

plots and map as effective tools for aiding modeler in making decision.

In our study, we focused on using multi-objective (MO) calibration,

and the uncertainty analysis also centers around solutions obtained from

MO with multiple criteria. Although our primary focus is not to compare

MO calibration with single-objective calibration, we would like to

discuss some aspects in terms of parameter and simulation uncertainty

associated with MO and single-objective calibration. Compared with

single-objective calibration, MO calibration could help improve the

identiability of model parameters by considering multiple aspects of

the system. Single-objective calibration can struggle with parameter

identiability issues, especially if the chosen objective function is not

sensitive to certain aspects of the hydrological response. However, MO

calibration requires more extensive and diverse datasets to adequately

capture different aspects of the hydrological system than single-

objective calibration. As the number of objectives increases, it also be-

comes more challenging for MO to solve the optimization search prob-

lems. The challenge to nd the real Pareto front in a limited budget can

also introduce uncertainty to parameter identication and simulation

output. Hence, it is important to have efcient MO methods that can nd

better solutions in a shorter time, which is what our study aims to

achieve. Additionally, the inclusion of multi-criteria in MO can lead to

solutions that excel in one or some objectives but perform extremely

poorly in others. Among the non-dominated solutions, these unique

solutions could also increase the uncertainty in model simulation

output. Therefore, it is also important to conduct a selection of the non-

dominated solutions from MO, exemplied by the hydrological signa-

ture consistency selection approach we proposed.

4.6.2. Versatile application potential beyond current study

In this study, we applied our calibration assessment approach to two

watershed models using SWAT. It is crucial to emphasize that our

approach has broader applicability beyond SWAT and can be extended

to other watershed models. Our proposed approach is inherently general

for two main reasons. Firstly, the surrogate models employed in

surrogate-assistant methods are versatile and can be applied to any

computationally expensive model. This exibility allows our approach

to transcend the connes of a specic model, enabling its utilization

across a spectrum of watershed models. Secondly, the hydrologic con-

sistency criteria, a pivotal component of our approach, only necessitates

simulation output time series. This means that it can be employed to

analyze the tness of runoff simulation time series produced by various

watershed models. The adaptability of this criterion adds to the gener-

alizability of our approach, making it applicable to a wide range of

hydrological models. The generality of our approach is supported by the

adaptability of surrogate models and the inclusivity of hydrologic con-

sistency criteria, rendering it applicable and valuable in the broader

context of hydrological modeling.

5. Conclusions

This study is novel in 1) providing the rst comprehensive compar-

ative analysis of both multiple surrogate and non-surrogate MO algo-

rithms for efcient calibration of computationally expensive

hydrological models and 2) proposing a framework to evaluate and lter

the many non-dominated solutions obtained from multi-objective opti-

mization using hydrologic signatures.

We compared the performance of 7 different MO algorithms,

including the surrogate-assisted methods, GOMORS and ParEGO, the

multi-method and adaptive evolutionary search methods, AMALGAM

and BORG MOEA, and the widely used MOEAs, NSGA-II,

-MOEA, and

MOEA/D. These algorithms are tested on two different watershed

models that were developed using the Soil and Water Assessment Tool

(SWAT) and tested with two different multi-objective optimization

formulations.

As is also illustrated in our analysis, the selection of both an appro-

priate algorithm and the objective function vector (i.e., formulations) is

extremely important in hydrologic model calibration, especially when

the underlying watershed problems are computationally expensive, and

the number of available evaluations is limited. GOMORS outperforms all

other algorithms on all four watershed calibration problems (two

watershed models with two different formulations) based on a) the

traditional hypervolume metric-based analysis, b) visual comparison of

trade-off curves, and c) statistical testing, within an evaluation budget of

600. The second-best method overall is the surrogate-based ParEGO

which supports the usefulness of surrogates for computationally

expensive functions. However, GOMORS signicantly outperforms

ParEGO as well as the other algorithms, which suggests that the specics

of the surrogate algorithm are also important.

We proposed a new framework that uses hydrologic signatures to

evaluate and lter the non-dominated solutions from the multi-objective

calibration. The framework uses hydrologic consistency frequency

(based on the number of dened hydrologic criteria satised) to assess

the quality of non-dominated solutions from a MO calibration search.

We adopt the hydrologic consistency frequency plot from Shai and

Tolson (2015) to assess the overall quality of solutions from a MO

calibration search and proposed a new hydrologic consistency frequency

heatmap to assess the calibration quality in terms of each hydrologic

criterion. These analyses allow us to assess the calibration quality from

different MO formulations and to lter the non-dominated solutions to

nd the high-quality solutions in order to help make sensible decisions

about which set of parameter values should be used.

Our hydrologic consistency analysis indicates that, among the two

MO formulations tested, the three-objective decomposition of NSE,

performs better when tested on the two watersheds analyzed in this

study (e.g., solutions from this formulation have a higher probability of

satisfying more hydrologic criteria). We also found that for both

watershed models, the calibration solutions overall have relatively low

frequencies of satisfaction of DP and FMS (See Table A2 for denitions),

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

which might be attributed to errors arising from weather data inaccur-

acies, parameterization & model setup inadequacies, and SWAT model

structural errors.

We also analyzed the quality of ltered non-dominated solutions

based on the hydrologic consistency analysis. Results show that ltered

non-dominated solutions from GOMORS using the three-objective

decomposition of NSE generally capture the measured hydrology

adequately both in terms of timing and overall water balance. This is

facilitated using an efcient surrogate algorithm for MO calibration and

the effectiveness of using hydrologic criteria for non-dominated solution

ltering.

The framework we proposed is general and could be used for other

watershed model calibration problems, which could be investigated in

future work. This framework could also be extended for efcient

sequential model calibration, where parameter ranges or model struc-

ture may be adjusted iteratively or sequentially, in each calibration run/

iteration.

Software availability

Name of software: GOMORS_pySOT

Description: A surrogate-assisted Multi-Objective Optimization (MO)

strategy, designed for computational expensive MO problems, e.g.,

expensive environmental simulation optimization problems, hyper-

parameter tuning of Deep Neural Networks etc. The Townbrook

watershed SWAT model calibration problem is provided as an example

for the use of GOMORS.

Developer: Taimoor Akhtar Contact: taimoor.akhtar@gmail.com.

Program language: Python.

Availability and cost:

Free and open source for non-commercial use. Availability in GitHub

https://github.com/drkupi/GOMORS_pySOT.

CRediT authorship contribution statement

Wei Xia: Writing – review & editing, Writing – original draft,

Visualization, Validation, Software, Methodology, Investigation, Formal

analysis, Conceptualization. Taimoor Akhtar: Writing – review &

editing, Writing – original draft, Visualization, Validation, Software,

Resources, Methodology, Investigation, Formal analysis, Conceptuali-

zation. Wei Lu: Writing – review & editing, Methodology, Investigation,

Formal analysis. Christine A. Shoemaker: Writing – review & editing,

Supervision, Funding acquisition, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

Data availability

I have shared the link to the code in the manuscript

Acknowledgments

This research was done primarily at National University of Singapore

(NUS), supported by the National Research Foundation, Prime Minis-

ter’s Ofce, Singapore under its Campus for Research Excellence and

Technological Enterprise (CREATE) program. Dr. Xia, Dr. Akhtar and Dr.

Lu were also supported by Prof. Shoemaker’s NUS start up grant. This

work is an extension of research started at Cornell University by Akhtar

and Shoemaker with nancial support funded from the Fulbright-HEC

Pakistan program and from a USA-NSF grant to Prof. Shoemaker. The

data and algorithms used in this study are provided in tables and gures

or listed in the references. The model and data for the Townbrook

watershed is shared in Hydroshare repository (Xia et al., 2023) as an

example for reader to run the GOMORS code. The data related to the

Townbrook and Cannosville watershed model were obtained by Prof.

Bryan A. Tolson and Prof. Shoemaker from New York State government.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.envsoft.2024.105983.

Appendix A: Table A1

Description of SWAT Model Parameters Calibrated in the Two Models Used in this Study. The parameter range in original optimization setting and

the parameter range of “Filtered Only” solutions and “Filtered ND” solutions. “Filtered Only” (118 out of 600) are the subset of calibrations (derived

from the median GOMORS run for FW-3 problem) that satisfy 7 consistency hydrologic criteria, whereas “Filtered ND” are the ensemble/subset

calibrations (21 out of 600; derived from the median GOMORS run for FW-3 problem) that are both non-dominated and satisfy 7 hydrologic con-

sistency criteria.

No. SWAT Code Related

Input le

Denition and unit (if any) Original

range

Calibrated Range (Filtered

Only solutions)

Calibrated Range (Filtered

ND solutions)

1 SFTMP .bsn Snowfall temperature (◦C) −5~5 −5.0~3.80 −5.0~0.998

2 SMTMP .bsn Snow melt base temperature (◦C) −5~5 −5.0~5.0 −3.6~5.0

3 SMFMX .bsn Maximum melt rate for snow during year where deg C refers

to the air temperature (mm H20/◦C-day)

1.5–8 1.5–8.0 1.5–8.0

4 TIMP .bsn Snowpack temperature lag factor 0.01–1 0.04–1.0 0.53–1

5 SURLAG .bsn Surface runoff lag coefcient 1~24 1~24.0 1.0–5.65

6 GW_DELAY .gw Groundwater delay time (days) 0.001–500 0.001–500 0.001–8.40

7 ALPHA_BF .gw Alpha factor for groundwater recession curve 0.001–1 0.082–1 0.83–1

8 GWWQMN .gw Threshold depth for shallow aquifer (mm H

O) 0.001–500 0.001–500 183.27–500

9 LAT_TIME .hru lateral ow travel time (days) 0.001–180 0.001–180 0.001–101.40

10 ESCO .bsn &.hru Soil evaporation compensation factor 0.01–1 0.01–0.85 0.01–0.26

11 CN2_f .mgt CN multiplicative factor 0.75–1.25 0.75–0.99 0.75–0.89

12 DEPTH_f .sol Depth multiplicative factor 0~1 0~1 0.014–1

(continued on next page)

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

(continued)

No. SWAT Code Related

Input le

Denition and unit (if any) Original

range

Calibrated Range (Filtered

Only solutions)

Calibrated Range (Filtered

ND solutions)

13 BD_f .sol Bulk Density multiplicative factor 0~1 0~1 0~1

14 AWC_f .sol Available Water Content multiplicative factor 0~1 0.0004–1 0.73–1

15 KSAT_f .sol Saturated hydraulic conductivity multiplicative factor 0~1 0~1 0~0.18

Appendix B: Table A2

Details of Eight Hydrological Signatures Used in this Study.

Group Symbol Full name Calculation Explanation

Water balance RR Overall runoff to rainfall

ratio ∑N

t=1Qt/∑N

t=1Pt Qt and Pt are the runoff and precipitation in time step t

Flow duration curve

(FDC)

FHV High-segment volume ∑H

h=1Qh h are ow indices located within the high-ow segment (exceedance probabilities

≤5%)

H is the index of the maximum ow

FMS Mid-segment slope log(Qm1) − log(Qm2)m

and m

are the lowest (20%) and highest (70%) ow exceedance probabilities that

are located at the two sides of the mid-segment

FLV Low-segment volume −∑L

l=1(log(Ql) −

log (QL))

l are ow indices located within the low-ow segment (ow exceedance probabilities

≥70%)

L is the index of the minimum ow

Discharge statistics DS Discharge Standard

Deviation

=

∑N

t=1(Qt−Q)2/N

√

is the standard deviation of the ow time series

DM Mean Discharge

=∑N

t=1Qt/N

is the mean of the ow time series

DP Peak discharge Max(Qt)Peak of the ow time series

DMD Median discharge Med(Qt)Median of the ow time series

DAC Lag-1 autocorrelation

coefcient ∑N−1

t=1(Qt−Q)(Qt+1−Q)

∑N

t=1(Qt−Q)(Qt−Q)

Q is the mean ow

References

Abbaspour, K.C., Johnson, C.A., van Genuchten, M.T., 2004. Estimating uncertain ow

and transport parameters using a sequential uncertainty tting procedure. Vadose

Zone J. 3 (4), 1340–1352. https://doi.org/10.2136/vzj2004.1340.

Ahmadi, M., Arabi, M., Ascough, J.C., Fontane, D.G., Engel, B.A., 2014. Toward

improved calibration of watershed models: multisite multiobjective measures of

information. Environ. Model. Software 59, 135–145. https://doi.org/10.1016/j.

envsoft.2014.05.012.

Akhtar, T., Shoemaker, C.A., 2016. Multi objective optimization of computationally

expensive multi-modal functions with RBF surrogates and multi-rule selection.

J. Global Optim. 64 (1), 17–32. https://doi.org/10.1007/s10898-015-0270-y.

Arnold, J.G., Moriasi, D.N., Gassman, P.W., Abbaspour, K.C., White, M.J., Srinivasan, R.,

et al., 2012. SWAT: model use, calibration, and validation. Transactions of the

ASABE 55 (4), 1491–1508. ISSN2151-0032.

Asadzadeh, M., Tolson, B., 2013. Pareto archived dynamically dimensioned search with

hypervolume-based selection for multi-objective optimization. Eng. Optim. 45 (12),

1489–1509. https://doi.org/10.1080/0305215X.2012.748046.

Asadzadeh, M., Tolson, B.A., Burn, D.H., 2014. A new selection metric for multiobjective

hydrologic model calibration. Water Resour. Res. 50 (9), 7082–7099. https://doi.

org/10.1002/2013WR014970.

Auger, A., Bader, J., Brockhoff, D., Zitzler, E., 2009. Theory of the hypervolume

indicator : optimal

-distributions and the choice of the reference point. Proceedings

of the tenth ACM SIGEVO workshop on Foundations of genetic algorithms,

pp. 87–102. https://doi.org/10.1145/1527125.1527138.

Baú, D.a., Mayer, A.S., 2006. Stochastic management of pump-and-treat strategies using

surrogate functions. Adv. Water Resour. 29 (12), 1901–1917. https://doi.org/

10.1016/j.advwatres.2006.01.008.

Behzadian, K., Kapelan, Z., Savic, D., Ardeshir, A., 2009. Environmental Modelling &

Software Stochastic sampling design using a multi-objective genetic algorithm and

adaptive neural networks. Environ. Model. Software 24 (4), 530–541. https://doi.

org/10.1016/j.envsoft.2008.09.013.

Bekele, E., Nicklow, J., 2007. Multi-objective automatic calibration of SWAT using

NSGA-II. J. Hydrol. 341 (3–4), 165–176. https://doi.org/10.1016/j.

jhydrol.2007.05.014.

Beven, K., Binley, A., 1992. The future of distributed models: model calibration and

uncertainty prediction. Hydrol. Process. 6 (3), 279–298.

Beven, K., Binley, A., 2014. GLUE: 20 years on. Hydrol. Process. 28 (24), 5897–5918.

Boyle, D.P., Gupta, H.V., Sorooshian, S., 2000. Toward improved calibration of

hydrologic models: combining the strengths of manual and automatic methods.

Water Resour. Res. 36 (12), 3663. https://doi.org/10.1029/2000WR900207.

Castelletti, A., Pianosi, F., Soncini-Sessa, R., Antenucci, J.P., 2010. A multiobjective

response surface approach for improved water quality planning in lakes and

reservoirs. Water Resour. Res. 46 (6), 1–16. https://doi.org/10.1029/

2009WR008389.

Cheng, Y., Xia, W., Detto, M., Shoemaker, C.A., 2023. A framework to calibrate

ecosystem demography models within Earth system models using parallel surrogate

global optimization. Water Resour. Res. 59 (1), e2022WR032945.

Chilkoti, V., Bolisetti, T., Balachandar, R., 2018. Multi-objective autocalibration of SWAT

model for improved low ow performance for a small snowfed catchment. Hydrol.

Sci. J. 63 (10), 1482–1501. https://doi.org/10.1080/02626667.2018.1505047.

Coello, C.A.C., Lamont, G.B., Van Veldhuizen, D.A., 2007. Evolutionary Algorithms for

Solving Multi-Objective Problems, vol. 5. Springer, New York, pp. 79–104.

Conover, W.J., 1998. Practical Nonparametric Statistics, third ed. Wiley.

Deb, K., Member, A., Pratap, A., Agarwal, S., Meyarivan, T., 2002. A fast and elitist multi-

objective genetic algorithm:NSGAII. IEEE Trans. Evol. Comput. 6 (2), 182–197.

di Pierro, F., Khu, S., Savic, D., Berardi, L., 2009. Efcient multi-objective optimal design

of water distribution networks on a budget of simulations using hybrid algorithms.

Environ. Model. Software 24, 202–213. https://doi.org/10.1016/j.

envsoft.2008.06.008.

Ercan, M.B., Goodall, J.L., 2016. Design and implementation of a general software

library for using NSGA-II with SWAT for multi-objective model calibration. Environ.

Model. Software 84, 112–120. https://doi.org/10.1016/J.ENVSOFT.2016.06.017.

Eriksson, D., Bindel, D., Shoemaker, C.A., 2019. pySOT and POAP: an event-driven

asynchronous framework for surrogate optimization. ArXiv Preprint ArXiv:

1908.00420 1, 1–19 arxiv.org/abs/1908.00420.

Espinet, A.J., Shoemaker, C.a., 2013. Comparison of optimization algorithms for

parameter estimation of multi-phase ow models with application to geological

carbon sequestration. Adv. Water Resour. 54, 133–148. https://doi.org/10.1016/j.

advwatres.2013.01.003.

Franco, A.C.L., Oliveira, D. Y. de, Bonum´

a, N.B., 2020. Comparison of single-site, multi-

site and multi-variable SWAT calibration strategies. Hydrol. Sci. J. 65 (14),

2376–2389. https://doi.org/10.1080/02626667.2020.1810252.

Gong, W., Duan, Q., 2017. An adaptive surrogate modeling-based sampling strategy for

parameter optimization and distribution estimation (ASMO-PODE). Environ. Model.

Software 95, 61–75. https://doi.org/10.1016/j.envsoft.2017.05.005.

Gupta, H.V., Bastidas, L.A., Vrugt, J.A., Sorooshian, S., 2003. Multiple criteria global

optimization for watershed model calibration. In: Duan, Q., Gupta, H.V.,

Sorooshian, S., Rousseau, A.N., Turcotte, R. (Eds.), Calibration of Watershed Models.

Gupta, H.V., Sorooshian, S., Yapo, P.O., 1998. Toward improved calibration of

hydrologic models: multiple and noncommensurable measures of information. Water

Resour. Res. 34 (4), 751. https://doi.org/10.1029/97WR03495.

W. Xia et al.

Environmental Modelling and Software 175 (2024) 105983

Gupta, H.V., Kling, H., Yilmaz, K.K., Martinez, G.F., 2009. Decomposition of the mean

squared error and NSE performance criteria : implications for improving

hydrological modelling. J. Hydrol. 377 (1–2), 80–91. https://doi.org/10.1016/j.

jhydrol.2009.08.003.

Hadka, D., Reed, P., 2013. BORG: an auto-adaptive many-objective evolutionary

computing framework. Evol. Comput. 21 (2), 231–259. https://doi.org/10.1162/

EVCO_a_00075.

Her, Y., Seong, C., 2018. Responses of hydrological model equinality, uncertainty, and

performance to multi-objective parameter calibration. J. Hydroinf. 20 (4), 864–885.

Hingray, B., Schaei, B., Mezghani, A., Hamdi, Y., 2010. Signature-based model

calibration for hydrological prediction in mesoscale Alpine catchments. Hydrol. Sci.

J. 55 (6), 1002–1016. https://doi.org/10.1080/02626667.2010.505572.

Knowles, J., 2006. ParEGO: a hybrid algorithm with on-line landscape approximation for

expensive multiobjective optimization problems. IEEE Trans. Evol. Comput. 10 (1),

50–66. https://doi.org/10.1109/TEVC.2005.851274.

Jones, D.R., Schonlau, M., Welch, W.J., 1998. Efcient global optimization of expensive

black-box functions. J. Global Optim. 455–492. https://doi.org/10.1023/A:

1008306431147.

Kling, H., Fuchs, M., Paulin, M., 2012. Runoff conditions in the upper Danube basin

under an ensemble of climate change scenarios. J. Hydrol. 424–425, 264–277.

Kollat, J.B., Reed, P.M., Wagener, T., 2012. When are multiobjective calibration trade-

offs in hydrologic models meaningful? Water Resour. Res. 48 (3), 1–19. https://doi.

org/10.1029/2011WR011534.

Kuczera, G., Parent, E., 1998. Monte Carlo assessment of parameter uncertainty in

conceptual catchment models: the Metropolis algorithm. J. Hydrol. 211 (1–4),

69–85.

Laumanns, M., Thiele, L., Deb, K., Zitzler, E., 2002. Combining convergence and diversity

in evolutionary multi-objective optimization. Evol. Comput. 10 (3), 263–282.

Lu, D., Ricciuto, D., Stoyanov, M., Gu, L., 2018. Calibration of the E3SM land model

using surrogate-based global optimization. J. Adv. Model. Earth Syst. 10 (6),

1337–1356.

Lu, W., Qin, X., Yu, J., 2019. On comparison of two-level and global optimization

schemes for layout design of storage ponds. J. Hydrol. 570, 544–554.

Madsen, H., 2003. Parameter estimation in distributed hydrological catchment

modelling using automatic calibration with multiple objectives. Adv. Water Resour.

26 (2), 205–216. https://doi.org/10.1016/S0309-1708(02)00092-1.

Madsen, H., Wilson, G., Ammentorp, H.C., 2002. Comparison of different automated

strategies for calibration of rainfall-runoff models. J. Hydrol. 261, 48–59.

Maier, H.R., Kapelan, Z., Kasprzyk, J.R., Kollat, J.B., Matott, L.S., Cunha, M.C., Dandy, G.

C., Gibbs, M.S., Keedwell, E., Marchi, A., Ostfeld, A., Savic, D., Solomatine, D.P.,

Vrugt, J.A., Zecchin, A.C., Minsker, B.S., Barbour, E.J., Kuczera, G., Pasha, F., et al.,

2014. Evolutionary algorithms and other metaheuristics in water resources: current

status, research challenges and future directions. Environ. Model. Software 62,

271–299. https://doi.org/10.1016/j.envsoft.2014.09.013.

Martinez, G.F., Gupta, H.V., 2010. Toward improved identication of hydrological

models: a diagnostic evaluation of the “abcd” monthly water balance model for the

conterminous United States. Water Resour. Res. 46 (8), 1–21. https://doi.org/

10.1029/2009WR008294.

Me, W., Abell, J.M., Hamilton, D.P., 2015. Effects of hydrologic conditions on SWAT

model performance and parameter sensitivity for a small, mixed land use catchment

in New Zealand. Hydrol. Earth Syst. Sci. 19 (10), 4127–4147.

McDonald, S., Mohammed, I.N., Bolten, J.D., Pulla, S., Meechaiya, C., Markert, A.,

Nelson, E.J., Srinivasan, R., Lakshmi, V., 2019. Web-based decision support system

tools: the Soil and Water Assessment Tool Online visualization and analyses

(SWATOnline) and NASA earth observation data downloading and reformatting tool

(NASAaccess). Environ. Model. Software 120 (December 2018), 104499. https://doi.

org/10.1016/j.envsoft.2019.104499.

Mugunthan, P., Shoemaker, C.A., 2006. Assessing the impacts of parameter uncertainty

for computationally expensive groundwater models. Water Resour. Res. 42 (10)

https://doi.org/10.1029/2005WR004640.

Muhammad, A., Evenson, G.R., Stadnyk, T.A., Boluwade, A., Jha, S.K., Coulibaly, P.,

2019. Impact of model structure on the accuracy of hydrological modeling of a

Canadian Prairie watershed. J. Hydrol.: Reg. Stud. 21, 40–56.

Müller, J., Shoemaker, C.A., Pich´

e, R., 2013. SO-MI: a surrogate model algorithm for

computationally expensive nonlinear mixed-integer black-box global optimization

problems. Comput. Oper. Res. 40 (5), 1383–1400.

Nash, J.E., Sutcliffe, J.V., 1970. River ow forecasting through conceptual models part I -

A discussion of principles. J. Hydrol. 10 (3), 282–290. https://doi.org/10.1016/

0022-1694(70)90255-6.

Nicklow, J., Reed, P., Savic, D., Dessalegne, T., 2010. State of the art for genetic

algorithms and beyond in water resources planning and management. J. Water

Resour. Plan. Manag. 136 (4), 412–432.

Razavi, S., Tolson, B.a., Burn, D.H., 2012. Review of surrogate modeling in water

resources. Water Resour. Res. 48 (7), W07401. https://doi.org/10.1029/

2011WR011527.

Reed, P.M., Hadka, D., Herman, J.D., Kasprzyk, J.R., Kollat, J.B., 2012. Evolutionary

multiobjective optimization in water resources: the past, present, and future. Adv.

Water Resour. https://doi.org/10.1016/j.advwatres.2012.01.005.

Regis, R.G., Shoemaker, C.A., 2013. Combining radial basis function surrogates and

dynamic coordinate search in high-dimensional expensive black-box optimization.

Eng. Optim. 45 (5), 529–555. https://doi.org/10.1080/0305215X.2012.687731.

Regis, R.G., Shoemaker, C.A., 2007. Stochastic radial basis function method for the

global optimization of expensive functions. Inf. J. Comput. 19, 497–509.

Sahraei, S., Asadzadeh, M., Shai, M., 2019. Toward effective many-objective

optimization: rounded-archiving. Environ. Model. Software 122, 104535. https://

doi.org/10.1016/j.envsoft.2019.104535.

Sahraei, S., Asadzadeh, M., Unduche, F., 2020. Signature-based multi-modelling and

multi-objective calibration of hydrologic models: application in ood forecasting for

Canadian Prairies. J. Hydrol. 588 (May), 125095 https://doi.org/10.1016/j.

jhydrol.2020.125095.

Shai, M., Tolson, B.A., 2015. Optimizing Hydrological Consistency by Incorporating

Hydrological Signatures into Model Calibration Objectives. Water Resources

Research, pp. 2616–2633.

Tang, Y., Reed, P.M., Kollat, J.B., 2007. Parallelization strategies for rapid and robust

evolutionary multiobjective optimization in water resources applications. Adv.

Water Resour. 30 (3), 335–353. https://doi.org/10.1016/j.advwatres.2006.06.006.

Tang, Y., Reed, P., Wagener, T., 2006. How effective and efcient are multiobjective

evolutionary algorithms at hydrologic model calibration? Hydrol. Earth Syst. Sci. 10

(2), 289–307. https://doi.org/10.5194/hess-10-289-2006.

Tolson, B.A., Shoemaker, C.A., 2007a. Cannonsville reservoir watershed SWAT2000

model development, calibration and validation. J. Hydrol. 337 (1–2), 68–86.

https://doi.org/10.1016/j.jhydrol.2007.01.017.

Tolson, B.A., Shoemaker, C.A., 2007b. Dynamically dimensioned search algorithm for

computationally efcient watershed model calibration. Water Resour. Res. 43 (1),

1–16. https://doi.org/10.1029/2005WR004723.

Vrugt, J.A., Gupta, H.V., Bouten, W., Sorooshian, S., 2003. A Shufed Complex Evolution

Metropolis algorithm for optimization and uncertainty assessment of hydrologic

model parameters. Water Resour. Res. 39 (8) https://doi.org/10.1029/

2002WR001642.

Vrugt, J. a, Robinson, B.a., 2007. Improved evolutionary optimization from genetically

adaptive multimethod search. Proc. Natl. Acad. Sci. U.S.A. 104 (3), 708–711.

https://doi.org/10.1073/pnas.0610471104.

Wang, Y., Jiang, R., Xie, J., Zhao, Y., Yan, D., Yang, S., 2019. Soil and water assessment

tool (SWAT) model: a systemic review. J. Coast Res. 93 (sp1), 22. https://doi.org/

10.2112/si93-004.1.

Wild, S.M., Regis, R.G., Shoemaker, C.A., 2008. ORBIT: Optimization by radial basis

function interpolation in trust-region. SIAM J. Sci. Comput. 30 (6), 3197–3219.

Wu, H., Chen, B., Ye, X., Guo, H., Meng, X., Zhang, B., 2021. An improved calibration and

uncertainty analysis approach using a multicriteria sequential algorithm for

hydrological modeling. Sci. Rep. 11 (1), 16954.

Xia, W., Akhtar, T., Shoemaker, C.A., 2022. A novel objective function DYNO for

automatic multivariable calibration of 3D lake models. Hydrol. Earth Syst. Sci. 26

(13), 3651–3671. https://doi.org/10.5194/hess-26-3651-2022.

Xia, W., Shoemaker, C., 2021. GOPS: efcient RBF surrogate global optimization

algorithm with high dimensions and many parallel processors including application

to multimodal water quality PDE model calibration. Optim. Eng. 22, 2741–2777.

Xia, W., Shoemaker, C.A., 2022a. A repetitive parameterization and optimization

strategy for the calibration of complex and computationally expensive process-based

models with application to a 3D water quality model of a tropical reservoir. Water

Resour. Res. 58 (5), e2021WR031054.

Xia, W., Shoemaker, C.A., 2022b. Improving the speed of global parallel optimization on

PDE models with processor afnity scheduling. Comput. Aided Civ. Infrastruct. Eng.

37 (3), 279–299.

Xia, W., Shoemaker, C., Akhtar, T., Nguyen, M.T., 2021. Efficient parallel surrogate

optimization algorithm and framework with application to parameter calibration of

computationally expensive three-dimensional hydrodynamic lake PDE models.

Environ. Model. Software 135, 104910.

Xia, W., Akhtar, T., Lu, W., Shoemaker, C.A., 2023. Enhanced watershed model

evaluation incorporating hydrologic signatures and consistency within efcient

surrogate multi-objective optimization. Hydro [Dateset]. http://www.hydroshare.

org/resource/77f2f1a6625c4a03b7660a87f55faaa4.

Yapo, P., Gupta, H., Sorooshian, S., 1998. Multi-objective global optimization for

hydrologic models. J. Hydrol. 204 (1–4), 83–97. https://doi.org/10.1016/S0022-

1694(97)00107-8.

Yang, G., Guo, S., Liu, P., Li, L., Liu, Z., 2017. Multiobjective cascade reservoir operation

rules and uncertainty analysis based on PA-DDS algorithm. J. Water Resour. Plann.

Manag. 143 (7), 04017025.

Zamani, M., Shrestha, N.K., Akhtar, T., Boston, T., Daggupati, P., 2020. Advancing model

calibration and uncertainty analysis of SWAT models using cloud computing

infrastructure: lcc-swat. J. Hydroinf. https://doi.org/10.2166/hydro.2020.066.

October.

Zhang, Q., Li, H., 2007. MOEA/D: a multiobjective evolutionary algorithm based on

decomposition. IEEE Trans. Evol. Comput. 11 (6), 712–731. https://doi.org/

10.1109/TEVC.2007.892759.

Zhang, X., Beeson, P., Link, R., Manowitz, D., Izaurralde, R.C., Sadeghi, A., Thomson, A.

M., Sahajpal, R., Srinivasan, R., Arnold, J.G., 2013. Efcient multi-objective

calibration of a computationally intensive hydrologic model with parallel computing

software in Python. Environ. Model. Software 46, 208–218. https://doi.org/

10.1016/j.envsoft.2013.03.013.

Zou, R., Lung, W.S., Wu, J., 2007. An adaptive neural network embedded genetic

algorithm approach for inverse water quality modeling. Water Resour. Res. 43 (8),

1–13. https://doi.org/10.1029/2006WR005158.

W. Xia et al.

ResearchGate has not been able to resolve any citations for this publication.

A Framework to Calibrate Ecosystem Demography Models Within Earth System Models Using Parallel Surrogate Global Optimization

Article

Full-text available

Jan 2023
WATER RESOUR RES

The climatic feedbacks from vegetation, particularly from tropical forests, can alter climate through land‐atmospheric interactions. Expected shifts in species composition can alter these interactions with profound effects on climate and terrestrial ecosystem dynamics. Ecosystem demographic (ED) models can explicitly represent vegetation dynamics and are a key component of next‐generation Earth System Models (ESMs). Although ED models exhibit greater fidelity and allow more direct comparisons with observations, their interacting parameters can be more difficult to calibrate due to the complex interactions among vegetation groups and physical processes. In addition, while representation of forest successional coexistence in ESMs is necessary to accurately capture forest‐climate interactions, few models can simulate forest coexistence and few studies have calibrated coexisted forest species. Furthermore, although both vegetation characteristics and soil properties affect vegetation dynamics, few studies have paid attention to jointly calibrating parameters related to these two processes. In this study, we develop a computationally‐efficient and physical model structure‐based framework that uses a parallel surrogate global optimization algorithm to calibrate ED models. We calibrate two typically coexisted tropical tree species, early and late successional plants, in a state‐of‐the‐art ED model that is capable of simulating successional diversity in forests. We concurrently calibrate vegetation and soil parameters and validate results against carbon, energy, and water cycle measurements collected in Barro Colorado Island, Panama. The framework can find optimal solutions within 4–12 iterations for 19‐dimensional problems. The calibration for tropical forests has important implications for predicting land‐atmospheric interactions and responses of tropical forests to environmental changes.

A novel objective function DYNO for automatic multivariable calibration of 3D lake models

Article

Full-text available

Jul 2022
HESS

This study introduced a novel Dynamically Normalized Objective Function (DYNO) for multivariable (i.e., temperature and velocity) model calibration problems. DYNO combines the error metrics of multiple variables into a single objective function by dynamically normalizing each variable's error terms using information available during the search. DYNO is proposed to dynamically adjust the weight of the error of each variable hence balancing the calibration to each variable during optimization search. DYNO is applied to calibrate a tropical hydrodynamic model where temperature and velocity observation data are used for model calibration simultaneously. We also investigated the efficiency of DYNO by comparing the calibration results obtained with DYNO with the results obtained through calibrating to temperature only and with the results obtained through calibrating to velocity only. The results indicate that DYNO can balance the calibration in terms of water temperature and velocity and that calibrating to only one variable (e.g., temperature or velocity) cannot guarantee the goodness-of-fit of another variable (e.g., velocity or temperature) in our case. Our study implies that in practical application, for an accurate spatially distributed hydrodynamic quantification, including direct velocity measurements is likely to be more effective than using only temperature measurements for calibrating a 3D hydrodynamic model. Our example problems were computed with a parallel optimization method PODS, but DYNO can also be easily used in serial applications.

A Repetitive Parameterization and Optimization Strategy for the Calibration of Complex and Computationally Expensive Process‐Based Models With Application to a 3D Water Quality Model of a Tropical Reservoir

Article

Full-text available

May 2022
WATER RESOUR RES

Parameter calibration is critical for modeling, especially for current process‐based models that are complex with many chemical and biological processes and immeasurable model parameters. This analysis quantifies significant disadvantages of the traditional use of local or global sensitivity analysis (SA) for selecting calibration parameters of nonlinear, expensive models when there are a large number of constituents and parameters. We propose a new Repetitive parameterization and optimization (Rep‐OPT) strategy that uses multiple optimization steps; and between each optimization step, a modeler picks the parameters to be optimized in the next optimization step. The modeler picks the parameters in each iteration following a suggested set of steps that analyze which processes and parameters are related to the poorly fit constituents with the current parameter set. We successfully applied the Rep‐OPT strategy on a complex tropical water quality model with more than 91 parameters using real data. We demonstrate that expert knowledge with assistance of proposed postanalysis techniques (i.e., trade‐off analysis, component analysis, and mass‐balance analysis) can identify the right calibration parameters and obtain excellent model fit. In contrast, the traditional approach using SA with optimization (SA‐OPT) does not find the right calibration parameters for our data. The solution found by Rep‐OPT excellently improves manual solution by 32.7% in goodness‐of‐fit, and all calibrated constituents fit well to observations. The solution found by SA‐OPT using global SA improves manual solution by only 13.3%. Local sensitivity by SA‐OPT performs very poorly being 49.6% worse than manual solution.

An improved calibration and uncertainty analysis approach using a multicriteria sequential algorithm for hydrological modeling

Article

Full-text available

Aug 2021

Hydrological models are widely used as simplified, conceptual, mathematical representatives for water resource management. The performance of hydrological modeling is usually challenged by model calibration and uncertainty analysis during modeling exercises. In this study, a multicriteria sequential calibration and uncertainty analysis (MS-CUA) method was proposed to improve the efficiency and performance of hydrological modeling with high reliability. To evaluate the performance and feasibility of the proposed method, two case studies were conducted in comparison with two other methods, sequential uncertainty fitting algorithm (SUFI-2) and generalized likelihood uncertainty estimation (GLUE). The results indicated that the MS-CUA method could quickly locate the highest posterior density regions to improve computational efficiency. The developed method also provided better-calibrated results (e.g., the higher NSE value of 0.91, 0.97, and 0.74) and more balanced uncertainty analysis results (e.g., the largest P/R ratio values of 1.23, 2.15, and 1.00) comparing with other traditional methods for both case studies.

GOPS: efficient RBF surrogate global optimization algorithm with high dimensions and many parallel processors including application to multimodal water quality PDE model calibration

Article

Full-text available

Dec 2021
OPTIM ENG

This paper describes a new parallel global surrogate-based algorithm Global Optimization in Parallel with Surrogate (GOPS) for the minimization of continuous black-box objective functions that might have multiple local minima, are expensive to compute, and have no derivative information available. The task of picking P new evaluation points for P processors in each iteration is addressed by sampling around multiple center points at which the objective function has been previously evaluated. The GOPS algorithm improves on earlier algorithms by (a) new center points are selected based on bivariate non-dominated sorting of previously evaluated points with additional constraints to ensure the objective value is below a target percentile and (b) as iterations increase, the number of centers decreases, and the number of evaluation points per center increases. These strategies and the hyperparameters controlling them significantly improve GOPS’s parallel performance on high dimensional problems in comparison to other global optimization algorithms, especially with a larger number of processors. GOPS is tested with up to 128 processors in parallel on 14 synthetic black-box optimization benchmarking test problems (in 10, 21, and 40 dimensions) and one 21-dimensional parameter estimation problem for an expensive real-world nonlinear lake water quality model with partial differential equations that takes 22 min for each objective function evaluation. GOPS numerically significantly outperforms (especially on high dimensional problems and with larger numbers of processors) the earlier algorithms SOP and PSD-MADS-VNS (and these two algorithms have outperformed other algorithms in prior publications).

Advancing model calibration and uncertainty analysis of SWAT models using cloud computing infrastructure: LCC-SWAT

Article

Full-text available

Oct 2020

Calibration and uncertainty analysis of a complex, over-parameterized environmental models such as the Soil and Water Assessment Tool (SWAT) requires thousands of simulation runs and multiple calibration iterations. A parallel calibration system is thus desired that can be deployed on cloud-based architectures for reducing calibration runtime. This paper presents a cloud-based calibration and uncertainty analysis system called LCC-SWAT that is designed for SWAT models. Two optimization techniques, sequential uncertainty fitting (SUFI-2) and dynamically dimensioned search (DDS), have been implemented in LCC-SWAT. Moreover, the cloud-based system has been deployed on the Southern Ontario Smart Computing Innovation Platform's (SOSCIP) Cloud Analytics platform for diagnostic assessment of parallel calibration runtime on both single-node and multi-node CPU architectures. Unlike other calibrations/uncertainty analysis systems developed on the cloud, this system is capable of generating a comprehensive set of statistical information automatically, which facilitates broader analyses of the performance of the SWAT models. Experimental results on SWAT models of different complexities showed that LCC-SWAT can reduce runtime significantly. The runtime reduction is more pronounced for more complex and computationally intensive models. However, the reported runtime efficiency is significantly higher for single node systems. Comparative experiments with DDS and SUFI-2 show that parallel DDS outperforms parallel SUFI-2 in terms of both parameter identifiability and reducing uncertainty in model simulations. LCC-SWAT is a flexible calibration system and other optimization algorithms and asynchronous parallelization strategies can be added to it in future.

Improving the speed of global parallel optimization on PDE models with processor affinity scheduling

Article

Jun 2021

Parallel global optimization of expensive simulation models like nonlinear partial differential equations (PDEs) can speed up model calibration or project design decisions, but the impact of memory management on the efficiency of using parallel global optimization methods has not been previously studied. This paper quantifies cache memory limitations arising during parallel optimization of expensive PDE models. An efficient parallel optimization algorithm is applied to model calibration for two different, expensive real-world PDEs (i.e., hydrodynamic and water quality analysis for a 250-hectare lake). One of these two lake models takes 4.5 h per simulation in serial, but that PDE simulation time per simulation increases to 12 h with parallel optimization if default processor scheduling strategy is used on a modern nonuniform memory access multicore platform. We proposed a novel mixed affinity scheduling strategy for parallel simulation optimization that increases computational efficiency by as much as 20% over the default affinity strategy.

Efficient parallel surrogate optimization algorithm and framework with application to parameter calibration of computationally expensive three-dimensional hydrodynamic lake PDE models

Article

Jan 2021
ENVIRON MODELL SOFTW

Parameter calibration for computationally expensive environmental models (e.g., hydrodynamic models) is challenging because of limits on computing budget and on human time for analysis and because the optimization problem can have multiple local minima and no available derivatives. We present a new general-purpose parallel surrogate global optimization method Parallel Optimization with Dynamic coordinate search using Surrogates (PODS) that reduces the number of model simulations as well as the human time needed for proper calibration of these multimodal problems without derivatives. PODS outperforms state-of-art parallel surrogate algorithms and a heuristic method, Parallel Differential Evolution (P-DE), on all eight well-known test problems. We further apply PODS to the parameter calibration of two expensive (5 h per simulation), three-dimensional hydrodynamic models with the assistant of High-Performance Computing (HPC). Results indicate that PODS outperforms the popularly used P-DE algorithm in speed (about twice faster) and accuracy with 24 parallel processors.

Comparison of single-site, multi-site and multi-variable SWAT calibration strategies

Article

Aug 2020

This study compares single-site, multi-site and multi-variable SWAT calibration. The SWAT model was applied to a large basin (63 884 km²) and calibrated at a monthly time step with the SUFI-2 algorithm, using the Kling-Gupta efficiency (KGE) as the objective function. Multi-variable calibration was performed by combining streamflow and remote sensing-derived actual evapotranspiration data. Parameter transferability was also investigated, by daily time step validation. The KGE for the outlet ranged from 0.73 to 0.86, and the average KGE of all streamflow gauge stations ranged from 0.73 to 0.80, reflecting a good overall simulation performance for the monthly time step. In daily time step validation, KGE ranged from 0.62 to 0.68, and the Nash-Sutcliffe efficiency ranged from 0.40 to 0.60, for the average of all gauge stations. Multi-site and multi-variable calibrations did not significantly improve inner sub-basin simulation performance but improved streamflow uncertainty when compared to single-site calibration.

Signature-Based Multi-Modelling and Multi-Objective Calibration of Hydrologic Models: Application in Flood Forecasting for Canadian Prairies

Article

May 2020
J HYDROL

Multi-modelling aims to make use of the strengths of single hydrologic models to improve the accuracy of simulating the watershed system behavior. Considering hydrological signatures such as the flow duration curve segmentation in the calibration of each hydrologic model leads to a better parameter identifiability. In this study, a novel weighted average model-wrapper based on flow duration curve segmentation is introduced to aggregate the calibrated models into a multi-model. The proposed framework is applied to develop a model-wrapper of the Upper Assiniboine River Basin for flood forecasting upstream of the Shellmouth reservoir in the Prairie region of Canada. The HEC-HMS, HBV-EC, HSPF, and WATFLOOD hydrologic models that are being used at the Hydrologic Forecast Centre of Manitoba Infrastructure for operational inflow forecasting are calibrated using signature-based multi-objective optimization. These models have significantly different structural complexities. The calibration of each of these models is set up as three simulation-optimization problems with different objective functions to balance the model capability in simulating multiple important hydrological signatures. Results show that the model-wrapper outperforms each of the single calibrated models that are of operational use at Manitoba Infrastructure, e.g. NSE improved from 0.44 for the best individual model to 0.76 for the model-wrapper in the calibration period. Moreover, the weights associated with each hydrologic model component indicate the contribution rate of the individual models to the model-wrapper in high-flow, mid-flow, and low-flow portions of streamflow time series. Quantifying the contribution of each model component provides a deeper insight into model selection strategy, especially when a component has minimal or no contribution, e.g. HEC-HMS and HBV-EC in this paper, to the model-wrapper performance in all ranges of streamflow simulation compared to other model components.

Enhanced watershed model evaluation incorporating hydrologic signatures and consistency within efficient surrogate multi-objective optimization

Recommended publications

Ecosystem Services of Riparian Restoration: A Review of Rock Detention Structures in the Madrean Arc...

Applicability of ann AGNPS for Ontario conditions

Hydrologic impact of small depressional watersheds ( Jackson County, Minnesota).

Estimation of Pollution Loads from the Yeongsan River Basin using a Conceptual Watershed Model