ArticlePDF Available

Scaling Multiobjective Evolution to Large Data With Minions: A Bayes-Informed Multitask Approach

Authors:

Abstract

In an era of pervasive digitalization, the growing volume and variety of data streams poses a new challenge to the efficient running of data-driven optimization algorithms. Targeting scalable multiobjective evolution under large-instance data, this article proposes the general idea of using subsampled small-data tasks as helpful minions (i.e., auxiliary source tasks) to quickly optimize for large datasets—via an evolutionary multitasking framework. Within this framework, a novel computational resource allocation strategy is designed to enable the effective utilization of the minions while guarding against harmful negative transfers. To this end, an intertask empirical correlation measure is defined and approximated via Bayes’ rule, which is then used to allocate resources online in proportion to the inferred degree of source–target correlation. In the experiments, the performance of the proposed algorithm is verified on: 1) sample average approximations of benchmark multiobjective optimization problems under uncertainty and 2) practical multiobjective hyperparameter tuning of deep neural network models. The results show that the proposed algorithm can obtain up to about 73% speedup relative to existing approaches, demonstrating its ability to efficiently tackle real-world multiobjective optimization involving evaluations on large datasets.
IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022 1
Scaling Multiobjective Evolution to Large Data with
Minions: A Bayes-Informed Multitask Approach
Zefeng Chen, Abhishek Gupta, Lei Zhou, and Yew-Soon Ong
Abstract—In an era of pervasive digitalization, the growing
volume and variety of data streams poses a new challenge to
the efficient running of data-driven optimization algorithms.
Targeting scalable multiobjective evolution under large-instance
data, this paper proposes the general idea of using subsampled
small-data tasks as helpful minions (i.e., auxiliary source tasks)
to quickly optimize for large datasets - via an evolutionary
multitasking framework. Within this framework, a novel compu-
tational resource allocation strategy is designed to enable effective
utilization of the minions while guarding against harmful negative
transfers. To this end, an inter-task empirical correlation measure
is defined and approximated via Bayes’ rule, which is then used to
allocate resources online in proportion to the inferred degree of
source-target correlation. In the experiments, the performance
of the proposed algorithm is verified on (1) sample average
approximations of benchmark multiobjective optimization prob-
lems under uncertainty, and (2) practical multiobjective hyper-
parameter tuning of deep neural network models. The results
show that the proposed algorithm can obtain up to about
73% speedup relative to existing approaches, demonstrating its
ability to efficiently tackle real-world multiobjective optimization
involving evaluations on large datasets.
Index Terms—Evolutionary multitasking, data-driven multiob-
jective optimization, Bayes resource allocation.
I. INTRODUCTION
Optimization problems are ubiquitous. Among them, mul-
tiobjective optimization problems (MOOPs) form an
important subclass where multiple objective functions (usually
conflicting) need to be simultaneously optimized to achieve
a trade-off. An MOOP formulation is generally expressed as
follows [1]:
min
xF(x) = f1x, . . . , fMx(1)
Corresponding author: Lei Zhou.
This research is supported in part by the Data Science and Artificial
Intelligence Research Center (DSAIR), School of Computer Science and
Engineering at Nanyang Technological University (NTU), the A*STAR AI3
HTPO seed grant C211118016, the A*STAR Cyber-Physical Production
System (CPPS) - Towards Contextual and Intelligent Response Research
Program through the RIE2020 IAF-PP Grant A19C1a0018, and the National
Natural Science Foundation of China under Grant 62206313.
Zefeng Chen is with the School of Artificial Intelligence, Sun Yat-
sen University, China, and also with the School of Computer Science
and Engineering, NTU, Singapore (e-mail: chenzef5@mail.sysu.edu.cn; ze-
feng.chen@ntu.edu.sg).
Lei Zhou is with the School of Computer Science and Engineering, NTU,
Singapore (e-mail: lei.zhou@ntu.edu.sg).
Abhishek Gupta is with the Singapore Institute of Manufacturing Technol-
ogy (SIMTech), Agency for Science, Technology and Research (A*STAR)
and the School of Computer Science and Engineering, NTU (e-mail: ab-
hishek gupta@simtech.a-star.edu.sg; abhishekg@ntu.edu.sg).
Yew-Soon Ong is with the Data Science and Artificial Intelligence Re-
search Centre, School of Computer Science and Engineering, NTU, and
also the Chief Artificial Intelligence Scientist of A*STAR Singapore (e-mail:
asysong@ntu.edu.sg; ongyewsoon@hq.a-star.edu.sg).
where xis an n-dimensional decision vector and fi(x)(i=
1, . . . , M ) denotes the i-th objective function. Rnis the
decision space (also known as search space), and its image
set S={F(x)|x}is called the objective space.
Particularly, in the real world, there exist MOOPs whose
objective functions call for computations to be carried out on
available data. Examples include multiobjective optimization
under uncertainty (i.e., MOOPs involving uncertain parame-
ters whose possible realizations are contained in a dataset)
[2]–[4], machine learning use-cases such as multiobjective
feature selection [5]–[8], multiobjective AutoML [9], [10],
hyper-parameter optimization of multi-task learning models
[11], [12], multiobjective neural architecture search [13]–[15],
multi-task feature learning [16], to name just a few. These
types of problems can be regarded as data-driven MOOPs,
and are formalized as follows:
min
xF(x;D) = f1x;D, . . . , fMx;D(2)
where fi(x;D)represents the i-th objective function evaluated
with dataset D. In general, when Dcontains a large number of
data instances, the computations of fi(x;D)become expen-
sive. This phenomenon is increasingly common nowadays. The
sheer volume and variety of data has increased enormously
across many fields, thus posing a significant challenge to
the efficient running of data-driven optimization algorithms
[17]. In this paper, we specifically focus on MOOPs involving
computations with large-instance data (denoted as DL). Real-
world applications of such problems span diverse domains,
from decision-support under uncertainty to big data machine
learning, as previously listed.
When solving an MOOP, one main difficulty lies in that an
improvement in one objective function is usually accompanied
by performance deterioration in another objective. Thus, a
single optimal solution that can simultaneously optimize all
objectives may not always exist. Instead, the best trade-off
solutions, called the Pareto optimal solutions, are important to
a decision maker. For ease of explaining the Pareto optimality
concept, we first present the definition of Pareto dominance
tailored for a data-driven MOOP with dataset D1:
Definition 1. Given two solutions x,yalong with their
corresponding objective vectors F(x;D),F(y;D)RM,x
is said to Pareto dominate y(written as xy) if and only if
(1) for all i {1, . . . , M }, fi(x;D)fi(y;D), and (2) there
exists some j {1, . . . , M}such that fj(x;D)< fj(y;D).
1When comparing the Pareto dominance relationship between any two
solutions, we assume that their objective function values are computed using
the same dataset D.
2 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
According to Pareto dominance, if a solution xis
not dominated by any other solution in the decision space
, then we call xaPareto optimal solution. The union of all
Pareto optimal solutions is called the Pareto set (PS for short):
P S ={x|@ys.t. yx}. The image of the PS is
called the Pareto front (PF for short).
In recent decades, evolutionary algorithms (EAs) have
demonstrated prowess in solving various kinds of MOOPs
by virtue of their implicit parallelism, that allows multiple
solutions - characterizing the PS of an MOOP [18] - to be
obtained in a single run. A variety of multiobjective EAs
(MOEAs) have thus been proposed over the years, such as
the classical NSGA-II [19], [20], SPEA2 [21] and MOEA/D
[22]. Despite their popularity, one major criticism of MOEAs
stems from the fact that they usually require a large number
of function evaluations to find reasonable approximations to
the PS. This vulnerability becomes more apparent when faced
with evaluations on large-instance datasets, as computational
costs scale deleteriously with the amount of data.
To tackle the aforementioned tractability issue brought by
large-instance datasets, two broad categories of approaches
have been considered in the literature, namely, hardware-
driven approaches (such as distributed computing [23]–[26],
parallel computing [27], [28] and hardware acceleration tech-
niques [29]) and algorithm-centric approaches (also known as
software solutions) [30], [31]. On the one hand, hardware-
driven approaches are heavily reliant on the amount of avail-
able computing resources. On the other hand, algorithmic solu-
tions where fast evaluations are carried out using small subsets
of the full data [30], [31] run the risk of misdirecting the
evolutionary search. This is because evaluations on small data
subsets are not guaranteed to be representative of a solution’s
true performance. In this paper, we propose a novel strategy
to address the shortcoming of the latter category, so as not to
fall back on the need for extensive hardware. In particular, we
resort to the core idea of evolutionary multitasking (EMT), in
which the full dataset together with its smaller subsets can be
jointly deployed as synergistically evolving tasks in a single
optimization run.
EMT is an emerging search paradigm originally proposed
for the simultaneous solving of multiple optimization problems
[32]–[34]. It offers a new avenue to further exploit the implicit
parallelism of population-based search, taking advantage of
latent synergies between distinct tasks through information
transfers to accelerate convergence rates in tandem. Until
today, a variety of EMT algorithms have been proposed in the
literature [35], demonstrating potential for wide-ranging real-
world applicability [36]. In the specific case of multiobjective
evolution with large-instance data, we note that a series of aux-
iliary small-data tasks can in fact be generated by subsampling
the full dataset. It is intuitively expected that at least some
of the generated tasks could then share a locally or globally
similar fitness landscape with the target task at hand, while
being relatively inexpensive to evaluate. It is thus theorized
that harnessing such correlated tasks - which we think of as
helpful minions - in an EMT framework would assist the target
task in quickly converging towards good solutions. What is
more, by adaptively allocating greater computational resources
to effective minions, a significant boost in overall convergence
trends could be achieved.
To sum up, the main contributions of this paper are three-
fold:
1) A new EMT framework for jointly accommodating large-
and small-data tasks is developed for MOEAs to efficient-
ly scale for data-driven MOOPs.
2) An online inter-task empirical correlation measure is
proposed for MOOPs and it is efficiently approximated by
Bayes’ rule. The estimate of the empirical correlation is
used to adaptively reward more computational resources
to inexpensive small-data tasks when they demonstrate
beneficial transfers to the target.
3) The performance of the proposed algorithm is verified on
MOOPs under uncertainty and the multiobjective hyper-
parameter tuning of deep neural network models. The
experimental results on diverse benchmark problems and
datasets confirmed the efficacy of the proposed algorithm
with up to 73% speedup.
The remainder of the paper is organized as follows. Section
II presents related work in the literature, while Section III
introduces the preliminaries. Section IV designs, develops and
analyzes the Bayes resource allocation strategy within our
proposed adaptive EMT framework. The experimental results
are provided in Section V. Finally, Section VI concludes this
paper and gives some research directions for future studies.
II. RE LATE D WORK
A. Multiobjective Evolution under Large-instance Data
In the literature, there are an increasing number of studies
dedicated to tackling the scalability issue faced by evolutionary
computation (EC) techniques for multiobjective optimization
under large-instance data. Many of these studies have however
focused on hardware-driven approaches, including distributed
computing (i.e., using MapReduce paradigm), parallel com-
puting and hardware acceleration techniques. For instance,
Ferranti et al. [23], [24] and Barsacchi et al. [25] proposed
distributed MOEAs based on Apache Spark to generate fuzzy
rule-based classifiers. Likewise, the multiobjective evolution-
ary fuzzy algorithm proposed in [26] for tackling the subgroup
discovery task is based on MapReduce. Utilizing a parallel
computing environment, Golchin and Liew proposed parallel
bi-cluster detection based on the strength Pareto evolutionary
algorithm (PBD-SPEA) to conduct bi-clustering [27]. Recent-
ly, Karagoz et al. proposed a parallel variant of NSGA-
II to address the multiobjective multi-label feature selection
problem for the classification of video data [28]. While these
representative approaches are able to achieve the goal of
efficiency promotion, they do so only by relying heavily on
different advanced computing infrastructures.
Different from hardware-driven approaches, there are a
relatively smaller number of algorithm-centric techniques that
attempt to achieve better efficiency without necessitating par-
allel/distributed infrastructures. For example, Garcia-Piquer
et al. [30] proposed the CAOS evolutionary algorithm for
multiobjective clustering, in which the original dataset is
divided into several subsets that are alternatively used in each
CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 3
generation of an MOEA. In a subsequent work [31], they
further investigated the performance of three subset building
strategies on large clustering datasets. It is noted that only a
small subset of the full dataset is used in each generation.
This may give rise to a potential risk of misdirecting the
evolutionary search (due to negative transfer of information
from one generation to the next), unless a suitable subset
accurately representative of the full dataset is found. On the
other hand, our method in this paper leverages the EMT
paradigm to simultaneously accommodate the full dataset as
well as its smaller subsets as distinct tasks in each generation
of a single optimization run, hence enabling the learning
of inter-task relationships to control and curb the risk of
misdirection.
Notably, as multiobjective optimization under large-instance
data often involves time-consuming function evaluations, it
can be considered as a type of expensive global optimization
(EGO) problem [17], [37]. In this regard, surrogate-assisted
EC is also a viable technique worth considering [38], [39].
Various types of computationally efficient surrogate models,
such as Gaussian Processes (GP, also known as Kriging model)
[40]–[42], radial basis function networks (RBFN) [43]–[45] or
polynomial regression [46], have been employed to replace a
portion of the original expensive function evaluations in the
evolutionary process. Among existing surrogates, the proba-
bilistic GP is perhaps the most commonly used, including in
the classical ParEGO [40] and MOEA/D-EGO [41], due to
its ability to capture predictive uncertainties in a principled
manner. For instance, in the rapidly growing area of AutoML,
GP-based Bayesian Optimization (BO) has become a promi-
nent approach for automatically tuning hyper-parameters of
machine learning models exposed to large datasets [11], [47].
Despite recent successes, there are however limitations to the
widespread use of BO algorithms. First, standard BO does
not readily extend to arbitrary solution representations due
to issues of kernel indefiniteness [48]. Further, it is hard to
accumulate enough data to build informative surrogates in
even moderately high-dimensional decision spaces, leading
to the notorious cold start problem [49]. Given the above,
and given the flexibility of EAs in coping with arbitrary
solution representations, this paper focuses on scaling purely
evolutionary multiobjective optimization approaches to
large datasets, from an algorithm-centric perspective (by
means of a novel EMT trick). In future works, further
augmenting the efficacy of evolution with surrogate-assistance
shall be a key research direction.
B. Constructing Auxiliary Tasks in EMT
A number of researchers have looked at utilizing the EMT
paradigm in a manner that one or more auxiliary tasks (helper
tasks) are artificially constructed and assimilated to assist the
solving of the original task at hand [50], [51]. For example,
in [52], with the aim of solving difficult single-objective opti-
mization tasks, Ma et al. utilized a technique called multiobjec-
tivization via decomposition to generate helper tasks, each of
which is a multiobjectivization of the original single-objective
optimization task. Feng et al. [53] tried to construct multiple
auxiliary tasks that have simplified search spaces, and used
them to promote the solving of a large-scale multiobjective
optimization problem. Similarly, in order to solve a high-
dimensional feature selection task, [54] and [55] designed
several low-dimensional versions of the original task to act as
auxiliary tasks. In one of our previous works in evolutionary
machine learning [6], we artificially generated a number of
static auxiliary tasks based on small subsampled portions of
a big training dataset.
In this paper, tailored for scalable data-driven multiobjective
optimization, we also propose to construct small-data auxiliary
tasks through a data subsampling approach. However, unlike
in [6], the auxiliary tasks constructed shall be dynamically
changing (via resampling) during the evolutionary search
process.
C. Online Resource Allocation in Multiobjective EMT
Till now, there has been relatively little research on online
resource allocation in EMT for MOOPs. The MFEA/D-DRA
algorithm proposed by Yao et al. [56] adopts a dynamic
resource allocation strategy in which the computational re-
sources are allocated according to the evolution rate of single-
objective subproblems (decomposed from each MOOP) in
each generation. Although this strategy achieves efficiency
enhancements, it is not flexible since it can only be applied to
MOEA/D-based EMT algorithms.
In contrast, [57] proposed a generalized resource allocation
(GRA for short) framework that can be applied to any kind of
EMT algorithm. The GRA is built on the base of an attainment
function performance metric and a multi-step nonlinear regres-
sion, demonstrating the ability to enhance multiobjective EMT
algorithms. However, the attainment function used in GRA is
of high computational complexity, and Kmulti-step nonlinear
regression models (Kis the number of tasks) have to be solved
before the resources allocated to distinct tasks are determined.
In this paper, we wish to design a novel online resource
allocation strategy that is flexible and efficient. It shall be
applicable to any kind of multiobjective EMT algorithm in an
effective manner, enabling multiobjective optimization where
evaluations are to be carried out on large-instance data.
III. PRELIMINARIES
In this section, we experimentally showcase how the dataset
size could affect the performance of multiobjective evolution,
and then illustrate the general idea of using small datasets
(as helpful minions) to accelerate multiobjective optimization
under large-instance data.
A. Effect of Dataset Size on Multiobjective Evolution
For data-driven MOOPs, the size of the dataset has a critical
impact on the optimization performance.
Here, we consider one typical example: multiobjective op-
timization under uncertainty, where a finite but large number
of data samples representative of the uncertain environment
are either drawn from a known probability distribution, or are
historically observed [58], [59]. To tackle the uncertainty, a
4 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
commonly used method is the sample average approximation
(SAA) model [2]. In terms of MOOPs, the mathematical
formulation of a target SAA model is given as follows [60]:
min
xb
FN(x) = 1
N
N
X
j=1
F(x;ξj)(3)
where D={ξ1, . . . , ξN}is an i.i.d. set of representative
uncertain scenarios. According to the law of large numbers,
as N ,b
FN(x)converges to E[F(x;ξ)] under some reg-
ularity conditions [60]. Thus, the SAA model usually requires
a large sample size (i.e., a large uncertainty set DL), so that a
more accurate approximation of the expected objective value
can be achieved. However, the resultant computational cost
would become higher. In contrast, a small uncertainty set DS
shall incur low computational cost, but may in turn introduce
high approximation errors.
To demonstrate the above claims, we conduct a toy ex-
periment using NSGA-II to solve the well-known 3-objective
benchmarks DTLZ5 and DTLZ1 [61] but corrupted with
additive Gaussian noise. To be specific, we add independent
Gaussian noise to each decision variable of DTLZ5/DTL1Z1
to simulate simple scenarios with decision variable uncertainty.
We denote these two resultant problems as DTLZ5D and
DTLZ1D2, respectively. For evaluating the objective functions
of each problem during the running of NSGA-II, two types of
uncertainty sets are used. One is a relatively large dataset with
1,000 samples drawn from the Gaussian distribution N(0,1),
and another is a small dataset consisting of only 10 samples
extracted uniformly at random from the large dataset. The
obtained Hypervolume (HV for short) results3are displayed in
Fig. 1. As can be seen, on DTLZ5D, NSGA-II with small data
converges significantly faster than that with large data. This
implies that using a small-instance dataset is sufficient for the
optimization of this particular problem. As for DTLZ1D on
the other hand, NSGA-II with small data progresses quickly
at first, but is found to stagnate in a sub-optimal region in the
later stage. This shows that NSGA-II with small data can not
reliably obtain good quality solutions. A large-instance dataset
is needed for solving DTLZ1D effectively.
B. Using Small-data Tasks to Quickly Evolve Solutions for
Large-instance Data
Let’s consider two tasks: one comprises a large-instance
dataset (“large-data task”) and the other comprises a small-
instance dataset uniformly subsampled from the large dataset
(“small-data task”). The objective function evaluation of the
small-data task would be computationally cheaper than that of
the target large-data task. Since the small dataset is a uniform
subset of the large dataset, the small-data task can be expected
(albeit not guaranteed) to share a degree of similarity in the
underlying data distribution and resultant fitness landscape as
2The last “D” represents that the noise is imposed in the decision space.
3Reported HV results are based on performance evaluations on an out-
of-sample validation dataset consisting of 10,000 samples to reevaluate all
solutions obtained in each generation.
(a) (b)
Fig. 1. Convergence curves of NSGA-II with large data and small data on
two example problems. (a) On DTLZ5D, NSGA-II with small data converges
significantly faster than that with large data, implying that a small-instance
dataset is sufficient for the optimization. (b) On DTLZ1D, NSGA-II with small
data converges fast at first, but gets stuck in the later stage. This indicates
that a large-instance dataset is needed for solving DTLZ1D effectively.
the target. Hence, it may be reasonable to use the small-
data task to discover useful solutions for the computationally-
expensive large-data task at a reduced cost. That is, the small-
data task may serve as an auxiliary source task to accelerate
the optimization of the target. Taking this cue, we propose
to jointly deploy both large- and small-data tasks in a single
EMT run. Within the EMT framework, the small-data tasks
are seen as helpful minions, i.e., as computationally-cheaper
auxiliary tasks, to assist the target task in the search for optimal
solutions. All tasks thus progress in a synergic and intertwined
manner, transferring useful information (solution prototypes)
when available.
Notice in Fig. 1b that the optimization of the small-data
task is trapped in an inferior region in the later stages, which
indicates that it ceases to be helpful to the target thereafter.
In this case, assigning evaluation budget to the small-data
task would imply a waste of computational effort in terms
of progressing the target search. Thus, unlike the majority
of existing EMT algorithms that equally weight all tasks, we
propose to adaptively adjust computational resource allocation
to tasks according to their performance. Concretely, if the
small-data task is assessed to provide beneficial transfers to the
target, we should reward more resources to it so as to quickly
progress the target search at a much lower cost. Otherwise, the
resources allocated to the small-data task should be reduced
to help prevent wastage of computational effort.
IV. PROPOSED ALGORITHM
This section first illustrates our proposed EMT framework.
Next, we introduce the definition of an online inter-task
empirical correlation measure that is used for adaptive resource
allocation. Finally, we present an approach to efficiently ap-
proximate the empirical correlation measure using Bayes’ rule.
A. Overview of the EMT framework
The pseudo code of the overall adaptive EMT framework
is shown in Algorithm 1. It is worth noting that the proposed
framework can be adapted to any existing MOEA by con-
figuring the reproduction operators and mating/environmental
selection schemes used.
Let the size of the whole population (including large- and
small-data populations) for the proposed framework be psize.
Let the large-instance data be denoted as DL.Nsmall number
CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 5
of instances are randomly sampled without replacement from
DLto form the small-instance dataset DS4. Next, two
populations PL(t= 0) (of size psizeL(0)) and PS(t= 0) (of
size psizeS(0)) are randomly initialized for the target large-
data task TLand small-data source task TS, respectively. The
solutions in PL(t)and PS(t)are evaluated with DLand DS,
respectively. PL(t)and PS(t)evolve separately via traditional
evolutionary operators (Lines 16-18), until the transfer phase is
triggered every 4tgenerations. The whole process is repeated
until a predefined stopping criterion is satisfied.
Note that 4tdenotes not only the transfer interval, but
also the interval for conducting online computational resource
allocation between tasks TLand TS. During the transfer phase
(Lines 6-8), explicit information transfer is conducted across
tasks. Specifically, PL(t)transfers min{num,psizeL(t)}
subsampled solutions (reevaluated with DS) to PS(t), while
PS(t)transfers min{num,psizeS(t)}subsampled solutions
(reevaluated with DL) to PL(t). The transferred solutions
merge with the existing task-specific population and undergo
the process of environmental selection. As stated in Section
III-B, the small-data task can help discover good solutions for
the large-data task at a significantly lower computational cost.
Thus, the useful information transferred from TSto TLcould
accelerate the convergence of the target task to its PS.
After the information transfer, a procedure of resource
allocation is conducted to adjust the population sizes of TS
and TLin an online manner (Lines 9-11). Through adaptively
allocating available resources to different tasks, we can not
only stimulate faster convergence trends, but also reduce the
risks of harmful negative transfer when small-data tasks are
not representative of the target.
Importantly, the small-data task TSin our proposed adaptive
EMT framework is dynamic. That is, the small-instance dataset
DSused in TSis periodically updated by resampling Nsmall
instances from DLand all the solutions in PS(t)would be
reevaluated with the new DS(Lines 9-10). Although the gen-
eration of a new small-data task introduces extra reevaluation
cost, the dynamic property is of significance. Specifically,
through these successive random resampling operations, a
series of auxiliary small-data tasks can be generated on-the-
fly. These randomly generated tasks may share various levels
of correlation with the target, since each of the small-instance
datasets is a random subset of the large dataset. As such, it
increases the chance of having generated some task that is
of high relevance for the target search. For the reevaluated
small-data population PS(t)and the large-data population
PL(t), we employ a novel computational resource allocation
strategy based on Bayes’ rule (i.e., Algorithm 2) to adjust
the population sizes of TSand TL(Line 11). As all the
solutions in the current PS(t)have been reevaluated with the
new DS, it’s reasonable to use the resource allocation strategy
conducted on the current PS(t)and PL(t)to infer the amount
of computational resources made available to the small-data
task in the following 4tgenerations. That is, when the current
4For the dataset where each data instance has a class label (such as the
datasets used in our experiments on multiobjective hyper-parameter tuning of
neural network models), stratified random sampling without replacement is
performed to sample Nsmall instances from DLto form DS.
Algorithm 1 Pseudocode of the Adaptive EMT Framework
Input: psize: size of whole population; TLand TS: large-
data task and small-data task; DL: large-instance dataset;
Nsmall: size of small-instance dataset; psizeL(0) and
psizeS(0): initial population sizes of TLand TS;num:
number of transferred solutions; 4t: transfer interval;
Output: Non-dominated solutions of TL;
1: Sample Nsmall instances from DLto form a small-
instance dataset DS, and set t= 0;
2: Initialize the population PL(t)of TLand population PS(t)
of TS, respectively;
3: Evaluate PL(t)and PS(t)with DLand DS, respectively;
4: while termination criterion is not fulfilled do
5: if mod(t+ 1,4t) == 0 then
6: Transfer min{num,psizeL(t)}subsampled solu-
tions from PL(t)to PS(t), and reevaluate with DS;
7: Transfer min{num,psizeS(t)}subsampled solu-
tions from PS(t)to PL(t), and reevaluate with DL;
8: Perform environmental selection on PL(t)and PS(t)
to maintain population sizes of psizeL(t)and
psizeS(t), respectively;
9: Sample Nsmall instances from DLto form a new
small-instance dataset DS;
10: Reevaluate the solutions in PS(t)with the new DS;
11: Conduct Bayes resource allocation strategy (i.e., Al-
gorithm 2) to obtain psizeL(t+1) and psizeS(t+1);
12: else
13: psizeL(t+ 1) = psizeL(t);
14: psizeS(t+ 1) = psizeS(t);
15: end if
16: Perform mating selection & reproduction on PL(t)and
PS(t)to generate the offspring population OL(t)(of
size psizeL(t+ 1)) and OS(t)(of size psizeS(t+ 1)),
respectively;
17: Evaluate OL(t)and OS(t)with DLand DS, respec-
tively;
18: Perform environmental selection on PL(t)OL(t)and
PS(t)OS(t)to construct PL(t+1) (of size psizeL(t+
1)) and PS(t+ 1) (of size psizeS(t+ 1)), respectively;
19: Set t=t+ 1;
20: end while
21: Reevaluate the solutions in PS(t)with DL;
22: Output solutions in P(t) = PL(t)PS(t)that are non-
dominated on TL.
population PS(t)is assessed to be effective in providing good
solutions for transfer, more resources would be rewarded to
the small-data task (as it will keep using the same DSin the
following 4tgenerations). Otherwise, more resources would
be allocated to the target until a small-data task that produces
beneficial transfers is newly generated.
In the following subsections, the basis and details of the
online computational resource allocation mechanism is elabo-
rated. For ease of description, we summarize some important
notations and their meanings in Table I.
6 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
TABLE I
NOTATIO NS USE D IN T HE DESCRIPTION OF THE BAYES RESOURCE
ALL OCATI ON ST RATEG Y AN D THEI R MEANINGS.
Notation Meaning
LSymbol of Pareto dominance for TL(i.e., using the objective
functions of TLto compare different solutions).
SSymbol of Pareto dominance for TS(i.e., using the objective
functions of TSto compare different solutions).
NSL(t)The solutions in P(t) = PL(t)PS(t)that are non-dominated
on TL,
NSL(t) = {xP(t)|@yP(t)s.t. yLx}.
NSS(t)The solutions in P(t)that are non-dominated on TS,
NSS(t) = {xP(t)|@yP(t)s.t. ySx}.
DSS(t)The solutions in P(t)that are dominated on TS,
|DSS(t)|=|P(t)|−|NS S(t)|.
NSL(t)The solutions in PL(t)that are non-dominated on TL,
NSL(t) = {xPL(t)|@yPL(t)s.t. yLx}.
NSS(t)The solutions in PS(t)that are non-dominated on TS,
NSS(t) = {xPS(t)|@yPS(t)s.t. ySx}.
DSS(t)The solutions in PS(t)that are dominated on TS,
|DSS(t)|=|PS(t)|−|NSS(t)|.
NNS (t)The solutions in N SL(t)that are also non-dominated on TS,
NNS (t) = {xN SL(t)|@yP(t)s.t. ySx}.
1For the last seven notations, the ones with an overline pertain to the joint
population P(t), while the ones without an overline pertain to PL(t)or
PS(t).
B. Inter-task Empirical Correlation of MOOPs
As indicated in Line 11 of Algorithm 1, we wish to
dynamically adjust the amount of resources (in terms of
population size 5) made available to the large- and small-data
tasks. To this end, we first consider the following question:
how strongly is the small-data task TScorrelated with the
target task TLat a given transfer phase?
In the t-th generation, there are two populations, namely,
PS(t)and PL(t), for TSand TL, respectively. Note that in
multiobjective evolution, the non-dominated solutions in a
population are usually preferred over the dominated solutions.
The solutions in the joint population P(t) = PL(t)PS(t)
that are non-dominated on TL(denoted as NSL(t)) are
considered most beneficial for the future optimization of the
target task. These solutions are contributed by either the small-
data population PS(t)or the large-data population PL(t), since
the joint population P(t)is composed of PS(t)and PL(t).
We propose that the proportion of non-dominated solutions
contributed by PS(t)reflects the degree of positive correlation
of the small-data task TSto the target. With this in mind,
we define an online inter-task empirical correlation measure,
5Here, we use the population size to control the amount of resources
allocated to the large- and small-data tasks. As the whole population size
psize is fixed, if the size of large-data population psizeL(t)increases, then
the size of small-data population psizeS(t) = psize psiz eL(t)would
decrease; and vice versa. Note that the adjustment of sizes of large- and small-
data populations is for determining how much of the limited computational
resources should be allocated to each task based on their observed potential,
but without guarantee that the task with a larger population size can absolutely
lead to better performance.
which is mathematically expressed as follows:
Corr(t) = |N SL(t)PS(t)|
|NSL(t)|
=|{xPS(t)|@yP(t)s.t. yLx}|
|{xP(t)|@yP(t)s.t. yLx}| .
(4)
If the contribution of PS(t)(i.e., the value of the numerator
in Eq. (4)) is large, this suggests that the auxiliary source task
TSproduces beneficial transfers to the target task in the t-th
generation, hence more resources could be allocated to TSto
further enhance the search efficiency. Otherwise, the resources
of TSare reduced to alleviate negative impact of TS. We can
thus allocate computational resources online in proportion to
the degree of source-target correlation defined in Eq. (4).
However, exact computation of Eq. (4) would incur extra
evaluations on the large-data task, since all solutions in PS(t)
would need to be reevaluated on TL. This would impose a
heavy overhead for the sake of resource allocation. Thus,
with the aim of maintaining computational tractability, we
design a new Bayes resource allocation strategy to efficiently
approximate the empirical correlation. The specific details are
presented in the next subsection.
C. Bayes Resource Allocation Strategy
Recall that in our proposed EMT framework, the small-
instance dataset is a uniform subset of the large-instance
dataset, suggesting that the resultant MOOPs may share locally
or globally similar fitness landscapes. We thus make the
simplifying assumption stated below that (as shall be shown)
facilitates fast, online approximation of Corr(t)- by means
of avoiding extra evaluations on the large dataset.
Assumption 1. The population of the large-data task, the
population of the small-data task, and their union share
similar underlying probability distributions during the EMT
run.
In addition, we highlight the following useful property that
will also be utilized in our derivation.
Property 1. Consider the additive form of data-driven
MOOPs expressed in Eq. (3). Since the small-instance dataset
DSis a subset of the large-instance dataset DL, the solutions
evaluated on DLare automatically evaluated on DSat no
extra cost. That is, for all solutions in PL(t), their evaluation
scores on TSare available.6
In Eq. (4), there are two key terms, the denominator
|NSL(t)|and the numerator |N SL(t)PS(t)|. To approx-
imate |NSL(t)|, we consider the probability that a solution in
P(t)is non-dominated on TL, which is denoted as P r(x
NSL(t)|xP(t)). Then, the expected value of |N SL(t)|can
be written as:
E[|NSL(t)|] = P r (xNSL(t)|xP(t)) |P(t)|,(5)
where the notation E[·]symbolizes statistical expectation.
6The Property 1 is not directly applicable to MOOPs of the type in Eq.
(15). Hence, for automated machine learning model configuration problems,
evaluation scores of PL(t)on TSare predicted (for the purpose of Bayes
resource allocation) via fast KNN regressions.
CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 7
As for |NSL(t)PS(t)|, the candidates in the intersection
set NSL(t)PS(t)may come from solutions in PS(t)that
are either currently non-dominated or even dominated with
respect to TS. Invoking Assumption 1, the expected value of
|NSL(t)PS(t)|can be weakly approximated as:
E[|NSL(t)PS(t)|] = P r (xNSL(t)|xN SS(t))
|NSS(t)|+P r(xN SL(t)|xDSS(t)) |DSS(t)|.(6)
It should be noted that both conditional probabilities P r(x
NSL(t)|xN SS(t)) and P r (xNSL(t)|xDSS(t)) are
computed based on the joint population P(t).
Substituting Eqs. (5) and (6) into Eq. (4), we obtain a
practical estimate of Corr(t)as:
Corr(t)P r(xN SL(t)|xN SS(t)) |NSS(t)|
P r(xN SL(t)|xP(t)) |P(t)|+
P r(xN SL(t)|xDSS(t)) |DSS(t)|
P r(xN SL(t)|xP(t)) |P(t)|.
(7)
At this point, in order to avoid extra evaluations on the large-
data task, we propose to invert the conditional probabilities
P r(xN SL(t)|xNSS(t)) and P r (xNSL(t)|x
DSS(t)) by resorting to Bayes’ rule as follows:
P r(xN SL(t)|xN SS(t)) = P r(xNSS(t)|xN SL(t))
P r(xN SL(t)|xP(t))
P r(xN SS(t)|xP(t)) ,(8)
and
P r(xN SL(t)|xDSS(t)) = P r(xDS S(t)|xNSL(t))
P r(xN SL(t)|xP(t))
P r(xDSS(t)|xP(t)) .(9)
Combining Eqs. (7), (8) and (9), the estimate of Corr(t)
would be expressed as:
Corr(t)P r(xN SS(t)|xN SL(t)) |NSS(t)|
P r(xN SS(t)|xP(t)) |P(t)|+
P r(xDSS(t)|xNSL(t)) |DSS(t)|
P r(xDSS(t)|xP(t)) |P(t)|
=P r(xN SS(t)|xN SL(t)) |NSS(t)|
P r(xN SS(t)|xP(t)) |P(t)|+
(1.0P r(xN SS(t)|xN SL(t))) |DSS(t)|
(1.0P r(xN SS(t)|xP(t))) |P(t)|.
(10)
Observe that the term P r(xNSL(t)|xP(t)) is elim-
inated in Eq. (10). Therefore, the approximation of Corr(t)
no longer depends on P r(xN SL(t)|xP(t)),P r(x
NSL(t)|xN SS(t)) or P r (xNSL(t)|xDSS(t)).
These terms are substituted by the probabilities P r(x
NSS(t)|xP(t)) and P r (xNSS(t)|xN SL(t)).
According to Property 1, we can identify the non-
dominated solutions NSS(t)and NNS(t)(whose specific
meanings can be seen in Table I), without the need to conduct
reevaluations on TS. Then, P r(xNSS(t)|xP(t)) can
be directly calculated as:
P r(xN SS(t)|xP(t)) = |NSS(t)|
|P(t)|.(11)
On the other hand, invoking Assumption 1, the value of
P(xNSS(t)|xN SL(t)) is weakly approximated as:
P r(xN SS(t)|xNSL(t)) |NNS(t)|
|NSL(t)|.(12)
Substituting Eqs. (11) and (12) into Eq. (10), we obtain the
Algorithm 2 Pseudocode of Bayes Resource Allocation Strat-
egy
Input: PL(t): the population of TL(of size psizeL(t));
PS(t): the population of TS(of size psizeS(t)).
Output: psizeL(t+ 1): the new population size of TL;
psizeS(t+ 1): the new population size of TS.
1: Identify the solutions in PL(t)that are non-dominated on
TL(denoted as NSL(t));
2: Identify the solutions in PS(t)that are non-dominated on
TS(denoted as NSS(t));
3: Using Property 1, identify solutions in P(t) = PL(t)
PS(t)that are non-dominated on TS(denoted as NSS(t));
4: Identify the solutions in NSL(t)that are also non-
dominated on TS(denoted as NNS(t));
5: Calculate the value of Corr(t)by Eq. (13);
6: proportion =C orr(t);
7: psizeS(t+ 1) = |P(t)| proportion;
8: psizeL(t+ 1) = |P(t)| (1 proportion).
final formula for the estimate of Corr(t):
Corr(t)
|NNS(t)|
|NSL(t)| |N SS(t)|
|NS S(t)|
|P(t)| |P(t)|
+(1.0|NNS(t)|
|NSL(t)|) |DSS(t)|
(1.0|NS S(t)|
|P(t)|) |P(t)|
=|NNS (t)|∗|NSS(t)|
|NSS(t)|∗|N SL(t)|+
(|NSL(t)|−|NN S(t)|)(|PS(t)|−|NSS(t)|)
(|P(t)|−|NS S(t)|) |N SL(t)|
=|NNS (t)|∗|NSS(t)|
|NSS(t)|∗|N SL(t)|+
(|NSL(t)|−|NN S(t)|)(psizeS(t) |N SS(t)|)
(psizeL(t) + psizeS(t) |N SS(t)|) |NSL(t)|.
(13)
This concludes the inference of Corr(t). We note the follow-
ing summarizing remarks.
Remark 1: No extra evaluation on TLor TSis required in
the inference process of Corr(t)under Property 1. The online
inter-task empirical correlation is efficiently approximated by
means of simple manipulations and the Bayes’ inversion trick.
Remark 2: In the proposed strategy, two weak approxima-
tions (i.e., Eqs. (6) and (12)) are utilized. When Assumption
1is satisfied, these approximations are reasonable. However,
in practice, the populations PL(t)and PS(t)may not satisfy
Assumption 1. In such cases, the numerical estimate in Eq.
(13) may exceed the theoretical upper bound of 1.0. Thus, an
additional check to appropriately bound the estimated value
of Corr(t)is needed. In particular, to avoid the elimination
of TL(i.e., a scenario where zero resource is allocated to the
target task TL), we bound the value of proportion =Corr(t)
to a fraction close to but smaller than 1.0 7.
Based on the obtained proportion, the population size
allocated to TSin the subsequent generation of EMT is
adjusted as psizeS(t+ 1) = psize proportion. Accordingly,
the population size of TLfor the next generation becomes
psize (1 proportion). We provide the pseudo code of the
proposed online resource allocation strategy in Algorithm 2.
7In our algorithm implementation, the fraction is set to 9/10.
8 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
D. Analyzing the Bayes Resource Allocation Strategy
The statement of Assumption 1 is strong, leading to weak
approximations. Hence, here we perform a sanity check on the
final formula for estimating Corr(t)(i.e., Eq. (13)) to confirm
its validity. To this end, two intuitively pleasing results along
with their proofs are stated as follows.
Result 1. When the large- and small-data populations are mu-
tually divergent (i.e., their underlying probability distributions
differ) and the small-data population has converged, Eq. (13)
implies that no resources will be allocated to TSin the next
generation.
Proof. Since the small-data population has converged, the
solutions in PS(t)are non-dominated on TS, implying
|NSS(t)|=psizeS(t). Next, due to the divergence of
populations PL(t)and PS(t), we have NNS(t) = and
|NNS(t)|= 0. Thus, the value of Corr(t)given by Eq. (13)
falls to zero, indicating that no resources will be allocated to
TSsince proportion = 0. Notably, this enables the Bayes
resource allocation strategy to guard against harmful negative
transfers from divergent small-data tasks.
Result 2. If the fitness landscapes of the large- and small-
data tasks are perfectly correlated with identical population
distributions, then Eq. (13) implies that the resources allocated
to TSis non-decreasing.
Proof. Given the similarity of fitness landscapes and pop-
ulation distributions, the solutions in NSL(t)would also
be non-dominated on TS, implying |NNS(t)|=|NSL(t)|.
Consequently, Eq. (13) reduces to:
Corr(t)|N SS(t)|
|NSS(t)|psizeS(t)
|P(t)|,(14)
indicating that the resources allocated to TSin the subsequent
generation is psizeS(t+ 1) = Corr(t) |P(t)|=psizeS(t).
The two aforementioned results jointly suggest that it is pru-
dent to set psizeS(t= 0)/|P(t= 0)|to a high value (closer
to 1) at the start of EMT. This is because the proportion
(and hence the resource allocated to TS) will fall to zero if
the source and target tasks diverge, while the proportion will
remain high / non-decreasing (thus reaping the most benefit
from TS) if the tasks are closely related.
V. EX PE RI ME NTAL STUDY
In the experiments, we consider synthetic MOOPs under
uncertainty as well as the multiobjective hyper-parameter tun-
ing of deep neural network models as examples to investigate
the performance of our proposed algorithm. All codes were
implemented in Python version 3.6, using the Keras package
for experiments with deep neural networks.
A. Experiments on MOOPs under Uncertainty
1) Experimental setup: In the experiments on MOOPs
under uncertainty, two suites of benchmark problems are
adopted: (1) The first suite consists of test problems of Type
I proposed in [58], each of which has been imposed with
additive Uniform noise (satisfying U(1,1)) in the decision
space. In particular, this suite of problems includes four
2-objective problems (named as DGT1M2Pxwhere x=
1,...,4) and two 3-objective problems (named DGT1M3P1
and DGT1M3P2, respectively). According to [58], the number
of decision variables for these problems are set to 5. (2)
The second suite of problems are variants of the well-known
DTLZ functions [61] with additive Gaussian noise (satisfying
N(0,1)) imposed in the decision space. We name them as
DTLZxD (x= 1,...,7). Since DTLZ problems can be scaled
to any number of objectives and any number of decision
variables, we set their number of objectives to 3 and set
their number of decision variables according to [61]. For
the aforementioned problems, the large uncertainty set DLis
constructed with 1,000 samples from the noise term. The HV
[62] metric is adopted to evaluate the algorithm performance,
while the reference vector used for computing HV is set based
on the non-dominated solutions obtained by all considered
algorithms. When calculating the HV value, we reevaluate all
solutions obtained by each algorithm with an out-of-sample
validation dataset consisting of 10,000 samples. For each
problem in the first/second suite, the samples in the validation
dataset are also sampled correspondingly from the Uniform
distribution U(1,1) or Gaussian distribution N(0,1).
For each algorithm, the stopping criterion is set as a prede-
fined running time (50 seconds), and the whole population size
is set to 100, i.e., psize = 100. For the genetic operators, we
use a simulated binary crossover operator (with a probability
pc= 0.9and a distribution index ηc= 20) [63] and a
polynomial mutation operator (with a probability pm= 1/n
and a distribution index ηm= 20) [64] to generate offspring.
By default, our proposed algorithm is set with the environ-
mental selection procedure of NSGA-II. For ease of descrip-
tion, hereafter, we denote our proposed algorithm with online
computational resource allocation as EMT-RA. In addition,
we also test a variant of our proposed algorithm without
resource allocation, which is called as EMT hereafter. For
our proposed EMT-RA and EMT, the parameter settings are
listed as follows8:num = 0.1psize = 10,4t= 20,
Nsmall = 0.01 |DL|= 10. For EMT, we allocate equal
amounts of resources to large-data and small-data tasks. For
EMT-RA, we set the initial sizes of large-data and small-
data populations to psizeL(0) = 0.1psize = 10 and
psizeS(0) = 0.9psize = 90, respectively9, and then the
sizes of both populations will be adaptively adjusted by our
proposed Bayes resource allocation strategy.
2) Comparative results: This subsection compares the per-
formance of single-task MOEA and our proposed algorithm
without/with resource allocation. To clearly demonstrate the
generalizability of our proposed framework, we select three
types of MOEAs (i.e., NSGA-II, SPEA2 and MOEA/D) to
act as the base MOEA used within our framework.
Table II shows the comparative results (in terms of average
HV value and standard deviation over 11 runs) on two suites
8The effect of several important parameters (num,4tand Nsmall) on
algorithm performance is investigated in the supplemental material.
9Based on Results 1 & 2 in Section IV-D, it is reasonable to allocate higher
resources to TSin the initial phase.
CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 9
TABLE II
AVERAGE AND STAND ARD DE VI ATION O F HV R ESU LTS OBTAINED BY DIFFE REN T ALGORITHMS ON BENCHMARK MOOPS UN DER UN CE RTAIN TY.
Problem Time
(seconds)
NSGA-II EMTNS GAI I EMT-RANS GAI I SPEA2 EMTSP EA2EMT-RASP EA2
Avg Std Avg Std Avg Std Avg Std Avg Std Avg Std
DGT1M2P1
10 0.015 0.033 0.318+0.127 0.517+0.155 0.343 0.122 0.626+0.015 0.614+0.041
30 0.428 0.150 0.585+0.154 0.580+0.131 0.627 0.045 0.638+0.014 0.632+0.033
50 0.524 0.124 0.589+0.153 0.587+0.125 0.628 0.042 0.633+0.021 0.634+0.027
DGT1M2P2
10 0.000 0.000 0.024+0.078 0.385+0.309 0.203 0.188 0.664+0.091 0.658+0.066
30 0.268 0.283 0.305+0.351 0.479+0.314 0.656 0.092 0.671+0.097 0.686+0.058
50 0.432 0.345 0.473+0.312 0.512+0.329 0.661 0.087 0.667+0.084 0.684+0.062
DGT1M2P3
10 0.475 0.072 0.611+0.019 0.624+0.021 0.560 0.077 0.617+0.036 0.620+0.022
30 0.643 0.003 0.6400.001 0.645+0.001 0.607 0.075 0.625+0.031 0.624+0.017
50 0.646 0.000 0.6410.001 0.6460.000 0.607 0.076 0.631+0.025 0.625+0.021
DGT1M2P4
10 0.467 0.035 0.615+0.012 0.640+0.014 0.602 0.039 0.640+0.011 0.645+0.015
30 0.636 0.027 0.658+0.012 0.663+0.011 0.653 0.008 0.6480.011 0.6490.015
50 0.652 0.020 0.663+0.014 0.667+0.011 0.658 0.007 0.6480.011 0.6470.026
DGT1M3P1
10 0.054 0.055 0.213+0.091 0.515+0.070 0.802 0.040 0.890+0.006 0.898+0.003
30 0.521 0.105 0.545+0.094 0.584+0.014 0.898 0.005 0.8950.004 0.8960.004
50 0.571 0.091 0.577+0.011 0.590+0.011 0.899 0.004 0.8980.004 0.8970.005
DGT1M3P2
10 0.309 0.087 0.433+0.049 0.475+0.062 0.352 0.079 0.497+0.077 0.527+0.063
30 0.525 0.064 0.5260.042 0.534+0.055 0.551 0.025 0.5490.027 0.557+0.013
50 0.543 0.054 0.5280.043 0.5430.043 0.555 0.020 0.5480.026 0.565+0.005
DTLZ1D
10 0.512 0.017 0.541+0.018 0.574+0.037 0.558 0.017 0.624+0.021 0.705+0.029
30 0.665 0.014 0.669+0.008 0.686+0.023 0.729 0.006 0.738+0.009 0.745+0.011
50 0.697 0.011 0.6890.008 0.711+0.013 0.738 0.004 0.748+0.008 0.754+0.006
DTLZ2D
10 0.126 0.025 0.264+0.026 0.345+0.014 0.037 0.021 0.196+0.023 0.308+0.027
30 0.344 0.010 0.362+0.010 0.379+0.010 0.288 0.026 0.360+0.013 0.347+0.021
50 0.384 0.007 0.3820.007 0.390+0.006 0.349 0.014 0.377+0.007 0.367+0.017
DTLZ3D
10 0.047 0.007 0.071+0.013 0.085+0.027 0.063 0.008 0.105+0.014 0.265+0.038
30 0.146 0.020 0.203+0.025 0.218+0.029 0.296 0.010 0.355+0.016 0.390+0.013
50 0.259 0.014 0.291+0.009 0.289+0.034 0.389 0.008 0.3900.011 0.408+0.010
DTLZ4D
10 0.256 0.046 0.418+0.026 0.506+0.018 0.117 0.051 0.306+0.034 0.523+0.019
30 0.517 0.007 0.527+0.011 0.538+0.011 0.485 0.026 0.532+0.010 0.563+0.012
50 0.546 0.006 0.5440.009 0.551+0.006 0.553 0.015 0.561+0.008 0.574+0.009
DTLZ5D
10 0.031 0.008 0.078+0.007 0.116+0.006 0.009 0.008 0.044+0.011 0.101+0.007
30 0.124 0.005 0.135+0.002 0.138+0.002 0.084 0.009 0.108+0.007 0.118+0.006
50 0.138 0.002 0.1390.002 0.142+0.002 0.110 0.009 0.124+0.005 0.126+0.007
DTLZ6D
10 0.000 0.000 0.007+0.007 0.151+0.042 0.000 0.000 0.078+0.024 0.321+0.011
30 0.150 0.040 0.264+0.020 0.298+0.020 0.281 0.018 0.352+0.007 0.355+0.014
50 0.290 0.019 0.328+0.013 0.338+0.010 0.363 0.007 0.368+0.008 0.365+0.014
DTLZ7D
10 0.000 0.000 0.0000.000 0.028+0.022 0.000 0.000 0.0000.000 0.201+0.019
30 0.001 0.002 0.059+0.027 0.111+0.034 0.012 0.007 0.244+0.024 0.295+0.013
50 0.063 0.025 0.143+0.030 0.170+0.034 0.200 0.012 0.299+0.022 0.315+0.005
Average rank
obtained by Friedman test 2.7949 2.0897 1.1154 2.7308 1.9103 1.3590
1The symbol + indicates that our proposed algorithm significantly improves the baseline algorithm (NSGA-II or SPEA2) at a 0.05 level by the
Wilcoxon’s rank sum test, whereas the symbol indicates the opposite. If no significant difference is detected, the symbol is used.
of benchmark problems. In each case, the best metric value is
highlighted with grey shade. Moreover, we adopt three sym-
bols (i.e., +”, and ”, whose meanings are illustrated
at the bottom of Table II) to mark the results of Wilcoxon’s
rank sum test with a confidence interval of 0.95. In addition,
the Friedman test is applied on all HV results to obtain the
average ranks of all algorithms.
Firstly, we focus on the comparisons among NSGA-II,
EMTNS GAII and EMT-RANS GAII . From Table II, we
can observe that EMTNS GAII and EMT-RANSGAI I ob-
tain better performance than NSGA-II on most benchmark
problems. Among 39 cases, EMTNSGAII performs sig-
nificantly better than NSGA-II in 30 cases, while EMT-
RANS GAII performs significantly better than NSGA-II in
37 cases. These results demonstrate the effectiveness of using
small-data task in our proposed framework. In terms of the
comparison between two variants of our proposed algorith-
m, EMT-RANSGAII shows significant improvement over
EMTNS GAII in 36 cases, demonstrating the effectiveness
of our proposed Bayes resource allocation strategy.
Similar results can also be obtained when using SPEA2
as the base MOEA. Particularly, EMTSP E A2performs sig-
nificantly better than SPEA2 in 31 cases, while EMT-
RASP E A2obtains significant better performance than SPEA2
in 34 cases. Moreover, as can be seen from the average
rank obtained by Friedman test, the overall performance of
10 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l) (m)
Fig. 2. Convergence curves obtained by the baseline algorithm and two variants of our proposed algorithm (i.e., EMT and EMT-RA) on benchmark MOOPs
under uncertainty. In each figure, the solid lines indicate the average HV values and the shaded area denotes the 95% confidence interval over multiple runs.
EMT-RANSGAII /EMT-RASP EA2on all considered prob-
lems ranks first, while that of EMTNS GAII /EMTS P EA2
ranks second. These results verify the generalizability of our
proposed framework.
The convergence trends of average HV obtained by NSGA-
II, EMTNS GAII and EMT-RANS GAII on all considered
benchmark problems are depicted in Fig. 2. In these figures,
the solid lines indicate the average HV values and the shaded
area denotes the 95% confidence interval. In addition, for sim-
plicity, we herein write EMTNSGAI I and EMT-RANSGAII
as EMT and EMT-RA, respectively. As can be seen, EMT
and EMT-RA can obtain faster convergence than NSGA-II
on most benchmark problems, indicating that using small-
data tasks can indeed help to accelerate the convergence rate.
For example, on the DGT1M2P1 problem, the performance
difference between EMT/EMT-RA and NSGA-II is significant,
which is explained by the fact that all Pareto optimal solutions
of the deterministic version of DGT1M2P1 remain robust
even under uncertainty (hence the small-data task acts as a
good proxy to the large-data task) [58]. On the other hand,
for DGT1M2P3 where uncertainty changes the Pareto set,
the improvement achieved by EMT-RA is lower (although
still noticeable). Next, we focus on the comparison between
EMT-RA and EMT. We can observe that EMT-RA converges
faster than EMT on all considered problems. This phenomenon
demonstrates that the convergence can be further boosted with
the help of our proposed Bayes resource allocation strategy.
Particularly, taking DGT1M2P2 problem as an example, EMT
progresses faster than NSGA-II at the early stage, but tends to
stagnate at a later stage. Such a phenomenon may occur if the
small-data task in EMT grows out of usefulness for the target
task. By contrast, EMT-RA can obtain faster performance by
leveraging the Bayes resource allocation strategy, which has
the ability to reduce the resources allocated to small-data task
when the usefulness of small-data task drops. The performance
difference between EMT-RA and EMT on DGT1M2P2 further
highlights the necessity and effectiveness of online computa-
tional resource allocation.
Fig. S-5 in the supplementary material shows the con-
vergence curves of average HV obtained by MOEA/D,
EMTMO EA/D and EMT-RAMOEA/D on four representa-
tive benchmark problems. As can be seen, on the whole,
EMTMO EA/D and EMT-RAMOEA/D converge faster than
MOEA/D. When separately comparing EMT-RAMOEA/D
and EMTMO EA/D , we find that EMT-RAMOEA/D always
converges faster than EMTM OEA/D on DGT1M2P3 and
DTLZ1D. However, on DGT1M2P1 and DGT1M2P2, EMT-
RAMO EA/D fails to outperform EMTM OEA/D . This phe-
nomenon may be attributed to the fact that the base MOEA/D
needs to be performed with the help of a set of predefined
evenly distributed weight vectors. When the resources allocat-
ed to each task (i.e., the population size for each task) are
changed, the number of weight vectors and the set of weight
vectors used for each task have to be regenerated. Howev-
CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 11
er, according to the common method for generating weight
vectors [65], the resultant number of newly generated weight
vectors for each task (can only be one of some particular
numbers) may not be sufficiently close to the new population
size for each task (which is an arbitrary number). This would
potentially hamper the effectiveness of our proposed Bayes
resource allocation strategy to some extent.
3) Analyses on learnt correlation curves, the effect of num-
ber of transferred solutions, dynamic small data, and small
data size: Due to space limitation, we have placed related
results and discussions in the supplementary material.
B. Experiments on Multiobjective Hyper-parameter Tuning of
Deep Neural Network Models
In this subsection, we apply our proposed algorithm for
the multiobjective hyper-parameter tuning of neural network
models (MOHPT for short), which serves as a representative
example from the field of AutoML.
MOHPT belongs to a kind of data-driven MOOP subclass
where data is utilized for supervised training of a candidate
machine learning model whose out-of-sample generalization
performance provides objective function scores for automated
model configuration. Specifically, the MOHPT under large-
instance data can be formalized as follows:
min
xF(x;DL,Dholdout)(15)
where F(x;DL,Dholdout)denotes the vector-valued out-of-
sample loss achieved by a model parametrized by x, trained on
the large-instance dataset DL10, and validated on a hold-out
dataset Dholdout 11.
For MOHPT under large-instance data, massive amounts of
energy is usually required for building, training and validating
the deep models of today, generating growing concerns on the
carbon footprint of deep learning [66], [67]. Developing an
efficient solver, such as that proposed in this paper, could help
to alleviate the computational bottleneck in hyper-parameter
optimization or neural architecture search, thus supporting the
environmental sustainability of modern AI [68].
1) Construction of MOHPT problems: We employ a Con-
volution Neural Network (CNN) to serve as the underlying
ML model, and use it to conduct multi-label/multi-task classi-
fication on various datasets. Multi-task learning, as the name
suggests, is a learning paradigm where data from multiple
tasks (each of which has a performance metric) are combined
for joint training with shared model parameters [11], [69].
Multi-label learning considers the problem in which each
example is represented by a single data instance associated
with a set of labels simultaneously, and the aim is to predict the
label sets of unseen instances by analyzing training instances
with known label sets [70]. In essence, multi-label learning
10For MOHPT under large-instance data, we treat the training of a model
as a black-box, regardless of whether its training adopts the mini-batch
optimizers (where a mini-batch of samples from the large-instance dataset
is used in each iteration; e.g., mini-batch gradient descent) or not.
11In practice, multiple hold-out datasets may be used and the average
performance over multiple hold-out datasets is computed as the objective
function score. However, here we only consider the scenario of using one
hold-out dataset.
TABLE III
HYP ER-PA RAM ET ERS O F CNN.
No. Description Range Type
1 Mini-batch size [24,28]Discrete
2 Size of convolution window {1,3,5}Discrete
3 Number of filters in the convolution layer [23,26]Discrete
4 Dropout rate [0,0.5] Continuous
5 Learning rate [104,101]Continuous
6 Decay parameter beta 1used in Adam [0.8,0.999] Continuous
7 Decay parameter beta 2used in Adam [0.99,0.9999] Continuous
8 Parameter used in Adam [109,103]Continuous
can be seen as a special form of multi-task learning where
each task represents a different label. In [11], researchers have
demonstrated the feasibility of modeling multi-task learning
problems as multiobjective optimization. Thus, we can use
multi-label/multi-task learning datasets to conduct the MOH-
PT experiments.
For the sake of simplicity, we limit the number of convo-
lution layers in the used CNN to 2. For the training of CNN,
we choose the cross entropy loss with dropout regularization
as the loss function and select Adam [71] (which is run with
10 epochs12) as a state-of-the-art CNN optimizer.
The hyper-parameters being optimized and their ranges used
in the experiments are listed in Table III. All hyper-parameters
are encoded in the range [0,1], and they can be transformed
into their corresponding ranges through the transformation
method used in [72].
To construct an MOHPT problem, we let the classification
error for each label/task act as each objective function to be
minimized. Two types of real-world datasets are adopted:
1) Five multi-label learning datasets (including scene,yeast,
Corel5k,delicious and tmc2007 500) downloaded from
the Mulan website13 [73]. For simplicity, we restrict each
of the original datasets to three labels by following the
steps below: selecting the top three labels (in terms of
the number of instances in each label) and then deleting
the data instances without any label. The final sizes of
the used datasets are listed as follows: scene (with 1,314
instances), yeast (with 2,136 instances), Corel5k (with
2,468 instances), delicious (with 11,305 instances) and
tmc2007 500 (with 24,829 instances). In this way, the
MOHPT on each dataset is modeled as a 3-objective op-
timization problem. In addition, each dataset is split into
a training dataset DLand a hold-out dataset Dholdout
with a splitting ratio of 80% and 20%, respectively.
2) One multi-task learning dataset (i.e., the MultiMNIST
dataset). We adopt the construction method introduced in
[11] to build the MultiMNIST dataset (a two-task learning
version of MNIST dataset; where the training dataset
DLand hold-out dataset Dholdout contain 60,000 and
10,000 instances, respectively). Hence, the MOHPT on
MultiMNIST is a 2-objective optimization problem.
2) Experimental setup: In the MOHPT experiments, our
proposed algorithm is set with the environmental selection
12The number of epochs is usually treated as a hyper-parameter to be
optimized in MOHPT. However, due to computational resource limitation,
we just set the number of epochs to a small value.
13http://mulan.sourceforge.net/datasets-mlc.html
12 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
procedure of NSGA-II. For ease of description, we denote
our proposed algorithm with/without resource allocation as
EMT-RA and EMT, respectively. For each algorithm, the
whole population size is set to 20, i.e., psize = 20. For
EMT-RA, we set the initial sizes of large-data and small-
data populations to psizeL(0) = 0.5psize = 10 and
psizeS(0) = 0.5psize = 10, respectively. The settings
of other parameters used in EMT-RA are listed as follows:
num = 0.1psize = 2,4t= 20,Nsmall =1
5|DL|. The large-
data task uses the whole training set DLto train the CNN, and
then uses Dholdout to evaluate the objective functions. The
small-data task uses DSto train the CNN, and uses Dholdout
to evaluate the objective functions. For the training process
occurring in the small-data task (or large-data task), a mini-
batch of samples are drawn from DS(or DL) and used in
each iteration of Adam. In addition, the reference point used
for computing HV is set to an all-one vector.
3) Effect of the size of small data: Due to space limitation,
we have placed related results in the supplementary material.
4) Comparative results: The convergence curves obtained
by NSGA-II and our proposed EMT-RA on all considered
datasets along with their obtained non-dominated solutions on
MultiMNIST can be found in the supplementary material.
To quantify the performance gain of our proposed EMT-
RA over the baseline algorithm (i.e., NSGA-II), we record
the running time for EMT-RA needed to achieve the same
level of performance (in terms of average HV values) as
NSGA-II. Specifically, for 6 datasets with different numbers
of instances, we first run NSGA-II with different levels of
time budgets (800 seconds for scene,yeast and Corel5k; 8,000
seconds for delicious and tmc2007 500; 18,000 seconds for
the MultiMNIST dataset), and record the average HV values
obtained by NSGA-II (i.e., 0.642, 0.538, 0.281, 0.277, 0.584
and 0.935 for 6 datasets, respectively). Then, we run EMT-RA
and see how much time it needs to reach the corresponding
HV value obtained by NSGA-II on each dataset. The observed
results are summarized in Fig. 3, showing that the running time
consumed by our proposed EMT-RA is significantly less than
that of NSGA-II on all considered datasets. For example, on
MultiMNIST dataset, EMT-RA only needs 9,863 seconds to
achieve the same HV result as NSGA-II with 18,000 seconds.
Furthermore, we show the speedup in terms of running time
obtained by our proposed EMT-RA in Fig. 4. From this figure,
we can observe that the algorithm can obtain 40% - 75%
speedup on most of the considered datasets. These results on
medium- and large-size datasets demonstrate that EMT-RA
can efficiently deal with the practical multiobjective hyper-
parameter tuning of neural network models.
VI. CONCLUSION
In this paper, we have put forward a novel evolutionary mul-
titasking (EMT) framework targeting scalable multiobjective
optimization under large-instance data. In this framework, a
series of computationally-cheaper small-data tasks (referred to
as minions) are generated on-the-fly via random subsampling,
with the aim of assisting the target large-data task in the search
for Pareto optimal solutions. Notably, our framework can be
Fig. 3. Comparison of the running time needed for the baseline and our
proposed EMT-RA algorithm to achieve the same level of performance (i.e.,
reaching an average HV value of 0.642, 0.538, 0.281, 0.277, 0.584 and 0.935,
respectively, on the datasets listed along the x-axis). Our proposed algorithm
consumes significantly less running time than its baseline.
Fig. 4. Speedup obtained by our proposed EMT-RA algorithm on all
considered datasets. About 40% - 75% speedup is observed on most datasets.
wrapped around any multiobjective evolutionary algorithm to
address the big data problem. Its salient feature is an online
computational resource allocation strategy based on Bayes’
rule, which automatically rewards more resources to the in-
expensive small-data tasks when they demonstrate beneficial
transfers to the target. In the empirical studies, we have verified
the performance of EMT with resource allocation through a
series of experiments on multiobjective optimization under un-
certainty as well as the multiobjective hyper-parameter tuning
of deep neural network models, covering different suites of
benchmark problems and various sizes of real-world datasets.
For future work, on the one hand, we shall further comple-
ment our methodology through the incorporation of surrogate-
assistance techniques, and also investigate alternative subsam-
pling approaches to enable the small-instance dataset to better
guide the search on large-instance data. On the other hand,
we shall continue to rigorously verify the performance of
EMT-RA on datasets containing millions (or more) instances,
spanning a much richer variety of data-driven multiobjective
optimization problems of real-world interest.
REFERENCES
[1] S. Mardle and K. M. Miettinen, “Nonlinear multiobjective optimization,
Journal of the Operational Research Society, vol. 51, no. 2, p. 246, 1999.
[2] D. P. Heyman and M. J. Sobel, Stochastic Models in Operations
Research Volume II: Stochastic Optimization. McGraw Hill, New York,
2003.
[3] P. Pandita, I. Bilionis, J. Panchal, B. P. Gautham, A. Joshi, and P. Zagade,
“Stochastic multiobjective optimization on a budget: Application to
multipass wire drawing with quantified uncertainties,” International
Journal for Uncertainty Quantification, vol. 8, no. 3, pp. 233–249, 2018.
[4] B. Wilder, B. Dilkina, and M. Tambe, “Melding the data-decisions
pipeline: Decision-focused learning for combinatorial optimization,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33,
no. 01, 2019, pp. 1658–1665.
CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 13
[5] Y. Hu, Y. Zhang, and D. Gong, “Multiobjective particle swarm opti-
mization for feature selection with fuzzy cost,” IEEE Transactions on
Cybernetics, vol. 51, no. 2, pp. 874–888, 2021.
[6] N. Zhang, A. Gupta, Z. Chen, and Y.-S. Ong, “Evolutionary machine
learning with minions: A case study in feature selection,” IEEE Trans-
actions on Evolutionary Computation, vol. 26, no. 1, pp. 130–144, 2022.
[7] J. Luo, D. Zhou, L. Jiang, and H. Ma, “A particle swarm optimization
based multiobjective memetic algorithm for high-dimensional feature
selection,” Memetic Computing, vol. 14, no. 1, pp. 77–93, 2022.
[Online]. Available: https://doi.org/10.1007/s12293- 022-00354- z
[8] Z. Wang, S. Gao, M. Zhou, S. Sato, J. Cheng, and J. Wang, “Information-
theory-based nondominated sorting ant colony optimization for mul-
tiobjective feature selection in classification, IEEE Transactions on
Cybernetics, pp. 1–14, 2022.
[9] X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,”
Knowledge-Based Systems, vol. 212, p. 106622, 2021.
[10] A. Morales-Hernndez, I. V. Nieuwenhuyse, and S. R. Gonzalez, “A
survey on multi-objective hyperparameter optimization algorithms for
machine learning,” arXiv e-prints, 2021.
[11] O. Sener and V. Koltun, “Multi-task learning as multi-objective op-
timization,” in Proceedings of the 32nd International Conference on
Neural Information Processing Systems, 2018, pp. 525–536.
[12] D. Ballabio, “Parsimonious optimization of multitask neural network
hyperparameters,” Molecules, vol. 26, 2021.
[13] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective
neural architecture search via lamarckian evolution, arXiv preprint
arXiv:1804.09081, 2018.
[14] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and
W. Banzhaf, “Nsga-net: neural architecture search using multi-objective
genetic algorithm,” in Proceedings of the Genetic and Evolutionary
Computation Conference, 2019, pp. 419–427.
[15] B. Lyu, S. Wen, K. Shi, and T. Huang, “Multiobjective reinforcement
learning-based neural architecture search for efficient portrait parsing,”
IEEE Transactions on Cybernetics, pp. 1–12, 2021.
[16] Y. Bi, B. Xue, and M. Zhang, “Multitask feature learning as multiob-
jective optimization: A new genetic programming approach to image
classification,” IEEE Transactions on Cybernetics, pp. 1–14, 2022.
[17] Z.-H. Zhou, N. V. Chawla, Y. Jin, and G. J. Williams, “Big data op-
portunities and challenges: Discussions from data analytics perspectives
[discussion forum],” IEEE Computational intelligence magazine, vol. 9,
no. 4, pp. 62–74, 2014.
[18] P. L. Yu, “Cone convexity, cone extreme points, and nondominated
solutions in decision problems with multiobjectives, Journal of Op-
timization Theory and Applications, vol. 14, no. 3, pp. 319–377, 1974.
[19] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, A fast and elitist
multiobjective genetic algorithm: Nsga-ii, IEEE transactions on evolu-
tionary computation, vol. 6, no. 2, pp. 182–197, 2002.
[20] L. M. Pang, H. Ishibuchi, and K. Shang, “Nsga-ii with simple modifi-
cation works well on a wide variety of many-objective problems, IEEE
Access, vol. 8, pp. 190 240–190 250, 2020.
[21] E. Zitzler, M. Laumanns, and L. Thiele, “Spea2: Improving the strength
pareto evolutionary algorithm, TIK-report, vol. 103, 2001.
[22] Q. Zhang and H. Li, “Moea/d: A multiobjective evolutionary algorithm
based on decomposition,” IEEE Transactions on evolutionary computa-
tion, vol. 11, no. 6, pp. 712–731, 2007.
[23] A. Ferranti, F. Marcelloni, and A. Segatori, “A multi-objective evolution-
ary fuzzy system for big data,” in 2016 IEEE International Conference
on Fuzzy Systems (FUZZ-IEEE), 2016, pp. 1562–1569.
[24] A. Ferranti, F. Marcelloni, A. Segatori, M. Antonelli, and P. Ducange,
“A distributed approach to multi-objective evolutionary generation of
fuzzy rule-based classifiers from big data,” Information Sciences, vol.
415, pp. 319–340, 2017.
[25] M. Barsacchi, A. Bechini, P. Ducange, and F. Marcelloni, “Optimizing
partition granularity, membership function parameters, and rule bases of
fuzzy classifiers for big data by a multi-objective evolutionary approach,
Cognitive Computation, vol. 11, no. 3, pp. 367–387, 2019.
[26] F. Pulgar-Rubio, A. Rivera-Rivas, M. D. P´
erez-Godoy, P. Gonz´
alez, C. J.
Carmona, and M. J. del Jesus, “Mefasd-bd: Multi-objective evolutionary
fuzzy algorithm for subgroup discovery in big data environments-a
mapreduce solution,” Knowledge-Based Systems, vol. 117, pp. 70–78,
2017.
[27] M. Golchin and A. W.-C. Liew, “Bi-clustering by multi-objective evolu-
tionary algorithm for multimodal analytics and big data,” in Multimodal
Analytics for Next-Generation Big Data Technologies and Applications.
Springer, 2019, pp. 125–150.
[28] G. N. Karagoz, A. Yazici, T. Dokeroglu, and A. Cosar, “A new frame-
work of multi-objective evolutionary algorithms for feature selection
and multi-label classification of video data,” International Journal of
Machine Learning and Cybernetics, vol. 12, no. 1, pp. 53–71, 2021.
[29] A. G¨
ulc¨
u and Z. Kus¸, “Multi-objective simulated annealing for hyper-
parameter optimization in convolutional neural networks, PeerJ Com-
puter Science, vol. 7, p. e338, 2021.
[30] A. Garcia-Piquer, A. Fornells, J. Bacardit, A. Orriols-Puig, and E. Golo-
bardes, “Large-scale experimental evaluation of cluster representations
for multiobjective evolutionary clustering, IEEE transactions on evolu-
tionary computation, vol. 18, no. 1, pp. 36–53, 2013.
[31] A. Garcia-Piquer, J. Bacardit, A. Fornells, and E. Golobardes, “Scaling-
up multiobjective evolutionary clustering algorithms using stratification,
Pattern Recognition Letters, vol. 93, pp. 69–77, 2017.
[32] Y.-S. Ong and A. Gupta, “Evolutionary multitasking: a computer science
view of cognitive multitasking, Cognitive Computation, vol. 8, no. 2,
pp. 125–142, 2016.
[33] A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: toward
evolutionary multitasking, IEEE Transactions on Evolutionary Compu-
tation, vol. 20, no. 3, pp. 343–357, 2016.
[34] A. Gupta, Y.-S. Ong, L. Feng, and K. C. Tan, “Multiobjective multifac-
torial optimization in evolutionary multitasking, IEEE transactions on
cybernetics, vol. 47, no. 7, pp. 1652–1665, 2017.
[35] T. Wei, S. Wang, J. Zhong, D. Liu, and J. Zhang, A review on
evolutionary multi-task optimization: Trends and challenges, IEEE
Transactions on Evolutionary Computation, pp. 1–1, 2021.
[36] A. Gupta, L. Zhou, Y.-S. Ong, Z. Chen, and Y. Hou, “Half a dozen
real-world applications of evolutionary multitasking, and more, IEEE
Computational Intelligence Magazine, vol. 17, no. 2, pp. 49–66, 2022.
[37] Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, “Data-driven
evolutionary optimization: An overview and case studies,” IEEE Trans-
actions on Evolutionary Computation, vol. 23, no. 3, pp. 442–458, 2018.
[38] Y. Jin, “Surrogate-assisted evolutionary computation: Recent advances
and future challenges,” Swarm and Evolutionary Computation, vol. 1,
no. 2, pp. 61–70, 2011.
[39] L. V. Santana-Quintero, A. A. Montano, and C. A. C. Coello, A
review of techniques for handling expensive functions in evolutionary
multi-objective optimization, Computational intelligence in expensive
optimization problems, pp. 29–59, 2010.
[40] J. Knowles, “Parego: A hybrid algorithm with on-line landscape ap-
proximation for expensive multiobjective optimization problems, IEEE
Transactions on Evolutionary Computation, vol. 10, no. 1, pp. 50–66,
2006.
[41] Q. Zhang, W. Liu, E. Tsang, and B. Virginas, “Expensive multiobjective
optimization by moea/d with gaussian process model,” IEEE Transac-
tions on Evolutionary Computation, vol. 14, no. 3, pp. 456–474, 2009.
[42] T. Chugh, Y. Jin, K. Miettinen, J. Hakanen, and K. Sindhya, “A
surrogate-assisted reference vector guided evolutionary algorithm for
computationally expensive many-objective optimization,” IEEE Trans-
actions on Evolutionary Computation, vol. 22, no. 1, pp. 129–142, 2016.
[43] R. G. Regis, “Evolutionary programming for high-dimensional con-
strained expensive black-box optimization using radial basis functions,
IEEE Transactions on Evolutionary Computation, vol. 18, no. 3, pp.
326–347, 2013.
[44] C. Sun, Y. Jin, R. Cheng, J. Ding, and J. Zeng, “Surrogate-assisted co-
operative swarm optimization of high-dimensional expensive problems,”
IEEE Transactions on Evolutionary Computation, vol. 21, no. 4, pp.
644–660, 2017.
[45] S. Zapotecas Mart´
ınez and C. A. Coello Coello, “Moea/d assisted by
rbf networks for expensive multi-objective optimization problems,” in
Proceedings of the 15th annual conference on Genetic and evolutionary
computation, 2013, pp. 1405–1412.
[46] Z. Zhou, Y. S. Ong, M. H. Nguyen, and D. Lim, “A study on polynomial
regression and gaussian process global surrogate model in hierarchical
surrogate-assisted evolutionary algorithm, in 2005 IEEE congress on
evolutionary computation, vol. 3. IEEE, 2005, pp. 2832–2839.
[47] M. Parsa, J. P. Mitchell, C. D. Schuman, R. M. Patton, T. E. Potok,
and K. Roy, “Bayesian multi-objective hyperparameter optimization for
accurate, fast, and efficient neural network accelerator design, Frontiers
in neuroscience, vol. 14, p. 667, 2020.
[48] M. Zaefferer and T. Bartz-Beielstein, “Efficient global optimization with
indefinite kernels,” in Parallel Problem Solving from Nature - PPSN XIV
- 14th International Conference, Edinburgh, UK, September 17-21, 2016,
Proceedings, ser. Lecture Notes in Computer Science, J. Handl, E. Hart,
P. R. Lewis, M. L´
opez-Ib´
a˜
nez, G. Ochoa, and B. Paechter, Eds., vol.
9921. Springer, 2016, pp. 69–79.
14 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022
[49] S. Park, Y.-D. Kim, and S. Choi, “Hierarchical bayesian matrix fac-
torization with side information,” in Twenty-Third International Joint
Conference on Artificial Intelligence, 2013.
[50] A. Gupta, Y.-S. Ong, and L. Feng, “Insights on Transfer Optimization:
Because Experience is the Best Teacher,” IEEE Transactions on Emerg-
ing Topics in Computational Intelligence, vol. 2, no. 1, pp. 51–64, 2017.
[51] L. Zhang, Y. Xie, J. Chen, L. Feng, C. Chen, and K. Liu, “A study on
multiform multi-objective evolutionary optimization, Memetic Comput-
ing, vol. 13, no. 3, pp. 307–318, sep 2021.
[52] X. Ma, J. Yin, A. Zhu, X. Li, Y. Yu, L. Wang, Y. Qi, and Z. Zhu,
“Enhanced multifactorial evolutionary algorithm with meme helper-
tasks,” IEEE Transactions on Cybernetics, vol. 52, no. 8, pp. 7837–7851,
2022.
[53] Y. Feng, L. Feng, S. Kwong, and K. C. Tan, “A multivariation multifacto-
rial evolutionary algorithm for large-scale multiobjective optimization,”
IEEE Transactions on Evolutionary Computation, vol. 26, no. 2, pp.
248–262, 2022.
[54] K. Chen, B. Xue, M. Zhang, and F. Zhou, An evolutionary multitasking-
based feature selection method for high-dimensional classification,”
IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 7172–7186, 2022.
[55] ——, “Evolutionary multitasking for feature selection in high-
dimensional classification via particle swarm optimization,” IEEE Trans-
actions on Evolutionary Computation, vol. 26, no. 3, pp. 446–460, 2022.
[56] S. Yao, Z. Dong, X. Wang, and L. Ren, “A Multiobjective multifactorial
optimization algorithm based on decomposition and dynamic resource
allocation strategy,” Information Sciences, vol. 511, pp. 18–35, 2020.
[Online]. Available: https://doi.org/10.1016/j.ins.2019.09.058
[57] T. Wei and J. Zhong, “Towards Generalized Resource Allocation on
Evolutionary Multitasking for Multi-Objective Optimization, IEEE
Computational Intelligence Magazine, vol. 16, no. 4, pp. 20–37, 2021.
[58] K. Deb and H. Gupta, “Introducing robustness in multi-objective opti-
mization,” Evolutionary Computation, vol. 14, no. 4, pp. 463–494, 2006.
[59] C. Liang and S. Mahadevan, “Pareto surface construction for multi-
objective optimization under uncertainty,” Structural and Multidisci-
plinary Optimization, vol. 55, no. 5, pp. 1865–1882, 2017.
[60] A. Shapiro, D. Dentcheva, and A. Ruszczy´
nski, Lectures on stochastic
programming: modeling and theory. SIAM, 2014.
[61] K. Deb, L. Thiele, M. Laumanns, and E. Zitzler, “Scalable multi-
objective optimization test problems, in Evolutionary Computation,
2002. CEC ’02. Proceedings of the 2002 Congress on, vol. 1, May
2002, pp. 825–830.
[62] E. Zitzler and L. Thiele, “Multiobjective evolutionary algorithms: a com-
parative case study and the strength pareto approach, IEEE Transactions
on Evolutionary Computation, vol. 3, no. 4, pp. 257–271, 1999.
[63] K. Deb and R. Agrawal, “Simulated binary crossover for continuous
search space,” Complex Systems, vol. 9, pp. 115–48, April 1995.
[64] M. G. Kalyanmoy Deb, A combined genetic adaptive search (geneas)
for engineering design,” Computer Science and Informatics, vol. 26, pp.
30–45, 1999.
[65] I. Das and J. E. Dennis, “Normal-boundary intersection: A new method
for generating the pareto surface in nonlinear multicriteria optimization
problems,” Siam Journal on Optimization, vol. 8, no. 3, pp. 631–657,
1996.
[66] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con-
siderations for deep learning in nlp,” arXiv preprint arXiv:1906.02243,
2019.
[67] L. F. W. Anthony, B. Kanding, and R. Selvan, “Carbontracker: Tracking
and predicting the carbon footprint of training deep learning models,”
arXiv preprint arXiv:2007.03051, 2020.
[68] Y.-S. Ong and A. Gupta, “Air 5: Five pillars of artificial intelligence
research,” IEEE Transactions on Emerging Topics in Computational
Intelligence, vol. 3, no. 5, pp. 411–415, 2019.
[69] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp.
41–75, 1997.
[70] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning al-
gorithms,” IEEE Transactions on Knowledge and Data Engineering,
vol. 26, no. 8, pp. 1819–1837, 2014.
[71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, 2014.
[72] I. Loshchilov and F. Hutter, “Cma-es for hyperparameter optimization
of deep neural networks,” arXiv preprint arXiv:1604.07269, 2016.
[73] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,
in Data mining and knowledge discovery handbook. Springer, 2009,
pp. 667–685.
Zefeng Chen received the B. Sc. degree in Infor-
mation and Computational Science from Sun Yat-
sen University, Guangzhou, China, in 2013, and the
M. Sc. degree in Computer Science and Technol-
ogy from South China University of Technology,
Guangzhou, China, in 2016, and the Ph.D. degree
in Computer Science and Technology from Sun
Yat-sen University, Guangzhou, China, in 2019. He
was a Post-Doctoral Research Fellow working with
Prof. Yew-Soon Ong at the School of Computer
Science and Engineering, Nanyang Technological
University, Singapore, from October 2019 to October 2021. Currently, he
is an Assitant Professor in the School of Artificial Intelligence, Sun Yat-sen
University (SYSU). His current research interests mainly include evolutionary
computation, evolutionary learning and data-driven optimization.
Abhishek Gupta received the PhD degree in En-
gineering Science from the University of Auckland,
New Zealand, in 2014. He is currently a Scientist in
the Singapore Institute of Manufacturing Technol-
ogy, a research institute in Singapores Agency for
Science, Technology and Research (A*STAR). He
also holds a joint appointment with the School of
Computer Science and Engineering at the Nanyang
Technological University. Abhishek has diverse re-
search experience in computational science. Current-
ly, his main interests lie in the theory and algorithms
of transfer and multitask optimization, neuroevolution, surrogate modeling,
and scientific machine learning. Abhishek is the recipient of the 2019 and
the 2023 IEEE Transactions on Evolutionary Computation Outstanding Paper
Award, for foundational works on evolutionary multitasking. He received the
IEEE Transactions on Emerging Topics in Computational Intelligence 2021
Outstanding Associate Editor Award. He is also editorial board member of
the Complex & Intelligent Systems journal, the Memetic Computing journal,
and the Springer book series on Adaptation, Learning, and Optimization.
Lei Zhou received the B.E. degree from the School
of Computer Science and Technology, Shandong
University, Shandong, China, in 2014, and the Ph.D.
degree from the College of Computer Science,
Chongqing University, Chongqing, China, in 2019.
His current research interests include evolutionary
computations, memetic computing, as well as trans-
fer learning and optimization.
YEW-SOON ONG (M‘99-SM‘12-F‘18) received
the Ph.D. degree in artificial intelligence in com-
plex engineering design from the University of
Southampton, U.K., in 2003. He is a President Chair
Professor in Computer Science at the Nanyang Tech-
nological University (NTU) and concurrently the
Chief Artificial Intelligence Scientist of the Agen-
cy for Science, Technology and Research (A*Star)
Singapore. At NTU, he serves as co-Director of
the Singtel-NTU Cognitive & Artificial Intelligence
Joint Lab. His core research interest is in artificial
and computational intelligence where he has received four IEEE outstanding
paper awards. He was listed as a Thomson Reuters highly cited researcher
and among the World’s Most Influential Scientific Minds. He is the in-
augural Editor-in-Chief of the IEEE Transactions on Emerging Topics in
Computational Intelligence and associate editor of the IEEE on Transactions
on Evolutionary Computation, IEEE Transactions on Neural Networks &
Learning Systems, IEEE on Transactions on Cybernetics, IEEE Transactions
on Artificial Intelligence.
... Secondly, emergent abilities of AGI in models of ETO are still unknown. Most works of ETO [28] are focused on multitasking optimization (MTO). There are 2 works for scaling up in MTO settings. ...
... 1 st one is "Scalable Transfer Evolutionary Optimization: Coping With Big Task Instances", where they handle the evolutionary scenarios beyond many source task instances [29]. Another work [28] is using subsampled smalldata evolutionary tasks as auxiliary source tasks for scaling up with the title of "Scaling Multiobjective Evolution to Large Data With Minions: A Bayes-Informed Multitask Approach". As a new model or paradigm, single-objective to multiobjective optimization or SMO belongs to the third kind of "complex optimization" in the ETO survey [20], which focus on learning from the single-objective tasks. ...
... Those two paradigms actually differ from each other in its nature in some sense, as we have seen in Section 4.1 of "rMeets". Most works of ETO are for MTO, we expect that well-studied MTO and [28,29] many scaling up tools for MTO can help explore emergent abilities better. ...
Article
Full-text available
Towards artificial general intelligence, emergent abilities of large language models (LLMs) are observed wildly especially for well-known GPTs, which are due to scaling up primarily along three factors: training computation, model parameters, and dataset size. Scaling up makes emergence. Inspired by the insights for LLMs case, we scale up the number of auxiliary tasks (from 2 to many) to boost the core task for the model of single-objective to multi-objective optimization (SMO) in evolutionary transfer optimization (ETO), therefore achieve a new version of SMO in the open benchmarks of vehicle routing problems, permutation flow shop scheduling problems and travel salesman problems. We name the new SMO armed with transfer core as a “Thousand-Hand Bodhisattva” with many arms or hands and analyze its emergent abilities.
... M ULTI-TASK evolutionary optimization (MTEO) has stood as an emerging framework to solve complex optimization problems, undergoing remarkable advancements [1,2,3,4] since the multi-factorial evolutionary algorithm (MFEA) was proposed [5]. MTEO aims to address situations where there are multiple related tasks or objectives being optimized simultaneously, with the goal of achieving better performance on each task. ...
Preprint
Multi-Task Evolutionary Optimization (MTEO), an important field focusing on addressing complex problems through optimizing multiple tasks simultaneously, has attracted much attention. While MTEO has been primarily focusing on task similarity, there remains a hugely untapped potential in harnessing the shared characteristics between different domains to enhance evolutionary optimization. For example, real-world complex systems usually share the same characteristics, such as the power-law rule, small-world property, and community structure, thus making it possible to transfer solutions optimized in one system to another to facilitate the optimization. Drawing inspiration from this observation of shared characteristics within complex systems, we set out to extend MTEO to a novel framework - multi-domain evolutionary optimization (MDEO). To examine the performance of the proposed MDEO, we utilize a challenging combinatorial problem of great security concern - community deception in complex networks as the optimization task. To achieve MDEO, we propose a community-based measurement of graph similarity to manage the knowledge transfer among domains. Furthermore, we develop a graph representation-based network alignment model that serves as the conduit for effectively transferring solutions between different domains. Moreover, we devise a self-adaptive mechanism to determine the number of transferred solutions from different domains and introduce a novel mutation operator based on the learned mapping to facilitate the utilization of knowledge from other domains. Experiments on eight real-world networks of different domains demonstrate MDEO superiority in efficacy compared to classical evolutionary optimization. Simulations of attacks on the community validate the effectiveness of the proposed MDEO in safeguarding community security.
... Liu proposed a discriminative reconstruction network for implicit knowledge to transfer implicit knowledge between different LSMOP tasks to accelerate convergence [26]. Chen constructed auxiliary tasks with small data sets to accelerate the evolution of the target LSMOP tasks, and also using a bayes-based approach to measure the relationship between tasks and achieve adaptive resource allocation [27]. Feng constructed multiple helper tasks from simple search spaces and transferred the knowledge to target task to accelerate solving LSMOP [28]. ...
Article
Full-text available
The decomposable feature of operations in the welding shop scheduling scenario results in a vast search space, posing challenges for the design of traditional optimization algorithms. Addressing the multi-objective distributed heterogeneous welding shop scheduling problem (DHWSP), this work introduces a generalized multitasking framework. It establishes an auxiliary task by employing knowledge-and-learning-synergy neighborhood search, thereby enhancing the convergence and diversity of the original task. In this framework, an enhanced competitive swarm optimizer is adopted as the original task for DHWSP. Additionally, knowledge expression and transfer strategies are designed to expedite the comprehensive performance of each task by leveraging knowledge gained from search results. Finally, a memetic algorithm based on the multitasking framework is proposed for DHWSP. The effectiveness of the algorithm is validated through extensive experiments on 20 DHWSP instances. Numerical experimental results indicate that the proposed multitasking framework can significantly improve algorithmic comprehensive performance, demonstrating its efficacy in addressing the multi-objective DHWSP within a complex search space.
... And emergent abilities in the 3 rd battlefield are far from mature so far, whether for SMO or MTO. In MTO, there exist 2 works for scaling up [28] [29]. For SMO, we firstly invent discrete SMO and are familiar with it, therefore, we will explore the emergent abilities of ETO restricted in SMO settings. ...
Article
Full-text available
Towards the mission of building artificial general intelligence, the unpredictable phenomena of emergent abilities in large language models are quite impressive and inspiring especially in GPTs. Emergence are achieved by scaling up three variables: training computation, model parameters, and dataset size. Following the spirit above, we scale up three large factors of duration, gap and population for the new model of single-objective to multi-objective optimization (SMO) in the background of evolutionary transfer optimization (ETO), which serves as a complementary part to our previous work of "tMeets" (scaling up the number of auxiliary tasks to get a "t"housand-hand bodhisattva). We name our paper here as "lMeets" with "large topics" and hope that both tMeets and lMeets could make up the full picture of scaling up or emergent abilities in SMO, which will be tested in vehicle routing problems benchmarks.
Preprint
Full-text available
Single-objective to multi-objective optimization (SMO) is proved to be a new efficient kind of evolutionary transfer optimization (ETO) for both continuous and discrete cases, for both well-known benchmarks and real-world applications , working as a promising artificial general intelligence (AGI) tool and/or system. At problem side, SMO ranges from vehicle routing problem with time windows to vehicle routing problem, which can be further simplified/reduced to travel salesman problem (TSP) here. For algorithmic side, SMO are also developed with global search like genetic algorithms, local search (insert) and memetic algorithms combing those two kinds of search mentioned above, which inspires our extension to fractal search (FS, whose self-similarity exists at all scales or when scaling up, whose diffusion can be Gaussian walk). In this paper or "fMeets" (borrow "f" from fractal), we run heavy computational simulations of "TSP+FS" in well-known TSP benchmark.
Preprint
Full-text available
Production quality during hot rolling in iron and steel industry for manufacturing equipment serves as a core competence, which is usually difficult to achieve online testing, and can only be tested offline after the task of rolling thick slabs into thin coils is completed. In order to improve the production quality in hot rolling via artificial general intelligence (AGI), we develop the solution method in the computational framework of single-objective to multi-objective optimization (SMO) within evolutionary transfer optimization (ETO). Our solution method is striding for the similar "emergent abilities" of AGI like what we have seen in large language models. We name the new SMO towards intelligent manufacturing equipment and technology armed with transfer core for "hot" rolling quality analytics as "hMeets", which is a direct real-world application of our previous work of "tMeets".
Article
Over the last few years, big data have emerged as a paradigm for processing and analyzing a large volume of data. Coupled with other paradigms, such as cloud computing, service computing, and Internet of Things, big data processing takes advantage of the underlying cloud infrastructure, which allows hosting and managing massive amounts of data, while service computing allows to process and deliver various data sources as on‐demand services. This synergy between multiple paradigms has led to the emergence of big services , as a cross‐domain, large‐scale, and big data‐centric service model. Apart from the adaptation issues (e.g., need of high reaction to changes) inherited from other service models, the massiveness and heterogeneity of big services add a new factor of complexity to the way such a large‐scale service ecosystem is managed in case of execution deviations. Indeed, big services are often subject to frequent deviations at both the functional (e.g., service failure, QoS degradation, and IoT resource unavailability) and data (e.g., data source unavailability or access restrictions) levels. Handling these execution problems is beyond the capacity of traditional web/cloud service management tools, and the majority of big service approaches have targeted specific management operations, such as selection and composition. To maintain a moderate state and high quality of their cross‐domain execution, big services should be continuously monitored and managed in a scalable and autonomous way. To cope with the absence of self‐management frameworks for large‐scale services, the goal of this work is to design an autonomic management solution that takes the whole control of big services in an autonomous and distributed lifecycle process. We combine autonomic computing and big data processing paradigms to endow big services with self‐ * and parallel processing capabilities. The proposed management framework takes advantage of the well‐known MapReduce programming model and Apache Spark and manages big service's related data using knowledge graph technology . We also define a scalable embedding model that allows processing and learning latent big service knowledge in a distributed manner. Finally, a cooperative decision mechanism is defined to trigger non‐conflicting management policies in response to the captured deviations of the running big service. Big services' management tasks (monitoring, embedding, and decision), as well as the core modules (autonomic managers' controller, embedding module, and coordinator), are implemented on top of Apache Spark as MapReduce jobs, while the processed data are represented as resilient distributed dataset (RDD) structures. To exploit the shared information exchanged between the workers and the master node (coordinator), and for further resolution of conflicts between management policies, we endowed the proposed framework with a lightweight communication mechanism that allows transferring useful knowledge between the running map‐reduce tasks and filtering inappropriate intermediate data (e.g., conflicting actions). The experimental results proved the increased quality of embeddings and the high performance of autonomic managers in a parallel and cooperative setting, thanks to the shared knowledge.
Chapter
When faced with large-instance datasets, existing feature selection methods based on evolutionary algorithms still face the challenge of high computational cost. To address this issue, this paper proposes a scalable evolutionary algorithm for feature selection on large-instance datasets, namely, transfer learning based co-surrogate assisted evolutionary multitask algorithm (cosEMT). Firstly, we tackle the feature selection on large-instance datasets via an evolutionary multitasking framework. The co-surrogate models are constructed to measure the similarity between each auxiliary task and main task, and the knowledge transfer between tasks is realized through instance-based transfer learning. Through the numerical relationship between the relative and absolute number of transferable instances, we propose a novel dynamic resource allocation strategy to make more efficient use of limited computational resources and accelerate evolutionary convergence. Meanwhile, an adaptive surrogate model update mechanism is proposed to balance the exploration and exploitation of the base optimizer embedded in the cosEMT framework. Finally, the proposed algorithm is compared with several state-of-the-art feature selection algorithms on twelve large-instance datasets. The experimental results show that the cosEMT framework can obtain significant acceleration in the convergence speed and high-quality solutions. All verify that cosEMT is a highly competitive method for feature selection on large-instance datasets.
Article
In an era of parallel computing, evolutionary multitasking optimization (EMT) has become a popular optimization paradigm due to its ability to optimize several tasks simultaneously. The common knowledge can improve the solving quality and efficiency for each component optimization task when transferred among tasks. Therefore, the performances of traditional EMT algorithms mostly rely on the correlation between tasks. In the field of EMT, a key issue needing to be solved urgently is the impact of negative transfer when tackling optimization tasks with low correlation. In order to overcome the short board of this situation, this paper proposes a multiobjective EMT algorithm EMT-GFK. In the proposed algorithm, a union subspace of the optimization tasks is designed to extract the compact information. Furthermore, the geodesic flow kernel based domain adaptation is applied to learn a nonlinear mapping matrix, which can increase the correlation between tasks. The numerical experiments and results analysis on the MO-MTO test suits demonstrate the effectiveness of proposed EMT-GFK.
Article
Full-text available
Hyperparameter optimization (HPO) is a necessary step to ensure the best possible performance of Machine Learning (ML) algorithms. Several methods have been developed to perform HPO; most of these are focused on optimizing one performance measure (usually an error-based measure), and the literature on such single-objective HPO problems is vast. Recently, though, algorithms have appeared that focus on optimizing multiple conflicting objectives simultaneously. This article presents a systematic survey of the literature published between 2014 and 2020 on multi-objective HPO algorithms, distinguishing between metaheuristic-based algorithms, metamodel-based algorithms and approaches using a mixture of both. We also discuss the quality metrics used to compare multi-objective HPO procedures and present future research directions.
Article
Full-text available
Until recently, the potential to transfer evolved skills across distinct optimization problem instances (or tasks) was seldom explored in evolutionary computation. The concept of evolutionary multitasking (EMT) fills this gap. It unlocks a population’s implicit parallelism to jointly solve a set of tasks, hence creating avenues for skills transfer between them. Despite it being early days, the idea of EMT has begun to show promise in a range of real-world applications. In the backdrop of recent advances, the contribution of this paper is twofold. First, a review of several application-oriented explorations of EMT in the literature is presented; the works are assimilated into half a dozen broad categories according to their respective application domains. Each of these six categories elaborates fundamental motivations to multitask, and contains a representative experimental study (referred from the literature). Second, a set of recipes is provided showing how problem formulations of general interest, those that cut across different disciplines, could be transformed in the new light of EMT. Our discussions emphasize the many practical use-cases of EMT, and are intended to spark future research towards crafting novel algorithms for real-world deployment.
Article
Full-text available
Feature selection, as one of the dimension reduction methods, is a crucial processing step in dealing with high-dimensional data. It tries to preserve feature subset representing the whole feature space, which aims to reduce redundancy and increase the classification accuracy. Since the two objectives are usually in conflict with each other, feature selection is modeled as a multi-objective problem. However, the high search space and discrete Pareto front makes it not easy for existing evolutionary multiobjective algorithms. Classic evolutionary computation method, which is often applied to feature selection problem straightforwardly, gradually exposes its inefficiency in searching process. Hence, a particle swarm optimization based multiobjective memetic algorithm for high-dimensional feature selection is designed in this paper to deal with above shortcomings. Its basic idea is to model feature selection as a multiobjective optimization problem by optimizing the number of features and the classification accuracy in supervised condition simultaneously, in which information entropy based initialization and adaptive local search are designed to improve the search efficiency. Moreover, a new particle velocity update rule considering both convergence and diversity of solutions is designed to update particles, and a fast discrete nondominated sorting strategy is designed to rank the Pareto solutions. These strategies enable the proposed algorithm to gain better performance on both the quality and size of feature subset. The experimental results show that the proposed algorithm can improve the quality of Pareto fronts evolved by the state-of-the-art algorithms for feature selection.
Article
Full-text available
Evolutionary algorithms possess strong problem-solving abilities and have been applied in a wide range of applications. However, they still suffer from a high computational burden and poor generalization ability. To overcome the limitations, numerous studies consider conducting knowledge extraction across distinct optimization task domains. Among these research strands, one representative tributary is evolutionary multi-task optimization (EMTO) that aims to resolve multiple optimization tasks simultaneously. The underlying attribute of implicit parallelism for evolutionary algorithms can well incorporate with the framework of EMTO, giving rise to the ascending EMTO studies. This review is intended to present a detailed exposition on the research in the EMTO area. We reveal the core components for designing the EMTO algorithms. Subsequently, we organize the works lying in the fusions between EMTO and traditional evolutionary algorithms. By analyzing the associations for diverse strategies in different branches of EMTO, this review uncovers the research trends and the potentially important directions, with additional interesting real-world applications mentioned.
Article
Full-text available
Neural networks are rapidly gaining popularity in chemical modeling and Quantitative Structure–Activity Relationship (QSAR) thanks to their ability to handle multitask problems. However, outcomes of neural networks depend on the tuning of several hyperparameters, whose small variations can often strongly affect their performance. Hence, optimization is a fundamental step in training neural networks although, in many cases, it can be very expensive from a computational point of view. In this study, we compared four of the most widely used approaches for tuning hyperparameters, namely, grid search, random search, tree-structured Parzen estimator, and genetic algorithms on three multitask QSAR datasets. We mainly focused on parsimonious optimization and thus not only on the performance of neural networks, but also the computational time that was taken into account. Furthermore, since the optimization approaches do not directly provide information about the influence of hyperparameters, we applied experimental design strategies to determine their effects on the neural network performance. We found that genetic algorithms, tree-structured Parzen estimator, and random search require on average 0.08% of the hours required by grid search; in addition, tree-structured Parzen estimator and genetic algorithms provide better results than random search.
Article
Feature selection (FS) has received significant attention since the use of a well-selected subset of features may achieve better classification performance than that of full features in many real-world applications. It can be considered as a multiobjective optimization consisting of two objectives: 1) minimizing the number of selected features and 2) maximizing classification performance. Ant colony optimization (ACO) has shown its effectiveness in FS due to its problem-guided search operator and flexible graph representation. However, there lacks an effective ACO-based approach for multiobjective FS to handle the problematic characteristics originated from the feature interactions and highly discontinuous Pareto fronts. This article presents an Information-theory-based Nondominated Sorting ACO (called INSA) to solve the aforementioned difficulties. First, the probabilistic function in ACO is modified based on the information theory to identify the importance of features; second, a new ACO strategy is designed to construct solutions; and third, a novel pheromone updating strategy is devised to ensure the high diversity of tradeoff solutions. INSA’s performance is compared with four machine-learning-based methods, four representative single-objective evolutionary algorithms, and six state-of-the-art multiobjective ones on 13 benchmark classification datasets, which consist of both low and high-dimensional samples. The empirical results verify that INSA is able to obtain solutions with better classification performance using features whose count is similar to or less than those obtained by its peers.
Article
Feature learning is a promising approach to image classification. However, it is difficult due to high image variations. When the training data are small, it becomes even more challenging, due to the risk of overfitting. Multitask feature learning has shown the potential for improving generalization. However, existing methods are not effective for handling the case that multiple tasks are partially conflicting. Therefore, for the first time, this article proposes to solve a multitask feature learning problem as a multiobjective optimization problem by developing a genetic programming approach with a new representation to image classification. In the new approach, all the tasks share the same solution space and each solution is evaluated on multiple tasks so that the objectives of all the tasks can be optimized simultaneously using a single population. To learn effective features, a new and compact program representation is developed to allow the new approach to evolving solutions shared across tasks. The new approach can automatically find a diverse set of nondominated solutions that achieve good tradeoffs between different tasks. To further reduce the risk of overfitting, an ensemble is created by selecting nondominated solutions to solve each image classification task. The results show that the new approach significantly outperforms a large number of benchmark methods on six problems consisting of 15 image classification datasets of varying difficulty. Further analysis shows that these new designs are effective for improving the performance. The detailed analysis clearly reveals the benefits of solving multitask feature learning as multiobjective optimization in improving the generalization.
Article
Evolutionary multitasking optimization (EMTO) is an emerging paradigm for solving several problems simultaneously. Due to the flexible framework, EMTO has been naturally applied to multi-objective optimization to exploit synergy among distinct multi-objective problem domains. However, most studies barely take into account the scenario where some problems cannot converge under restrictive computational budgets with the traditional EMTO framework. To dynamically allocate computational resources for multi-objective EMTO problems, this article proposes a generalized resource allocation (GRA) framework by concerning both theoretical grounds of conventional resource allocation and the characteristics of multi-objective optimization. In the proposed framework, a normalized attainment function is designed for better quantifying convergence status, a multi-step nonlinear regression is proposed to serve as a stable performance estimator, and the algorithmic procedure of conventional resource allocation is refined for flexibly adjusting resource allocation intensity and including knowledge transfer information. It has been verified that the GRA framework can enhance the overall performance of the multi-objective EMTO algorithm in solving benchmark problems, complex problems, many-task problems, and a real-world application problem. Notably, the proposed GRA framework served as a crucial component for the winner algorithm in the Competition on Evolutionary Multi-Task Optimization (Multi-objective Optimization Track) in IEEE 2020 World Congress on Computational Intelligence.
Article
For solving large-scale multi-objective problems (LSMOPs), the transformation-based methods have shown promising search efficiency, which varies the original problem as a new simplified problem and performs the optimization in simplified spaces instead of the original problem space. Owing to the useful information provided by the simplified searching space, the performance of LSMOPs has been improved to some extent. However, it is worth noting that the original problem has changed after the variation, and there is thus no guarantee of the preservation of the original global or near-global optimum in the newly generated space. In this paper, we propose to solve LSMOPs via a multi-variation multifactorial evolutionary algorithm. In contrast to existing transformation-based methods, the proposed approach intends to conduct an evolutionary search on both the original space of the LSMOP and multiple simplified spaces constructed in a multi-variation manner concurrently. In this way, useful traits found along the search can be seamlessly transferred from the simplified problem spaces to the original problem space toward efficient problem-solving. Besides, since the evolutionary search is also performed in the original problem space, preserving the original global optimal solution can be guaranteed. To evaluate the performance of the proposed framework, comprehensive empirical studies are carried out on a set of LSMOPs with 2-3 objectives and 500-5000 variables. The experiment results highlight the efficiency and effectiveness of the proposed method compared to the state-of-the-art methods for large-scale multi-objective optimization.
Article
This article dedicates to automatically explore efficient portrait parsing models that are easily deployed in edge computing or terminal devices. In the interest of the tradeoff between the resource cost and performance, we design the multiobjective reinforcement learning (RL)-based neural architecture search (NAS) scheme, which comprehensively balances the accuracy, parameters, FLOPs, and inference latency. Finally, under varying hyperparameter configurations, the search procedure emits a bunch of excellent objective-oriented architectures. The combination of two-stage training with precomputing and memory-resident feature maps effectively reduces the time consumption of the RL-based NAS method, so that we complete approximately 1000 search iterations in two GPU days. To accelerate the convergence of the lightweight candidate architecture, we incorporate knowledge distillation into the training of the search process. This also provides a reasonable evaluation signal to the RL controller that enables it to converge well. In the end, we conduct full training with outstanding Pareto-optimal architectures, so that a series of excellent portrait parsing models (with only approximately 0.3M parameters) is received. Furthermore, we directly transfer the architectures searched on CelebAMask-HQ (Portrait Parsing) to other portrait and face segmentation tasks. Finally, we achieve the state-of-the-art performance of 96.5% MIOU on EG1800 (portrait segmentation) and 91.6% overall $F1$ -score on HELEN (face labeling). That is, our models significantly surpass the artificial network on the accuracy, but with lower resource consumption and higher real-time performance.