ArticlePDF Available

Scaling Multiobjective Evolution to Large Data With Minions: A Bayes-Informed Multitask Approach

November 2022
IEEE Transactions on Cybernetics PP(99):1-14

November 2022
PP(99):1-14

DOI:10.1109/TCYB.2022.3214825

Authors:

Zefeng Chen

Sun Yat-Sen University

Abhishek Gupta

Indian Institute of Technology Goa

Yew Soon Ong

Nanyang Technological University

In an era of pervasive digitalization, the growing volume and variety of data streams poses a new challenge to the efficient running of data-driven optimization algorithms. Targeting scalable multiobjective evolution under large-instance data, this article proposes the general idea of using subsampled small-data tasks as helpful minions (i.e., auxiliary source tasks) to quickly optimize for large datasets—via an evolutionary multitasking framework. Within this framework, a novel computational resource allocation strategy is designed to enable the effective utilization of the minions while guarding against harmful negative transfers. To this end, an intertask empirical correlation measure is defined and approximated via Bayes’ rule, which is then used to allocate resources online in proportion to the inferred degree of source–target correlation. In the experiments, the performance of the proposed algorithm is verified on: 1) sample average approximations of benchmark multiobjective optimization problems under uncertainty and 2) practical multiobjective hyperparameter tuning of deep neural network models. The results show that the proposed algorithm can obtain up to about 73% speedup relative to existing approaches, demonstrating its ability to efficiently tackle real-world multiobjective optimization involving evaluations on large datasets.

Content uploaded by Abhishek Gupta

Content may be subject to copyright.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022 1

Scaling Multiobjective Evolution to Large Data with

Minions: A Bayes-Informed Multitask Approach

Zefeng Chen, Abhishek Gupta, Lei Zhou, and Yew-Soon Ong

Abstract—In an era of pervasive digitalization, the growing

volume and variety of data streams poses a new challenge to

the efﬁcient running of data-driven optimization algorithms.

Targeting scalable multiobjective evolution under large-instance

data, this paper proposes the general idea of using subsampled

small-data tasks as helpful minions (i.e., auxiliary source tasks)

to quickly optimize for large datasets - via an evolutionary

multitasking framework. Within this framework, a novel compu-

tational resource allocation strategy is designed to enable effective

utilization of the minions while guarding against harmful negative

transfers. To this end, an inter-task empirical correlation measure

is deﬁned and approximated via Bayes’ rule, which is then used to

allocate resources online in proportion to the inferred degree of

source-target correlation. In the experiments, the performance

of the proposed algorithm is veriﬁed on (1) sample average

approximations of benchmark multiobjective optimization prob-

lems under uncertainty, and (2) practical multiobjective hyper-

parameter tuning of deep neural network models. The results

show that the proposed algorithm can obtain up to about

73% speedup relative to existing approaches, demonstrating its

ability to efﬁciently tackle real-world multiobjective optimization

involving evaluations on large datasets.

Index Terms—Evolutionary multitasking, data-driven multiob-

jective optimization, Bayes resource allocation.

I. INTRODUCTION

Optimization problems are ubiquitous. Among them, mul-

tiobjective optimization problems (MOOPs) form an

important subclass where multiple objective functions (usually

conﬂicting) need to be simultaneously optimized to achieve

a trade-off. An MOOP formulation is generally expressed as

follows [1]:

min

x∈ΩF(x) = f1x, . . . , fMx(1)

Corresponding author: Lei Zhou.

This research is supported in part by the Data Science and Artiﬁcial

Intelligence Research Center (DSAIR), School of Computer Science and

Engineering at Nanyang Technological University (NTU), the A*STAR AI3

HTPO seed grant C211118016, the A*STAR Cyber-Physical Production

System (CPPS) - Towards Contextual and Intelligent Response Research

Program through the RIE2020 IAF-PP Grant A19C1a0018, and the National

Natural Science Foundation of China under Grant 62206313.

Zefeng Chen is with the School of Artiﬁcial Intelligence, Sun Yat-

sen University, China, and also with the School of Computer Science

and Engineering, NTU, Singapore (e-mail: chenzef5@mail.sysu.edu.cn; ze-

feng.chen@ntu.edu.sg).

Lei Zhou is with the School of Computer Science and Engineering, NTU,

Singapore (e-mail: lei.zhou@ntu.edu.sg).

Abhishek Gupta is with the Singapore Institute of Manufacturing Technol-

ogy (SIMTech), Agency for Science, Technology and Research (A*STAR)

and the School of Computer Science and Engineering, NTU (e-mail: ab-

hishek gupta@simtech.a-star.edu.sg; abhishekg@ntu.edu.sg).

Yew-Soon Ong is with the Data Science and Artiﬁcial Intelligence Re-

search Centre, School of Computer Science and Engineering, NTU, and

also the Chief Artiﬁcial Intelligence Scientist of A*STAR Singapore (e-mail:

asysong@ntu.edu.sg; ongyewsoon@hq.a-star.edu.sg).

where xis an n-dimensional decision vector and fi(x)(i=

1, . . . , M ) denotes the i-th objective function. Ω⊆Rnis the

decision space (also known as search space), and its image

set S={F(x)|x∈Ω}is called the objective space.

Particularly, in the real world, there exist MOOPs whose

objective functions call for computations to be carried out on

available data. Examples include multiobjective optimization

under uncertainty (i.e., MOOPs involving uncertain parame-

ters whose possible realizations are contained in a dataset)

[2]–[4], machine learning use-cases such as multiobjective

feature selection [5]–[8], multiobjective AutoML [9], [10],

hyper-parameter optimization of multi-task learning models

[11], [12], multiobjective neural architecture search [13]–[15],

multi-task feature learning [16], to name just a few. These

types of problems can be regarded as data-driven MOOPs,

and are formalized as follows:

min

x∈ΩF(x;D) = f1x;D, . . . , fMx;D(2)

where fi(x;D)represents the i-th objective function evaluated

with dataset D. In general, when Dcontains a large number of

data instances, the computations of fi(x;D)become expen-

sive. This phenomenon is increasingly common nowadays. The

sheer volume and variety of data has increased enormously

across many ﬁelds, thus posing a signiﬁcant challenge to

the efﬁcient running of data-driven optimization algorithms

[17]. In this paper, we speciﬁcally focus on MOOPs involving

computations with large-instance data (denoted as DL). Real-

world applications of such problems span diverse domains,

from decision-support under uncertainty to big data machine

learning, as previously listed.

When solving an MOOP, one main difﬁculty lies in that an

improvement in one objective function is usually accompanied

by performance deterioration in another objective. Thus, a

single optimal solution that can simultaneously optimize all

objectives may not always exist. Instead, the best trade-off

solutions, called the Pareto optimal solutions, are important to

a decision maker. For ease of explaining the Pareto optimality

concept, we ﬁrst present the deﬁnition of Pareto dominance

tailored for a data-driven MOOP with dataset D1:

Deﬁnition 1. Given two solutions x,y∈Ωalong with their

corresponding objective vectors F(x;D),F(y;D)∈RM,x

is said to Pareto dominate y(written as x≺y) if and only if

(1) for all i∈ {1, . . . , M }, fi(x;D)≤fi(y;D), and (2) there

exists some j∈ {1, . . . , M}such that fj(x;D)< fj(y;D).

1When comparing the Pareto dominance relationship between any two

solutions, we assume that their objective function values are computed using

the same dataset D.

2 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

According to Pareto dominance, if a solution x∗∈Ωis

not dominated by any other solution in the decision space

Ω, then we call xaPareto optimal solution. The union of all

Pareto optimal solutions is called the Pareto set (PS for short):

P S ={x∈Ω|@y∈Ωs.t. y≺x}. The image of the PS is

called the Pareto front (PF for short).

In recent decades, evolutionary algorithms (EAs) have

demonstrated prowess in solving various kinds of MOOPs

by virtue of their implicit parallelism, that allows multiple

solutions - characterizing the PS of an MOOP [18] - to be

obtained in a single run. A variety of multiobjective EAs

(MOEAs) have thus been proposed over the years, such as

the classical NSGA-II [19], [20], SPEA2 [21] and MOEA/D

[22]. Despite their popularity, one major criticism of MOEAs

stems from the fact that they usually require a large number

of function evaluations to ﬁnd reasonable approximations to

the PS. This vulnerability becomes more apparent when faced

with evaluations on large-instance datasets, as computational

costs scale deleteriously with the amount of data.

To tackle the aforementioned tractability issue brought by

large-instance datasets, two broad categories of approaches

have been considered in the literature, namely, hardware-

driven approaches (such as distributed computing [23]–[26],

parallel computing [27], [28] and hardware acceleration tech-

niques [29]) and algorithm-centric approaches (also known as

software solutions) [30], [31]. On the one hand, hardware-

driven approaches are heavily reliant on the amount of avail-

able computing resources. On the other hand, algorithmic solu-

tions where fast evaluations are carried out using small subsets

of the full data [30], [31] run the risk of misdirecting the

evolutionary search. This is because evaluations on small data

subsets are not guaranteed to be representative of a solution’s

true performance. In this paper, we propose a novel strategy

to address the shortcoming of the latter category, so as not to

fall back on the need for extensive hardware. In particular, we

resort to the core idea of evolutionary multitasking (EMT), in

which the full dataset together with its smaller subsets can be

jointly deployed as synergistically evolving tasks in a single

optimization run.

EMT is an emerging search paradigm originally proposed

for the simultaneous solving of multiple optimization problems

[32]–[34]. It offers a new avenue to further exploit the implicit

parallelism of population-based search, taking advantage of

latent synergies between distinct tasks through information

transfers to accelerate convergence rates in tandem. Until

today, a variety of EMT algorithms have been proposed in the

literature [35], demonstrating potential for wide-ranging real-

world applicability [36]. In the speciﬁc case of multiobjective

evolution with large-instance data, we note that a series of aux-

iliary small-data tasks can in fact be generated by subsampling

the full dataset. It is intuitively expected that at least some

of the generated tasks could then share a locally or globally

similar ﬁtness landscape with the target task at hand, while

being relatively inexpensive to evaluate. It is thus theorized

that harnessing such correlated tasks - which we think of as

helpful minions - in an EMT framework would assist the target

task in quickly converging towards good solutions. What is

more, by adaptively allocating greater computational resources

to effective minions, a signiﬁcant boost in overall convergence

trends could be achieved.

To sum up, the main contributions of this paper are three-

fold:

1) A new EMT framework for jointly accommodating large-

and small-data tasks is developed for MOEAs to efﬁcient-

ly scale for data-driven MOOPs.

2) An online inter-task empirical correlation measure is

proposed for MOOPs and it is efﬁciently approximated by

Bayes’ rule. The estimate of the empirical correlation is

used to adaptively reward more computational resources

to inexpensive small-data tasks when they demonstrate

beneﬁcial transfers to the target.

3) The performance of the proposed algorithm is veriﬁed on

MOOPs under uncertainty and the multiobjective hyper-

parameter tuning of deep neural network models. The

experimental results on diverse benchmark problems and

datasets conﬁrmed the efﬁcacy of the proposed algorithm

with up to 73% speedup.

The remainder of the paper is organized as follows. Section

II presents related work in the literature, while Section III

introduces the preliminaries. Section IV designs, develops and

analyzes the Bayes resource allocation strategy within our

proposed adaptive EMT framework. The experimental results

are provided in Section V. Finally, Section VI concludes this

paper and gives some research directions for future studies.

II. RE LATE D WORK

A. Multiobjective Evolution under Large-instance Data

In the literature, there are an increasing number of studies

dedicated to tackling the scalability issue faced by evolutionary

computation (EC) techniques for multiobjective optimization

under large-instance data. Many of these studies have however

focused on hardware-driven approaches, including distributed

computing (i.e., using MapReduce paradigm), parallel com-

puting and hardware acceleration techniques. For instance,

Ferranti et al. [23], [24] and Barsacchi et al. [25] proposed

distributed MOEAs based on Apache Spark to generate fuzzy

rule-based classiﬁers. Likewise, the multiobjective evolution-

ary fuzzy algorithm proposed in [26] for tackling the subgroup

discovery task is based on MapReduce. Utilizing a parallel

computing environment, Golchin and Liew proposed parallel

bi-cluster detection based on the strength Pareto evolutionary

algorithm (PBD-SPEA) to conduct bi-clustering [27]. Recent-

ly, Karagoz et al. proposed a parallel variant of NSGA-

II to address the multiobjective multi-label feature selection

problem for the classiﬁcation of video data [28]. While these

representative approaches are able to achieve the goal of

efﬁciency promotion, they do so only by relying heavily on

different advanced computing infrastructures.

Different from hardware-driven approaches, there are a

relatively smaller number of algorithm-centric techniques that

attempt to achieve better efﬁciency without necessitating par-

allel/distributed infrastructures. For example, Garcia-Piquer

et al. [30] proposed the CAOS evolutionary algorithm for

multiobjective clustering, in which the original dataset is

divided into several subsets that are alternatively used in each

CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 3

generation of an MOEA. In a subsequent work [31], they

further investigated the performance of three subset building

strategies on large clustering datasets. It is noted that only a

small subset of the full dataset is used in each generation.

This may give rise to a potential risk of misdirecting the

evolutionary search (due to negative transfer of information

from one generation to the next), unless a suitable subset

accurately representative of the full dataset is found. On the

other hand, our method in this paper leverages the EMT

paradigm to simultaneously accommodate the full dataset as

well as its smaller subsets as distinct tasks in each generation

of a single optimization run, hence enabling the learning

of inter-task relationships to control and curb the risk of

misdirection.

Notably, as multiobjective optimization under large-instance

data often involves time-consuming function evaluations, it

can be considered as a type of expensive global optimization

(EGO) problem [17], [37]. In this regard, surrogate-assisted

EC is also a viable technique worth considering [38], [39].

Various types of computationally efﬁcient surrogate models,

such as Gaussian Processes (GP, also known as Kriging model)

[40]–[42], radial basis function networks (RBFN) [43]–[45] or

polynomial regression [46], have been employed to replace a

portion of the original expensive function evaluations in the

evolutionary process. Among existing surrogates, the proba-

bilistic GP is perhaps the most commonly used, including in

the classical ParEGO [40] and MOEA/D-EGO [41], due to

its ability to capture predictive uncertainties in a principled

manner. For instance, in the rapidly growing area of AutoML,

GP-based Bayesian Optimization (BO) has become a promi-

nent approach for automatically tuning hyper-parameters of

machine learning models exposed to large datasets [11], [47].

Despite recent successes, there are however limitations to the

widespread use of BO algorithms. First, standard BO does

not readily extend to arbitrary solution representations due

to issues of kernel indeﬁniteness [48]. Further, it is hard to

accumulate enough data to build informative surrogates in

even moderately high-dimensional decision spaces, leading

to the notorious cold start problem [49]. Given the above,

and given the ﬂexibility of EAs in coping with arbitrary

solution representations, this paper focuses on scaling purely

evolutionary multiobjective optimization approaches to

large datasets, from an algorithm-centric perspective (by

means of a novel EMT trick). In future works, further

augmenting the efﬁcacy of evolution with surrogate-assistance

shall be a key research direction.

B. Constructing Auxiliary Tasks in EMT

A number of researchers have looked at utilizing the EMT

paradigm in a manner that one or more auxiliary tasks (helper

tasks) are artiﬁcially constructed and assimilated to assist the

solving of the original task at hand [50], [51]. For example,

in [52], with the aim of solving difﬁcult single-objective opti-

mization tasks, Ma et al. utilized a technique called multiobjec-

tivization via decomposition to generate helper tasks, each of

which is a multiobjectivization of the original single-objective

optimization task. Feng et al. [53] tried to construct multiple

auxiliary tasks that have simpliﬁed search spaces, and used

them to promote the solving of a large-scale multiobjective

optimization problem. Similarly, in order to solve a high-

dimensional feature selection task, [54] and [55] designed

several low-dimensional versions of the original task to act as

auxiliary tasks. In one of our previous works in evolutionary

machine learning [6], we artiﬁcially generated a number of

static auxiliary tasks based on small subsampled portions of

a big training dataset.

In this paper, tailored for scalable data-driven multiobjective

optimization, we also propose to construct small-data auxiliary

tasks through a data subsampling approach. However, unlike

in [6], the auxiliary tasks constructed shall be dynamically

changing (via resampling) during the evolutionary search

process.

C. Online Resource Allocation in Multiobjective EMT

Till now, there has been relatively little research on online

resource allocation in EMT for MOOPs. The MFEA/D-DRA

algorithm proposed by Yao et al. [56] adopts a dynamic

resource allocation strategy in which the computational re-

sources are allocated according to the evolution rate of single-

objective subproblems (decomposed from each MOOP) in

each generation. Although this strategy achieves efﬁciency

enhancements, it is not ﬂexible since it can only be applied to

MOEA/D-based EMT algorithms.

In contrast, [57] proposed a generalized resource allocation

(GRA for short) framework that can be applied to any kind of

EMT algorithm. The GRA is built on the base of an attainment

function performance metric and a multi-step nonlinear regres-

sion, demonstrating the ability to enhance multiobjective EMT

algorithms. However, the attainment function used in GRA is

of high computational complexity, and Kmulti-step nonlinear

regression models (Kis the number of tasks) have to be solved

before the resources allocated to distinct tasks are determined.

In this paper, we wish to design a novel online resource

allocation strategy that is ﬂexible and efﬁcient. It shall be

applicable to any kind of multiobjective EMT algorithm in an

effective manner, enabling multiobjective optimization where

evaluations are to be carried out on large-instance data.

III. PRELIMINARIES

In this section, we experimentally showcase how the dataset

size could affect the performance of multiobjective evolution,

and then illustrate the general idea of using small datasets

(as helpful minions) to accelerate multiobjective optimization

under large-instance data.

A. Effect of Dataset Size on Multiobjective Evolution

For data-driven MOOPs, the size of the dataset has a critical

impact on the optimization performance.

Here, we consider one typical example: multiobjective op-

timization under uncertainty, where a ﬁnite but large number

of data samples representative of the uncertain environment

are either drawn from a known probability distribution, or are

historically observed [58], [59]. To tackle the uncertainty, a

4 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

commonly used method is the sample average approximation

(SAA) model [2]. In terms of MOOPs, the mathematical

formulation of a target SAA model is given as follows [60]:

min

x∈Ωb

FN(x) = 1

j=1

F(x;ξj)(3)

where D={ξ1, . . . , ξN}is an i.i.d. set of representative

uncertain scenarios. According to the law of large numbers,

as N→ ∞,b

FN(x)converges to E[F(x;ξ)] under some reg-

ularity conditions [60]. Thus, the SAA model usually requires

a large sample size (i.e., a large uncertainty set DL), so that a

more accurate approximation of the expected objective value

can be achieved. However, the resultant computational cost

would become higher. In contrast, a small uncertainty set DS

shall incur low computational cost, but may in turn introduce

high approximation errors.

To demonstrate the above claims, we conduct a toy ex-

periment using NSGA-II to solve the well-known 3-objective

benchmarks DTLZ5 and DTLZ1 [61] but corrupted with

additive Gaussian noise. To be speciﬁc, we add independent

Gaussian noise to each decision variable of DTLZ5/DTL1Z1

to simulate simple scenarios with decision variable uncertainty.

We denote these two resultant problems as DTLZ5D and

DTLZ1D2, respectively. For evaluating the objective functions

of each problem during the running of NSGA-II, two types of

uncertainty sets are used. One is a relatively large dataset with

1,000 samples drawn from the Gaussian distribution N(0,1),

and another is a small dataset consisting of only 10 samples

extracted uniformly at random from the large dataset. The

obtained Hypervolume (HV for short) results3are displayed in

Fig. 1. As can be seen, on DTLZ5D, NSGA-II with small data

converges signiﬁcantly faster than that with large data. This

implies that using a small-instance dataset is sufﬁcient for the

optimization of this particular problem. As for DTLZ1D on

the other hand, NSGA-II with small data progresses quickly

at ﬁrst, but is found to stagnate in a sub-optimal region in the

later stage. This shows that NSGA-II with small data can not

reliably obtain good quality solutions. A large-instance dataset

is needed for solving DTLZ1D effectively.

B. Using Small-data Tasks to Quickly Evolve Solutions for

Large-instance Data

Let’s consider two tasks: one comprises a large-instance

dataset (“large-data task”) and the other comprises a small-

instance dataset uniformly subsampled from the large dataset

(“small-data task”). The objective function evaluation of the

small-data task would be computationally cheaper than that of

the target large-data task. Since the small dataset is a uniform

subset of the large dataset, the small-data task can be expected

(albeit not guaranteed) to share a degree of similarity in the

underlying data distribution and resultant ﬁtness landscape as

2The last “D” represents that the noise is imposed in the decision space.

3Reported HV results are based on performance evaluations on an out-

of-sample validation dataset consisting of 10,000 samples to reevaluate all

solutions obtained in each generation.

(a) (b)

Fig. 1. Convergence curves of NSGA-II with large data and small data on

two example problems. (a) On DTLZ5D, NSGA-II with small data converges

signiﬁcantly faster than that with large data, implying that a small-instance

dataset is sufﬁcient for the optimization. (b) On DTLZ1D, NSGA-II with small

data converges fast at ﬁrst, but gets stuck in the later stage. This indicates

that a large-instance dataset is needed for solving DTLZ1D effectively.

the target. Hence, it may be reasonable to use the small-

data task to discover useful solutions for the computationally-

expensive large-data task at a reduced cost. That is, the small-

data task may serve as an auxiliary source task to accelerate

the optimization of the target. Taking this cue, we propose

to jointly deploy both large- and small-data tasks in a single

EMT run. Within the EMT framework, the small-data tasks

are seen as helpful minions, i.e., as computationally-cheaper

auxiliary tasks, to assist the target task in the search for optimal

solutions. All tasks thus progress in a synergic and intertwined

manner, transferring useful information (solution prototypes)

when available.

Notice in Fig. 1b that the optimization of the small-data

task is trapped in an inferior region in the later stages, which

indicates that it ceases to be helpful to the target thereafter.

In this case, assigning evaluation budget to the small-data

task would imply a waste of computational effort in terms

of progressing the target search. Thus, unlike the majority

of existing EMT algorithms that equally weight all tasks, we

propose to adaptively adjust computational resource allocation

to tasks according to their performance. Concretely, if the

small-data task is assessed to provide beneﬁcial transfers to the

target, we should reward more resources to it so as to quickly

progress the target search at a much lower cost. Otherwise, the

resources allocated to the small-data task should be reduced

to help prevent wastage of computational effort.

IV. PROPOSED ALGORITHM

This section ﬁrst illustrates our proposed EMT framework.

Next, we introduce the deﬁnition of an online inter-task

empirical correlation measure that is used for adaptive resource

allocation. Finally, we present an approach to efﬁciently ap-

proximate the empirical correlation measure using Bayes’ rule.

A. Overview of the EMT framework

The pseudo code of the overall adaptive EMT framework

is shown in Algorithm 1. It is worth noting that the proposed

framework can be adapted to any existing MOEA by con-

ﬁguring the reproduction operators and mating/environmental

selection schemes used.

Let the size of the whole population (including large- and

small-data populations) for the proposed framework be psize.

Let the large-instance data be denoted as DL.Nsmall number

CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 5

of instances are randomly sampled without replacement from

DLto form the small-instance dataset DS4. Next, two

populations PL(t= 0) (of size psizeL(0)) and PS(t= 0) (of

size psizeS(0)) are randomly initialized for the target large-

data task TLand small-data source task TS, respectively. The

solutions in PL(t)and PS(t)are evaluated with DLand DS,

respectively. PL(t)and PS(t)evolve separately via traditional

evolutionary operators (Lines 16-18), until the transfer phase is

triggered every 4tgenerations. The whole process is repeated

until a predeﬁned stopping criterion is satisﬁed.

Note that 4tdenotes not only the transfer interval, but

also the interval for conducting online computational resource

allocation between tasks TLand TS. During the transfer phase

(Lines 6-8), explicit information transfer is conducted across

tasks. Speciﬁcally, PL(t)transfers min{num,psizeL(t)}

subsampled solutions (reevaluated with DS) to PS(t), while

PS(t)transfers min{num,psizeS(t)}subsampled solutions

(reevaluated with DL) to PL(t). The transferred solutions

merge with the existing task-speciﬁc population and undergo

the process of environmental selection. As stated in Section

III-B, the small-data task can help discover good solutions for

the large-data task at a signiﬁcantly lower computational cost.

Thus, the useful information transferred from TSto TLcould

accelerate the convergence of the target task to its PS.

After the information transfer, a procedure of resource

allocation is conducted to adjust the population sizes of TS

and TLin an online manner (Lines 9-11). Through adaptively

allocating available resources to different tasks, we can not

only stimulate faster convergence trends, but also reduce the

risks of harmful negative transfer when small-data tasks are

not representative of the target.

Importantly, the small-data task TSin our proposed adaptive

EMT framework is dynamic. That is, the small-instance dataset

DSused in TSis periodically updated by resampling Nsmall

instances from DLand all the solutions in PS(t)would be

reevaluated with the new DS(Lines 9-10). Although the gen-

eration of a new small-data task introduces extra reevaluation

cost, the dynamic property is of signiﬁcance. Speciﬁcally,

through these successive random resampling operations, a

series of auxiliary small-data tasks can be generated on-the-

ﬂy. These randomly generated tasks may share various levels

of correlation with the target, since each of the small-instance

datasets is a random subset of the large dataset. As such, it

increases the chance of having generated some task that is

of high relevance for the target search. For the reevaluated

small-data population PS(t)and the large-data population

PL(t), we employ a novel computational resource allocation

strategy based on Bayes’ rule (i.e., Algorithm 2) to adjust

the population sizes of TSand TL(Line 11). As all the

solutions in the current PS(t)have been reevaluated with the

new DS, it’s reasonable to use the resource allocation strategy

conducted on the current PS(t)and PL(t)to infer the amount

of computational resources made available to the small-data

task in the following 4tgenerations. That is, when the current

4For the dataset where each data instance has a class label (such as the

datasets used in our experiments on multiobjective hyper-parameter tuning of

neural network models), stratiﬁed random sampling without replacement is

performed to sample Nsmall instances from DLto form DS.

Algorithm 1 Pseudocode of the Adaptive EMT Framework

Input: psize: size of whole population; TLand TS: large-

data task and small-data task; DL: large-instance dataset;

Nsmall: size of small-instance dataset; psizeL(0) and

psizeS(0): initial population sizes of TLand TS;num:

number of transferred solutions; 4t: transfer interval;

Output: Non-dominated solutions of TL;

1: Sample Nsmall instances from DLto form a small-

instance dataset DS, and set t= 0;

2: Initialize the population PL(t)of TLand population PS(t)

of TS, respectively;

3: Evaluate PL(t)and PS(t)with DLand DS, respectively;

4: while termination criterion is not fulﬁlled do

5: if mod(t+ 1,4t) == 0 then

6: Transfer min{num,psizeL(t)}subsampled solu-

tions from PL(t)to PS(t), and reevaluate with DS;

7: Transfer min{num,psizeS(t)}subsampled solu-

tions from PS(t)to PL(t), and reevaluate with DL;

8: Perform environmental selection on PL(t)and PS(t)

to maintain population sizes of psizeL(t)and

psizeS(t), respectively;

9: Sample Nsmall instances from DLto form a new

small-instance dataset DS;

10: Reevaluate the solutions in PS(t)with the new DS;

11: Conduct Bayes resource allocation strategy (i.e., Al-

gorithm 2) to obtain psizeL(t+1) and psizeS(t+1);

12: else

13: psizeL(t+ 1) = psizeL(t);

14: psizeS(t+ 1) = psizeS(t);

15: end if

16: Perform mating selection & reproduction on PL(t)and

PS(t)to generate the offspring population OL(t)(of

size psizeL(t+ 1)) and OS(t)(of size psizeS(t+ 1)),

respectively;

17: Evaluate OL(t)and OS(t)with DLand DS, respec-

tively;

18: Perform environmental selection on PL(t)∪OL(t)and

PS(t)∪OS(t)to construct PL(t+1) (of size psizeL(t+

1)) and PS(t+ 1) (of size psizeS(t+ 1)), respectively;

19: Set t=t+ 1;

20: end while

21: Reevaluate the solutions in PS(t)with DL;

22: Output solutions in P(t) = PL(t)∪PS(t)that are non-

dominated on TL.

population PS(t)is assessed to be effective in providing good

solutions for transfer, more resources would be rewarded to

the small-data task (as it will keep using the same DSin the

following 4tgenerations). Otherwise, more resources would

be allocated to the target until a small-data task that produces

beneﬁcial transfers is newly generated.

In the following subsections, the basis and details of the

online computational resource allocation mechanism is elabo-

rated. For ease of description, we summarize some important

notations and their meanings in Table I.

6 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

TABLE I

NOTATIO NS USE D IN T HE DESCRIPTION OF THE BAYES RESOURCE

ALL OCATI ON ST RATEG Y AN D THEI R MEANINGS.

Notation Meaning

≺LSymbol of Pareto dominance for TL(i.e., using the objective

functions of TLto compare different solutions).

≺SSymbol of Pareto dominance for TS(i.e., using the objective

functions of TSto compare different solutions).

NSL(t)The solutions in P(t) = PL(t)∪PS(t)that are non-dominated

on TL,

NSL(t) = {x∈P(t)|@y∈P(t)s.t. y≺Lx}.

NSS(t)The solutions in P(t)that are non-dominated on TS,

NSS(t) = {x∈P(t)|@y∈P(t)s.t. y≺Sx}.

DSS(t)The solutions in P(t)that are dominated on TS,

|DSS(t)|=|P(t)|−|NS S(t)|.

NSL(t)The solutions in PL(t)that are non-dominated on TL,

NSL(t) = {x∈PL(t)|@y∈PL(t)s.t. y≺Lx}.

NSS(t)The solutions in PS(t)that are non-dominated on TS,

NSS(t) = {x∈PS(t)|@y∈PS(t)s.t. y≺Sx}.

DSS(t)The solutions in PS(t)that are dominated on TS,

|DSS(t)|=|PS(t)|−|NSS(t)|.

NNS (t)The solutions in N SL(t)that are also non-dominated on TS,

NNS (t) = {x∈N SL(t)|@y∈P(t)s.t. y≺Sx}.

1For the last seven notations, the ones with an overline pertain to the joint

population P(t), while the ones without an overline pertain to PL(t)or

PS(t).

B. Inter-task Empirical Correlation of MOOPs

As indicated in Line 11 of Algorithm 1, we wish to

dynamically adjust the amount of resources (in terms of

population size 5) made available to the large- and small-data

tasks. To this end, we ﬁrst consider the following question:

how strongly is the small-data task TScorrelated with the

target task TLat a given transfer phase?

In the t-th generation, there are two populations, namely,

PS(t)and PL(t), for TSand TL, respectively. Note that in

multiobjective evolution, the non-dominated solutions in a

population are usually preferred over the dominated solutions.

The solutions in the joint population P(t) = PL(t)∪PS(t)

that are non-dominated on TL(denoted as NSL(t)) are

considered most beneﬁcial for the future optimization of the

target task. These solutions are contributed by either the small-

data population PS(t)or the large-data population PL(t), since

the joint population P(t)is composed of PS(t)and PL(t).

We propose that the proportion of non-dominated solutions

contributed by PS(t)reﬂects the degree of positive correlation

of the small-data task TSto the target. With this in mind,

we deﬁne an online inter-task empirical correlation measure,

5Here, we use the population size to control the amount of resources

allocated to the large- and small-data tasks. As the whole population size

psize is ﬁxed, if the size of large-data population psizeL(t)increases, then

the size of small-data population psizeS(t) = psize −psiz eL(t)would

decrease; and vice versa. Note that the adjustment of sizes of large- and small-

data populations is for determining how much of the limited computational

resources should be allocated to each task based on their observed potential,

but without guarantee that the task with a larger population size can absolutely

lead to better performance.

which is mathematically expressed as follows:

Corr(t) = |N SL(t)∩PS(t)|

|NSL(t)|

=|{x∈PS(t)|@y∈P(t)s.t. y≺Lx}|

|{x∈P(t)|@y∈P(t)s.t. y≺Lx}| .

(4)

If the contribution of PS(t)(i.e., the value of the numerator

in Eq. (4)) is large, this suggests that the auxiliary source task

TSproduces beneﬁcial transfers to the target task in the t-th

generation, hence more resources could be allocated to TSto

further enhance the search efﬁciency. Otherwise, the resources

of TSare reduced to alleviate negative impact of TS. We can

thus allocate computational resources online in proportion to

the degree of source-target correlation deﬁned in Eq. (4).

However, exact computation of Eq. (4) would incur extra

evaluations on the large-data task, since all solutions in PS(t)

would need to be reevaluated on TL. This would impose a

heavy overhead for the sake of resource allocation. Thus,

with the aim of maintaining computational tractability, we

design a new Bayes resource allocation strategy to efﬁciently

approximate the empirical correlation. The speciﬁc details are

presented in the next subsection.

C. Bayes Resource Allocation Strategy

Recall that in our proposed EMT framework, the small-

instance dataset is a uniform subset of the large-instance

dataset, suggesting that the resultant MOOPs may share locally

or globally similar ﬁtness landscapes. We thus make the

simplifying assumption stated below that (as shall be shown)

facilitates fast, online approximation of Corr(t)- by means

of avoiding extra evaluations on the large dataset.

Assumption 1. The population of the large-data task, the

population of the small-data task, and their union share

similar underlying probability distributions during the EMT

run.

In addition, we highlight the following useful property that

will also be utilized in our derivation.

Property 1. Consider the additive form of data-driven

MOOPs expressed in Eq. (3). Since the small-instance dataset

DSis a subset of the large-instance dataset DL, the solutions

evaluated on DLare automatically evaluated on DSat no

extra cost. That is, for all solutions in PL(t), their evaluation

scores on TSare available.6

In Eq. (4), there are two key terms, the denominator

|NSL(t)|and the numerator |N SL(t)∩PS(t)|. To approx-

imate |NSL(t)|, we consider the probability that a solution in

P(t)is non-dominated on TL, which is denoted as P r(x∈

NSL(t)|x∈P(t)). Then, the expected value of |N SL(t)|can

be written as:

E[|NSL(t)|] = P r (x∈NSL(t)|x∈P(t)) ∗ |P(t)|,(5)

where the notation E[·]symbolizes statistical expectation.

6The Property 1 is not directly applicable to MOOPs of the type in Eq.

(15). Hence, for automated machine learning model conﬁguration problems,

evaluation scores of PL(t)on TSare predicted (for the purpose of Bayes

resource allocation) via fast KNN regressions.

CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 7

As for |NSL(t)∩PS(t)|, the candidates in the intersection

set NSL(t)∩PS(t)may come from solutions in PS(t)that

are either currently non-dominated or even dominated with

respect to TS. Invoking Assumption 1, the expected value of

|NSL(t)∩PS(t)|can be weakly approximated as:

E[|NSL(t)∩PS(t)|] = P r (x∈NSL(t)|x∈N SS(t))∗

It should be noted that both conditional probabilities P r(x∈

NSL(t)|x∈N SS(t)) and P r (x∈NSL(t)|x∈DSS(t)) are

computed based on the joint population P(t).

Substituting Eqs. (5) and (6) into Eq. (4), we obtain a

practical estimate of Corr(t)as:

Corr(t)∼P r(x∈N SL(t)|x∈N SS(t)) ∗ |NSS(t)|

P r(x∈N SL(t)|x∈P(t)) ∗ |P(t)|+

P r(x∈N SL(t)|x∈DSS(t)) ∗ |DSS(t)|

P r(x∈N SL(t)|x∈P(t)) ∗ |P(t)|.

(7)

At this point, in order to avoid extra evaluations on the large-

data task, we propose to invert the conditional probabilities

P r(x∈N SL(t)|x∈NSS(t)) and P r (x∈NSL(t)|x∈

DSS(t)) by resorting to Bayes’ rule as follows:

P r(x∈N SL(t)|x∈N SS(t)) = P r(x∈NSS(t)|x∈N SL(t))∗

P r(x∈N SL(t)|x∈P(t))

P r(x∈N SS(t)|x∈P(t)) ,(8)

and

P r(x∈N SL(t)|x∈DSS(t)) = P r(x∈DS S(t)|x∈NSL(t))∗

P r(x∈N SL(t)|x∈P(t))

P r(x∈DSS(t)|x∈P(t)) .(9)

Combining Eqs. (7), (8) and (9), the estimate of Corr(t)

would be expressed as:

Corr(t)∼P r(x∈N SS(t)|x∈N SL(t)) ∗ |NSS(t)|

P r(x∈N SS(t)|x∈P(t)) ∗ |P(t)|+

P r(x∈DSS(t)|x∈NSL(t)) ∗ |DSS(t)|

P r(x∈DSS(t)|x∈P(t)) ∗ |P(t)|

=P r(x∈N SS(t)|x∈N SL(t)) ∗ |NSS(t)|

P r(x∈N SS(t)|x∈P(t)) ∗ |P(t)|+

(1.0−P r(x∈N SS(t)|x∈N SL(t))) ∗ |DSS(t)|

(1.0−P r(x∈N SS(t)|x∈P(t))) ∗ |P(t)|.

(10)

Observe that the term P r(x∈NSL(t)|x∈P(t)) is elim-

inated in Eq. (10). Therefore, the approximation of Corr(t)

no longer depends on P r(x∈N SL(t)|x∈P(t)),P r(x∈

NSL(t)|x∈N SS(t)) or P r (x∈NSL(t)|x∈DSS(t)).

These terms are substituted by the probabilities P r(x∈

NSS(t)|x∈P(t)) and P r (x∈NSS(t)|x∈N SL(t)).

According to Property 1, we can identify the non-

dominated solutions NSS(t)and NNS(t)(whose speciﬁc

meanings can be seen in Table I), without the need to conduct

reevaluations on TS. Then, P r(x∈NSS(t)|x∈P(t)) can

be directly calculated as:

P r(x∈N SS(t)|x∈P(t)) = |NSS(t)|

|P(t)|.(11)

On the other hand, invoking Assumption 1, the value of

P(x∈NSS(t)|x∈N SL(t)) is weakly approximated as:

P r(x∈N SS(t)|x∈NSL(t)) ∼|NNS(t)|

|NSL(t)|.(12)

Substituting Eqs. (11) and (12) into Eq. (10), we obtain the

Algorithm 2 Pseudocode of Bayes Resource Allocation Strat-

egy

Input: PL(t): the population of TL(of size psizeL(t));

PS(t): the population of TS(of size psizeS(t)).

Output: psizeL(t+ 1): the new population size of TL;

psizeS(t+ 1): the new population size of TS.

1: Identify the solutions in PL(t)that are non-dominated on

TL(denoted as NSL(t));

2: Identify the solutions in PS(t)that are non-dominated on

TS(denoted as NSS(t));

3: Using Property 1, identify solutions in P(t) = PL(t)∪

PS(t)that are non-dominated on TS(denoted as NSS(t));

4: Identify the solutions in NSL(t)that are also non-

dominated on TS(denoted as NNS(t));

5: Calculate the value of Corr(t)by Eq. (13);

6: proportion =C orr(t);

7: psizeS(t+ 1) = |P(t)| ∗ proportion;

8: psizeL(t+ 1) = |P(t)| ∗ (1 −proportion).

ﬁnal formula for the estimate of Corr(t):

Corr(t)∼

|NNS(t)|

|NSL(t)|∗ |N SS(t)|

|NS S(t)|

|P(t)|∗ |P(t)|

+(1.0−|NNS(t)|

|NSL(t)|)∗ |DSS(t)|

(1.0−|NS S(t)|

|P(t)|)∗ |P(t)|

=|NNS (t)|∗|NSS(t)|

|NSS(t)|∗|N SL(t)|+

(|NSL(t)|−|NN S(t)|)∗(|PS(t)|−|NSS(t)|)

(|P(t)|−|NS S(t)|)∗ |N SL(t)|

=|NNS (t)|∗|NSS(t)|

|NSS(t)|∗|N SL(t)|+

(|NSL(t)|−|NN S(t)|)∗(psizeS(t)− |N SS(t)|)

(psizeL(t) + psizeS(t)− |N SS(t)|)∗ |NSL(t)|.

(13)

This concludes the inference of Corr(t). We note the follow-

ing summarizing remarks.

Remark 1: No extra evaluation on TLor TSis required in

the inference process of Corr(t)under Property 1. The online

inter-task empirical correlation is efﬁciently approximated by

means of simple manipulations and the Bayes’ inversion trick.

Remark 2: In the proposed strategy, two weak approxima-

tions (i.e., Eqs. (6) and (12)) are utilized. When Assumption

1is satisﬁed, these approximations are reasonable. However,

in practice, the populations PL(t)and PS(t)may not satisfy

Assumption 1. In such cases, the numerical estimate in Eq.

(13) may exceed the theoretical upper bound of 1.0. Thus, an

additional check to appropriately bound the estimated value

of Corr(t)is needed. In particular, to avoid the elimination

of TL(i.e., a scenario where zero resource is allocated to the

target task TL), we bound the value of proportion =Corr(t)

to a fraction close to but smaller than 1.0 7.

Based on the obtained proportion, the population size

allocated to TSin the subsequent generation of EMT is

adjusted as psizeS(t+ 1) = psize ∗proportion. Accordingly,

the population size of TLfor the next generation becomes

psize ∗(1 −proportion). We provide the pseudo code of the

proposed online resource allocation strategy in Algorithm 2.

7In our algorithm implementation, the fraction is set to 9/10.

8 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

D. Analyzing the Bayes Resource Allocation Strategy

The statement of Assumption 1 is strong, leading to weak

approximations. Hence, here we perform a sanity check on the

ﬁnal formula for estimating Corr(t)(i.e., Eq. (13)) to conﬁrm

its validity. To this end, two intuitively pleasing results along

with their proofs are stated as follows.

Result 1. When the large- and small-data populations are mu-

tually divergent (i.e., their underlying probability distributions

differ) and the small-data population has converged, Eq. (13)

implies that no resources will be allocated to TSin the next

generation.

Proof. Since the small-data population has converged, the

solutions in PS(t)are non-dominated on TS, implying

|NSS(t)|=psizeS(t). Next, due to the divergence of

populations PL(t)and PS(t), we have NNS(t) = ∅and

|NNS(t)|= 0. Thus, the value of Corr(t)given by Eq. (13)

falls to zero, indicating that no resources will be allocated to

TSsince proportion = 0. Notably, this enables the Bayes

resource allocation strategy to guard against harmful negative

transfers from divergent small-data tasks.

Result 2. If the ﬁtness landscapes of the large- and small-

data tasks are perfectly correlated with identical population

distributions, then Eq. (13) implies that the resources allocated

to TSis non-decreasing.

Proof. Given the similarity of ﬁtness landscapes and pop-

ulation distributions, the solutions in NSL(t)would also

be non-dominated on TS, implying |NNS(t)|=|NSL(t)|.

Consequently, Eq. (13) reduces to:

Corr(t)∼|N SS(t)|

|NSS(t)|∼psizeS(t)

|P(t)|,(14)

indicating that the resources allocated to TSin the subsequent

generation is psizeS(t+ 1) = Corr(t)∗ |P(t)|=psizeS(t).

The two aforementioned results jointly suggest that it is pru-

dent to set psizeS(t= 0)/|P(t= 0)|to a high value (closer

to 1) at the start of EMT. This is because the proportion

(and hence the resource allocated to TS) will fall to zero if

the source and target tasks diverge, while the proportion will

remain high / non-decreasing (thus reaping the most beneﬁt

from TS) if the tasks are closely related.

V. EX PE RI ME NTAL STUDY

In the experiments, we consider synthetic MOOPs under

uncertainty as well as the multiobjective hyper-parameter tun-

ing of deep neural network models as examples to investigate

the performance of our proposed algorithm. All codes were

implemented in Python version 3.6, using the Keras package

for experiments with deep neural networks.

A. Experiments on MOOPs under Uncertainty

1) Experimental setup: In the experiments on MOOPs

under uncertainty, two suites of benchmark problems are

adopted: (1) The ﬁrst suite consists of test problems of Type

I proposed in [58], each of which has been imposed with

additive Uniform noise (satisfying U(−1,1)) in the decision

space. In particular, this suite of problems includes four

2-objective problems (named as DGT1M2Pxwhere x=

1,...,4) and two 3-objective problems (named DGT1M3P1

and DGT1M3P2, respectively). According to [58], the number

of decision variables for these problems are set to 5. (2)

The second suite of problems are variants of the well-known

DTLZ functions [61] with additive Gaussian noise (satisfying

N(0,1)) imposed in the decision space. We name them as

DTLZxD (x= 1,...,7). Since DTLZ problems can be scaled

to any number of objectives and any number of decision

variables, we set their number of objectives to 3 and set

their number of decision variables according to [61]. For

the aforementioned problems, the large uncertainty set DLis

constructed with 1,000 samples from the noise term. The HV

[62] metric is adopted to evaluate the algorithm performance,

while the reference vector used for computing HV is set based

on the non-dominated solutions obtained by all considered

algorithms. When calculating the HV value, we reevaluate all

solutions obtained by each algorithm with an out-of-sample

validation dataset consisting of 10,000 samples. For each

problem in the ﬁrst/second suite, the samples in the validation

dataset are also sampled correspondingly from the Uniform

distribution U(−1,1) or Gaussian distribution N(0,1).

For each algorithm, the stopping criterion is set as a prede-

ﬁned running time (50 seconds), and the whole population size

is set to 100, i.e., psize = 100. For the genetic operators, we

use a simulated binary crossover operator (with a probability

pc= 0.9and a distribution index ηc= 20) [63] and a

polynomial mutation operator (with a probability pm= 1/n

and a distribution index ηm= 20) [64] to generate offspring.

By default, our proposed algorithm is set with the environ-

mental selection procedure of NSGA-II. For ease of descrip-

tion, hereafter, we denote our proposed algorithm with online

computational resource allocation as EMT-RA. In addition,

we also test a variant of our proposed algorithm without

resource allocation, which is called as EMT hereafter. For

our proposed EMT-RA and EMT, the parameter settings are

listed as follows8:num = 0.1∗psize = 10,4t= 20,

Nsmall = 0.01 ∗ |DL|= 10. For EMT, we allocate equal

amounts of resources to large-data and small-data tasks. For

EMT-RA, we set the initial sizes of large-data and small-

data populations to psizeL(0) = 0.1∗psize = 10 and

psizeS(0) = 0.9∗psize = 90, respectively9, and then the

sizes of both populations will be adaptively adjusted by our

proposed Bayes resource allocation strategy.

2) Comparative results: This subsection compares the per-

formance of single-task MOEA and our proposed algorithm

without/with resource allocation. To clearly demonstrate the

generalizability of our proposed framework, we select three

types of MOEAs (i.e., NSGA-II, SPEA2 and MOEA/D) to

act as the base MOEA used within our framework.

Table II shows the comparative results (in terms of average

HV value and standard deviation over 11 runs) on two suites

8The effect of several important parameters (num,4tand Nsmall) on

algorithm performance is investigated in the supplemental material.

9Based on Results 1 & 2 in Section IV-D, it is reasonable to allocate higher

resources to TSin the initial phase.

CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 9

TABLE II

AVERAGE AND STAND ARD DE VI ATION O F HV R ESU LTS OBTAINED BY DIFFE REN T ALGORITHMS ON BENCHMARK MOOPS UN DER UN CE RTAIN TY.

Problem Time

(seconds)

NSGA-II EMTNS GA−I I EMT-RANS GA−I I SPEA2 EMTSP EA2EMT-RASP EA2

Avg Std Avg Std Avg Std Avg Std Avg Std Avg Std

DGT1M2P1

10 0.015 0.033 0.318+0.127 0.517+0.155 0.343 0.122 0.626+0.015 0.614+0.041

30 0.428 0.150 0.585+0.154 0.580+0.131 0.627 0.045 0.638+0.014 0.632+0.033

50 0.524 0.124 0.589+0.153 0.587+0.125 0.628 0.042 0.633+0.021 0.634+0.027

DGT1M2P2

10 0.000 0.000 0.024+0.078 0.385+0.309 0.203 0.188 0.664+0.091 0.658+0.066

30 0.268 0.283 0.305+0.351 0.479+0.314 0.656 0.092 0.671+0.097 0.686+0.058

50 0.432 0.345 0.473+0.312 0.512+0.329 0.661 0.087 0.667+0.084 0.684+0.062

DGT1M2P3

10 0.475 0.072 0.611+0.019 0.624+0.021 0.560 0.077 0.617+0.036 0.620+0.022

30 0.643 0.003 0.640−0.001 0.645+0.001 0.607 0.075 0.625+0.031 0.624+0.017

50 0.646 0.000 0.641−0.001 0.646≈0.000 0.607 0.076 0.631+0.025 0.625+0.021

DGT1M2P4

10 0.467 0.035 0.615+0.012 0.640+0.014 0.602 0.039 0.640+0.011 0.645+0.015

30 0.636 0.027 0.658+0.012 0.663+0.011 0.653 0.008 0.648−0.011 0.649−0.015

50 0.652 0.020 0.663+0.014 0.667+0.011 0.658 0.007 0.648−0.011 0.647−0.026

DGT1M3P1

10 0.054 0.055 0.213+0.091 0.515+0.070 0.802 0.040 0.890+0.006 0.898+0.003

30 0.521 0.105 0.545+0.094 0.584+0.014 0.898 0.005 0.895≈0.004 0.896≈0.004

50 0.571 0.091 0.577+0.011 0.590+0.011 0.899 0.004 0.898≈0.004 0.897≈0.005

DGT1M3P2

10 0.309 0.087 0.433+0.049 0.475+0.062 0.352 0.079 0.497+0.077 0.527+0.063

30 0.525 0.064 0.526≈0.042 0.534+0.055 0.551 0.025 0.549≈0.027 0.557+0.013

50 0.543 0.054 0.528−0.043 0.543≈0.043 0.555 0.020 0.548−0.026 0.565+0.005

DTLZ1D

10 0.512 0.017 0.541+0.018 0.574+0.037 0.558 0.017 0.624+0.021 0.705+0.029

30 0.665 0.014 0.669+0.008 0.686+0.023 0.729 0.006 0.738+0.009 0.745+0.011

50 0.697 0.011 0.689−0.008 0.711+0.013 0.738 0.004 0.748+0.008 0.754+0.006

DTLZ2D

10 0.126 0.025 0.264+0.026 0.345+0.014 0.037 0.021 0.196+0.023 0.308+0.027

30 0.344 0.010 0.362+0.010 0.379+0.010 0.288 0.026 0.360+0.013 0.347+0.021

50 0.384 0.007 0.382≈0.007 0.390+0.006 0.349 0.014 0.377+0.007 0.367+0.017

DTLZ3D

10 0.047 0.007 0.071+0.013 0.085+0.027 0.063 0.008 0.105+0.014 0.265+0.038

30 0.146 0.020 0.203+0.025 0.218+0.029 0.296 0.010 0.355+0.016 0.390+0.013

50 0.259 0.014 0.291+0.009 0.289+0.034 0.389 0.008 0.390≈0.011 0.408+0.010

DTLZ4D

10 0.256 0.046 0.418+0.026 0.506+0.018 0.117 0.051 0.306+0.034 0.523+0.019

30 0.517 0.007 0.527+0.011 0.538+0.011 0.485 0.026 0.532+0.010 0.563+0.012

50 0.546 0.006 0.544≈0.009 0.551+0.006 0.553 0.015 0.561+0.008 0.574+0.009

DTLZ5D

10 0.031 0.008 0.078+0.007 0.116+0.006 0.009 0.008 0.044+0.011 0.101+0.007

30 0.124 0.005 0.135+0.002 0.138+0.002 0.084 0.009 0.108+0.007 0.118+0.006

50 0.138 0.002 0.139≈0.002 0.142+0.002 0.110 0.009 0.124+0.005 0.126+0.007

DTLZ6D

10 0.000 0.000 0.007+0.007 0.151+0.042 0.000 0.000 0.078+0.024 0.321+0.011

30 0.150 0.040 0.264+0.020 0.298+0.020 0.281 0.018 0.352+0.007 0.355+0.014

50 0.290 0.019 0.328+0.013 0.338+0.010 0.363 0.007 0.368+0.008 0.365+0.014

DTLZ7D

10 0.000 0.000 0.000≈0.000 0.028+0.022 0.000 0.000 0.000≈0.000 0.201+0.019

30 0.001 0.002 0.059+0.027 0.111+0.034 0.012 0.007 0.244+0.024 0.295+0.013

50 0.063 0.025 0.143+0.030 0.170+0.034 0.200 0.012 0.299+0.022 0.315+0.005

Average rank

obtained by Friedman test 2.7949 2.0897 1.1154 2.7308 1.9103 1.3590

1The symbol “+” indicates that our proposed algorithm signiﬁcantly improves the baseline algorithm (NSGA-II or SPEA2) at a 0.05 level by the

Wilcoxon’s rank sum test, whereas the symbol “−” indicates the opposite. If no signiﬁcant difference is detected, the symbol “≈” is used.

of benchmark problems. In each case, the best metric value is

highlighted with grey shade. Moreover, we adopt three sym-

bols (i.e., “+”, “−” and “≈”, whose meanings are illustrated

at the bottom of Table II) to mark the results of Wilcoxon’s

rank sum test with a conﬁdence interval of 0.95. In addition,

the Friedman test is applied on all HV results to obtain the

average ranks of all algorithms.

Firstly, we focus on the comparisons among NSGA-II,

EMTNS GA−II and EMT-RANS GA−II . From Table II, we

can observe that EMTNS GA−II and EMT-RANSGA−I I ob-

tain better performance than NSGA-II on most benchmark

problems. Among 39 cases, EMTNSGA−II performs sig-

niﬁcantly better than NSGA-II in 30 cases, while EMT-

RANS GA−II performs signiﬁcantly better than NSGA-II in

37 cases. These results demonstrate the effectiveness of using

small-data task in our proposed framework. In terms of the

comparison between two variants of our proposed algorith-

m, EMT-RANSGA−II shows signiﬁcant improvement over

EMTNS GA−II in 36 cases, demonstrating the effectiveness

of our proposed Bayes resource allocation strategy.

Similar results can also be obtained when using SPEA2

as the base MOEA. Particularly, EMTSP E A2performs sig-

niﬁcantly better than SPEA2 in 31 cases, while EMT-

RASP E A2obtains signiﬁcant better performance than SPEA2

in 34 cases. Moreover, as can be seen from the average

rank obtained by Friedman test, the overall performance of

10 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l) (m)

Fig. 2. Convergence curves obtained by the baseline algorithm and two variants of our proposed algorithm (i.e., EMT and EMT-RA) on benchmark MOOPs

under uncertainty. In each ﬁgure, the solid lines indicate the average HV values and the shaded area denotes the 95% conﬁdence interval over multiple runs.

EMT-RANSGA−II /EMT-RASP EA2on all considered prob-

lems ranks ﬁrst, while that of EMTNS GA−II /EMTS P EA2

ranks second. These results verify the generalizability of our

proposed framework.

The convergence trends of average HV obtained by NSGA-

II, EMTNS GA−II and EMT-RANS GA−II on all considered

benchmark problems are depicted in Fig. 2. In these ﬁgures,

the solid lines indicate the average HV values and the shaded

area denotes the 95% conﬁdence interval. In addition, for sim-

plicity, we herein write EMTNSGA−I I and EMT-RANSGA−II

as EMT and EMT-RA, respectively. As can be seen, EMT

and EMT-RA can obtain faster convergence than NSGA-II

on most benchmark problems, indicating that using small-

data tasks can indeed help to accelerate the convergence rate.

For example, on the DGT1M2P1 problem, the performance

difference between EMT/EMT-RA and NSGA-II is signiﬁcant,

which is explained by the fact that all Pareto optimal solutions

of the deterministic version of DGT1M2P1 remain robust

even under uncertainty (hence the small-data task acts as a

good proxy to the large-data task) [58]. On the other hand,

for DGT1M2P3 where uncertainty changes the Pareto set,

the improvement achieved by EMT-RA is lower (although

still noticeable). Next, we focus on the comparison between

EMT-RA and EMT. We can observe that EMT-RA converges

faster than EMT on all considered problems. This phenomenon

demonstrates that the convergence can be further boosted with

the help of our proposed Bayes resource allocation strategy.

Particularly, taking DGT1M2P2 problem as an example, EMT

progresses faster than NSGA-II at the early stage, but tends to

stagnate at a later stage. Such a phenomenon may occur if the

small-data task in EMT grows out of usefulness for the target

task. By contrast, EMT-RA can obtain faster performance by

leveraging the Bayes resource allocation strategy, which has

the ability to reduce the resources allocated to small-data task

when the usefulness of small-data task drops. The performance

difference between EMT-RA and EMT on DGT1M2P2 further

highlights the necessity and effectiveness of online computa-

tional resource allocation.

Fig. S-5 in the supplementary material shows the con-

vergence curves of average HV obtained by MOEA/D,

EMTMO EA/D and EMT-RAMOEA/D on four representa-

tive benchmark problems. As can be seen, on the whole,

EMTMO EA/D and EMT-RAMOEA/D converge faster than

MOEA/D. When separately comparing EMT-RAMOEA/D

and EMTMO EA/D , we ﬁnd that EMT-RAMOEA/D always

converges faster than EMTM OEA/D on DGT1M2P3 and

DTLZ1D. However, on DGT1M2P1 and DGT1M2P2, EMT-

RAMO EA/D fails to outperform EMTM OEA/D . This phe-

nomenon may be attributed to the fact that the base MOEA/D

needs to be performed with the help of a set of predeﬁned

evenly distributed weight vectors. When the resources allocat-

ed to each task (i.e., the population size for each task) are

changed, the number of weight vectors and the set of weight

vectors used for each task have to be regenerated. Howev-

CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 11

er, according to the common method for generating weight

vectors [65], the resultant number of newly generated weight

vectors for each task (can only be one of some particular

numbers) may not be sufﬁciently close to the new population

size for each task (which is an arbitrary number). This would

potentially hamper the effectiveness of our proposed Bayes

resource allocation strategy to some extent.

3) Analyses on learnt correlation curves, the effect of num-

ber of transferred solutions, dynamic small data, and small

data size: Due to space limitation, we have placed related

results and discussions in the supplementary material.

B. Experiments on Multiobjective Hyper-parameter Tuning of

Deep Neural Network Models

In this subsection, we apply our proposed algorithm for

the multiobjective hyper-parameter tuning of neural network

models (MOHPT for short), which serves as a representative

example from the ﬁeld of AutoML.

MOHPT belongs to a kind of data-driven MOOP subclass

where data is utilized for supervised training of a candidate

machine learning model whose out-of-sample generalization

performance provides objective function scores for automated

model conﬁguration. Speciﬁcally, the MOHPT under large-

instance data can be formalized as follows:

min

x∈ΩF(x;DL,Dhold−out)(15)

where F(x;DL,Dhold−out)denotes the vector-valued out-of-

sample loss achieved by a model parametrized by x, trained on

the large-instance dataset DL10, and validated on a hold-out

dataset Dhold−out 11.

For MOHPT under large-instance data, massive amounts of

energy is usually required for building, training and validating

the deep models of today, generating growing concerns on the

carbon footprint of deep learning [66], [67]. Developing an

efﬁcient solver, such as that proposed in this paper, could help

to alleviate the computational bottleneck in hyper-parameter

optimization or neural architecture search, thus supporting the

environmental sustainability of modern AI [68].

1) Construction of MOHPT problems: We employ a Con-

volution Neural Network (CNN) to serve as the underlying

ML model, and use it to conduct multi-label/multi-task classi-

ﬁcation on various datasets. Multi-task learning, as the name

suggests, is a learning paradigm where data from multiple

tasks (each of which has a performance metric) are combined

for joint training with shared model parameters [11], [69].

Multi-label learning considers the problem in which each

example is represented by a single data instance associated

with a set of labels simultaneously, and the aim is to predict the

label sets of unseen instances by analyzing training instances

with known label sets [70]. In essence, multi-label learning

10For MOHPT under large-instance data, we treat the training of a model

as a black-box, regardless of whether its training adopts the mini-batch

optimizers (where a mini-batch of samples from the large-instance dataset

is used in each iteration; e.g., mini-batch gradient descent) or not.

11In practice, multiple hold-out datasets may be used and the average

performance over multiple hold-out datasets is computed as the objective

function score. However, here we only consider the scenario of using one

hold-out dataset.

TABLE III

HYP ER-PA RAM ET ERS O F CNN.

No. Description Range Type

1 Mini-batch size [24,28]Discrete

2 Size of convolution window {1,3,5}Discrete

3 Number of ﬁlters in the convolution layer [23,26]Discrete

4 Dropout rate [0,0.5] Continuous

5 Learning rate [10−4,10−1]Continuous

6 Decay parameter beta 1used in Adam [0.8,0.999] Continuous

7 Decay parameter beta 2used in Adam [0.99,0.9999] Continuous

8 Parameter used in Adam [10−9,10−3]Continuous

can be seen as a special form of multi-task learning where

each task represents a different label. In [11], researchers have

demonstrated the feasibility of modeling multi-task learning

problems as multiobjective optimization. Thus, we can use

multi-label/multi-task learning datasets to conduct the MOH-

PT experiments.

For the sake of simplicity, we limit the number of convo-

lution layers in the used CNN to 2. For the training of CNN,

we choose the cross entropy loss with dropout regularization

as the loss function and select Adam [71] (which is run with

10 epochs12) as a state-of-the-art CNN optimizer.

The hyper-parameters being optimized and their ranges used

in the experiments are listed in Table III. All hyper-parameters

are encoded in the range [0,1], and they can be transformed

into their corresponding ranges through the transformation

method used in [72].

To construct an MOHPT problem, we let the classiﬁcation

error for each label/task act as each objective function to be

minimized. Two types of real-world datasets are adopted:

1) Five multi-label learning datasets (including scene,yeast,

Corel5k,delicious and tmc2007 500) downloaded from

the Mulan website13 [73]. For simplicity, we restrict each

of the original datasets to three labels by following the

steps below: selecting the top three labels (in terms of

the number of instances in each label) and then deleting

the data instances without any label. The ﬁnal sizes of

the used datasets are listed as follows: scene (with 1,314

instances), yeast (with 2,136 instances), Corel5k (with

2,468 instances), delicious (with 11,305 instances) and

tmc2007 500 (with 24,829 instances). In this way, the

MOHPT on each dataset is modeled as a 3-objective op-

timization problem. In addition, each dataset is split into

a training dataset DLand a hold-out dataset Dhold−out

with a splitting ratio of 80% and 20%, respectively.

2) One multi-task learning dataset (i.e., the MultiMNIST

dataset). We adopt the construction method introduced in

[11] to build the MultiMNIST dataset (a two-task learning

version of MNIST dataset; where the training dataset

DLand hold-out dataset Dhold−out contain 60,000 and

10,000 instances, respectively). Hence, the MOHPT on

MultiMNIST is a 2-objective optimization problem.

2) Experimental setup: In the MOHPT experiments, our

proposed algorithm is set with the environmental selection

12The number of epochs is usually treated as a hyper-parameter to be

optimized in MOHPT. However, due to computational resource limitation,

we just set the number of epochs to a small value.

13http://mulan.sourceforge.net/datasets-mlc.html

12 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

procedure of NSGA-II. For ease of description, we denote

our proposed algorithm with/without resource allocation as

EMT-RA and EMT, respectively. For each algorithm, the

whole population size is set to 20, i.e., psize = 20. For

EMT-RA, we set the initial sizes of large-data and small-

data populations to psizeL(0) = 0.5∗psize = 10 and

psizeS(0) = 0.5∗psize = 10, respectively. The settings

of other parameters used in EMT-RA are listed as follows:

num = 0.1∗psize = 2,4t= 20,Nsmall =1

5|DL|. The large-

data task uses the whole training set DLto train the CNN, and

then uses Dhold−out to evaluate the objective functions. The

small-data task uses DSto train the CNN, and uses Dhold−out

to evaluate the objective functions. For the training process

occurring in the small-data task (or large-data task), a mini-

batch of samples are drawn from DS(or DL) and used in

each iteration of Adam. In addition, the reference point used

for computing HV is set to an all-one vector.

3) Effect of the size of small data: Due to space limitation,

we have placed related results in the supplementary material.

4) Comparative results: The convergence curves obtained

by NSGA-II and our proposed EMT-RA on all considered

datasets along with their obtained non-dominated solutions on

MultiMNIST can be found in the supplementary material.

To quantify the performance gain of our proposed EMT-

RA over the baseline algorithm (i.e., NSGA-II), we record

the running time for EMT-RA needed to achieve the same

level of performance (in terms of average HV values) as

NSGA-II. Speciﬁcally, for 6 datasets with different numbers

of instances, we ﬁrst run NSGA-II with different levels of

time budgets (800 seconds for scene,yeast and Corel5k; 8,000

seconds for delicious and tmc2007 500; 18,000 seconds for

the MultiMNIST dataset), and record the average HV values

obtained by NSGA-II (i.e., 0.642, 0.538, 0.281, 0.277, 0.584

and 0.935 for 6 datasets, respectively). Then, we run EMT-RA

and see how much time it needs to reach the corresponding

HV value obtained by NSGA-II on each dataset. The observed

results are summarized in Fig. 3, showing that the running time

consumed by our proposed EMT-RA is signiﬁcantly less than

that of NSGA-II on all considered datasets. For example, on

MultiMNIST dataset, EMT-RA only needs 9,863 seconds to

achieve the same HV result as NSGA-II with 18,000 seconds.

Furthermore, we show the speedup in terms of running time

obtained by our proposed EMT-RA in Fig. 4. From this ﬁgure,

we can observe that the algorithm can obtain 40% - 75%

speedup on most of the considered datasets. These results on

medium- and large-size datasets demonstrate that EMT-RA

can efﬁciently deal with the practical multiobjective hyper-

parameter tuning of neural network models.

VI. CONCLUSION

In this paper, we have put forward a novel evolutionary mul-

titasking (EMT) framework targeting scalable multiobjective

optimization under large-instance data. In this framework, a

series of computationally-cheaper small-data tasks (referred to

as minions) are generated on-the-ﬂy via random subsampling,

with the aim of assisting the target large-data task in the search

for Pareto optimal solutions. Notably, our framework can be

Fig. 3. Comparison of the running time needed for the baseline and our

proposed EMT-RA algorithm to achieve the same level of performance (i.e.,

reaching an average HV value of 0.642, 0.538, 0.281, 0.277, 0.584 and 0.935,

respectively, on the datasets listed along the x-axis). Our proposed algorithm

consumes signiﬁcantly less running time than its baseline.

Fig. 4. Speedup obtained by our proposed EMT-RA algorithm on all

considered datasets. About 40% - 75% speedup is observed on most datasets.

wrapped around any multiobjective evolutionary algorithm to

address the big data problem. Its salient feature is an online

computational resource allocation strategy based on Bayes’

rule, which automatically rewards more resources to the in-

expensive small-data tasks when they demonstrate beneﬁcial

transfers to the target. In the empirical studies, we have veriﬁed

the performance of EMT with resource allocation through a

series of experiments on multiobjective optimization under un-

certainty as well as the multiobjective hyper-parameter tuning

of deep neural network models, covering different suites of

benchmark problems and various sizes of real-world datasets.

For future work, on the one hand, we shall further comple-

ment our methodology through the incorporation of surrogate-

assistance techniques, and also investigate alternative subsam-

pling approaches to enable the small-instance dataset to better

guide the search on large-instance data. On the other hand,

we shall continue to rigorously verify the performance of

EMT-RA on datasets containing millions (or more) instances,

spanning a much richer variety of data-driven multiobjective

optimization problems of real-world interest.

REFERENCES

[1] S. Mardle and K. M. Miettinen, “Nonlinear multiobjective optimization,”

Journal of the Operational Research Society, vol. 51, no. 2, p. 246, 1999.

[2] D. P. Heyman and M. J. Sobel, Stochastic Models in Operations

Research Volume II: Stochastic Optimization. McGraw Hill, New York,

2003.

[3] P. Pandita, I. Bilionis, J. Panchal, B. P. Gautham, A. Joshi, and P. Zagade,

“Stochastic multiobjective optimization on a budget: Application to

multipass wire drawing with quantiﬁed uncertainties,” International

Journal for Uncertainty Quantiﬁcation, vol. 8, no. 3, pp. 233–249, 2018.

[4] B. Wilder, B. Dilkina, and M. Tambe, “Melding the data-decisions

pipeline: Decision-focused learning for combinatorial optimization,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 33,

no. 01, 2019, pp. 1658–1665.

CHEN et al.: SCALING MULTIOBJECTIVE EVOLUTION TO LARGE DATA WITH MINIONS: A BAYES-INFORMED MULTITASK APPROACH 13

[5] Y. Hu, Y. Zhang, and D. Gong, “Multiobjective particle swarm opti-

mization for feature selection with fuzzy cost,” IEEE Transactions on

Cybernetics, vol. 51, no. 2, pp. 874–888, 2021.

[6] N. Zhang, A. Gupta, Z. Chen, and Y.-S. Ong, “Evolutionary machine

learning with minions: A case study in feature selection,” IEEE Trans-

actions on Evolutionary Computation, vol. 26, no. 1, pp. 130–144, 2022.

[7] J. Luo, D. Zhou, L. Jiang, and H. Ma, “A particle swarm optimization

based multiobjective memetic algorithm for high-dimensional feature

selection,” Memetic Computing, vol. 14, no. 1, pp. 77–93, 2022.

[Online]. Available: https://doi.org/10.1007/s12293- 022-00354- z

[8] Z. Wang, S. Gao, M. Zhou, S. Sato, J. Cheng, and J. Wang, “Information-

theory-based nondominated sorting ant colony optimization for mul-

tiobjective feature selection in classiﬁcation,” IEEE Transactions on

Cybernetics, pp. 1–14, 2022.

[9] X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,”

Knowledge-Based Systems, vol. 212, p. 106622, 2021.

[10] A. Morales-Hernndez, I. V. Nieuwenhuyse, and S. R. Gonzalez, “A

survey on multi-objective hyperparameter optimization algorithms for

machine learning,” arXiv e-prints, 2021.

[11] O. Sener and V. Koltun, “Multi-task learning as multi-objective op-

timization,” in Proceedings of the 32nd International Conference on

Neural Information Processing Systems, 2018, pp. 525–536.

[12] D. Ballabio, “Parsimonious optimization of multitask neural network

hyperparameters,” Molecules, vol. 26, 2021.

[13] T. Elsken, J. H. Metzen, and F. Hutter, “Efﬁcient multi-objective

neural architecture search via lamarckian evolution,” arXiv preprint

arXiv:1804.09081, 2018.

[14] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and

W. Banzhaf, “Nsga-net: neural architecture search using multi-objective

genetic algorithm,” in Proceedings of the Genetic and Evolutionary

Computation Conference, 2019, pp. 419–427.

[15] B. Lyu, S. Wen, K. Shi, and T. Huang, “Multiobjective reinforcement

learning-based neural architecture search for efﬁcient portrait parsing,”

IEEE Transactions on Cybernetics, pp. 1–12, 2021.

[16] Y. Bi, B. Xue, and M. Zhang, “Multitask feature learning as multiob-

jective optimization: A new genetic programming approach to image

classiﬁcation,” IEEE Transactions on Cybernetics, pp. 1–14, 2022.

[17] Z.-H. Zhou, N. V. Chawla, Y. Jin, and G. J. Williams, “Big data op-

portunities and challenges: Discussions from data analytics perspectives

[discussion forum],” IEEE Computational intelligence magazine, vol. 9,

no. 4, pp. 62–74, 2014.

[18] P. L. Yu, “Cone convexity, cone extreme points, and nondominated

solutions in decision problems with multiobjectives,” Journal of Op-

timization Theory and Applications, vol. 14, no. 3, pp. 319–377, 1974.

[19] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist

multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolu-

tionary computation, vol. 6, no. 2, pp. 182–197, 2002.

[20] L. M. Pang, H. Ishibuchi, and K. Shang, “Nsga-ii with simple modiﬁ-

cation works well on a wide variety of many-objective problems,” IEEE

Access, vol. 8, pp. 190 240–190 250, 2020.

[21] E. Zitzler, M. Laumanns, and L. Thiele, “Spea2: Improving the strength

pareto evolutionary algorithm,” TIK-report, vol. 103, 2001.

[22] Q. Zhang and H. Li, “Moea/d: A multiobjective evolutionary algorithm

based on decomposition,” IEEE Transactions on evolutionary computa-

tion, vol. 11, no. 6, pp. 712–731, 2007.

[23] A. Ferranti, F. Marcelloni, and A. Segatori, “A multi-objective evolution-

ary fuzzy system for big data,” in 2016 IEEE International Conference

on Fuzzy Systems (FUZZ-IEEE), 2016, pp. 1562–1569.

[24] A. Ferranti, F. Marcelloni, A. Segatori, M. Antonelli, and P. Ducange,

“A distributed approach to multi-objective evolutionary generation of

fuzzy rule-based classiﬁers from big data,” Information Sciences, vol.

415, pp. 319–340, 2017.

[25] M. Barsacchi, A. Bechini, P. Ducange, and F. Marcelloni, “Optimizing

partition granularity, membership function parameters, and rule bases of

fuzzy classiﬁers for big data by a multi-objective evolutionary approach,”

Cognitive Computation, vol. 11, no. 3, pp. 367–387, 2019.

[26] F. Pulgar-Rubio, A. Rivera-Rivas, M. D. P´

erez-Godoy, P. Gonz´

alez, C. J.

Carmona, and M. J. del Jesus, “Mefasd-bd: Multi-objective evolutionary

fuzzy algorithm for subgroup discovery in big data environments-a

mapreduce solution,” Knowledge-Based Systems, vol. 117, pp. 70–78,

2017.

[27] M. Golchin and A. W.-C. Liew, “Bi-clustering by multi-objective evolu-

tionary algorithm for multimodal analytics and big data,” in Multimodal

Analytics for Next-Generation Big Data Technologies and Applications.

Springer, 2019, pp. 125–150.

[28] G. N. Karagoz, A. Yazici, T. Dokeroglu, and A. Cosar, “A new frame-

work of multi-objective evolutionary algorithms for feature selection

and multi-label classiﬁcation of video data,” International Journal of

Machine Learning and Cybernetics, vol. 12, no. 1, pp. 53–71, 2021.

[29] A. G¨

ulc¨

u and Z. Kus¸, “Multi-objective simulated annealing for hyper-

parameter optimization in convolutional neural networks,” PeerJ Com-

puter Science, vol. 7, p. e338, 2021.

[30] A. Garcia-Piquer, A. Fornells, J. Bacardit, A. Orriols-Puig, and E. Golo-

bardes, “Large-scale experimental evaluation of cluster representations

for multiobjective evolutionary clustering,” IEEE transactions on evolu-

tionary computation, vol. 18, no. 1, pp. 36–53, 2013.

[31] A. Garcia-Piquer, J. Bacardit, A. Fornells, and E. Golobardes, “Scaling-

up multiobjective evolutionary clustering algorithms using stratiﬁcation,”

Pattern Recognition Letters, vol. 93, pp. 69–77, 2017.

[32] Y.-S. Ong and A. Gupta, “Evolutionary multitasking: a computer science

view of cognitive multitasking,” Cognitive Computation, vol. 8, no. 2,

pp. 125–142, 2016.

[33] A. Gupta, Y.-S. Ong, and L. Feng, “Multifactorial evolution: toward

evolutionary multitasking,” IEEE Transactions on Evolutionary Compu-

tation, vol. 20, no. 3, pp. 343–357, 2016.

[34] A. Gupta, Y.-S. Ong, L. Feng, and K. C. Tan, “Multiobjective multifac-

torial optimization in evolutionary multitasking,” IEEE transactions on

cybernetics, vol. 47, no. 7, pp. 1652–1665, 2017.

[35] T. Wei, S. Wang, J. Zhong, D. Liu, and J. Zhang, “A review on

evolutionary multi-task optimization: Trends and challenges,” IEEE

Transactions on Evolutionary Computation, pp. 1–1, 2021.

[36] A. Gupta, L. Zhou, Y.-S. Ong, Z. Chen, and Y. Hou, “Half a dozen

real-world applications of evolutionary multitasking, and more,” IEEE

Computational Intelligence Magazine, vol. 17, no. 2, pp. 49–66, 2022.

[37] Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, “Data-driven

evolutionary optimization: An overview and case studies,” IEEE Trans-

actions on Evolutionary Computation, vol. 23, no. 3, pp. 442–458, 2018.

[38] Y. Jin, “Surrogate-assisted evolutionary computation: Recent advances

and future challenges,” Swarm and Evolutionary Computation, vol. 1,

no. 2, pp. 61–70, 2011.

[39] L. V. Santana-Quintero, A. A. Montano, and C. A. C. Coello, “A

review of techniques for handling expensive functions in evolutionary

multi-objective optimization,” Computational intelligence in expensive

optimization problems, pp. 29–59, 2010.

[40] J. Knowles, “Parego: A hybrid algorithm with on-line landscape ap-

proximation for expensive multiobjective optimization problems,” IEEE

Transactions on Evolutionary Computation, vol. 10, no. 1, pp. 50–66,

2006.

[41] Q. Zhang, W. Liu, E. Tsang, and B. Virginas, “Expensive multiobjective

optimization by moea/d with gaussian process model,” IEEE Transac-

tions on Evolutionary Computation, vol. 14, no. 3, pp. 456–474, 2009.

[42] T. Chugh, Y. Jin, K. Miettinen, J. Hakanen, and K. Sindhya, “A

surrogate-assisted reference vector guided evolutionary algorithm for

computationally expensive many-objective optimization,” IEEE Trans-

actions on Evolutionary Computation, vol. 22, no. 1, pp. 129–142, 2016.

[43] R. G. Regis, “Evolutionary programming for high-dimensional con-

strained expensive black-box optimization using radial basis functions,”

IEEE Transactions on Evolutionary Computation, vol. 18, no. 3, pp.

326–347, 2013.

[44] C. Sun, Y. Jin, R. Cheng, J. Ding, and J. Zeng, “Surrogate-assisted co-

operative swarm optimization of high-dimensional expensive problems,”

IEEE Transactions on Evolutionary Computation, vol. 21, no. 4, pp.

644–660, 2017.

[45] S. Zapotecas Mart´

ınez and C. A. Coello Coello, “Moea/d assisted by

rbf networks for expensive multi-objective optimization problems,” in

Proceedings of the 15th annual conference on Genetic and evolutionary

computation, 2013, pp. 1405–1412.

[46] Z. Zhou, Y. S. Ong, M. H. Nguyen, and D. Lim, “A study on polynomial

regression and gaussian process global surrogate model in hierarchical

surrogate-assisted evolutionary algorithm,” in 2005 IEEE congress on

evolutionary computation, vol. 3. IEEE, 2005, pp. 2832–2839.

[47] M. Parsa, J. P. Mitchell, C. D. Schuman, R. M. Patton, T. E. Potok,

and K. Roy, “Bayesian multi-objective hyperparameter optimization for

accurate, fast, and efﬁcient neural network accelerator design,” Frontiers

in neuroscience, vol. 14, p. 667, 2020.

[48] M. Zaefferer and T. Bartz-Beielstein, “Efﬁcient global optimization with

indeﬁnite kernels,” in Parallel Problem Solving from Nature - PPSN XIV

- 14th International Conference, Edinburgh, UK, September 17-21, 2016,

Proceedings, ser. Lecture Notes in Computer Science, J. Handl, E. Hart,

P. R. Lewis, M. L´

opez-Ib´

a˜

nez, G. Ochoa, and B. Paechter, Eds., vol.

9921. Springer, 2016, pp. 69–79.

14 IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, XX 2022

[49] S. Park, Y.-D. Kim, and S. Choi, “Hierarchical bayesian matrix fac-

torization with side information,” in Twenty-Third International Joint

Conference on Artiﬁcial Intelligence, 2013.

[50] A. Gupta, Y.-S. Ong, and L. Feng, “Insights on Transfer Optimization:

Because Experience is the Best Teacher,” IEEE Transactions on Emerg-

ing Topics in Computational Intelligence, vol. 2, no. 1, pp. 51–64, 2017.

[51] L. Zhang, Y. Xie, J. Chen, L. Feng, C. Chen, and K. Liu, “A study on

multiform multi-objective evolutionary optimization,” Memetic Comput-

ing, vol. 13, no. 3, pp. 307–318, sep 2021.

[52] X. Ma, J. Yin, A. Zhu, X. Li, Y. Yu, L. Wang, Y. Qi, and Z. Zhu,

“Enhanced multifactorial evolutionary algorithm with meme helper-

tasks,” IEEE Transactions on Cybernetics, vol. 52, no. 8, pp. 7837–7851,

2022.

[53] Y. Feng, L. Feng, S. Kwong, and K. C. Tan, “A multivariation multifacto-

rial evolutionary algorithm for large-scale multiobjective optimization,”

IEEE Transactions on Evolutionary Computation, vol. 26, no. 2, pp.

248–262, 2022.

[54] K. Chen, B. Xue, M. Zhang, and F. Zhou, “An evolutionary multitasking-

based feature selection method for high-dimensional classiﬁcation,”

IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 7172–7186, 2022.

[55] ——, “Evolutionary multitasking for feature selection in high-

dimensional classiﬁcation via particle swarm optimization,” IEEE Trans-

actions on Evolutionary Computation, vol. 26, no. 3, pp. 446–460, 2022.

[56] S. Yao, Z. Dong, X. Wang, and L. Ren, “A Multiobjective multifactorial

optimization algorithm based on decomposition and dynamic resource

allocation strategy,” Information Sciences, vol. 511, pp. 18–35, 2020.

[Online]. Available: https://doi.org/10.1016/j.ins.2019.09.058

[57] T. Wei and J. Zhong, “Towards Generalized Resource Allocation on

Evolutionary Multitasking for Multi-Objective Optimization,” IEEE

Computational Intelligence Magazine, vol. 16, no. 4, pp. 20–37, 2021.

[58] K. Deb and H. Gupta, “Introducing robustness in multi-objective opti-

mization,” Evolutionary Computation, vol. 14, no. 4, pp. 463–494, 2006.

[59] C. Liang and S. Mahadevan, “Pareto surface construction for multi-

objective optimization under uncertainty,” Structural and Multidisci-

plinary Optimization, vol. 55, no. 5, pp. 1865–1882, 2017.

[60] A. Shapiro, D. Dentcheva, and A. Ruszczy´

nski, Lectures on stochastic

programming: modeling and theory. SIAM, 2014.

[61] K. Deb, L. Thiele, M. Laumanns, and E. Zitzler, “Scalable multi-

objective optimization test problems,” in Evolutionary Computation,

2002. CEC ’02. Proceedings of the 2002 Congress on, vol. 1, May

2002, pp. 825–830.

[62] E. Zitzler and L. Thiele, “Multiobjective evolutionary algorithms: a com-

parative case study and the strength pareto approach,” IEEE Transactions

on Evolutionary Computation, vol. 3, no. 4, pp. 257–271, 1999.

[63] K. Deb and R. Agrawal, “Simulated binary crossover for continuous

search space,” Complex Systems, vol. 9, pp. 115–48, April 1995.

[64] M. G. Kalyanmoy Deb, “A combined genetic adaptive search (geneas)

for engineering design,” Computer Science and Informatics, vol. 26, pp.

30–45, 1999.

[65] I. Das and J. E. Dennis, “Normal-boundary intersection: A new method

for generating the pareto surface in nonlinear multicriteria optimization

problems,” Siam Journal on Optimization, vol. 8, no. 3, pp. 631–657,

1996.

[66] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con-

siderations for deep learning in nlp,” arXiv preprint arXiv:1906.02243,

2019.

[67] L. F. W. Anthony, B. Kanding, and R. Selvan, “Carbontracker: Tracking

and predicting the carbon footprint of training deep learning models,”

arXiv preprint arXiv:2007.03051, 2020.

[68] Y.-S. Ong and A. Gupta, “Air 5: Five pillars of artiﬁcial intelligence

research,” IEEE Transactions on Emerging Topics in Computational

Intelligence, vol. 3, no. 5, pp. 411–415, 2019.

[69] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp.

41–75, 1997.

[70] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning al-

gorithms,” IEEE Transactions on Knowledge and Data Engineering,

vol. 26, no. 8, pp. 1819–1837, 2014.

[71] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

arXiv preprint arXiv:1412.6980, 2014.

[72] I. Loshchilov and F. Hutter, “Cma-es for hyperparameter optimization

of deep neural networks,” arXiv preprint arXiv:1604.07269, 2016.

[73] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,”

in Data mining and knowledge discovery handbook. Springer, 2009,

pp. 667–685.

Zefeng Chen received the B. Sc. degree in Infor-

mation and Computational Science from Sun Yat-

sen University, Guangzhou, China, in 2013, and the

M. Sc. degree in Computer Science and Technol-

ogy from South China University of Technology,

Guangzhou, China, in 2016, and the Ph.D. degree

in Computer Science and Technology from Sun

Yat-sen University, Guangzhou, China, in 2019. He

was a Post-Doctoral Research Fellow working with

Prof. Yew-Soon Ong at the School of Computer

Science and Engineering, Nanyang Technological

University, Singapore, from October 2019 to October 2021. Currently, he

is an Assitant Professor in the School of Artiﬁcial Intelligence, Sun Yat-sen

University (SYSU). His current research interests mainly include evolutionary

computation, evolutionary learning and data-driven optimization.

Abhishek Gupta received the PhD degree in En-

gineering Science from the University of Auckland,

New Zealand, in 2014. He is currently a Scientist in

the Singapore Institute of Manufacturing Technol-

ogy, a research institute in Singapores Agency for

Science, Technology and Research (A*STAR). He

also holds a joint appointment with the School of

Computer Science and Engineering at the Nanyang

Technological University. Abhishek has diverse re-

search experience in computational science. Current-

ly, his main interests lie in the theory and algorithms

of transfer and multitask optimization, neuroevolution, surrogate modeling,

and scientiﬁc machine learning. Abhishek is the recipient of the 2019 and

the 2023 IEEE Transactions on Evolutionary Computation Outstanding Paper

Award, for foundational works on evolutionary multitasking. He received the

IEEE Transactions on Emerging Topics in Computational Intelligence 2021

Outstanding Associate Editor Award. He is also editorial board member of

the Complex & Intelligent Systems journal, the Memetic Computing journal,

and the Springer book series on Adaptation, Learning, and Optimization.

Lei Zhou received the B.E. degree from the School

of Computer Science and Technology, Shandong

University, Shandong, China, in 2014, and the Ph.D.

degree from the College of Computer Science,

Chongqing University, Chongqing, China, in 2019.

His current research interests include evolutionary

computations, memetic computing, as well as trans-

fer learning and optimization.

YEW-SOON ONG (M‘99-SM‘12-F‘18) received

the Ph.D. degree in artiﬁcial intelligence in com-

plex engineering design from the University of

Southampton, U.K., in 2003. He is a President Chair

Professor in Computer Science at the Nanyang Tech-

nological University (NTU) and concurrently the

Chief Artiﬁcial Intelligence Scientist of the Agen-

cy for Science, Technology and Research (A*Star)

Singapore. At NTU, he serves as co-Director of

the Singtel-NTU Cognitive & Artiﬁcial Intelligence

Joint Lab. His core research interest is in artiﬁcial

and computational intelligence where he has received four IEEE outstanding

paper awards. He was listed as a Thomson Reuters highly cited researcher

and among the World’s Most Inﬂuential Scientiﬁc Minds. He is the in-

augural Editor-in-Chief of the IEEE Transactions on Emerging Topics in

Computational Intelligence and associate editor of the IEEE on Transactions

on Evolutionary Computation, IEEE Transactions on Neural Networks &

Learning Systems, IEEE on Transactions on Cybernetics, IEEE Transactions

on Artiﬁcial Intelligence.

("tMeets") A Thousand-Hand Bodhisattva: Emergent Abilities of Artificial General Intelligence via Single- objective to Multi-objective Optimization ( under review )

Article

Full-text available

Jan 2024

Xu Wendi

Towards artificial general intelligence, emergent abilities of large language models (LLMs) are observed wildly especially for well-known GPTs, which are due to scaling up primarily along three factors: training computation, model parameters, and dataset size. Scaling up makes emergence. Inspired by the insights for LLMs case, we scale up the number of auxiliary tasks (from 2 to many) to boost the core task for the model of single-objective to multi-objective optimization (SMO) in evolutionary transfer optimization (ETO), therefore achieve a new version of SMO in the open benchmarks of vehicle routing problems, permutation flow shop scheduling problems and travel salesman problems. We name the new SMO armed with transfer core as a “Thousand-Hand Bodhisattva” with many arms or hands and analyze its emergent abilities.

Multi-Domain Evolutionary Optimization of Network Structures

Preprint

Jun 2024

Multi-Task Evolutionary Optimization (MTEO), an important field focusing on addressing complex problems through optimizing multiple tasks simultaneously, has attracted much attention. While MTEO has been primarily focusing on task similarity, there remains a hugely untapped potential in harnessing the shared characteristics between different domains to enhance evolutionary optimization. For example, real-world complex systems usually share the same characteristics, such as the power-law rule, small-world property, and community structure, thus making it possible to transfer solutions optimized in one system to another to facilitate the optimization. Drawing inspiration from this observation of shared characteristics within complex systems, we set out to extend MTEO to a novel framework - multi-domain evolutionary optimization (MDEO). To examine the performance of the proposed MDEO, we utilize a challenging combinatorial problem of great security concern - community deception in complex networks as the optimization task. To achieve MDEO, we propose a community-based measurement of graph similarity to manage the knowledge transfer among domains. Furthermore, we develop a graph representation-based network alignment model that serves as the conduit for effectively transferring solutions between different domains. Moreover, we devise a self-adaptive mechanism to determine the number of transferred solutions from different domains and introduce a novel mutation operator based on the learned mapping to facilitate the utilization of knowledge from other domains. Experiments on eight real-world networks of different domains demonstrate MDEO superiority in efficacy compared to classical evolutionary optimization. Simulations of attacks on the community validate the effectiveness of the proposed MDEO in safeguarding community security.

An Evolutionary Multitasking Memetic Algorithm for Multi-Objective Distributed Heterogeneous Welding Flow Shop Scheduling

Article

Full-text available

Apr 2024
IEEE T EVOLUT COMPUT

The decomposable feature of operations in the welding shop scheduling scenario results in a vast search space, posing challenges for the design of traditional optimization algorithms. Addressing the multi-objective distributed heterogeneous welding shop scheduling problem (DHWSP), this work introduces a generalized multitasking framework. It establishes an auxiliary task by employing knowledge-and-learning-synergy neighborhood search, thereby enhancing the convergence and diversity of the original task. In this framework, an enhanced competitive swarm optimizer is adopted as the original task for DHWSP. Additionally, knowledge expression and transfer strategies are designed to expedite the comprehensive performance of each task by leveraging knowledge gained from search results. Finally, a memetic algorithm based on the multitasking framework is proposed for DHWSP. The effectiveness of the algorithm is validated through extensive experiments on 20 DHWSP instances. Numerical experimental results indicate that the proposed multitasking framework can significantly improve algorithmic comprehensive performance, demonstrating its efficacy in addressing the multi-objective DHWSP within a complex search space.

("lMeets") Three “Large” Topics: Emergent Abilities of Artificial General Intelligence via Single-objective to Multi-objective Optimization ( under review )

Article

Full-text available

Jan 2024

Xu Wendi

Towards the mission of building artificial general intelligence, the unpredictable phenomena of emergent abilities in large language models are quite impressive and inspiring especially in GPTs. Emergence are achieved by scaling up three variables: training computation, model parameters, and dataset size. Following the spirit above, we scale up three large factors of duration, gap and population for the new model of single-objective to multi-objective optimization (SMO) in the background of evolutionary transfer optimization (ETO), which serves as a complementary part to our previous work of "tMeets" (scaling up the number of auxiliary tasks to get a "t"housand-hand bodhisattva). We name our paper here as "lMeets" with "large topics" and hope that both tMeets and lMeets could make up the full picture of scaling up or emergent abilities in SMO, which will be tested in vehicle routing problems benchmarks.

MOREM: An Evolutionary Multitasking Optimization Algorithm for Multi-Objective Recommendations

Article

Jun 2024
INFORM SCIENCES

("fMeets") Diffusion Is All You Need: AGI via "Fractal" Single-objective to Multi-objective Optimization ( under review )

Preprint

Full-text available

Apr 2024

Xu Wendi

Single-objective to multi-objective optimization (SMO) is proved to be a new efficient kind of evolutionary transfer optimization (ETO) for both continuous and discrete cases, for both well-known benchmarks and real-world applications , working as a promising artificial general intelligence (AGI) tool and/or system. At problem side, SMO ranges from vehicle routing problem with time windows to vehicle routing problem, which can be further simplified/reduced to travel salesman problem (TSP) here. For algorithmic side, SMO are also developed with global search like genetic algorithms, local search (insert) and memetic algorithms combing those two kinds of search mentioned above, which inspires our extension to fractal search (FS, whose self-similarity exists at all scales or when scaling up, whose diffusion can be Gaussian walk). In this paper or "fMeets" (borrow "f" from fractal), we run heavy computational simulations of "TSP+FS" in well-known TSP benchmark.

("hMeets") Hot Rolling for Manufacturing Equipment: Emergent Abilities of AGI via Single-objective to Multi-objective Optimization ( under review )

Preprint

Full-text available

Mar 2024

Xu Wendi

Production quality during hot rolling in iron and steel industry for manufacturing equipment serves as a core competence, which is usually difficult to achieve online testing, and can only be tested offline after the task of rolling thick slabs into thin coils is completed. In order to improve the production quality in hot rolling via artificial general intelligence (AGI), we develop the solution method in the computational framework of single-objective to multi-objective optimization (SMO) within evolutionary transfer optimization (ETO). Our solution method is striding for the similar "emergent abilities" of AGI like what we have seen in large language models. We name the new SMO towards intelligent manufacturing equipment and technology armed with transfer core for "hot" rolling quality analytics as "hMeets", which is a direct real-world application of our previous work of "tMeets".

On the use of big data frameworks in big service management

Article

Dec 2023

Over the last few years, big data have emerged as a paradigm for processing and analyzing a large volume of data. Coupled with other paradigms, such as cloud computing, service computing, and Internet of Things, big data processing takes advantage of the underlying cloud infrastructure, which allows hosting and managing massive amounts of data, while service computing allows to process and deliver various data sources as on‐demand services. This synergy between multiple paradigms has led to the emergence of big services , as a cross‐domain, large‐scale, and big data‐centric service model. Apart from the adaptation issues (e.g., need of high reaction to changes) inherited from other service models, the massiveness and heterogeneity of big services add a new factor of complexity to the way such a large‐scale service ecosystem is managed in case of execution deviations. Indeed, big services are often subject to frequent deviations at both the functional (e.g., service failure, QoS degradation, and IoT resource unavailability) and data (e.g., data source unavailability or access restrictions) levels. Handling these execution problems is beyond the capacity of traditional web/cloud service management tools, and the majority of big service approaches have targeted specific management operations, such as selection and composition. To maintain a moderate state and high quality of their cross‐domain execution, big services should be continuously monitored and managed in a scalable and autonomous way. To cope with the absence of self‐management frameworks for large‐scale services, the goal of this work is to design an autonomic management solution that takes the whole control of big services in an autonomous and distributed lifecycle process. We combine autonomic computing and big data processing paradigms to endow big services with self‐ * and parallel processing capabilities. The proposed management framework takes advantage of the well‐known MapReduce programming model and Apache Spark and manages big service's related data using knowledge graph technology . We also define a scalable embedding model that allows processing and learning latent big service knowledge in a distributed manner. Finally, a cooperative decision mechanism is defined to trigger non‐conflicting management policies in response to the captured deviations of the running big service. Big services' management tasks (monitoring, embedding, and decision), as well as the core modules (autonomic managers' controller, embedding module, and coordinator), are implemented on top of Apache Spark as MapReduce jobs, while the processed data are represented as resilient distributed dataset (RDD) structures. To exploit the shared information exchanged between the workers and the master node (coordinator), and for further resolution of conflicts between management policies, we endowed the proposed framework with a lightweight communication mechanism that allows transferring useful knowledge between the running map‐reduce tasks and filtering inappropriate intermediate data (e.g., conflicting actions). The experimental results proved the increased quality of embeddings and the high performance of autonomic managers in a parallel and cooperative setting, thanks to the shared knowledge.

Towards Scalable Feature Selection: An Evolutionary Multitask Algorithm Assisted by Transfer Learning Based Co-surrogate

Chapter

Nov 2023

When faced with large-instance datasets, existing feature selection methods based on evolutionary algorithms still face the challenge of high computational cost. To address this issue, this paper proposes a scalable evolutionary algorithm for feature selection on large-instance datasets, namely, transfer learning based co-surrogate assisted evolutionary multitask algorithm (cosEMT). Firstly, we tackle the feature selection on large-instance datasets via an evolutionary multitasking framework. The co-surrogate models are constructed to measure the similarity between each auxiliary task and main task, and the knowledge transfer between tasks is realized through instance-based transfer learning. Through the numerical relationship between the relative and absolute number of transferable instances, we propose a novel dynamic resource allocation strategy to make more efficient use of limited computational resources and accelerate evolutionary convergence. Meanwhile, an adaptive surrogate model update mechanism is proposed to balance the exploration and exploitation of the base optimizer embedded in the cosEMT framework. Finally, the proposed algorithm is compared with several state-of-the-art feature selection algorithms on twelve large-instance datasets. The experimental results show that the cosEMT framework can obtain significant acceleration in the convergence speed and high-quality solutions. All verify that cosEMT is a highly competitive method for feature selection on large-instance datasets.

Evolutionary Multitasking Optimization Enhanced by Geodesic Flow Kernel

Article

Jan 2023

In an era of parallel computing, evolutionary multitasking optimization (EMT) has become a popular optimization paradigm due to its ability to optimize several tasks simultaneously. The common knowledge can improve the solving quality and efficiency for each component optimization task when transferred among tasks. Therefore, the performances of traditional EMT algorithms mostly rely on the correlation between tasks. In the field of EMT, a key issue needing to be solved urgently is the impact of negative transfer when tackling optimization tasks with low correlation. In order to overcome the short board of this situation, this paper proposes a multiobjective EMT algorithm EMT-GFK. In the proposed algorithm, a union subspace of the optimization tasks is designed to extract the compact information. Furthermore, the geodesic flow kernel based domain adaptation is applied to learn a nonlinear mapping matrix, which can increase the correlation between tasks. The numerical experiments and results analysis on the MO-MTO test suits demonstrate the effectiveness of proposed EMT-GFK.

A survey on multi-objective hyperparameter optimization algorithms for machine learning

Article

Full-text available

Dec 2022
ARTIF INTELL REV

Hyperparameter optimization (HPO) is a necessary step to ensure the best possible performance of Machine Learning (ML) algorithms. Several methods have been developed to perform HPO; most of these are focused on optimizing one performance measure (usually an error-based measure), and the literature on such single-objective HPO problems is vast. Recently, though, algorithms have appeared that focus on optimizing multiple conflicting objectives simultaneously. This article presents a systematic survey of the literature published between 2014 and 2020 on multi-objective HPO algorithms, distinguishing between metaheuristic-based algorithms, metamodel-based algorithms and approaches using a mixture of both. We also discuss the quality metrics used to compare multi-objective HPO procedures and present future research directions.

Half a Dozen Real-World Applications of Evolutionary Multitasking, and More

Article

Full-text available

May 2022

Until recently, the potential to transfer evolved skills across distinct optimization problem instances (or tasks) was seldom explored in evolutionary computation. The concept of evolutionary multitasking (EMT) fills this gap. It unlocks a population’s implicit parallelism to jointly solve a set of tasks, hence creating avenues for skills transfer between them. Despite it being early days, the idea of EMT has begun to show promise in a range of real-world applications. In the backdrop of recent advances, the contribution of this paper is twofold. First, a review of several application-oriented explorations of EMT in the literature is presented; the works are assimilated into half a dozen broad categories according to their respective application domains. Each of these six categories elaborates fundamental motivations to multitask, and contains a representative experimental study (referred from the literature). Second, a set of recipes is provided showing how problem formulations of general interest, those that cut across different disciplines, could be transformed in the new light of EMT. Our discussions emphasize the many practical use-cases of EMT, and are intended to spark future research towards crafting novel algorithms for real-world deployment.

A particle swarm optimization based multiobjective memetic algorithm for high-dimensional feature selection

Article

Full-text available

Mar 2022

Feature selection, as one of the dimension reduction methods, is a crucial processing step in dealing with high-dimensional data. It tries to preserve feature subset representing the whole feature space, which aims to reduce redundancy and increase the classification accuracy. Since the two objectives are usually in conflict with each other, feature selection is modeled as a multi-objective problem. However, the high search space and discrete Pareto front makes it not easy for existing evolutionary multiobjective algorithms. Classic evolutionary computation method, which is often applied to feature selection problem straightforwardly, gradually exposes its inefficiency in searching process. Hence, a particle swarm optimization based multiobjective memetic algorithm for high-dimensional feature selection is designed in this paper to deal with above shortcomings. Its basic idea is to model feature selection as a multiobjective optimization problem by optimizing the number of features and the classification accuracy in supervised condition simultaneously, in which information entropy based initialization and adaptive local search are designed to improve the search efficiency. Moreover, a new particle velocity update rule considering both convergence and diversity of solutions is designed to update particles, and a fast discrete nondominated sorting strategy is designed to rank the Pareto solutions. These strategies enable the proposed algorithm to gain better performance on both the quality and size of feature subset. The experimental results show that the proposed algorithm can improve the quality of Pareto fronts evolved by the state-of-the-art algorithms for feature selection.

A Review on Evolutionary Multi-Task Optimization: Trends and Challenges

Article

Full-text available

Dec 2021
IEEE T EVOLUT COMPUT

Evolutionary algorithms possess strong problem-solving abilities and have been applied in a wide range of applications. However, they still suffer from a high computational burden and poor generalization ability. To overcome the limitations, numerous studies consider conducting knowledge extraction across distinct optimization task domains. Among these research strands, one representative tributary is evolutionary multi-task optimization (EMTO) that aims to resolve multiple optimization tasks simultaneously. The underlying attribute of implicit parallelism for evolutionary algorithms can well incorporate with the framework of EMTO, giving rise to the ascending EMTO studies. This review is intended to present a detailed exposition on the research in the EMTO area. We reveal the core components for designing the EMTO algorithms. Subsequently, we organize the works lying in the fusions between EMTO and traditional evolutionary algorithms. By analyzing the associations for diverse strategies in different branches of EMTO, this review uncovers the research trends and the potentially important directions, with additional interesting real-world applications mentioned.

Parsimonious Optimization of Multitask Neural Network Hyperparameters

Article

Full-text available

Dec 2021
MOLECULES

Neural networks are rapidly gaining popularity in chemical modeling and Quantitative Structure–Activity Relationship (QSAR) thanks to their ability to handle multitask problems. However, outcomes of neural networks depend on the tuning of several hyperparameters, whose small variations can often strongly affect their performance. Hence, optimization is a fundamental step in training neural networks although, in many cases, it can be very expensive from a computational point of view. In this study, we compared four of the most widely used approaches for tuning hyperparameters, namely, grid search, random search, tree-structured Parzen estimator, and genetic algorithms on three multitask QSAR datasets. We mainly focused on parsimonious optimization and thus not only on the performance of neural networks, but also the computational time that was taken into account. Furthermore, since the optimization approaches do not directly provide information about the influence of hyperparameters, we applied experimental design strategies to determine their effects on the neural network performance. We found that genetic algorithms, tree-structured Parzen estimator, and random search require on average 0.08% of the hours required by grid search; in addition, tree-structured Parzen estimator and genetic algorithms provide better results than random search.

Information-Theory-based Nondominated Sorting Ant Colony Optimization for Multiobjective Feature Selection in Classification

Article

Aug 2022

Feature selection (FS) has received significant attention since the use of a well-selected subset of features may achieve better classification performance than that of full features in many real-world applications. It can be considered as a multiobjective optimization consisting of two objectives: 1) minimizing the number of selected features and 2) maximizing classification performance. Ant colony optimization (ACO) has shown its effectiveness in FS due to its problem-guided search operator and flexible graph representation. However, there lacks an effective ACO-based approach for multiobjective FS to handle the problematic characteristics originated from the feature interactions and highly discontinuous Pareto fronts. This article presents an Information-theory-based Nondominated Sorting ACO (called INSA) to solve the aforementioned difficulties. First, the probabilistic function in ACO is modified based on the information theory to identify the importance of features; second, a new ACO strategy is designed to construct solutions; and third, a novel pheromone updating strategy is devised to ensure the high diversity of tradeoff solutions. INSA’s performance is compared with four machine-learning-based methods, four representative single-objective evolutionary algorithms, and six state-of-the-art multiobjective ones on 13 benchmark classification datasets, which consist of both low and high-dimensional samples. The empirical results verify that INSA is able to obtain solutions with better classification performance using features whose count is similar to or less than those obtained by its peers.

Multitask Feature Learning as Multiobjective Optimization: A New Genetic Programming Approach to Image Classification

Article

May 2022

Feature learning is a promising approach to image classification. However, it is difficult due to high image variations. When the training data are small, it becomes even more challenging, due to the risk of overfitting. Multitask feature learning has shown the potential for improving generalization. However, existing methods are not effective for handling the case that multiple tasks are partially conflicting. Therefore, for the first time, this article proposes to solve a multitask feature learning problem as a multiobjective optimization problem by developing a genetic programming approach with a new representation to image classification. In the new approach, all the tasks share the same solution space and each solution is evaluated on multiple tasks so that the objectives of all the tasks can be optimized simultaneously using a single population. To learn effective features, a new and compact program representation is developed to allow the new approach to evolving solutions shared across tasks. The new approach can automatically find a diverse set of nondominated solutions that achieve good tradeoffs between different tasks. To further reduce the risk of overfitting, an ensemble is created by selecting nondominated solutions to solve each image classification task. The results show that the new approach significantly outperforms a large number of benchmark methods on six problems consisting of 15 image classification datasets of varying difficulty. Further analysis shows that these new designs are effective for improving the performance. The detailed analysis clearly reveals the benefits of solving multitask feature learning as multiobjective optimization in improving the generalization.

Towards Generalized Resource Allocation on Evolutionary Multitasking for Multi-Objective Optimization

Article

Nov 2021

Evolutionary multitasking optimization (EMTO) is an emerging paradigm for solving several problems simultaneously. Due to the flexible framework, EMTO has been naturally applied to multi-objective optimization to exploit synergy among distinct multi-objective problem domains. However, most studies barely take into account the scenario where some problems cannot converge under restrictive computational budgets with the traditional EMTO framework. To dynamically allocate computational resources for multi-objective EMTO problems, this article proposes a generalized resource allocation (GRA) framework by concerning both theoretical grounds of conventional resource allocation and the characteristics of multi-objective optimization. In the proposed framework, a normalized attainment function is designed for better quantifying convergence status, a multi-step nonlinear regression is proposed to serve as a stable performance estimator, and the algorithmic procedure of conventional resource allocation is refined for flexibly adjusting resource allocation intensity and including knowledge transfer information. It has been verified that the GRA framework can enhance the overall performance of the multi-objective EMTO algorithm in solving benchmark problems, complex problems, many-task problems, and a real-world application problem. Notably, the proposed GRA framework served as a crucial component for the winner algorithm in the Competition on Evolutionary Multi-Task Optimization (Multi-objective Optimization Track) in IEEE 2020 World Congress on Computational Intelligence.

A Multi-Variation Multifactorial Evolutionary Algorithm for Large-Scale Multi-Objective Optimization

Article

Oct 2021
IEEE T EVOLUT COMPUT

For solving large-scale multi-objective problems (LSMOPs), the transformation-based methods have shown promising search efficiency, which varies the original problem as a new simplified problem and performs the optimization in simplified spaces instead of the original problem space. Owing to the useful information provided by the simplified searching space, the performance of LSMOPs has been improved to some extent. However, it is worth noting that the original problem has changed after the variation, and there is thus no guarantee of the preservation of the original global or near-global optimum in the newly generated space. In this paper, we propose to solve LSMOPs via a multi-variation multifactorial evolutionary algorithm. In contrast to existing transformation-based methods, the proposed approach intends to conduct an evolutionary search on both the original space of the LSMOP and multiple simplified spaces constructed in a multi-variation manner concurrently. In this way, useful traits found along the search can be seamlessly transferred from the simplified problem spaces to the original problem space toward efficient problem-solving. Besides, since the evolutionary search is also performed in the original problem space, preserving the original global optimal solution can be guaranteed. To evaluate the performance of the proposed framework, comprehensive empirical studies are carried out on a set of LSMOPs with 2-3 objectives and 500-5000 variables. The experiment results highlight the efficiency and effectiveness of the proposed method compared to the state-of-the-art methods for large-scale multi-objective optimization.

Multiobjective Reinforcement Learning-Based Neural Architecture Search for Efficient Portrait Parsing

Article

Aug 2021

This article dedicates to automatically explore efficient portrait parsing models that are easily deployed in edge computing or terminal devices. In the interest of the tradeoff between the resource cost and performance, we design the multiobjective reinforcement learning (RL)-based neural architecture search (NAS) scheme, which comprehensively balances the accuracy, parameters, FLOPs, and inference latency. Finally, under varying hyperparameter configurations, the search procedure emits a bunch of excellent objective-oriented architectures. The combination of two-stage training with precomputing and memory-resident feature maps effectively reduces the time consumption of the RL-based NAS method, so that we complete approximately 1000 search iterations in two GPU days. To accelerate the convergence of the lightweight candidate architecture, we incorporate knowledge distillation into the training of the search process. This also provides a reasonable evaluation signal to the RL controller that enables it to converge well. In the end, we conduct full training with outstanding Pareto-optimal architectures, so that a series of excellent portrait parsing models (with only approximately 0.3M parameters) is received. Furthermore, we directly transfer the architectures searched on CelebAMask-HQ (Portrait Parsing) to other portrait and face segmentation tasks. Finally, we achieve the state-of-the-art performance of 96.5% MIOU on EG1800 (portrait segmentation) and 91.6% overall $F1$ -score on HELEN (face labeling). That is, our models significantly surpass the artificial network on the accuracy, but with lower resource consumption and higher real-time performance.

Scaling Multiobjective Evolution to Large Data With Minions: A Bayes-Informed Multitask Approach

Abstract

Recommended publications

Half a Dozen Real-World Applications of Evolutionary Multitasking, and More

Multitask Neuroevolution for Reinforcement Learning With Long and Short Episodes

Half a Dozen Real-World Applications of Evolutionary Multitasking, and More

An Evolutionary Multitasking Memetic Algorithm for Multi-Objective Distributed Heterogeneous Welding...