ChapterPDF Available

Cooperative Coevolutionary Methods

January 2006

January 2006
36:181-206

DOI:10.1007/0-387-33416-5_9

In book: Metaheuristic Procedures for Training Neutral Networks (pp.181-206)

Authors:

Nicolás García-Pedrajas

University of Cordoba (Spain)

Cesar Hervás Martínez

University of Cordoba (Spain)

Domingo Ortiz-Boyer

University of Cordoba (Spain)

This chapter presents a cooperative revolutionary model for evolving artificial neural networks. This model is based on the idea of coevolving subnetworks that must cooperate to form a solution for a specific problem, instead of evolving complete networks. The combination of these subnetworks is part of a coevolutionary process. The best combinations of subnetworks must be evolved together with the coevolution of the subnetworks. Several subpopulations of subnetworks coevolve cooperatively and genetically isolated. The individuals of every subpopulation are combined to form whole networks. This is a different approach from most current models of evolutionary neural networks which try to develop whole networks. This model places as few restrictions as possible over the network structure, allowing the model to reach a wide variety of architectures during the evolution and to be easily extensible to other kind of neural networks. The performance of the model in solving ten real problems of classification is compared with a modular network, the adaptive mixture of experts, and with the results reported in the literature.

Model of a nodule. As a node has only connections to some nodes of the nodule, the connections that are missing are represented with dashed lines. The nodule is composed by the hidden nodes and the connections of these nodes from the input and to the output.

…

Equivalent model with two hidden layers. Every connection from an input node represents two connections, as the input value is used in two steps (see Equations 2 and 3). Every connection from another node of the nodule represents a connection between the first and second hidden layer (see Equation 3).

…

Populations of networks and nodules. Each element of the network is a reference to, or a label of, an individual of the corresponding subpopulation of nodules. So the network is a vector where the first component refers to a nodule of subpopulation 1, the second component to a nodule of subpopulation 2, and so on.

…

Figures - uploaded by Domingo Ortiz-Boyer

Content may be subject to copyright.

Content uploaded by Domingo Ortiz-Boyer

Content may be subject to copyright.

Cooperative coevolutionary methods

Nicol´as Garc´ıa-Pedrajas, C´esar Herv´as-Mart´ınez, and Domingo Ortiz-Boyer

Abstract

This chapter presents Covnet a cooperative coevolutionary model for evolving artiﬁcial neural networks. This

model is based on the idea of coevolving subnetworks that must cooperate to form a solution for a speciﬁc problem,

instead of evolving complete networks. The combination of this subnetworks is part of a coevolutionary process. The best

combinations of subnetworks must be evolved together with the coevolution of the subnetworks. Several subpopulations

of subnetworks coevolve cooperatively and genetically isolated. The individual of every subpopulation are combined

to form whole networks. This is a diﬀerent approach from most current models of evolutionary neural networks which

try to develop whole networks. Covnet places as few restrictions as possible over the network structure, allowing the

model to reach a wide variety of architectures during the evolution and to be easily extensible to other kind of neural

networks. The performance of the model in solving three real problems of classiﬁcation is compared with a modular

network, the adaptive mixture of experts, and with the results presented in the literature. Covnet has shown better

generalization and produced smaller networks than the adaptive mixture of experts, and has also achieved results, at

least, comparable with the results in the literature.

Keywords

Neural networks automatic design, cooperative coevolution, evolutionary computation, genetic algorithms, evolu-

tionary programming.

I. Introduction

In the area of neural networks [1] design one of the main problems is ﬁnding suitable architectures for

solving speciﬁc problems. The election of such architecture is very important, as a network smaller than

needed would be unable to learn and a network larger than needed would end in over-training.

The problem of ﬁnding a suitable architecture and the corresponding weights of the network is a very

complex task (for a very interesting review of the matter the reader can consult [2]). Modular systems are

often used in machine learning as an approach for solving these complex problems. Moreover, in spite of the

fact that small networks are preferred because they usually lead to better performance, the error surfaces of

such networks are more rugged and have few good solutions [3]. In addition, there is much neuropsychological

evidence showing that the brain of humans and other animals consists of modules, which are subdivisions in

identiﬁable parts, each one with its own purpose and function [4].

The objective of this chapter is showing how Cooperative Coevolution, a recent paradigm within the ﬁeld of

Evolutionary Computation, can be used to design of such modular neural networks. Evolutionary computation

[5] [6] is a set of global optimization techniques that have been widely used in late years for training and

automatically designing neural networks (see Section III). Some eﬀorts have been made in designing modular

[7] neural networks with these techniques(e.g. [8]), but in almost all of them the design of the networks is

helped by methods outside evolutionary computation, or the application area for those models is limited to

very speciﬁc architectures.

This chapter is organised as follows: Section II explains the paradigm of cooperative coevolution; Section

III shows an application of cooperative coevolution to automatic neural network design; Section IV describes

the experiments carried out; and ﬁnally Section V states the conclusions of this chapter.

II. Cooperative coevolution

Cooperative coevolution [9] is a recent paradigm in the area of evolutionary computation focused on the

evolution of coadapted subcomponents without external interaction. In cooperative coevolution a number

of species are evolved together. The cooperation among the individuals is encouraged by rewarding the

individuals based on how well they cooperate to solve a target problem. The work on this paradigm has shown

The authors are with the Department of Computing and Numerical Analysis of the University of C´ordoba. e-mail: {npedrajas,

chervas, dortiz}@uco.es

that cooperative coevolutionary models present many interesting features, such as specialization through

genetic isolation, generalization and eﬃciency [10]. Cooperative coevolution approaches the design of modular

systems in a natural way, as the modularity is part of the model. Other models need some a priori knowledge

to decompose the problem by hand. In many cases, either this knowledge is not available or it is not clear

how to decompose the problem.

This chapter describes a cooperative coevolutionary model called Covnet [11]. This model develops subnet-

works instead of whole networks. These modules are combined forming ensembles that constitute a network.

As M. A. Potter and K. A. De Jong [10] have stated, “to apply evolutionary algorithms eﬀectively to increas-

ingly complex problems explicit notions of modularity must be introduced to provide reasonable opportunities

for solutions to evolve in the form of interacting coadapted subcomponents”.

The most distinctive feature of Covnet is the coevolution of modules without the intervention of any agent

external to the evolutionary process and without an external mechanism for combining subnetworks. Also,

the use of an evolutionary algorithm for the evolution of both the weights and the architecture allows the

model to be applied to tasks where there is no error function that could be deﬁned (e.g.: game playing [12] or

control [13]) in order to apply an algorithm based on the minimisation of that error, like the backpropagation

learning rule, or the derivatives of that error function cannot be obtained.

The most important contribution of Covnet are the following. First, it forms modular artiﬁcial neural

networks using cooperative coevolution. Every module must learn how to combine with the other modules

of the evolved network to be useful. Introducing the combination of nodules into the evolutionary process

enforces the cooperation among the modules, as independently evolved modules are less likely to combine well

after the evolutionary process have ﬁnished.

Second, it develops a method for measuring the ﬁtness of cooperative subcomponents in a coevolutionary

model. This method, based on three diﬀerent criteria, could be applied to other cooperative coevolutionary

models not related to the evolution of neural networks. The current methods are based, almost exclusively,

on measuring the ﬁtness of the networks where the module appears.

Third, it introduces a new hybrid evolutionary programming algorithm that puts very few restrictions in the

subnetworks evolved. This algorithm produces very compact subnetworks, and even the evolved subnetworks

alone achieved very good performance in the test problems, as it will be shown in the experimental section.

III. Automatic design of artificial neural networks by means of cooperative coevolution

The automatic design of artiﬁcial neural networks has two diﬀerent approaches: parametric learning and

structural learning. In structural learning, both architecture and parametric information must be learned

through the process of training. Basically, we can consider three models of structural learning: Constructive

algorithms, destructive algorithms, and evolutionary computation.

Constructive algorithms [14] [15] [16] start with a small network (usually a single neuron). This network is

trained until it is unable to continue learning, then new components are added to the network. This process

is repeated until a satisfactory solution is found. These methods are usually trapped in local minima [17] and

tend to produce large networks. Destructive methods, also known as pruning algorithms [18], start with a big

network, that is able to learn but usually ends in over-ﬁtting, and try to remove the connections and nodes

that are not useful. A major problem with pruning methods is measuring the relevance of the structural

components of the network in order to decide whether a connection or node must be removed.

Both methods, constructive and destructive, limit the number of available architectures, thus introducing

constraints in the search space of possible structures that may not be suitable to the problem. Although

these methods have been proved useful in simulated data [19] [20], their application to real problems has been

rather unsuccessful [21] [22] [23].

Evolutionary computation has been widely used to evolve neural network architectures and weights. There

have been many applications for parametric learning [24] and for both parametric and structural learning [25]

[17] [26] [27] [28] [29] [8] [30]. These works fall in two broad categories of evolutionary computation: genetic

algorithms and evolutionary programming.

Genetic algorithms are based on a representation independent of the problem, usually the representation

is a string of binary, integer or real numbers. This representation (the genotype) codiﬁes a network (the

phenotype). This is a dual representation scheme. The ability to create better solutions in a genetic algorithm

relies mainly on the operation of crossover. This operator forms oﬀspring by recombining representational

components from two members of the population.

The beneﬁts of crossover come from the ability of forming connected substrings of the representation that

correspond to above-average solutions [5]. This substrings are called building blocks. Crossover is not eﬀective

in environments where the ﬁtness of an individual of the population is not correlated with the expected

ability of its representational components [31]. Such environments are called deceptive [32]. Deception is

a very important feature in most representations of neural networks, so crossover is usually be avoided in

evolutionary neural networks [17].

One of the most important forms of deception arises from the many-to-one mapping from genotypes in the

representation space to phenotypes in the evaluation space. The existence of networks functionally equivalent

and with diﬀerent encodings makes the evolution ineﬃcient, and it is unclear whether crossover would produce

more ﬁtted individuals from two members of the population. This problem is usually termed as the permutation

problem [33] [34] or the competing conventions problem [35].

Evolutionary programming [36] is, for many authors, the most suited paradigm of evolutionary computation

for evolving artiﬁcial neural networks [17]. Evolutionary programming uses a representation natural for the

problem. Once the representation scheme has been chosen, mutation operators speciﬁc to the representation

scheme are deﬁned. Evolutionary programming oﬀers a major advantage over genetic algorithms when evolving

artiﬁcial neural networks, the representation scheme allows manipulating networks directly, avoiding the

problems associated with a dual representation.

The use of evolutionary learning for designing neural networks dates from no more than two decades (see

[2] or [35] for reviews). However, a lot of work has been made in these two decades, with many diﬀerent

approaches and working models, for instance, [25], [37], or [8]. Evolutionary computation has been used for

learning connection weights and for learning both architecture and connection weights. The main advantage

of evolutionary computation is that it performs a global exploration of the search space avoiding to become

trapped in local minima as usually happens with local search procedures.

G. F. Miller et al. [38] proposed that evolutionary computation is a very good candidate to be used

to search the space of topologies because the ﬁtness function associated with that space is complex, noisy,

non-diﬀerentiable, multi-modal and deceptive.

Almost all the current models try to develop a global architecture, which is a very complex problem.

Although, some attempts have been made in developing modular networks [39] [40], in most cases the modules

are combined only after the evolutionary process has ﬁnished and not following a cooperative coevolutionary

model.

Few authors have devoted their attention to the cooperative coevolution of subnetworks. Some authors

have termed this kind of cooperative evolution (where the individuals must cooperate to achieve a good

performance) symbiotic evolution [41]. More formally, we should speak of mutualism, that is, the cooperation

of two individuals from diﬀerent species that beneﬁts both organisms.

R. Smalz and M. Conrad [26] developed a cooperative model where there are two populations: a population

of nodes, divided into clusters, and a population of networks that are combinations of neurons, one from each

cluster. Both populations are evolved separately.

B. A. Whitehead and T. D. Choate [29] developed a cooperative-competitive genetic model for Radial-Basis

Function (RBF) neural networks. In this work there is a population of genetically encoded neurons that

evolves both the centers and the widths of the radial basis functions. There is just one network that is formed

by the whole population of RBF’s. The major problem, as in our approach, is to assign the ﬁtness to each

node of the population, as the only performance measure available is for the whole network. This is well known

as the “credit apportionment problem”1[26] [9]. The credit assignment used by Whitehead and Choate is

restricted to RBF-like networks and very diﬃcult to adapt to other kind of networks.

D. W. Opitz and J. W. Shavlik [43] developed a model called ADDEMUP (Accurate anD Diverse Ensemble

Maker giving United Predictions). They evolved a population of networks by means of a genetic algorithm

1This problem can be traced back to the earliest attempts to apply machine learning to playing the game of checkers by Arthur

Samuels [42] in 1959.

and combined the networks in an ensemble with a linear combination. The competition among the networks

is encouraged with a diversity term added to the ﬁtness of each network.

D. E. Moriarty and R. Miikkulainen [30] [41] developed an actual cooperative model, called SANE, that

had some common points with R. Smalz and M. Conrad [26]. In this work they propose two populations: one

of nodes and another of networks that are combinations of the individuals from the population of nodes. Zhao

et al. [44] proposed a framework for cooperative coevolution, and applied that framework to the evolution of

RBF networks. Nevertheless, their work, more than a ﬁnished model, is an open proposal that aims at the

deﬁnition of the problems to be solved in a cooperative environment.

S-B. Cho and K. Shimohara [4] developed a modular neural network evolved by means of genetic pro-

gramming. Each network is a complex structure formed by diﬀerent modules which are codiﬁed by a tree

structure.

X. Yao and Y. Liu [45] use the ﬁnal population of networks developed using the EPNet [8] model to form

ensembles of neural networks. The combination of these networks produced better results than any isolated

network. Nevertheless, the cooperation among the networks takes place only after the evolutionary process

has ﬁnished. So, the model is neither cooperative nor coevolutionary.

A. Covnet: A Cooperative coevolutionary model

Covnet is a cooperative coevolutionary model, that is, several species are coevolved together. Each

species is a subnetwork that constitutes a partial solution of a problem; the combination of several individuals

from diﬀerent species constitutes the network that must be applied to the speciﬁc problem. The population

of subnetworks, that are called nodules, is made up by several subpopulations2that evolve independently.

Each one of these subpopulations constitutes a species. The combination of individuals from these diﬀerent

subpopulations that coevolve together is the key factor of our model.

The evolution of coadapted subcomponents must address four major issues: problem decomposition, inter-

dependence among subcomponents, credit assignment and maintenance of diversity. Cooperative coevolution

gives a framework where these issues could be faced in a natural way. The problem decomposition is intrinsic

in the model. Each population will evolve diﬀerent species that must cooperate in order to be rewarded with

high ﬁtness values. There is no need to any a priori knowledge to decompose the problem by hand. The

interdependence among the subcomponents comes from the fact that the ﬁtness of each individual depends

on how well the individual works together with the members of other species.

A nodule is made up of a variable number of nodes with free interconnection among them (see Figure 1),

that is, each node could have connections from input nodes, from other nodes of the nodule, and to output

nodes. More formally a nodule could be deﬁned as follows:

Deﬁnition 1: (Nodule) A nodule is a subnetwork formed by: a set of nodes with free interconnection among

them, the connection of these nodes from the input and the connections of the nodes to the output. It cannot

have connections with any node belonging to another nodule.

The input and output layers of the nodules are common, they are the input and output layers of the network.

It is important to note that the genotype of the nodule has a one-to-one mapping to the phenotype, as the

many-to-one mapping between them is one of the main sources of deception and the permutation problem

[17].

In the same way we deﬁne a network as a combination of nodules. The deﬁnition more formally is as follows:

Deﬁnition 2: (Network) A network is the combination of a ﬁnite number of nodules. The output of the

network is the sum of the outputs of all the nodules that constitute the network.

In practice all the networks of a population must have the same number of nodules, and this number, N,

is ﬁxed along the evolution.

Some parameters of the nodule are given by the problem and for that reason they are common to all the

nodules:

2Each subpopulation evolves independently, so we can talk of subpopulations or species indistinctly, as each subpopulation will

constitute a diﬀerent species.

Fig. 1. Model of a nodule. As a node has only connections to some nodes of the nodule, the connections that are

missing are represented with dashed lines. The nodule is composed by the hidden nodes and the connections of

these nodes from the input and to the output.

nNumber of inputs

mNumber of outputs

x= (1, x1,...,xn) Input vector

foutput Transfer function of the

output layer

these parameters are ﬁxed for all nodules. The rest of the parameters depend on each nodule:

hNumber of (hidden) nodes of the nodule

fiTransfer function of node i

piPartial output of node i(see

explanation below)

yiOutput of the node i

wiWeight vector of node i

As the node has a variable number of connections we have considered, for simplicity, that the connections

that are not present in the node have weight 0, so we can use a weight vector of ﬁxed length for all nodes. A

node could have connections from input nodes, from other nodes and to output nodes. The weight vector is

ordered as follows:

wi= (

bias

z}|{

wi,0,

input

z}| {

wi,1,...,wi,n,

hidden

z}| {

wi,n+1,...,wi,n+h,

output

z}| {

wi,n+h+1,...,wi,n+h+m) (1)

As there is no restriction in the connectivity of the nodule the transmission of the impulse along the

connections must be deﬁned in a way that avoids recurrence as the aim of these work is the cooperative

coevolution of feed-forward neural networks. The transmission has been deﬁned in three steps:

Step 1. Each node generates its output as a function of only the inputs of the nodule (that is, the inputs of

the whole network):

pi=fi



j=0

wi,j xj

,(2)

this value is called partial output.

Step 2. These partial outputs are propagated along the connections. Then, each node generates its output as

a function of all its inputs:

yi=fi



j=0

wi,j xj+

j=1

wi,n+jpj

.(3)

Step 3. Finally, the output layer of the nodule generates its output:

oj=foutput h

i=1

wi,n+h+jyi!.(4)

These three steps are repeated over all the nodules. The actual output vector of the network is the sum of

the output vectors generated by each nodule.

Deﬁned in this way a nodule is equivalent to a subnetwork of two hidden layers with the same number of

nodes in both layers. This equivalent model is shown on Figure 2. So, the nodule of Figure 1 could be seen as

the genotype of a nodule whose phenotype is the subnetwork shown on Figure 2. This diﬀerence is important,

as the model of Figure 1 considered as a phenotype would be a recurrent network. In this representation,

the mapping from genotype to phenotype is one-to-one, so the deception problem above mentioned does not

appear.

Fig. 2. Equivalent model with two hidden layers. Every connection from an input node represents two connections, as

the input value is used in two steps (see Equations 2 and 3). Every connection from another node of the nodule

represents a connection between the ﬁrst and second hidden layer (see Equation 3).

As the nodules must coevolve to develop diﬀerent behaviors we have Nsindependent subpopulations of

nodules3that evolve separately. The network will always have Nsnodules, each one from one diﬀerent

subpopulation of nodules. Our task is not only developing cooperative nodules but also obtaining the best

combinations. For that reason we have also a population of networks. This population keeps track of the best

combinations of nodules and evolves as the population of nodules evolves. The whole evolutionary process is

shown in Figure 3.

Species creation is implicit, the subpopulations must coevolve complementary behaviors in order to get

useful networks, as the combination of several nodules with the same behavior when they receive the same

inputs would not produce networks with a good ﬁtness value. So, there is no need to introduce a mechanism

for enforcing diversity that can bias the evolutionary process.

In the next two sections we will explain in depth the two populations and their evolutionary process.

A.1 Nodule population

The nodule population is formed by Nssubpopulations. Each subpopulation consists of a ﬁxed number

of nodules codiﬁed directly as subnetworks, that is, we evolve the genotype of Figure 1 that is a one-to-

one mapping to the phenotype of Figure 2. The population is subject to the operations of replication and

mutation. Crossover is not used due to its disadvantages in evolving artiﬁcial neural networks [17]. With

these features the algorithm falls in the class of evolutionary programming [36].

There is no limitation in the structure of the nodule or in the connections among the nodes. There are only

one restriction to avoid unnecessary complexity in the resulting nodules, there can be no connections to an

input node or from an output node.

The algorithm for the generation of a new nodule subpopulation is similar to other models proposed in the

literature, such as GNARL [17], EPNet [8], or the genetic algorithm developed by G. Bebis et al. [37] The

steps for generating the subpopulations are the following:

•The nodules of the initial subpopulation are created randomly. The number of nodes of the nodule, h, is

obtained from a uniform distribution: 0 ≤h≤hmax. Each node is created with a number of connections, c,

taken from a uniform distribution: 0 ≤c≤cmax . The initial value of the weights is uniformly distributed in

the interval [wmin, wmax ].

•The new subpopulation is generated replicating the best P% of the former population. The remaining

(100 −P)% is removed and replaced by mutated copies of the best P%. An individual of the best P% is

selected by roulette selection and mutated. This mutated copy substitutes one of the worst (100 −P)%

individuals.

•There are two types of mutation: parametric and structural. The severity of the mutation is determined by

the relative ﬁtness, Fr, of the nodule. Given a nodule νits relative ﬁtness is deﬁned as:

Fr=e−αF (ν).(5)

where F(ν) is the ﬁtness value of nodule ν.

Parametric mutation consists of a local search algorithm in the space of weights, a simulated annealing

algorithm [46]. This algorithm performs random steps in the space of weights. Each random step aﬀects all

the weights of the nodule. For every weight, wij, of the nodule the following operation is carried out:

wij =wij + ∆wij,∀wij ∈ν, (6)

where

∆wij ∈N(0, βFr(ν)).(7)

where βis a positive value that must be set by the user in order to avoid large steps in the space of weights.

The value of βused in all our experiments has been β= 0.75, anyway Covnet is quite robust regarding this

parameter.

3In order to maintain a coherent nomenclature we talk of one population of networks and another population of nodules. The

population of nodules is divided into Nsgenetically isolated subpopulations that coevolve together.

Fig. 3. Evolutionary process of both populations. The generation of a new population for both populations, networks

and nodules, is shown in detail.

Then, the ﬁtness of the nodule is recalculated and the usual simulated annealing criterion is applied. Being

∆Fthe diﬀerence in the ﬁtness function before and after the random step:

•If ∆F≥0 the step is accepted.

•If ∆F < 0 then the step is accepted with a probability

P(∆F) = e−∆F

where Tis the current temperature. Tstarts at an initial value T0and it is updated at every step, T(t+ 1) =

γT (t),0< γ < 1. The number of steps of the algorithm that are carried out on each parametric mutation is

very low. Performing many steps is computationally very expensive.

Parametric mutation is always carried out after structural mutation, as it does not modify the structure of

the network.

Structural mutation is more complex because it implies a modiﬁcation of the structure of the nodule. The

behavioral link between parents and their oﬀspring must be enforced to avoid generational gaps that produce

inconsistency in the evolution. There are four diﬀerent structural mutations: Addition of a node without

connections, deletion of a node, addition of a connection with 0 weight, and deletion of a connection.

The nodes are added with no connections to enforce the behavioral link with its parent. As many authors

have stated, [8] [17], maintaining the behavioral link between parents and their oﬀsprings is of the utmost

importance to get a useful algorithm.

All the above mutations are made in the mutation operation on the nodule. For each mutation there is a

minimum value, ∆m, and a maximum value, ∆M. The number of elements (nodes or connections) involved

in the mutation is calculated as follows:

∆ = ∆m+Fr(ν)(∆M−∆m).(8)

So, before making a mutation the number of elements, ∆, is calculated, if ∆ = 0 the mutation is not actually

carried out.

There is no migration among the subpopulations. So, each subpopulation must develop diﬀerent behaviors

of their nodules, that is, diﬀerent species of nodules, in order to compete with the other subpopulations for

conquering its own niche and to cooperate to form networks with high ﬁtness values.

A.2 Network population

The network population is formed by a ﬁxed number of networks. Each network is the combination of one

nodule of each subpopulation of nodules. So the networks are strings of integer numbers of ﬁxed length. The

value of the numbers is not signiﬁcant as they are just labels of the nodules. The relationship between the

two populations can be seen in Figure 4. It is important to note that, as the chromosome that represents the

network is ordered, the permutation problem we have discussed cannot appear.

Fig. 4. Populations of networks and nodules. Each element of the network is a reference to, or a label of, an individual

of the corresponding subpopulation of nodules. So the network is a vector where the ﬁrst component refers to a

nodule of subpopulation 1, the second component to a nodule of subpopulation 2, and so on.

The network population is evolved using the steady-state genetic algorithm [47] [48]. This term may lead to

confusion as it has been proved that shows higher variance [49] and is a more aggressive and selective selection

strategy [50] than the standard genetic algorithm. This algorithm is selected because we need a population

of networks that evolves more slowly than the population of nodules, as the changes in the population of

networks have a major impact in the ﬁtness of the nodules. The steady-state genetic algorithm avoids the

negative eﬀect that this drastic modiﬁcation of the population of networks could have over the subpopulations

of nodules. As the two populations evolve in synchronous generations, the modiﬁcations in the population

of networks are less severe than the modiﬁcations in the subpopulations of modules. It has been also shown

by some works in the area [51] [52] that the steady-state genetic algorithm produces better solutions and is

faster than the standard genetic algorithm.

In a steady-state genetic algorithm one member of the population is changed at a time. In the algorithm we

have implemented the oﬀspring generated by crossover replaces the two worst individuals of the population

instead of replacing its parents. The algorithm allows adding mutation to the model, always at very low rates.

Crossover is made at nodule level, using a standard two-point crossover. So the parents exchange their

nodules to generate their oﬀspring. Mutation is also carried out at nodule level. When a network is mutated

one of its nodules is selected and is substituted by another nodule of the same subpopulation selected by

means of a roulette algorithm.

During the generation of the new nodule population some nodules of every population are removed and

substituted. The removed nodules are also substituted in the networks. This substitution has two advantages:

ﬁrst, poor performing nodules are removed from the networks and substituted by potentially better ones;

second, the new nodules have the opportunity to participate in the networks immediately after their creation.

A.3 Fitness assignment

The assignment of ﬁtness to networks is straightforward. Each network is assigned a ﬁtness in function

of its performance in solving a given problem. If the model is applied to classiﬁcation, the ﬁtness of each

network is the number of patterns of the training set that are correctly classiﬁed; if it is applied to regression,

the ﬁtness is the sum of squared errors, and so on.

Assigning ﬁtness to the nodules is a much more complex problem. In fact, the assignment of ﬁtness to

the individuals that form a solution in cooperative evolution is one of its key topics. The performance of the

model highly depends on that assignment. A discussion of the matter can be found in the Introduction of [9].

A credit assignment must fulﬁll the following requirements to be useful:

•It must enforce competition among the subpopulations to avoid two subpopulations developing similar

responses to the same features of the data.

•It must enforce cooperation. The diﬀerent subpopulations must develop complementary features that to-

gether could solve the problem.

•It must measure the contribution of a nodule to the ﬁtness of the network, and not only the performance

of the networks where the nodule is present. A nodule in a good network must not get a high ﬁtness if its

contribution to the performance of the network is not signiﬁcant. Likewise, a nodule in a poor performing

network must not be penalized if its contribution to the ﬁtness of the network is positive. Otherwise, a good

nodule that is temporarily assigned to poor rated networks could be lost in the evolution of the subpopulations

of nodules.

Some methods for calculating the ﬁtness of the nodules have been tried. The best one consists of the weighted

sum of three diﬀerent criteria. These criteria, for obtaining the ﬁtness of a nodule νin a subpopulation π,

are:

Substitution (σ)knetworks are selected using an elitist method, that is, the best knetworks of the population.

In these networks the nodule of subpopulation πis substituted by the nodule ν. The ﬁtness of the network

with the nodule of the population πsubstituted by νis measured. The ﬁtness assigned to the nodule is the

averaged diﬀerence in the ﬁtness of the networks with the original nodule and with the nodule substituted by

ν. This criterion enforces competition among nodules of the same subpopulation, as it tests if a nodule could

achieve better performance than the rest of the nodules of its subpopulation.

The interdependencies among nodules could be a major drawback in the substitution criterion, but it does

not mean that this criterion is useless. In any case, the criterion has two important features:

–It encourages the nodules to compete within the subpopulations, rewarding the nodules most compatible

with the nodules of the rest of the subpopulation. This is true even for a distributed representation, because

it has been shown that such representation is also modular. Moreover, as the nodules have no connection

among them, they are more independent than in a standard network.

–As many of the nodules are competing with their parents, this criterion allows to measure if an oﬀspring

is able to improve the performance of its parent.

In addition, the neuropsychological evidence showing that certain parts of the brain consist of modules, that

we discussed above, would support this objective.

Diﬀerence(δ)The nodule is removed from all the networks where it is present. The ﬁtness is measured as

the diﬀerence in performance of these networks. This criterion enforces competition among subpopulations of

nodules preventing more than one subpopulation from developing the same behavior. If two subpopulations

evolve in the same way, the value of this criterion in the ﬁtness of their nodules will be near 0 and the

subpopulations will be penalized.

Best k(βk)The ﬁtness is the mean of the ﬁtness values of the best knetworks where the nodule νis present.

Only the best kare selected because the importance of the worst networks of the population must not be

signiﬁcant. This criterion rewards the nodules in the best networks, and does not penalize a good nodule if it

is in some poor performing networks.

Considered independently none of these criteria is able to fulﬁll the three desired features above mentioned.

Nevertheless, when the weighted sum of all of them is used they have proved to give a good performance in the

problems used as tests. Typical values of the weights of the components of the ﬁtness used in our experiment

are (λδ≃2λσ≃60λβn). The values of these coeﬃcients must not only weight the importance of each criteria

but also correct the diﬀerences in range of them.

In order to encourage small nodules we have included a regularization term in the ﬁtness of the nodule.

Being nnthe number of nodes of the nodule and ncthe number of connections, the eﬀective ﬁtness4,f′

i, of

the nodule is calculated following:

f′

i=fi−ρnnn−ρcnc.(9)

The values of the coeﬃcients must be in the interval 0 < ρn, ρc<< 1 in order to avoid the regularization

term introducing a high bias in the learning process.

So, the equation of the eﬀective ﬁtness of the nodule νof subpopulation πis the following:

fπ

ν=λσσ+λδδ+λβkβk−ρnnn−ρcnc,(10)

if the expression above is negative for any of the nodules of a subpopulation, then the ﬁtness values of all the

nodules of that subpopulation are shifted, as we have mentioned above, as following:

fπ

ν=fπ

ν−min{fπ

i}N

i=1,(11)

where Nis the number of nodules of the nodule subpopulation.

IV. Experiments

The performance of the developed model is tested in ten classiﬁcation problems with diﬀerent features from

the UCI Machine Learning Repository [53]. In order to get a clear idea of the performance of the model we

have compared our model with a modular network, the adaptive mixture of local experts [54]. Each expert is a

multilayer perceptron (MLP) trained with standard back-propagation [55] and a momentum term. We have

also compared Covnet with the results in the literature.

For the design and training of the modular networks we have used NeuralWorks Professional II/Plus [56]

simulator. We also tried some pruning algorithms that are implemented in the Stuttgart Neural Network

Simulator(SNNS)5, (OBD [57], OBS [22] and Skeletonization [58]) but always with worse results.

4It is called efective ﬁtness because it is the actual value used as the ﬁtness of the nodule in the generation of a new subpopulation.

5This package could be obtain by anonymous ftp from ftp://ftp.informatik.uni-stuttgart.de/pub/SNNS.

Covnet has been programmed in C under the Linux Operating System. All the tools and programs used

for its development are licensed under the GNU General Public License. Covnet’s code6is also under the

GNU General Public License.

All the parameters of Covnet are common to all the data sets used in the experiments. Such parameters are

shown in Table I. Setting the parameters for each problem speciﬁcally improves the performance of Covnet

but using the same parameters for all the problems shows the robustness of the model regarding parameter

setting.

TABLE I

Covnet parameters common to all the experiments

Parameter Value

Number of networks 100

Number of nodules on each subpopulation 40

Networks to replace on each generation 2.0%

Mutation rate on network population 5.0%

Initial value of weights (-0.5, 0.5)

Nodule elitism 70%

Input scaling interval [−2.5,2.5]

Number of nodule subpopulations 5

Initial maximum number of nodes 3

Initial maximum number of connections 15

Nodule ﬁtness components λσ= 3.50

λδ= 1.45

λβ3= 0.05

Regularization term ρn= 0.25

ρc= 0.025

Simulated annealing To= 5.0

α= 0.95

n= 25

Minimum improvement (stop criterion) 10%

The regularization term is either used with the parameters shown in Table I or is removed, setting the

parameters to 0. This second option may be used when no over-training eﬀect is observed and the resulting

networks are small enough for the purposes of a speciﬁc task.

The parameters of the population, number of networks, number of nodule subpopulations and number of

nodules per subpopulation, can have a variety of values. However, increasing the values shown in this chapter

will not improve the performance, and will increase the computational cost of the evolution.

The weight of the nodule ﬁtness subcomponents must be ﬁxed in a way that corrects the diﬀerences among

their ranges. The values used in our experiments follows this idea. In a speciﬁc problem could be interesting

considering any of the subcomponent more important than the others, but that can only be tested by trial

and error.

Regularisation parameters must be set in function of the importance of parsimony in our task. Increasing

the values shown in this chapter will evolve smaller network, but also will decrease the performance of the

networks as the regularization restriction becomes more critical.

Each set of available data was divided into three sets: 50% of the patterns were used for learning, 25% of

them for validation and the remaining 25% for testing the generalization of the individuals. There are two

exceptions, Sonar and Vowel problems, as the patterns of these two problems are prearranged in two subsets

due to their speciﬁc features. Table II shows a summary of the used data sets.

The populations of Covnet were evolved using together the training set and the validation set, that is, no

validation was used. At the end of the evolution the best network, in terms of training error, was selected as

the result of the evolution. The test set was then used to obtain the generalization of this network.

6The code is available upon request to the authors.

For the training of the modular networks we used the method of cross-validation and early-stopping [59].

The networks were trained until the error over the validation set started to grow. Nevertheless, the results

obtained with early-stopping were worse that the ones obtained when the validation set was added to the

training set. Only the results with the latter conﬁguration are shown.

In all the tables we show, for each permutation of the data sets, the averaged error of classiﬁcation over 30

repetitions on each permutation of the data set, the standard deviation, the best and worst individuals, and

the averaged number of nodes and connections of the best networks of each experiment. The measure of the

error is the following:

E=1

i=1

ei,(12)

where P is the number of patterns and eiis 0 if pattern iis correctly classiﬁed, and 1 otherwise.

TABLE II

Summary of data sets. The features of each data set can be C(continuous), B(binary) or

N(nominal). nodes.

Cases Features

Data set Train Val Test Classes C B N Description

Cancer 360 175 174 2 9 – – There are two classes meaning if the cancer was be-

nign (65.5% of the cases) or malignant (34.5%).

Card 346 172 172 2 6 4 5 There are two classes, meaning whether the applica-

tion was granted (44.5% of the patterns) or denied

(55.5%).

Gene 1588 794 793 3 – – 60 This problem consists of two subtasks: recognizing

exon/intron boundaries (referred to as EI sites), and

recognizing intron/exon boundaries (IE sites).

Glass 107 54 53 6 9 - - This data set is from the UCI Machine Learning

Repository. The set contains data from 6 diﬀerent

types of glass.

Heart 134 68 68 2 6 1 6 This data set comes from the Cleveland Clinic The

goal is the prediction of the presence or absence of

heart disease in those patients.

Horse 182 91 91 3 13 2 5 This data set is from the UCI Machine Learning

Repository. The aim is to predict the fate of a horse

that has a colic: to survive, to die, of to be eutha-

nized.

Pima 384 192 192 2 8 - - The patterns are divided into two classes that show

whether the patient shows signs of diabetes.

Sonar 104 – 104 2 60 – – The task is to train a network to discriminate be-

tween sonar signals bounced oﬀ a metal cylinder and

those bounced oﬀ a roughly cylindrical rock.

Soybean 342 171 170 19 6 13 16 The task is to recognize 19 diﬀerent diseases of soy-

beans.

Vowel 528 – 462 11 10 – – Speaker independent recognition of the eleven steady

state vowels of British English.

The results obtained using Covnet and the modular neural network are shown in Table III. In boldface

we show the best results when the diﬀerence is statistically signiﬁcant using a t-test at a conﬁdence level of

95%. We can see that Covnet is able to outperform the modular network in 6 out of 10 datasets, and is

worse only in 2 problems.

The results obtained are good when they are compared with other works using these data sets. Table

IV shows a summary of the results reported in papers devoted to ensembles, modular networks or similar

classiﬁcation methods. Comparisons must be made cautiously, as the experimental setup is diﬀerent in

TABLE III

Error rates for the modular network and Covnet for the datasets. The best generalization error

is in boldface for every problem.

Modular network Covnet

Problem Mean Best Worst Mean Best Worst

Cancer 0.0224 0.0115 0.0345 0.0167 0.0105 0.0230

Card 0.1374 0.1163 0.1686 0.1157 0.0930 0.1395

Gene 0.1511 0.1324 0.1702 0.1398 0.1021 0.1425

Glass 0.3904 0.0745 0.1180 0.3723 0.1321 0.3019

Heart 0.1941 0.1324 0.2500 0.1426 0.0882 0.2059

Horse 0.2714 0.2308 0.3077 0.2780 0.2308 0.3407

Pima 0.2299 0.1771 0.2865 0.1990 0.1615 0.2448

Sonar 0.1875 0.1346 0.2506 0.2202 0.2019 0.2404

Soybean 0.2023 0.1235 0.2765 0.1985 0.1235 0.2412

Vowel 0.5821 0.5325 0.6710 0.5788 0.4913 0.5190

diﬀerent papers. There are diﬀerences also in the methods used for estimating the generalisation error. Some

of the papers use 10-fold cross-validation that for some of the problems obtains a more optimistic estimation

of the error.

TABLE IV

Results of previous works using the same data sets. We record the results of the best

method among the algorithms tested in each paper.

Data set Reference

Coop [45]1[60]2[61]3[62]1[63]2[64]2[65]2[66]1[67]2[68]2[69]1[70]2[71]1

Cancer 0.0123 – 0.035 – – 0.038 – – 0.0120 0.0310 0.0272 0.034 0.0263 0.033

Card 0.1217 0.093 – – 0.1398 – 0.135 0.130 0.0910 0.1300 0.1432 – 0.1433 –

Gene 0.1238 – – – – – – – – 0.0503 – 0.051 – –

Glass 0.2289 – 0.249 – 0.3144 0.238 – – 0.2518 0.2277 0.2519 0.226 0.3154 0.329

Heart 0.1196 0.151 0.197 0.166 0.1751 – – – 0.1384 0.2045 0.1604 – 0.1617 –

Horse 0.2674 – 0.169 – – – – – – 0.1825 – – – –

Pima 0.1969 0.226 0.244 0.234 – – 0.221 0.223 0.1960 – 0.2402 – 0.2372 0.260

Sonar 0.1436 – – – 0.2278 0.154 – – – 0.1651 0.1529 0.163 – –

Soybean 0.0761 – 0.070 – – – – – 0.0781 0.0757 0.0633 0.056 0.0568 –

Vowel 0.4587 – – – – 0.5171– – – – – – – –

1Hold out.

2k-fold cross-validation.

3Best classiﬁer.

V. Conclusions

In this chapter we have shown how a cooperative coevolutionary model for the design of artiﬁcial neural

networks can be developed. This model is based on the coevolution of several species of subnetworks (called

nodules in our model) that must cooperate to form networks for solving a given problem. Instead of trying to

evolve whole networks, a task that is not feasible in many problems or ends up with poorly performing neural

networks, we evolve these subnetworks that must cooperate in solving the given task. The nodules coevolve in

several independent subpopulations that evolve to diﬀerent species. A population of networks that is evolved

by means of a steady-state genetic algorithm keeps track of the best combinations of nodules for solving the

problem.

We have also developed a new method for assigning credit to the individuals of the diﬀerent species that

cooperate to form a network. This method is based on the combination of three criteria. The criteria enforce

competition within species and cooperation among species. The same idea underlying this method could be

applied to other models of cooperative coevolution.

This model has proved to perform better than standard algorithms in two real problems of classiﬁcation.

Moreover, it has shown better results than the methods of training modular neural networks by means of

gradient descent, e.g. the backpropagation learning rule, in 8 out of 10 problems.

Networks evolved by Covnet are very compact and have few sparsely distributed connections. These

networks are appropriate for hardware implementation. Moreover, the robustness to the damage of some

parts of the network is also a very interesting feature for hardware implemented neural networks.

We have also worked in a diﬀerent point of view that is considering the assignment of ﬁtness to the nodules

a multi-objective problem [72]. The optimization of each criterion would be approached by a multi-objective

evolutionary algorithm [73]. Each one of the three criteria discussed above together with a regularization term

could be seen as diﬀerent objectives for optimization.

Acknowledgments

The authors would like to acknowledge Dr. F. Herrera-Triguero, R. Moya-S´anchez and E. Sanz-Tapia for

their helping in the ﬁnal version of this chapter. Part of the work reported in this chapter has been ﬁnanced

by the Project TIC2002-04036-C05-02 of the Spanish CICYT and FEDER funds.

References

[1] S. Haykin, Neural Networks – A Comprehensive Foundation, Macmillan College Publishing Company, New York, NY, 1994.

[2] X. Yao, “Evolving artiﬁcial neural networks,” Proceedings of the IEEE, vol. 9, no. 87, pp. 1423–1447, 1999.

[3] Y. Shang and B. W. Wah, “Global optimization for neural networks training,” IEEE Computer, vol. 29, no. 3, pp. 45–54,

1996.

[4] S-B. Cho and K. Shimohara, “Evolutionary learning of modular neural networks with genetic programming,” Applied

Intelligence, vol. 9, pp. 191–200, 1998.

[5] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison–Wesley, Reading, MA, 1989.

[6] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer–Verlag, New York, 1994.

[7] T. Caelli, L. Guan, and W. Wen, “Modularity in neural computing,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1497–1518,

September 1999.

[8] X. Yao and Y. Liu, “A new evolutionary system for evolving artiﬁcial neural networks,” IEEE Transactions on Neural

Networks, vol. 8, no. 3, pp. 694–713, May 1997.

[9] M. A. Potter, The Design and Analysis of a Computational Model of Cooperative Coevolution, Ph.D. thesis, Goerge Mason

University, Fairfax, Virginia, 1997.

[10] M. A. Potter and K. A. de Jong, “Cooperative coevolution: An architecture for evolving coadapted subcomponents,”

Evolutionary Computation, vol. 8, no. 1, pp. 1–29, 2000.

[11] N. Garc´ıa-Pedrajas, C. Herv´as-Mart´ınez, and J. Mu˜noz-P´erez, “Covnet: A cooperative coevolutionary model for evolving

artiﬁcial neural networks,” IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 575–596, May 2003.

[12] K. Chellapilla and D. B. Fogel, “Evolving neural networks to play checkers without relying on expert knowledge,” IEEE

Transactions on Neural Networks, vol. 10, no. 6, pp. 1382–1391, November 1999.

[13] Ch-T. Lin and Ch-P. Jou, “Controlling chaos by GA-based reinforcement learning neural network,” IEEE Transactions on

Neural Networks, vol. 10, no. 4, pp. 846–859, July 1999.

[14] S. Gallant, Neural-Network Learning and Expert Systems, MIT Press, Cambridge, MA, 1993.

[15] V. Honavar and V. L. Uhr, “Generative learning structures for generalized connectionist networks,” Inform. Sci., vol. 70,

no. 1/2, pp. 75–108, 1993.

[16] R. Parekh, J. Yang, and V. Honavar, “Constructive neural-network learning algorithms for pattern classiﬁcation,” IEEE

Transactions on Neural Networks, vol. 11, no. 2, pp. 436–450, March 2000.

[17] P. J. Angeline, G. M. Saunders, and J. B. Pollack, “An evolutionary algorithm that constructs recurrent neural networks,”

IEEE Transactions on Neural Networks, vol. 5, no. 1, pp. 54 – 65, January 1994.

[18] R. Reed, “Pruning algorithms – A survey,” IEEE Transactions on Neural Networks, vol. 4, pp. 740 – 747, 1993.

[19] J. Depenau and M. Moller, “Aspects of generalization and pruning,” in Proc. World Congress on Neural Networks, 1994,

vol. III, pp. 504–509.

[20] H. H. Thodberg, “Improving generalization of neural networks through pruning,” International Journal of Neural Systems,

vol. 1, no. 4, pp. 317–326, 1991.

[21] Y. Hirose, K. Yamashita, and S. Hijiya, “Backpropagation algorithm which varies the number of hidden units,” Neural

Networks, vol. 4, pp. 61–66, 1991.

[22] B. Hassibi and D. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural

Information Systems 5, 1993.

[23] R. Kamimura and S. Nakanishi, “Weight-decay as a process of redundancy reduction,” in Proceedings of World Congress on

Neural Networks, 1994, vol. III, pp. 486–489.

[24] A. J. F. van Rooij, L. C. Jain, and R. P. Johnson, Neural Networks Training Using Genetic Algorithms, vol. 26 of Series in

Machine Perception and Artiﬁcial Intelligence, World Scientiﬁc, Singapore, 1996.

[25] S. V. Odri, D. P. Petrovacki, and G. A. Krstonosic, “Evolutional development of a multilevel neural network,” Neural

Networks, vol. 6, pp. 583–595, 1993.

[26] R. Smalz and M. Conrad, “Combining evolution with credit apportionment: A new learning algorithm for neural nets,”

Neural Networks, vol. 7, no. 2, pp. 341 – 351, 1994.

[27] V. Maniezzo, “Genetic evolution of the topology and weight distribution of neural networks,” IEEE Transactions on neural

networks, vol. 5, no. 1, pp. 39–53, January 1994.

[28] M. V. Borst, Local structure optimization in evolutionary generated neural networks architectures, Ph.D. thesis, Leiden

University, The Netherlands, August 1994.

[29] B. A. Whitehead and T. D. Choate, “Cooperative–competitive genetic evolution of radial basis function centers and widths

for time series prediction,” IEEE Transactions on Neural Networks, vol. 7, no. 4, pp. 869–880, July 1996.

[30] D. E. Moriarty, Symbiotic Evolution of Neural Networks in Sequential Decision Tasks, Ph.D. thesis, University of Texas at

Austin, 1997, Report AI97-257.

[31] D. E. Goldberg, “Genetic algorithms and Walsh functions: Part 2, deception and its analysis,” Complex Systems, vol. 3, pp.

153 – 171, 1989.

[32] D. E. Goldberg, “Genetic algorithms and Walsh functions: Part 1, a gentle introduction,” Complex Systems, vol. 3, pp. 129

– 152, 1989.

[33] R. K. Belew, J. McInerney, and N. N. Schraudolph, “Evolving networks: Using genetic algorithms with connectionist

learning,” Tech. Rep. CS90-174, Computer Science Engineering Department, University of California-San Diego, Feb. 1991.

[34] P. J. B. Hancock, “Genetic algorithms and permutation problems: A comparison of recombination operators for neural net

structure speciﬁcation,” in Proc. Int. Workshop of Combinations of Genetic Algorithms and Neural Networks (COGANN-92),

D. Whitley and J. D. Schaﬀer, Eds., Los Alamitos, CA, 1992, pp. 108–122, IEEE Computer Soc. Press.

[35] J. D. Schaﬀer, L. D. Whitley, and L. J. Eshelman, “Combinations of genetic algorithms and neural networks: A survey of

the state of the art,” in Proceedings of COGANN-92 International Workshop on Combinations of Genetic Algorithms and

Neural Networks, L. D. Whitley and J. D. Schaﬀer, Eds., Los Alamitos, CA, 1992, pp. 1–37, IEEE Computer Society Press.

[36] D. B. Fogel, Evolving artiﬁcial intelligence, Ph.D. thesis, University of California, San Diego, 1992.

[37] G. Bebis, M. Georgiopoulos, and T. Kasparis, “Coupling weight elimination with genetic algorithms to reduce network size

and preserve generalization,” Neurocomputing, vol. 17, pp. 167–194, 1997.

[38] G. F. Miller, P. M. Todd, and S. U. Hedge, “Designing neural networks,” Neural Networks, vol. 4, pp. 53–60, 1991.

[39] Y. Liu and X. Yao, “Ensemble learning via negative correlation,” Neural Networks, vol. 12, no. 10, pp. 1399–1404, December

1999.

[40] B. E. Rosen, “Ensemble learning using decorrelated neural networks,” Connection Science, vol. 8, no. 3, pp. 373–384,

december 1996.

[41] D. E. Moriarty and R. Miikkulainen, “Eﬃcient reinforcement learning through symbiotic evolution,” Machine Learning, vol.

22, pp. 11 – 32, 1996.

[42] A. L. Samuel, “Some studies in machine learning using the game of checkers,” Journal of Research and Development, vol. 3,

no. 3, pp. 210–229, 1959.

[43] D. W. Opitz and J. W. Shavlik, “Actively searching for an eﬀective neural network ensemble,” Connection Science, vol. 8,

no. 3, pp. 337–353, 1996.

[44] Q. F. Zhao, O. Hammami, Kuroda K, and K. Saito, “Cooperative co-evolutionary algorithm - How to evaluate a module?,”

in Proc. 1st IEEE Symposium of Evolutionary Computation and Neural Networks, San Antonio, TX, May 2000, pp. 150–157.

[45] X. Yao and Y. Liu, “Making use of population information in evolutionary artiﬁcial neural networks,” IEEE Transactions

on Systems, Man, and Cybernetics – Part B: Cybernetics, vol. 28, no. 3, pp. 417–425, June 1998.

[46] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp.

671–680, 1983.

[47] D. Whitley and J. Kauth, “GENITOR: A diﬀerent genetic algorithm,” in Proceedings of the Rocky Mountain Conference on

Artiﬁcial Intelligence, Denver, CO, 1988, pp. 118–130.

[48] D. Whitley, “The GENITOR algorithm and selective pressure,” in Proc 3rd International Conf. on Genetic Algorithms,

Morgan Kaufmann Publishers, Ed., Los Altos, CA, 1989, pp. 116–121.

[49] G. Syswerda, “Uniform crossover in genetic algorithms,” in Proc 3rd Internation Conf. on Genetic Algorithms, Morgan-

Kaufmann, Ed., 1989, pp. 2–9.

[50] D. Goldberg and K. Deb, “A comparative analysis of selection schemes used in genetic algorithms,” in Foundations of

Genetic Algorithms, G. Rawlins, Ed., pp. 94–101. Morgan Kaufmann, 1991.

[51] D. Whitley and T. Starkweather, “GENITOR II: A distributed genetic algorithm,” J. Experimental Theoretical Artiﬁcial

Intelligence, pp. 189 – 214, 1990.

[52] G. Syswerda, “A study of reproduction in generational and steady-state genetic algorithms,” in Foundations of Genetic

Algorithms, G. Rawlins, Ed., pp. 94–101. Morgan Kaufmann, 1991.

[53] S. Hettich, C.L. Blake, and C.J. Merz, “UCI repository of machine learning databases,” 1998,

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

[54] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol.

3, pp. 79–87, 1991.

[55] D. Rumelhart, G. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed

Processing, D. Rumelhart and J. McClelland, Eds., pp. 318–362. MIT Press, Cambridge, MA, 1986.

[56] NeuralWare, Neural Computing: A Technology Handbook for Professional II/Plus, NeuralWare Inc., Pittsburgh, PA, 1993.

[57] Y. Le Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing (2), D. S.

Touretzky, Ed., Denver, CO, 1990, pp. 598–605.

[58] M. C. Mozer and P. Smolensky, “Skeletonization: A technique for trimming the fat from a network via relevance assessment,”

in Advances in Neural Information Processing (1), D. S. Touretzky, Ed., Denver, CO, 1989, pp. 107–155.

[59] W. Finnoﬀ, F. Hergert, and H. G. Zimmermann, “Improving model selection by nonconvergent methods,” Neural Networks,

vol. 6, pp. 771–783, 1993.

[60] G. I. Webb, “Multiboosting: A technique for combining boosting and wagging,” Machine Learning, vol. 40, no. 2, pp.

159–196, August 2000.

[61] G. Zenobi and P. Cunningham, “Using diversity in preparing ensembles of classiﬁers based on diﬀerent feature subsets to

minimize generalization error,” in 12th European Conference on Machine Learning (ECML 2001), L. de Raedt and P. Flach,

Eds. 2001, LNAI 2167, pp. 576–587, Springer–Verlag.

[62] C. J. Merz, “Using correspondence analysis to combine classiﬁers,” Machine Learning, vol. 36, no. 1, pp. 33–58, July 1999.

[63] J. Friedman, T. Hastie, and R. Tibshirani, “Additice logistic regression: A statistical view of boosting,” Annals of Statistics,

vol. 28, no. 2, pp. 337–407, 2000.

[64] Y. Liu, X. Yao, and T. Higuchi, “Evolutionary ensembles with negative correlation learning,” IEEE Transactions on

Evolutionary Computation, vol. 4, no. 4, pp. 380–387, November 2000.

[65] Y. Liu, X. Yao, Q. Zhao, and T. Higuchi, “Evolving a cooperative population of neural networks by minimizing mutual

information,” in Proc. of the 2001 IEEE Congress on Evolutionary Computation, Seoul, Korea, May 2001, pp. 384–389.

[66] Md. M. Islam, X. Yao, and K. Murase, “A constructive algorithm for training cooperative neural network ensembles,” IEEE

Transactions on Neural Networks, vol. 14, no. 4, pp. 820–834, July 2003.

[67] T. G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging,

boosting, and randomization,” Machine Learning, vol. 40, pp. 139–157, 2000.

[68] S. Dzeroski and B. Zenko, “Is combining classiﬁers with stacking better than selecting the best one?,” Machine Learning,

vol. 54, pp. 255–273, 2004.

[69] L. Breiman, “Randomizing outputs to increase prediction accuracy,” Machine Learning, vol. 40, pp. 229–242, 2000.

[70] L. Todorovski and S. Dzeroski, “Combining classiﬁers with meta decision trees,” Machine Learning, vol. 50, pp. 223–249,

2003.

[71] E. Cant´u-Paz and C. Kamath, “Inducing oblique decision trees with evolutionary algorithms,” IEEE Transactions on

Evolutionary Computation, vol. 7, no. 1, pp. 54–68, February 2003.

[72] N. Garc´ıa-Pedrajas, C. Herv´as-Mart´ınez, and J. Mu˜noz-P´erez, “Multiobjective cooperative coevolution of artiﬁcial neural

networks,” Neural Networks, vol. 15, no. 10, pp. 1255–1274, November 2002.

[73] K. Deb, “Evolutionary algorithms for multi-criterion optimization in engineering design,” in Proceedings of Evolutionary

Algorithms in Engineering and Computer Science (EUROGEN’99), Jyv¨askyl¨a, Finland, 30 May/3 June 1999, pp. 135–161.

ResearchGate has not been able to resolve any citations for this publication.

Designing neural networks

Article

Full-text available

Jan 1991
NEURAL NETWORKS

Second Order Derivatives for Network Pruning: Optimal Brain Surgeon

Conference Paper

Feb 1993

We investigate the use of information from all second order derivatives of the error function to perform network pruning (i.e., removing unimportant weights from a trained network) in order to improve generalization, simplify networks, reduce hardware or storage requirements, increase the speed of further training, and in some cases enable rule extraction. Our method, Optimal Brain Surgeon (OBS), is Significantly better than magnitude-based methods and Optimal Brain Damage [Le Cun, Denker and Sol1a, 1990], which often remove the wrong weights. OBS permits the pruning of more weights than other methods (for the same error on the training set), and thus yields better generalization on test data. Crucial to OBS is a recursion relation for calculating the inverse Hessian matrix H^(-1) from training data and structural information of the net. OBS permits a 90%, a 76%, and a 62% reduction in weights over backpropagation with weigh decay on three benchmark MONK's problems [Thrun et aI., 1991]. Of OBS, Optimal Brain Damage, and magnitude-based methods, only OBS deletes the correct weights from a trained XOR network in every case. Finally, whereas Sejnowski and Rosenberg [1987J used 18,000 weights in their NETtalk network, we used OBS to prune a network to just 1560 weights, yielding better generalization.

A new evolutionary system for evolving artificial neural networks

Article