ChapterPDF Available

Abstract and Figures

This chapter presents a cooperative revolutionary model for evolving artificial neural networks. This model is based on the idea of coevolving subnetworks that must cooperate to form a solution for a specific problem, instead of evolving complete networks. The combination of these subnetworks is part of a coevolutionary process. The best combinations of subnetworks must be evolved together with the coevolution of the subnetworks. Several subpopulations of subnetworks coevolve cooperatively and genetically isolated. The individuals of every subpopulation are combined to form whole networks. This is a different approach from most current models of evolutionary neural networks which try to develop whole networks. This model places as few restrictions as possible over the network structure, allowing the model to reach a wide variety of architectures during the evolution and to be easily extensible to other kind of neural networks. The performance of the model in solving ten real problems of classification is compared with a modular network, the adaptive mixture of experts, and with the results reported in the literature.
Content may be subject to copyright.
1
Cooperative coevolutionary methods
Nicol´as Garc´ıa-Pedrajas, esar Herv´as-Mart´ınez, and Domingo Ortiz-Boyer
Abstract
This chapter presents Covnet a cooperative coevolutionary model for evolving artificial neural networks. This
model is based on the idea of coevolving subnetworks that must cooperate to form a solution for a specific problem,
instead of evolving complete networks. The combination of this subnetworks is part of a coevolutionary process. The best
combinations of subnetworks must be evolved together with the coevolution of the subnetworks. Several subpopulations
of subnetworks coevolve cooperatively and genetically isolated. The individual of every subpopulation are combined
to form whole networks. This is a different approach from most current models of evolutionary neural networks which
try to develop whole networks. Covnet places as few restrictions as possible over the network structure, allowing the
model to reach a wide variety of architectures during the evolution and to be easily extensible to other kind of neural
networks. The performance of the model in solving three real problems of classification is compared with a modular
network, the adaptive mixture of experts, and with the results presented in the literature. Covnet has shown better
generalization and produced smaller networks than the adaptive mixture of experts, and has also achieved results, at
least, comparable with the results in the literature.
Keywords
Neural networks automatic design, cooperative coevolution, evolutionary computation, genetic algorithms, evolu-
tionary programming.
I. Introduction
In the area of neural networks [1] design one of the main problems is finding suitable architectures for
solving specific problems. The election of such architecture is very important, as a network smaller than
needed would be unable to learn and a network larger than needed would end in over-training.
The problem of finding a suitable architecture and the corresponding weights of the network is a very
complex task (for a very interesting review of the matter the reader can consult [2]). Modular systems are
often used in machine learning as an approach for solving these complex problems. Moreover, in spite of the
fact that small networks are preferred because they usually lead to better performance, the error surfaces of
such networks are more rugged and have few good solutions [3]. In addition, there is much neuropsychological
evidence showing that the brain of humans and other animals consists of modules, which are subdivisions in
identifiable parts, each one with its own purpose and function [4].
The objective of this chapter is showing how Cooperative Coevolution, a recent paradigm within the eld of
Evolutionary Computation, can be used to design of such modular neural networks. Evolutionary computation
[5] [6] is a set of global optimization techniques that have been widely used in late years for training and
automatically designing neural networks (see Section III). Some efforts have been made in designing modular
[7] neural networks with these techniques(e.g. [8]), but in almost all of them the design of the networks is
helped by methods outside evolutionary computation, or the application area for those models is limited to
very specific architectures.
This chapter is organised as follows: Section II explains the paradigm of cooperative coevolution; Section
III shows an application of cooperative coevolution to automatic neural network design; Section IV describes
the experiments carried out; and finally Section V states the conclusions of this chapter.
II. Cooperative coevolution
Cooperative coevolution [9] is a recent paradigm in the area of evolutionary computation focused on the
evolution of coadapted subcomponents without external interaction. In cooperative coevolution a number
of species are evolved together. The cooperation among the individuals is encouraged by rewarding the
individuals based on how well they cooperate to solve a target problem. The work on this paradigm has shown
The authors are with the Department of Computing and Numerical Analysis of the University of ordoba. e-mail: {npedrajas,
chervas, dortiz}@uco.es
2
that cooperative coevolutionary models present many interesting features, such as specialization through
genetic isolation, generalization and efficiency [10]. Cooperative coevolution approaches the design of modular
systems in a natural way, as the modularity is part of the model. Other models need some a priori knowledge
to decompose the problem by hand. In many cases, either this knowledge is not available or it is not clear
how to decompose the problem.
This chapter describes a cooperative coevolutionary model called Covnet [11]. This model develops subnet-
works instead of whole networks. These modules are combined forming ensembles that constitute a network.
As M. A. Potter and K. A. De Jong [10] have stated, “to apply evolutionary algorithms effectively to increas-
ingly complex problems explicit notions of modularity must be introduced to provide reasonable opportunities
for solutions to evolve in the form of interacting coadapted subcomponents”.
The most distinctive feature of Covnet is the coevolution of modules without the intervention of any agent
external to the evolutionary process and without an external mechanism for combining subnetworks. Also,
the use of an evolutionary algorithm for the evolution of both the weights and the architecture allows the
model to be applied to tasks where there is no error function that could be defined (e.g.: game playing [12] or
control [13]) in order to apply an algorithm based on the minimisation of that error, like the backpropagation
learning rule, or the derivatives of that error function cannot be obtained.
The most important contribution of Covnet are the following. First, it forms modular artificial neural
networks using cooperative coevolution. Every module must learn how to combine with the other modules
of the evolved network to be useful. Introducing the combination of nodules into the evolutionary process
enforces the cooperation among the modules, as independently evolved modules are less likely to combine well
after the evolutionary process have finished.
Second, it develops a method for measuring the fitness of cooperative subcomponents in a coevolutionary
model. This method, based on three different criteria, could be applied to other cooperative coevolutionary
models not related to the evolution of neural networks. The current methods are based, almost exclusively,
on measuring the fitness of the networks where the module appears.
Third, it introduces a new hybrid evolutionary programming algorithm that puts very few restrictions in the
subnetworks evolved. This algorithm produces very compact subnetworks, and even the evolved subnetworks
alone achieved very good performance in the test problems, as it will be shown in the experimental section.
III. Automatic design of artificial neural networks by means of cooperative coevolution
The automatic design of artificial neural networks has two different approaches: parametric learning and
structural learning. In structural learning, both architecture and parametric information must be learned
through the process of training. Basically, we can consider three models of structural learning: Constructive
algorithms, destructive algorithms, and evolutionary computation.
Constructive algorithms [14] [15] [16] start with a small network (usually a single neuron). This network is
trained until it is unable to continue learning, then new components are added to the network. This process
is repeated until a satisfactory solution is found. These methods are usually trapped in local minima [17] and
tend to produce large networks. Destructive methods, also known as pruning algorithms [18], start with a big
network, that is able to learn but usually ends in over-fitting, and try to remove the connections and nodes
that are not useful. A major problem with pruning methods is measuring the relevance of the structural
components of the network in order to decide whether a connection or node must be removed.
Both methods, constructive and destructive, limit the number of available architectures, thus introducing
constraints in the search space of possible structures that may not be suitable to the problem. Although
these methods have been proved useful in simulated data [19] [20], their application to real problems has been
rather unsuccessful [21] [22] [23].
Evolutionary computation has been widely used to evolve neural network architectures and weights. There
have been many applications for parametric learning [24] and for both parametric and structural learning [25]
[17] [26] [27] [28] [29] [8] [30]. These works fall in two broad categories of evolutionary computation: genetic
algorithms and evolutionary programming.
Genetic algorithms are based on a representation independent of the problem, usually the representation
is a string of binary, integer or real numbers. This representation (the genotype) codifies a network (the
3
phenotype). This is a dual representation scheme. The ability to create better solutions in a genetic algorithm
relies mainly on the operation of crossover. This operator forms offspring by recombining representational
components from two members of the population.
The benefits of crossover come from the ability of forming connected substrings of the representation that
correspond to above-average solutions [5]. This substrings are called building blocks. Crossover is not effective
in environments where the fitness of an individual of the population is not correlated with the expected
ability of its representational components [31]. Such environments are called deceptive [32]. Deception is
a very important feature in most representations of neural networks, so crossover is usually be avoided in
evolutionary neural networks [17].
One of the most important forms of deception arises from the many-to-one mapping from genotypes in the
representation space to phenotypes in the evaluation space. The existence of networks functionally equivalent
and with different encodings makes the evolution inefficient, and it is unclear whether crossover would produce
more fitted individuals from two members of the population. This problem is usually termed as the permutation
problem [33] [34] or the competing conventions problem [35].
Evolutionary programming [36] is, for many authors, the most suited paradigm of evolutionary computation
for evolving artificial neural networks [17]. Evolutionary programming uses a representation natural for the
problem. Once the representation scheme has been chosen, mutation operators specific to the representation
scheme are defined. Evolutionary programming offers a major advantage over genetic algorithms when evolving
artificial neural networks, the representation scheme allows manipulating networks directly, avoiding the
problems associated with a dual representation.
The use of evolutionary learning for designing neural networks dates from no more than two decades (see
[2] or [35] for reviews). However, a lot of work has been made in these two decades, with many different
approaches and working models, for instance, [25], [37], or [8]. Evolutionary computation has been used for
learning connection weights and for learning both architecture and connection weights. The main advantage
of evolutionary computation is that it performs a global exploration of the search space avoiding to become
trapped in local minima as usually happens with local search procedures.
G. F. Miller et al. [38] proposed that evolutionary computation is a very good candidate to be used
to search the space of topologies because the fitness function associated with that space is complex, noisy,
non-differentiable, multi-modal and deceptive.
Almost all the current models try to develop a global architecture, which is a very complex problem.
Although, some attempts have been made in developing modular networks [39] [40], in most cases the modules
are combined only after the evolutionary process has finished and not following a cooperative coevolutionary
model.
Few authors have devoted their attention to the cooperative coevolution of subnetworks. Some authors
have termed this kind of cooperative evolution (where the individuals must cooperate to achieve a good
performance) symbiotic evolution [41]. More formally, we should speak of mutualism, that is, the cooperation
of two individuals from different species that benefits both organisms.
R. Smalz and M. Conrad [26] developed a cooperative model where there are two populations: a population
of nodes, divided into clusters, and a population of networks that are combinations of neurons, one from each
cluster. Both populations are evolved separately.
B. A. Whitehead and T. D. Choate [29] developed a cooperative-competitive genetic model for Radial-Basis
Function (RBF) neural networks. In this work there is a population of genetically encoded neurons that
evolves both the centers and the widths of the radial basis functions. There is just one network that is formed
by the whole population of RBF’s. The major problem, as in our approach, is to assign the fitness to each
node of the population, as the only performance measure available is for the whole network. This is well known
as the “credit apportionment problem”1[26] [9]. The credit assignment used by Whitehead and Choate is
restricted to RBF-like networks and very difficult to adapt to other kind of networks.
D. W. Opitz and J. W. Shavlik [43] developed a model called ADDEMUP (Accurate anD Diverse Ensemble
Maker giving United Predictions). They evolved a population of networks by means of a genetic algorithm
1This problem can be traced back to the earliest attempts to apply machine learning to playing the game of checkers by Arthur
Samuels [42] in 1959.
4
and combined the networks in an ensemble with a linear combination. The competition among the networks
is encouraged with a diversity term added to the tness of each network.
D. E. Moriarty and R. Miikkulainen [30] [41] developed an actual cooperative model, called SANE, that
had some common points with R. Smalz and M. Conrad [26]. In this work they propose two populations: one
of nodes and another of networks that are combinations of the individuals from the population of nodes. Zhao
et al. [44] proposed a framework for cooperative coevolution, and applied that framework to the evolution of
RBF networks. Nevertheless, their work, more than a finished model, is an open proposal that aims at the
definition of the problems to be solved in a cooperative environment.
S-B. Cho and K. Shimohara [4] developed a modular neural network evolved by means of genetic pro-
gramming. Each network is a complex structure formed by different modules which are codified by a tree
structure.
X. Yao and Y. Liu [45] use the final population of networks developed using the EPNet [8] model to form
ensembles of neural networks. The combination of these networks produced better results than any isolated
network. Nevertheless, the cooperation among the networks takes place only after the evolutionary process
has finished. So, the model is neither cooperative nor coevolutionary.
A. Covnet: A Cooperative coevolutionary model
Covnet is a cooperative coevolutionary model, that is, several species are coevolved together. Each
species is a subnetwork that constitutes a partial solution of a problem; the combination of several individuals
from different species constitutes the network that must be applied to the specific problem. The population
of subnetworks, that are called nodules, is made up by several subpopulations2that evolve independently.
Each one of these subpopulations constitutes a species. The combination of individuals from these different
subpopulations that coevolve together is the key factor of our model.
The evolution of coadapted subcomponents must address four major issues: problem decomposition, inter-
dependence among subcomponents, credit assignment and maintenance of diversity. Cooperative coevolution
gives a framework where these issues could be faced in a natural way. The problem decomposition is intrinsic
in the model. Each population will evolve different species that must cooperate in order to be rewarded with
high fitness values. There is no need to any a priori knowledge to decompose the problem by hand. The
interdependence among the subcomponents comes from the fact that the fitness of each individual depends
on how well the individual works together with the members of other species.
A nodule is made up of a variable number of nodes with free interconnection among them (see Figure 1),
that is, each node could have connections from input nodes, from other nodes of the nodule, and to output
nodes. More formally a nodule could be defined as follows:
Definition 1: (Nodule) A nodule is a subnetwork formed by: a set of nodes with free interconnection among
them, the connection of these nodes from the input and the connections of the nodes to the output. It cannot
have connections with any node belonging to another nodule.
The input and output layers of the nodules are common, they are the input and output layers of the network.
It is important to note that the genotype of the nodule has a one-to-one mapping to the phenotype, as the
many-to-one mapping between them is one of the main sources of deception and the permutation problem
[17].
In the same way we define a network as a combination of nodules. The definition more formally is as follows:
Definition 2: (Network) A network is the combination of a finite number of nodules. The output of the
network is the sum of the outputs of all the nodules that constitute the network.
In practice all the networks of a population must have the same number of nodules, and this number, N,
is fixed along the evolution.
Some parameters of the nodule are given by the problem and for that reason they are common to all the
nodules:
2Each subpopulation evolves independently, so we can talk of subpopulations or species indistinctly, as each subpopulation will
constitute a different species.
5
Fig. 1. Model of a nodule. As a node has only connections to some nodes of the nodule, the connections that are
missing are represented with dashed lines. The nodule is composed by the hidden nodes and the connections of
these nodes from the input and to the output.
nNumber of inputs
mNumber of outputs
x= (1, x1,...,xn) Input vector
foutput Transfer function of the
output layer
these parameters are fixed for all nodules. The rest of the parameters depend on each nodule:
hNumber of (hidden) nodes of the nodule
fiTransfer function of node i
piPartial output of node i(see
explanation below)
yiOutput of the node i
wiWeight vector of node i
As the node has a variable number of connections we have considered, for simplicity, that the connections
that are not present in the node have weight 0, so we can use a weight vector of fixed length for all nodes. A
node could have connections from input nodes, from other nodes and to output nodes. The weight vector is
ordered as follows:
wi= (
bias
z}|{
wi,0,
input
z}| {
wi,1,...,wi,n,
hidden
z}| {
wi,n+1,...,wi,n+h,
output
z}| {
wi,n+h+1,...,wi,n+h+m) (1)
As there is no restriction in the connectivity of the nodule the transmission of the impulse along the
6
connections must be defined in a way that avoids recurrence as the aim of these work is the cooperative
coevolution of feed-forward neural networks. The transmission has been defined in three steps:
Step 1. Each node generates its output as a function of only the inputs of the nodule (that is, the inputs of
the whole network):
pi=fi
n
X
j=0
wi,j xj
,(2)
this value is called partial output.
Step 2. These partial outputs are propagated along the connections. Then, each node generates its output as
a function of all its inputs:
yi=fi
n
X
j=0
wi,j xj+
h
X
j=1
wi,n+jpj
.(3)
Step 3. Finally, the output layer of the nodule generates its output:
oj=foutput h
X
i=1
wi,n+h+jyi!.(4)
These three steps are repeated over all the nodules. The actual output vector of the network is the sum of
the output vectors generated by each nodule.
Defined in this way a nodule is equivalent to a subnetwork of two hidden layers with the same number of
nodes in both layers. This equivalent model is shown on Figure 2. So, the nodule of Figure 1 could be seen as
the genotype of a nodule whose phenotype is the subnetwork shown on Figure 2. This difference is important,
as the model of Figure 1 considered as a phenotype would be a recurrent network. In this representation,
the mapping from genotype to phenotype is one-to-one, so the deception problem above mentioned does not
appear.
Fig. 2. Equivalent model with two hidden layers. Every connection from an input node represents two connections, as
the input value is used in two steps (see Equations 2 and 3). Every connection from another node of the nodule
represents a connection between the first and second hidden layer (see Equation 3).
7
As the nodules must coevolve to develop different behaviors we have Nsindependent subpopulations of
nodules3that evolve separately. The network will always have Nsnodules, each one from one different
subpopulation of nodules. Our task is not only developing cooperative nodules but also obtaining the best
combinations. For that reason we have also a population of networks. This population keeps track of the best
combinations of nodules and evolves as the population of nodules evolves. The whole evolutionary process is
shown in Figure 3.
Species creation is implicit, the subpopulations must coevolve complementary behaviors in order to get
useful networks, as the combination of several nodules with the same behavior when they receive the same
inputs would not produce networks with a good fitness value. So, there is no need to introduce a mechanism
for enforcing diversity that can bias the evolutionary process.
In the next two sections we will explain in depth the two populations and their evolutionary process.
A.1 Nodule population
The nodule population is formed by Nssubpopulations. Each subpopulation consists of a fixed number
of nodules codified directly as subnetworks, that is, we evolve the genotype of Figure 1 that is a one-to-
one mapping to the phenotype of Figure 2. The population is subject to the operations of replication and
mutation. Crossover is not used due to its disadvantages in evolving artificial neural networks [17]. With
these features the algorithm falls in the class of evolutionary programming [36].
There is no limitation in the structure of the nodule or in the connections among the nodes. There are only
one restriction to avoid unnecessary complexity in the resulting nodules, there can be no connections to an
input node or from an output node.
The algorithm for the generation of a new nodule subpopulation is similar to other models proposed in the
literature, such as GNARL [17], EPNet [8], or the genetic algorithm developed by G. Bebis et al. [37] The
steps for generating the subpopulations are the following:
The nodules of the initial subpopulation are created randomly. The number of nodes of the nodule, h, is
obtained from a uniform distribution: 0 hhmax. Each node is created with a number of connections, c,
taken from a uniform distribution: 0 ccmax . The initial value of the weights is uniformly distributed in
the interval [wmin, wmax ].
The new subpopulation is generated replicating the best P% of the former population. The remaining
(100 P)% is removed and replaced by mutated copies of the best P%. An individual of the best P% is
selected by roulette selection and mutated. This mutated copy substitutes one of the worst (100 P)%
individuals.
There are two types of mutation: parametric and structural. The severity of the mutation is determined by
the relative fitness, Fr, of the nodule. Given a nodule νits relative fitness is defined as:
Fr=eαF (ν).(5)
where F(ν) is the fitness value of nodule ν.
Parametric mutation consists of a local search algorithm in the space of weights, a simulated annealing
algorithm [46]. This algorithm performs random steps in the space of weights. Each random step affects all
the weights of the nodule. For every weight, wij, of the nodule the following operation is carried out:
wij =wij + wij,wij ν, (6)
where
wij N(0, βFr(ν)).(7)
where βis a positive value that must be set by the user in order to avoid large steps in the space of weights.
The value of βused in all our experiments has been β= 0.75, anyway Covnet is quite robust regarding this
parameter.
3In order to maintain a coherent nomenclature we talk of one population of networks and another population of nodules. The
population of nodules is divided into Nsgenetically isolated subpopulations that coevolve together.
8
Fig. 3. Evolutionary process of both populations. The generation of a new population for both populations, networks
and nodules, is shown in detail.
Then, the fitness of the nodule is recalculated and the usual simulated annealing criterion is applied. Being
Fthe difference in the fitness function before and after the random step:
If F0 the step is accepted.
If F < 0 then the step is accepted with a probability
P(∆F) = eF
T,
where Tis the current temperature. Tstarts at an initial value T0and it is updated at every step, T(t+ 1) =
γT (t),0< γ < 1. The number of steps of the algorithm that are carried out on each parametric mutation is
very low. Performing many steps is computationally very expensive.
Parametric mutation is always carried out after structural mutation, as it does not modify the structure of
the network.
9
Structural mutation is more complex because it implies a modification of the structure of the nodule. The
behavioral link between parents and their offspring must be enforced to avoid generational gaps that produce
inconsistency in the evolution. There are four different structural mutations: Addition of a node without
connections, deletion of a node, addition of a connection with 0 weight, and deletion of a connection.
The nodes are added with no connections to enforce the behavioral link with its parent. As many authors
have stated, [8] [17], maintaining the behavioral link between parents and their offsprings is of the utmost
importance to get a useful algorithm.
All the above mutations are made in the mutation operation on the nodule. For each mutation there is a
minimum value, m, and a maximum value, M. The number of elements (nodes or connections) involved
in the mutation is calculated as follows:
= m+Fr(ν)(∆Mm).(8)
So, before making a mutation the number of elements, ∆, is calculated, if = 0 the mutation is not actually
carried out.
There is no migration among the subpopulations. So, each subpopulation must develop different behaviors
of their nodules, that is, different species of nodules, in order to compete with the other subpopulations for
conquering its own niche and to cooperate to form networks with high fitness values.
A.2 Network population
The network population is formed by a fixed number of networks. Each network is the combination of one
nodule of each subpopulation of nodules. So the networks are strings of integer numbers of fixed length. The
value of the numbers is not significant as they are just labels of the nodules. The relationship between the
two populations can be seen in Figure 4. It is important to note that, as the chromosome that represents the
network is ordered, the permutation problem we have discussed cannot appear.
Fig. 4. Populations of networks and nodules. Each element of the network is a reference to, or a label of, an individual
of the corresponding subpopulation of nodules. So the network is a vector where the first component refers to a
nodule of subpopulation 1, the second component to a nodule of subpopulation 2, and so on.
The network population is evolved using the steady-state genetic algorithm [47] [48]. This term may lead to
confusion as it has been proved that shows higher variance [49] and is a more aggressive and selective selection
strategy [50] than the standard genetic algorithm. This algorithm is selected because we need a population
of networks that evolves more slowly than the population of nodules, as the changes in the population of
10
networks have a major impact in the fitness of the nodules. The steady-state genetic algorithm avoids the
negative effect that this drastic modification of the population of networks could have over the subpopulations
of nodules. As the two populations evolve in synchronous generations, the modifications in the population
of networks are less severe than the modifications in the subpopulations of modules. It has been also shown
by some works in the area [51] [52] that the steady-state genetic algorithm produces better solutions and is
faster than the standard genetic algorithm.
In a steady-state genetic algorithm one member of the population is changed at a time. In the algorithm we
have implemented the offspring generated by crossover replaces the two worst individuals of the population
instead of replacing its parents. The algorithm allows adding mutation to the model, always at very low rates.
Crossover is made at nodule level, using a standard two-point crossover. So the parents exchange their
nodules to generate their offspring. Mutation is also carried out at nodule level. When a network is mutated
one of its nodules is selected and is substituted by another nodule of the same subpopulation selected by
means of a roulette algorithm.
During the generation of the new nodule population some nodules of every population are removed and
substituted. The removed nodules are also substituted in the networks. This substitution has two advantages:
first, poor performing nodules are removed from the networks and substituted by potentially better ones;
second, the new nodules have the opportunity to participate in the networks immediately after their creation.
A.3 Fitness assignment
The assignment of fitness to networks is straightforward. Each network is assigned a fitness in function
of its performance in solving a given problem. If the model is applied to classification, the fitness of each
network is the number of patterns of the training set that are correctly classified; if it is applied to regression,
the fitness is the sum of squared errors, and so on.
Assigning fitness to the nodules is a much more complex problem. In fact, the assignment of fitness to
the individuals that form a solution in cooperative evolution is one of its key topics. The performance of the
model highly depends on that assignment. A discussion of the matter can be found in the Introduction of [9].
A credit assignment must fulfill the following requirements to be useful:
It must enforce competition among the subpopulations to avoid two subpopulations developing similar
responses to the same features of the data.
It must enforce cooperation. The different subpopulations must develop complementary features that to-
gether could solve the problem.
It must measure the contribution of a nodule to the fitness of the network, and not only the performance
of the networks where the nodule is present. A nodule in a good network must not get a high tness if its
contribution to the performance of the network is not significant. Likewise, a nodule in a poor performing
network must not be penalized if its contribution to the fitness of the network is positive. Otherwise, a good
nodule that is temporarily assigned to poor rated networks could be lost in the evolution of the subpopulations
of nodules.
Some methods for calculating the fitness of the nodules have been tried. The best one consists of the weighted
sum of three different criteria. These criteria, for obtaining the fitness of a nodule νin a subpopulation π,
are:
Substitution (σ)knetworks are selected using an elitist method, that is, the best knetworks of the population.
In these networks the nodule of subpopulation πis substituted by the nodule ν. The fitness of the network
with the nodule of the population πsubstituted by νis measured. The fitness assigned to the nodule is the
averaged difference in the fitness of the networks with the original nodule and with the nodule substituted by
ν. This criterion enforces competition among nodules of the same subpopulation, as it tests if a nodule could
achieve better performance than the rest of the nodules of its subpopulation.
The interdependencies among nodules could be a major drawback in the substitution criterion, but it does
not mean that this criterion is useless. In any case, the criterion has two important features:
It encourages the nodules to compete within the subpopulations, rewarding the nodules most compatible
with the nodules of the rest of the subpopulation. This is true even for a distributed representation, because
11
it has been shown that such representation is also modular. Moreover, as the nodules have no connection
among them, they are more independent than in a standard network.
As many of the nodules are competing with their parents, this criterion allows to measure if an offspring
is able to improve the performance of its parent.
In addition, the neuropsychological evidence showing that certain parts of the brain consist of modules, that
we discussed above, would support this objective.
Difference(δ)The nodule is removed from all the networks where it is present. The fitness is measured as
the difference in performance of these networks. This criterion enforces competition among subpopulations of
nodules preventing more than one subpopulation from developing the same behavior. If two subpopulations
evolve in the same way, the value of this criterion in the fitness of their nodules will be near 0 and the
subpopulations will be penalized.
Best k(βk)The fitness is the mean of the fitness values of the best knetworks where the nodule νis present.
Only the best kare selected because the importance of the worst networks of the population must not be
significant. This criterion rewards the nodules in the best networks, and does not penalize a good nodule if it
is in some poor performing networks.
Considered independently none of these criteria is able to fulfill the three desired features above mentioned.
Nevertheless, when the weighted sum of all of them is used they have proved to give a good performance in the
problems used as tests. Typical values of the weights of the components of the fitness used in our experiment
are (λδ2λσ60λβn). The values of these coefficients must not only weight the importance of each criteria
but also correct the differences in range of them.
In order to encourage small nodules we have included a regularization term in the fitness of the nodule.
Being nnthe number of nodes of the nodule and ncthe number of connections, the effective fitness4,f
i, of
the nodule is calculated following:
f
i=fiρnnnρcnc.(9)
The values of the coefficients must be in the interval 0 < ρn, ρc<< 1 in order to avoid the regularization
term introducing a high bias in the learning process.
So, the equation of the effective fitness of the nodule νof subpopulation πis the following:
fπ
ν=λσσ+λδδ+λβkβkρnnnρcnc,(10)
if the expression above is negative for any of the nodules of a subpopulation, then the fitness values of all the
nodules of that subpopulation are shifted, as we have mentioned above, as following:
fπ
ν=fπ
νmin{fπ
i}N
i=1,(11)
where Nis the number of nodules of the nodule subpopulation.
IV. Experiments
The performance of the developed model is tested in ten classification problems with different features from
the UCI Machine Learning Repository [53]. In order to get a clear idea of the performance of the model we
have compared our model with a modular network, the adaptive mixture of local experts [54]. Each expert is a
multilayer perceptron (MLP) trained with standard back-propagation [55] and a momentum term. We have
also compared Covnet with the results in the literature.
For the design and training of the modular networks we have used NeuralWorks Professional II/Plus [56]
simulator. We also tried some pruning algorithms that are implemented in the Stuttgart Neural Network
Simulator(SNNS)5, (OBD [57], OBS [22] and Skeletonization [58]) but always with worse results.
4It is called efective fitness because it is the actual value used as the fitness of the nodule in the generation of a new subpopulation.
5This package could be obtain by anonymous ftp from ftp://ftp.informatik.uni-stuttgart.de/pub/SNNS.
12
Covnet has been programmed in C under the Linux Operating System. All the tools and programs used
for its development are licensed under the GNU General Public License. Covnet’s code6is also under the
GNU General Public License.
All the parameters of Covnet are common to all the data sets used in the experiments. Such parameters are
shown in Table I. Setting the parameters for each problem specifically improves the performance of Covnet
but using the same parameters for all the problems shows the robustness of the model regarding parameter
setting.
TABLE I
Covnet parameters common to all the experiments
Parameter Value
Number of networks 100
Number of nodules on each subpopulation 40
Networks to replace on each generation 2.0%
Mutation rate on network population 5.0%
Initial value of weights (-0.5, 0.5)
Nodule elitism 70%
Input scaling interval [2.5,2.5]
Number of nodule subpopulations 5
Initial maximum number of nodes 3
Initial maximum number of connections 15
Nodule fitness components λσ= 3.50
λδ= 1.45
λβ3= 0.05
Regularization term ρn= 0.25
ρc= 0.025
Simulated annealing To= 5.0
α= 0.95
n= 25
Minimum improvement (stop criterion) 10%
The regularization term is either used with the parameters shown in Table I or is removed, setting the
parameters to 0. This second option may be used when no over-training effect is observed and the resulting
networks are small enough for the purposes of a specific task.
The parameters of the population, number of networks, number of nodule subpopulations and number of
nodules per subpopulation, can have a variety of values. However, increasing the values shown in this chapter
will not improve the performance, and will increase the computational cost of the evolution.
The weight of the nodule fitness subcomponents must be fixed in a way that corrects the differences among
their ranges. The values used in our experiments follows this idea. In a specific problem could be interesting
considering any of the subcomponent more important than the others, but that can only be tested by trial
and error.
Regularisation parameters must be set in function of the importance of parsimony in our task. Increasing
the values shown in this chapter will evolve smaller network, but also will decrease the performance of the
networks as the regularization restriction becomes more critical.
Each set of available data was divided into three sets: 50% of the patterns were used for learning, 25% of
them for validation and the remaining 25% for testing the generalization of the individuals. There are two
exceptions, Sonar and Vowel problems, as the patterns of these two problems are prearranged in two subsets
due to their specific features. Table II shows a summary of the used data sets.
The populations of Covnet were evolved using together the training set and the validation set, that is, no
validation was used. At the end of the evolution the best network, in terms of training error, was selected as
the result of the evolution. The test set was then used to obtain the generalization of this network.
6The code is available upon request to the authors.
13
For the training of the modular networks we used the method of cross-validation and early-stopping [59].
The networks were trained until the error over the validation set started to grow. Nevertheless, the results
obtained with early-stopping were worse that the ones obtained when the validation set was added to the
training set. Only the results with the latter configuration are shown.
In all the tables we show, for each permutation of the data sets, the averaged error of classification over 30
repetitions on each permutation of the data set, the standard deviation, the best and worst individuals, and
the averaged number of nodes and connections of the best networks of each experiment. The measure of the
error is the following:
E=1
P
P
X
i=1
ei,(12)
where P is the number of patterns and eiis 0 if pattern iis correctly classified, and 1 otherwise.
TABLE II
Summary of data sets. The features of each data set can be C(continuous), B(binary) or
N(nominal). nodes.
Cases Features
Data set Train Val Test Classes C B N Description
Cancer 360 175 174 2 9 There are two classes meaning if the cancer was be-
nign (65.5% of the cases) or malignant (34.5%).
Card 346 172 172 2 6 4 5 There are two classes, meaning whether the applica-
tion was granted (44.5% of the patterns) or denied
(55.5%).
Gene 1588 794 793 3 60 This problem consists of two subtasks: recognizing
exon/intron boundaries (referred to as EI sites), and
recognizing intron/exon boundaries (IE sites).
Glass 107 54 53 6 9 - - This data set is from the UCI Machine Learning
Repository. The set contains data from 6 different
types of glass.
Heart 134 68 68 2 6 1 6 This data set comes from the Cleveland Clinic The
goal is the prediction of the presence or absence of
heart disease in those patients.
Horse 182 91 91 3 13 2 5 This data set is from the UCI Machine Learning
Repository. The aim is to predict the fate of a horse
that has a colic: to survive, to die, of to be eutha-
nized.
Pima 384 192 192 2 8 - - The patterns are divided into two classes that show
whether the patient shows signs of diabetes.
Sonar 104 104 2 60 The task is to train a network to discriminate be-
tween sonar signals bounced off a metal cylinder and
those bounced off a roughly cylindrical rock.
Soybean 342 171 170 19 6 13 16 The task is to recognize 19 different diseases of soy-
beans.
Vowel 528 462 11 10 Speaker independent recognition of the eleven steady
state vowels of British English.
The results obtained using Covnet and the modular neural network are shown in Table III. In boldface
we show the best results when the difference is statistically significant using a t-test at a confidence level of
95%. We can see that Covnet is able to outperform the modular network in 6 out of 10 datasets, and is
worse only in 2 problems.
The results obtained are good when they are compared with other works using these data sets. Table
IV shows a summary of the results reported in papers devoted to ensembles, modular networks or similar
classification methods. Comparisons must be made cautiously, as the experimental setup is different in
14
TABLE III
Error rates for the modular network and Covnet for the datasets. The best generalization error
is in boldface for every problem.
Modular network Covnet
Problem Mean Best Worst Mean Best Worst
Cancer 0.0224 0.0115 0.0345 0.0167 0.0105 0.0230
Card 0.1374 0.1163 0.1686 0.1157 0.0930 0.1395
Gene 0.1511 0.1324 0.1702 0.1398 0.1021 0.1425
Glass 0.3904 0.0745 0.1180 0.3723 0.1321 0.3019
Heart 0.1941 0.1324 0.2500 0.1426 0.0882 0.2059
Horse 0.2714 0.2308 0.3077 0.2780 0.2308 0.3407
Pima 0.2299 0.1771 0.2865 0.1990 0.1615 0.2448
Sonar 0.1875 0.1346 0.2506 0.2202 0.2019 0.2404
Soybean 0.2023 0.1235 0.2765 0.1985 0.1235 0.2412
Vowel 0.5821 0.5325 0.6710 0.5788 0.4913 0.5190
different papers. There are differences also in the methods used for estimating the generalisation error. Some
of the papers use 10-fold cross-validation that for some of the problems obtains a more optimistic estimation
of the error.
TABLE IV
Results of previous works using the same data sets. We record the results of the best
method among the algorithms tested in each paper.
Data set Reference
Coop [45]1[60]2[61]3[62]1[63]2[64]2[65]2[66]1[67]2[68]2[69]1[70]2[71]1
Cancer 0.0123 0.035 0.038 0.0120 0.0310 0.0272 0.034 0.0263 0.033
Card 0.1217 0.093 0.1398 0.135 0.130 0.0910 0.1300 0.1432 0.1433
Gene 0.1238 0.0503 0.051
Glass 0.2289 0.249 0.3144 0.238 0.2518 0.2277 0.2519 0.226 0.3154 0.329
Heart 0.1196 0.151 0.197 0.166 0.1751 0.1384 0.2045 0.1604 0.1617
Horse 0.2674 0.169 0.1825
Pima 0.1969 0.226 0.244 0.234 0.221 0.223 0.1960 0.2402 0.2372 0.260
Sonar 0.1436 0.2278 0.154 0.1651 0.1529 0.163
Soybean 0.0761 0.070 0.0781 0.0757 0.0633 0.056 0.0568
Vowel 0.4587 0.5171
1Hold out.
2k-fold cross-validation.
3Best classifier.
V. Conclusions
In this chapter we have shown how a cooperative coevolutionary model for the design of artificial neural
networks can be developed. This model is based on the coevolution of several species of subnetworks (called
nodules in our model) that must cooperate to form networks for solving a given problem. Instead of trying to
evolve whole networks, a task that is not feasible in many problems or ends up with poorly performing neural
networks, we evolve these subnetworks that must cooperate in solving the given task. The nodules coevolve in
several independent subpopulations that evolve to different species. A population of networks that is evolved
by means of a steady-state genetic algorithm keeps track of the best combinations of nodules for solving the
problem.
We have also developed a new method for assigning credit to the individuals of the different species that
cooperate to form a network. This method is based on the combination of three criteria. The criteria enforce
competition within species and cooperation among species. The same idea underlying this method could be
applied to other models of cooperative coevolution.
This model has proved to perform better than standard algorithms in two real problems of classification.
Moreover, it has shown better results than the methods of training modular neural networks by means of
gradient descent, e.g. the backpropagation learning rule, in 8 out of 10 problems.
Networks evolved by Covnet are very compact and have few sparsely distributed connections. These
15
networks are appropriate for hardware implementation. Moreover, the robustness to the damage of some
parts of the network is also a very interesting feature for hardware implemented neural networks.
We have also worked in a different point of view that is considering the assignment of fitness to the nodules
a multi-objective problem [72]. The optimization of each criterion would be approached by a multi-objective
evolutionary algorithm [73]. Each one of the three criteria discussed above together with a regularization term
could be seen as different objectives for optimization.
Acknowledgments
The authors would like to acknowledge Dr. F. Herrera-Triguero, R. Moya-S´anchez and E. Sanz-Tapia for
their helping in the final version of this chapter. Part of the work reported in this chapter has been financed
by the Project TIC2002-04036-C05-02 of the Spanish CICYT and FEDER funds.
References
[1] S. Haykin, Neural Networks A Comprehensive Foundation, Macmillan College Publishing Company, New York, NY, 1994.
[2] X. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE, vol. 9, no. 87, pp. 1423–1447, 1999.
[3] Y. Shang and B. W. Wah, “Global optimization for neural networks training,” IEEE Computer, vol. 29, no. 3, pp. 45–54,
1996.
[4] S-B. Cho and K. Shimohara, “Evolutionary learning of modular neural networks with genetic programming,” Applied
Intelligence, vol. 9, pp. 191–200, 1998.
[5] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison–Wesley, Reading, MA, 1989.
[6] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer–Verlag, New York, 1994.
[7] T. Caelli, L. Guan, and W. Wen, “Modularity in neural computing,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1497–1518,
September 1999.
[8] X. Yao and Y. Liu, “A new evolutionary system for evolving artificial neural networks,” IEEE Transactions on Neural
Networks, vol. 8, no. 3, pp. 694–713, May 1997.
[9] M. A. Potter, The Design and Analysis of a Computational Model of Cooperative Coevolution, Ph.D. thesis, Goerge Mason
University, Fairfax, Virginia, 1997.
[10] M. A. Potter and K. A. de Jong, “Cooperative coevolution: An architecture for evolving coadapted subcomponents,”
Evolutionary Computation, vol. 8, no. 1, pp. 1–29, 2000.
[11] N. Garc´ıa-Pedrajas, C. Herv´as-Mart´ınez, and J. Mu˜noz-P´erez, Covnet: A cooperative coevolutionary model for evolving
artificial neural networks,” IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 575–596, May 2003.
[12] K. Chellapilla and D. B. Fogel, “Evolving neural networks to play checkers without relying on expert knowledge, IEEE
Transactions on Neural Networks, vol. 10, no. 6, pp. 1382–1391, November 1999.
[13] Ch-T. Lin and Ch-P. Jou, “Controlling chaos by GA-based reinforcement learning neural network,” IEEE Transactions on
Neural Networks, vol. 10, no. 4, pp. 846–859, July 1999.
[14] S. Gallant, Neural-Network Learning and Expert Systems, MIT Press, Cambridge, MA, 1993.
[15] V. Honavar and V. L. Uhr, “Generative learning structures for generalized connectionist networks,” Inform. Sci., vol. 70,
no. 1/2, pp. 75–108, 1993.
[16] R. Parekh, J. Yang, and V. Honavar, “Constructive neural-network learning algorithms for pattern classification,” IEEE
Transactions on Neural Networks, vol. 11, no. 2, pp. 436–450, March 2000.
[17] P. J. Angeline, G. M. Saunders, and J. B. Pollack, “An evolutionary algorithm that constructs recurrent neural networks,”
IEEE Transactions on Neural Networks, vol. 5, no. 1, pp. 54 65, January 1994.
[18] R. Reed, “Pruning algorithms A survey,” IEEE Transactions on Neural Networks, vol. 4, pp. 740 747, 1993.
[19] J. Depenau and M. Moller, “Aspects of generalization and pruning,” in Proc. World Congress on Neural Networks, 1994,
vol. III, pp. 504–509.
[20] H. H. Thodberg, “Improving generalization of neural networks through pruning,” International Journal of Neural Systems,
vol. 1, no. 4, pp. 317–326, 1991.
[21] Y. Hirose, K. Yamashita, and S. Hijiya, “Backpropagation algorithm which varies the number of hidden units,” Neural
Networks, vol. 4, pp. 61–66, 1991.
[22] B. Hassibi and D. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural
Information Systems 5, 1993.
[23] R. Kamimura and S. Nakanishi, “Weight-decay as a process of redundancy reduction,” in Proceedings of World Congress on
Neural Networks, 1994, vol. III, pp. 486–489.
[24] A. J. F. van Rooij, L. C. Jain, and R. P. Johnson, Neural Networks Training Using Genetic Algorithms, vol. 26 of Series in
Machine Perception and Artificial Intelligence, World Scientific, Singapore, 1996.
[25] S. V. Odri, D. P. Petrovacki, and G. A. Krstonosic, “Evolutional development of a multilevel neural network,” Neural
Networks, vol. 6, pp. 583–595, 1993.
[26] R. Smalz and M. Conrad, “Combining evolution with credit apportionment: A new learning algorithm for neural nets,”
Neural Networks, vol. 7, no. 2, pp. 341 351, 1994.
16
[27] V. Maniezzo, “Genetic evolution of the topology and weight distribution of neural networks,” IEEE Transactions on neural
networks, vol. 5, no. 1, pp. 39–53, January 1994.
[28] M. V. Borst, Local structure optimization in evolutionary generated neural networks architectures, Ph.D. thesis, Leiden
University, The Netherlands, August 1994.
[29] B. A. Whitehead and T. D. Choate, “Cooperative–competitive genetic evolution of radial basis function centers and widths
for time series prediction,” IEEE Transactions on Neural Networks, vol. 7, no. 4, pp. 869–880, July 1996.
[30] D. E. Moriarty, Symbiotic Evolution of Neural Networks in Sequential Decision Tasks, Ph.D. thesis, University of Texas at
Austin, 1997, Report AI97-257.
[31] D. E. Goldberg, “Genetic algorithms and Walsh functions: Part 2, deception and its analysis,” Complex Systems, vol. 3, pp.
153 171, 1989.
[32] D. E. Goldberg, “Genetic algorithms and Walsh functions: Part 1, a gentle introduction,” Complex Systems, vol. 3, pp. 129
152, 1989.
[33] R. K. Belew, J. McInerney, and N. N. Schraudolph, “Evolving networks: Using genetic algorithms with connectionist
learning,” Tech. Rep. CS90-174, Computer Science Engineering Department, University of California-San Diego, Feb. 1991.
[34] P. J. B. Hancock, “Genetic algorithms and permutation problems: A comparison of recombination operators for neural net
structure specification,” in Proc. Int. Workshop of Combinations of Genetic Algorithms and Neural Networks (COGANN-92),
D. Whitley and J. D. Schaffer, Eds., Los Alamitos, CA, 1992, pp. 108–122, IEEE Computer Soc. Press.
[35] J. D. Schaffer, L. D. Whitley, and L. J. Eshelman, “Combinations of genetic algorithms and neural networks: A survey of
the state of the art,” in Proceedings of COGANN-92 International Workshop on Combinations of Genetic Algorithms and
Neural Networks, L. D. Whitley and J. D. Schaffer, Eds., Los Alamitos, CA, 1992, pp. 1–37, IEEE Computer Society Press.
[36] D. B. Fogel, Evolving artificial intelligence, Ph.D. thesis, University of California, San Diego, 1992.
[37] G. Bebis, M. Georgiopoulos, and T. Kasparis, “Coupling weight elimination with genetic algorithms to reduce network size
and preserve generalization,” Neurocomputing, vol. 17, pp. 167–194, 1997.
[38] G. F. Miller, P. M. Todd, and S. U. Hedge, “Designing neural networks,” Neural Networks, vol. 4, pp. 53–60, 1991.
[39] Y. Liu and X. Yao, “Ensemble learning via negative correlation,” Neural Networks, vol. 12, no. 10, pp. 1399–1404, December
1999.
[40] B. E. Rosen, “Ensemble learning using decorrelated neural networks,” Connection Science, vol. 8, no. 3, pp. 373–384,
december 1996.
[41] D. E. Moriarty and R. Miikkulainen, “Efficient reinforcement learning through symbiotic evolution,” Machine Learning, vol.
22, pp. 11 32, 1996.
[42] A. L. Samuel, “Some studies in machine learning using the game of checkers,” Journal of Research and Development, vol. 3,
no. 3, pp. 210–229, 1959.
[43] D. W. Opitz and J. W. Shavlik, “Actively searching for an effective neural network ensemble,” Connection Science, vol. 8,
no. 3, pp. 337–353, 1996.
[44] Q. F. Zhao, O. Hammami, Kuroda K, and K. Saito, “Cooperative co-evolutionary algorithm - How to evaluate a module?,”
in Proc. 1st IEEE Symposium of Evolutionary Computation and Neural Networks, San Antonio, TX, May 2000, pp. 150–157.
[45] X. Yao and Y. Liu, “Making use of population information in evolutionary artificial neural networks,” IEEE Transactions
on Systems, Man, and Cybernetics Part B: Cybernetics, vol. 28, no. 3, pp. 417–425, June 1998.
[46] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp.
671–680, 1983.
[47] D. Whitley and J. Kauth, “GENITOR: A different genetic algorithm,” in Proceedings of the Rocky Mountain Conference on
Artificial Intelligence, Denver, CO, 1988, pp. 118–130.
[48] D. Whitley, “The GENITOR algorithm and selective pressure,” in Proc 3rd International Conf. on Genetic Algorithms,
Morgan Kaufmann Publishers, Ed., Los Altos, CA, 1989, pp. 116–121.
[49] G. Syswerda, “Uniform crossover in genetic algorithms,” in Proc 3rd Internation Conf. on Genetic Algorithms, Morgan-
Kaufmann, Ed., 1989, pp. 2–9.
[50] D. Goldberg and K. Deb, “A comparative analysis of selection schemes used in genetic algorithms,” in Foundations of
Genetic Algorithms, G. Rawlins, Ed., pp. 94–101. Morgan Kaufmann, 1991.
[51] D. Whitley and T. Starkweather, “GENITOR II: A distributed genetic algorithm,” J. Experimental Theoretical Artificial
Intelligence, pp. 189 214, 1990.
[52] G. Syswerda, “A study of reproduction in generational and steady-state genetic algorithms,” in Foundations of Genetic
Algorithms, G. Rawlins, Ed., pp. 94–101. Morgan Kaufmann, 1991.
[53] S. Hettich, C.L. Blake, and C.J. Merz, “UCI repository of machine learning databases,” 1998,
http://www.ics.uci.edu/mlearn/MLRepository.html.
[54] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol.
3, pp. 79–87, 1991.
[55] D. Rumelhart, G. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed
Processing, D. Rumelhart and J. McClelland, Eds., pp. 318–362. MIT Press, Cambridge, MA, 1986.
[56] NeuralWare, Neural Computing: A Technology Handbook for Professional II/Plus, NeuralWare Inc., Pittsburgh, PA, 1993.
[57] Y. Le Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing (2), D. S.
Touretzky, Ed., Denver, CO, 1990, pp. 598–605.
[58] M. C. Mozer and P. Smolensky, “Skeletonization: A technique for trimming the fat from a network via relevance assessment,”
in Advances in Neural Information Processing (1), D. S. Touretzky, Ed., Denver, CO, 1989, pp. 107–155.
17
[59] W. Finnoff, F. Hergert, and H. G. Zimmermann, “Improving model selection by nonconvergent methods,” Neural Networks,
vol. 6, pp. 771–783, 1993.
[60] G. I. Webb, “Multiboosting: A technique for combining boosting and wagging,” Machine Learning, vol. 40, no. 2, pp.
159–196, August 2000.
[61] G. Zenobi and P. Cunningham, “Using diversity in preparing ensembles of classifiers based on different feature subsets to
minimize generalization error,” in 12th European Conference on Machine Learning (ECML 2001), L. de Raedt and P. Flach,
Eds. 2001, LNAI 2167, pp. 576–587, Springer–Verlag.
[62] C. J. Merz, “Using correspondence analysis to combine classifiers,” Machine Learning, vol. 36, no. 1, pp. 33–58, July 1999.
[63] J. Friedman, T. Hastie, and R. Tibshirani, “Additice logistic regression: A statistical view of boosting,” Annals of Statistics,
vol. 28, no. 2, pp. 337–407, 2000.
[64] Y. Liu, X. Yao, and T. Higuchi, “Evolutionary ensembles with negative correlation learning,” IEEE Transactions on
Evolutionary Computation, vol. 4, no. 4, pp. 380–387, November 2000.
[65] Y. Liu, X. Yao, Q. Zhao, and T. Higuchi, “Evolving a cooperative population of neural networks by minimizing mutual
information,” in Proc. of the 2001 IEEE Congress on Evolutionary Computation, Seoul, Korea, May 2001, pp. 384–389.
[66] Md. M. Islam, X. Yao, and K. Murase, “A constructive algorithm for training cooperative neural network ensembles,” IEEE
Transactions on Neural Networks, vol. 14, no. 4, pp. 820–834, July 2003.
[67] T. G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging,
boosting, and randomization,” Machine Learning, vol. 40, pp. 139–157, 2000.
[68] S. Dzeroski and B. Zenko, “Is combining classifiers with stacking better than selecting the best one?,” Machine Learning,
vol. 54, pp. 255–273, 2004.
[69] L. Breiman, “Randomizing outputs to increase prediction accuracy,” Machine Learning, vol. 40, pp. 229–242, 2000.
[70] L. Todorovski and S. Dzeroski, “Combining classifiers with meta decision trees,” Machine Learning, vol. 50, pp. 223–249,
2003.
[71] E. Cant´u-Paz and C. Kamath, “Inducing oblique decision trees with evolutionary algorithms,” IEEE Transactions on
Evolutionary Computation, vol. 7, no. 1, pp. 54–68, February 2003.
[72] N. Garc´ıa-Pedrajas, C. Herv´as-Mart´ınez, and J. Mu˜noz-P´erez, “Multiobjective cooperative coevolution of artificial neural
networks,” Neural Networks, vol. 15, no. 10, pp. 1255–1274, November 2002.
[73] K. Deb, “Evolutionary algorithms for multi-criterion optimization in engineering design,” in Proceedings of Evolutionary
Algorithms in Engineering and Computer Science (EUROGEN’99), Jyv¨askyl¨a, Finland, 30 May/3 June 1999, pp. 135–161.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Conference Paper
We investigate the use of information from all second order derivatives of the error function to perform network pruning (i.e., removing unimportant weights from a trained network) in order to improve generalization, simplify networks, reduce hardware or storage requirements, increase the speed of further training, and in some cases enable rule extraction. Our method, Optimal Brain Surgeon (OBS), is Significantly better than magnitude-based methods and Optimal Brain Damage [Le Cun, Denker and Sol1a, 1990], which often remove the wrong weights. OBS permits the pruning of more weights than other methods (for the same error on the training set), and thus yields better generalization on test data. Crucial to OBS is a recursion relation for calculating the inverse Hessian matrix H^(-1) from training data and structural information of the net. OBS permits a 90%, a 76%, and a 62% reduction in weights over backpropagation with weigh decay on three benchmark MONK's problems [Thrun et aI., 1991]. Of OBS, Optimal Brain Damage, and magnitude-based methods, only OBS deletes the correct weights from a trained XOR network in every case. Finally, whereas Sejnowski and Rosenberg [1987J used 18,000 weights in their NETtalk network, we used OBS to prune a network to just 1560 weights, yielding better generalization.