ArticlePDF Available

Genetic Programming with a Genetic Algorithm for Feature Construction and Selection

September 2005
Genetic Programming and Evolvable Machines 6(3):265-281

September 2005
6(3):265-281

DOI:10.1007/s10710-005-2988-7

Source
DBLP

Authors:

Matthew G. Smith

University of the West of England, Bristol

The use of machine learning techniques to automatically analyse data for information is becoming increasingly widespread. In this paper we primarily examine the use of Genetic Programming and a Genetic Algorithm to pre-process data before it is classified using the C4.5 decision tree learning algorithm. Genetic Programming is used to construct new features from those available in the data, a potentially significant process for data mining since it gives consideration to hidden relationships between features. A Genetic Algorithm is used to determine which such features are the most predictive. Using ten well-known datasets we show that our approach, in comparison to C4.5 alone, provides marked improvement in a number of cases. We then examine its use with other well-known machine learning techniques.

Sample Genotype 1

…

: UCI dataset information.

…

: Comparative performance of GAP algorithm and C4.5 (J48).

…

: Comparing the Initialisation method with Ramped Half and Half.

…

: Results with IBk

…

Figures - uploaded by Matthew G. Smith

Content may be subject to copyright.

Content uploaded by Matthew G. Smith

Content may be subject to copyright.

Genetic Programming with a Genetic Algorithm for Feature

Construction and Selection

Matthew G. Smith and Larry Bull

Faculty of Computing, Engineering & Mathematical Sciences,

University of the West of England, Bristol BS16 1QY, U.K.

Matt@matt-smith.me.uk, Larry.Bull@uwe.ac.uk

Abstract. The use of machine learning techniques to automatically analyse data for information is

becoming increasingly widespread. In this paper we primarily examine the use of Genetic

Programming and a Genetic Algorithm to pre-process data before it is classified using the C4.5

decision tree learning algorithm. Genetic Programming is used to construct new features from

those available in the data, a potentially significant process for data mining since it gives

consideration to hidden relationships between features. A Genetic Algorithm is used to determine

which such features are the most predictive. Using ten well-known datasets we show that our

approach, in comparison to C4.5 alone, provides marked improvement in a number of cases. We

then examine its use with other well-known machine learning techniques.

1. Introduction

Classification is one of the major tasks in machine learning [17], involving the prediction of class

value based on information about some other attributes. The process is a form of inductive learning

whereby a set of pre-classified training examples are presented to an algorithm which must then

generalise from the training set to correctly categorise unseen examples. One of the most commonly

used forms of classification technique is the decision tree learning algorithm C4.5 [19]. In this paper we

examine the use of Genetic Programming (GP) [14, 4] and a Genetic Algorithm (GA) [10] to improve

the performance of, initially, C4.5 through feature construction and feature selection. Feature

construction is a process that aims to discover hidden relationships between features, inferring new

composite features. In contrast, feature selection is a process that aims to refine the list of features used

thereby removing potential sources of noise and ambiguity. We use GP individuals consisting of a

number of separate trees/automatically defined functions (ADFs) [14] to construct features for C4.5. A

GA is then used to select over the combined set of original and constructed features for a final hybrid

C4.5 classifier system. In an effort to reduce over-fitting the train data is randomly reordered between

the two stages. Results show that the system is able to outperform standard C4.5 on a number of

datasets held at the UCI repository (http://www.ics.uci.edu/~mlearn/MLRepository.html). We then

show how similar benefits are obtained for the k-nearest neighbour algorithm [1] and Naïve Bayes [11].

Raymer et al. [20] have used ADFs for feature extraction in conjunction with the k-nearest-

neighbour algorithm. Feature extraction replaces an original feature with the result from passing it

through a functional mapping. In Raymer et al.’s approach, as with the algorithm presented here, for

problems with n features individuals consist of n ADFs whose fitness is evaluated using an external

classifier. However in Raymer et al.’s approach each ADF is based on a single feature and evolved for

that feature only, with the aim of increasing the separation of pattern classes in the feature space – in

our approach each ADF may be constructed from multiple features. Ahluwalia and Bull [2] extended

Raymer et al.’s approach by coevolving the ADFs for each feature and adding an extra coevolving GA

population of feature selectors; extraction and selection occurred simultaneously in n+1 populations

(this is in contrast to the single population in the approach presented here, with (initially) creation and

selection in separate stages). Amit and Geman [3] have explored the joint induction of binary features

and tree classifiers for shape recognition. They grow multiple decision trees where the non-terminal

nodes are binary features based on the spatial relationships of image ‘tags’ and the terminal nodes are

labelled with an estimate of the conditional distribution over the shape classes. For other (early)

examples of evolutionary computation approaches to data mining see [21, 12] for GA-based feature

selection approaches using k-nearest-neighbour, and [9] for an overview of a special edition of the

journal of Machine Learning Research on Variable and Feature Selection. Vafaie and DeJong [23] have

combined GP and a GA for use with C4.5. They used the GA to perform feature selection for a face

recognition dataset where feature subsets were evaluated through their use by C4.5. GP individuals

were then evolved which contained a variable number of ADFs to construct new features from the

selected subset, again using C4.5. Our approach is very similar to Vafaie and DeJong’s but initially the

feature operations are reversed such that feature construction occurs before selection (and later the two

stages are combined). We find that our approach performs as well or better than Vafaie and DeJong’s.

More recent work using GP to construct features for use by C4.5 includes that of Otero et al. [18].

They use a population of GP trees to evolve a single new feature using information gain as the fitness

measure (this is the criteria used by C4.5 to select attributes to test at each node of the decision tree).

This produces a single feature that attempts to cover as many instances as possible – a feature that aims

to be generally useful and which is appended to the set of original features for use by C4.5. Ekárt and

Márkus [8] use GP to evolve new features that are useful at specific points in the decision tree by

working interactively with C4.5. They do this by invoking a GP algorithm when constructing a new

node in the decision tree – e.g., when a leaf node incorrectly classifies some instances. Information gain

is again used as the fitness criterion but the GP is trained only on those instances relevant at that node

of the tree.

Krawiec [15] also uses GP to construct new features for use by C4.5 but instead of using

information gain as the fitness criterion uses, like the technique presented here, the “so-called wrapper

approach [13] where the evaluation consists of multiple train-and-test experiments carried out [with]

the same inducer that is used to create the final classifier”[15] (the train and test sets used in such cross-

validation are sometimes referred to as train-train and train-test as they are drawn only form the train

set). Krawiec justifies the additional computational expense involved by reporting Kohavi and Johns

[13] findings that “although computationally demanding … [the wrapper approach] seems to out-

perform other methods on most tasks”. Krawiec’s algorithm produces a fixed number of new features

(4 in the experiments shown) which replace the original set of features without any subsequent

selection. Krawiec also extends the algorithm with the concept of features that are hidden from the

evolutionary process to preserve them from destruction. These features (2 in the experiments shown)

are selected according to the number of times they appear in the decision trees constructed during

fitness evaluation. While Krawiec’s approach bears some similarity with the algorithm presented here,

there are a number of differences: the fixed number of new features introduces a parameter that must be

altered for each new problem (whilst the algorithm presented here specifies a minimum number of

features the actual number rises and falls with the number of original features in the dataset); it does not

involve any subset selection (other than the implicit selection of original features by their presence in

the ADFs); nor does it appear to allow for the inclusion of any original features found to be useful.

Song et al. [22] have used Subset Selection with Genetic Programming to avoid overfitting by

individuals to specific subsets, and also to make training on a very large dataset more efficient. Here,

we present the entire train set at all times (there is no attempt to make the training more efficient in the

same manner as Song et al.) but present the set in a different order at different times (so as to create

different train-train and train-test sets during cross-validation) to reduce overfitting by individuals to

the train set.

This paper is arranged as follows: the next section describes the initial approach; section 3 presents

results from its use on a number of well-known datasets and discusses the results. This is followed by

some amendments to the algorithm and further results; finally section 4 presents some conclusions and

future directions.

2. The GAP Algorithm

In this work we have used the WEKA [24] implementation of C4.5, known as J48, to examine the

performance of our Genetic Algorithm and Programming (GAP) approach. This is a wrapper approach,

in which the fitness of individuals is evaluated by performing 10-fold cross validation using the same

inducer as used to create the final classifier: C4.5(J48). The approach consists of two phases:

2.1 Feature Creation

An initial population of 101 genotypes is created at random. Each genotype consists of n trees,

where n is the number of numeric valued attributes in the dataset, subject to a minimum of 7. This

minimum is chosen to ensure that, for datasets with a small number of numeric features, the initial

population contains a sufficient number of compound features. A tree can be either an original feature

or an ADF. That is, a genotype consists of n GP trees (where n is the number of attributes in the

original dataset, but at least 7), each of which may contain 1 or more nodes. The chance of a node

being a leaf node (a primitive attribute) is determined by:

( )

−=

depth

leaf

Where depth is the depth of the tree at the current node. Hence a root node will have a depth of 1,

and therefore a probability of 0.5 of being a leaf node. Nodes at depth 2 will have a 0.67 probability of

being a leaf node, and so on. If a node is a leaf node, it takes the value of one of the original features

chosen at random. Otherwise, a function is randomly chosen from the set {*, /, +, -, %} and two child

nodes are generated. In this manner there is no absolute limit placed on the depth any one tree may

reach but the average depth is limited. This initialisation method was compared against the well known

ramped half and half method - this is discussed further in section 3.3.

During the initial creation no two trees in a single genotype are allowed to be alike, though this

restriction is not enforced in later stages. Additionally, nodes with ‘–‘, ‘%’ or ‘/’ for functions cannot

have child nodes that are equal to each other. In order to enforce this child nodes within a function ‘*’

or ‘+’ are ordered alphabetically to enable comparison (e.g. [width + length] will become [length +

width]). For the sake of simplicity, no random constants are used.

Figure 1: Sample Genotype

An individual is evaluated by constructing a new dataset with one feature for each tree in the

genotype. During calculation of instance values the result of any invalid calculation (e.g. division by 0,

or a calculation involving a missing value) is replaced by zero. The resulting dataset is then passed to a

C4.5(J48) classifier (using default parameters), whose performance on the dataset is evaluated using

10-fold cross validation

. The percentage correct is then assigned to the individual and used as the

fitness score.

Once the initial population has been evaluated, several generations of selection, crossover, mutation

and evaluation are performed. After each evaluation, if the fittest individual in the current generation is

fitter than the fittest so far, a copy of it is set aside and the generation noted. The fittest genotype is

always copied unchanged into the next generation. We use tournament selection to select the parents of

the next generation, with two-point crossover possibly occurring between the ADFs of the two selected

parents (whole trees are exchanged between genotypes)

. There is an additional chance that crossover

will occur between two ADFs at a randomly chosen locus (sub-trees are exchanged between trees at the

same position in each parent). Mutation may occur at any node, whereby a randomly created subtree

Genotypes have a minimum of 7 trees, but only 4 are shown here due to space constraints. The sample genotype

has been constructed using a very simple dataset with 3 attributes – Area, Length and Width.

This cross-validation is performed using the supplied train set only – it does not involve any data later used to test

the effectiveness of the algorithm.

Tests using this algorithm have shown no significant difference between two-point crossover with inversion and

uniform crossover. Of ten datasets tested, 5 showed a higher fitness with two-point crossover and inversion and

5 a lower fitness, with no significant difference overall. Similar results were obtained comparing two-point

crossover with inversion and uniform crossover with inversion.

replaces the selected node. We also use a form of inversion where the order of the trees between two

randomly chosen loci is reversed.

The evolutionary process continues until a minimum number of generations have passed and the

fittest genotype has reached a specific age. There is no maximum generation, but in practice very rarely

have more than 50 generations been necessary, and often fewer than 30 are required. This is still a

lengthy process, as performing 10-fold cross validation for each member of the population is very

processor intensive. The extra time required can justified by the improvement in the results over using,

e.g., a single train and test set (results not shown). Information Gain, the fitness criterion employed by

both Otero and Ekárt, is much faster but is only applicable to a single feature – it cannot provide the

fitness criterion for a set of features.

Once the termination criteria have been met the fittest individual is used to seed the feature selection

stage.

2.2 Feature Selection

The fittest individual from the feature creation stage (ties broken randomly) is analysed to see if any

of the original features do not appear. If some initial attributes do not appear as elementary (single

node) trees, they are added to the genotype as elementary trees. This genotype (with the same size as

the fittest individual, plus at most the initial number of attributes) replaces the initial individual and is

used as the basis of the second stage.

For the GA a population of 101 bit strings is randomly created. The strings have the same number of

bits as the genotype has trees – there is one bit for every attribute in the extended genotype. The last

member of the population, the 101

bit string, is not randomly created but is initialised to all ones. This

ensures that there are no missing alleles at the start of the selection process.

A new dataset is constructed with one attribute for every tree in the extended genotype. In an

attempt to reduce over-fitting of the data, the order of the dataset is randomly reordered at this point.

This has the effect of providing a different split of the data for 10-fold cross validation during the

selection stage, giving the algorithm a chance of recognising trees that performed well only on the

particular data partition in the creation stage. As a result of the reordering, it is usually the case that the

fitness score of individuals drops at the start of the selection stage before improving again. Often

individuals in the selection stage fail to reach the same fitness levels as seen in the construction stage,

but solutions should be more robust.

Each bit string is evaluated by taking a copy of the parent dataset and removing every attribute that

has a ‘0’ in the corresponding position in the bit string. As in the feature creation stage, this dataset is

then passed to a C4.5(J48) classifier whose performance on the dataset is evaluated using 10-fold cross

validation. The percentage correct is then assigned to the bit string and used as the fitness score.

If the fittest individual in the current generation has a higher fitness score than the fittest so far (from

the selection stage), or it has the same fitness score and fewer ‘1’s, a copy of it is set aside and the

generation noted. As in the feature creation stage the cycles of selection, crossover, mutation, and

evaluation continue until a minimum number of generations have passed and the fittest genotype has

reached a specific age.

2.3 Experimental Settings

The parameter settings used in the experiments were as follows (except where otherwise noted in the

text):

• Population size: 101 (this is the same for both stages).

• Tournament selection: Group size of 8, with a probability of 0.3 of selecting the fittest of

the group (otherwise a ‘winner’ is selected at random from the tournament group).

Tournament selection is the same for both stages.

• Crossover: The probability of crossover is 0.6. (This probability applies separately to two-

point crossover exchanging complete trees between individuals and to crossover

exchanging sub-trees between trees). Two-point crossover is the same for both stages.

• Mutation: There is a probability of 0.008 per node that the node will be replaced by a new

sub-tree during the construction stage, and a probability of 0.005 per bit that the bit will be

flipped during the selection stage.

• Inversion: Inversion occurs with a probability of 0.2 per individual. Inversion is the same

for both stages.

• Termination criteria: At least 10 generations have passed, and the fittest individual is at

least 6 generations old. The termination criteria are the same for both stages.

Experimentation on varying these parameters (not shown) has found the algorithm to be fairly robust

to their setting.

3. Experimentation

The experiments outlined in this section had a number of goals: 1) to assess the utility of the GAP

algorithm as a feature construction method for C4.5; 2) to compare the effects of the initialisation

method outlined above with ramped half and half; 3) to compare different orders for the create and

select stages; 4) to see if creation and selection could be successfully combined in a single stage; 5) to

see if the GAP algorithm could be successfully applied to classifiers other than C4.5.

We have used ten well-known data sets from the UCI repository to examine the performance of the

GAP algorithm. The UCI datasets were chosen because they consisted entirely of numeric attributes

(though the algorithm can handle some nominal attributes, as long as there are two or more numeric

attributes – it ignores the presence of the nominal attributes, but does pass them to C4.5(J48)). Table 1

shows the details of the ten datasets used here.

Table 1: UCI dataset information.

Dataset

# Numeric

features

# Nominal

features

# Classes #

Instances

# Instances

with missing

attributes

BUPA Liver Disorder (Liver) 6

345

Glass Identification (Glass) 9

214

Ionosphere (Iono.) 34

351

New Thyroid (NT) 5

215

Pima Indians Diabetes (Diab.) 8

768

Sonar 60

208

Vehicle 18

846

Wine Recognition (Wine) 13

178

Wisconsin Breast Cancer – New

(WBC New)

569

Wisconsin Breast Cancer – Original

(WBC Orig.)

699

For performance comparisons the tests were performed twenty times on each dataset (in which 90%

of the data was used for training and 10% withheld for testing in each run).

3.1 Initial Results

The highest classification score for each dataset is shown in Table 2 in bold underline. The first

column shows the performance of the GAP algorithm on unseen test data, the third column the

performance of C4.5(J48) on test data, and the last column shows the results of the paired t-test. T-test

results that are significant at the 95% confidence level are shown in bold.

The GAP algorithm out-performs C4.5(J48) on eight out of ten datasets, and provides a significant

improvement on three (Glass Identification, New Thyroid, and Wisconsin Breast Cancer Original) –

two of which are significant at the 99% confidence level. There are no datasets on which the GAP

algorithm performs significantly worse than C4.5(J48) alone.

The standard deviation of the GAP algorithm’s results do not seem to differ greatly from that of

C4.5(J48); there are five datasets where the GAP algorithms’ standard deviation is greater and five

where it is smaller. This is perhaps the most surprising aspect of the results, given that the GAP

algorithm is unlikely to produce the same result twice when presented with exactly the same data,

whereas C4.5(J48) will always give the same result if presented with the same data.

Table 2: Comparative performance of GAP algorithm and C4.5 (J48).

Dataset

GAP

S.D. C4.5 (J48) S.D.

Paired t-test

Liver

65.97 11.27

66.37

8.86 -0.22

Glass

73.74

9.86 68.28 8.86

3.39

Iono.

89.38 4.76

89.82

4.79 -0.34

96.27

4.17 92.31 4.14

3.02

Diab.

73.50

4.23 73.32 5.25 0.19

Sonar

73.98

11.29 73.86 10.92 0.05

Vehicle

72.46

4.72 72.22 3.33 0.20

Wine

94.68

5.66 93.27 5.70 0.85

(WBC New)

95.62

2.89 93.88 4.22 1.87

(WBC Orig.)

95.63

1.58 94.42 3.05

2.11

Overall Average

83.12 81.77 2.91

3.2 Analysis

We were interested in whether the improvement over C4.5(J48) is simply the result of the selection

stage choosing an improved subset of features and discarding the new constructed features. An analysis

of the attributes output by the GAP classifier algorithm, and the use made of them in C4.5’s decision

trees, shows this is not the case.

As noted above, the results in Table 2 were obtained from twenty runs on each of ten UCI datasets,

i.e. a total of two hundred individual solutions. In those two hundred individuals there are a total of

2,425 trees: 982 ADFs and 1,443 original features - a ratio of roughly two constructed features to three

original features. All but two of the two hundred individuals contained at least one constructed feature.

Table 4 gives details of the average number of ADFs per individual for each dataset (the number of

original features used is not shown).

Table 3: Analysis of the constructed features for each data set.

Dataset Results

Name

# Features in

Dataset

Average # Features

Average # ADFs

Minimum # ADFs

Liver 6

4.9

2.6

Glass 9

7.1

3.1

Iono. 34

19.7

7.6

NT 5

3.2

2.3

Diab. 8

6.4

2.7

Sonar 60

38.0

13.6

Vehicle 18

16.3

5.5

Wine 13

5.2

2.1

(WBC New) 30

15.7

7.0

(WBC Orig.) 9

5.0

2.9

Knowing that the feature selection stage continues as long as there is a reduction in the number of

attributes without reducing the fitness score, we can assume that C4.5(J48) is making good use of all

the attributes in most if not all of the winning individuals. This can be demonstrated by looking in

detail at the attributes in a single solution, and the decision tree created by C4.5(J48).

One of the best performers on the New Thyroid dataset had three trees, two of them ADFs and hence

constructed features. The original dataset contains five features and the class:

a. T3-resin uptake test. (A percentage)

b. Total Serum thyroxin as measured by the isotopic displacement method.

c. Total serum triiodothyronine as measured by radioimmuno assay.

d. Basal thyroid-stimulating hormone (TSH) as measured by radioimmuno assay.

e. Maximal absolute difference of TSH value after injection of 200 micro grams of

thyrotropin-releasing hormone as compared to the basal value.

Class attribute. (1 = normal, 2 = hyper, 3 = hypo)

In the chosen example the newly constructed features were:

• “e” becomes Tree0

• “((a/d)*b)” becomes Tree1

• “(b/a)” becomes Tree2

The decision tree created by C4.5 using these constructed features is shown in figure 2 (the numbers

after the class prediction indicate the count of instances correctly / incorrectly classified by that node).

Figure 2: New Thyroid Decision tree using constructed features

It is apparent that C4.5(J48) is using the constructed features to classify a large majority of the

instances, and only referring to an original feature (Tree0, or the original feature “e”) to classify 50 of

215 instances. The decision tree is also simpler than that created using the five original features (figure

3) – 11 nodes compared with 17, while using only four of the five original features (the original feature

“c” is not used in the construction of the new feature set).

Figure 3: New Thyroid Decision tree using original attributes

3.3 Comparison of the Initialisation Method with Ramped Half and Half

As mentioned previously, we compared the performance of the initialisation method outlined in

section 2.1 (hereafter referred to as P(leaf)) with the Ramped Half and Half method. As can be seen in

table 4, ramped half and half provides significantly improved fitness scores (with a better fitness in 8 of

10 datasets tested), but this results in a lower average test score (test scores are lower in 7 of ten

datasets).

Table 4: Comparing the Initialisation method with Ramped Half and Half.

Train

Test

Dataset

P(leaf)

S.D. RHH

S.D.

P(leaf)

S.D.

RHH S. D.

Liver

75.04 2.37 78.10 1.59 65.97 11.27 66.52 9.26

Glass

78.17

1.95

80.12 2.30 73.74 9.86 68.83 11.03

Iono.

95.99 0.87 95.89 1.04 89.38 4.76 89.62 4.99

98.22 0.68 98.94 0.46 96.27 4.17 94.47 4.81

Diab.

78.15 0.96 79.65 0.92 73.50 4.23 73.82 4.43

Sonar

92.33 1.27 92.23 2.68 73.98 11.29 73.16 11.94

Vehicle

78.82 1.21 79.33 1.76 72.46 4.72 72.28 5.00

Wine

98.47 0.80 99.13 0.68 94.68 5.66 93.86 7.49

(WBC New)

97.86 0.47 98.55 0.42 95.62 2.89 94.65 3.07

(WBC Orig.)

97.65 0.38 98.05 0.28 95.63 1.58 95.63 2.28

Overall Average

89.07 90.00 83.12 82.28

Comparing the size and number of trees in the solutions generated using the different methods

provides a possible explanation – the solutions generated with ramped half and half have on average

0.88 more trees and 3.6 more nodes per tree than the solutions generated with P(leaf). We speculate

that the extra nodes have the same effect as too many nodes in an artificial neural network – causing

the solutions to overfit the data.

3.4 Computational Effort

As this algorithm employs a wrapper approach it is computationally expensive when compared to

C4.5 alone (though for an embarked application only the best feature set would be used, and the

computational effort would differ from C4.5 only in the time taken to transform the dataset). For

instance on the New Thyroid dataset the algorithm runs for an average of 23.8 generations (this figure

includes both create and select stages). With a population size of 101 (and assuming an unlikely worst

case where every genotype is unfamiliar and requires fitness evaluation) this means ~2404 genotypes

requiring evaluation. As fitness evaluation involves 10-fold cross-validation on the training data, this

equates to roughly 24038 times the amount of computational effort of C4.5 alone

– something in the

region of 3 minutes per run on a 1.4Ghz PC.

The results achieved with the GAP algorithm compare favourably with those achieved by Bagging

C4.5 [5] with the same computational effort – Bagging C4.5(J48) (using the Weka implementation of

Bagging) with 24038 iterations provides an average performance of 94.40% (and standard deviation of

3.60%) on the New Thyroid dataset (using the same 20 train/test splits as the results shown above) –

compared with 96.27% for the GAP algorithm

. A paired t-test comparing these two methods gives a

score of 7.57 – significant at the 99% confidence level.

3.5 The Order of the Create and Select stages

As noted in the introduction, Vafaie and DeJong [23] have used a very similar approach to improve

the performance of C4.5. They use feature selection (GA) followed by feature construction (GP). We

have examined the performance of our algorithm as described above with the two processes occurring

in the opposite order. Results indicate that GAP gives either equivalent (e.g. Wisconsin Breast Cancer)

or better performance (e.g. New Thyroid) (Table 5). We suggest this is due to GAP’s potential to

The computational effort required to transform the dataset is minimal compared to that required to perform 10-

fold cross-validation, so it has been ignored for this calculation.

It should perhaps be noted that Bagging C4.5(J48) with only 10 iterations provides an average performance of

94.27% - the additional iterations add very little value on the New Thyroid dataset.

construct new features in a less restricted way, i.e. its ability to use all the original features during the

create stage. For instance, on the New Thyroid dataset the select stage will always remove either

feature a or feature b thus preventing the create stage from being able to construct the apparently useful

feature “b/a” (see section 3.2). That is, on a number of datasets there is a significant difference (at the

95% confidence level) in the results brought about by changing the order of the stages.

Table 5: Comparison of ordering of Create and Select stages

Create then Select Select then Create

Dataset

Test

S.D. Test S.D.

Liver

65.97 11.27

67.42

8.23

Glass

73.74

9.86 68.75 6.36

Iono.

89.38

4.76 89.02 4.62

96.27

4.17 93.67 5.82

Diab.

73.50 4.23

74.09

4.46

Sonar

73.98

11.29 73.16 9.05

Vehicle

72.46 4.72 72.46 3.01

Wine

94.68 5.66

94.69

3.80

(WBC New)

95.62 2.89

95.88

3.15

(WBC Orig.)

95.63

1.58 95.13 2.16

Overall Average

83.12 82.43

3.6 The Importance of Reordering the Dataset

In section two it was mentioned that the dataset was randomly reordered before the second stage

commenced, providing a different view of the data for 10-fold cross validation during fitness

evaluation and, it was hoped, this would reduce over fitting and improve the performance on the test

data. Is this what actually happens? In order to test this hypothesis we turned off the randomisation and

retested the algorithm. The first impression is that there is no important difference between the two sets

of results – there are 5 datasets where not reordering gives a better result and 5 where it is worse.

However, there are now only two (rather than three) datasets on which the algorithm provides a

significant improvement over C4.5(J48) (New Thyroid and Wisconsin Breast Cancer); and most

importantly the t-test performed over the 200 runs from all datasets no longer shows a significant

improvement. The results are shown in table 6 (the column for paired t-test shows the results for testing

the algorithm without reordering against C4.5(J48)):

Table 6: Comparative performance of GAP algorithm (with and without reordering) and C4.5 (J48).

Dataset

GAP

reorder

S.D.

GAP no

reorder

S.D. C4.5 (J48) S.D.

Paired t-test

Liver

65.97 11.27

66.65

7.84 66.37 8.86 0.17

Glass

73.74

9.86 69.74 9.79 68.28 8.86 0.61

Iono.

89.38 4.76 89.77 4.24

89.82

4.79 -0.04

96.27 4.17

97.22

3.78 92.31 4.14

3.99

Diab.

73.50

4.23 71.74 4.34 73.32 5.25 -1.37

Sonar

73.98 11.29

75.22

8.32 73.86 10.92 0.50

Vehicle

72.46

4.72 71.94 4.43 72.22 3.33 -0.28

Wine

94.68

5.66 94.08 5.09 93.27 5.70 0.55

(WBC New)

95.62

2.89 94.56 2.58 93.88 4.22 0.71

(WBC Orig.)

95.63 1.58

95.71

1.59 94.42 3.05

2.03

Overall Average

83.12 82.66 81.77 1.82

3.7 Combining Creation and Selection in a single stage

Having successfully tested the algorithm with two separate stages, we redesigned it to move feature

selection into the construction stage. Feature construction occurs as before but each tree now has a bit

flag associated with it, to determine whether the tree is passed to C4.5(J48) for evaluation. During

crossover each tree retains its associated bit flag, which is subject to the same chance of mutation as

during the second stage (0.005).

Testing the amended algorithm with the same parameter values as before gives a much shorter run

time (not surprisingly, roughly half the time of the two stage algorithm) but with poorer results – an

overall average of 82.20%

(though this is still an improvement over unaided C4.5(J48)).

There are three primary differences between the two versions of the algorithm that may account for

this drop in performance:

1. With a single stage we are asking the algorithm to do the same amount of work in half the

time.

2. There is no longer an opportunity to randomly reorder the dataset between stages.

3. There is no longer an opportunity to reintroduce any original attributes that have been

dropped during the first stage.

There seems no reasonable way to address the third of these differences with a single stage

approach, but the other two can be compensated for. Firstly we can change the termination criteria – by

doubling both the minimum number of generations to 20 and the age of the fittest individual to 12

generations. Doing this does improve the result (to an overall average result of 82.88%) but not

sufficiently to bring it into line with a two stage process.

Additionally we can randomly reorder the dataset. We considered two approaches to this. The first

was to have two versions of the dataset from the start, with the same data but in a different order, and

simply alternate between datasets when evaluating each generation (i.e. the first dataset was used to

evaluate even numbered generations, the second to evaluate odd numbered) – this approach did not

seem to improve the results (slightly worse than having no reordering at 82.24%). The second, more

successful, approach was to reorder the dataset once the termination criteria had been reached. That is,

run as before but when the fittest individual reaches 12 generations old reorder the dataset, re-evaluate

the current generation and reset the fittest individual, then continue until the termination criteria are met

again. The results obtained with a longer run time and randomly reordering the dataset part-way

through are shown in table 7 (the column for paired t-test shows the results for testing the single stage

algorithm against C4.5(J48)):

Table 7: Comparative performance of a single stage and C4.5 (J48).

Dataset

2 stage

S.D.

1 stage

S.D. C4.5 (J48) S.D.

Paired t-test

Liver

65.97 11.27

66.55

8.10 66.37 8.86 0.11

Glass

73.74

9.86 71.84 10.26 68.28 8.86 1.78

Iono.

89.38 4.76

90.69

4.66 89.82 4.79 0.96

96.27 4.17

96.49

3.98 92.31 4.14

3.69

Diab.

73.50 4.23

73.64

5.11 73.32 5.25 0.24

Sonar

73.98 11.29

75.89

9.00 73.86 10.92 0.80

Vehicle

72.46

4.72 72.11 4.60 72.22 3.33 -0.09

Wine

94.68 5.66

96.10

4.08 93.27 5.70 1.76

(WBC New)

95.62 2.89

95.71

4.39 93.88 4.22 1.71

(WBC Orig.)

95.63

1.58 95.56 2.52 94.42 3.05 1.62

Overall Average

83.12 83.46 81.77 3.62

Although the single stage algorithm out-performs C4.5(J48)) on only one dataset at the 95%

confidence level (a t-test of 1.96 or higher), as compared to three datasets for the two stage algorithm, it

outperforms C4.5(J48) on everything but the Vehicle dataset (and then loses only by a very small

margin). It also improves on the performance of the two stage version on seven out of ten datasets,

resulting in an increase of the (already high) overall confidence of improvement over C4.5(J48).

3.8 Applying GAP to other Classification Algorithms

As the version of C4.5 used is part of the Weka package, it is a simple matter to replace C4.5(J48)

with different classifiers and thus test the GAP algorithm with a number of different classification

techniques. We replaced C4.5(J48) with IBk (a k-nearest neighbour classifier [1] with k=1) and Naïve

It should be noted that the results for some of the datasets have a fairly high standard deviation, and so can show

some variation in the results from run to run. For this reason we have taken to using the average result over all

10 datasets as a useful (and briefer!) indicator of the performance of the algorithm.

Bayes (a probability based classifier [11]). Tables 8 and 9 present the results of using GAP with these

algorithms:

Table 8: Results with IBk

Dataset GAP SD IBk SD t-Test

Liver 60.89 7.65 62.62 8.68 -0.86

Glass 73.92 9.10 68.79 9.75

2.19

Iono. 91.38 4.00 86.95 4.53

3.62

NT 95.36 3.28 96.95 3.73 -1.91

Diab. 68.96 6.27 69.90 4.27 -0.64

Sonar 83.72 7.88 86.65 6.39 -1.39

Vehicle 72.46 5.42 70.03 4.10 1.87

Wine 96.00 4.88 95.44 5.81 0.52

(WBC New) 94.82 3.25 95.44 2.92 -1.26

(WBC Orig.) 95.64 2.51 95.42 2.16 0.41

Overall Average 83.31 5.42 82.82 5.23 1.01

Table 8 shows that IBk on its own offers marginally better performance over the ten datasets than

C4.5 (on average only – there are several individual datasets on which C4.5 performs better) and

perhaps offers the GAP algorithm less scope for improvement. There is no significant overall

improvement over IBk at the 96% level and no improvement at all on half the datasets, but there are

two datasets on which GAP does offer a significant improvement (Glass and Ionosphere) and none

where there is a significant drop in performance. Overall the result is competitive with GAP using

C4.5.

Table 9: Results with Naïve Bayes

Dataset GAP SD N.B. SD t-test

Liver 71.35 8.51 54.19 9.61

5.82

Glass 61.45 7.73 48.50 14.03

4.23

Iono. 90.60 4.38 82.37 7.31

5.09

NT 97.18 3.48 97.20 3.83 -0.02

Diab. 75.77 4.83 75.13 5.00 0.59

Sonar 77.64 8.77 67.16 7.53

5.80

Vehicle 69.03 4.96 43.98 4.61

20.69

Wine 96.40 4.80 97.99 3.85 -1.47

(WBC New) 96.75 2.20 93.26 4.79

4.04

(WBC Orig.) 95.92 2.16 96.06 1.74 -0.32

Overall Average 83.21 5.18 75.58 6.23

9.68

By looking at the overall average of table 9 it can be quickly seen that, on its own, Naïve Bayes

cannot compete with C4.5 or IBk over the ten datasets – it is approx. 6% worse on average (though

again there are individual datasets where Naïve Bayes performs better than the other two – e.g., New

Thyroid, Wine). This relatively poor performance gives the GAP algorithm much greater scope for

improvement. In fact, the GAP algorithm brings the results very closely into line with those achieved

using C4.5 and IBk. Further GAP with Naïve Bayes outperforms both IBk and C4.5 on their own when

averaged over the ten datasets.

While neither of the two new classifiers provide an improvement on GAP’s overall result using C4.5

both results are competitive – regardless of the performance of the classifier on its own. It seems as if

there is a ceiling on the overall results achievable with any one classifier. While using any particular

classifier GAP may perform well on some datasets and worse on others, the average seems to settle out

at somewhere near 83% no matter which classifier is employed. That is, GAP appears to provide a

robustness to the classifier techniques used.

3.9 A Rough Comparison to other Algorithms

Table 10 presents a number of published results we have found regarding the same ten UCI datasets

using other machine learning algorithms. The first 3 columns present results for the single stage version

of the algorithm using C4.5(J48), IBk and Naïve Bayes – the next 3 columns present the performance

of those algorithms on their own. Cells in the table are left blank where algorithms were not tested on

the dataset in question. The highest classification score for each dataset is shown in bold underline.

Table 10: Performance of GAP and other algorithms on the UCI datasets.

Dataset GAP

(J48)

GAP

(IBK)

GAP

(NB)

C4.5

(J48)

IBk N.B. HIDER XCS O. F.

LSVM Krawiec

Liver

66.55 60.89

71.35

66.37 62.62 54.19 64.29 67.85 57.01 68.68

Glass

71.84

73.92

61.45 68.28 68.79 48.50 70.59 72.53 69.56 66.39

Iono.

90.69

91.38

90.60 89.82 86.95 82.37 87.75

96.49 95.36 97.18 92.31 96.95

97.20

Diab.

73.64 68.96 75.77 73.32 69.90 75.13 74.1 68.62 69.8

78.12

76.41

Sonar

75.89 83.72 77.64 73.86

86.65

67.16 56.93 53.41 79.96

Vehicle

72.11

72.46

69.03 72.22 70.03 43.98

Wine

96.10 96.00 96.40 93.27 95.44 97.99 96.05 92.74

98.27

WBC

New

95.71 94.82

96.75

93.88 95.44 93.26

WBC

Orig.

95.56 95.64 95.92 94.42 95.42 96.06 95.71

96.27

94.39

The results for HIDER and XCS were obtained from [7], those for O.F.A. (‘Ordered Fuzzy

ARTMAP’, a neural network algorithm) from [6], LSVM (Lagrangian Support Vector Machines) from

[16] and Krawiec from [15].

The results are by no means an exhaustive list of current machine learning algorithms, nor are they

guaranteed to be the best performing algorithms available, but they give some indication of the relative

performance of our approach – which appears to be very good.

4. Conclusion

In this paper we have presented an approach to improve the classification performance of the well-

known induction algorithm C4.5. We have shown that GP individuals consisting of multiple

trees/ADFs can be used for effective feature creation and that solutions, combined with feature

selection via a GA in either a separate or the same stage, can give significant improvements to the

classification accuracy of C4.5. We have also indicated that randomly reordering the dataset part-way

through the process may help to reduce the problem of over fitting. We have shown that the same

algorithm can be used successfully with more than one type of classifier. Given that table 10 shows

that using a classifier more appropriate to the dataset improves the results of the GAP algorithm, future

work will look at the possibility of using evolution to select the most appropriate classifier for a

particular problem.

References

1. D. Aha and D. Kibler, “Instance-based learning algorithms”, in Machine Learning vol.6,

1991, pp. 37-66.

2. M. Ahluwalia and L. Bull, “Co-Evolving Functions in Genetic Programming: Classification

using k-nearest neighbour” in GECCO-99: Proceedings of the Genetic and Evolutionary Computation

Conference, W. Banzhaf, J. Daida, G. Eiben, M-H. Garzon, J. Honavar, K. Jakeila, R. Smith (eds).

Morgan Kaufmann: San Mateo, 1999, pp. 947–952.

3. Y. Amit and D. Geman, “Shape Quantization and Recognition With Randomized Trees”, in

Neural Computation vol. 9-7:1545-1588, 1996.

4. W. Banzhaf, P. Nordin, R. E. Keller and F. D. Francone, "Genetic Programming - An

Introduction On the Automatic Evolution of Computer Programs and Its Applications”, Morgan

Kaufmann: San Mateo, 1998.

5. L. Breiman, "Bagging predictors," Machine Learning, vol.24, no.2, pp.123--140, 1996.

6. I. Dagher, M. Georgiopoulos, G.L. Heileman, and G. Bebis, "An Ordering Algorithm for

Pattern Presentation in Fuzzy ARTMAP That Tends to Improve Generalization Performance. IEEE

Transactions on Neural Networks 10(4), 1999, pp. 768-778.

7. P. Dixon, D. Corne, and M. Oates, “A Preliminary Investigation of Modified XCS as a

Generic Data Mining Tool”, in Advances in Learning Classifier Systems, P-L. Lanzi, W. Stolzmann, S.

Wilson (eds). Springer, 2001, pp.133-151.

8. A. Ekárt and A. Márkus, “Using Genetic Programming and Decision Trees for Generating

Structural Descriptions of Four Bar Mechanisms”, to appear in Artificial Intelligence for Engineering

Design, Analysis and Manufacturing, volume 17, issue 3. 2003.

9. I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection”, in Journal of

Machine Learning Research 3, 2003, pp. 1157-1182.

10. J. Holland, “Adaptation in Natural and Artificial Systems”. Univ. Michigan, 1975.

11. G. John and P. Langley, “Estimating Continuous Distributions in Bayesian Classifiers”, in

Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann:

San Mateo, 1995, pp. 338-345.

12. J. Kelly and L. Davis, “Hybridizing the Genetic Algorithm and the K Nearest Neighbors

Classification Algorithm”, in Proceedings of the Fourth International Conference on Genetic

Algorithms, R. Belew and L. Booker (eds). Morgan Kaufmann: San Mateo, 1991, pp377-383.

13. R. Kohavi and G. John, “Wrappers for feature subset selection”, in Artificial Intelligence

Journal vol. 1-2: 273-324, 1997.

14. J. Koza, “Genetic Programming”. MIT Press, 1992.

15. K. Krawiec, “Genetic Programming-based Construction of Features for Machine Learning and

Knowledge Discovery Tasks”, in Genetic Programming and Evolvable Machines vol. 3 no. 4, 2002, pp.

329-343.

16. O. Mangasarian and D. Musicant, “Lagrangian support vector machines”, in Journal of

Machine Learning Research 1, 2001, pp.161-177.

17. T. M. Mitchell, "Machine Learning", McGraw-Hill: New York, 1997

18. F. Otero, M. Silva, A. Freitas, and J. Nievola, “Genetic Programming for Attribute

Construction in Data Mining”, in Proceedings of Genetic Programming: 6

European Conference,

EuroGP 2003, Essex, UK, Springer, 2003, pp. 384-393.

19. J. Quinlan, “C4.5: Programs for Machine Learning” Morgan Kaufmann: San Mateo, 1993.

20. M. Raymer, W. Punch, E. Goodman, and L. Kuhn, “Genetic Programming for Improved Data

Mining - Application to the Biochemistry of Protein Interactions”, in Proceedings of the Second

Annual Conference on Genetic Programming, J. Koza, K. Deb, M. Dorigo, D. Fogel, M.Garzon, H. Iba

and R. Riolo (eds), Morgan Kaufmann: San Mateo, 1996, pp. 375-380.

21. W. Siedlecki and J. Sklansky, “On Automatic Feature Selection”, in International Journal of

Pattern Recognition and Artificial Intelligence 2, 1988, pp. 197-220.

22. D. Song, M. I. Heywood and A. Nur Zincir-Heywood, “A Linear Genetic Programming

Approach to Intrusion Detection”, in Genetic and Evolutionary Computation – GECCO-2003, E.

Cantú-Paz et al. (eds), 2003, pp. 2325-2336.

23. H. Vafaie and K. De Jong, “Genetic Algorithms as a Tool for Restructuring Feature Space

Representations”, in Proceedings of the International Conference on Tools with A.I., IEEE Computer

Society Press, 1995.

24. I. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques with

Java Implementations”, Morgan Kaufmann: San Mateo, 2000.

Toward the Automatic Generation of an Objective Function for Extractive Text Summarization

Article

Full-text available

Jan 2023

The fitness function is a type of objective function that quantifies the optimality of a solution; the correct formulation of this function is relevant, in evolutionary-based ATS systems, because it must indicate the quality of the summaries. Several unsupervised evolutionary methods for the automatic text summarization (ATS) task proposed in current standards require authors to manually construct an objective function that guides the algorithms to create good-quality summaries. In this sense, it is necessary to test each fitness function created to measure its performance; however, this process is time consuming and only a few functions are analyzed. This study proposes the automatic generation of heuristic functions, through genetic programming (GP), to be applied in the ATS task. Therefore, our proposed method for ATS provides an automatically generated fitness function for cluster-based unsupervised approaches. The results of this study, using two standard collections, demonstrate to automatically obtain an orientation function that leads to good quality abstracts.

Forecasting China's Military Industry Index: Based on Decision Tree, Random Forest and Time Series Models

Chapter

Full-text available

Dec 2022

QFC: A Parallel Software Tool for Feature Construction, Based on Grammatical Evolution

Article

Full-text available

Aug 2022

Ioannis G Tsoulos

This paper presents and analyzes a programming tool that implements a method for classification and function regression problems. This method builds new features from existing ones with the assistance of a hybrid algorithm that makes use of artificial neural networks and grammatical evolution. The implemented software exploits modern multi-core computing units for faster execution. The method has been applied to a variety of classification and function regression problems, and an extensive comparison with other methods of computational intelligence is made.

Evolutionary Ensemble Learning

Chapter

Jan 2024

Malcolm I. Heywood

Evolutionary Ensemble Learning (EEL) provides a general approach for scaling evolutionary learning algorithms to increasingly complex tasks. This is generally achieved by developing a diverse complement of models that provide solutions to different (yet overlapping) aspects of the task. This chapter reviews the topic of EEL by considering two basic application contexts that were initially developed independently: (1) ensembles as applied to classification and regression problems and (2) multi-agent systems as typically applied to reinforcement learning tasks. We show that common research themes have developed from the two communities, resulting in outcomes applicable to both application contexts. More recent developments reviewed include EEL frameworks that support variable-sized ensembles, scaling to high cardinality or dimensionality, and operation under dynamic environments. Looking to the future we point out that the versatility of EEL can lead to developments that support interpretable solutions and lifelong/continuous learning.

Multi-Objective Multi-Gene Genetic Programming for the Prediction of Leakage in Water Distribution Networks

Conference Paper

Jul 2023

Automatic design of machine learning via evolutionary computation: A survey

Article

May 2023
APPL SOFT COMPUT

Multi-generation multi-criteria feature construction using Genetic Programming

Article

Mar 2023

Self-Configuring Genetic Programming Feature Generation in Affect Recognition Tasks

Chapter

Nov 2022

Feature extraction is one of the main parts of Machine Learning. Regardless of the nature of solving tasks, developers either need to use standard sets of features for a certain problem or try to generate their own features from raw data. In this paper, we present the genetic programming (GP) algorithm for feature generation issues in affect recognition tasks. We tested this approach in human affect recognition tasks on two corpora the WESAD and the RECOLA. We also used classical methods for feature space reduction Principal Component Analysis (PCA) and Independent Component Analysis (ICA). The results show the effectiveness of the GP approach in comparison with PCA and ICA and its capability to significantly reduce the feature space saving a high performance of classifiers in affect recognition tasks.KeywordsEvolutionary algorithmsFeature space reductionHeart rate variabilityLow-level descriptors

Dictionary with the Evaluation of Positivity/Negativity Degree of the Russian Words

Chapter

Nov 2022

The article describes the Russian Dictionary containing a numerical evaluation of the positivity/negativity degree of words. It includes more than 25 thousand frequency words from the main parts of speech – nouns, verbs, and adjectives. Scores were obtained for 1000 words by crowdsourcing of respondents through Yandex. Toloka service with manual quality control of answers. For the remaining 24 thousand words, the evaluation was obtained by extrapolating the available ones using the BERT model. A dictionary with the evaluation of positivity/negativity degree is built using neural networks for the Russian language for the first time. It is shown that the Dictionary developed with the help of such methodology can be used to replenish the RuSentiLex. Another obtained result was the verification for the hypothesis existing in the Russian language concerning the predominant use of positive vocabulary. The hypothesis was confirmed basing on the sub-corpus of the Russian National Corpus. The results obtained are compared with the data of investigations of the English language. Finally, the article manifests the opportunity to use the Dictionary for analyzing of fiction.KeywordsDictionaryValenceBERTRussian languageSentiment analysisPollyanna hypothesis

Evolutionary computation for feature selection and feature construction

Conference Paper

Jul 2022

Genetic programming for improved data mining: application to the biochemistry of protein interactions

Article

Full-text available

Jul 1996

We have previously shown how a genetic algorithm (GA) can be used to perform "data mining," the discovery of particular/important data within large datasets, by finding optimal data classifications using known examples. However, these approaches, while successful, limited data relationships to those that were "fixed" before the GA run. We report here on an extension of our previous work, substituting a genetic program (GP) for a GA. The GP could optimize data classification, as did the GA, but could also determine the functional relationships among the features. This gave improved performance and new information on important relationships among features. We discuss the overall approach, and compare the effectiveness of the GA vs. GP on a biochemistry problem, the determination of the involvement of bound water molecules in protein interactions.

Wrappers for Feature Subset Selection

Article

Full-text available

Dec 1997
ARTIF INTELL

In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.

Adaptation in natural and artificial systems

Article

Jan 1994

J.H. Holland

On automatic feature selection

Chapter

Aug 1993
INT J PATTERN RECOGN

ON AUTOMATIC FEATURE SELECTION

Article

Nov 2011
INT J PATTERN RECOGN

We review recent research on methods for selecting features for multidimensional pattern classification. These methods include nonmonotonicity-tolerant branch-and-bound search and beam search. We describe the potential benefits of Monte Carlo approaches such as simulated annealing and genetic algorithms. We compare these methods to facilitate the planning of future research on feature selection.

Adaptation In Natural And Artificial Systems

Article

Jan 1975

John H Holland

Genetic Programming: On the Programming of Computers by Means of Natural Selection (Complex Adaptive Systems)

Book

Jan 1992

J R Koza

Estimating Continuous Distributions in Bayesian Classifiers

Article

Feb 2013

When modeling a probability distribution with a Bayesian network, we are faced with the problem of how to handle continuous variables. Most previous work has either solved the problem by discretizing, or assumed that the data are generated by a single Gaussian. In this paper we abandon the normality assumption and instead use statistical methods for nonparametric density estimation. For a naive Bayesian classifier, we present experimental results on a variety of natural and artificial domains, comparing two methods of density estimation: assuming normality and modeling each conditional distribution with a single Gaussian; and using nonparametric kernel density estimation. We observe large reductions in error on several natural and artificial data sets, which suggests that kernel estimation is a useful tool for learning Bayesian models.

Bagging Predictors

Article

Aug 1996

Leo Breiman

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

A Linear Genetic Programming Approach to Intrusion Detection

Conference Paper

Jun 2003
Lect Notes Comput Sci

Page-based Linear Genetic Programming (GP) is proposed and implemented with two-layer Subset Selection to address a two-class intrusion detection classification problem as defined by the KDD-99 benchmark dataset. By careful adjustment of the relationship between subset layers, over fitting by individuals to specific subsets is avoided. Moreover, efficient training on a dataset of 500,000 patterns is demonstrated. Unlike the current approaches to this benchmark, the learning algorithm is also responsible for deriving useful temporal features. Following evolution, decoding of a GP individual demonstrates that the solution is unique and comparative to hand coded solutions found by experts.

Genetic Programming with a Genetic Algorithm for Feature Construction and Selection

Abstract and Figures

Recommended publications

Feature Construction and Selection Using Genetic Programming and a Genetic Algorithm

Using Genetic Programming for Feature Creation with a Genetic Algorithm Feature Selector.

Improving the human readability of features constructed by genetic programming

GAP: Constructing and Selecting Features with Evolutionary Computing