ArticlePDF Available

Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification

April 2004
Information Fusion 5(4):239–250

April 2004
5(4):239–250

DOI:10.1016/j.inffus.2004.02.003

Source
DBLP

Authors:

Viswanath Pulabaigari

IIIT Sri City Chittoor

M. Narasimha Murty

Indian Institute of Science

Shalabh Bhatnagar

Indian Institute of Science

The nearest neighbor classifier (NNC) is a popular non-parametric classifier. It is a simple classifier with no design phase and shows good performance. Important factors affecting the efficiency and performance of NNC are (i) memory required to store the training set, (ii) classification time required to search the nearest neighbor of a given test pattern, and (iii) due to the curse of dimensionality it becomes severely biased when the dimensionality of the data is high with finite samples. In this paper we propose (i) a novel pattern synthesis technique to increase the density of patterns in the input feature space which can reduce the curse of dimensionality effect, (ii) a compact representation of the training set to reduce the memory requirement, (iii) a weak approximate nearest neighbor classifier which has constant classification time, and (iv) an ensemble of the approximate nearest neighbor clas-sifiers where the individual classifier's decisions are combined based on the majority vote. The ensemble has constant classification time upperbound and according to empirical results, it shows good classification accuracy. A comparison based on empirical results is shown between our approaches and other related classifiers. Ó 2004 Elsevier B.V. All rights reserved.

PC-tree T i .

…

Properties of the datasets used

…

PPC-tree T i ¼ fT i1 ; T i2 g.

…

A comparison between the classifiers (showing CA (%))

…

A comparison between the classifiers

…

Figures - uploaded by Viswanath Pulabaigari

Content may be subject to copyright.

Content uploaded by Viswanath Pulabaigari

Content may be subject to copyright.

Fusion of multiple approximate nearest neighbor classiﬁers

for fast and eﬃcient classiﬁcation

P. Viswanath

, M. Narasimha Murty, Shalabh Bhatnagar

Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India

Received 8 May 2003; received in revised form 25 February 2004; accepted 25 February 2004

Available online 20 March 2004

Abstract

The nearest neighbor classiﬁer (NNC) is a popular non-parametric classiﬁer. It is a simple classiﬁer with no design phase and

shows good performance. Important factors aﬀecting the eﬃciency and performance of NNC are (i) memory required to store the

training set, (ii) classiﬁcation time required to search the nearest neighbor of a given test pattern, and (iii) due to the curse of

dimensionality it becomes severely biased when the dimensionality of the data is high with ﬁnite samples. In this paper we propose (i)

a novel pattern synthesis technique to increase the density of patterns in the input feature space which can reduce the curse of

dimensionality eﬀect, (ii) a compact representation of the training set to reduce the memory requirement, (iii) a weak approximate

nearest neighbor classiﬁer which has constant classiﬁcation time, and (iv) an ensemble of the approximate nearest neighbor clas-

siﬁers where the individual classiﬁer’s decisions are combined based on the majority vote. The ensemble has constant classiﬁcation

time upperbound and according to empirical results, it shows good classiﬁcation accuracy. A comparison based on empirical results

is shown between our approaches and other related classiﬁers.

Keywords: Multi-classiﬁer fusion; Ensemble of classiﬁers; Nearest neighbor classiﬁer; Pattern synthesis; Approximate nearest neighbor classiﬁer;

Compact representation

1. Introduction

The nearest neighbor classiﬁer (NNC) is a very pop-

ular non-parametric classiﬁer [1,2]. It is widely used be-

cause of its simplicity and good performance. It has no

design phase but simply stores the training set. The test

pattern is classiﬁed to the class of its nearest neighbor in

the training set. So the classiﬁcation time required for the

NNC is largely due to reading the entire training set to

ﬁnd the nearest neighbor(s).

Thus two major short-

comings of the classiﬁer are (i) entire training set needs to

be stored and (ii) entire training set needs to be searched.

To add to this list, when the dimensionality of the data is

high, it becomes severely biased with ﬁnite training set

due to the curse of dimensionality [2].

Cover and Hart [3] show that the error for the NNC

is bounded by twice the Bayes error when the available

sample size is inﬁnity. However, in practice, one can

never have an inﬁnite number of training samples. With

a ﬁxed number of training samples, the error of the

NNC tends to increase as the dimensionality of the data

gets large. This is called the peaking phenomenon [4,5].

Jain and Chandrasekharan [6] point out that the number

of training samples per class should be about 5–10 times

the dimensionality of the data. The peaking phenome-

non with the NNC is known to be more severe than

other parametric classiﬁers such as Fisher’s linear and

quadratic classiﬁers [7,8]. Thus, it is widely believed that

the size of training sample set needed to achieve a given

classiﬁcation accuracy would be prohibitively large

when the dimensionality of data is high.

Increasing the training set size has two problems.

These are: (i) space and time requirements get increased

Corresponding authors. Tel.: +91803942368; fax: +910803600683.

E-mail addresses: viswanath@csa.iisc.ernet.in (P. Viswanath),

mnm@csa.iisc.ernet.in (M. Narasimha Murty), shalabh@csa.iisc.

ernet.in (S. Bhatnagar).

We assume that the training set is not preprocessed (like indexed,

etc.) to reduce the time needed to ﬁnd the neighbor.

doi:10.1016/j.inﬀus.2004.02.003

Information Fusion 5 (2004) 239–250

www.elsevier.com/locate/inﬀus

and (ii) it may be expensive to get training patterns from

the real world. The space requirement problem can be

solved to some extent by using a compact representation

of the training data like PC-tree [9], FP-tree [10], CF-

tree [11], etc., or by using some editing techniques [1]

which reduce the training set size without aﬀecting the

performance. The classiﬁcation time requirement prob-

lem can be solved by ﬁnding an index over the training

set, like R-tree [12], while the curse of dimensionality

problem can be solved by using re-sampling techniques

like bootstrapping [13] and is widely studied [14–18].

These remedies are orthogonal, i.e., have to be done one

after the other (cannot be combined into a single step).

This paper, however, attempts to give a uniﬁed solution.

In this paper, we propose a novel bootstrap technique

for the NNC design, that we call partition based pattern

synthesis which reduces the curse of dimensionality ef-

fect. The artiﬁcial training set generated by this method

can be exponentially larger than the original set.

As a

result the synthetic patterns cannot be explicitly stored.

We propose a compact data structure called partitioned

pattern count tree (PPC-tree) which is a compact

representation of the original set and is suitable for

performing the synthesis. The classiﬁcation time

requirement problem is solved as follows. Finding an

approximate nearest neighbor (NN) is computationally

less demanding since it avoids exhaustive search of the

training set. We propose an approximate NN classiﬁer

called PPC-aNNC whose classiﬁcation time is indepen-

dent of the training set size. PPC-aNNC directly works

with PPC-tree which does implicit pattern synthesis and

ﬁnds an approximate nearest neighbor of the given test

pattern from the entire synthetic set. Thus an explicit

bootstrap step to generate the artiﬁcial training set is

avoided. However, PPC-aNNC is a weak classiﬁer

having a lower classiﬁcation accuracy (CA) than NNC.

Classiﬁcation decision fusion of multiple PPC-aNNC’s

is empirically shown to achieve better CA than con-

ventional NNC. This ensemble of PPC-aNNC’s is based

on simple majority voting technique and is suitable for

parallel implementations. The proposed ensemble is a

faster and better classiﬁer than NNC and some of the

classiﬁers of its kind. PPC-tree and PPC-aNNC assume

discrete valued features. For other domains, the data

sets need to be discretized appropriately.

Some of the earlier attempts at combining nearest

neighbor (NN) classiﬁers are as follows. Breiman [19]

experimentally demonstrated that combining NN clas-

siﬁers does not improve performance as compared to that

of a single NN classiﬁer. He attributed this behavior to

the characteristic of NN classiﬁers that the addition or

removal of a small number of training instances does not

change NN classiﬁcation boundaries signiﬁcantly.

Decision trees and neural networks, he argued, are in this

sense less stable than NN classiﬁers. In his experiments,

the component NN classiﬁers stored a large number of

prototypes. Thus it is computationally less eﬃcient as

well. Skalak [20] used a few selected prototypes for each

component NN classiﬁer and showed that the composite

classiﬁer outperforms the conventional NNC. Alpaydin

[21] used multiple condensed sets generated by accessing

the training set in various random orders. Individual

NNC works with a condensed training set and the ﬁnal

decision is made by taking a majority voting (either

simple or weighted) of the individual classiﬁers. Experi-

mental results show that this improves performance.

Kubat and Chen [22] propose an ensemble of several

NNCs, such that each independent classiﬁer considers

only one of the available features. Class assignment to

new patterns is done through weighted majority voting of

the individual classiﬁers. This does not work well in do-

mains where mutual inter-correlation between pairs of

attributes is high. Bay [23] combined multiple NN clas-

siﬁers where each component uses only a random subset

of features. Experimentally this also is shown to improve

performance in most cases.

Hamamoto et al. [18] proposed a bootstrap technique

for NNC design which is experimentally shown to per-

form well. In their approach, each training pattern is

replaced by weighted average (which is the centroid if

the weights are equal) of its rnearest neighbors in the

training set.

We present experimental results in this paper with six

diﬀerent data sets (having both discrete and continuous

valued features), and a comparison is drawn between

our approaches and (i) NNC, (ii) k-NNC, (iii) Naive–

Bayes classiﬁer, (iv) NNC based on a bootstrap tech-

nique given by Hamamoto et al. [18], (v) voting over

multiple condensed nearest neighbors [21] and (vi)

weighted nearest neighbor with feature projection [22].

This paper is organized as follows: partition based

pattern synthesis is described in Section 2, compact data

structures in Section 3, PPC-aNNC in Section 4.2, the

ensemble of PPC-aNNC’s in Section 4.3, experimental

results in Section 5 and conclusions in Section 6,

respectively.

2. Partition based pattern synthesis

We use the following notation and deﬁnitions to de-

scribe partition based pattern synthesis and various

other concepts throughout this paper.

2.1. Notation and deﬁnitions

Set of features:

F¼ff1;f2;...;fdgis the set of features. Feature fi

can get its value from domain Di(1 6i6d).

By original set we mean the given training set.

240 P. Viswanath et al. / Information Fusion 5 (2004) 239–250

Pattern:

X¼ðx1;x2;...;xdÞTis a pattern in ddimensional vec-

tor format.

X½fiis the feature-value of pattern Xfor feature fi.

X½fi2Di(1 6i6d).

Thus, X½fi¼xifor pattern X.

Set of class labels:

X¼f1;2;...;cgis the set of class labels. Each train-

ing pattern has a class label.

Set of training patterns:

Xis the set of all training patterns.

Xlis the set of training patterns for a class with label

X¼X1[X2[...[Xc.

Partition:

pl¼fB1;B2;...;Bpgis a partition of Ffor class with

label l, i.e.,

BiF,8i,

SiBi¼F, and

Bi\Bj¼;,ifi6¼ j,8i,8j.

Set of partitions is P¼fplj16l6cg.

Sub-pattern:

A pattern for which zero or more feature-values

are absent (missing or unknown) is called a sub-pat-

tern.

An absent feature-value is represented by H.

Thus, if Yis a sub-pattern, then

Y½fi2Di[fHg;16i6d:

Scheme of a sub-pattern:

A sub-pattern Yis said to be of scheme S, where

SF, if for 1 6i6d,

Y½fi2Di;if fi2S

Y½fi¼H;otherwise:

Sub-pattern of a pattern:

XSis said to be sub-pattern of pattern X, with scheme

S, provided

XS½fi¼X½fi;if fi2S

¼H;otherwise:

Set of sub-patterns:

A collection of sub-patterns with all members having

the same scheme. A collection of sub-patterns of dif-

ferent schemes is not aset of sub-patterns.

We further deﬁne, set of sub-patterns for a set of

patterns with respect to a scheme as follows.

If Wis a set of patterns, then WSis called the set of

sub-patterns of Wwith respect to scheme Swith

WS¼fWSjW2Wg.

Merge operation (}):

If P,Qare two sub-patterns of schemes Si,Sjrespec-

tively, then merge of Pand Qwritten as P}Qis a sub-

pattern of scheme Si[Sjand is deﬁned only if

Si\Sj¼;.

If R¼P}Q, then for 1 6k6d,

R½fk¼P½fkif fk2Si;

¼Q½fkif fk2Sj;

¼H;otherwise:

Join operation (}):

If Ym,Ynare sets of sub-patterns of scheme Si,Sj

respectively, then join of Ymand Yn, written as

Ym}Ynis deﬁned only if Si\Sj¼;, and

Ym}Yn¼fRjR¼P}Q;P2Ym;Q2Yng.

Join operation is commutative and associative.

So,

Ym}ðYn}YoÞ¼ðYm}YnÞ}Yo, and is written as

Ym}Yn}Yo.

2.2. Synthetic pattern generation

The method of synthetic pattern generation is as

follows.

(1) Choose an appropriate set of partitions

P¼fplj16l6cgwhere pl¼fB1;B2;...;Bpgis a

partition of Ffor a class with label l.

(2) The set of training patterns for the class with label l,

that is Xlis replaced by its synthetic counterpart,

SP ðXlÞ, where SP ðXlÞ¼XB1

l}XB2

l}}XBp

(3) Repeat step 2 for each label l2X.

Note 1. Partition can be diﬀerent for each class. How-

ever, we assume jplj¼p, a constant for all l2X.

This simpliﬁes analysis of classiﬁcation methods

and cross-validation method as described in subse-

quent sections.

Note 2. If each pattern is seen as an ordered tuple, then

XlSP ðXlÞD1D2Dd.

2.3. Example

This example illustrates the concept of synthetic

pattern generation. Let F¼ff1;f2;f3;f4g,D1¼

fred;green;blueg,D2¼f2;3;4;5g,D3¼fsmall;bigg

and D4¼f1:75;2:04g, respectively.

Let Xl¼fðred;3;big;1:75ÞT;ðgreen;2;small;1:75ÞTg

be the training set for the class with label l. Also, let the

partition for class x1be px1¼fB1;B2gwhere B1¼

ff1;f3g, and B2¼ff2;f4g. Then, XB1

l¼fðred;H;big;

HÞT;ðgreen;H;small;HÞTg, and XB2

l¼fðH;3;H;1:75ÞT;

ðH;2;H;1:75ÞTg, respectively.

The set of synthetic patterns for class l, i.e. SP ðXlÞ,is

SP ðXlÞ¼XB1

l}XB2

¼fðred;3;big;1:75ÞT;ðred;2;big;1:75ÞT;

ðgreen;3;small;1:75ÞT;ðgreen;2;small;1:75ÞTg:

This can be directly proved from the deﬁnitions of join and merge

operations.

P. Viswanath et al. / Information Fusion 5 (2004) 239–250 241

2.4. A partitioning method

Appropriate partition needs to be chosen for the given

classiﬁcation problem. We present a simple heuristic

based method to ﬁnd a partition. This method is based on

pair-wise correlation between the features and therefore

is suitable for domains having numerical feature values

only. Domain knowledge also can be used to get an

appropriate partition. The synthesis method can how-

ever work with any domain provided a partition is given.

The partitioning method is given below. The objective

of this method is to ﬁnd a partition such that the average

correlation between features within a block is high and

that between features of diﬀerent blocks is low. Since

this objective is a computationally demanding one, we

give a greedy method which can ﬁnd only a locally

optimal partition.

Find-Partition()

{

Input: (i) Set of features, F¼ff1;...;fdg.

(ii) Pair-wise correlation between features,

C¼fc½fi½fj¼correlation between fi;fjj

ð16i;j6dÞg.

(iii) p¼Number of blocks required in the parti-

tion such that p6d.

Output: Partition, p¼fB1;B2;...;Bpg.

(1) Mark all features in Fas unused.

(2) Find c½f0

1½f0

2, the minimum element in Csuch that

16¼ f0

(3) B1¼ff0

1g,B2¼ff0

(4) Mark f0

1,f0

2as used.

(5) For i¼3top

{

(i) Choose an un-marked feature, f0

isuch that

ðc½f0

i½f0

1þþc½f0

i½f0

i1Þ=ði1Þis minimum,

where f0

1;...;f0

i1are marked (as used) features.

(ii) Bi¼ff0

(iii) Mark f0

ias used.

}

(6) For each unmarked feature, f0

{

(i) For i¼1top

Ti¼PjBij

j¼1c½f0½f0

j



=jBij

(ii) Find maximum element from fT1;T2;...;Tpg.

Let it be Tk.

(iii) Bk¼Bk[ff0g

(iv) Mark f0as used.

}

(7) Output the partition, p¼fB1;...;Bpg.

}

For each class of training patterns, the above method

is used separately. Experiments (Section 5) are done

with number of blocks (i.e., p) being 1, 2, 3 and d,

respectively, where dis the total number of features.

3. The data structures

Partition based pattern synthesis can generate syn-

thetic set of size OðnpÞ, where nis the original set size

and pis the number of blocks in the partition. Hence

explicitly storing the synthetic set is very space con-

suming. In this section we present a compact represen-

tation of the original set which is suitable for the

synthesis. For large data sets, this representation re-

quires less storage space than that for the original set.

This representation is called partitioned pattern count

tree (PPC-tree).

Partitioned pattern count tree (PPC-tree) is a gener-

alization of pattern count tree (PC-tree). For the sake of

completeness, we give ﬁrst a brief overview of PC-tree,

details of which can be found in [9]. These data struc-

tures are suitable when each feature can take discrete

values (can be categorical values also). For continuous

valued features, an appropriate discretization needs to

be done ﬁrst. Later, we present a simple discretization

process which is used in our experimental studies.

3.1. PC-tree

PC-tree is a complete and compact representation of

training patterns which belong to a class. An order is

imposed on the set of features F¼ff1;...;fdg, where fi

denotes the ith feature. Patterns belonging to a class are

stored in a tree structure (PC-tree), where each feature

occupies a node. Every training pattern is present in a

path from root to leaf. Two patterns X,Yof a class can

share a common node for their respective nth feature if,

X½fi¼Y½fifor 1 6i6n.

A node has along with feature value, a count indi-

cating how many patterns are sharing that node. A

compact representation of the training set is obtained as

many patterns share a common node in the tree. The

given training set is represented as the set fT1;T2;...;Tcg

where each element Tiis the PC-tree for the class of

training patterns with label i.

3.1.1. Example

Let fða;b;c;x;y;zÞT;ða;b;d;x;y;zÞT;ða;e;c;x;y;uÞT;

ðf;b;c;x;y;vÞTgbe the original training set for a class

with label i. Then the corresponding PC-tree Ti(same

symbol is used for the tree and the root node of the tree)

for this training set is shown in Fig. 1. Each node of the

tree is of the format (feature : count).

3.2. PPC-tree

Let Xibe the set of original patterns which belong to

a class with label i. Let pi¼fB1;B2;...;Bpgbe a par-

tition of the feature set F, where each block

Bj¼ffj1;...;fjjBjjg(for 1 6j6p) is an ordered set and

the nth feature of block Bjis fjn. Then PPC-tree for Xi

242 P. Viswanath et al. / Information Fusion 5 (2004) 239–250

with respect to piis Ti¼fTi1;...;Tip g, a set of PC-trees

such that Tij is the PC-tree for the set of sub-patterns XBj

for 1 6j6pwhere the H-valued features (see Section

2.1) are ignored. Each PC-tree (Tij) corresponds to a

class (with label i) and to a block (Bjsuch that Bj2pi)

of the partition of that class. The given training set is

represented as the set fT1;T2;...;Tcg, where each ele-

ment Tiis the PPC-tree for the class of training patterns

with label i, and Ti¼fTi1;...;Tipg, a set of PC-trees.

A path from root to leaf of the PC-tree Tij (excluding

the root node) corresponds to a unique sub-pattern with

scheme Bj2pi.If(x1;x2;...;xjBjj) is a path in Tij then

the corresponding sub-pattern is Psuch that

P½fj1¼x1;P½fj2¼x2;...;P½fjjBjj¼xjBjjand for the

remaining features f, such that f2FBj,P½f¼H.If

Qjis sub-pattern corresponding to a path in Tij for

16j6p, then Q¼Q1}Q2}}Qpis a synthetic pat-

tern in the class with label i. Algorithms 1 and 2 give the

construction procedures.

Algorithm 1 (Build-PPC-trees())

{Input: (i) Original Training Set

(ii) Partition for each class, i.e., p1;p2;...;pc.

Output: The set of PPC-trees,T¼fT1;...;Tcgwhere

Ti¼fTi1;...;Tipg

for 16i6c.Tij is the PC-tree for the class

with label iand block Bj2pi.

Assumption: (i) Number of blocks in each pi,16i6c,

is the same and is equal to p.

(ii) Each Tij is empty (i.e., has only root

node) to start with.}

for i¼1to cdo

for each training pattern X2Xido

for j¼1to pdo

Add-Pattern(Tij;X)

end for

3.2.1. Example

For the example considered in Section 3.1.1, the PPC-

tree is shown in Fig. 2 where the partition is pi¼fB1;B2g

such that B1¼ff1;f2;f3gand B2¼ff4;f5;f6g, respec-

tively. The ordering of features considered for each block

is the same as that in Example 3.1.1. Thus the PPC-tree is

the set of PC-trees fTi1;Ti2g.Ti1is the PC-tree for the set

of sub-patterns XB1

i¼fða;b;c;H;H;HÞT;ða;b;d;H;H;

HÞT;ða;e;c;H;H;HÞT;ðf;b;c;H;H;HÞTgwhere the H-

valued features are ignored. Similarly Ti2is the PC-tree

for the set of sub-patterns XB2

i, see Fig. 2.

Note that PPC-tree is a more compact representation

than the corresponding PC-tree. From the examples, it

can be seen that the number of nodes in PPC-tree is 16,

but that in PC-tree is 22. A path from root to leaf of Ti1

represents a sub-pattern with scheme B1and that of Ti2

represents a sub-pattern with scheme B2. Merging the

two sub-patterns gives a synthetic pattern according to

the partition. Further,

Algorithm 2 (Add-Pattern(PC-tree Tij , Pattern X))

X0¼XBjsuch that Bj2pi{X0is the sub-pattern of X

with scheme Bj2pi.}

Node current-node ¼Tij root

for j¼1toddo

{dis the dimensionality of X}

if (X0½fj 6¼ H)then

L¼List of child nodes of current-node

if (Lis empty) then

Node new-node ¼create a new node

new-node Æfeature-value ¼X0½fj

new-node Æcount ¼1

Make new-node as a child for the current-node

current-node ¼new-node

else

if (a node v2Lexists such that vfeature-

value ¼X0½fj)then

vcount ¼vcount þ1

current-node ¼v

else

Node new-node ¼create a new node

new-node Æfeature-value ¼X0½fj

new-node Æcount ¼1

Make new-node as a child for the current-node

current-node ¼new-node

end if

end for

f : 1

a : 3

b : 2

e : 1

b : 1

c : 1

d : 1

c : 1

c : 1 x : 1

x : 1

x : 1 y : 1

y : 1

z : 1

u : 1

v : 1

Fig. 1. PC-tree Ti.

Ti2

Ti1

u : 1

a : 3

f : 1 b : 1

e : 1

b : 2

c : 1

d : 1

c : 1

x : 4 y : 4

z : 2

v : 1

PCtree for block 1. PCtree for block 2.

Fig. 2. PPC-tree Ti¼fTi1;Ti2g.

P. Viswanath et al. / Information Fusion 5 (2004) 239–250 243

both PC-tree and PPC-tree can be incrementally built by

scanning the database of patterns only once and are

suitable with discrete valued features which could be of

categorical type as well.

4. Classiﬁcation methods with synthetic patterns

We present three classiﬁcation methods to work with

synthetic patterns viz., NNC(SP), PPC-aNNC, and

ensemble of several PPC-aNNC’s.

4.1. NNC(SP)

NNC(SP) is the nearest neighbor classiﬁer with syn-

thetic patterns. Explicit generation of the synthetic set is

done ﬁrst and then NNC is applied. This method is

computationally ineﬃcient as the space and classiﬁca-

tion time requirements are both OðnpÞ, where pis the

number of blocks used in the partition and nis the

original training set size. This method is presented for

comparison purposes with other methods using the

synthetic set. The distance measure used by this method

is Euclidean distance.

4.2. PPC-aNNC

PPC-aNNC ﬁnds an approximate nearest neighbor

of a given test pattern. The distance measure used

here is the Manhattan distance (City block distance).

The method is suitable for discrete and numeric val-

ued features only. PPC-aNNC is described in Algo-

rithm 3. Let Qbe the given test pattern. The quantity

distij is the distance between the sub-pattern QBjand

its approximate nearest neighbor in the set XBj

i(the

set of sub-patterns of Xiwith respect to the scheme

Bj2pi), where the H-valued features are ignored. The

quantity di¼Pp

j¼1distij is then the distance between Q

and its approximate nearest neighbor in the class with

label i.

The method progressively ﬁnds a path in each of Tij

starting from root to leaf. The ordering of features

present in QBjmust be the same as that of Bj2piwhich

is used to construct the PC-tree Tij . At each node, it tries

to ﬁnd a child which is nearest to the corresponding

feature value in QBj(based on the absolute diﬀerence

between the values) and proceeds to that node. If there is

more than one such child then it proceeds to the child

that has the maximum count value. Let the child node

be vand the feature value of the corresponding feature

in QBjbe q. Then the distance distij is increased by

jvfeature-value qj.

If Qis present in the original training set then PPC-

aNNC will ﬁnd it and in this case the neighbor obtained

is the exact nearest neighbor.

4.2.1. Computational requirements of PPC-aNNC

Let the number of discrete values any feature can take

be at-most l, the dimensionality of each pattern be dand

the number of classes be c. Then the time complexity of

PPC-aNNC is OðcldÞ, since it ﬁnds only one path in

each of the cPPC-trees (one for a class) and at any node

it searches only the child-list (can be of size OðlÞ) of that

node to ﬁnd the next node in the path. The path will

have dnodes. For a given problem, c,land dare con-

stants (i.e., independent of the number of training pat-

terns) that are typically much smaller than the number

of training patterns. Thus, the effective time complexity

of the method is only Oð1Þ.That is classification time of

PPC-aNNC is constant and is independent of the training

set size. However, since it avoids exhaustive search of the

PPC-tree, it can only find an approximate nearest

neighbor.

Algorithm 3 (PPC-aNNC(Test Pattern Q))

{Assumption (i): The set of PPC-trees fT1;...;Tcg,

where Ti¼fTi1;...;Ti2gfor (1 6i6c)and (1 6j6p)

is assumed to be already built.

Assumption (ii): pi¼fB1;B2;...;Bpgð16i6cÞis the

partition of the feature set Ffor the class with label

i,and is the same as that used in the construction of

the PPC-tree,Ti,where each block Bj¼ffj1;...;

fjjBjjg(for 1 6j6p)is an ordered set with the nth fea-

ture of the block Bjbeing fjn.}

for each class with label i¼1tocdo

for each Bj2pi,(16j6pÞdo

Q0¼QBj

current-node ¼Tij root

distij ¼0

for l¼1tojBjjdo

L¼List of child nodes of the current-node.

Choose a sublist of nodes in Lsuch that

jQ0½fjlvfeature-valuejis minimum. Let this

sublist be L0.

Choose a node v2L0such that vcount is max-

imum. {Ties are broken arbitrarily}

distij ¼distij þjQ0½fjl vfeature-valuej

current-node ¼v

end for

for i¼1tocdo

di¼0

for j¼1topdo

di¼diþdistij

end for

Find dx¼Minimum element in fd1;d2;...;dcg

Output (class label of Q¼x)

The space requirement of the method is mostly due to

the PPC-tree structures. PPC-trees require space of

244 P. Viswanath et al. / Information Fusion 5 (2004) 239–250

OðnÞ, where nis the total number of original patterns.

For medium to large data sets, empirical studies show

that the space requirement is much smaller than that

required by conventional vector format (i.e., each pat-

tern is represented by a list of feature values). However,

for small data sets the space required may increase be-

cause of the data structure overhead (the space needed

for pointers, etc.).

4.3. The ensemble of PPC-aNNC’s

PPC-aNNC is a weak classiﬁer since it ﬁnds only an

approximate nearest neighbor for the test pattern. Par-

tition based pattern synthesis depends on the partition

chosen. PPC-tree and hence PPC-aNNC depends not

only on the partition chosen for each class, but also on

the ordering of features within each block of the parti-

tions. Thus various orderings of features in each block

of the partitions results in various PPC-aNNC’s. An

ensemble of PPC-aNNC’s where the ﬁnal decision is

made based on simple majority voting is empirically

shown to perform well. Let there be rcomponent clas-

siﬁers in the ensemble. Each component classiﬁer is

chosen based on a random ordering of features in each

block.

Intuitively, the functioning of PPC-aNNC’s can be

explained as follows. While ﬁnding the approximate

nearest neighbor, PPC-aNNC gives emphasis to the

features in a block according to its order. The ﬁrst

feature in a block is emphasized the most while the last

feature the least. Notice that if there is only one fea-

ture in each block for partitions of all classes, then

PPC-aNNC ﬁnds the exact nearest neighbor in the

entire synthetic set (generated according to this parti-

tioning). This is because, in this case all features are

emphasized equally (since there is only one feature in

each block). Since each PPC-aNNC in the ensemble is

based on a random ordering of the features, the

emphasis on features given by each of them is quite

diﬀerent from that of others. Because of this the errors

made by each of the individual classiﬁers become un-

correlated signiﬁcantly, causing the ensemble to per-

form well.

The ensemble is suitable for parallel implementation

with rmachines, where each machine implements a

diﬀerent PPC-aNNC. The communication requirement

is there only when the test pattern is communicated to

all individual classiﬁers and when majority vote is re-

quired, and therefore results in a very small overhead.

On the other hand, if the ensemble is to be implemented

on a single machine, then the space and time require-

ments will be rtimes that of a single PPC-aNNC, which

may not be feasible for large data sets.

5. Experiments

5.1. Datasets

We performed experiments with six diﬀerent datasets,

viz., OCR, WINE, VOWEL, THYROID, GLASS and

PENDIGITS, respectively. Except the OCR dataset, all

others are from the UCI Repository [24]. OCR dataset is

also used in [9,25] while WINE, VOWEL, THYROID

and GLASS datasets are used in [21]. The properties of

the datasets are given in Table 1. All the datasets have

only numeric valued features. The OCR dataset has

binary discrete features, while the others have continu-

ous valued features. Except OCR dataset, all other

datasets are normalized to have zero mean and unit

variance for each feature and subsequently discretized.

Let abe a feature value after normalization, and a0be its

discrete value. We used the following discretization

procedure.

If (a<0:75) then a0¼1;

Else-If (a<0:25) then a0¼0:5;

Else-If (a<0:25) then a0¼0;

Else-If (a<0:75) then a0¼0:5;

Else a0¼1.

5.2. Classiﬁers for comparison

The classiﬁers chosen for comparison purposes are as

follows.

NNC: The test pattern is assigned to the class of its

nearest neighbor in the training set. The Distance

measure used is Euclidean distance.

k-NNC: A simple extension of NNC, where the most

common class in the knearest neighbors is chosen.

The distance measure is Euclidean distance. Three-

fold cross-validation is done to choose the kvalue.

Table 1

Properties of the datasets used

Dataset Number of features Number of classes Number of training examples Number of test examples

OCR 192 10 6670 3333

WINE 13 3 100 78

VOWEL 10 11 528 462

THYROID 21 3 3772 3428

GLASS 9 7 100 114

PENDIGITS 16 10 7494 3498

P. Viswanath et al. / Information Fusion 5 (2004) 239–250 245

Naive–Bayes classiﬁer (NBC): This is a specialization of

the Bayes classiﬁer where the features are assumed to

be statistically independent. Further, the features are

assumed to be of discrete type. Let X¼ðx1;...;xdÞT

be a pattern and lbe a class label. Then the class con-

ditional probability, PðXjlÞ¼Pðx1jlÞPðxdjlÞ.

PðxijlÞis taken as the frequency ratio of number of

patterns in class with label land with feature fivalue

equal to xito that of total number of patterns in that

class. A priori probability for each class is taken as

the frequency ratio of number of patterns in that class

to the total training set size. The given test pattern is

classiﬁed to the class for which the posteriori proba-

bility is maximum. OCR dataset is used as it is,

whereas the other datasets are normalized (to have

zero mean and unit variance for each feature) and

discretized as done for PPC-aNNC.

NNC with bootstrapped training set (NNC(BS)):We

used the bootstrap method given by Hamamoto

et al. [18] to generate an artiﬁcial training set. The

bootstrapping method is as follows. Let Xbe a train-

ing pattern and let X1;...;Xrbe its rnearest neighbors

in its class. Then X0¼ð

i¼1XiÞ=ris the artiﬁcial pat-

tern generated for X. In this manner, for each training

pattern an artiﬁcial pattern is generated. NNC is done

with this new bootstrapped training set. The value of r

is chosen according to a three-fold cross-validation.

Voting over multiple condensed nearest neighbors

(MCNNC): Condensed nearest neighbor classiﬁer

(CNNC) ﬁrst ﬁnds a condensed training set which

is a subset of the training set, such that NNC with

the condensed set classiﬁes each training pattern cor-

rectly. The condensed set is incrementally built.

Changing the order of the training patterns consid-

ered can give a new condensed set. Alpaydin [21] pro-

posed to train a multiple such subsets and take a vote

over them, thus combining predictions from a set of

concept descriptions. Two voting schemes are given:

simple voting where voters have equal weight and

weighted voting where weights depend on classiﬁers’

conﬁdence in their predictions. The second scheme

is shown empirically to do well, so this is taken for

comparison purposes. The paper [21] proposes some

additional improvements based on bootstrapping,

etc., which are not considered here.

Weighted nearest neighbor with feature projection

(wNNFP): This is given by Kubat et al. in [22]. If d

is the number of features, then dindividual nearest

neighbor classiﬁers are considered, each classiﬁer tak-

ing only one feature into account. That is, dseparate

training sets are projected, each being used by an

individual NNC. Weighted majority voting is taken

to combine the decisions of the individual NNC’s.

The weights for the individual classiﬁers are given

based on their classiﬁcation accuracies. Three-fold

cross-validation is done for this.

NNC with synthetic patterns (NNC(SP)): This method is

given in Section 4.1. The parameter P, i.e., the set of

partitions is chosen based on the cross-validation

method given in Section 5.3.

PPC-aNNC: This method is given in Section 4.2 and

cross-validation method to choose the parameter val-

ues in Section 5.3.

Ensemble of PPC-aNNC’s: This method is given in Sec-

tion 4.3. Cross-validation method to choose the

parameter values is given in Section 5.3.

5.3. Validation method

Three-fold cross-validation is used to ﬁx the param-

eter values for various classiﬁers described in this paper.

For the methods proposed in this paper, viz., NNC(SP),

PPC-aNNC and Ensemble of PPC-aNNC’s we give a

detailed cross-validation procedure below.

The training set is randomly divided into three equal

non-overlapping subsets. If equal division is not possible

then one or two randomly chosen training patterns are

replicated to get an equal division. Two such subsets are

combined to form a training set called validation training

set, and the remaining one is called validation test set.

Like this we get three diﬀerent validation training sets

and corresponding validation test sets. We call these sets

as val-train-set-1,val-train-set-2,val-train-set-3 for vali-

dation training sets and val-test-set-1,val-test-set-2,val-

test-set-3 for the corresponding validation test sets,

respectively. For a given set of parameter values, val-

train-set-iis used as the training set for the classiﬁer and

classiﬁcation accuracy(CA) is measured over val-test-

set-iand is called val-CA-i, where i¼1, 2 or 3. Average

value of {val-CA-1,val-CA-2,val-CA-3} is called avg-

val-CA and its standard deviation as val-SD. Objective

of cross-validation is to ﬁnd a set of parameter values

for which avg-val-CA is maximum. val-SD measures the

spread of val-CA-i,i¼1, 2 or 3, around avg-val-CA.

Exhaustive search over all possible sets of parameter

values is a computationally expensive activity and hence

we give a greedy approach for choosing the set of

parameter values.

(1) NNC(SP): The parameters used in NNC(SP) are

partitions of the set of features (F) for each class

which are used for performing partition based pattern

synthesis. Let these partitions be represented as a

set P¼fp1;p2;...;pcgwhere pi(1 6i6c) is the parti-

tion used for class with label i. Further, let Ppbe the set

of partitions where each element (i.e., partition) has

exactly pblocks. An element from fP1;P2;P3;Pdg

(where d¼jFj) which gives maximum avg-val-CA is

chosen. Ppfor a given pis obtained either from

domain knowledge or by using the method given in

Section 2.4.

OCR dataset consists of handwritten images on a two

dimensional rectangular grid of size 1612 where for

246 P. Viswanath et al. / Information Fusion 5 (2004) 239–250

each cell, presence of ink is represented by 1 and its

absence by 0. It is known that for a given class, the

values in nearby cells are highly dependent than for

far apart cells (nearness here is based on physical

closeness between the cells). This knowledge is used for

obtaining the partitions. An entire image is represented

as a 192 dimensional vector where the ﬁrst 12 features

correspond to the ﬁrst row of the grid, the second 12

features correspond to the second row of the grid and

so on. Let the set of features in this order be F¼

ff1;f2;...;f192g. A partitioning of Fwith p(where

p¼1, 2, 3 or 192) blocks, i.e., fB1;B2;...;Bpgis ob-

tained in the following manner: The ﬁrst 192=pfeatures

go into block B1, the next 192=pfeatures into block B2,

and so on.

For other datasets, viz., WINE, VOWEL, THY-

ROID, GLASS and PENDIGITS, partitions are ob-

tained by using the method given in Section 2.4.

(2) PPC-aNNC: Parameters are the partitions of the

set of features as used by NNC(SP), ordering these in

all blocks of each partition. These are chosen as fol-

lows. Ppfor p¼1, 2, 3 and dare obtained as done

for NNC(SP). For each Pp(where p¼1, 2, 3 or d),

features in each block (for each partition) are randomly

ordered and avg-val-CA is obtained. 100 such runs

(each with a diﬀerent random ordering of features) are

obtained for each Pp. The Ppalong with ordering of

features for which avg-val-CA is maximum is then cho-

sen.

(3) Ensemble of PPC-aNNC’s: The parameters here

are (i) number of component classiﬁers (r), (ii) set of

partitions Pi¼fp1;...;pcgused by each component

ið16i6rÞ, and (iii) ordering of features in each block of

each partition. These parameters are chosen by

restricting the search space as given below.

Set of partitions used by each component classiﬁer is

same (except for ordering of features). That is,

P1¼P2¼ ¼ Prand is chosen as done for PPC-

aNNC. With 100 random orderings of features, PPC-

aNNC is run and their respective classiﬁcation accura-

cies (CA) are obtained. Let avg-CA and max-CA be the

average, and maximum CA of these 100 runs, respec-

tively. We deﬁne a threshold classiﬁcation accuracy

called thr-CA ¼(avg-CA+max-CA)/2. 50 component

classiﬁers are obtained by ﬁnding 50 random orderings

of features such that each component has CA greater

than the thr-CA. This process is done to choose good

component classiﬁers. For these 50 component classiﬁ-

ers, their respective CAs and orderings of features are

stored. This is done with each pair (val-train-set-i,val-

test-set-i) for i¼1, 2 and 3. So we get in total of 150

(i.e., 503) orderings of features along with their

respective CAs. This corresponds to the list of orderings.

This list is sorted based on CA values and best r

orderings are chosen to be used in the ﬁnal ensemble

where r, the number of component classiﬁers, is chosen

as described below.

For each pair (val-train-set-i,val-test-set-i), for i¼1,

2 and 3, we obtain 50 component classiﬁers as explained

above. From these 50 components, we choose randomly

mdistinct components to form an ensemble. CA of this

ensemble is measured and is called val-CA-im. The above

is done for m¼1 to 50 and for i¼1, 2 and 3. The

quantity avg-val-CAmis the average value of {val-CA-

1m,val-CA-2m,val-CA-3m}, and val-SDmis its standard

deviation. The number of components ris chosen such

that avg-val-CAris the maximum element in {avg-val-

CA1,avg-val-CA2,...,avg-val-CA50}.

5.4. Experimental results

Tables 2 and 3 give the comparison between these

classiﬁers. These show the classiﬁcation accuracy (CA)

for each of the classiﬁers as a percentage over respective

test sets. The parameter values are chosen by per-

forming cross-validation as described in Section 5.3.

Table 3 shows the CAs for the methods proposed by us.

Along with CA values, it also shows the parameter

values p(the number of blocks used in the synthesis)

and r(the number of components used in the ensem-

ble). The points worth noting are as follows: (i) For

large values of n(the number of original training pat-

terns) and p, it may not be feasible to consider the entire

synthetic set which is required in the case of NNC(SP).

(ii) If p¼d, where dis the total number of features,

each feature goes into a separate block (i.e., each block

contains only one feature) and therefore only one

ordering of features is possible. This means, in the case

of ensemble of PPC-aNNC’s, each component is same

and hence CA of one component is equal to the CA of

the ensemble. (iii) If p¼1, then the synthetic and ori-

ginal sets are same.

Table 2

A comparison between the classiﬁers (showing CA (%))

Dataset NNC k-NNC NBC NNC(BS) MCNNC wNNFP

OCR 91.12 92.68 81.01 92.88 91.97 10.02

WINE 94.87 96.15 91.03 97.44 95.00 92.30

VOWEL 56.28 60.17 36.80 57.36 55.97 23.38

THYROID 93.14 94.40 83.96 94.57 92.23 92.71

GLASS 71.93 71.93 60.53 71.93 71.67 53.5

PENDIGITS 96.08 97.54 83.08 97.57 97.25 45.05

P. Viswanath et al. / Information Fusion 5 (2004) 239–250 247

The cross-validation results for four datasets (WINE,

VOWEL, GLASS and PENDIGITS) for the ensemble of

PPC-aNNC’s are given in Tables 4–7 respectively, that

show the average CA (avg-val-CAm) and standard devi-

ation (val-SDm) with various values of m(i.e., number of

components). For the remaining two datasets (OCR and

THYROID), similar results as those for WINE, GLASS

and PENDIGITS are observed and hence not presented.

From the results presented, some of the observations

are:

(1) The methods given by us (viz., NNC(SP) and

Ensemble of PPC-aNNC’s) outperform the other

methods in the case of OCR and THYROID data-

sets. For the remaining datasets, our methods show

good performance.

Table 3

A comparison between the classiﬁers

Dataset NNC(SP)

(# blocks)

PPC-aNNC

(# blocks)

Ensemble of PPC-aNNCs

(# blocks) (# components)

OCR 93.01 (3) 84.91 (3) 94.15 (3) (45)

WINE 96.15 (2) 89.74 (2) 94.87 (2) (7)

VOWEL 56.28 (1) 43.51 (1) 46.32 (1) (33)

THYROID 97.23 (d) 94.16 (2) 94.66 (2) (49)

GLASS 71.93 (1) 60.53 (1) 67.54 (1) (7)

PENDIGITS 96.08 (1) 90.19 (1) 96.34 (1) (29)

Table 4

Cross validation results for ensemble of PPC-aNNC’s for WINE dataset

# component classiﬁers Number of blocks

123d

1 95.09 (1.38) 96.08 (1.38) 95.09 (1.38) 77.45 (5.00)

7 98.03 (1.38) 99.06 (1.38) 98.03 (1.38) 77.45 (5.00)

10 98.03 (1.38) 99.02 (1.38) 98.03 (2.77) 77.45 (5.00)

20 98.03 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)

30 98.03 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)

40 98.03 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)

50 99.01 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)

Table 5

Cross validation results for ensemble of PPC-aNNC’s for VOWEL dataset

# component classiﬁers Number of blocks

123d

1 82.77 (1.49) 82.57 (3.15) 79.73 (2.83) 23.86 (2.45)

10 91.86 (2.63) 88.06 (2.78) 85.22 (2.41) 23.86 (2.45)

20 91.86 (1.75) 89.58 (2.09) 85.41 (2.28) 23.86 (2.45)

30 92.23 (1.93) 90.15 (2.09) 85.98 (1.93) 23.86 (2.45)

33 93.18 (2.02) 90.72 (2.56) 85.98 (1.93) 23.86 (2.45)

40 92.42 (2.14) 89.58 (2.33) 85.98 (2.09) 23.86 (2.45)

50 92.61 (1.67) 89.96 (2.38) 85.22 (2.45) 23.86 (2.45)

Table 6

Cross validation results for ensemble of PPC-aNNC’s for GLASS dataset

# component classiﬁers Number of blocks

123d

1 67.65 (2.40) 71.57 (5.00) 63.73 (6.04) 46.08 (7.34)

776.47 (2.40) 67.65 (6.35) 66.67 (7.34) 46.08 (7.34)

10 70.59 (2.40) 69.61 (6.04) 65.69 (7.34) 46.08 (7.34)

20 72.55 (1.39) 71.57 (5.00) 64.71 (7.20) 46.08 (7.34)

30 73.53 (0.00) 70.59 (4.80) 66.67 (7.34) 46.08 (7.34)

40 73.53 (2.40) 70.59 (4.16) 65.69 (6.04) 46.08 (7.34)

50 71.57 (1.39) 72.55 (6.04) 65.69 (6.04) 46.08 (7.34)

248 P. Viswanath et al. / Information Fusion 5 (2004) 239–250

(2) The Ensemble of PPC-aNNC’s performs uniformly

better as compared to NBC, wNNFP, and PPC-

aNNC, respectively, over all datasets.

(3) It is interesting to note that PPC-aNNC outper-

forms NBC and wNNFP over all datasets except

for WINE dataset.

The actual space requirement on the average for

PPC-tree is about 60% to 90% for OCR, THYROID

and PENDIGITS datasets, when compared with the

space requirement of the respective original sets. For

other datasets, the actual space requirement is slightly

more than that required for the original set. This is be-

cause for small datasets, the data structure overhead is

larger when compared with the space reduced because of

the sharing of nodes in PPC-tree.

6. Conclusions

This paper presented a fusion of multiple approxi-

mate nearest neighbor classiﬁers having constant (Oð1Þ)

classiﬁcation time upper bound and good classiﬁcation

accuracy. Each individual classiﬁer of the ensemble is a

weak classiﬁer which works with a synthetic set gener-

ated from the novel pattern synthesis technique called

partition based pattern synthesis which reduces the curse

of dimensionality eﬀect. Further explicit generation of

the synthetic set is avoided by doing implicit pattern

synthesis within the classiﬁer which works directly with

a compact representation of the original training set

called PPC-tree. The proposed ensemble of PPC-aN-

NC’s with parallel implementation is a fast and eﬃcient

classiﬁer suitable for large, high dimensional data sets.

Since it has constant classiﬁcation time upper bound, it

is a suitable classiﬁer for online, real-time applications.

7. Future work

A formal explanation for the good behavior of the

ensemble of PPC-aNNC’s needs to be given. Next, one

needs to answer questions such as ‘What is a good

partition for doing partition based pattern synthesis and

how to ﬁnd it?’ We gave a partitioning method based on

pair-wise correlations between features within a class,

but this takes into account only linear dependency be-

tween the features, so for features having higher order

dependency this method can fail to capture that. A

general partitioning method which is computationally

also eﬃcient needs to be found which can be used for

both numerical and categorical features.

Acknowledgements

Research work reported here is supported in part by

AOARD Grant F62562-03-P-0318. Thanks to the three

anonymous reviewers for constructive comments. Spe-

cial thanks to B.V. Dasarathy for prompt feedback and

many suggestions during the revision that signiﬁcantly

improved the content of the paper.

References

[1] B.V. Dasarathy, Nearest neighbor (NN) norms: NN pattern

classiﬁcation techniques, IEEE Computer Society Press, Los

Alamitos, California, 1991.

[2] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, second

ed., A Wiley-interscience Publication, John Wiley & Sons, 2000.

[3] T. Cover, P. Hart, Nearest neighbor pattern classiﬁcation, IEEE

Transactions on Information Theory 13 (1) (1967) 21–27.

[4] K. Fukunaga, D. Hummels, Bias of nearest neighbor error

estimates, IEEE Transactions on Pattern Analysis and Machine

Intelligence 9 (1987) 103–112.

[5] G. Hughes, On the mean accuracy of statistical pattern recogniz-

ers, IEEE Transactions on Information Theory 14 (1) (1968) 55–

63.

[6] A. Jain, B. Chandrasekharan, Dimensionality and sample size

considerations in pattern recognition practice, in: P. Krishnaiah,

L. Kanal (Eds.), Handbook of Statistics, vol. 2, North Holland,

1982, pp. 835–855.

[7] K. Fukunaga, D. Hummels, Bayes error estimation using parzen

and k-nn procedures, IEEE Transactions on Pattern Analysis and

Machine Intelligence 9 (1987) 634–643.

[8] K. Fukunaga, Introduction to Statistical Pattern Recognition,

second ed., Academic Press, 1990.

[9] V. Ananthanarayana, M.N. Murty, D. Subramanian, An incre-

mental data mining algorithm for compact realization of proto-

types, Pattern Recognition 34 (2001) 2249–2251.

Table 7

Cross validation results for ensemble of PPC-aNNC’s for PENDIGITS dataset

# component classiﬁers Number of blocks

123d

1 93.70 (0.32) 93.68 (0.42) 90.28 (0.94) 29.06 (5.00)

10 98.16 (0.26) 96.99 (0.37) 93.15 (1.19) 29.06 (5.00)

20 98.45 (0.18) 97.17 (0.30) 93.76 (1.03) 29.06 (5.00)

29 98.71 (0.14) 97.30 (0.30) 93.79 (1.13) 29.06 (5.00)

30 98.57 (0.27) 97.41 (0.21) 93.79 (1.13) 29.06 (5.00)

40 98.64 (0.14) 97.37 (0.31) 93.78 (0.99) 29.06 (5.00)

50 98.51 (0.19) 97.43 (0.36) 93.78 (1.04) 29.06 (5.00)

P. Viswanath et al. / Information Fusion 5 (2004) 239–250 249

[10] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate

generation, in: Proceedings of ACM SIGMOD International

Conference of Management of Data, Dallas, Texas, USA, 2000.

[11] Z. Tian, R. Raghu, L. Micon, BIRCH: an eﬃcient data clustering

methods for very large databases, in: Proceedings of ACM

SIGMOD International Conference of Management of Data,

1996.

[12] A. Guttman, A dynamic index structure for spatial searching 2

(1984) 47–57.

[13] B. Efron, Bootstrap methods: Another look at the jackknife,

Annual Statistics 7 (1979) 1–26.

[14] A. Jain, R. Dubes, C. Chen, Bootstrap technique for error

estimation, IEEE Transactions on Pattern Analysis and Machine

Intelligence 9 (1987) 628–633.

[15] M. Chernick, V. Murthy, C. Nealy, Application of bootstrap and

other resampling techniques: Evaluation of classiﬁer performance,

Pattern Recognition Letters 3 (1985) 167–178.

[16] S. Weiss, Small sample error rate estimation for k-NN classiﬁers,

IEEE Transactions on Pattern Analysis and Machine Intelligence

13 (1991) 285–289.

[17] D. Hand, Recent advances in error rate estimation, Pattern

Recognition Letters 4 (1986) 335–346.

[18] Y. Hamamoto, S. Uchimura, S. Tomita, A bootstrap technique

for nearest neighbor classiﬁer design, IEEE Transactions on

Pattern Analysis and Machine Intelligence 19 (1) (1997) 73–79.

[19] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–

140.

[20] D.B. Skalak, Prototype Selection for Composite Nearest Neigh-

bor Classiﬁers, Ph.D. Thesis, Department of Computer Science,

University of Massachusetts Amberst, 1997.

[21] E. Alpaydin, Voting over multiple condensed nearest neighbors,

Artiﬁcial Intelligence Review 11 (1997) 115–132.

[22] M. Kubat, W.K. Chen, Weighted projection in nearest-neighbor

classiﬁers, in: Proceedings of the First Southern Symposium on

Computing, The University of Southern Mississippi, December 4–

5, 1998.

[23] S.D. Bay, Combining nearest neighbor classiﬁers through multiple

feature subsets, Intelligent Data Analysis 3 (3) (1999) 191–209.

[24] P.M. Murphy, UCI Repository of Machine Learning Databases

[http://www.ics.uci.edu/mlearn/MLRepository.html], Department

of Information and Computer Science, University of California,

Irvine, CA, 1994.

[25] T.R. Babu, M.N. Murty, Comparison of genetic algorithms based

prototypeselection schemes, Pattern Recognition 34 (2001) 523–525.

250 P. Viswanath et al. / Information Fusion 5 (2004) 239–250

EVOLUTIONARY DESIGN OF THE CLASSIFIER ENSEMBLE

Article

Full-text available

Jan 2011

This paper 1 presents two novel approaches to evolutionary design of the classifier ensemble. The first one presents the task of one-objective optimization of feature set partitioning together with feature weighting for the construction of the individual classifiers. The second approach deals with multi-objective optimization of classifier ensemble design. The proposed approaches have been tested on two data sets from the machine learning repository and one real data set on transient ischemic attack. The experiments show the advantages of the feature weighting in terms of classification accuracy when dealing with multivariate data sets and the possibility in one run of multi-objective genetic algorithm to get the non-dominated ensembles of different sizes and thereby skip the tedious process of iterative search for the best ensemble of fixed size.

Pattern Synthesis Using Multiple Kernel Learning for Efficient SVM Classification

Article

Full-text available

Dec 2012

Support Vector Machines (SVMs) have gained prominence because of their high generalization ability for a wide range of applications. However, the size of the training data that it requires to achieve a commendable performance becomes extremely large with increasing dimensionality using RBF and polynomial kernels. Synthesizing new training patterns curbs this effect. In this paper, we propose a novel multiple kernel learning approach to generate a synthetic training set which is larger than the original training set. This method is evaluated on seven of the benchmark datasets and experimental studies showed that SVM classifier trained with synthetic patterns has demonstrated superior performance over the traditional SVM classifier.

Advances in Multimodal Data Fusion in Neuroimaging: Overview, Challenges, and Novel Orientation

Article

Jul 2020
INFORM FUSION

Multimodal fusion in neuroimaging combines data from multiple imaging modalities to overcome the fundamental limitations of individual modalities. Neuroimaging fusion can achieve higher temporal and spatial resolution, enhance contrast, correct imaging distortions, and bridge physiological and cognitive information. In this study, we analyzed over 450 references from PubMed, Google Scholar, IEEE, ScienceDirect, Web of Science, and various sources published from 1978 to 2020. We provide a review that encompasses (1) an overview of current challenges in multimodal fusion (2) the current medical applications of fusion for specific neurological diseases, (3) strengths and limitations of available imaging modalities, (4) fundamental fusion rules, (5) fusion quality assessment methods, and (6) the applications of fusion for atlas-based segmentation and quantification. Overall, multimodal fusion shows significant benefits in clinical diagnosis and neuroscience research. Widespread education and further research amongst engineers, researchers and clinicians will benefit the field of multimodal neuroimaging.

Ensemble Neural Network in Classifying Handwritten Arabic Numerals

Article

Full-text available

Jan 2016

A method has been proposed to classify handwritten Arabic numerals in its compressed form using partitioning approach, Leader algorithm and Neural network. Handwritten numerals are represented in a matrix form. Compressing the matrix representation by merging adjacent pair of rows using logical OR operation reduces its size in half. Considering each row as a partitioned portion , clusters are formed for same partition of same digit separately. Leaders of clusters of partitions are used to recognize the patterns by Divide and Conquer approach using proposed ensemble neural network. Experimental results show that the proposed method recognize the patterns accurately.

Subquery Plan Reuse based Query Optimization

Article

In this paper, we revisit the problem of query optimiza-tion in relational DBMS. We propose a scheme to re-duce the search space of Dynamic Programming based on reuse of query plans among similar subqueries. The method generates the cover set of similar subgraph-s present in the query graph and allows their corre-sponding subqueries to share query plans among them-selves in the search space. Numerous variants to this scheme have been developed for enhanced memory ef-ficiency. Our implementation and experimental study in PostgreSQL show that one of the schemes is bet-ter suited to improve the performance of (Iterative) Dynamic Programming.

Data Mining Paradigms

Chapter

Jan 2013

In the process of finding novel patterns, algorithms for mining large datasets face a number of issues. We discuss the issues related to efficiency in data mining. We elaborate some important data mining tasks such as clustering, classification, and association rule mining that are relevant to the content of the book. We discuss popular and representative algorithms of partitional and hierarchical data clustering. In classification, we discuss the nearest-neighbor classifier and the support vector machine. We use both these algorithms extensively in the book. We provide an elaborate discussion on issues in mining large datasets and possible solutions. We discuss each possible direction in detail. The discussion on clustering includes topics such as incremental clustering with focus on leader and BIRCH clustering algorithms, divide-and-conquer clustering algorithms, and clustering based on intermediate representation. The discussion on classification includes topics such as incremental classification and classification based on intermediate abstraction. We further discuss frequent-itemset mining with two directions such as divide-and-conquer itemset mining and intermediate abstraction for frequent-itemset mining. Bibliographic notes contain a brief discussion on the significant research contribution in each of the directions discussed in the chapter and literature for further study.

Software system for different types of data classification based on the ensemble algorithms

Conference Paper

Oct 2016

This article describes the structure and functional content of a developed software system of different types of data classification based on the ensemble algorithms. Also the article describes the developed heterogeneous ensemble classification algorithm implemented in the software system. A distinctive feature of the algorithm is the iterative use of single (basic) classifiers on the initial training sample and inclusion in the ensemble only those classifiers whose relative error does not exceed a predetermined threshold. The software system was tested on real medical data. The accuracy of data classification using the basic classifiers and heterogeneous ensemble algorithm was compared. Test results showed the effectiveness (increase classification accuracy) of heterogeneous ensemble algorithm compared with the single classifiers.

Selective Hierarchical Ensemble Modeling Approach and Its Application in Leaching Process

Conference Paper

Nov 2015

To improve the precision and generalization of ensemble model and leaching model, a novel selective hierarchical ensemble modeling approach is proposed for leaching rate prediction in this paper. Unlike previous selective ensemble model, the new selective ensemble model is a hierarchical model. The model considers not only the combination of sub-models, but also the generation of sub-models. First of all, a new multi-model ensemble hybrid model (MEHM) based on bagging algorithm is proposed. In this model, the sub-models are composed of data model and mechanism model. The data model generates training subsets by using the proposed based vector bootstrap sampling algorithm. Afterwards, a new selective multi-model ensemble hybrid model (NSMEHM) based on binary particle swarm optimization (PSO) algorithm is presented. In this model, the binary PSO optimization algorithm is used to find out a group of the MEHMs, which minimizes the error and maximizes the diversity. Experiment results indicate that the proposed NSMEHM has better prediction performance than the other models.

Combination of multiple real-valued nearest neighbor classifiers based on different feature subsets with fuzzy integral

Article

Feb 2008
INT J INNOV COMPUT I

Generally, the curse of dimensionality leads to great bias of NNC. However, in this paper, multiple real-valued NNCs based on different feature subsets are combined by fuzzy integral so that the bias of NNC in high dimensionality is minimized, which is called FI-MRNNC. In FI-MRNNC, the feature set is partitioned into several low dimensionality feature subsets, where fuzzy measure is used to measure the importance of each feature subset and the interaction between feature subsets in its decision making process. According to the FI-MRNNC's classification accuracy, GA not only partitions the feature set into several feature subsets but oho defines a density value for the corresponding feature subset. Experimental results on some UCI databases illustrate that FI-MRNNC can reduce the bias of NNC, especially in high dimensionality. ICIC International

Nearest Neighbor Search for Diagnosing Rain/Non-Rain Discrimination

Article

Nov 2012

This study examines the rain occurrence by the passive microwave imagery during typhoons. The dataset consists of 53 typhoons affecting the watershed over 2001-2008. This study employs nearest neighbor search (NNS) classifier which is often used for diagnosing forecast problems. The multilayer perceptron (MLP) and logistic regressions (LR) are selected as the benchmarks. The results show that for the rain/non-rain discrimination, the best performing classifier is NNS according to the AUC measures. The results show that the use of NNS can effectively improve the AUC measures for diagnosing rain occurrence. Overall, the use of NNS is a relatively effective algorithm comparing to other classifiers.

Pattern Classification

Chapter

Full-text available

Jan 2001

Pattern Classification

Book

Jan 2001

R-trees: a dynamic index structure for spatial searching

Conference Paper

Jan 1984

Antonin Guttman

On the mean accuracy of statistical pattern recognizers

Article

Jan 1968

G.F. Hughes

Nearest Neighbor Pattern Classification

Article

Jan 1967

Nearest neighbor classification from multiple feature subsets

Article

Sep 1999

Stephen D. Bay

Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding, that significantly improve classifiers like decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor classifier. In this paper, we present MFS, a combining algorithm designed to improve the accuracy of the nearest neighbor (NN) classifier. MFS combines multiple NN classifiers each using only a random subset of features. The experimental results are encouraging: On 25 datasets from the UCI Repository, MFS significantly outperformed several standard NN variants and was competitive with boosted decision trees. In additional experiments, we show that MFS is robust to irrelevant features, and is able to reduce both bias and variance components of error. Keywords: multiple models, combining classifiers, nearest neighbor, feature selection,...

Tree: A Dynamic Indexing Structure for Spatial Searching

Article

Jan 1983

A. Guttman

Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases

Article

Dimensionality and Sample Size Considerations in Pattern Recognition Practice

Article

Dec 1982

This chapter discusses the role that the relationship between the number of measurements and the number of training patterns plays at various stages in the design of a pattern recognition system. The designer of a pattern recognition system should make every possible effort to obtain as many samples as possible. As the number of samples increases, not only does the designer have more confidence in the performance of the classifier, but also more measurements can be incorporated in the design of the classifier without the fear of peaking in its performance. However, there are many pattern classification problems where either the number of samples is limited or obtaining a large number of samples is extremely expensive. If the designer chooses to take the optimal Bayesian approach, the average performance of the classifier improves monotonically as the number of measurements is increased. Most practical pattern recognition systems employ a non-Bayesian decision rule because the use of optimal Bayesian approach requires knowledge of prior densities, and besides, their complexity precludes the development of real-time recognition systems. The peaking behavior of practical classifiers is caused principally by their nonoptimal use of measurements.

A wiley-interscience publication

Article

Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification

Abstract and Figures

Recommended publications

Dynamic and Static Weighting in Classifier Fusion

Boosting Technique for Combining Cellular GP Classifiers

Analysis of Multiple Classifiers Performance for Discretized Data in Authorship Attribution

Limitation of the Individual Classifiers