ArticlePDF Available

Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification

Authors:

Abstract and Figures

The nearest neighbor classifier (NNC) is a popular non-parametric classifier. It is a simple classifier with no design phase and shows good performance. Important factors affecting the efficiency and performance of NNC are (i) memory required to store the training set, (ii) classification time required to search the nearest neighbor of a given test pattern, and (iii) due to the curse of dimensionality it becomes severely biased when the dimensionality of the data is high with finite samples. In this paper we propose (i) a novel pattern synthesis technique to increase the density of patterns in the input feature space which can reduce the curse of dimensionality effect, (ii) a compact representation of the training set to reduce the memory requirement, (iii) a weak approximate nearest neighbor classifier which has constant classification time, and (iv) an ensemble of the approximate nearest neighbor clas-sifiers where the individual classifier's decisions are combined based on the majority vote. The ensemble has constant classification time upperbound and according to empirical results, it shows good classification accuracy. A comparison based on empirical results is shown between our approaches and other related classifiers. Ó 2004 Elsevier B.V. All rights reserved.
Content may be subject to copyright.
Fusion of multiple approximate nearest neighbor classifiers
for fast and efficient classification
P. Viswanath
*
, M. Narasimha Murty, Shalabh Bhatnagar
*
Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India
Received 8 May 2003; received in revised form 25 February 2004; accepted 25 February 2004
Available online 20 March 2004
Abstract
The nearest neighbor classifier (NNC) is a popular non-parametric classifier. It is a simple classifier with no design phase and
shows good performance. Important factors affecting the efficiency and performance of NNC are (i) memory required to store the
training set, (ii) classification time required to search the nearest neighbor of a given test pattern, and (iii) due to the curse of
dimensionality it becomes severely biased when the dimensionality of the data is high with finite samples. In this paper we propose (i)
a novel pattern synthesis technique to increase the density of patterns in the input feature space which can reduce the curse of
dimensionality effect, (ii) a compact representation of the training set to reduce the memory requirement, (iii) a weak approximate
nearest neighbor classifier which has constant classification time, and (iv) an ensemble of the approximate nearest neighbor clas-
sifiers where the individual classifier’s decisions are combined based on the majority vote. The ensemble has constant classification
time upperbound and according to empirical results, it shows good classification accuracy. A comparison based on empirical results
is shown between our approaches and other related classifiers.
Ó2004 Elsevier B.V. All rights reserved.
Keywords: Multi-classifier fusion; Ensemble of classifiers; Nearest neighbor classifier; Pattern synthesis; Approximate nearest neighbor classifier;
Compact representation
1. Introduction
The nearest neighbor classifier (NNC) is a very pop-
ular non-parametric classifier [1,2]. It is widely used be-
cause of its simplicity and good performance. It has no
design phase but simply stores the training set. The test
pattern is classified to the class of its nearest neighbor in
the training set. So the classification time required for the
NNC is largely due to reading the entire training set to
find the nearest neighbor(s).
1
Thus two major short-
comings of the classifier are (i) entire training set needs to
be stored and (ii) entire training set needs to be searched.
To add to this list, when the dimensionality of the data is
high, it becomes severely biased with finite training set
due to the curse of dimensionality [2].
Cover and Hart [3] show that the error for the NNC
is bounded by twice the Bayes error when the available
sample size is infinity. However, in practice, one can
never have an infinite number of training samples. With
a fixed number of training samples, the error of the
NNC tends to increase as the dimensionality of the data
gets large. This is called the peaking phenomenon [4,5].
Jain and Chandrasekharan [6] point out that the number
of training samples per class should be about 5–10 times
the dimensionality of the data. The peaking phenome-
non with the NNC is known to be more severe than
other parametric classifiers such as Fisher’s linear and
quadratic classifiers [7,8]. Thus, it is widely believed that
the size of training sample set needed to achieve a given
classification accuracy would be prohibitively large
when the dimensionality of data is high.
Increasing the training set size has two problems.
These are: (i) space and time requirements get increased
*
Corresponding authors. Tel.: +91803942368; fax: +910803600683.
E-mail addresses: viswanath@csa.iisc.ernet.in (P. Viswanath),
mnm@csa.iisc.ernet.in (M. Narasimha Murty), shalabh@csa.iisc.
ernet.in (S. Bhatnagar).
1
We assume that the training set is not preprocessed (like indexed,
etc.) to reduce the time needed to find the neighbor.
1566-2535/$ - see front matter Ó2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.inffus.2004.02.003
Information Fusion 5 (2004) 239–250
www.elsevier.com/locate/inffus
and (ii) it may be expensive to get training patterns from
the real world. The space requirement problem can be
solved to some extent by using a compact representation
of the training data like PC-tree [9], FP-tree [10], CF-
tree [11], etc., or by using some editing techniques [1]
which reduce the training set size without affecting the
performance. The classification time requirement prob-
lem can be solved by finding an index over the training
set, like R-tree [12], while the curse of dimensionality
problem can be solved by using re-sampling techniques
like bootstrapping [13] and is widely studied [14–18].
These remedies are orthogonal, i.e., have to be done one
after the other (cannot be combined into a single step).
This paper, however, attempts to give a unified solution.
In this paper, we propose a novel bootstrap technique
for the NNC design, that we call partition based pattern
synthesis which reduces the curse of dimensionality ef-
fect. The artificial training set generated by this method
can be exponentially larger than the original set.
2
As a
result the synthetic patterns cannot be explicitly stored.
We propose a compact data structure called partitioned
pattern count tree (PPC-tree) which is a compact
representation of the original set and is suitable for
performing the synthesis. The classification time
requirement problem is solved as follows. Finding an
approximate nearest neighbor (NN) is computationally
less demanding since it avoids exhaustive search of the
training set. We propose an approximate NN classifier
called PPC-aNNC whose classification time is indepen-
dent of the training set size. PPC-aNNC directly works
with PPC-tree which does implicit pattern synthesis and
finds an approximate nearest neighbor of the given test
pattern from the entire synthetic set. Thus an explicit
bootstrap step to generate the artificial training set is
avoided. However, PPC-aNNC is a weak classifier
having a lower classification accuracy (CA) than NNC.
Classification decision fusion of multiple PPC-aNNC’s
is empirically shown to achieve better CA than con-
ventional NNC. This ensemble of PPC-aNNC’s is based
on simple majority voting technique and is suitable for
parallel implementations. The proposed ensemble is a
faster and better classifier than NNC and some of the
classifiers of its kind. PPC-tree and PPC-aNNC assume
discrete valued features. For other domains, the data
sets need to be discretized appropriately.
Some of the earlier attempts at combining nearest
neighbor (NN) classifiers are as follows. Breiman [19]
experimentally demonstrated that combining NN clas-
sifiers does not improve performance as compared to that
of a single NN classifier. He attributed this behavior to
the characteristic of NN classifiers that the addition or
removal of a small number of training instances does not
change NN classification boundaries significantly.
Decision trees and neural networks, he argued, are in this
sense less stable than NN classifiers. In his experiments,
the component NN classifiers stored a large number of
prototypes. Thus it is computationally less efficient as
well. Skalak [20] used a few selected prototypes for each
component NN classifier and showed that the composite
classifier outperforms the conventional NNC. Alpaydin
[21] used multiple condensed sets generated by accessing
the training set in various random orders. Individual
NNC works with a condensed training set and the final
decision is made by taking a majority voting (either
simple or weighted) of the individual classifiers. Experi-
mental results show that this improves performance.
Kubat and Chen [22] propose an ensemble of several
NNCs, such that each independent classifier considers
only one of the available features. Class assignment to
new patterns is done through weighted majority voting of
the individual classifiers. This does not work well in do-
mains where mutual inter-correlation between pairs of
attributes is high. Bay [23] combined multiple NN clas-
sifiers where each component uses only a random subset
of features. Experimentally this also is shown to improve
performance in most cases.
Hamamoto et al. [18] proposed a bootstrap technique
for NNC design which is experimentally shown to per-
form well. In their approach, each training pattern is
replaced by weighted average (which is the centroid if
the weights are equal) of its rnearest neighbors in the
training set.
We present experimental results in this paper with six
different data sets (having both discrete and continuous
valued features), and a comparison is drawn between
our approaches and (i) NNC, (ii) k-NNC, (iii) Naive–
Bayes classifier, (iv) NNC based on a bootstrap tech-
nique given by Hamamoto et al. [18], (v) voting over
multiple condensed nearest neighbors [21] and (vi)
weighted nearest neighbor with feature projection [22].
This paper is organized as follows: partition based
pattern synthesis is described in Section 2, compact data
structures in Section 3, PPC-aNNC in Section 4.2, the
ensemble of PPC-aNNC’s in Section 4.3, experimental
results in Section 5 and conclusions in Section 6,
respectively.
2. Partition based pattern synthesis
We use the following notation and definitions to de-
scribe partition based pattern synthesis and various
other concepts throughout this paper.
2.1. Notation and definitions
Set of features:
F¼ff1;f2;...;fdgis the set of features. Feature fi
can get its value from domain Di(1 6i6d).
2
By original set we mean the given training set.
240 P. Viswanath et al. / Information Fusion 5 (2004) 239–250
Pattern:
X¼ðx1;x2;...;xdÞTis a pattern in ddimensional vec-
tor format.
X½fiis the feature-value of pattern Xfor feature fi.
X½fi2Di(1 6i6d).
Thus, X½fi¼xifor pattern X.
Set of class labels:
X¼f1;2;...;cgis the set of class labels. Each train-
ing pattern has a class label.
Set of training patterns:
Xis the set of all training patterns.
Xlis the set of training patterns for a class with label
l.
X¼X1[X2[...[Xc.
Partition:
pl¼fB1;B2;...;Bpgis a partition of Ffor class with
label l, i.e.,
BiF,8i,
SiBi¼F, and
Bi\Bj¼;,ifi j,8i,8j.
Set of partitions is P¼fplj16l6cg.
Sub-pattern:
A pattern for which zero or more feature-values
are absent (missing or unknown) is called a sub-pat-
tern.
An absent feature-value is represented by H.
Thus, if Yis a sub-pattern, then
Y½fi2Di[fHg;16i6d:
Scheme of a sub-pattern:
A sub-pattern Yis said to be of scheme S, where
SF, if for 1 6i6d,
Y½fi2Di;if fi2S
Y½fi¼H;otherwise:
Sub-pattern of a pattern:
XSis said to be sub-pattern of pattern X, with scheme
S, provided
XS½fi¼X½fi;if fi2S
¼H;otherwise:
Set of sub-patterns:
A collection of sub-patterns with all members having
the same scheme. A collection of sub-patterns of dif-
ferent schemes is not aset of sub-patterns.
We further define, set of sub-patterns for a set of
patterns with respect to a scheme as follows.
If Wis a set of patterns, then WSis called the set of
sub-patterns of Wwith respect to scheme Swith
WS¼fWSjW2Wg.
Merge operation (}):
If P,Qare two sub-patterns of schemes Si,Sjrespec-
tively, then merge of Pand Qwritten as P}Qis a sub-
pattern of scheme Si[Sjand is defined only if
Si\Sj¼;.
If R¼P}Q, then for 1 6k6d,
R½fk¼P½fkif fk2Si;
¼Q½fkif fk2Sj;
¼H;otherwise:
Join operation (}):
If Ym,Ynare sets of sub-patterns of scheme Si,Sj
respectively, then join of Ymand Yn, written as
Ym}Ynis defined only if Si\Sj¼;, and
Ym}Yn¼fRjR¼P}Q;P2Ym;Q2Yng.
Join operation is commutative and associative.
3
So,
YmYn}YoÞ¼ðYm}YnÞ}Yo, and is written as
Ym}Yn}Yo.
2.2. Synthetic pattern generation
The method of synthetic pattern generation is as
follows.
(1) Choose an appropriate set of partitions
P¼fplj16l6cgwhere pl¼fB1;B2;...;Bpgis a
partition of Ffor a class with label l.
(2) The set of training patterns for the class with label l,
that is Xlis replaced by its synthetic counterpart,
SP ðXlÞ, where SP ðXlÞ¼XB1
l}XB2
l}}XBp
l.
(3) Repeat step 2 for each label l2X.
Note 1. Partition can be different for each class. How-
ever, we assume jplp, a constant for all l2X.
This simplifies analysis of classification methods
and cross-validation method as described in subse-
quent sections.
Note 2. If each pattern is seen as an ordered tuple, then
XlSP ðXlÞD1D2Dd.
2.3. Example
This example illustrates the concept of synthetic
pattern generation. Let F¼ff1;f2;f3;f4g,D1¼
fred;green;blueg,D2¼f2;3;4;5g,D3¼fsmall;bigg
and D4¼f1:75;2:04g, respectively.
Let Xl¼fðred;3;big;1:75ÞT;ðgreen;2;small;1:75ÞTg
be the training set for the class with label l. Also, let the
partition for class x1be px1¼fB1;B2gwhere B1¼
ff1;f3g, and B2¼ff2;f4g. Then, XB1
l¼fðred;H;big;
HÞT;ðgreen;H;small;HÞTg, and XB2
l¼fðH;3;H;1:75ÞT;
ðH;2;H;1:75ÞTg, respectively.
The set of synthetic patterns for class l, i.e. SP ðXlÞ,is
SP ðXlÞ¼XB1
l}XB2
l
¼fðred;3;big;1:75ÞT;ðred;2;big;1:75ÞT;
ðgreen;3;small;1:75ÞT;ðgreen;2;small;1:75ÞTg:
3
This can be directly proved from the definitions of join and merge
operations.
P. Viswanath et al. / Information Fusion 5 (2004) 239–250 241
2.4. A partitioning method
Appropriate partition needs to be chosen for the given
classification problem. We present a simple heuristic
based method to find a partition. This method is based on
pair-wise correlation between the features and therefore
is suitable for domains having numerical feature values
only. Domain knowledge also can be used to get an
appropriate partition. The synthesis method can how-
ever work with any domain provided a partition is given.
The partitioning method is given below. The objective
of this method is to find a partition such that the average
correlation between features within a block is high and
that between features of different blocks is low. Since
this objective is a computationally demanding one, we
give a greedy method which can find only a locally
optimal partition.
Find-Partition()
{
Input: (i) Set of features, F¼ff1;...;fdg.
(ii) Pair-wise correlation between features,
C¼fc½fi½fj¼correlation between fi;fjj
ð16i;j6dÞg.
(iii) p¼Number of blocks required in the parti-
tion such that p6d.
Output: Partition, p¼fB1;B2;...;Bpg.
(1) Mark all features in Fas unused.
(2) Find c½f0
1½f0
2, the minimum element in Csuch that
f0
1 f0
2
(3) B1¼ff0
1g,B2¼ff0
2g
(4) Mark f0
1,f0
2as used.
(5) For i¼3top
{
(i) Choose an un-marked feature, f0
isuch that
ðc½f0
i½f0
1þþc½f0
i½f0
i1Þ=ði1Þis minimum,
where f0
1;...;f0
i1are marked (as used) features.
(ii) Bi¼ff0
ig
(iii) Mark f0
ias used.
}
(6) For each unmarked feature, f0
{
(i) For i¼1top
Ti¼PjBij
j¼1c½f0½f0
j

=jBij
(ii) Find maximum element from fT1;T2;...;Tpg.
Let it be Tk.
(iii) Bk¼Bk[ff0g
(iv) Mark f0as used.
}
(7) Output the partition, p¼fB1;...;Bpg.
}
For each class of training patterns, the above method
is used separately. Experiments (Section 5) are done
with number of blocks (i.e., p) being 1, 2, 3 and d,
respectively, where dis the total number of features.
3. The data structures
Partition based pattern synthesis can generate syn-
thetic set of size OðnpÞ, where nis the original set size
and pis the number of blocks in the partition. Hence
explicitly storing the synthetic set is very space con-
suming. In this section we present a compact represen-
tation of the original set which is suitable for the
synthesis. For large data sets, this representation re-
quires less storage space than that for the original set.
This representation is called partitioned pattern count
tree (PPC-tree).
Partitioned pattern count tree (PPC-tree) is a gener-
alization of pattern count tree (PC-tree). For the sake of
completeness, we give first a brief overview of PC-tree,
details of which can be found in [9]. These data struc-
tures are suitable when each feature can take discrete
values (can be categorical values also). For continuous
valued features, an appropriate discretization needs to
be done first. Later, we present a simple discretization
process which is used in our experimental studies.
3.1. PC-tree
PC-tree is a complete and compact representation of
training patterns which belong to a class. An order is
imposed on the set of features F¼ff1;...;fdg, where fi
denotes the ith feature. Patterns belonging to a class are
stored in a tree structure (PC-tree), where each feature
occupies a node. Every training pattern is present in a
path from root to leaf. Two patterns X,Yof a class can
share a common node for their respective nth feature if,
X½fi¼Y½fifor 1 6i6n.
A node has along with feature value, a count indi-
cating how many patterns are sharing that node. A
compact representation of the training set is obtained as
many patterns share a common node in the tree. The
given training set is represented as the set fT1;T2;...;Tcg
where each element Tiis the PC-tree for the class of
training patterns with label i.
3.1.1. Example
Let a;b;c;x;y;zÞT;ða;b;d;x;y;zÞT;ða;e;c;x;y;uÞT;
ðf;b;c;x;y;vÞTgbe the original training set for a class
with label i. Then the corresponding PC-tree Ti(same
symbol is used for the tree and the root node of the tree)
for this training set is shown in Fig. 1. Each node of the
tree is of the format (feature : count).
3.2. PPC-tree
Let Xibe the set of original patterns which belong to
a class with label i. Let pi¼fB1;B2;...;Bpgbe a par-
tition of the feature set F, where each block
Bj¼ffj1;...;fjjBjjg(for 1 6j6p) is an ordered set and
the nth feature of block Bjis fjn. Then PPC-tree for Xi
242 P. Viswanath et al. / Information Fusion 5 (2004) 239–250
with respect to piis Ti¼fTi1;...;Tip g, a set of PC-trees
such that Tij is the PC-tree for the set of sub-patterns XBj
i
for 1 6j6pwhere the H-valued features (see Section
2.1) are ignored. Each PC-tree (Tij) corresponds to a
class (with label i) and to a block (Bjsuch that Bj2pi)
of the partition of that class. The given training set is
represented as the set fT1;T2;...;Tcg, where each ele-
ment Tiis the PPC-tree for the class of training patterns
with label i, and Ti¼fTi1;...;Tipg, a set of PC-trees.
A path from root to leaf of the PC-tree Tij (excluding
the root node) corresponds to a unique sub-pattern with
scheme Bj2pi.If(x1;x2;...;xjBjj) is a path in Tij then
the corresponding sub-pattern is Psuch that
P½fj1¼x1;P½fj2¼x2;...;P½fjjBjj¼xjBjjand for the
remaining features f, such that f2FBj,P½f¼H.If
Qjis sub-pattern corresponding to a path in Tij for
16j6p, then Q¼Q1}Q2}}Qpis a synthetic pat-
tern in the class with label i. Algorithms 1 and 2 give the
construction procedures.
Algorithm 1 (Build-PPC-trees())
{Input: (i) Original Training Set
(ii) Partition for each class, i.e., p1;p2;...;pc.
Output: The set of PPC-trees,T¼fT1;...;Tcgwhere
Ti¼fTi1;...;Tipg
for 16i6c.Tij is the PC-tree for the class
with label iand block Bj2pi.
Assumption: (i) Number of blocks in each pi,16i6c,
is the same and is equal to p.
(ii) Each Tij is empty (i.e., has only root
node) to start with.}
for i¼1to cdo
for each training pattern X2Xido
for j¼1to pdo
Add-Pattern(Tij;X)
end for
end for
end for
3.2.1. Example
For the example considered in Section 3.1.1, the PPC-
tree is shown in Fig. 2 where the partition is pi¼fB1;B2g
such that B1¼ff1;f2;f3gand B2¼ff4;f5;f6g, respec-
tively. The ordering of features considered for each block
is the same as that in Example 3.1.1. Thus the PPC-tree is
the set of PC-trees fTi1;Ti2g.Ti1is the PC-tree for the set
of sub-patterns XB1
i¼fða;b;c;H;H;HÞT;ða;b;d;H;H;
HÞT;ða;e;c;H;H;HÞT;ðf;b;c;H;H;HÞTgwhere the H-
valued features are ignored. Similarly Ti2is the PC-tree
for the set of sub-patterns XB2
i, see Fig. 2.
Note that PPC-tree is a more compact representation
than the corresponding PC-tree. From the examples, it
can be seen that the number of nodes in PPC-tree is 16,
but that in PC-tree is 22. A path from root to leaf of Ti1
represents a sub-pattern with scheme B1and that of Ti2
represents a sub-pattern with scheme B2. Merging the
two sub-patterns gives a synthetic pattern according to
the partition. Further,
Algorithm 2 (Add-Pattern(PC-tree Tij , Pattern X))
X0¼XBjsuch that Bj2pi{X0is the sub-pattern of X
with scheme Bj2pi.}
Node current-node ¼Tij root
for j¼1toddo
{dis the dimensionality of X}
if (X0½fj H)then
L¼List of child nodes of current-node
if (Lis empty) then
Node new-node ¼create a new node
new-node Æfeature-value ¼X0½fj
new-node Æcount ¼1
Make new-node as a child for the current-node
current-node ¼new-node
else
if (a node v2Lexists such that vfeature-
value ¼X0½fj)then
vcount ¼vcount þ1
current-node ¼v
else
Node new-node ¼create a new node
new-node Æfeature-value ¼X0½fj
new-node Æcount ¼1
Make new-node as a child for the current-node
current-node ¼new-node
end if
end if
end if
end for
T
f : 1
a : 3
i
b : 2
e : 1
b : 1
c : 1
d : 1
c : 1
c : 1 x : 1
x : 1
x : 1
x : 1 y : 1
y : 1
y : 1
y : 1
z : 1
z : 1
u : 1
v : 1
Fig. 1. PC-tree Ti.
Ti2
Ti1
u : 1
a : 3
f : 1 b : 1
e : 1
b : 2
c : 1
d : 1
c : 1
c : 1
x : 4 y : 4
z : 2
v : 1
PCtree for block 1. PCtree for block 2.
Fig. 2. PPC-tree Ti¼fTi1;Ti2g.
P. Viswanath et al. / Information Fusion 5 (2004) 239–250 243
both PC-tree and PPC-tree can be incrementally built by
scanning the database of patterns only once and are
suitable with discrete valued features which could be of
categorical type as well.
4. Classification methods with synthetic patterns
We present three classification methods to work with
synthetic patterns viz., NNC(SP), PPC-aNNC, and
ensemble of several PPC-aNNC’s.
4.1. NNC(SP)
NNC(SP) is the nearest neighbor classifier with syn-
thetic patterns. Explicit generation of the synthetic set is
done first and then NNC is applied. This method is
computationally inefficient as the space and classifica-
tion time requirements are both OðnpÞ, where pis the
number of blocks used in the partition and nis the
original training set size. This method is presented for
comparison purposes with other methods using the
synthetic set. The distance measure used by this method
is Euclidean distance.
4.2. PPC-aNNC
PPC-aNNC finds an approximate nearest neighbor
of a given test pattern. The distance measure used
here is the Manhattan distance (City block distance).
The method is suitable for discrete and numeric val-
ued features only. PPC-aNNC is described in Algo-
rithm 3. Let Qbe the given test pattern. The quantity
distij is the distance between the sub-pattern QBjand
its approximate nearest neighbor in the set XBj
i(the
set of sub-patterns of Xiwith respect to the scheme
Bj2pi), where the H-valued features are ignored. The
quantity di¼Pp
j¼1distij is then the distance between Q
and its approximate nearest neighbor in the class with
label i.
The method progressively finds a path in each of Tij
starting from root to leaf. The ordering of features
present in QBjmust be the same as that of Bj2piwhich
is used to construct the PC-tree Tij . At each node, it tries
to find a child which is nearest to the corresponding
feature value in QBj(based on the absolute difference
between the values) and proceeds to that node. If there is
more than one such child then it proceeds to the child
that has the maximum count value. Let the child node
be vand the feature value of the corresponding feature
in QBjbe q. Then the distance distij is increased by
jvfeature-value qj.
If Qis present in the original training set then PPC-
aNNC will find it and in this case the neighbor obtained
is the exact nearest neighbor.
4.2.1. Computational requirements of PPC-aNNC
Let the number of discrete values any feature can take
be at-most l, the dimensionality of each pattern be dand
the number of classes be c. Then the time complexity of
PPC-aNNC is OðcldÞ, since it finds only one path in
each of the cPPC-trees (one for a class) and at any node
it searches only the child-list (can be of size OðlÞ) of that
node to find the next node in the path. The path will
have dnodes. For a given problem, c,land dare con-
stants (i.e., independent of the number of training pat-
terns) that are typically much smaller than the number
of training patterns. Thus, the effective time complexity
of the method is only Oð1Þ.That is classification time of
PPC-aNNC is constant and is independent of the training
set size. However, since it avoids exhaustive search of the
PPC-tree, it can only find an approximate nearest
neighbor.
Algorithm 3 (PPC-aNNC(Test Pattern Q))
{Assumption (i): The set of PPC-trees fT1;...;Tcg,
where Ti¼fTi1;...;Ti2gfor (1 6i6c)and (1 6j6p)
is assumed to be already built.
Assumption (ii): pi¼fB1;B2;...;Bp16i6cÞis the
partition of the feature set Ffor the class with label
i,and is the same as that used in the construction of
the PPC-tree,Ti,where each block Bj¼ffj1;...;
fjjBjjg(for 1 6j6p)is an ordered set with the nth fea-
ture of the block Bjbeing fjn.}
for each class with label i¼1tocdo
for each Bj2pi,(16j6pÞdo
Q0¼QBj
current-node ¼Tij root
distij ¼0
for l¼1tojBjjdo
L¼List of child nodes of the current-node.
Choose a sublist of nodes in Lsuch that
jQ0½fjlvfeature-valuejis minimum. Let this
sublist be L0.
Choose a node v2L0such that vcount is max-
imum. {Ties are broken arbitrarily}
distij ¼distij þjQ0½fjl vfeature-valuej
current-node ¼v
end for
end for
end for
for i¼1tocdo
di¼0
for j¼1topdo
di¼diþdistij
end for
end for
Find dx¼Minimum element in fd1;d2;...;dcg
Output (class label of Q¼x)
The space requirement of the method is mostly due to
the PPC-tree structures. PPC-trees require space of
244 P. Viswanath et al. / Information Fusion 5 (2004) 239–250
OðnÞ, where nis the total number of original patterns.
For medium to large data sets, empirical studies show
that the space requirement is much smaller than that
required by conventional vector format (i.e., each pat-
tern is represented by a list of feature values). However,
for small data sets the space required may increase be-
cause of the data structure overhead (the space needed
for pointers, etc.).
4.3. The ensemble of PPC-aNNC’s
PPC-aNNC is a weak classifier since it finds only an
approximate nearest neighbor for the test pattern. Par-
tition based pattern synthesis depends on the partition
chosen. PPC-tree and hence PPC-aNNC depends not
only on the partition chosen for each class, but also on
the ordering of features within each block of the parti-
tions. Thus various orderings of features in each block
of the partitions results in various PPC-aNNC’s. An
ensemble of PPC-aNNC’s where the final decision is
made based on simple majority voting is empirically
shown to perform well. Let there be rcomponent clas-
sifiers in the ensemble. Each component classifier is
chosen based on a random ordering of features in each
block.
Intuitively, the functioning of PPC-aNNC’s can be
explained as follows. While finding the approximate
nearest neighbor, PPC-aNNC gives emphasis to the
features in a block according to its order. The first
feature in a block is emphasized the most while the last
feature the least. Notice that if there is only one fea-
ture in each block for partitions of all classes, then
PPC-aNNC finds the exact nearest neighbor in the
entire synthetic set (generated according to this parti-
tioning). This is because, in this case all features are
emphasized equally (since there is only one feature in
each block). Since each PPC-aNNC in the ensemble is
based on a random ordering of the features, the
emphasis on features given by each of them is quite
different from that of others. Because of this the errors
made by each of the individual classifiers become un-
correlated significantly, causing the ensemble to per-
form well.
The ensemble is suitable for parallel implementation
with rmachines, where each machine implements a
different PPC-aNNC. The communication requirement
is there only when the test pattern is communicated to
all individual classifiers and when majority vote is re-
quired, and therefore results in a very small overhead.
On the other hand, if the ensemble is to be implemented
on a single machine, then the space and time require-
ments will be rtimes that of a single PPC-aNNC, which
may not be feasible for large data sets.
5. Experiments
5.1. Datasets
We performed experiments with six different datasets,
viz., OCR, WINE, VOWEL, THYROID, GLASS and
PENDIGITS, respectively. Except the OCR dataset, all
others are from the UCI Repository [24]. OCR dataset is
also used in [9,25] while WINE, VOWEL, THYROID
and GLASS datasets are used in [21]. The properties of
the datasets are given in Table 1. All the datasets have
only numeric valued features. The OCR dataset has
binary discrete features, while the others have continu-
ous valued features. Except OCR dataset, all other
datasets are normalized to have zero mean and unit
variance for each feature and subsequently discretized.
Let abe a feature value after normalization, and a0be its
discrete value. We used the following discretization
procedure.
If (a<0:75) then a0¼1;
Else-If (a<0:25) then a0¼0:5;
Else-If (a<0:25) then a0¼0;
Else-If (a<0:75) then a0¼0:5;
Else a0¼1.
5.2. Classifiers for comparison
The classifiers chosen for comparison purposes are as
follows.
NNC: The test pattern is assigned to the class of its
nearest neighbor in the training set. The Distance
measure used is Euclidean distance.
k-NNC: A simple extension of NNC, where the most
common class in the knearest neighbors is chosen.
The distance measure is Euclidean distance. Three-
fold cross-validation is done to choose the kvalue.
Table 1
Properties of the datasets used
Dataset Number of features Number of classes Number of training examples Number of test examples
OCR 192 10 6670 3333
WINE 13 3 100 78
VOWEL 10 11 528 462
THYROID 21 3 3772 3428
GLASS 9 7 100 114
PENDIGITS 16 10 7494 3498
P. Viswanath et al. / Information Fusion 5 (2004) 239–250 245
Naive–Bayes classifier (NBC): This is a specialization of
the Bayes classifier where the features are assumed to
be statistically independent. Further, the features are
assumed to be of discrete type. Let X¼ðx1;...;xdÞT
be a pattern and lbe a class label. Then the class con-
ditional probability, PðXjlÞ¼Pðx1jlÞPðxdjlÞ.
PðxijlÞis taken as the frequency ratio of number of
patterns in class with label land with feature fivalue
equal to xito that of total number of patterns in that
class. A priori probability for each class is taken as
the frequency ratio of number of patterns in that class
to the total training set size. The given test pattern is
classified to the class for which the posteriori proba-
bility is maximum. OCR dataset is used as it is,
whereas the other datasets are normalized (to have
zero mean and unit variance for each feature) and
discretized as done for PPC-aNNC.
NNC with bootstrapped training set (NNC(BS)):We
used the bootstrap method given by Hamamoto
et al. [18] to generate an artificial training set. The
bootstrapping method is as follows. Let Xbe a train-
ing pattern and let X1;...;Xrbe its rnearest neighbors
in its class. Then X0¼ð
Pr
i¼1XiÞ=ris the artificial pat-
tern generated for X. In this manner, for each training
pattern an artificial pattern is generated. NNC is done
with this new bootstrapped training set. The value of r
is chosen according to a three-fold cross-validation.
Voting over multiple condensed nearest neighbors
(MCNNC): Condensed nearest neighbor classifier
(CNNC) first finds a condensed training set which
is a subset of the training set, such that NNC with
the condensed set classifies each training pattern cor-
rectly. The condensed set is incrementally built.
Changing the order of the training patterns consid-
ered can give a new condensed set. Alpaydin [21] pro-
posed to train a multiple such subsets and take a vote
over them, thus combining predictions from a set of
concept descriptions. Two voting schemes are given:
simple voting where voters have equal weight and
weighted voting where weights depend on classifiers’
confidence in their predictions. The second scheme
is shown empirically to do well, so this is taken for
comparison purposes. The paper [21] proposes some
additional improvements based on bootstrapping,
etc., which are not considered here.
Weighted nearest neighbor with feature projection
(wNNFP): This is given by Kubat et al. in [22]. If d
is the number of features, then dindividual nearest
neighbor classifiers are considered, each classifier tak-
ing only one feature into account. That is, dseparate
training sets are projected, each being used by an
individual NNC. Weighted majority voting is taken
to combine the decisions of the individual NNC’s.
The weights for the individual classifiers are given
based on their classification accuracies. Three-fold
cross-validation is done for this.
NNC with synthetic patterns (NNC(SP)): This method is
given in Section 4.1. The parameter P, i.e., the set of
partitions is chosen based on the cross-validation
method given in Section 5.3.
PPC-aNNC: This method is given in Section 4.2 and
cross-validation method to choose the parameter val-
ues in Section 5.3.
Ensemble of PPC-aNNC’s: This method is given in Sec-
tion 4.3. Cross-validation method to choose the
parameter values is given in Section 5.3.
5.3. Validation method
Three-fold cross-validation is used to fix the param-
eter values for various classifiers described in this paper.
For the methods proposed in this paper, viz., NNC(SP),
PPC-aNNC and Ensemble of PPC-aNNC’s we give a
detailed cross-validation procedure below.
The training set is randomly divided into three equal
non-overlapping subsets. If equal division is not possible
then one or two randomly chosen training patterns are
replicated to get an equal division. Two such subsets are
combined to form a training set called validation training
set, and the remaining one is called validation test set.
Like this we get three different validation training sets
and corresponding validation test sets. We call these sets
as val-train-set-1,val-train-set-2,val-train-set-3 for vali-
dation training sets and val-test-set-1,val-test-set-2,val-
test-set-3 for the corresponding validation test sets,
respectively. For a given set of parameter values, val-
train-set-iis used as the training set for the classifier and
classification accuracy(CA) is measured over val-test-
set-iand is called val-CA-i, where i¼1, 2 or 3. Average
value of {val-CA-1,val-CA-2,val-CA-3} is called avg-
val-CA and its standard deviation as val-SD. Objective
of cross-validation is to find a set of parameter values
for which avg-val-CA is maximum. val-SD measures the
spread of val-CA-i,i¼1, 2 or 3, around avg-val-CA.
Exhaustive search over all possible sets of parameter
values is a computationally expensive activity and hence
we give a greedy approach for choosing the set of
parameter values.
(1) NNC(SP): The parameters used in NNC(SP) are
partitions of the set of features (F) for each class
which are used for performing partition based pattern
synthesis. Let these partitions be represented as a
set P¼fp1;p2;...;pcgwhere pi(1 6i6c) is the parti-
tion used for class with label i. Further, let Ppbe the set
of partitions where each element (i.e., partition) has
exactly pblocks. An element from fP1;P2;P3;Pdg
(where d¼jFj) which gives maximum avg-val-CA is
chosen. Ppfor a given pis obtained either from
domain knowledge or by using the method given in
Section 2.4.
OCR dataset consists of handwritten images on a two
dimensional rectangular grid of size 1612 where for
246 P. Viswanath et al. / Information Fusion 5 (2004) 239–250
each cell, presence of ink is represented by 1 and its
absence by 0. It is known that for a given class, the
values in nearby cells are highly dependent than for
far apart cells (nearness here is based on physical
closeness between the cells). This knowledge is used for
obtaining the partitions. An entire image is represented
as a 192 dimensional vector where the first 12 features
correspond to the first row of the grid, the second 12
features correspond to the second row of the grid and
so on. Let the set of features in this order be F¼
ff1;f2;...;f192g. A partitioning of Fwith p(where
p¼1, 2, 3 or 192) blocks, i.e., fB1;B2;...;Bpgis ob-
tained in the following manner: The first 192=pfeatures
go into block B1, the next 192=pfeatures into block B2,
and so on.
For other datasets, viz., WINE, VOWEL, THY-
ROID, GLASS and PENDIGITS, partitions are ob-
tained by using the method given in Section 2.4.
(2) PPC-aNNC: Parameters are the partitions of the
set of features as used by NNC(SP), ordering these in
all blocks of each partition. These are chosen as fol-
lows. Ppfor p¼1, 2, 3 and dare obtained as done
for NNC(SP). For each Pp(where p¼1, 2, 3 or d),
features in each block (for each partition) are randomly
ordered and avg-val-CA is obtained. 100 such runs
(each with a different random ordering of features) are
obtained for each Pp. The Ppalong with ordering of
features for which avg-val-CA is maximum is then cho-
sen.
(3) Ensemble of PPC-aNNC’s: The parameters here
are (i) number of component classifiers (r), (ii) set of
partitions Pi¼fp1;...;pcgused by each component
ið16i6rÞ, and (iii) ordering of features in each block of
each partition. These parameters are chosen by
restricting the search space as given below.
Set of partitions used by each component classifier is
same (except for ordering of features). That is,
P1¼P2¼ ¼ Prand is chosen as done for PPC-
aNNC. With 100 random orderings of features, PPC-
aNNC is run and their respective classification accura-
cies (CA) are obtained. Let avg-CA and max-CA be the
average, and maximum CA of these 100 runs, respec-
tively. We define a threshold classification accuracy
called thr-CA ¼(avg-CA+max-CA)/2. 50 component
classifiers are obtained by finding 50 random orderings
of features such that each component has CA greater
than the thr-CA. This process is done to choose good
component classifiers. For these 50 component classifi-
ers, their respective CAs and orderings of features are
stored. This is done with each pair (val-train-set-i,val-
test-set-i) for i¼1, 2 and 3. So we get in total of 150
(i.e., 503) orderings of features along with their
respective CAs. This corresponds to the list of orderings.
This list is sorted based on CA values and best r
orderings are chosen to be used in the final ensemble
where r, the number of component classifiers, is chosen
as described below.
For each pair (val-train-set-i,val-test-set-i), for i¼1,
2 and 3, we obtain 50 component classifiers as explained
above. From these 50 components, we choose randomly
mdistinct components to form an ensemble. CA of this
ensemble is measured and is called val-CA-im. The above
is done for m¼1 to 50 and for i¼1, 2 and 3. The
quantity avg-val-CAmis the average value of {val-CA-
1m,val-CA-2m,val-CA-3m}, and val-SDmis its standard
deviation. The number of components ris chosen such
that avg-val-CAris the maximum element in {avg-val-
CA1,avg-val-CA2,...,avg-val-CA50}.
5.4. Experimental results
Tables 2 and 3 give the comparison between these
classifiers. These show the classification accuracy (CA)
for each of the classifiers as a percentage over respective
test sets. The parameter values are chosen by per-
forming cross-validation as described in Section 5.3.
Table 3 shows the CAs for the methods proposed by us.
Along with CA values, it also shows the parameter
values p(the number of blocks used in the synthesis)
and r(the number of components used in the ensem-
ble). The points worth noting are as follows: (i) For
large values of n(the number of original training pat-
terns) and p, it may not be feasible to consider the entire
synthetic set which is required in the case of NNC(SP).
(ii) If p¼d, where dis the total number of features,
each feature goes into a separate block (i.e., each block
contains only one feature) and therefore only one
ordering of features is possible. This means, in the case
of ensemble of PPC-aNNC’s, each component is same
and hence CA of one component is equal to the CA of
the ensemble. (iii) If p¼1, then the synthetic and ori-
ginal sets are same.
Table 2
A comparison between the classifiers (showing CA (%))
Dataset NNC k-NNC NBC NNC(BS) MCNNC wNNFP
OCR 91.12 92.68 81.01 92.88 91.97 10.02
WINE 94.87 96.15 91.03 97.44 95.00 92.30
VOWEL 56.28 60.17 36.80 57.36 55.97 23.38
THYROID 93.14 94.40 83.96 94.57 92.23 92.71
GLASS 71.93 71.93 60.53 71.93 71.67 53.5
PENDIGITS 96.08 97.54 83.08 97.57 97.25 45.05
P. Viswanath et al. / Information Fusion 5 (2004) 239–250 247
The cross-validation results for four datasets (WINE,
VOWEL, GLASS and PENDIGITS) for the ensemble of
PPC-aNNC’s are given in Tables 4–7 respectively, that
show the average CA (avg-val-CAm) and standard devi-
ation (val-SDm) with various values of m(i.e., number of
components). For the remaining two datasets (OCR and
THYROID), similar results as those for WINE, GLASS
and PENDIGITS are observed and hence not presented.
From the results presented, some of the observations
are:
(1) The methods given by us (viz., NNC(SP) and
Ensemble of PPC-aNNC’s) outperform the other
methods in the case of OCR and THYROID data-
sets. For the remaining datasets, our methods show
good performance.
Table 3
A comparison between the classifiers
Dataset NNC(SP)
(# blocks)
PPC-aNNC
(# blocks)
Ensemble of PPC-aNNCs
(# blocks) (# components)
OCR 93.01 (3) 84.91 (3) 94.15 (3) (45)
WINE 96.15 (2) 89.74 (2) 94.87 (2) (7)
VOWEL 56.28 (1) 43.51 (1) 46.32 (1) (33)
THYROID 97.23 (d) 94.16 (2) 94.66 (2) (49)
GLASS 71.93 (1) 60.53 (1) 67.54 (1) (7)
PENDIGITS 96.08 (1) 90.19 (1) 96.34 (1) (29)
Table 4
Cross validation results for ensemble of PPC-aNNC’s for WINE dataset
# component classifiers Number of blocks
123d
1 95.09 (1.38) 96.08 (1.38) 95.09 (1.38) 77.45 (5.00)
7 98.03 (1.38) 99.06 (1.38) 98.03 (1.38) 77.45 (5.00)
10 98.03 (1.38) 99.02 (1.38) 98.03 (2.77) 77.45 (5.00)
20 98.03 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)
30 98.03 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)
40 98.03 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)
50 99.01 (1.38) 98.04 (1.38) 99.01 (1.38) 77.45 (5.00)
Table 5
Cross validation results for ensemble of PPC-aNNC’s for VOWEL dataset
# component classifiers Number of blocks
123d
1 82.77 (1.49) 82.57 (3.15) 79.73 (2.83) 23.86 (2.45)
10 91.86 (2.63) 88.06 (2.78) 85.22 (2.41) 23.86 (2.45)
20 91.86 (1.75) 89.58 (2.09) 85.41 (2.28) 23.86 (2.45)
30 92.23 (1.93) 90.15 (2.09) 85.98 (1.93) 23.86 (2.45)
33 93.18 (2.02) 90.72 (2.56) 85.98 (1.93) 23.86 (2.45)
40 92.42 (2.14) 89.58 (2.33) 85.98 (2.09) 23.86 (2.45)
50 92.61 (1.67) 89.96 (2.38) 85.22 (2.45) 23.86 (2.45)
Table 6
Cross validation results for ensemble of PPC-aNNC’s for GLASS dataset
# component classifiers Number of blocks
123d
1 67.65 (2.40) 71.57 (5.00) 63.73 (6.04) 46.08 (7.34)
776.47 (2.40) 67.65 (6.35) 66.67 (7.34) 46.08 (7.34)
10 70.59 (2.40) 69.61 (6.04) 65.69 (7.34) 46.08 (7.34)
20 72.55 (1.39) 71.57 (5.00) 64.71 (7.20) 46.08 (7.34)
30 73.53 (0.00) 70.59 (4.80) 66.67 (7.34) 46.08 (7.34)
40 73.53 (2.40) 70.59 (4.16) 65.69 (6.04) 46.08 (7.34)
50 71.57 (1.39) 72.55 (6.04) 65.69 (6.04) 46.08 (7.34)
248 P. Viswanath et al. / Information Fusion 5 (2004) 239–250
(2) The Ensemble of PPC-aNNC’s performs uniformly
better as compared to NBC, wNNFP, and PPC-
aNNC, respectively, over all datasets.
(3) It is interesting to note that PPC-aNNC outper-
forms NBC and wNNFP over all datasets except
for WINE dataset.
The actual space requirement on the average for
PPC-tree is about 60% to 90% for OCR, THYROID
and PENDIGITS datasets, when compared with the
space requirement of the respective original sets. For
other datasets, the actual space requirement is slightly
more than that required for the original set. This is be-
cause for small datasets, the data structure overhead is
larger when compared with the space reduced because of
the sharing of nodes in PPC-tree.
6. Conclusions
This paper presented a fusion of multiple approxi-
mate nearest neighbor classifiers having constant (Oð1Þ)
classification time upper bound and good classification
accuracy. Each individual classifier of the ensemble is a
weak classifier which works with a synthetic set gener-
ated from the novel pattern synthesis technique called
partition based pattern synthesis which reduces the curse
of dimensionality effect. Further explicit generation of
the synthetic set is avoided by doing implicit pattern
synthesis within the classifier which works directly with
a compact representation of the original training set
called PPC-tree. The proposed ensemble of PPC-aN-
NC’s with parallel implementation is a fast and efficient
classifier suitable for large, high dimensional data sets.
Since it has constant classification time upper bound, it
is a suitable classifier for online, real-time applications.
7. Future work
A formal explanation for the good behavior of the
ensemble of PPC-aNNC’s needs to be given. Next, one
needs to answer questions such as ‘What is a good
partition for doing partition based pattern synthesis and
how to find it?’ We gave a partitioning method based on
pair-wise correlations between features within a class,
but this takes into account only linear dependency be-
tween the features, so for features having higher order
dependency this method can fail to capture that. A
general partitioning method which is computationally
also efficient needs to be found which can be used for
both numerical and categorical features.
Acknowledgements
Research work reported here is supported in part by
AOARD Grant F62562-03-P-0318. Thanks to the three
anonymous reviewers for constructive comments. Spe-
cial thanks to B.V. Dasarathy for prompt feedback and
many suggestions during the revision that significantly
improved the content of the paper.
References
[1] B.V. Dasarathy, Nearest neighbor (NN) norms: NN pattern
classification techniques, IEEE Computer Society Press, Los
Alamitos, California, 1991.
[2] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second
ed., A Wiley-interscience Publication, John Wiley & Sons, 2000.
[3] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE
Transactions on Information Theory 13 (1) (1967) 21–27.
[4] K. Fukunaga, D. Hummels, Bias of nearest neighbor error
estimates, IEEE Transactions on Pattern Analysis and Machine
Intelligence 9 (1987) 103–112.
[5] G. Hughes, On the mean accuracy of statistical pattern recogniz-
ers, IEEE Transactions on Information Theory 14 (1) (1968) 55–
63.
[6] A. Jain, B. Chandrasekharan, Dimensionality and sample size
considerations in pattern recognition practice, in: P. Krishnaiah,
L. Kanal (Eds.), Handbook of Statistics, vol. 2, North Holland,
1982, pp. 835–855.
[7] K. Fukunaga, D. Hummels, Bayes error estimation using parzen
and k-nn procedures, IEEE Transactions on Pattern Analysis and
Machine Intelligence 9 (1987) 634–643.
[8] K. Fukunaga, Introduction to Statistical Pattern Recognition,
second ed., Academic Press, 1990.
[9] V. Ananthanarayana, M.N. Murty, D. Subramanian, An incre-
mental data mining algorithm for compact realization of proto-
types, Pattern Recognition 34 (2001) 2249–2251.
Table 7
Cross validation results for ensemble of PPC-aNNC’s for PENDIGITS dataset
# component classifiers Number of blocks
123d
1 93.70 (0.32) 93.68 (0.42) 90.28 (0.94) 29.06 (5.00)
10 98.16 (0.26) 96.99 (0.37) 93.15 (1.19) 29.06 (5.00)
20 98.45 (0.18) 97.17 (0.30) 93.76 (1.03) 29.06 (5.00)
29 98.71 (0.14) 97.30 (0.30) 93.79 (1.13) 29.06 (5.00)
30 98.57 (0.27) 97.41 (0.21) 93.79 (1.13) 29.06 (5.00)
40 98.64 (0.14) 97.37 (0.31) 93.78 (0.99) 29.06 (5.00)
50 98.51 (0.19) 97.43 (0.36) 93.78 (1.04) 29.06 (5.00)
P. Viswanath et al. / Information Fusion 5 (2004) 239–250 249
[10] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate
generation, in: Proceedings of ACM SIGMOD International
Conference of Management of Data, Dallas, Texas, USA, 2000.
[11] Z. Tian, R. Raghu, L. Micon, BIRCH: an efficient data clustering
methods for very large databases, in: Proceedings of ACM
SIGMOD International Conference of Management of Data,
1996.
[12] A. Guttman, A dynamic index structure for spatial searching 2
(1984) 47–57.
[13] B. Efron, Bootstrap methods: Another look at the jackknife,
Annual Statistics 7 (1979) 1–26.
[14] A. Jain, R. Dubes, C. Chen, Bootstrap technique for error
estimation, IEEE Transactions on Pattern Analysis and Machine
Intelligence 9 (1987) 628–633.
[15] M. Chernick, V. Murthy, C. Nealy, Application of bootstrap and
other resampling techniques: Evaluation of classifier performance,
Pattern Recognition Letters 3 (1985) 167–178.
[16] S. Weiss, Small sample error rate estimation for k-NN classifiers,
IEEE Transactions on Pattern Analysis and Machine Intelligence
13 (1991) 285–289.
[17] D. Hand, Recent advances in error rate estimation, Pattern
Recognition Letters 4 (1986) 335–346.
[18] Y. Hamamoto, S. Uchimura, S. Tomita, A bootstrap technique
for nearest neighbor classifier design, IEEE Transactions on
Pattern Analysis and Machine Intelligence 19 (1) (1997) 73–79.
[19] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–
140.
[20] D.B. Skalak, Prototype Selection for Composite Nearest Neigh-
bor Classifiers, Ph.D. Thesis, Department of Computer Science,
University of Massachusetts Amberst, 1997.
[21] E. Alpaydin, Voting over multiple condensed nearest neighbors,
Artificial Intelligence Review 11 (1997) 115–132.
[22] M. Kubat, W.K. Chen, Weighted projection in nearest-neighbor
classifiers, in: Proceedings of the First Southern Symposium on
Computing, The University of Southern Mississippi, December 4–
5, 1998.
[23] S.D. Bay, Combining nearest neighbor classifiers through multiple
feature subsets, Intelligent Data Analysis 3 (3) (1999) 191–209.
[24] P.M. Murphy, UCI Repository of Machine Learning Databases
[http://www.ics.uci.edu/mlearn/MLRepository.html], Department
of Information and Computer Science, University of California,
Irvine, CA, 1994.
[25] T.R. Babu, M.N. Murty, Comparison of genetic algorithms based
prototypeselection schemes, Pattern Recognition 34 (2001) 523–525.
250 P. Viswanath et al. / Information Fusion 5 (2004) 239–250
... According to the literature [1][2][3]application of the classifier combination to solving the practical tasks allows to improve the classification accuracy. The combined decision is supposed to be better (more accurate, more reliable) than the classification decision of the best individual classifier. ...
... However, this approach is inefficient for the high-dimensional feature space. In [2] the heuristic algorithm is applied for the partition of the feature set into several uncorrelated subsets, which because of being locally optimal doesn't guarantee the best result. In this paper we present novel approaches to evolutionary design of the classifier ensembles . ...
Article
Full-text available
This paper 1 presents two novel approaches to evolutionary design of the classifier ensemble. The first one presents the task of one-objective optimization of feature set partitioning together with feature weighting for the construction of the individual classifiers. The second approach deals with multi-objective optimization of classifier ensemble design. The proposed approaches have been tested on two data sets from the machine learning repository and one real data set on transient ischemic attack. The experiments show the advantages of the feature weighting in terms of classification accuracy when dealing with multivariate data sets and the possibility in one run of multi-objective genetic algorithm to get the non-dominated ensembles of different sizes and thereby skip the tedious process of iterative search for the best ensemble of fixed size.
... Very few studies were reported in literature regarding artificial pattern generation. V i s w a n a t h et al. [10,11] proposed a pattern synthesis approach for efficient nearest neighbor classification. A g r a w a l et al. [12] applied prototyping as an intermediate step in the synthetic pattern generation technique to reduce classification time of K nearest neighbour classifier. ...
... • The proposed method is suitable for the datasets having high dimensionality, but not very high dimensionality, as the computational time and the memory resources for finding the correlation (used for partitioning the features) between the features of the data increases with dimensionality. • The experimental results were in good agreement with the results reported by V i s w a n a t h et al. [10,11] on pattern synthesis for nearest neighbour classification. • The figures showed the variation of CA% with variation in the number of the nearest neighbors and demonstrated the profound effect of smoothing of the training patterns on the performance of the SVM classifier. ...
Article
Full-text available
Support Vector Machines (SVMs) have gained prominence because of their high generalization ability for a wide range of applications. However, the size of the training data that it requires to achieve a commendable performance becomes extremely large with increasing dimensionality using RBF and polynomial kernels. Synthesizing new training patterns curbs this effect. In this paper, we propose a novel multiple kernel learning approach to generate a synthetic training set which is larger than the original training set. This method is evaluated on seven of the benchmark datasets and experimental studies showed that SVM classifier trained with synthetic patterns has demonstrated superior performance over the traditional SVM classifier.
... After registration and label propagation for each of the K available atlases, a final unique segmentation (x) ∈ Λ is obtained by merging the information provided by each of the individual atlas-based segmentations into a single segmented image. This approach has been shown to be more accurate than an individual atlas segmentation [316] in the same manner as a combination of classifiers is generally more accurate than a single classifier in many pattern recognition scenarios [325][326][327][328][329][330]. ...
Article
Multimodal fusion in neuroimaging combines data from multiple imaging modalities to overcome the fundamental limitations of individual modalities. Neuroimaging fusion can achieve higher temporal and spatial resolution, enhance contrast, correct imaging distortions, and bridge physiological and cognitive information. In this study, we analyzed over 450 references from PubMed, Google Scholar, IEEE, ScienceDirect, Web of Science, and various sources published from 1978 to 2020. We provide a review that encompasses (1) an overview of current challenges in multimodal fusion (2) the current medical applications of fusion for specific neurological diseases, (3) strengths and limitations of available imaging modalities, (4) fundamental fusion rules, (5) fusion quality assessment methods, and (6) the applications of fusion for atlas-based segmentation and quantification. Overall, multimodal fusion shows significant benefits in clinical diagnosis and neuroscience research. Widespread education and further research amongst engineers, researchers and clinicians will benefit the field of multimodal neuroimaging.
... The proposed method is applied on OCR Handwritten digit data [20] Figure 3. Among 3330 patterns 98.6% of the patterns are recognized correctly. ...
Article
Full-text available
A method has been proposed to classify handwritten Arabic numerals in its compressed form using partitioning approach, Leader algorithm and Neural network. Handwritten numerals are represented in a matrix form. Compressing the matrix representation by merging adjacent pair of rows using logical OR operation reduces its size in half. Considering each row as a partitioned portion , clusters are formed for same partition of same digit separately. Leaders of clusters of partitions are used to recognize the patterns by Divide and Conquer approach using proposed ensemble neural network. Experimental results show that the proposed method recognize the patterns accurately.
... Since we search for similar subgraphs in the query graph, we have also done extensive literature survey on isomorphic graph detection. Partitioned Pattern count(PPC) trees ( [15]) and divide-and-conquer based split search algorithm in feature trees ( [10]) are examples of similar subtree detection done in a top-down way. [17] adopts a bottom-up way of sub query identification in a greedy manner and it aims at identifying only the largest similar subgraphs within a query graph. ...
Article
In this paper, we revisit the problem of query optimiza-tion in relational DBMS. We propose a scheme to re-duce the search space of Dynamic Programming based on reuse of query plans among similar subqueries. The method generates the cover set of similar subgraph-s present in the query graph and allows their corre-sponding subqueries to share query plans among them-selves in the search space. Numerous variants to this scheme have been developed for enhanced memory ef-ficiency. Our implementation and experimental study in PostgreSQL show that one of the schemes is bet-ter suited to improve the performance of (Iterative) Dynamic Programming.
Chapter
In the process of finding novel patterns, algorithms for mining large datasets face a number of issues. We discuss the issues related to efficiency in data mining. We elaborate some important data mining tasks such as clustering, classification, and association rule mining that are relevant to the content of the book. We discuss popular and representative algorithms of partitional and hierarchical data clustering. In classification, we discuss the nearest-neighbor classifier and the support vector machine. We use both these algorithms extensively in the book. We provide an elaborate discussion on issues in mining large datasets and possible solutions. We discuss each possible direction in detail. The discussion on clustering includes topics such as incremental clustering with focus on leader and BIRCH clustering algorithms, divide-and-conquer clustering algorithms, and clustering based on intermediate representation. The discussion on classification includes topics such as incremental classification and classification based on intermediate abstraction. We further discuss frequent-itemset mining with two directions such as divide-and-conquer itemset mining and intermediate abstraction for frequent-itemset mining. Bibliographic notes contain a brief discussion on the significant research contribution in each of the directions discussed in the chapter and literature for further study.
Conference Paper
This article describes the structure and functional content of a developed software system of different types of data classification based on the ensemble algorithms. Also the article describes the developed heterogeneous ensemble classification algorithm implemented in the software system. A distinctive feature of the algorithm is the iterative use of single (basic) classifiers on the initial training sample and inclusion in the ensemble only those classifiers whose relative error does not exceed a predetermined threshold. The software system was tested on real medical data. The accuracy of data classification using the basic classifiers and heterogeneous ensemble algorithm was compared. Test results showed the effectiveness (increase classification accuracy) of heterogeneous ensemble algorithm compared with the single classifiers.
Conference Paper
To improve the precision and generalization of ensemble model and leaching model, a novel selective hierarchical ensemble modeling approach is proposed for leaching rate prediction in this paper. Unlike previous selective ensemble model, the new selective ensemble model is a hierarchical model. The model considers not only the combination of sub-models, but also the generation of sub-models. First of all, a new multi-model ensemble hybrid model (MEHM) based on bagging algorithm is proposed. In this model, the sub-models are composed of data model and mechanism model. The data model generates training subsets by using the proposed based vector bootstrap sampling algorithm. Afterwards, a new selective multi-model ensemble hybrid model (NSMEHM) based on binary particle swarm optimization (PSO) algorithm is presented. In this model, the binary PSO optimization algorithm is used to find out a group of the MEHMs, which minimizes the error and maximizes the diversity. Experiment results indicate that the proposed NSMEHM has better prediction performance than the other models.
Article
Generally, the curse of dimensionality leads to great bias of NNC. However, in this paper, multiple real-valued NNCs based on different feature subsets are combined by fuzzy integral so that the bias of NNC in high dimensionality is minimized, which is called FI-MRNNC. In FI-MRNNC, the feature set is partitioned into several low dimensionality feature subsets, where fuzzy measure is used to measure the importance of each feature subset and the interaction between feature subsets in its decision making process. According to the FI-MRNNC's classification accuracy, GA not only partitions the feature set into several feature subsets but oho defines a density value for the corresponding feature subset. Experimental results on some UCI databases illustrate that FI-MRNNC can reduce the bias of NNC, especially in high dimensionality. ICIC International
Article
This study examines the rain occurrence by the passive microwave imagery during typhoons. The dataset consists of 53 typhoons affecting the watershed over 2001-2008. This study employs nearest neighbor search (NNS) classifier which is often used for diagnosing forecast problems. The multilayer perceptron (MLP) and logistic regressions (LR) are selected as the benchmarks. The results show that for the rain/non-rain discrimination, the best performing classifier is NNS according to the AUC measures. The results show that the use of NNS can effectively improve the AUC measures for diagnosing rain occurrence. Overall, the use of NNS is a relatively effective algorithm comparing to other classifiers.
Article
Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding, that significantly improve classifiers like decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor classifier. In this paper, we present MFS, a combining algorithm designed to improve the accuracy of the nearest neighbor (NN) classifier. MFS combines multiple NN classifiers each using only a random subset of features. The experimental results are encouraging: On 25 datasets from the UCI Repository, MFS significantly outperformed several standard NN variants and was competitive with boosted decision trees. In additional experiments, we show that MFS is robust to irrelevant features, and is able to reduce both bias and variance components of error. Keywords: multiple models, combining classifiers, nearest neighbor, feature selection,...
Article
This chapter discusses the role that the relationship between the number of measurements and the number of training patterns plays at various stages in the design of a pattern recognition system. The designer of a pattern recognition system should make every possible effort to obtain as many samples as possible. As the number of samples increases, not only does the designer have more confidence in the performance of the classifier, but also more measurements can be incorporated in the design of the classifier without the fear of peaking in its performance. However, there are many pattern classification problems where either the number of samples is limited or obtaining a large number of samples is extremely expensive. If the designer chooses to take the optimal Bayesian approach, the average performance of the classifier improves monotonically as the number of measurements is increased. Most practical pattern recognition systems employ a non-Bayesian decision rule because the use of optimal Bayesian approach requires knowledge of prior densities, and besides, their complexity precludes the development of real-time recognition systems. The peaking behavior of practical classifiers is caused principally by their nonoptimal use of measurements.