ArticlePDF Available

Large Margin Classifier Based on Hyperdisks

Authors:

Abstract and Figures

This paper introduces a binary large margin classifier that approximates each class with an hyper disk constructed from its training samples. For any pair of classes approximated with hyper disks, there is a corresponding linear separating hyper plane that maximizes the margin between them, and this can be found by solving a convex program that finds the closest pair of points on the hyper disks. More precisely, the best separating hyper plane is chosen to be the one that is orthogonal to the line segment connecting the closest points on the hyper disks and at the same time bisects the line. The method is extended to the nonlinear case by using the kernel trick, and the multi-class classification problems are dealt with constructing and combining several binary classifiers as in Support Vector Machine (SVM) classifier. The experiments on several databases show that the proposed method compares favorably to other popular large margin classifiers.
Content may be subject to copyright.
Large Margin Classifier Based on Hyperdisks
Hakan Cevikalp
Electrical and Electronics Engineering Department
Machine Learning and Computer Vision Laboratory, Eskisehir Osmangazi University
Meselik, Eskisehir, 26480 Turkey. Email:hakan.cevikalp@gmail.com
Abstract—This paper introduces a binary large margin classi-
fier that approximates each class with an hyperdisk constructed
from its training samples. For any pair of classes approximated
with hyperdisks, there is a corresponding linear separating
hyperplane that maximizes the margin between them, and this
can be found by solving a convex program that finds the
closest pair of points on the hyperdisks. More precisely, the best
separating hyperplane is chosen to be the one that is orthogonal
to the line segment connecting the closest points on the hyperdisks
and at the same time bisects the line. The method is extended to
the nonlinear case by using the kernel trick, and the multi-class
classification problems are dealt with constructing and combining
several binary classifiers as in Support Vector Machine (SVM)
classifier. The experiments on several databases show that the
proposed method compares favorably to other popular large
margin classifiers.
Keywords-classification; convex hull; hyperdisk; kernel meth-
ods; large margin classifier; quadratic programming; support
vector machines.
I. INTRODUCTION
Large margin classifiers have recently enjoyed increased
attention due to their successful applications in various fields
such as computer vision (visual object classification and detec-
tion), text classification, biometrics, and genetic microarrays
[1,2,3,4]. The most popular large margin classifier, the Sup-
port Vector Machines (SVMs) [4], is a binary classification
method that simultaneously minimizes the empirical classi-
fication error and maximizes the geometric margin, which
is defined to be the distance between the best separating
hyperplane and closest samples from the classes. If the classes
are not linearly separable in the original input space, the
data samples are mapped onto a higher-dimensional space
where they become linearly separable, and the best separat-
ing hyperplane is constructed in the mapped space. Finding
such an hyperplane involves the minimization of a convex
quadratic function subject to linear inequality constraints,
and the quadratic optimization problem can be efficiently
solved using sequential minimal optimization [5, 6, 7] or using
minimum enclosing balls [8]. The solution of the quadratic
programming problem leads to a sparse solution that enables
us to evaluate the decision function by using a small number
of samples in the vicinity of the class decision boundaries
(more precisely, the training samples that lie on either on the
margin or on the “wrong” side of it) called the support vectors.
Therefore, SVM classifier is relatively fast compared to the
other classification algorithms, which makes it attractive for
the most of the pattern classification tasks.
From geometrical point of view, in linearly separable case,
SVM classifier approximates each class with a convex hull
and finds the closest points in these hulls [9,10]. Then these
two closest points are connected with a line segment. The
separating hyperplane, that is orthogonal to the line segment
and at the same time bisects the line, is chosen to be the best
separating hyperplane. In other words, the two closest points
on the convex hulls determine the separating hyperplane, and
the SVM margin is merely equivalent to the minimum distance
between the convex hulls that represent classes. However,
convex hulls may not be the best models for approximation
of classes especially in high-dimensional spaces. Because,
convex hulls approximations tend to be unrealistically tight
in high-dimensional spaces since the classes typically extend
beyond the convex hulls of their training samples (It should
be noted that even if the original dimensionality of the input
space is low, the data samples are mapped to a higher-
dimensional feature space through kernel mapping during the
estimation of the nonlinear decision boundaries with SVMs.)
For example, for classes that are ellipsoids or boxes in high
dimensions and for any placement of any number of samples
sub-exponential in the dimension, the volume of their convex
hull is exponentially smaller than the volume of the real
class region. Similarly, for the Gaussians, the convex hull
of any probable placement of a sub-exponential number of
samples contain exponentially little probability mass. Other
alternative models to the convex hulls may be affine hulls,
hyperspheres, hyperdisks, and hyperellipsoids, which are all
convex geometric models: Affine hulls are linear subspaces
that have been shifted to pass through the centroids of the
classes. The hyperdisk model of a class is the intersection of
the affine hull and the smallest bounding hypersphere of its
training samples [11]. Hyperellipsoids on the other hand are
characterized by the covariance matrix of the class samples
as well as their mean. Different studies [12,11,13,1,14,15]
show that when such convex models are used in “nearest
convex model” type classifiers, convex hulls of samples are
often outperformed by simpler convex models such as affine
hulls or hyperdisks. These results are not surprising due to the
fact that high-dimensional approximations tend to be simple:
For a fixed sample size, the amount of geometric details that
can be resolved usually decreases rapidly as the dimensionality
increases. Note that the methods we just cited are “nearest
convex model” type of classifiers rather than a “large margin
classifier between the convex models”.
This paper introduces a new binary large margin clas-
sifier between the convex models (rather than the nearest
convex model classifier) that approximates each class with
an hyperdisk model. One motivation for replacing nearest-
convex-model approaches with margin-based ones is that for
the nearest convex model classifier, the pairwise decision
boundaries (surfaces equidistant from the two convex models)
are generically at least quadratic or piecewise quadratic in
complexity. Such decision boundaries are more flexible than
linear ones, but in high dimensions when the training data is
scarce this may lead to overfitting, thus damaging general-
ization to unseen examples. Linear margin based approaches
have fewer degrees of freedom, so they are typically less
sensitive to the precise arrangement of the training samples.
For example for an SVM classifier, motions of the SVM
support vectors parallel to the SVM decision surface do not
alter the margin and hence do not invalidate the classifier
(although they might allow an even better one to be found),
whereas they do typically change the piecewise quadratic
decision surface of the equivalent nearest-convexhull classifier.
The best separating hyperplane between hyperdisks is chosen
to be the one that maximizes the distance between them.
Finding such an hyperplane was first discussed in [3], and a
solution based on a linear system and 2D Newton root-finding
process has been given there for linearly separable data. Here
this problem is formulated as a quadratically constrained
quadratic optimization problem, and it is also extended to the
nonlinear case by using the kernel trick. To handle multi-class
problems, we construct several binary classifiers and combine
them by using different techniques (e.g., one-against-one, one-
againts-rest, etc.) as in SVM.
The rest of the paper is organized as follows: In Section
2, we introduce the proposed method. Section 3 describes the
experimental results. Concluding remarks are given in Section
4.
II. METHOD
A. Motivation and Problem Setting
Consider a binary classification problem with the training
data given in the form {x
i
, y
i
}, i = 1, ..., n, y
i
{−1, +1},
x
i
IR
d
. The most popular large margin classifier, SVM, finds
a separating hyperplane that maximizes the margin, which is
defined as the distance between the hyperplane and closest
samples from the classes. To do so, SVM first approximates
each class with a convex hull [9]. A convex hull consists of
all points that can be written as a linear combination of data
points where all coefficients are nonnegative and sum up to
1. More formally, the convex hull of samples {x
ci
}
i=1,...,n
c
of
class c can be written as
H
convex
c
=
(
x =
n
c
X
i=1
α
i
x
ci
|
n
c
X
i=1
α
i
= 1, α
i
0
)
. (1)
Following this approximation, SVM finds the closest points
in these convex hulls. Then, these two points are connected
with a line segment. The plane, orthogonal to the line segment,
that bisects the line is selected to be the separating hyperplane
[9,10]. The convex hull model is the tightest possible convex
approximation to the class samples, and for classes with
more general convex forms, it is typically a substantial under-
approximation.
The large margin classifier using affine hulls on the other
hand approximates each class with an affine hull [1]. An affine
hull of a class c is the smallest affine subspace containing the
class samples and the affine hull of samples {x
ci
}
i=1,...,n
c
can
be written as
H
affine
c
=
(
x =
n
c
X
i=1
α
i
x
ci
|
n
c
X
i=1
α
i
= 1
)
. (2)
This is an unbounded and hence typically rather loose model
for a class in contrast to the convex hull approximation. Affine
hulls work surprisingly better than convex hulls especially in
high-dimensional spaces with limited number of samples [12,
1,3]. However, one may have problems with affine hull models
if the classes have similar or intersecting affine hulls, but very
different distributions of samples within their affine hulls.
The hyperdisk is a model between convex and affine hulls,
and it captures the best aspects of each model. The hyperdisk
of a class is the intersection of the affine hull and the smallest
bounding hypersphere of its training samples as illustrated
in Fig. 1, and it maintains the stability of the affine hull
and hypersphere methods while providing better localiza-
tion of the training samples and hence potentially a better
discrimination. The hyperdisk of a class consists of affine
combinations of class samples as before and an additional
constraint ||
P
n
c
i=1
α
i
x
ci
s
c
||
2
r
2
c
. Thus, hyperdisk of a
class can be written as
H
disk
c
=
(
x =
n
c
X
i=1
α
i
x
ci
|
n
c
X
i=1
α
i
= 1, ||
n
c
X
i=1
α
i
x
ci
s
c
||
2
r
2
c
)
.
(3)
Here, s
c
is the center of the bounding hypersphere and r
c
is the radius. These hypersphere parameters can be found by
solving the following quadratic program [16]
min
s
c
,r
c
r
2
c
+ γ
X
i
ξ
i
!
s.t. ||x
ci
s
c
||
2
r
2
c
+ ξ
i
, i = 1, ..., n
c
,
(4)
or its dual
min
α
X
i,j
α
i
α
j
hx
ci
, x
cj
i
X
i
α
i
hx
ci
, x
ci
i
s.t.
X
i
α
i
= 1, i 0 α
i
γ, i, j = 1, ..., n
c
,
(5)
where h.i denotes the inner product between samples. Here α
i
are Lagrange multipliers and γ [0, 1] is a ceiling parameter
that can be set to a finite value to eliminate overdistant points
as outliers. Given the solution, the center of the hypersphere
is s
c
=
P
i
α
i
x
ci
and the radius is r
c
= ||x
ci
s
c
|| for any
x
ci
for 0 < α
i
< γ.
Our goal is to find the linear separating hyperplane that
yields the maximum margin between hyperdisks of classes.
Fig. 1. Hyperdisk model of a class is the intersection of affine hull and
bounding hypersphere of class samples.
The points x which lie on the separating hyperplane satisfy
hw, xi + b = 0, where w is the normal of the separating
hyperplane, |b|/||w|| is the perpendicular distance from the
hyperplane to the origin, and ||w|| is the Euclidean norm of
w. For any separating hyperplane, all points x
i
in the positive
class satisfy hw, x
i
i + b > 0 and all points x
i
in the negative
class satisfy hw, x
i
i + b < 0 so that y
i
(hw, x
i
i + b) > 0 for
all training data points. Finding the best separating hyperplane
maximizing the margin between hyperdisks can be solved by
computing the closest points on them. The optimal separating
hyperplane will be the one that bisects perpendicularly the
line segment connecting the closest points as in SVM. The
offset (also called threshold), b, can be chosen as the distance
from the origin to the point halfway between the closest points
along the normal w. Once the best separating hyperplane is
determined, a new sample x
test
is classified based on the
decision function, f(x
test
) = sign(hw, x
test
i + b).
B. Formulation Based on Quadratically Constrained
Quadratic Optimization
In this setup, we formulate the finding the closest points
on the hyperdisks as a quadratically constrained quadratic
optimization (QCQP) problem. Now let X
+
and X
denote
the matrices whose columns are the samples belonging to the
positive and negative classes, respectively. We first compute
the hypersphere center and radius for both classes. Then,
finding the closest points on the hyperdisk of classes can be
written as the following optimization problem
min
α
+
,α
||X
+
α
+
X
α
||
2
s.t.
X
i
α
+i
= 1,
X
j
α
j
= 1, i(j) = 1, ..., n
+
(n
),
||X
+
α
+
s
+
||
2
r
2
+
, ||X
α
s
||
2
r
2
.
(6)
If we let α
α
+
α
, y be a column vector of combined la-
bels, and e be a column vector of ones of arbitrary dimension,
the optimization problem can be written as
min
α
α
>
Gα
s.t. α
>
+
e
+
= 1, α
>
e
= 1,
α
>
+
G
+
α
+
2α
>
+
X
>
+
s
+
+ s
>
+
s
+
r
2
+
,
α
>
G
α
2α
>
X
>
s
+ s
>
s
r
2
,
(7)
where G = (yy
>
)
[X
+
X
]
>
[X
+
X
]
, G
+
= X
>
+
X
+
,
and G
= X
>
X
. Here denotes the element-wise
(Hadamard) multiplication of matrices. This is a quadratically
constrained quadratic programming problem and it is convex
since the Hessian matrix G of the objective function and the
other two Hessian matrices G
+
and G
of constraints are
positive semi-definite matrices.
QCQP problems can be transformed into semi-definite pro-
gramming (SDP), that is the optimization problem over the in-
tersection of an affine set and cone of the positive semi-definite
matrices [17]. CVX software
1
uses this approach. However, in
our simulations with synthetic data, CVX sometimes failed to
find a solution or returned a wrong solution. Therefore we
used MOSEK software
2
in the experiments since it always
successfully returned correct solutions with the simulated data.
MOSEK transforms the QCQP problem into a second-order
cone programming (SOCP) problem, and SOCP problems can
be solved in polynomial time by interior points methods more
efficiently than SDP [18]. Recently more efficient algorithms
have been introduced for solving QCQP problems [19,20].
Given optimal α =
α
+
α
, the closest pair of points in the
two disks and the normal of the maximum margin separating
hyperplane can be found by using the following equation
w =
1
2
(x
+
x
) =
1
2
(X
+
α
+
X
α
), (8)
where x
+
and x
denote the closest points on the hyperdisks
of the positive and negative classes, respectively. The offset b
of the separating hyperplane will be
b =
1
2
w
>
(x
+
+ x
). (9)
If the hyperdisks are close to being linearly separable and
they overlap because of a few outliers, there are several ways
to overcome this problem. Firstly, ceiling parameter γ can be
set to a value smaller than 1 to find a more smaller compact
hypersphere that does not include outliers. If this does not
solve the overlapping problem between the hyperdisks, we can
introduce lower and upper bounds on Lagrange coefficients α
i
in (7) to reduce hyperdisks so that they do not overlap anymore
as in reducing convex hulls introduced in [9].
In case of linearly inseparable hyperdisks, we can map the
data to a higher-dimensional space where linear hyperdisks
constructed in the mapped space become separable by using
the kernel trick. Extension of the QCQP algorithm to the non-
linear case is easy. Note that objective function of (7) can be
written in terms of the dot products of samples, which allows
the use of the kernel trick, i.e., replacing the inner product
hx
i
, x
j
i with the kernel function k(x
i
, x
j
) = hφ(x
i
), φ(x
j
)i.
Now let Φ
+
= [φ(x
+
1
), ..., φ(x
+
n
+
)] and Φ
=
[φ(x
1
), ..., φ(x
n
)] be matrices whose columns are the
mapped samples belonging to the positive and negative classes,
respectively. In the nonlinear case, hypersphere center of any
class (consider positive class for example) is also expressed
in terms of the mapped samples, i.e., φ(s
+
) = Φ
+
s
β
+
, where
Φ
+
s
= [φ(x
+
1
), ..., φ(x
+
l
+
)] is the matrix whose columns are
1
available at http://cvxr.com/cvx/
2
available at http://www.mosek.com/
the mapped samples associated to the nonzero coefficients
returned by the hypersphere algorithm and the β
+
is the vector
of nonzero coefficients. Note that l
+
n
+
. Then, optimization
problem becomes
min
α
α
>
Gα
s.t. α
>
+
e
+
= 1, α
>
e
= 1,
α
>
+
K
+
α
+
2α
>
+
K
+
s
β
+
+ β
>
+
K
s
+
β
+
r
2
+
,
α
>
K
α
2α
>
K
s
β
+ β
>
K
s
β
r
2
,
(10)
where G = (yy
>
)
[Φ
+
Φ
]
>
[Φ
+
Φ
]
= (yy
>
) K,
K
+
= Φ
>
+
Φ
+
, K
= Φ
>
Φ
, K
+
s
= Φ
>
+
Φ
+
s
, K
s
=
Φ
>
Φ
s
, K
s
+
= (Φ
+
s
)
>
Φ
+
s
, and K
s
= (Φ
s
)
>
Φ
s
. Note
that all kernel matrices K, K
+
, K
, K
+
s
, K
s
, K
s
+
, and
K
s
can be easily computed by using kernel functions
k(x
i
, x
j
) = hφ(x
i
), φ(x
j
)i. Given the solution α, the normal
of the separating hyperplane is w = (1/2)
P
n
i=1
α
i
y
i
φ(x
i
).
Bias b can be computed using (9). A new sample x
test
is
classified by using
f(x
test
) = sign(
1
2
n
X
i=1
α
i
y
i
k(x
i
, x
test
) + b).
(11)
C. Extension to the Multi-Class Classification Problems
To use the proposed methods in multi-class classification
problems, we can use most of the strategies adopted for
extending binary SVM classifiers to the multi-class case. We
used the most popular two strategies in our experiments:
one-against-one (OAO) and one-against-rest (OAR). For a c-
class classification problem, the OAR strategy trains c binary
classifiers, in which each classifier separates one class from the
remaining c1 classes. All classifiers are needed to be trained
on the entire training set, and the class label of a test sample is
determined according to the highest output of the classifiers in
the ensemble. On the other hand, the OAO strategy constructs
all possible c(c 1)/2 binary classifiers out of c classes. The
decision of the ensemble is decided by max wins algorithm:
each OAO classifier casts one vote for its preferred class, and
the final decision is the class with the most votes. In addition to
these, one can also use Directed Acyclic Graphs (DAGs) [21]
or Binary Decision Trees [22] for multi-class classification.
III. EXPERIMENTS
We tested
3
the proposed method, Large Margin Classifier
based on HyperDisks (LMC-HD), on a number of datasets
and compared them to the SVM classifier and large margin
classifier based on affine hulls (LMC-AH). Both OAR and
OAO approaches are used for multi-class classification prob-
lems and we report the results of whichever yields the best.
A. Experiments with Linear Large Margin Classifiers
Here we test large margin classifiers on high-dimensional
linearly separable datasets.
3
For software see http://www2.ogu.edu.tr/mlcv/softwares.html.
Fig. 2. Aligned images of one subject from the AR face database.
1) AR Face Database: The AR Face data set [23] contains
26 frontal images with different facial expressions, illumi-
nation conditions and occlusions for each of 126 subjects,
recorded in two 13-image sessions spaced by 14 days. For
this experiment, we randomly selected 20 male and 20 female
subjects. The images were down-scaled (from 768×576),
aligned so that centers of the two eyes fell at fixed coordinates,
then cropped to size 105×78. Some cropped images are
shown in Fig. 2. Raw pixel values were used as features. The
design parameters are set based on grid search using random
partitions of datasets into training and test set. For training
we randomly selected n = 5, 10, 15, 20, 25 samples for each
individual, keeping the remaining 26 n for testing. This
process was repeated 15 times, with the final classification
rates being obtained by averaging the 15 results. The results
are plotted in Fig. 3.
Fig. 3. Classification rates as a function of different number of samples per
class on the AR Face Database.
When 5 samples per class are used, large margin classifiers
based on affine hulls and hyperdisks respectively yield 92.64%
and 92.34% classification accuracy, and they significantly out-
perform SVM, which yields 90.24% accuracy. As the number
Fig. 4. Selected 40 objects from the Coil100 database.
of samples per class is increased, all methods begin to yield
similar classification accuracies. These results show that affine
hull and hyperdisks are better models for representing classes
in high-dimensional spaces with limited number of samples.
2) Coil100 Object Database: The Coil100 dataset
4
in-
cludes 72 views each of 100 different objects taken on a
turntable at orientations spaced at 5 degree intervals. We chose
40 objects randomly for the experiments, and these objects are
shown in Fig. 4. We used the raw grayscale pixels of the down-
sampled 32×32 images as input features, without applying
any further visual preprocessing. For training we randomly
selected n = 10, 20, ..., 60 images of each object, keeping the
remaining 72 n for testing. The results are given in Fig. 5.
As can be seen in the figure, all classification methods yielded
same classification accuracies for this particular database.
Fig. 5. Classification rates for different number of samples per class on the
Coil Database.
B. Experiments with Nonlinear Large Margin Classifiers
In this group of experiments, we tested the kernelized ver-
sions of the methods on eight lower-dimensional datasets from
the UCI repository
5
: Ionosphere, Iris, Image Segmentation
(IS), Letter Recognition (LR), Multiple Features (MF) - pixel
4
available at www1.cs.columbia.edu/CAVE/ software/softlib/coil-100.php
5
available at at http://archive.ics.uci.edu/ml/
TABLE I
LOW-DIMENSIONAL DATABASES SELECTED FROM UCI
REPOSITORY
Databases Number of Classes Data Set Size Dimensionality
Ionosphere 2 351 34
Iris 3 150 4
IS 7 2310 19
LR 26 20000 16
MF 10 2000 256
PID 2 768 8
Wine 3 178 13
WDBC 2 569 30
TABLE II
CLASSIFICATION RATES (%) ON THE UCI DATASETS.
UCI LMC-HD LMC-AH SVM
Ionosphere 94.01±3.1 93.73±3.4 92.87±3.2
Iris 96.67±2.3 94.67±2.9 95.33±3.8
IS 97.23±0.3 95.28±0.7 97.10±0.4
LR 99.99±0.02 99.99±0.02 99.64±0.12
MF 98.30±0.5 98.30±0.5 98.00±0.4
PID 99.87±0.3 99.87±0.3 99.87±0.3
Wine 98.82±1.6 98.82±1.6 98.20±1.6
WDBC 97.01±0.5 96.00±0.8 97.36±0.9
averages, Pima Indian Diabetes (PID), Wine, and Wisconsin
Diagnostic Breast Cancer (WDBC). The key parameters of
UCI Repository datasets are summarized in Table I. We used
the Gaussian kernels for all datasets. All design parameters are
set based on grid search using random partitions of datasets.
Classification accuracies obtained by 5-fold cross-validation
for UCI databases are given in Table II. Among all tested
methods, the proposed hyperdisk based large margin classifier
achieves the best results except for the WDBC database. For
Ionosphere, LR, MF, and Wine databases, both affine hull and
hyperdisk based classifiers achieve better results than SVM.
On the other hand, for IS and Iris databases SVM achieves
better results than affine hull based classifier, yet the hyperdisk
classifier yields even higher results than SVM. There is only
one case where the hyperdisk classifier is outperformed by
SVM. Overall these results show that the hyperdisk model
captures the best aspects of affine hulls and convex hulls. Thus,
the corresponding classifier using hyperdisks either achieves
the best classification accuracy or comparable results to the
other convex class model classifiers using affine or convex
hulls as demonstrated in Table II.
IV. SUMMARY AND CONCLUSION
We investigated the idea of basing large margin classifiers
on hyperdisks of classes as an alternative to the affine and
convex hull classifiers. Given two hyperdisk models, their
corresponding large margin classifier is easily determined by
finding a closest pair of points on these two models and
bisecting the displacement between them. To this end, we
formulated the problem as a convex quadratically constrained
quadratic optimization problem. Extension to the nonlinear
case is realized by using the kernel trick.
Hyperdisk is a model between an affine and convex hull,
and it captures the best aspects of these. More precisely, since
an affine hull is restricted to lie in a bounding hypersphere
in an hyperdisk model, hyperdisks provide better localization
of class samples compared to affine hulls. At the same time,
hyperdisks are looser than convex hulls, and this enables us
to approximate classes more accurately in high-dimensinal
spaces. Experimental results verify these facts. Hyperdisk
classifier always produced the best classification results or
comparable results to the other convex class model classifiers
achieving the best result. There is not a single case where the
hyperdisk classifier is significantly outperformed by the other
convex class model classifiers, whereas it significantly outper-
forms others on some databases. However, these improvements
come with a price. Training time of the hyperdisk classifier
is slow compared to other large margin classifiers since
it requires running a quadratic programming algorithm (for
finding hypersphere parameters) followed by QCQP algorithm.
Another limitation is related to the real-time efficiency (testing
time). Hyperdisk classifier does not return a sparse solution as
in affine hull classifier, thus its real-time efficiency is slow
compared to the SVM classifier. However, this limitation can
be overcome by running a reduced set algorithm [24, 25] that
enables us to derive a sparse solution. As a future work, we are
considering to revise the QCQP algorithm such that it returns
sparse solutions.
ACKNOWLEDGMENT
This work is supported by the Young Scientists Award
Programme (T
¨
UBA-GEB
˙
IP/2010-11) of the Turkish Academy
of Sciences.
REFERENCES
[1] H. Cevikalp, B. Triggs, H. S. Yavuz, Y. Kucuk, M. Kucuk,
and A. Barkana. Large margin classifiers based on affine hulls.
Neurocomputing, 73:3160–3168, 2010.
[2] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection
for cancer classification using support vector machines. Ma-
chine Learning, 46:389–422, 2002.
[3] H. Cevikalp and B. Triggs. Large margin classifiers based on
convex class models. In International Conference on Computer
Vision Workshops, 2009.
[4] C. Cortes and V. Vapnik. Support vector networks. Machine
Learning, 20:273–297, 1995.
[5] J. Platt. Fast training of support vector machines using sequen-
tial minimal optimization. In Advances in Kernel Methods: Sup-
port Vector Learning, pages 185–208. MIT Press, Cambridge,
1999.
[6] T. Joachims. Making large-scale support vector machine learn-
ing practical. In Advances in Kernel Methods: Support Vector
Learning, pages 169–184. MIT Press, Cambridge, 1999.
[7] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection
using second order information for training svm. Journal of
Machine Learning Research, 6:1889–1918, 2005.
[8] I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core vector
machines: Fast svm training on very large data sets. Journal of
Machine Learning Research, 6:363–392, 2005.
[9] K. P. Bennett and E. J. Bredensteiner. Duality and geometry in
svm classifiers. In ICML, 2000.
[10] D. J. Crisp and C. J. Burges. A geometric interpretation of
ν -svm classifiers. In Neural Information Processing Systems,
1999.
[11] H. Cevikalp, B. Triggs, and R. Polikar. Nearest hyperdisk
methods for high-dimensional classification. In ICML ’08:
Proceedings of the 25th international conference on Machine
learning, pages 120–127, 2008.
[12] P. Vincent and Y. Bengio. K-local hyperplane and convex
distance nearest neighbor algorithms. In NIPS, 2001.
[13] J. Laaksonen. Subspace classifiers in recognition of handwritten
digits. PhD thesis, Helsinki University of Technology, 1997.
[14] H. Cevikalp, D. Larlus, M. Neamtu, B. Triggs, and F. Ju-
rie. Manifold based local classifiers: linear and nonlinear
approaches. Journal of Signal Processing Systems, 61:61–73,
2010.
[15] M. B. Gulmezoglu, V. Dzhafarov, and A. Barkana. The common
vector approach and its relation to principal component analysis.
IEEE Trans. Speech Audio Proc., 9:655–662, 2001.
[16] D. M. J. Tax and R. P. W. Duin. Support vector data description.
Machine Learning, 54:45–66, 2004.
[17] L. Vandenberghe and S. Boyd. Semidefinite programming.
SIAM Review, 38:49–95, 1996.
[18] F. Alizadeh and D. Goldfarb. Second-order cone programming.
Mathematical Programming, 95:3–51, 2003.
[19] C.-M. Tang and J.-B. Jian. A sequential quadratically con-
strained quadratic programming method with an augmented
lagrangian line search function. Journal of Computational and
Applied Mathematics, 220:527–547, 2008.
[20] H. Tuy and N. T. Hoai-Phuong. A robust algorithm for quadratic
optimization under quadratic constraints. Journal of Global
Optimization, 37:557–569, 2007.
[21] J. C. Platt, N. Cristianini, and J. Shawe-taylor. Large margin
dags for multiclass classification. In Advances in Neural
Information Processing Systems, pages 547–553. MIT Press,
2000.
[22] V. Vural and J. G. Dy. A hierarchical method for multi-class
support vector machines. In ICML ’04: Proceedings of the
twenty-first international conference on Machine learning, page
105, New York, NY, USA, 2004. ACM.
[23] A. M. Martinez and R. Benavente. The AR face database.
Technical report, Computer Vision Center, Barcelona, Spain,
1998.
[24] B. Sch
¨
olkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. M
¨
uller,
G. Ratsch, and A. J. Smola. Input space versus feature space in
kernel-based methods. IEEE Transactions on Neural Networks,
10:1000–1017, 1999.
[25] S. Mika, B. Sc
¨
olkopf, A. Smola, K.-R. M
¨
uller, M. Scholz, and
G. Ratsch. Kernel pca and de-noising in feature spaces. In
Neural Information Processing Systems (NIPS), 1998.
... Given two sample sets, the classical SVM is aimed at constructing an optimal separating hyper-plane that maximizes the geometric margin between the hyper-plane and its closest samples from the two sample sets [2]. From a geometrical point of view, in the linearly separable case the classical SVM optimization problem of finding the maximum margin between two sample sets is equivalent to the problem of finding the closest pair of points on the convex hulls of such two sample sets, and the optimal separating hyper-plane will be the one that perpendicularly bisects the line segment joining the closest pair of points [7][8][9]. Namely, the classical SVM can be actually regarded as a maximum margin classification based on convex hulls (MMC-CH), which approximates each class region with a convex hull and then constructs the separating hyper-plane using the closest pair of points on the convex hulls [7][8][9]. Generally, we only can collect finite samples from a class region. ...
... From a geometrical point of view, in the linearly separable case the classical SVM optimization problem of finding the maximum margin between two sample sets is equivalent to the problem of finding the closest pair of points on the convex hulls of such two sample sets, and the optimal separating hyper-plane will be the one that perpendicularly bisects the line segment joining the closest pair of points [7][8][9]. Namely, the classical SVM can be actually regarded as a maximum margin classification based on convex hulls (MMC-CH), which approximates each class region with a convex hull and then constructs the separating hyper-plane using the closest pair of points on the convex hulls [7][8][9]. Generally, we only can collect finite samples from a class region. But in fact, the real number of samples belonging to that class region should be infinite. ...
... Since SVM approximates the class region of these finite samples with a convex hull rather than isolated samples themselves, the number of samples extends to be infinite. In addition, some other analogous geometric models also have been proposed to approximate class regions, such as affine hulls [8,10], hyperdisks [9,11], hyperspheres [12,13] and hyperellipsoids [14]. The convex hull is a very tight approximation to the class region of given finite samples because it is the smallest convex set containing these samples [9]. ...
... Supervised learning techniques, e.g., large margin classifiers [1,2,3,4], use training data with class labels being associated to the data samples to find a prediction function to estimate labels of new test data samples. However, in many applications, there is a lack of labeled data since obtaining labels is a costly procedure as it often requires human effort. ...
Article
In this paper, we propose a robust and fast transductive support vector machine (RTSVM) classifier that can be applied to large-scale data. To this end, we use the robust Ramp loss instead of Hinge loss for labeled data samples. The resulting optimization problem is non-convex but it can be decomposed to a convex and concave parts. Therefore, the optimization is accomplished iteratively by solving a sequence of convex problems known as concave-convex procedure. Stochastic gradient (SG) is used to solve the convex problem at each iteration, thus the proposed method scales well with large training set size for the linear case (to the best of our knowledge, it is the second transductive classification method that is practical for more than a million data). To extend the proposed method to the nonlinear case, we proposed two alternatives where one uses the primal optimization problem and the other uses the dual. But in contrast to the linear case, both alternatives do not scale well with large-scale data. Experimental results show that the proposed method achieves comparable results to other related transductive SVM methods, but it is faster than other transductive learning methods and it is more robust to the noisy data.
... Cevikalp et al. [2008] suggested the use of NHD 7 as a compromise between too loose a structure of affine hulls and too tight a structure of convex hulls. Recently, large margin classifiers based on NAH, NCH, and NHD have been studied further in [Cevikalp and Triggs, 2009, Cevikalp et al., 2010, Cevikalp and Triggs, 2013. Moreover, affine hull based modeling is elegantly applied in [Hu et al., 2012] in order to approximate unseen appearances in the context of image set classification. ...
Thesis
Full-text available
Vision-based human action recognition has attracted considerable interest in recent research for its applications to video surveillance, content-based search, healthcare, and interactive games. Most existing research deals with building informative feature descriptors, designing efficient and robust algorithms, proposing versatile and challenging datasets, and fusing multiple modalities. Often, these approaches build on certain conventions such as the use of motion cues to determine video descriptors, application of off-the-shelf classifiers, and single-factor classification of videos. In this thesis, we deal with important but overlooked issues such as efficiency, simplicity, and scalability of human activity recognition in different application scenarios: controlled video environment (e.g.~indoor surveillance), unconstrained videos (e.g.~YouTube), depth or skeletal data (e.g.~captured by Kinect), and person images (e.g.~Flicker). In particular, we are interested in answering questions like (a) is it possible to efficiently recognize human actions in controlled videos without temporal cues? (b) given that the large-scale unconstrained video data are often of high dimension low sample size (HDLSS) nature, how to efficiently recognize human actions in such data? (c) considering the rich 3D motion information available from depth or motion capture sensors, is it possible to recognize both the actions and the actors using only the motion dynamics of underlying activities? and (d) can motion information from monocular videos be used for automatically determining saliency regions for recognizing actions in still images?
Article
Full-text available
For automatic, rapid, accurate and objective classification of fish freshness under cold storage an electronic nose using seven metal dioxide gas sensors was developed to detect fish volatiles. Total viable count and Total volatile base nitrogen analyses were conducted simultaneously to indicate fish quality status. By sampling fish headspace, patterns were obtained during 15 storage days. 35 appropriate odor parameters were selected from each test. Principle component analysis was applied to reduce the 35-dimensional vectors to 5-dimensional vectors and clustered samples into fresh, semi fresh and spoiled. With 5-dimensional vectors as input, multilayer perceptron neural network modeled fish spoilage based on these three classes with 96.87 percent correct rate of test data. We found that the newly introduced hyper disk models maximum margin optimum classifier yielded 100 percent correct rate that could be successfully applied in industry for the diagnosis of fish spoilage. Graphical abstract Open image in new window
Article
We introduce a large margin linear binary classification framework that approximates each class with a hyperdisk – the intersection of the affine support and the bounding hypersphere of its training samples in feature space – and then finds the linear classifier that maximizes the margin separating the two hyperdisks. We contrast this with Support Vector Machines (SVMs), which find the maximum-margin separator of the pointwise convex hulls of the training samples, arguing that replacing convex hulls with looser convex class models such as hyperdisks provides safer margin estimates that improve the accuracy on some problems. Both the hyperdisks and their separators are found by solving simple quadratic programs. The method is extended to nonlinear feature spaces using the kernel trick, and multi-class problems are dealt with by combining binary classifiers in the same ways as for SVMs. Experiments on a range of data sets show that the method compares favourably with other popular large margin classifiers.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
This thesis consists of two parts. The first part reviews the general structure of a pattern recognition system and, in particular, various statistical and neural classification algorithms. The presentation then focuses on subspace classification methods that form a family of semiparametric methods. Several improvements on the traditional subspace classification rule are presented. Most importantly, two new classification techniques, here named the Local Subspace Classifier (LSC) and the Convex Local Subspace Classifier (LSC+), are introduced. These new methods connect the subspace principle to the family of nonparametric prototype-based classifiers and, thus, seek to combine the benefits of both approaches. The second part addresses the recognition of handwritten digits, which is the case study of this thesis. Special attention is given to feature extraction methods in optical character recognition systems. As a novel contribution, a new method, here named the error-corrective feature extraction, is presented. The prototype recognition system developed for the experiments is described and various options in the implementation are discussed. For the background of the experiments, thirteen well-known statistical and neural classification algorithms were tested. The results obtained with two traditional subspace methods and ten novel techniques presented in this thesis are compared with them. The results show that the Convex Local Subspace Classifier performs better than any other classification algorithm in the comparison. The conclusions of this thesis state that the suggested enhancements make the subspace methods very useful for tasks like the recognition of handwritten digits. This result is expected to be applicable in other similar cases of recognizing two-dimensional isolated visual objects.
Article
We show that the recently proposed variant of the Support Vector machine (SVM[) algorithm, known as ν-SVM, can be interpreted as a maximal separation between subsets of the convex hulls of the data, which we call soft convex hulls. The soft convex hulls are controlled by choice of the parameter ν If the intersection of the convex hulls is empty, the hyperplane is positioned halfway between them such that the distance between convex hulls, measured along the normal, is maximized 5 and if it is not, the hyperplane s normal is similarly determined by the soft convex hulls, but its position (perpendicular distance from the origin) is adjusted to minimize the error sum. The proposed geometric interpretation of ν-SVM also leads to necessary and sufficient conditions for the existence of a choice of v for which the ν-SVM solution is nontrivial.