Content uploaded by Evgeny Mirkes
Author content
All content in this area was uploaded by Evgeny Mirkes on Aug 13, 2019
Content may be subject to copyright.
SOM: stochastic initialization versus principal
components
Ayodeji A. Akinduko
University of Leicester, Leicester, UK
Evgeny M. Mirkes
University of Leicester, Leicester, UK
Alexander N. Gorban
University of Leicester, Leicester, UK
Abstract
Selection of a good initial approximation is a well known problem for all iterative
methods of data approximation, from k-means to Self-Organising Maps (SOM)
and manifold learning. The quality of the resulting data approximation depends
on the initial approximation. Principal components are popular as an initial
approximation for many methods of nonlinear dimensionality reduction because
its convenience and exact reproducibility of the results. Nevertheless, the reports
about the results of the principal component initialization are controversial.
In this work, we develop the idea of quasilinear datasets. We demonstrate
on learning of one-dimensional SOM (models of principal curves) that for the
quasilinear datasets the principal component initialization of the self-organizing
maps is systematically better than the random initialization, whereas for the
essentially nonlinear datasets the random initialization may perform better.
Performance is evaluated by the fraction of variance unexplained in numerical
experiments.
1. Introduction
Principal components are popular as an initial approximation for many
methods of nonlinear dimensionality reduction [13, 9, 15] because its conve-
nience and exact reproducibility of the results. The quality of the resulting
data approximation depends on the initial approximation but the systematic
analysis of this dependence requires usually too much efforts and the reports
are often controversial.
Email addresses: aaa78@le.ac.uk (Ayodeji A. Akinduko), em322@le.ac.uk (Evgeny M.
Mirkes), ag153@le.ac.uk (Alexander N. Gorban)
Preprint submitted to Elsevier April 8, 2015
In this work, we analyze initialization of Self Organized Maps (SOM). Origi-
nally, Kohonen [14] has proposed random initiation of SOM weights but recently
the principal component initialization (PCI), in which the initial map weights
are chosen from the space of the first principal components, has become rather
popular [4]. Nevertheless, some authors have criticized PCI [3, 20]. For example,
the initialization procedure is expected to perform much better if there are more
nodes in the areas where dense clusters are expected and less nodes in empty
areas. In this paper, the performance of random initialization (RI) approach is
compared to that of PCI for one-dimensional SOM (models of principal curves).
Performance is evaluated by the fraction of variance unexplained. Datasets were
classified into linear, quasilinear and nonlinear [10, 11]. It was observed that RI
systematically performes better for nonlinear datasets; however the performance
of PCI approach remains inconclusive for quasilinear datasets.
Self-Organizing Map (SOM) can be considered as a nonlinear generalization
of the principal component analysis (Yin, 2008a,b) and has found much appli-
cation in data exploration especially in data visualization, vector quantization
and dimension reduction. Inspired by biological neural networks, it is a type of
artificial neural network which uses unsupervised learning algorithm with the
additional property that it preserves the topological mapping from input space
to output space making it a great tool for visualization of high dimensional data
in a lower dimension. Originally developed by Kohonen (1984) for visualization
of distribution of metric vectors, SOM found many applications. However, like
clustering algorithms [18, 8], the quality of learning of SOM is greatly influ-
enced by the initial conditions: initial weight of the map, the neighbourhood
function, the learning rate, sequence of training vector and number of itera-
tions [14, 19]. Several initialization approaches have been developed and can be
broadly grouped into two classes: random initialization and data analysis based
initialization [3]. Due to many possible initial configurations when using random
approach, several attempts are usually made and the best initial configuration
is adopted. However, for the data analysis based approach, certain statistical
data analysis and data classification methods are used to determine the initial
configuration; a popular method is selecting the initial weights from the same
space spanned by the linear principal component (first eigenvector correspond-
ing to the largest eigenvalue of the empirical covariance matrix). Modification
to the PCA approach was done [3] and over the years other initialization meth-
ods have been proposed. An example is given by Fort et al [7]. In this paper we
consider the performance in terms of the quality of learning of the SOM using
the random initialization (RI) method (in which the initial weight is taking from
the sample data) and the principal component initialization (PCI) method. The
quality of learning is determined by the fraction of variance unexplained [17]. To
ensure an exhaustive study, synthetic data sets distributed along various shapes
of only 2-dimensions are considered in this study and the map is 1-dimensional.
1-Dimensional SOMs are important, for example, for approximation of princi-
pal curves. The experiment was performed using the PCA, SOM and Growing
SOM (GSOM) applet available online [17] and can be reproduced. The SOMs
learning has been done with the same neighbourhood function and learning rate
2
for both initialization approaches. Therefore, the two methods are subject to
the same conditions which could influence the learning outcome of our study. To
marginalize the effect of the sequence of training vectors, the applet adopts the
batch learning SOM algorithm [14, 6, 7] described in the next Section. For the
random initialization approach, the space of initial starting weights was sam-
pled; this is because as the size of the data set nincreases, the possible choice
of initial configuration for a given number of nodes kbecomes enormous (nk
). The PCI was done using regular grid on the first principal component with
equal variance (Mirkes, 2011). For each data set and initialization approach,
the data set was trained using three or four different values of kWe use a heuris-
tic classification of datasets in three classes, linear, quasilinear and essentially
nonlinear [10, 11], to organize the case study and to represent the results. We
describe below the used versions of the SOM algorithms in detail in order to
provide the reproducibility of the case study.
2. Background
2.1. SOM Algorithm
The SOM is an artificial neural network which has a feed-forward structure
with a single computational layer. Each neuron in the map is connected to all
the input nodes. The classical on-line SOM algorithm can be summarised as
follows:
1. Initialization: An initial weight is assigned to all the connection wj(0).
2. Competition: all nodes compete for the ownership of the input pattern.
Using the Euclidean distant as criterion, the neuron with the minimum-
distance wins.
j∗= arg min
1≤j≤k∥x(t)−wj(t)∥.
where x(t) is the input pattern at time t,wj(t) is j-th coding vector at
time t,kis the number of nodes.
3. Cooperation: the winning neuron also excites its neighbouring neurons
(topologically close neurons). The closeness of the i-th and j-th neurons
is measured by the neighbourhood function ηji (t): ηii = 1, ηj i →0 for
large |i−j|.
4. Learning Process (Adaptation): The winning neuron and the neighbours
are adjusted with the rule given below:
wi(t+ 1) = wi(t) + α(t)ηj∗i(x(t)−wi(t)).
Hence, the weight of the winning neuron and its neighbours are adjusted
towards the input patterns however the neighbours have their weights
adjusted with a value less than the winning neuron. This action helps to
preserve the topology of the map.
3
2.2. The Batch Algorithm
We use the batch algorithm of the SOM learning. This is a version of the
SOM algorithm in which the whole training set is presented to the map before
the weights are adjusted with the net effect over the samples [14, 16, 6]. The
algorithm is given below.
1. Put the set of data point associated with each node equal to empty set:
Ci=∅.
2. Present an input vector xsand find the winner neuron, which is the weight
vector closest to the input data.
i= arg min
1≤j≤k∥xs−wj(t)∥, Ci←Ci∪ {s}.
3. Repeat step 2 for all the data points in the training set.
4. Update all the weights as follows
wi(t+ 1) =
k
j=1
ηij (t)
s∈Ci
xs
k
j=1
ηij (t) (1)
where ηij (t) is the neighbourhood function between the i-th and j-th nodes
at time t, and kis the number of nodes.
2.3. SOM learning algorithm used in the case study
Before learning, all Ciare set to the empty set (Ci=∅), and the steps
counter is set to zero.
1. Associate data points with nodes (form the list of indices
Ci={l:∥xl−wi∥ ≤ ∥xl−wj∥∀i̸=j}.
2. If all sets Cievaluated at step 1 coincide with sets from the previous step
of learning, then STOP.
3. Calculate the new values of coding vectors by formula (1)
4. Increment the step counter by 1.
5. If the step counter is equal to 100, then STOP.
6. Return to step 1.
The neighbourhood function used for this applet has the simple B-spline form
given as a B-spline with hmax = 3: ηij = 1 − |i−j|/(hmax + 1) if |i−j|< hmax
and ηij = 0 if |i−j| ≥ hmax .
4
2.4. GSOM
GSOM was developed to identify a suitable map size in the SOM and to
improve the approximation of data [1]. It starts with a minimal number of
nodes and grows new nodes on the boundary based on a heuristic. There are
many heuristics for GSOM growing. Our version is optimized for 1D GSOM, the
model of principal curves [17]. GSOM method is specified by three parameters
•Neighbourhood radius. This parameter, hmax , is used to evaluate the
neighbourhood function, ηij (the same as for SOM).
•Maximum number of nodes. This parameter restricts the size of the map.
•Stop when fraction of variance unexplained percent is less than a prese-
lected threshold.
The GSOM algorithm includes learning and growing phases. The learning phase
is exactly the SOM leaning algorithm. The only difference is in the number of
learning steps. For SOM we use 100 batch learning steps after each learning start
or restart, whereas for GSOM we select 20 batch learning steps in a learning
loop.
2.5. Fraction of Variance Unexplained
In this study, data are approximated by broken lines (SOM and GSOM).
The dimensionless least square evaluation of the error is the Fraction of Variance
Unexplained (FVU). It is defined as the fraction: [The sum of squared distances
from data to the approximating line /the sum of squared distances from data
to the mean point] [17].
The distance from a point xito a straight line is the length of a perpendicular
dropped from the point to the line pi. This definition allows us to evaluate FVU
for PCA:
FVU =
n
i=1
p2
in
i=1
∥xi−¯x∥2,
where ¯xis the mean point ¯x= (1/n)n
i=1 xi. For SOM we need to solve the
following problem. For the given array of coding vectors {yi}(i= 1,2, . . . , k)
we have to calculate the distance from each data point xto the broken line
specified by a sequence of points {y1, y2, . . . , yk}. For this purpose, we calculate
the distance from xto each segment [yi, yi+ 1] and find d(x), the minimum of
these distances.
FVU =
n
i=1
d2(xi)n
i=1
∥xi−¯x∥2.
2.6. Initialization Methods
The objective of this paper is to consider the performance of two different
initialization methods for SOM using the FVU as the criterion for measuring the
performance or the quality of learning. The two initialization methods compared
are:
5
•PCA initialization (PCI): The weight vectors are selected from the sub-
space spanned by the first nprincipal components. For this study, the
weight vectors are chosen as a regular grid on the first principal compo-
nent, with the same variance as the whole dataset. Therefore, given the
number of weight vectors k, the behaviour of SOM using PCA initial-
ization, is completely deterministic and results in the only configuration.
PCA initialization does not take into account the distribution of the lin-
ear projection results. It can produce several empty cells and may need
a post-processing reconstitution algorithm [3]. However, since the PCA
initialization is better organized, SOM computation can be made order of
magnitude faster comparing to random initialization, according to Koho-
nen [14].
•Random Initialization (RI): kweight vectors are selected randomly, in-
dependently and equiprobably from the data points. The size of the set
of possible initial configurations given a dataset of size nis nk. Given
an initial configuration, the behaviour of the SOM becomes completely
deterministic.
2.7. Linear, Quasilinear and Nonlinear models
Data sets can be modelled using linear or nonlinear manifold of lower dimen-
sion. According to [10, 11] a class of quasilinear model data set was identified.
In this study, data sets will be classified as linear, quasilinear or nonlinear.
The non-linearity test for PCA helps to determine whether a linear model is
appropriate for modelling of a data set [15].
•Linear Model. A data set is said to be linear if it could be modelled using
a sequence of linear manifolds of small dimension (in Figure 1 d, they can
be approximated by a straight line with sufficient accuracy). These data
can be easily approximated by the principal components without SOM.
We do not study such data.
•Quasilinear Model. A dataset is called quasilinear (in dimension one) if the
principal curve approximating the dataset can be univalently and linearly
projected on the linear principal component. For this study, the border
cases between nonlinear and quasilinear datasets (like “S” below) are also
classified as quasilinear. See examples in Figure 1.
•Nonlinear Model. In this paper, we call the essentially nonlinear datasets
which do not fall into the class of quasilinear datasets just nonlinear data.
See example in Figures 1b, 1c and 1e.
For each test we found the number of RI SOM with FVU that is less or
equal to PCI SOM. In the tables, results are averaged for various types of
pattern smearing (Table 2) and for different pattern models (Table 3).
In eight tests (from 100) all RI SOM have FVU that is equal or greater
than that of PCI SOM: clear C with 10 nodes, scattered C with 10 nodes, clear
6
Table 1: Classification of patterns models (Figure 1).
Etalon Clear Scattering Noise Noise & scattering
C quasilinear quasilinear nonlinear quasilinear
Circle nonlinear nonlinear nonlinear nonlinear
Horseshoe nonlinear nonlinear nonlinear nonlinear
S quasilinear quasilinear nonlinear quasilinear
Spiral nonlinear nonlinear nonlinear nonlinear
Table 2: The results of testing for different kind of patterns
Pattern Average fraction of RI Average fraction of RI
SOM with FVU better SOM with FVU better
than for PCI than for GSOM
Clear 35.00% 27.95%
Scattered 44.56% 13.84%
Noised 55.52% 73.72%
Scattered and noised 64.60% 64.52%
Table 3: The results of testing for different models
Pattern Average fraction of RI Average fraction of RI
model SOM with FVU better SOM with FVU better
than for PCI than for GSOM
Quasilinear 36.62% 30.26%
Nonlinear 60.89% 57.20%
7
a
b
c
d
e
Figure 1: (a) Quasilinear data set; (b, c, e) nonlinear data set; (d) a border case between
nonlinear and quasilinear dataset. The first principal component approximations are shown
(black line). The left column contains clear patterns, the second column from the left contains
scattered patterns, the second column from the right contains the clear patterns with added
noise, and the right column contains the scattered patterns with added noise.
circle with 10 nodes, scattered circle with 10 nodes, scattered S with 20 nodes,
scattered and noised spiral with 10 nodes, noised circle with 75 nodes and clear
spiral with 50 nodes. The hystograms are presented in Figure 2
The results of tests show that the RI SOM may perform better than PCI
SOM for any models and any kinds of patterns. Nevertheless, there exists a
small fraction of patterns for which RI SOM does not overperform PCI SOM.
Let us estimate the number of RI SOM which we can learn to obtain the FVU
less than that of PCI with probability 90%. Let us have pattern with quasilinear
model. In this case we estimate the probability of obtaining RI SOM with FVU
worse than for PCI SOM is 63.38% (100-36.62). Probability of obtaining 5
RI SOM with FVU not less than for PCI SOM is 0.63385≈0.10. Therefore,
it is sufficient to try 5 RI SOM to obtain FVU less than for PCI SOM with
probability ≈90%. All these numbers are valid for our choice of patterns and
their smearing (Figure 1).
3. Discussion
The simple systematical case study demonstrates that the widely accepted
presumption about advantages of PCI SOM initialization seems to be not uni-
versal. The frequency of RI SOM with FVU that is less than FVU for PCI SOM
8
250%200%150%100%80%60%
100
80
60
40
20
0
a
250%200%150%100%80%60%
100
80
60
40
20
0
b
250%200%150%100%80%60%
100
80
60
40
20
0
c
250%200%150%100%80%60%
100
80
60
40
20
0
d
Figure 2: A typical example of distribution of RI SOM FVU in percent of PCI FVU. Vertical
solid line with thin arrow above corresponds to PCI SOM FVU. Vertical dashed line with
wide arrow above corresponds to GSOM FVU. All four histograms illustrate the distribution
of RI SOM FVU with 20 SOM nodes for the spiral pattern: (a) clear spiral, (b) scattered
spiral, (c) noised spiral, and (d) scattered and noised spiral.
is 61% for nonlinear patterns selected as benchmarks for our study (Figure 2).
This means that three random initializations are sufficient to obtain the FVU
which is less or equal to PCI SOM FVU with probability 95% in these cases.
For quasilinear patterns the situation is different and the performance of PCI
SOM is better. Nevertheless, it is sufficient for the selected quasilinear bench-
marks to try RI SOM five times to obtain FVU less than for PCI SOM with
probability 90% (see Figure 2). Of course, there may be many heuristical rules
for the further improvement of the initiation, for example, to respect the cluster
structure.
The proposed classification of datasets into two classes, quasilinear and non-
linear, is important for understanding of dynamics of manifold learning and for
selection of the initial approximation. The linear configurations may be consid-
ered as a limit case of the quasilinear ones. We defined quasilinear (in dimension
one) dataset using the principal curve and studied one-dimensional SOMs. In
applications, SOMs of higher dimensions (two or even three) are used much
more often. Therefore, the next step should be the development of the concept
of quasilinear datasets for higher dimensions of approximants.
It is possible to generalize this definition to dimension k > 1 using injectivity
of projection of the k-dimensional principal manifold onto the space of first k
principal components. Nevertheless, it may be desirable to consider the quasi-
linearity of the data distribution without such a complex intermediate concept
9
as “principal manifold”. Indeed, SOM is often considered as an approxima-
tions of the principal manifold [22, 23] and it is reasonable to avoid usage of
the principal manifolds of the definition of quasilinearity which will be used for
selection of the initial approximation in manifold learning. Let us operate with
the probability distributions directly.
Consider a probability distribution in the dataspace with probability density
p(x). Assume that there is a gap between kfirst eigenvalues of the correlation
matrix and the rest of its spectrum. Then the projector Πkof the dataspace
onto the space of first kprincipal components is defined unambiguously. This
projector is orthogonal with respect to the standard inner product in the space
of the normalized data. We call the distribution p(x)quasilinear in dimension
kif the conditional distribution
p(x|Πk(x) = y)
is for each yeither log-concave or zero.
The requirement of log-concavity is motivated by the properties of such dis-
tributions: convolution of log-concave distributions and their marginal distribu-
tions are also log-concave [5]. Therefore, this class of distributions is much more
convenient than the na¨ıve unimodal distributions [2]. Most of the commonly
used parametric distributions are log-concave and log-concave distributions nec-
essarily have subexponential tails. Non-parametric maximum likelihood estima-
tions for log-concave distributions are developed even in multidimensional case
[21].
Finally, let us formulate a hypothesis: if the probability distribution is quasi-
linear in dimension kthen the PCI will perform better then RI, at least for
sufficiently large data sets.
References
[1] D. Alahakoon, S. K. Halgamuge, B. Srinivasan, Dynamic Self-Organizing
Maps With Controlled Growth For Knowledge Discovery, IEEE Transac-
tions on Neural Networks 11 (3) (2000), 601–614.
[2] M.Y. An, Log-concave probability distributions: Theory and statisti-
cal testing, Duke University Dept of Economics Working Paper 95-
03, 1997. Available at SSRN: http://ssrn.com/abstract=1933 or
http://dx.doi.org/10.2139/ssrn.1933.
[3] M. Attik, L. Bougrain, F. Alexandre, Self-organizing map initialization,
In: W. Duch, J. Kacprzyk, E. Oja, S. Zadrozny, S. (Eds.): Artificial
Neural Networks: Biological Inspirations. LNCS, vol. 3696. Springer, Berlin
Heidelberg, pp. 357–362, 2005.
[4] A. Ciampi, Y. Lechevallier, Clustering Large, Multi-Level Data Sets: An
Approach Based On Kohonen Self Organizing Maps, In: D.A. Zighed, J.
Komorowski, J. Zytkow (Eds.): PKDD 2000. LNCS (LNAI), vol. 1910,
pp. 353–358, 2000.
10
[5] Dharmadhikari, S., Joag-dev, K. Unimodality, Convexity, and Applications,
Academic Press, 1988.
[6] J.-C. Fort, M. Cottrell, P. Letr´emy, Stochastic On-Line Algorithm Versus
Batch Algorithm For Quantization And Self Organizing Maps. In: Neural
Networks for Signal Processing 11. Proceedings of the 2001 IEEE Signal
Processing Society Workshop, pp. 43–52, 2001.
[7] J-C. Fort, P. Letr´emy, M. Cottrell, Advantages and drawbacks of the batch
Kohonen algorithm. In: Verleysen, M. (ed.), ESANN’2002 Proceedings,
European Symposium on Artificial Neural Networks, Bruges (Belgium),
pp. 223–230, 2002.
[8] A. P. Ghosh, R. Maitra, A. D. Peterson, Systematic Evaluation Of Dif-
ferent Methods For Initializing The K-Means Clustering Algorithm, IEEE
Transactions on Knowledge and Data Engineering (2010), 522–537.
[9] A.N. Gorban, B. K´egl, D.C. Wunsch, A.Y. Zinovyev (Eds.), Principal Man-
ifolds for Data Visualization and Dimension Reduction. LNCSE, vol. 58.
Springer, Berlin – Heidelberg, 2008.
[10] A.N. Gorban, A.A. Rossiev, Neural Network Iterative Method Of Principal
Curves For Data With Gaps. Journal of Computer and Systems Sciences
International 38(5), 825–830, 1999.
[11] A.N. Gorban, A.A. Rossiev, D.C. Wunsch II, Neural Network Modeling Of
Data With Gaps: Method Of Principal Curves, Carleman’S Formula, And
Other. In: USA-NIS Neurocomputing opportunities workshop, Washing-
ton DC (1999), arXiv:cond-mat/0305508
[12] A. N. Gorban, A. Zinovyev, Principal manifolds and graphs in practice:
from molecular biology to dynamical systems, International Journal of Neu-
ral Systems, 20 (3) (2010), 219–232.
[13] K. Kiviluoto, E. Oja, S-map: A Network With A Simple Self-Organization
Algorithm For Generative Topographic Mappings, In: M.I. Jordan, M.J.
Kearns, S.A. Solla (Eds.) Advances in Neural Information Processing Sys-
tems, Vol. 10, pp. 549–555, MIT Press, Cambridge, MA, 1998.
[14] T. Kohonen, Self-Organization and Associative Memory. Springer, Berlin,
1984.
[15] U. Kruger, J. Zhang, L. Xie, Development And Apllications Of Nonlinear
Principal Component Analysis - A Review. In: Gorban, A.N., Kgl, B.,
Wunsch, D.C., Zinovyev, A.Y. (eds.), Principal Manifolds for Data Visu-
alization and Dimension Reduction, LNCSE, vol. 58, pp. 1–44. Springer,
Berlin Heidelberg, 2008.
11
[16] H. Matsushita, Y. Nishio, Batch-Learning Self-Organizing Map With False-
Neighbor Degree Between Neurons. In: Neural Networks, 2008. IJCNN
2008. IEEE World Congress on Computational Intelligence. IEEE Interna-
tional Joint Conference on, pp. 2259–2266, 2008.
[17] E.M. Mirkes, Principal Component Analysis and Self-
Organizing Maps: applet. University of Leicester, 2011.
http://www.math.le.ac.uk/people/ag153/homepage/PCA SOM/PCA SOM.html
[18] J.M. Pena, J.A. Lozano, P. Larranaga, (1999): An Empirical Compari-
son Of Four Initialization Methods For The K-Means Algorithm. Pattern
Recognition Letters 20 (1999), 1027–1040.
[19] M.-C. Su, T.-K. Liu, H.-T. Chang, Improving The Self-Organizing Feature
Map Algorithm Using An Efficient Initialization Scheme. Tamkang Journal
of Science and Engineering 5 (1) (2002), 35–48.
[20] T. Vatanen, I.T. Nieminen, T. Honkela, T. Raiko, K. Lagus Control-
ling Self-Organization And Handling Missing Values In SOM And GTM,
In: P.A. Est´evez, Jos´e C. Pr´ıncipe, P. Zegers (Eds.): Advances in Self-
Organizing Maps, Advances in Intelligent Systems and Computing, Vol.
198, 55–64, 2013.
[21] G. Walther, Inference and modeling with log-concave distributions, Statis-
tical Science, 24 (3) (2009), 319–327
[22] H. Yin, The Self-Organizing Maps: Background, Theories, Extensions
and Applications. In: Fulcher, J. et al. (eds.), Computational Intelli-
gence: A Compendium: Studies in Computational Intelligence, pp. 715–
762. Springer, Berlin Heidelberg, 2008.
[23] H. Yin, Learning Nonlinear Principal Manifolds by Self-Organising Maps,
In: Gorban, A.N., K´egl, B., Wunsch, D.C., Zinovyev, A.Y. (eds.), Principal
Manifolds for Data Visualization and Dimension Reduction, LNCSE, vol.
58, pp. 69–96. Springer, Berlin Heidelberg, 2008.
12