ArticlePDF Available

Clustering of the SOM easily reveals distinct gene expression patterns: Results of a reanalysis of lymphoma study

Authors:

Abstract and Figures

A method to evaluate and analyze the massive data generated by series of microarray experiments is of utmost importance to reveal the hidden patterns of gene expression. Because of the complexity and the high dimensionality of microarray gene expression profiles, the dimensional reduction of raw expression data and the feature selections necessary for, for example, classification of disease samples remains a challenge. To solve the problem we propose a two-level analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that simplifies and reduces the dimensionality of original measurements and visualizes individual tumor sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to identify patterns of gene expression useful for classification of samples. We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The analysis easily distinguished major gene expression patterns without the need for supervision: a germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related gene expression pattern. The first three patterns matched the patterns described in the original publication using supervised clustering analysis, whereas the fourth one was novel. Our study shows that by using SOM as an intermediate step to analyze genome-wide gene expression data, the gene expression patterns can more easily be revealed. The "expression display" by the SOM component plane summarises the complicated data in a way that allows the clinician to evaluate the classification options rather than giving a fixed diagnosis.
Content may be subject to copyright.
BioMed Central
Page 1 of 9
(page number not for citation purposes)
BMC Bioinformatics
Open Access
BMC Bioinformatics
2002,
3
x
Research article
Clustering of the SOM easily reveals distinct gene expression
patterns: results of a reanalysis of lymphoma study
Junbai Wang*
1
, Jan Delabie
2
, Hans Christian Aasheim
3
, Erlend Smeland
3
and Ola Myklebost
1
Address:
1
Departments of Tumor Biology, Norwegian Radium Hospital, N0310 Oslo, Norway,
2
Department of Pathology, Norwegian Radium
Hospital, N0310 Oslo, Norway and
3
Department of Immunology, Norwegian Radium Hospital, N0310 Oslo, Norway
E-mail: Junbai Wang* - junbaiw@radium.uio.no; Jan Delabie - jan.delabie@labmed.uio.no; Hans Aasheim - h.c.asheim@labmed.uio.no;
Erlend Smeland - e.b.smeland@labmed.uio.no; Ola Myklebost - olam@ulrik.uio.no
*Corresponding author
Abstract
Background: A method to evaluate and analyze the massive data generated by series of
microarray experiments is of utmost importance to reveal the hidden patterns of gene expression.
Because of the complexity and the high dimensionality of microarray gene expression profiles, the
dimensional reduction of raw expression data and the feature selections necessary for, for example,
classification of disease samples remains a challenge. To solve the problem we propose a two-level
analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that
simplifies and reduces the dimensionality of original measurements and visualizes individual tumor
sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to
identify patterns of gene expression useful for classification of samples.
Results: We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The
analysis easily distinguished major gene expression patterns without the need for supervision: a
germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related
gene expression pattern. The first three patterns matched the patterns described in the original
publication using supervised clustering analysis, whereas the fourth one was novel.
Conclusions: Our study shows that by using SOM as an intermediate step to analyze genome-
wide gene expression data, the gene expression patterns can more easily be revealed. The
"expression display" by the SOM component plane summarises the complicated data in a way that
allows the clinician to evaluate the classification options rather than giving a fixed diagnosis.
Background
The development and progression of cancer is accompa-
nied by complex changes in the patterns of gene expres-
sion. That can be revealed by DNA microarrays analysis
[1]. However, to reliably identify expression patterns asso-
ciated with tumor type, prognosis or therapy, hundreds of
samples need to be studied, and powerful data mining
tools are needed. Microarray experiments are generally
performed without a priori hypothesis. Therefore, the
data mining tools have to be developed that reveal a max-
imum of information to generate new hypotheses [9] with
minimal supervision. Hierarchical clustering is a frequent-
Published: 24 November 2002
BMC Bioinformatics 2002, 3:36
Received: 14 June 2002
Accepted: 24 November 2002
This article is available from: http://www.biomedcentral.com/1471-2105/3/36
© 2002 Wang et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media
for any purpose, provided this notice is preserved along with the article's original URL.
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 2 of 9
(page number not for citation purposes)
ly used method [2–4], but has a number of shortcomings
[5,6]. Notably, the most important genes defining the
branches of the clustering tree are not readily recognized,
and important patterns can be lost due to the determinis-
tic nature of clustering or the high dimensionality of data.
To solve this problem, we propose a two-level analysis
[14] for the study of complex gene expression data. This
analysis summarizes the data by the SOM component
plane, and then clusters the SOM to investigate the feature
gene expression patterns. The SOM reduces the dimen-
Figure 1
Classification of samples by SOM analysis and K-means clustering. SOM component planes are shown for a) 42
DLBCL samples and three DLBCL cell lines (OCILy3, OCILy10 and OCILy1). SOM map size is (22 × 14) and the color scale of
SOM component plane represented the mean ratio in each map node, and red indicates high expression, blue indicates low
expression. See supplementary information for full data. b) K-means clustering of SOM, mean SOM component planes for
DLBCL, FL and CLL. The cluster numbers are given, and the genes contained within each SOM node and K-means cluster are
listed in the web supplement [13], selected genes from clusters 10, 11 and 1, 7, 9 are listed in table 1.
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 3 of 9
(page number not for citation purposes)
Table 1: Selected genes grouped to cluster 1,7,9,10,11 of K-means clustering of SOM. Full list can be found in the web supplement [13].
Cluster No. Clone ID Gene Description
Cluster 1 100 Ki67 (long type)
1287099 Survivin = apoptosis inhibitor = effector cell protease EPR-1
108294, 1287528 XRCC9 = DNA repair protein
950690, 824709 Cyclin A
563130, 824060 Cyclin B1
1288839, 325880 Tubulin-beta
1240822, 588637 Actin = cytoskeletal gamma-actin
683084 Cyclin E2
1356512 Similar to MCM2 = DNA replication licensing factor
703757 MPP1 = Putative M phase phosphoprotein 1
1240595 Tubulin-alpha
1341540, 781047 BUB1 = putative mitotic checkpoint protein ser/thr kinase
Cluster 7 789182 PCNA = proliferating cell nuclear antigen
1288183, 235938 BAK = BCL-2 family member
80592 Syndecan-1
469256, 1322301 Bag-1 = Bcl-2 interacting anti-apoptotic protein = RAP46 = Glucocorticoid
receptor-associated protein
525540 BCL-3
1338456, 364941 C-myc binding protein
784012 40S ribosomal protein S21
324144 Ribosomal protein S29
1087015, 1240788 Ribosomal protein S9
510395 Ribosomal protein S16
272185 Ribosomal protein L27
1335421 Similar to ribosomal protein L37a
1368302 Ribosomal protein L32
Cluster 9 46778 BCL-XL
814478, 1353675 A1 = Bfl-1 = GRs = Bcl-2 related protein
270770, 1272196 IRF-4 = LSIRF = Mum1 = homologue of Pip = Lymphoid-specific interferon
regulatory factor = Multiple myeloma oncogene 1
1290353 Similar to TREB and X box binding protein 1
145093 MCL1 = myeloid cell differentiation protein
Cluster 10 701606, 1286850, 200814 CD10 = CALLA = Neprilysin = enkepalinase
1337241, 306139 BCL-7A
1340526, 712395 BCL-6
824476, 95093, 1350545 Spi-B transcription factor
1335782, 13194072, 1338245 Oct-2 = lymphoid-specific octamer binding transcription factor = POU
278808 Spi-1 = PU.1 = ets family transcription factor
50214 CD86 = B7-2 = CD28 and CTLA-4 counter-receptor 2
Cluster 11 753794 BLC = BCA-1 = B lymphocyte chemoattractant BLC = CXC chemokine
1326652 CD2
245959 SDF-1 = Stromal cell-derived factor 1 = chemokine
159946 CD14 = monocyte differentiation antigen
1130062 CD3E antigen, epsilon polypeptide
258802, 470615 CD64 = high affinity immunogobulin gamma FC receptor I A form precursor
= FC-gamma
377560 CD3 delta = T cell surface glycoprotein
505569 T cell receptor beta chain
23435, 1306024 CD11C = leukocyte adhesion protein p150,95 alpha subunit = integrin
alpha-X
1219244, 57, 1071581 RANTES = chemokine
472180 S100 calcium binding protein A4 = Placental calcium binding protein = Cal-
vasculin
701290 C-C chemokine receptor 5 == CC CK5
47509 Major histocompatibility complex, class II, DN alpha
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 4 of 9
(page number not for citation purposes)
sionality of the data, and thereby allows to easy display
the data and reveal the gene expression patterns. The vis-
ual inspection of the gene expression patterns in each sin-
gle case, and comparison of those patterns between the
different cases allows identifying common patterns in
gene expression that may have been lost by directly apply-
ing hierarchal clustering to the data. In addition, by K-
means clustering of the SOM, genes that have similar ex-
pression patterns, and might therefore be functionally re-
lated, may be identified.
To test the power of this two-level approach, we applied it
to the analysis of a publicly available gene expression data
set of non-Hodgkin's lymphomas, including mostly dif-
fuse large B-cell lymphoma (DLBCL), follicular lympho-
ma (FL) and chronic lymphocytic leukaemia (CLL). K-
means clustering of the SOM readily identifies four dis-
tinct gene expression profiles: germinal center related,
proliferation, inflammatory and plasma cell differentia-
tion related gene expression patterns. All identified gene
expression patterns are correlated with clinical survival
analysis.
Results
The expression data [10] were filtered and preprocessed as
described and subjected to SOM. Davies-Bouldin index
was used to find the optimum number of 12 clusters in K-
means clustering of the SOM [14]. Figure 1b shows the K-
means clustering of SOM with map size (22 × 14), where
the number of map units M = 5 N
0.5
, N is the number of
genes; after M has been determined, the map size is deter-
mined by setting the ratio between column number and
row number of map units equal to the ratio of two biggest
eigenvalues of the training data, and their product is as
close to M as possible [11]. Each hexagonal node of SOM
is a prototype vector representing local averages of the da-
ta, and the nearby nodes have similar prototype vectors.
The genes included in each cluster can be found in the
supplement [13].
Through the proposed two-level approach, one may di-
rectly observe the gene expression pattern of different
lymphoma subtypes, i.e. DLBCL, CLL and FL (figure 1b).
As can be seen from figure 1a, DLBCL primarily showed
four prominent gene expression patterns; distinguished
by gene cluster 10, 11, 1 and the large group of clusters 7
and 9. More detailed illustrations of distinct gene expres-
sion patterns are shown in the supplement [13], summary
of the genes included in these clusters are listed in Table
1. Cluster 10 contains genes were known to be expressed
in germinal center B cells, such as FAK, WIP, CD10, CD27,
CD38, FMR2, BCL-6 and BCL-7A. Cluster 11 contains
genes specifically expressed by T-cells (a.o. CD3, CD2,
TCR), NK cells (a.o. NK4), macrophages (a.o. CD14,
CD63, CD64, CD115) and lymph node dendritic cells
(a.o. S100). Also included are genes coding for chemok-
ines and chemokine receptors (RANTES, BLC, IP10, SLC,
FPR, STRL33.1 and MIP1), which play a major role in the
chemoattraction of inflammatory cells. Furthermore DLB-
CL variably express genes in the adjacent clusters 1, 7 and
9 (figure 1a). Cluster 1 includes genes associated with pro-
liferation (Ki67, cyclin A, BUB1, Cyclin B1, thymidine ki-
nase) whereas clusters 7 and 9 include genes associated
with cell survival (Bcl-XL, defender against cell death 1,
Bfl-1, BAK, Bag-1, MCL1) and plasma cell differentiation
(XBP-1, STAT3, IRF-4, ribosomal proteins) [10].
We subsequently regrouped the DLBCL based on the ex-
pression of each of the identified gene expression patterns
and studied survival differences between the groups thus
formed. We confirmed the better survival (figure 2a) for
those cases expressing genes related to the germinal center
(gene cluster 10) as reported by Alizadeh et al. We further-
more could show that there is a significant improved sur-
vival (figure 2b) of cases expressing genes related to
inflammation (gene cluster 11). Equally, there is a signif-
icant reduced survival (figure 2c) of cases expressing genes
related to cell proliferation, anti-apoptosis and plasma
cell differentiation (clusters 1,7,9). Interestingly, there is
also a significant difference in survival (figure 2d) ob-
tained when cases are subdivided using a combination of
gene expression patterns 10 and 1,7,9 in spite of the low
number of cases. We were further intrigued by the clusters
of genes in groups 7 and 9 that apparently were related to
plasma cell differentiation and are frequently co-ex-
pressed with the genes in cluster 1 (cell proliferation). Hi-
erarchical clustering of DLBCL using only genes in clusters
7 and 9 (figure 3) revealed an interesting pattern of mutu-
ally exclusive expressed genes, including many of which
are of utmost importance for plasma cell differentiation
(XBP-1, STAT3, IRF-4) as well as genes coding for ribos-
omal proteins, known to be highly expressed in plasma
cells. Of interest are the two mutually exclusive patterns of
plasma cell differentiation in DLBCL, suggesting either
different pathways of plasma cell differentiation or differ-
ent stages of differentiation.
Figures 1b shows the mean SOM component planes of
CLL and FL. Typically for CLL the genes in the whole lower
part of the SOM are highly expressed while for FL the
genes in the lower and middle left part of the SOM (cluster
10) are highly expressed. Therefore, the most prominent
distinction between CLL and FL lies in the expression of
genes that are characteristic of germinal center B cells
(cluster 10), as has also been suggested by Alizadeh et al
[10].
Discussion
When microarray measurements are presented in random
order, the patterns of gene expression are impossible to
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 5 of 9
(page number not for citation purposes)
discern by eye, and methods like hierarchical clustering
are frequently used to sort the measurements in such a
way that many patterns can easily be visualized, such as in
figure 3. However, this method suffers from several short-
comings [14], of which the most important is the loss of
information of potentially important patterns in a high
dimensional gene space. Although the number of meas-
ured genes is large there may only be a few underlying
gene components that account for most of the response
variation; for example, only a few linear combinations of
a subset of genes can account for nearly all of the expres-
sion variation among various tumor types. In such a situ-
ation, dimension reduction is needed to reduce the high
dimensional gene space to a low dimensional gene com-
ponent space; for instance, principal component analysis
[18] and partial least squares [20] had been applied to the
dimension reduction of microarray data. Thus, we pro-
posed a two-level analysis, first to summarize the gene ex-
pression data by a large set of prototypes; then the
prototypes are further combined to form the actual clus-
ters in the next step. SOM is a suitable method for data re-
duction since it creates a set of prototype vectors
representing the gene expression data and carries out a to-
pology preserving the projection of the prototypes from
the high-dimensional gene space into a low-dimensional
map. To preserve the cluster structure of original data in a
low-dimensional map, we can select as many prototype
vectors as needed, where the number of prototypes equals
Figure 2
Clinically distinct DLBCL subgroups defined by gene expression profiling. a) Kaplan-Meier plot of overall survival of
DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 10. b) Kaplan-Meier plot of overall sur-
vival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 11. c) Kaplan-Meier plot of over-
all survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster (1,7,9). d) Kaplan-Meier
plot of overall survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 10 and cluster
(1,7,9).
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 6 of 9
(page number not for citation purposes)
Figure 3
Selected genes from K-means clusters. Hierarchical clustering of 72 selected genes from K-means cluster 1, 7 and 9.
Depicted are the measurements of gene expression from DLBCL, FL and CLL samples. The dendrogram is colour coded
according to the category of sample studied (see upper right key). Each row represents a separate cDNA clone on the micro-
array and each column a separate mRNA sample. The squares presented represent the ratio of hybridisation of fluorescent
cDNA probes prepared from each experimental mRNA sample to reference mRNA sample. These ratios are a measure of rel-
ative gene expression, and red indicates high expression, green indicates low expression and grey indicates missing or excluded
data. See supplementary information for full data [13].
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 7 of 9
(page number not for citation purposes)
5 N
0.5
(N is the number of genes) [14]. The map follows
the probability density function of the data and is very ro-
bust with regard to missing data points [7]. Furthermore,
the component plane of SOM can be used as a visualiza-
tion surface for showing different features of the SOM
(and thus of the gene expression data), for example the
cluster structure [14]. By clustering the SOM, a good in-
sight into the cluster structure (and thus of the feature
gene expression patterns) can be obtained.
We applied this two-level approach to the analysis of a set
DLBCL samples that have previously been published. The
inspection of the maps obtained through our analysis
clearly reveals four major gene expression patterns. One
pattern concerns genes expressed by germinal center B
cells (cluster 10), the second could be called an 'inflam-
matory' pattern and relates to genes expressed by T-cells
and macrophages (cluster 11). The third pattern is an ex-
tensive collection of genes involved in cell proliferation
(cluster 1), which seems to be closely linked to the fourth
pattern, anti-apoptosis and plasma cell differentiation-re-
lated genes (cluster 7, 9). This last pattern has not previ-
ously been described whereas the others were also
discovered by Alizadeh et. al, by using hierarchical cluster-
ing only.
The survival data based on the grouping of cases according
to the different gene expression patterns show that all
these expression patterns were significantly correlated
with survival (figure 2a, 2b, 2c). When the germinal center
B cell gene expression pattern (cluster 10) is combined
with the proliferation/anti-apoptosis/plasma cell differ-
entiation pattern (cluster 1,7,9), thus yielding four groups
(figure 2d), significant differences in survival are still seen
notwithstanding the low number of cases. It is of particu-
lar interest that all but one of the cases expressing high lev-
els of germinal center (cell) genes but low levels of
proliferation/anti-apoptosis/plasma cell genes, have a sur-
vival beyond 5 years (figure 2d). This contrasts sharply
with the cases expressing low levels of germinal center B
cell genes but high levels of proliferation/anti-apoptosis
and plasma cell differentiation genes of which none sur-
vive beyond 5 years. Although these data need to be con-
firmed in larger series of cases, a division of DLBCL
according to expression of a combination of genes relating
to the germinal center, proliferation, anti-apoptosis and
plasma cell differentiation seems to be very relevant in
predicting prognosis. Why the expressions of genes related
to cell proliferation, anti-apoptosis and plasma cell differ-
entiation are frequently co-expressed in DLBCL is not
known and needs to be further investigated. It is apparent
from our further analysis (figure 3) that there are two mu-
tually exclusive patterns of gene expression related to plas-
ma cell differentiation. One pattern contains the
transcription factors IRF4 and XBP-1, which have both
been shown to be important for plasma cell differentia-
tion, as well as STAT3, which is part of the IL-6 signaling
pathway involved in plasma cell differentiation [15–17].
The other pattern shows many unknown genes in addi-
tion to genes coding for ribosomal proteins. The latter
suggests an expression pattern related to a later stage of
plasma cell differentiation. These patterns are intriguing
but more studies on normal plasma cell differentiation
are needed in order for these plasmas to be fully under-
stood.
In conclusion, we propose a two-level approach for the
analysis of gene expression patterns, where the clustering
analysis is carried out in a set of summarized prototype
vectors created by SOM. By applying the current two-level
approach to the DLBCL data set [10], the discovered gene
expression patterns were consistent with the ones origi-
nally published. In addition, a novel pattern of gene ex-
pression related to plasma cell differentiation was
revealed. Our results underscore the value of the two-level
analysis for discovering gene expression patterns, and the
method should be useful as a part of routine classification
of clinical samples, when the suggested subdivision have
been confirmed in large studies.
Methods
Sources of experimental data
All experimental data including the survival data of the
lymphoma patients were obtained from the web supple-
ment to the publication of Alizadeh et al. [10] [http://
llmpp.nih.gov/lymphoma/data.shtml].
Preprocessing of data
The data were cleaned before doing any data mining. This
includes flagging and removal of bad measurements, i.e.
measurements where the fluorescent intensity in one
channel was less than 1.4 times the local background were
discarded [10], and replacement of values for identical
probes (same IMAGE number and gene) with the mean
ratio. After cleaning the original data, we were left with
values for 3906 genes from 96 samples, and these ratios
were log 2 transformed.
Hierarchical clustering
Hierarchical clustering [12] is an agglomerative clustering
usually having the following steps: 1) Initialization: as-
sign each vector (the series of values from a single sample)
to its own cluster. 2) Computation of the distance be-
tween all clusters. 3) Merging the two clusters that are
closest to each other. Step 2 and 3 are repeated until there
is only one cluster left. In this work, log 2 transformed ra-
tios were median-centered before clustering, Pearson cor-
relation was used as distance matrixes and the centered
average linkage method was used for merging. Hierarchi-
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 8 of 9
(page number not for citation purposes)
cal clustering was applied to both rows and columns using
the Cluster and Tree View software from Stanford [2].
Self-organizing map (SOM) and K-means clustering
The basic SOM [7] consists of m neurons located on a reg-
ular low-dimensional grid, usually 1- or 2- dimensional.
The lattice of the grid is hexagonal. The basic SOM algo-
rithm is iterative. Each neuron i has a d-dimensional pro-
totype vector m
i
= [m
i1
, ..., m
id
], d is the input vector
dimension. Before the training phase, initial values are
given to the prototype vectors and typically linear initiali-
zation was used. At each training step, a sample data vec-
tor x is randomly chosen from the training set. Distances
between x and all the prototype vectors are computed.
During training, the SOM behaves like a flexible net that
folds onto the "cloud" formed by the training data. Be-
cause of the neighborhood relations, neighboring proto-
types are pulled to the same direction, and thus prototype
vectors of neighboring units resemble each other [11]. To
inspect the cluster structure of the map, the SOM compo-
nent plane (figure 1) was used to show the gene expres-
sion features of various tumor samples, and also the
common gene expression patterns of each tumor type.
Each component plane can be thought of as a slice of the
map: it consists of the values of a single vector component
in all map units. It is visualized as 2-dimensional color
images, where the color of a map unit corresponds to its
value. By visualizing the spread of values of that compo-
nent and comparing component planes with each other,
correlations are revealed as similar patterns in identical
positions of the component planes. Based on overall view,
it is easy to select interesting component combinations
and map units for further investigation. To be able to
more effectively study interesting groups of map units,
methods to give good candidates for map unit clusters or
groups are required. Thus, the trained prototype vectors
m
i
of SOM is further clustered by K-means clustering and
combined to form the actual clusters, more detailed de-
scription of clustering of the SOM can be found in the ear-
ly paper [14].
K-means clustering is a partition clustering, it classifies the
data into k groups, which together satisfy the require-
ments of a partition: (1) Each group must contain at least
one object. (2) Each object must belong to only one
group. To select the best k among different partitions,
each of these can be evaluated using some kind of validity
index. In our calculations, we used the Davies-Bouldin in-
dex [11], which minimizes the ratio between within-clus-
ter distance and between-cluster distance, indicating good
clustering results for spherical clusters with low values. Be-
cause no unified theory for determining the number of
clusters has been fully developed and accepted, the selec-
tion of optimal number of clusters remains as an active re-
search field [19,21]. Thus, the Davies-Bouldin index used
here is only a guideline to estimate the best clustering
among the partitionings with different number of clusters.
Some problems need to be noted when clustering the
SOM by the K-means clustering, due to the properties of
the algorithm: it not only searches for spherical clusters
but also clusters with roughly equal number of samples,
the non-spherical cluster could not be properly recog-
nized as one cluster; and as the number of clusters is in-
creased, the number of samples in clusters decreases,
which makes the algorithm more sensitive to outliers.
Therefore, we have to carefully verify the results obtained
by K-means clustering [14].
In this work, SOM and K-means clustering were carried
out by the SOM toolbox in MATLAB [11]. SOM was
trained using batch version of the algorithm for raw ex-
pression data. All prototype vectors were linearly initial-
ized in the subspace spanned by the two eigenvectors with
greatest eigenvalues computed from the training data. The
SOM was trained in two phases: a rough training with
large initial neighborhood width and a fine-tuning phase
with small initial neighborhood width. The neighbor-
hood width decreased linearly to 1; neighborhood func-
tion was Gaussian. The training length of the two phases
was 1 and 4 epochs and the initial neighborhood width 3
and 1, respectively.
Survival analysis
The statistical treatment of survival times is known as sur-
vival analysis. From a set of observed survival times from
a sample of individuals we can estimate the proportion of
the population of such people who would survive a given
length of time in the same circumstances. The method
yields a graph, the Kaplan-Merier survival curve, is drawn
as a "step function" that changes at every distinct survival
time. The time of survival observations are indicated by
ticks on the survival curve, which shows at a glance the
survival times of the surviving subjects (figure 2). To com-
pare the survival experience of two or more groups of sub-
jects we calculate the logrank test. The logrank test is a
hypothesis test for testing the null hypothesis that the
groups being compared are samples from the same popu-
lation as regards survival experience, it involves calculat-
ing the observed and expected numbers of failures in
separate time intervals, and summing these, comparing
the results to a χ
2
distribution with k-1 degrees of freedom
gives P value, where there are k groups of observations [9].
The plotting of Kaplan-Merier survival curves and logrank
test of significance level P value were implemented in
MATLAB.
Authors' contributions
Junbai wang carried out the data mining studies, per-
formed microarray data analysis, implemented MATLAB
code for survival analysis and drafted the manuscript. Jan
BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36
Page 9 of 9
(page number not for citation purposes)
Delabie carried out the biological studies of discovered
gene expression patterns, participated in data analysis and
drafted part of the manuscript. Hans Christian Aasheim
and Erlend Smeland participated in validation of the
microarray analysis. Ola Myklebost conceived of the
study, and participated in its design and coordination.
Acknowledgements
This work was supported by the Norwegian Cancer Society [http://
www.kreft.no].
References
1. DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen
Y, Su YA, Trent JM: Use of a cDNA microarray to analyze gene
expression patterns in human cancer. Nat Genet 1996, 14:457-
460
2. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis
and display of genome-wide expression patterns. Proc Natl
Acad Sci USA 1998, 95:14863-14868
3. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM,
Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein
D, Brown PO: The transcriptional program in the response of
human fibroblasts to serum. Science 1999, 283:83-87
4. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pol-
lack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A,
Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Bot-
stein D: Molecular portraits of human breast tumours. Nature
2000, 406:747-752
5. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,
Lander ES, Golub TR: Interpreting patterns of gene expression
with self-organizing maps: methods and application to he-
matopoietic differentiation. Proc Natl Acad Sci USA 1999,
96:2907-2912
6. Kufman L, Rousseeuw PJ: Finding groups in data, An introduc-
tion to cluster analysis. (Edited by: Kuman L. Brussels) John Wiley &
Sons 1991
7. Kohonen T: Self-organizing maps. (Edited by: Lotsch HKV) Berlin,
Springer 1997, 117
8. Toronen P, Kolehmainen M, Wong G, Castren E: Analysis of gene
expression data using self-organizing maps. FEBS Letters 1999,
451:142-146
9. Altman DG: Practical statistics for medical research. (Edited by:
Altman DG) London, Chapman and Hall 1991
10. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A,
Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore
T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC,
Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM:
Distinct types of diffuse large B-cell lymphoma identified by
gene expression profiling. [see comments]. Nature 2000,
403:503-511
11. Vesanto J: SOM-Based data visualization methods. Intelligent
Data Analysis journal 1999
12. Everitt BS: Cluster Analysis. (Edited by: Edward Arnold) London, John
Wiley & Sons 1987
13. Junbai wang, Jan Delabie, Hans Christian Aasheim, Erlend Smeland,
Ola Myklebost: Supplementary informatioin for "Reanalysis of
global gene expression patterns from Diffuse Large B-Cell
Lymphoma by a two-level strategy reveals novel subtypes"
2001 [http://matrise.uio.no/supDLBCL/Supview.html]
14. Vesanto J, Alhoniemi E: Clustering of the self-organizing map.
IEEE TNN 2000, 11(3):586-600
15. Reimold AM, Iwakoshi NN, Manis J, Vallabhajosyula P, Szomolanyi-
Tsuda E, Gravallese EM, Friend D, Grusby MJ, Alt F, Glimcher LH:
Plasma cell differentiation requires the transcription factor
XBP-1. Nature 2001, 412:300-307
16. Hirano T, Ishihara K, Hibi M: Roles of STAT3 in mediating the
cell growth, differnetiation and survival signals relayed
through the IL-6 family of cytokine receptors. Oncogene 2000,
19:2548-2556
17. Mittrucker HW, Matsuyama T, Grossman A, Kundig TM, Potter J,
Shahinian A, Wakeham A, Patterson B, Ohashi PS, Mak TW: Re-
quirement for the transcription factor LSIRF/IRF4 for ma-
ture B and T lymphocyte function. Science 1997, 275:540-543
18. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F,
Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Clas-
sification and diagnostic prediction of cancers using gene ex-
pression profiling and artificial neural networks. Nat med 2001,
7:673-679
19. Horimoto K, Toh H: Statistical estimation of cluster bounda-
ries in gene expression profile data. Bioinformatics 2001,
17(12):1143-1151
20. Nguyen DV, Rocke DM: Tumor classification by partial least
squares using microarray gene expression data. Bioinformatics
2002, 18(1):39-50
21. Fukunaga K: Introduction to statistical pattern recognition.
(Edited by: Rheinboldt W) Boston, Academic Press 1990
Publish with Bio Med Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
... FL often presents with extensive disease and may transform to more biologically aggressive lesions such as diffuse large B-cell lymphoma 1,2 . There has been much effort in recent decades to better describe and understand the biology of FL; this research has focused mainly on exploring the molecular genetic profiles of FL [3][4][5] . Previous work has demonstrated that unique molecular genetic signatures can be employed to distinguish FL from other lymphomas, especially diffuse large B-cell lymphoma 4,5 . ...
... There has been much effort in recent decades to better describe and understand the biology of FL; this research has focused mainly on exploring the molecular genetic profiles of FL [3][4][5] . Previous work has demonstrated that unique molecular genetic signatures can be employed to distinguish FL from other lymphomas, especially diffuse large B-cell lymphoma 4,5 . Other research has demonstrated a link between molecular genetic signatures and survival in FL 6 . ...
Article
Full-text available
Follicular lymphoma (FL) is a cancer of B-cells, representing the second most common type of non-Hodgkin lymphoma and typically diagnosed at advanced stage in older adults. In contrast to the wide range of available molecular genetic data, limited data relating the metabolomic features of follicular lymphoma are known. Metabolomics is a promising analytical approach employing metabolites (molecules < 1 kDa in size) as potential biomarkers in cancer research. In this pilot study, we performed proton nuclear magnetic resonance spectroscopy (1H-NMR) on 29 cases of FL and 11 control patient specimens. The resulting spectra were assessed by both unsupervised and supervised statistical methods. We report significantly discriminant metabolomic models of common metabolites distinguishing FL from control tissues. Within our FL case series, we also report discriminant metabolomic signatures predictive of progression-free survival.
... [29] The self-organizing maps (SOM) have been used to clarify and streamline the gene-expression data. [30] SOM was availed by adopting various parameters, like dimensions X and Y: 3; iterations: 2,000; alpha: 0.05; radius: 3; initialization: random genes; neighborhood: Gaussian; and topology to be hexagonal. This resulted in the formation of nine clusters of genes. ...
... The K-means clustering has led to the identification of the genes with comparable expression patterns. [30] The K-means clustering was applied and defined by a few parameters: the number of clusters: 9; maximum iterations: 2,000; and runs to be 1. These eventuated nine clusters of genes with similar expressions. ...
Article
Full-text available
The High Mobility Group A1 (HMGA1) gene over expression has been widely observed in various types of cancers. The raw data for microarray data analysis was obtained from the dataset record GDS3525. The SOM and K-means of the Genesis led to the identification of two clusters (each consisting of 30 genes) bearing HMGA1 gene. This on further analysis resulted into identification of 14 similar genes by Easy M-A. The evolutionary similarity of HMGA1 and GORASP2 is clearly observed in the Phylogenetic Tree. Due to the absence of precise structures, the homology modeling was done by using EasyModeller and the resulting models of proteins HMGA1 and GORASP2 were validated by Ramachandran plot. These models were further put to loop optimization by Modloop and the output models were assessed by Ramachandran plot (Rampage) and through SAVS (Procheck). The molecular docking was done by using Autodock, this resulted in two ligands, DB11641 (Vinflunine) and DB12674 (Lurbinectedin), showing potential for the effective treatment of various types of cancers characterized by the over expression of HMGA1 and GORASP2.
... A two-level clustering approach (Wang et al. 2002;Robinson et al. 2010) was applied to expression profiles of all DEGs identified by ANOVA-like tests of each light treatment. First, the expression profiles were summarized by a self-organizing map (SOM). ...
... We identified eight clusters, of which six main clusters (clusters 1-6) accounted for 91.9-94.3% of the DEGs (Fig. 4b; Fig. S5) using a two-level clustering approach (Wang et al. 2002). The DEGs were classified into functional categories according to MapMan annotations. ...
Article
Full-text available
Main conclusion: AS events affect genes encoding protein domain composition and make the single gene produce more proteins with a certain number of genes to satisfy the establishment of photosynthesis during de-etiolation. The drastic switch from skotomorphogenic to photomorphogenic development is an excellent system to elucidate rapid developmental responses to environmental stimuli in plants. To decipher the effects of different light wavelengths on de-etiolation, we illuminated etiolated maize seedlings with blue, red, blue-red mixed and white light, respectively. We found that blue light alone has the strongest effect on photomorphogenesis and that this effect can be attributed to the higher number and expression levels of photosynthesis and chlorosynthesis proteins. Deep sequencing-based transcriptome analysis revealed gene expression changes under different light treatments and a genome-wide alteration in alternative splicing (AS) profiles. We discovered 41,188 novel transcript isoforms for annotated genes, which increases the percentage of multi-exon genes with AS to 63% in maize. We provide peptide support for all defined types of AS, especially retained introns. Further in silico prediction revealed that 58.2% of retained introns have changes in domains compared with their most similar annotated protein isoform. This suggests that AS acts as a protein function switch allowing rapid light response through the addition or removal of functional domains. The richness of novel transcripts and protein isoforms also demonstrates the potential and importance of integrating proteomics into genome annotation in maize.
... If no cluster number is specified, COMMO will automatically determine the number of clusters using the k-nearest neighbor algorithm (see Supplementary Section S1.1). Second, eight clustering methods are used to identify gene modules, including FLAME (Fuzzy clustering by Local Approximation of Memberships) (Fu and Medico 2007), K-means (Timmerman et al. 2013), SOM (self-organizing mapping) (Wang et al. 2002), spectral clustering (Zhang et al. 2021), Agglomerative (Liu et al. 2022b), Hclust (Bu et al. 2022), NMF (non-negative matrix factorization) (Liefeld et al. 2023), and ICA (independent component analysis) (Hyvä rinen 2013) of these eight methods can be found in the Supplementary Materials. We selected these methods because they demonstrated superior performance in a previous evaluation study (Saelens et al. 2018) and have reasonable runtimes. ...
Article
Full-text available
A variety of computational methods have been developed to identify functionally related gene modules from genome-wide gene expression profiles. Integrating the results of these methods to identify consensus modules is a promising approach to produce more accurate and robust results. In this application note, we introduce COMMO, the first web server to identify and analyze consensus gene functionally related gene modules from different module detection methods. First, COMMO implements eight state-of-the-art module detection methods and two consensus clustering algorithms. Second, COMMO provides users with mRNA and protein expression data for 33 cancer types from three public databases. Users can also upload their own data for module detection. Third, users can perform functional enrichment and two types of survival analyses on the observed gene modules. Finally, COMMO provides interactive, customizable visualizations and exportable results. With its extensive analysis and interactive capabilities, COMMO offers a user-friendly solution for conducting module-based precision medicine research. Availability and implementation: COMMO web is available at https://commo.ncpsb.org.cn/, with the source code available on GitHub: https://github.com/Song-xinyu/COMMO/tree/master. Supplementary information: Supplementary data are available at Bioinformatics online.
... It has to be noted that K-Means, Hierarchical and Spectral clustering have been considered in two versions: a first one where the clustering technique is adopted directly on the joint TCGA-PROMOLE dataset, and a second one where the dataset has been previously treated with a Self-Organizing Map (SOM) or Kohonen's Map 76,77 before clustering. In fact, SOMs are used to produce a low-dimensional representation of a higher dimensional dataset still preserving its topological structure and already proved to be effective on gene-expression data 78 , in combination with K-Means, hierarchical and consensus clustering, for different types of tumour subtyping tasks 79,80 . SOMs were not applied on iClus-terPlus and SNF since these methods were already designed to deal with high-dimensional genomic matrices. ...
Article
Full-text available
Recent advances in machine learning research, combined with the reduced sequencing costs enabled by modern next-generation sequencing, paved the way to the implementation of precision medicine through routine multi-omics molecular profiling of tumours. Thus, there is an emerging need of reliable models exploiting such data to retrieve clinically useful information. Here, we introduce an original consensus clustering approach, overcoming the intrinsic instability of common clustering methods based on molecular data. This approach is applied to the case of non-small cell lung cancer (NSCLC), integrating data of an ongoing clinical study (PROMOLE) with those made available by The Cancer Genome Atlas, to define a molecular-based stratification of the patients beyond, but still preserving, histological subtyping. The resulting subgroups are biologically characterized by well-defined mutational and gene-expression profiles and are significantly related to disease-free survival (DFS). Interestingly, it was observed that (1) cluster B, characterized by a short DFS, is enriched in KEAP1 and SKP2 mutations, that makes it an ideal candidate for further studies with inhibitors, and (2) over- and under-representation of inflammation and immune systems pathways in squamous-cell carcinomas subgroups could be potentially exploited to stratify patients treated with immunotherapy.
... To overcome this limit, unsupervised AI methods can be exploited to identify patterns in data, and to spontaneously learn the optimal separation of a dataset into clusters according to some measure of mutual similarity. In this context, neural networks like Self-Organizing Maps (SOMs) [22] have been extensively used as a tool for clustering [19,45], complexity reduction [48], anomaly detection [41], and visualization of multi-dimensional numerical data [44] to assist and simplify its interpretation [34]. Practical application areas in which SOMs have been effectively applied include pattern recognition [49] and medical applications, such as clustering of gene microarray data of breast and prostate cancer cells [20,24], integration of clinical and molecular information to the aim of classifying the risk of progression in bladder cancer [7], identification of patterns associated with the survival of patients with breast cancer [38], analysis of functional magnetic resonance imaging [26] and ophthalmological data [21]. ...
Article
Artificial intelligence is getting a foothold in medicine for disease screening and diagnosis. While typical machine learning methods require large labeled datasets for training and validation, their application is limited in clinical fields since ground truth information can hardly be obtained on a sizeable cohort of patients. Unsupervised neural networks—such as Self-Organizing Maps (SOMs)—represent an alternative approach to identifying hidden patterns in biomedical data. Here we investigate the feasibility of SOMs for the identification of malignant and non-malignant regions in liquid biopsies of thyroid nodules, on a patient-specific basis. MALDI-ToF (Matrix Assisted Laser Desorption Ionization - Time of Flight) mass spectrometry-imaging (MSI) was used to measure the spectral profile of bioptic samples. SOMs were then applied for the analysis of MALDI-MSI data of individual patients’ samples, also testing various pre-processing and agglomerative clustering methods to investigate their impact on SOMs’ discrimination efficacy. The final clustering was compared against the sample’s probability to be malignant, hyperplastic or related to Hashimoto thyroiditis as quantified by multinomial regression with LASSO. Our results show that SOMs are effective in separating the areas of a sample containing benign cells from those containing malignant cells. Moreover, they allow to overlap the different areas of cytological glass slides with the corresponding proteomic profile image, and inspect the specific weight of every cellular component in bioptic samples. We envision that this approach could represent an effective means to assist pathologists in diagnostic tasks, avoiding the need to manually annotate cytological images and the effort in creating labeled datasets.
... For example, hierarchal clustering of gene expression variations have discovered distinctive gene expression patterns in liver tumour tissues [10]. Another clustering application was performed on diffuse large B-cell lymphomas using two level clustering [11]. The first level is Self-Organizing Maps (SOM) [12], in which the clustering was applied on tumour samples, and then the second level, which is hierarchical clustering and k-means that were used to identify useful patterns of gene expressions. ...
Chapter
Toxicogenomics (TGX) may be defined as a toxicological subdiscipline of pharmacogenomics, which is defined as the study of interindividual variations in whole‐genome or candidate gene single‐nucleotide polymorphism maps, haplotype markers, and alterations in gene expression that might correlate with drug responses. For much of the field of toxicology, the primary focus is on determining the probability and potential exposure‐related aspects of risk. This chapter presents potential uses of TGX in this process. There are, of course, a number of considerations involved in the use of TGX as a tool in risk assessment. With persistent application of TGX to toxicology and risk assessment, it seems inevitable that researchers will learn how to successfully apply this technology to advance the field. The National Research Council (NRC) report stresses that the twenty‐first century vision for toxicity testing should remain consistent with the NRC risk assessment paradigm put forward in 1983.
Article
Context.—Gene expression (GE) analyses using microarrays have become an important part of biomedical and clinical research in hematolymphoid malignancies. However, the methods are time-consuming and costly for routine clinical practice. Objectives.—To review the literature regarding GE data that may provide important information regarding pathogenesis and that may be extrapolated for use in diagnosing and prognosticating lymphomas and leukemias; to present GE findings in Hodgkin and non-Hodgkin lymphomas, acute leukemias, and chronic myeloid leukemia in detail; and to summarize the practical clinical applications in tables that are referenced throughout the text. Data Source.—PubMed was searched for pertinent literature from 1993 to 2005. Conclusions.—Gene expression profiling of lymphomas and leukemias aids in the diagnosis and prognostication of these diseases. The extrapolation of these findings to more timely, efficient, and cost-effective methods, such as flow cytometry and immunohistochemistry, results in better diagnostic tools to manage the diseases. Flow cytometric and immunohistochemical applications of the information gained from GE profiling assist in the management of chronic lymphocytic leukemia, other low-grade B-cell non-Hodgkin lymphomas and leukemias, diffuse large B-cell lymphoma, nodular lymphocyte–predominant Hodgkin lymphoma, and classic Hodgkin lymphoma. For practical clinical use, GE profiling of precursor B acute lymphoblastic leukemia, precursor T acute lymphoblastic leukemia, and acute myeloid leukemia has supported most of the information that has been obtained by cytogenetic and molecular studies (except for the identification of FLT3 mutations for molecular analysis), but extrapolation of the analyses leaves much to be gained based on the GE profiling data.
Article
The temporal program of gene expression during a model physiological response of human cells, the response of fibroblasts to serum, was explored with a complementary DNA microarray representing about 8600 different human genes. Genes could be clustered into groups on the basis of their temporal patterns of expression in this program. Many features of the transcriptional program appeared to be related to the physiology of wound repair, suggesting that fibroblasts play a larger and richer role in this complex multicellular response than had previously been appreciated.
Article
The purpose of this study was to develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs). We trained the ANNs using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories and often present diagnostic dilemmas in clinical practice. The ANNs correctly classified all samples and identified the genes most relevant to the classification. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRSCTs, we analyzed additional blinded samples that were not previously used for the training procedure, and correctly classified them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the Identification of candidate targets for therapy.
Article
The self-organizing map (SOM) is an efficient tool for visualization of multidimensional numerical data. In this paper, an overview and categorization of both old and new methods for the visualization of SOM is presented. The purpose is to give an idea of what kind of information can be acquired from different presentations and how the SOM can best be utilized in exploratory data visualization. Most of the presented methods can also be applied in the more general case of first making a vector quantization (e.g. k-means) and then a vector projection (e.g. Sammon's mapping).