ArticlePDF Available

Clustering of the SOM easily reveals distinct gene expression patterns: Results of a reanalysis of lymphoma study

December 2002
BMC Bioinformatics 3(1):36

December 2002
3(1):36

DOI:10.1186/1471-2105-3-36

Source
PubMed

License
CC BY 4.0

Authors:

Junbai Wang

Oslo University Hospital

Jan Delabie

University of Toronto

Hans-Christian Aasheim

Kristiania University College

Erlend Smeland

Oslo University Hospital

Show all 5 authorsHide

A method to evaluate and analyze the massive data generated by series of microarray experiments is of utmost importance to reveal the hidden patterns of gene expression. Because of the complexity and the high dimensionality of microarray gene expression profiles, the dimensional reduction of raw expression data and the feature selections necessary for, for example, classification of disease samples remains a challenge. To solve the problem we propose a two-level analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that simplifies and reduces the dimensionality of original measurements and visualizes individual tumor sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to identify patterns of gene expression useful for classification of samples. We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The analysis easily distinguished major gene expression patterns without the need for supervision: a germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related gene expression pattern. The first three patterns matched the patterns described in the original publication using supervised clustering analysis, whereas the fourth one was novel. Our study shows that by using SOM as an intermediate step to analyze genome-wide gene expression data, the gene expression patterns can more easily be revealed. The "expression display" by the SOM component plane summarises the complicated data in a way that allows the clinician to evaluate the classification options rather than giving a fixed diagnosis.

Classification of samples by SOM analysis and K-means clustering. SOM component planes are shown for a) 42 DLBCL samples and three DLBCL cell lines (OCILy3, OCILy10 and OCILy1). SOM map size is (22 × 14) and the color scale of SOM component plane represented the mean ratio in each map node, and red indicates high expression, blue indicates low expression. See supplementary information for full data. b) K-means clustering of SOM, mean SOM component planes for DLBCL, FL and CLL. The cluster numbers are given, and the genes contained within each SOM node and K-means cluster are listed in the web supplement [13], selected genes from clusters 10, 11 and 1, 7, 9 are listed in table 1.

…

Clinically distinct DLBCL subgroups defined by gene expression profiling. a) Kaplan-Meier plot of overall survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 10. b) Kaplan-Meier plot of overall survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 11. c) Kaplan-Meier plot of overall survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster (1,7,9). d) Kaplan-Meier plot of overall survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 10 and cluster (1,7,9).

…

Selected genes from K-means clusters. Hierarchical clustering of 72 selected genes from K-means cluster 1, 7 and 9. Depicted are the measurements of gene expression from DLBCL, FL and CLL samples. The dendrogram is colour coded according to the category of sample studied (see upper right key). Each row represents a separate cDNA clone on the microarray and each column a separate mRNA sample. The squares presented represent the ratio of hybridisation of fluorescent cDNA probes prepared from each experimental mRNA sample to reference mRNA sample. These ratios are a measure of relative gene expression, and red indicates high expression, green indicates low expression and grey indicates missing or excluded data. See supplementary information for full data [13].

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Content uploaded by Ola Myklebost

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

BioMed Central

Page 1 of 9

(page number not for citation purposes)

BMC Bioinformatics

Open Access

BMC Bioinformatics

2002,

Research article

Clustering of the SOM easily reveals distinct gene expression

patterns: results of a reanalysis of lymphoma study

Junbai Wang*

, Jan Delabie

, Hans Christian Aasheim

, Erlend Smeland

and Ola Myklebost

Address:

Departments of Tumor Biology, Norwegian Radium Hospital, N0310 Oslo, Norway,

Department of Pathology, Norwegian Radium

Hospital, N0310 Oslo, Norway and

Department of Immunology, Norwegian Radium Hospital, N0310 Oslo, Norway

E-mail: Junbai Wang* - junbaiw@radium.uio.no; Jan Delabie - jan.delabie@labmed.uio.no; Hans Aasheim - h.c.asheim@labmed.uio.no;

Erlend Smeland - e.b.smeland@labmed.uio.no; Ola Myklebost - olam@ulrik.uio.no

*Corresponding author

Abstract

Background: A method to evaluate and analyze the massive data generated by series of

microarray experiments is of utmost importance to reveal the hidden patterns of gene expression.

Because of the complexity and the high dimensionality of microarray gene expression profiles, the

dimensional reduction of raw expression data and the feature selections necessary for, for example,

classification of disease samples remains a challenge. To solve the problem we propose a two-level

analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that

simplifies and reduces the dimensionality of original measurements and visualizes individual tumor

sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to

identify patterns of gene expression useful for classification of samples.

Results: We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The

analysis easily distinguished major gene expression patterns without the need for supervision: a

germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related

gene expression pattern. The first three patterns matched the patterns described in the original

publication using supervised clustering analysis, whereas the fourth one was novel.

Conclusions: Our study shows that by using SOM as an intermediate step to analyze genome-

wide gene expression data, the gene expression patterns can more easily be revealed. The

"expression display" by the SOM component plane summarises the complicated data in a way that

allows the clinician to evaluate the classification options rather than giving a fixed diagnosis.

Background

The development and progression of cancer is accompa-

nied by complex changes in the patterns of gene expres-

sion. That can be revealed by DNA microarrays analysis

[1]. However, to reliably identify expression patterns asso-

ciated with tumor type, prognosis or therapy, hundreds of

samples need to be studied, and powerful data mining

tools are needed. Microarray experiments are generally

performed without a priori hypothesis. Therefore, the

data mining tools have to be developed that reveal a max-

imum of information to generate new hypotheses [9] with

minimal supervision. Hierarchical clustering is a frequent-

Published: 24 November 2002

BMC Bioinformatics 2002, 3:36

Received: 14 June 2002

Accepted: 24 November 2002

This article is available from: http://www.biomedcentral.com/1471-2105/3/36

for any purpose, provided this notice is preserved along with the article's original URL.

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 2 of 9

(page number not for citation purposes)

ly used method [2–4], but has a number of shortcomings

[5,6]. Notably, the most important genes defining the

branches of the clustering tree are not readily recognized,

and important patterns can be lost due to the determinis-

tic nature of clustering or the high dimensionality of data.

To solve this problem, we propose a two-level analysis

[14] for the study of complex gene expression data. This

analysis summarizes the data by the SOM component

plane, and then clusters the SOM to investigate the feature

gene expression patterns. The SOM reduces the dimen-

Figure 1

Classification of samples by SOM analysis and K-means clustering. SOM component planes are shown for a) 42

DLBCL samples and three DLBCL cell lines (OCILy3, OCILy10 and OCILy1). SOM map size is (22 × 14) and the color scale of

SOM component plane represented the mean ratio in each map node, and red indicates high expression, blue indicates low

expression. See supplementary information for full data. b) K-means clustering of SOM, mean SOM component planes for

DLBCL, FL and CLL. The cluster numbers are given, and the genes contained within each SOM node and K-means cluster are

listed in the web supplement [13], selected genes from clusters 10, 11 and 1, 7, 9 are listed in table 1.

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 3 of 9

(page number not for citation purposes)

Table 1: Selected genes grouped to cluster 1,7,9,10,11 of K-means clustering of SOM. Full list can be found in the web supplement [13].

Cluster No. Clone ID Gene Description

Cluster 1 100 Ki67 (long type)

1287099 Survivin = apoptosis inhibitor = effector cell protease EPR-1

108294, 1287528 XRCC9 = DNA repair protein

950690, 824709 Cyclin A

563130, 824060 Cyclin B1

1288839, 325880 Tubulin-beta

1240822, 588637 Actin = cytoskeletal gamma-actin

683084 Cyclin E2

1356512 Similar to MCM2 = DNA replication licensing factor

703757 MPP1 = Putative M phase phosphoprotein 1

1240595 Tubulin-alpha

1341540, 781047 BUB1 = putative mitotic checkpoint protein ser/thr kinase

Cluster 7 789182 PCNA = proliferating cell nuclear antigen

1288183, 235938 BAK = BCL-2 family member

80592 Syndecan-1

469256, 1322301 Bag-1 = Bcl-2 interacting anti-apoptotic protein = RAP46 = Glucocorticoid

receptor-associated protein

525540 BCL-3

1338456, 364941 C-myc binding protein

784012 40S ribosomal protein S21

324144 Ribosomal protein S29

1087015, 1240788 Ribosomal protein S9

510395 Ribosomal protein S16

272185 Ribosomal protein L27

1335421 Similar to ribosomal protein L37a

1368302 Ribosomal protein L32

Cluster 9 46778 BCL-XL

814478, 1353675 A1 = Bfl-1 = GRs = Bcl-2 related protein

270770, 1272196 IRF-4 = LSIRF = Mum1 = homologue of Pip = Lymphoid-specific interferon

regulatory factor = Multiple myeloma oncogene 1

1290353 Similar to TREB and X box binding protein 1

145093 MCL1 = myeloid cell differentiation protein

Cluster 10 701606, 1286850, 200814 CD10 = CALLA = Neprilysin = enkepalinase

1337241, 306139 BCL-7A

1340526, 712395 BCL-6

824476, 95093, 1350545 Spi-B transcription factor

1335782, 13194072, 1338245 Oct-2 = lymphoid-specific octamer binding transcription factor = POU

278808 Spi-1 = PU.1 = ets family transcription factor

50214 CD86 = B7-2 = CD28 and CTLA-4 counter-receptor 2

Cluster 11 753794 BLC = BCA-1 = B lymphocyte chemoattractant BLC = CXC chemokine

1326652 CD2

245959 SDF-1 = Stromal cell-derived factor 1 = chemokine

159946 CD14 = monocyte differentiation antigen

1130062 CD3E antigen, epsilon polypeptide

258802, 470615 CD64 = high affinity immunogobulin gamma FC receptor I A form precursor

= FC-gamma

377560 CD3 delta = T cell surface glycoprotein

505569 T cell receptor beta chain

23435, 1306024 CD11C = leukocyte adhesion protein p150,95 alpha subunit = integrin

alpha-X

1219244, 57, 1071581 RANTES = chemokine

472180 S100 calcium binding protein A4 = Placental calcium binding protein = Cal-

vasculin

701290 C-C chemokine receptor 5 == CC CK5

47509 Major histocompatibility complex, class II, DN alpha

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 4 of 9

(page number not for citation purposes)

sionality of the data, and thereby allows to easy display

the data and reveal the gene expression patterns. The vis-

ual inspection of the gene expression patterns in each sin-

gle case, and comparison of those patterns between the

different cases allows identifying common patterns in

gene expression that may have been lost by directly apply-

ing hierarchal clustering to the data. In addition, by K-

means clustering of the SOM, genes that have similar ex-

pression patterns, and might therefore be functionally re-

lated, may be identified.

To test the power of this two-level approach, we applied it

to the analysis of a publicly available gene expression data

set of non-Hodgkin's lymphomas, including mostly dif-

fuse large B-cell lymphoma (DLBCL), follicular lympho-

ma (FL) and chronic lymphocytic leukaemia (CLL). K-

means clustering of the SOM readily identifies four dis-

tinct gene expression profiles: germinal center related,

proliferation, inflammatory and plasma cell differentia-

tion related gene expression patterns. All identified gene

expression patterns are correlated with clinical survival

analysis.

Results

The expression data [10] were filtered and preprocessed as

described and subjected to SOM. Davies-Bouldin index

was used to find the optimum number of 12 clusters in K-

means clustering of the SOM [14]. Figure 1b shows the K-

means clustering of SOM with map size (22 × 14), where

the number of map units M = 5 N

0.5

, N is the number of

genes; after M has been determined, the map size is deter-

mined by setting the ratio between column number and

row number of map units equal to the ratio of two biggest

eigenvalues of the training data, and their product is as

close to M as possible [11]. Each hexagonal node of SOM

is a prototype vector representing local averages of the da-

ta, and the nearby nodes have similar prototype vectors.

The genes included in each cluster can be found in the

supplement [13].

Through the proposed two-level approach, one may di-

rectly observe the gene expression pattern of different

lymphoma subtypes, i.e. DLBCL, CLL and FL (figure 1b).

As can be seen from figure 1a, DLBCL primarily showed

four prominent gene expression patterns; distinguished

by gene cluster 10, 11, 1 and the large group of clusters 7

and 9. More detailed illustrations of distinct gene expres-

sion patterns are shown in the supplement [13], summary

of the genes included in these clusters are listed in Table

1. Cluster 10 contains genes were known to be expressed

in germinal center B cells, such as FAK, WIP, CD10, CD27,

CD38, FMR2, BCL-6 and BCL-7A. Cluster 11 contains

genes specifically expressed by T-cells (a.o. CD3, CD2,

TCR), NK cells (a.o. NK4), macrophages (a.o. CD14,

CD63, CD64, CD115) and lymph node dendritic cells

(a.o. S100). Also included are genes coding for chemok-

ines and chemokine receptors (RANTES, BLC, IP10, SLC,

FPR, STRL33.1 and MIP1), which play a major role in the

chemoattraction of inflammatory cells. Furthermore DLB-

CL variably express genes in the adjacent clusters 1, 7 and

9 (figure 1a). Cluster 1 includes genes associated with pro-

liferation (Ki67, cyclin A, BUB1, Cyclin B1, thymidine ki-

nase) whereas clusters 7 and 9 include genes associated

with cell survival (Bcl-XL, defender against cell death 1,

Bfl-1, BAK, Bag-1, MCL1) and plasma cell differentiation

(XBP-1, STAT3, IRF-4, ribosomal proteins) [10].

We subsequently regrouped the DLBCL based on the ex-

pression of each of the identified gene expression patterns

and studied survival differences between the groups thus

formed. We confirmed the better survival (figure 2a) for

those cases expressing genes related to the germinal center

(gene cluster 10) as reported by Alizadeh et al. We further-

more could show that there is a significant improved sur-

vival (figure 2b) of cases expressing genes related to

inflammation (gene cluster 11). Equally, there is a signif-

icant reduced survival (figure 2c) of cases expressing genes

related to cell proliferation, anti-apoptosis and plasma

cell differentiation (clusters 1,7,9). Interestingly, there is

also a significant difference in survival (figure 2d) ob-

tained when cases are subdivided using a combination of

gene expression patterns 10 and 1,7,9 in spite of the low

number of cases. We were further intrigued by the clusters

of genes in groups 7 and 9 that apparently were related to

plasma cell differentiation and are frequently co-ex-

pressed with the genes in cluster 1 (cell proliferation). Hi-

erarchical clustering of DLBCL using only genes in clusters

7 and 9 (figure 3) revealed an interesting pattern of mutu-

ally exclusive expressed genes, including many of which

are of utmost importance for plasma cell differentiation

(XBP-1, STAT3, IRF-4) as well as genes coding for ribos-

omal proteins, known to be highly expressed in plasma

cells. Of interest are the two mutually exclusive patterns of

plasma cell differentiation in DLBCL, suggesting either

different pathways of plasma cell differentiation or differ-

ent stages of differentiation.

Figures 1b shows the mean SOM component planes of

CLL and FL. Typically for CLL the genes in the whole lower

part of the SOM are highly expressed while for FL the

genes in the lower and middle left part of the SOM (cluster

10) are highly expressed. Therefore, the most prominent

distinction between CLL and FL lies in the expression of

genes that are characteristic of germinal center B cells

(cluster 10), as has also been suggested by Alizadeh et al

[10].

Discussion

When microarray measurements are presented in random

order, the patterns of gene expression are impossible to

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 5 of 9

(page number not for citation purposes)

discern by eye, and methods like hierarchical clustering

are frequently used to sort the measurements in such a

way that many patterns can easily be visualized, such as in

figure 3. However, this method suffers from several short-

comings [14], of which the most important is the loss of

information of potentially important patterns in a high

dimensional gene space. Although the number of meas-

ured genes is large there may only be a few underlying

gene components that account for most of the response

variation; for example, only a few linear combinations of

a subset of genes can account for nearly all of the expres-

sion variation among various tumor types. In such a situ-

ation, dimension reduction is needed to reduce the high

dimensional gene space to a low dimensional gene com-

ponent space; for instance, principal component analysis

[18] and partial least squares [20] had been applied to the

dimension reduction of microarray data. Thus, we pro-

posed a two-level analysis, first to summarize the gene ex-

pression data by a large set of prototypes; then the

prototypes are further combined to form the actual clus-

ters in the next step. SOM is a suitable method for data re-

duction since it creates a set of prototype vectors

representing the gene expression data and carries out a to-

pology preserving the projection of the prototypes from

the high-dimensional gene space into a low-dimensional

map. To preserve the cluster structure of original data in a

low-dimensional map, we can select as many prototype

vectors as needed, where the number of prototypes equals

Figure 2

Clinically distinct DLBCL subgroups defined by gene expression profiling. a) Kaplan-Meier plot of overall survival of

DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 10. b) Kaplan-Meier plot of overall sur-

vival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 11. c) Kaplan-Meier plot of over-

all survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster (1,7,9). d) Kaplan-Meier

plot of overall survival of DLBCL patients grouped on the basis of gene expression profiling in K-means cluster 10 and cluster

(1,7,9).

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 6 of 9

(page number not for citation purposes)

Figure 3

Selected genes from K-means clusters. Hierarchical clustering of 72 selected genes from K-means cluster 1, 7 and 9.

Depicted are the measurements of gene expression from DLBCL, FL and CLL samples. The dendrogram is colour coded

according to the category of sample studied (see upper right key). Each row represents a separate cDNA clone on the micro-

array and each column a separate mRNA sample. The squares presented represent the ratio of hybridisation of fluorescent

cDNA probes prepared from each experimental mRNA sample to reference mRNA sample. These ratios are a measure of rel-

ative gene expression, and red indicates high expression, green indicates low expression and grey indicates missing or excluded

data. See supplementary information for full data [13].

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 7 of 9

(page number not for citation purposes)

5 N

0.5

(N is the number of genes) [14]. The map follows

the probability density function of the data and is very ro-

bust with regard to missing data points [7]. Furthermore,

the component plane of SOM can be used as a visualiza-

tion surface for showing different features of the SOM

(and thus of the gene expression data), for example the

cluster structure [14]. By clustering the SOM, a good in-

sight into the cluster structure (and thus of the feature

gene expression patterns) can be obtained.

We applied this two-level approach to the analysis of a set

DLBCL samples that have previously been published. The

inspection of the maps obtained through our analysis

clearly reveals four major gene expression patterns. One

pattern concerns genes expressed by germinal center B

cells (cluster 10), the second could be called an 'inflam-

matory' pattern and relates to genes expressed by T-cells

and macrophages (cluster 11). The third pattern is an ex-

tensive collection of genes involved in cell proliferation

(cluster 1), which seems to be closely linked to the fourth

pattern, anti-apoptosis and plasma cell differentiation-re-

lated genes (cluster 7, 9). This last pattern has not previ-

ously been described whereas the others were also

discovered by Alizadeh et. al, by using hierarchical cluster-

ing only.

The survival data based on the grouping of cases according

to the different gene expression patterns show that all

these expression patterns were significantly correlated

with survival (figure 2a, 2b, 2c). When the germinal center

B cell gene expression pattern (cluster 10) is combined

with the proliferation/anti-apoptosis/plasma cell differ-

entiation pattern (cluster 1,7,9), thus yielding four groups

(figure 2d), significant differences in survival are still seen

notwithstanding the low number of cases. It is of particu-

lar interest that all but one of the cases expressing high lev-

els of germinal center (cell) genes but low levels of

proliferation/anti-apoptosis/plasma cell genes, have a sur-

vival beyond 5 years (figure 2d). This contrasts sharply

with the cases expressing low levels of germinal center B

cell genes but high levels of proliferation/anti-apoptosis

and plasma cell differentiation genes of which none sur-

vive beyond 5 years. Although these data need to be con-

firmed in larger series of cases, a division of DLBCL

according to expression of a combination of genes relating

to the germinal center, proliferation, anti-apoptosis and

plasma cell differentiation seems to be very relevant in

predicting prognosis. Why the expressions of genes related

to cell proliferation, anti-apoptosis and plasma cell differ-

entiation are frequently co-expressed in DLBCL is not

known and needs to be further investigated. It is apparent

from our further analysis (figure 3) that there are two mu-

tually exclusive patterns of gene expression related to plas-

ma cell differentiation. One pattern contains the

transcription factors IRF4 and XBP-1, which have both

been shown to be important for plasma cell differentia-

tion, as well as STAT3, which is part of the IL-6 signaling

pathway involved in plasma cell differentiation [15–17].

The other pattern shows many unknown genes in addi-

tion to genes coding for ribosomal proteins. The latter

suggests an expression pattern related to a later stage of

plasma cell differentiation. These patterns are intriguing

but more studies on normal plasma cell differentiation

are needed in order for these plasmas to be fully under-

stood.

In conclusion, we propose a two-level approach for the

analysis of gene expression patterns, where the clustering

analysis is carried out in a set of summarized prototype

vectors created by SOM. By applying the current two-level

approach to the DLBCL data set [10], the discovered gene

expression patterns were consistent with the ones origi-

nally published. In addition, a novel pattern of gene ex-

pression related to plasma cell differentiation was

revealed. Our results underscore the value of the two-level

analysis for discovering gene expression patterns, and the

method should be useful as a part of routine classification

of clinical samples, when the suggested subdivision have

been confirmed in large studies.

Methods

Sources of experimental data

All experimental data including the survival data of the

lymphoma patients were obtained from the web supple-

ment to the publication of Alizadeh et al. [10] [http://

llmpp.nih.gov/lymphoma/data.shtml].

Preprocessing of data

The data were cleaned before doing any data mining. This

includes flagging and removal of bad measurements, i.e.

measurements where the fluorescent intensity in one

channel was less than 1.4 times the local background were

discarded [10], and replacement of values for identical

probes (same IMAGE number and gene) with the mean

ratio. After cleaning the original data, we were left with

values for 3906 genes from 96 samples, and these ratios

were log 2 transformed.

Hierarchical clustering

Hierarchical clustering [12] is an agglomerative clustering

usually having the following steps: 1) Initialization: as-

sign each vector (the series of values from a single sample)

to its own cluster. 2) Computation of the distance be-

tween all clusters. 3) Merging the two clusters that are

closest to each other. Step 2 and 3 are repeated until there

is only one cluster left. In this work, log 2 transformed ra-

tios were median-centered before clustering, Pearson cor-

relation was used as distance matrixes and the centered

average linkage method was used for merging. Hierarchi-

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 8 of 9

(page number not for citation purposes)

cal clustering was applied to both rows and columns using

the Cluster and Tree View software from Stanford [2].

Self-organizing map (SOM) and K-means clustering

The basic SOM [7] consists of m neurons located on a reg-

ular low-dimensional grid, usually 1- or 2- dimensional.

The lattice of the grid is hexagonal. The basic SOM algo-

rithm is iterative. Each neuron i has a d-dimensional pro-

totype vector m

= [m

, ..., m

], d is the input vector

dimension. Before the training phase, initial values are

given to the prototype vectors and typically linear initiali-

zation was used. At each training step, a sample data vec-

tor x is randomly chosen from the training set. Distances

between x and all the prototype vectors are computed.

During training, the SOM behaves like a flexible net that

folds onto the "cloud" formed by the training data. Be-

cause of the neighborhood relations, neighboring proto-

types are pulled to the same direction, and thus prototype

vectors of neighboring units resemble each other [11]. To

inspect the cluster structure of the map, the SOM compo-

nent plane (figure 1) was used to show the gene expres-

sion features of various tumor samples, and also the

common gene expression patterns of each tumor type.

Each component plane can be thought of as a slice of the

map: it consists of the values of a single vector component

in all map units. It is visualized as 2-dimensional color

images, where the color of a map unit corresponds to its

value. By visualizing the spread of values of that compo-

nent and comparing component planes with each other,

correlations are revealed as similar patterns in identical

positions of the component planes. Based on overall view,

it is easy to select interesting component combinations

and map units for further investigation. To be able to

more effectively study interesting groups of map units,

methods to give good candidates for map unit clusters or

groups are required. Thus, the trained prototype vectors

of SOM is further clustered by K-means clustering and

combined to form the actual clusters, more detailed de-

scription of clustering of the SOM can be found in the ear-

ly paper [14].

K-means clustering is a partition clustering, it classifies the

data into k groups, which together satisfy the require-

ments of a partition: (1) Each group must contain at least

one object. (2) Each object must belong to only one

group. To select the best k among different partitions,

each of these can be evaluated using some kind of validity

index. In our calculations, we used the Davies-Bouldin in-

dex [11], which minimizes the ratio between within-clus-

ter distance and between-cluster distance, indicating good

clustering results for spherical clusters with low values. Be-

cause no unified theory for determining the number of

clusters has been fully developed and accepted, the selec-

tion of optimal number of clusters remains as an active re-

search field [19,21]. Thus, the Davies-Bouldin index used

here is only a guideline to estimate the best clustering

among the partitionings with different number of clusters.

Some problems need to be noted when clustering the

SOM by the K-means clustering, due to the properties of

the algorithm: it not only searches for spherical clusters

but also clusters with roughly equal number of samples,

the non-spherical cluster could not be properly recog-

nized as one cluster; and as the number of clusters is in-

creased, the number of samples in clusters decreases,

which makes the algorithm more sensitive to outliers.

Therefore, we have to carefully verify the results obtained

by K-means clustering [14].

In this work, SOM and K-means clustering were carried

out by the SOM toolbox in MATLAB [11]. SOM was

trained using batch version of the algorithm for raw ex-

pression data. All prototype vectors were linearly initial-

ized in the subspace spanned by the two eigenvectors with

greatest eigenvalues computed from the training data. The

SOM was trained in two phases: a rough training with

large initial neighborhood width and a fine-tuning phase

with small initial neighborhood width. The neighbor-

hood width decreased linearly to 1; neighborhood func-

tion was Gaussian. The training length of the two phases

was 1 and 4 epochs and the initial neighborhood width 3

and 1, respectively.

Survival analysis

The statistical treatment of survival times is known as sur-

vival analysis. From a set of observed survival times from

a sample of individuals we can estimate the proportion of

the population of such people who would survive a given

length of time in the same circumstances. The method

yields a graph, the Kaplan-Merier survival curve, is drawn

as a "step function" that changes at every distinct survival

time. The time of survival observations are indicated by

ticks on the survival curve, which shows at a glance the

survival times of the surviving subjects (figure 2). To com-

pare the survival experience of two or more groups of sub-

jects we calculate the logrank test. The logrank test is a

hypothesis test for testing the null hypothesis that the

groups being compared are samples from the same popu-

lation as regards survival experience, it involves calculat-

ing the observed and expected numbers of failures in

separate time intervals, and summing these, comparing

the results to a χ

distribution with k-1 degrees of freedom

gives P value, where there are k groups of observations [9].

The plotting of Kaplan-Merier survival curves and logrank

test of significance level P value were implemented in

MATLAB.

Authors' contributions

Junbai wang carried out the data mining studies, per-

formed microarray data analysis, implemented MATLAB

code for survival analysis and drafted the manuscript. Jan

BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/36

Page 9 of 9

(page number not for citation purposes)

Delabie carried out the biological studies of discovered

gene expression patterns, participated in data analysis and

drafted part of the manuscript. Hans Christian Aasheim

and Erlend Smeland participated in validation of the

microarray analysis. Ola Myklebost conceived of the

study, and participated in its design and coordination.

Acknowledgements

This work was supported by the Norwegian Cancer Society [http://

www.kreft.no].

References

1. DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen

Y, Su YA, Trent JM: Use of a cDNA microarray to analyze gene

expression patterns in human cancer. Nat Genet 1996, 14:457-

460

2. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis

and display of genome-wide expression patterns. Proc Natl

Acad Sci USA 1998, 95:14863-14868

3. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM,

Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein

D, Brown PO: The transcriptional program in the response of

human fibroblasts to serum. Science 1999, 283:83-87

4. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pol-

lack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A,

Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Bot-

stein D: Molecular portraits of human breast tumours. Nature

2000, 406:747-752

5. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E,

Lander ES, Golub TR: Interpreting patterns of gene expression

with self-organizing maps: methods and application to he-

matopoietic differentiation. Proc Natl Acad Sci USA 1999,

96:2907-2912

6. Kufman L, Rousseeuw PJ: Finding groups in data, An introduc-

tion to cluster analysis. (Edited by: Kuman L. Brussels) John Wiley &

Sons 1991

7. Kohonen T: Self-organizing maps. (Edited by: Lotsch HKV) Berlin,

Springer 1997, 117

8. Toronen P, Kolehmainen M, Wong G, Castren E: Analysis of gene

expression data using self-organizing maps. FEBS Letters 1999,

451:142-146

9. Altman DG: Practical statistics for medical research. (Edited by:

Altman DG) London, Chapman and Hall 1991

10. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A,

Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore

T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC,

Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM:

Distinct types of diffuse large B-cell lymphoma identified by

gene expression profiling. [see comments]. Nature 2000,

403:503-511

11. Vesanto J: SOM-Based data visualization methods. Intelligent

Data Analysis journal 1999

12. Everitt BS: Cluster Analysis. (Edited by: Edward Arnold) London, John

Wiley & Sons 1987

13. Junbai wang, Jan Delabie, Hans Christian Aasheim, Erlend Smeland,

Ola Myklebost: Supplementary informatioin for "Reanalysis of

global gene expression patterns from Diffuse Large B-Cell

Lymphoma by a two-level strategy reveals novel subtypes"

2001 [http://matrise.uio.no/supDLBCL/Supview.html]

14. Vesanto J, Alhoniemi E: Clustering of the self-organizing map.

IEEE TNN 2000, 11(3):586-600

15. Reimold AM, Iwakoshi NN, Manis J, Vallabhajosyula P, Szomolanyi-

Tsuda E, Gravallese EM, Friend D, Grusby MJ, Alt F, Glimcher LH:

Plasma cell differentiation requires the transcription factor

XBP-1. Nature 2001, 412:300-307

16. Hirano T, Ishihara K, Hibi M: Roles of STAT3 in mediating the

cell growth, differnetiation and survival signals relayed

through the IL-6 family of cytokine receptors. Oncogene 2000,

19:2548-2556

17. Mittrucker HW, Matsuyama T, Grossman A, Kundig TM, Potter J,

Shahinian A, Wakeham A, Patterson B, Ohashi PS, Mak TW: Re-

quirement for the transcription factor LSIRF/IRF4 for ma-

ture B and T lymphocyte function. Science 1997, 275:540-543

18. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F,

Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Clas-

sification and diagnostic prediction of cancers using gene ex-

pression profiling and artificial neural networks. Nat med 2001,

7:673-679

19. Horimoto K, Toh H: Statistical estimation of cluster bounda-

ries in gene expression profile data. Bioinformatics 2001,

17(12):1143-1151

20. Nguyen DV, Rocke DM: Tumor classification by partial least

squares using microarray gene expression data. Bioinformatics

2002, 18(1):39-50

21. Fukunaga K: Introduction to statistical pattern recognition.

(Edited by: Rheinboldt W) Boston, Academic Press 1990

Publish with Bio Med Central and every

scientist can read your work free of charge

"BioMed Central will be the most significant development for

disseminating the results of biomedical researc h in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:

http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

NMR-based metabolomic profiling can differentiate follicular lymphoma from benign lymph node tissues and may be predictive of outcome

Article

Full-text available

May 2022

Follicular lymphoma (FL) is a cancer of B-cells, representing the second most common type of non-Hodgkin lymphoma and typically diagnosed at advanced stage in older adults. In contrast to the wide range of available molecular genetic data, limited data relating the metabolomic features of follicular lymphoma are known. Metabolomics is a promising analytical approach employing metabolites (molecules < 1 kDa in size) as potential biomarkers in cancer research. In this pilot study, we performed proton nuclear magnetic resonance spectroscopy (1H-NMR) on 29 cases of FL and 11 control patient specimens. The resulting spectra were assessed by both unsupervised and supervised statistical methods. We report significantly discriminant metabolomic models of common metabolites distinguishing FL from control tissues. Within our FL case series, we also report discriminant metabolomic signatures predictive of progression-free survival.

Microarray data analysis, structure prediction and in silico docking of drugs for inhibiting the over expression of High Mobility Group A1 in human malignant neoplasias

Article

Full-text available

Sep 2020

The High Mobility Group A1 (HMGA1) gene over expression has been widely observed in various types of cancers. The raw data for microarray data analysis was obtained from the dataset record GDS3525. The SOM and K-means of the Genesis led to the identification of two clusters (each consisting of 30 genes) bearing HMGA1 gene. This on further analysis resulted into identification of 14 similar genes by Easy M-A. The evolutionary similarity of HMGA1 and GORASP2 is clearly observed in the Phylogenetic Tree. Due to the absence of precise structures, the homology modeling was done by using EasyModeller and the resulting models of proteins HMGA1 and GORASP2 were validated by Ramachandran plot. These models were further put to loop optimization by Modloop and the output models were assessed by Ramachandran plot (Rampage) and through SAVS (Procheck). The molecular docking was done by using Autodock, this resulted in two ligands, DB11641 (Vinflunine) and DB12674 (Lurbinectedin), showing potential for the effective treatment of various types of cancers characterized by the over expression of HMGA1 and GORASP2.

Genome-wide transcriptome and proteome profiles indicate an active role of alternative splicing during de-etiolation of maize seedlings

Article

Full-text available

Sep 2020
PLANTA

Main conclusion: AS events affect genes encoding protein domain composition and make the single gene produce more proteins with a certain number of genes to satisfy the establishment of photosynthesis during de-etiolation. The drastic switch from skotomorphogenic to photomorphogenic development is an excellent system to elucidate rapid developmental responses to environmental stimuli in plants. To decipher the effects of different light wavelengths on de-etiolation, we illuminated etiolated maize seedlings with blue, red, blue-red mixed and white light, respectively. We found that blue light alone has the strongest effect on photomorphogenesis and that this effect can be attributed to the higher number and expression levels of photosynthesis and chlorosynthesis proteins. Deep sequencing-based transcriptome analysis revealed gene expression changes under different light treatments and a genome-wide alteration in alternative splicing (AS) profiles. We discovered 41,188 novel transcript isoforms for annotated genes, which increases the percentage of multi-exon genes with AS to 63% in maize. We provide peptide support for all defined types of AS, especially retained introns. Further in silico prediction revealed that 58.2% of retained introns have changes in domains compared with their most similar annotated protein isoform. This suggests that AS acts as a protein function switch allowing rapid light response through the addition or removal of functional domains. The richness of novel transcripts and protein isoforms also demonstrates the potential and importance of integrating proteomics into genome annotation in maize.

COMMO: A web server for the identification and analysis of consensus gene modules across multiple methods

Article

Full-text available

Nov 2023
BIOINFORMATICS

A variety of computational methods have been developed to identify functionally related gene modules from genome-wide gene expression profiles. Integrating the results of these methods to identify consensus modules is a promising approach to produce more accurate and robust results. In this application note, we introduce COMMO, the first web server to identify and analyze consensus gene functionally related gene modules from different module detection methods. First, COMMO implements eight state-of-the-art module detection methods and two consensus clustering algorithms. Second, COMMO provides users with mRNA and protein expression data for 33 cancer types from three public databases. Users can also upload their own data for module detection. Third, users can perform functional enrichment and two types of survival analyses on the observed gene modules. Finally, COMMO provides interactive, customizable visualizations and exportable results. With its extensive analysis and interactive capabilities, COMMO offers a user-friendly solution for conducting module-based precision medicine research. Availability and implementation: COMMO web is available at https://commo.ncpsb.org.cn/, with the source code available on GitHub: https://github.com/Song-xinyu/COMMO/tree/master. Supplementary information: Supplementary data are available at Bioinformatics online.

Consensus clustering methodology to improve molecular stratification of non-small cell lung cancer

Article

Full-text available

May 2023

Recent advances in machine learning research, combined with the reduced sequencing costs enabled by modern next-generation sequencing, paved the way to the implementation of precision medicine through routine multi-omics molecular profiling of tumours. Thus, there is an emerging need of reliable models exploiting such data to retrieve clinically useful information. Here, we introduce an original consensus clustering approach, overcoming the intrinsic instability of common clustering methods based on molecular data. This approach is applied to the case of non-small cell lung cancer (NSCLC), integrating data of an ongoing clinical study (PROMOLE) with those made available by The Cancer Genome Atlas, to define a molecular-based stratification of the patients beyond, but still preserving, histological subtyping. The resulting subgroups are biologically characterized by well-defined mutational and gene-expression profiles and are significantly related to disease-free survival (DFS). Interestingly, it was observed that (1) cluster B, characterized by a short DFS, is enriched in KEAP1 and SKP2 mutations, that makes it an ideal candidate for further studies with inhibitors, and (2) over- and under-representation of inflammation and immune systems pathways in squamous-cell carcinomas subgroups could be potentially exploited to stratify patients treated with immunotherapy.

Unsupervised neural networks as a support tool for pathology diagnosis in MALDI-MSI experiments: A case study on thyroid biopsies

Article

Nov 2022
EXPERT SYST APPL

Artificial intelligence is getting a foothold in medicine for disease screening and diagnosis. While typical machine learning methods require large labeled datasets for training and validation, their application is limited in clinical fields since ground truth information can hardly be obtained on a sizeable cohort of patients. Unsupervised neural networks—such as Self-Organizing Maps (SOMs)—represent an alternative approach to identifying hidden patterns in biomedical data. Here we investigate the feasibility of SOMs for the identification of malignant and non-malignant regions in liquid biopsies of thyroid nodules, on a patient-specific basis. MALDI-ToF (Matrix Assisted Laser Desorption Ionization - Time of Flight) mass spectrometry-imaging (MSI) was used to measure the spectral profile of bioptic samples. SOMs were then applied for the analysis of MALDI-MSI data of individual patients’ samples, also testing various pre-processing and agglomerative clustering methods to investigate their impact on SOMs’ discrimination efficacy. The final clustering was compared against the sample’s probability to be malignant, hyperplastic or related to Hashimoto thyroiditis as quantified by multinomial regression with LASSO. Our results show that SOMs are effective in separating the areas of a sample containing benign cells from those containing malignant cells. Moreover, they allow to overlap the different areas of cytological glass slides with the corresponding proteomic profile image, and inspect the specific weight of every cellular component in bioptic samples. We envision that this approach could represent an effective means to assist pathologists in diagnostic tasks, avoiding the need to manually annotate cytological images and the effort in creating labeled datasets.

Application of K-Means Clustering to Identify Similar Gene Expression Patterns during Erythroid Development

Article

Full-text available

May 2020

TOXICOGENOMICS

Chapter

Dec 2022

Toxicogenomics (TGX) may be defined as a toxicological subdiscipline of pharmacogenomics, which is defined as the study of interindividual variations in whole‐genome or candidate gene single‐nucleotide polymorphism maps, haplotype markers, and alterations in gene expression that might correlate with drug responses. For much of the field of toxicology, the primary focus is on determining the probability and potential exposure‐related aspects of risk. This chapter presents potential uses of TGX in this process. There are, of course, a number of considerations involved in the use of TGX as a tool in risk assessment. With persistent application of TGX to toxicology and risk assessment, it seems inevitable that researchers will learn how to successfully apply this technology to advance the field. The National Research Council (NRC) report stresses that the twenty‐first century vision for toxicity testing should remain consistent with the NRC risk assessment paradigm put forward in 1983.

Evaluating the Optimal Number of Clusters to Identify Similar Gene Expression Patterns During Erythropoiesis

Conference Paper

Jul 2022

Gene Expression Profiling Data in Lymphoma and Leukemia: Review of the Literature and Extrapolation of Pertinent Clinical Applications

Article

Apr 2006

Cherie Dunphy

Context.—Gene expression (GE) analyses using microarrays have become an important part of biomedical and clinical research in hematolymphoid malignancies. However, the methods are time-consuming and costly for routine clinical practice. Objectives.—To review the literature regarding GE data that may provide important information regarding pathogenesis and that may be extrapolated for use in diagnosing and prognosticating lymphomas and leukemias; to present GE findings in Hodgkin and non-Hodgkin lymphomas, acute leukemias, and chronic myeloid leukemia in detail; and to summarize the practical clinical applications in tables that are referenced throughout the text. Data Source.—PubMed was searched for pertinent literature from 1993 to 2005. Conclusions.—Gene expression profiling of lymphomas and leukemias aids in the diagnosis and prognostication of these diseases. The extrapolation of these findings to more timely, efficient, and cost-effective methods, such as flow cytometry and immunohistochemistry, results in better diagnostic tools to manage the diseases. Flow cytometric and immunohistochemical applications of the information gained from GE profiling assist in the management of chronic lymphocytic leukemia, other low-grade B-cell non-Hodgkin lymphomas and leukemias, diffuse large B-cell lymphoma, nodular lymphocyte–predominant Hodgkin lymphoma, and classic Hodgkin lymphoma. For practical clinical use, GE profiling of precursor B acute lymphoblastic leukemia, precursor T acute lymphoblastic leukemia, and acute myeloid leukemia has supported most of the information that has been obtained by cytogenetic and molecular studies (except for the identification of FLT3 mutations for molecular analysis), but extrapolation of the analyses leaves much to be gained based on the GE profiling data.

Cluster analysis and display of genome-wide expression patterns

Article

Jan 1998

The Transcriptional Program in the Response of Human Fibroblasts to Serum

Article

Jan 1999

Vishwanath R. Iyer

The temporal program of gene expression during a model physiological response of human cells, the response of fibroblasts to serum, was explored with a complementary DNA microarray representing about 8600 different human genes. Genes could be clustered into groups on the basis of their temporal patterns of expression in this program. Many features of the transcriptional program appeared to be related to the physiology of wound repair, suggesting that fibroblasts play a larger and richer role in this complex multicellular response than had previously been appreciated.

Self-Organizing Maps

Book

Jan 2001

Teuvo Kohonen

Self-organizing maps. 2nd ed

Article

Jan 1995

Teuvo Kohonen

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

Article

Sep 2002

The purpose of this study was to develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs). We trained the ANNs using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories and often present diagnostic dilemmas in clinical practice. The ANNs correctly classified all samples and identified the genes most relevant to the classification. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRSCTs, we analyzed additional blinded samples that were not previously used for the training procedure, and correctly classified them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the Identification of candidate targets for therapy.

Self-Organizing Maps 3rd edition

Article

T. Kohonen

Practical Statistic for Medical Research

Article

Nov 1990

DG Altman

Identification of clinically distinct types of diffuse large B-cell lymphoma based on gene expressio

Article

Jan 2000
NATURE

Inroduction to Statistical Pattern Recognition

Chapter

Jan 1990

K.~Fukunaga

SOM-Based Data Visualization Methods

Article

Aug 1999

Juha Vesanto

The self-organizing map (SOM) is an efficient tool for visualization of multidimensional numerical data. In this paper, an overview and categorization of both old and new methods for the visualization of SOM is presented. The purpose is to give an idea of what kind of information can be acquired from different presentations and how the SOM can best be utilized in exploratory data visualization. Most of the presented methods can also be applied in the more general case of first making a vector quantization (e.g. k-means) and then a vector projection (e.g. Sammon's mapping).

Clustering of the SOM easily reveals distinct gene expression patterns: Results of a reanalysis of lymphoma study

Abstract and Figures

Recommended publications

A comparison of clustering algorithms in article recommendation system

The Applications of Self-Organizing Maps in Protein Mass Spectrometry High-Throughput Analysis

Data mining approaches for network intrusion detection from dimensionality reduction to misuse and a...

Application of Clustering Techniques Using Prioritized Variables in Regional Flood Frequency Analysi...