ArticlePDF Available

Using the self organizing map for clustering of text documents

July 2009
Expert Systems with Applications 36(5):9584-9591

July 2009
36(5):9584-9591

DOI:10.1016/j.eswa.2008.07.082

Source
DBLP

Authors:

Vish P Kallimani

Universiti Teknologi PETRONAS

Lam Hong Lee

Simplify Networks, Malaysia

An increasing number of computational and statistical approaches have been used for text classification, including nearest-neighbor classification, naïve Bayes classification, support vector machines, decision tree induction, rule induction, and artificial neural networks. Among these approaches, naïve Bayes classifiers have been widely used because of its simplicity. Due to the simplicity of the Bayes formula, the naïve Bayes classification algorithm requires a relatively small number of training data and shorter time in both the training and classification stages as compared to other classifiers. However, a major short coming of this technique is the fact that the classifier will pick the highest probability category as the one to which the document is annotated too. Doing this is tantamount to classifying using only one dimension of a multi-dimensional data set. The main aim of this work is to utilize the strengths of the self organizing map (SOM) to overcome the inadvertent dimensionality reduction resulting from using only the Bayes formula to classify. Combining the hybrid system with new ranking techniques further improves the performance of the proposed document classification approach. This work describes the implementation of an enhanced hybrid classification approach which affords a better classification accuracy through the utilization of two familiar algorithms, the naïve Bayes classification algorithm which is used to vectorize the document using a probability distribution and the self organizing map (SOM) clustering algorithm which is used as the multi-dimensional unsupervised classifier.

The SOM model.

…

Classification accuracy of the na Bayes algorithms and hybrid algorithms with and without HRKE

…

Cluster formation according to distance measurement (round robin)

…

Figures - uploaded by Lam Hong Lee

Content may be subject to copyright.

Content uploaded by Lam Hong Lee

Content may be subject to copyright.

Using the self organizing map for clustering of text documents

Dino Isa

, V.P. Kallimani

, Lam Hong Lee

Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia

article info

Keywords:

Bayesian

Self organizing maps

Clusters similarity

abstract

An increasing number of computational and statistical approaches have been used for text classiﬁcation,

including nearest-neighbor classiﬁcation, naïve Bayes classiﬁcation, support vector machines, decision

tree induction, rule induction, and artiﬁcial neural networks. Among these approaches, naïve Bayes clas-

siﬁers have been widely used because of its simplicity. Due to the simplicity of the Bayes formula, the

naïve Bayes classiﬁcation algorithm requires a relatively small number of training data and shorter time

in both the training and classiﬁcation stages as compared to other classiﬁers. However, a major short

coming of this technique is the fact that the classiﬁer will pick the highest probability category as the

one to which the document is annotated too. Doing this is tantamount to classifying using only one

dimension of a multi-dimensional data set. The main aim of this work is to utilize the strengths of the

self organizing map (SOM) to overcome the inadvertent dimensionality reduction resulting from using

only the Bayes formula to classify. Combining the hybrid system with new ranking techniques further

improves the performance of the proposed document classiﬁcation approach. This work describes the

implementation of an enhanced hybrid classiﬁcation approach which affords a better classiﬁcation accu-

racy through the utilization of two familiar algorithms, the naïve Bayes classiﬁcation algorithm which is

used to vectorize the document using a probability distribution and the self organizing map (SOM) clus-

tering algorithm which is used as the multi-dimensional unsupervised classiﬁer.

1. Introduction

Document classiﬁcation can be deﬁned as the task of learning

methods for categorizing collections of electronic documents into

their automatically annotated classes, based on its contents. For

several decades now, document classiﬁcation in the form of text

classiﬁcation systems have been widely implemented in numerous

applications such as spam ﬁltering (Cunningham, Nowlan, Delany,

& Haahr, 2003; Delany, Cunningham, & Coyle, 2005; Delany, Cunn-

ingham, Tsymbal, & Coyle, 2004; O

Brien & Vogel, 2002; Provost,

1999; Sahami, Dumais, Heckerman, & Horvitz, 1998; Wei, 2003),

e-mails categorization (Kamens, 2005; Xia, Liu, & Guthrie, 2005;

Brucher, Knowlmayer, & Mittermayer, 2002), formation of knowl-

edge repositories (Hartley, Isa, Kallimani, & Lee, 2006), and ontol-

ogy mapping (Su, 2002). An increasing number of statistical

approaches have been developed for document classiﬁcation,

including k-nearest-neighbor classiﬁcation (Han, Karypis, & Kumar,

1999), naïve Bayes classiﬁcation (McCallum & Nigam, 2003), sup-

port vector machines (Chakrabarti, Roy, & Soundalgekar, 2003;

Joachims, 1998), maximum entropy (Nigam, Lafferty, & McCallum,

1999), decision tree induction, rule induction, and artiﬁcial neural

networks.

Each one of the document classiﬁcation schemes mentioned

previously has unique properties. The decision tree induction algo-

rithm and rule induction algorithm are simple to understand and

interpret after a brief explanation. However, these algorithms do

not work well when the number of distinguishing features is large

(Quinlan, 1993). The k-nearest-neighbor algorithm is easy to

implement and has show its effectiveness in a variety of problem

domains (Han et al., 1999). However, a major drawback of the k-

NN algorithm is that it is computationally intensive, especially

when the size of the training set grows (Han et al., 1999). Support

vector machines can be used as a discriminative document classi-

ﬁer, and these have been shown to be more accurate in classiﬁca-

tion tasks. The good generalization property of the SVM is due to

the implementation of structural risk minimization which entails

ﬁnding a hyper-plane which guarantees the lowest classiﬁcation

error (Vapnik, 1995). An ability to learn which is independent of

the dimensionality of the feature space (Joachims, 1998) is also

an advantage of the SVM. However, the usage of SVMs in many real

world applications is relatively complex due to its convoluted

training and categorizing algorithms as compared to the naïve

Bayes classiﬁer (Chakrabarti et al., 2003; Isa, Prasad, Lee, &

Kallimani, 2007; Kim, Rim, Yook, & Lim, 2002).

Among these approaches, the naïve Bayes text classiﬁer has

been widely used because of simplicity in both the training and

doi:10.1016/j.eswa.2008.07.082

*Corresponding author. Tel.: +60 3 89248141; fax: +60 3 89248017.

E-mail addresses: Dino.Isa@nottingham.edu.my (D. Isa), VP.Kallimani@nottin-

gham.edu.my (V.P. Kallimani), kcx4lhl@nottingham.edu.my (L.H. Lee).

Tel.: +60 3 89248116.

Tel.: +60 3 89248141.

Expert Systems with Applications 36 (2009) 9584–9591

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

classifying stage although this generative method has been re-

ported less accurate than discriminative methods such as SVM

(Chakrabarti et al., 2003; Joachims, 1998). However, some

researchers have proven that the naïve Bayes classiﬁcation ap-

proach provides an intuitively simple text generation model and

performs surprisingly well in many other domains, under speciﬁc

‘‘ideal” conditions (McCallum & Nigam, 2003). A naive Bayes clas-

siﬁer is a simple probabilistic classiﬁer based on Bayes’ Theorem

with strong independence assumptions but this assumption se-

verely limits its applicability (Flach, Gyftodimos, & Lachiche,

2002). In real life applications, the probability values associated

with an event are seldom ‘‘independent”. For example, even toss-

ing a coin will not have the expected 50:50 chance of a result being

either ‘‘heads” or ‘‘tails” due to factors which are associated with

machining the coin, the different surface textures, different envi-

ronments and different ways and methods used to toss the coin

among other things. If we are lucky, these factors even out over

time, if we are not, then the naïve Bayes formula will misclassify

frequently. However, depending on the precise nature of the prob-

ability model, naive Bayes classiﬁers can be trained very efﬁciently

and requires a relatively small amount of training data to estimate

the parameters necessary for classiﬁcation. Because independent

variables are assumed, only the variances of the variables for each

class need to be determined and not the entire covariance matrix.

Naïve Bayes classiﬁcation is a probabilistic inference approach

which has been implemented in mail repositories to remove spam

e-mails (Cunningham et al., 2003; Delany et al.,2004, 2005; O

Brien

& Vogel, 2002; Provost, 1999; Sahami et al., 1998; Wei, 2003). In

this work, the traditional naïve Bayes classiﬁcation approach is

implemented to classify electronic documents into one or more

categories, by calculating the probabilistic distribution of the text

body of the document in a vector space of features. In the context

of classiﬁcation, the Bayes theorem emphasizes that the probabil-

ity of a particular document being annotated to a particular cate-

gory given that the document contains certain words in it, is

equal to the probability of ﬁnding those certain words in that par-

ticular category, times the probability that any document is anno-

tated to that category, divided by the probability of ﬁnding those

words in any document, as illustrated in equation below:

PrðCategoryjWordÞ¼PrðWordjCategoryÞPrðCategoryÞ

PrðWordÞ

Each document contains words which are given probabilities

based on its number of occurrence within that particular kind of

documents. Naïve Bayes classiﬁcation is predicated on the idea

that electronic documents can be classiﬁed based on the probabil-

ity that certain keywords will correctly identify a piece of text doc-

ument to its annotated category. At the basic level, a naïve Bayes

classiﬁer examines a set of text documents that have been well or-

ganized and categorized into categories, and compares the content

in all categories in order to build a list of words and their occur-

rence. This list of word occurrence is used to identify or classify

new documents to their right categories, according to the probabil-

ity of occurrence of certain words in the document (Fig. 1b).

The naive Bayes classiﬁer is attractive as compared to other

classiﬁcation methods due to its simplicity. This is due to the fact

that it can ‘‘make do” with a small amount of training data to esti-

mate the parameters necessary for classiﬁcation. The Bayesian

classiﬁcation approach arrives at the correct classiﬁcation as long

as the correct category gives the highest probability value as com-

pared to the others. A category

s probability does not have to be

estimated very well. In other words, the overall classiﬁer is robust

enough to ignore serious deﬁciencies in its underlying naive prob-

ability model (Haykin, 1999). However, the major drawback of the

Bayesian classiﬁcation approach is its relatively low classiﬁcation

performance compare to other discriminative algorithms due to

its ‘‘single dimensional” nature (classifying by highest probability

category). Therefore, much active research has been carried out

to improve the naïve Bayes classiﬁer, enhancing this approach

through the implementation of techniques which add a method

of ranking the potential candidates through a tournament

structure in the classiﬁcation task (Isa, Lee, & Kallimani, in press;

McCallum & Nigam, 2003).

The self organizing map (SOM) is a clustering method which

clusters data, based on a similarity measure related to the calcula-

tion of Euclidean distances. The idea of this principle is to ﬁnd a

winner-takes-all neuron to ﬁnd the most closely matching case.

The SOM was proposed by Kohonen, and is based on the idea that

systems can be design to emulate the collective co-operation of the

neurons in the human brain. Collectivism can be realized by feed-

back and thus can also be realized in the network, where many

neighboring neurons react collectively upon being activated by

events. If neurons are activated in the learning process, the neigh-

boring neurons are also affected. The network structure is deﬁned

by synapses and has a similar total arrangement after a phase of

self organization as the input data of the event space (Negnevitsky,

2002). Consequently, the SOM is an established paradigm in AI and

cognitive modeling being the basis of unsupervised learning. This

unsupervised machine learning method is widely used in data

mining, visualization of complex data, image processing, speech

recognition, process control, diagnostics in industry and medicine,

and natural language processing (Michalski, Bratko, & Kubat,

1999).

As a summary, the simplicity of implementation of the naïve

Bayes classiﬁer is in stark contrasts with its poor performance in

classiﬁcation tasks. In this work, this poor performance is im-

proved using the SOM as the multi-dimensional classiﬁer and the

Bayes formula as the feature extractor or vectorizer. Our previous

work has introduced some specialized algorithms to improve the

performance of naïve Bayes classiﬁer when handling different

types of knowledge domains and thus guarantees a lower error

rate s compared to using only the Bayes theorem to classify (Isa

et al., in press). A so called ﬂat ranking classiﬁcation algorithm

and a series of tournament structures ranking algorithms have

been designed and implemented. Besides this, additional tech-

niques are introduced with the hope of obtaining a higher classiﬁ-

cation accuracy, such as the high relevance keywords extraction

(HRKE) facility and the automatically computed document depen-

dent (ACDD) weighting factors (not covered in this paper), in order

overcome some of the weakness of the traditional naïve Bayes clas-

siﬁcation algorithms (Isa et al., in press). We have implemented

here, a practical system which uses the Bayes formula and various

ranking algorithms along with the SOM to automatically classify

any electronic document for any database, with 100% accuracy.

2. The hybrid classiﬁcation approach

We propose, design, implement and evaluate a hybrid classiﬁ-

cation approach by integrating the naïve Bayes classiﬁcation (with

tournament ranking methods) and SOM utilizing the simplicity of

the naïve Bayes to vectorize raw text data based on probability val-

ues and the SOM to automatically cluster based on the previously

vectorized data. The naïve Bayes classiﬁer vectorizes the raw text

documents into numerical values, so that the SOM can use the vec-

torized data in both the training and the categorizing stages. The

structure of the proposed classiﬁcation approach is illustrated in

Fig. 1a.

In the training stage of the classiﬁer, the training dataset which

contains well organized training documents are used by the front-

end naïve Bayes classiﬁer. After the naïve Bayes classiﬁer has been

D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591 9585

trained, each training document is vectorized by the trained naïve

Bayes classiﬁer through the calculation of the posterior probability

value for each existing category based on the Bayes formula. For

example, the probability value for a document Xto be annotated

to a category Cis computed as Pr(C|X). Assuming that we have a

category list as Cat1,Cat2,Cat3,Cat4,Cat5,...,CatN, thus, each docu-

ment has Nassociated probability values, where document Xwill

have Pr(Cat1|X),...,Pr(Cat2|X), ...,Pr(Cat3|X), ...,Pr(Cat4|X),...,Pr(

Cat5|X),...,Pr(CatN|X). All the probability values for a document

are combined to construct a multi-dimensional numerical array.

In this way, all the documents in the training dataset are vectorized

into multi-dimensional numerical values to be used for the con-

struction of separate vectorized training dataset for the SOM.

As for the classiﬁcation stage, the categorizing processes are

similar to the processes used during the text document vectoriza-

tion in the training stage. The input to the trained naïve Bayes clas-

siﬁer during the classifying stage is the raw text document which is

to be classiﬁed. The output from the naïve Bayes classiﬁer, which is

vectorized data in the format of multi-dimensional numerical

probability values (an ‘‘address”) is used as the input for the

SOM for the ﬁnal classiﬁcation step (Fig. 1b). In this example the

address sent to the SOM interface program is 5311, relating to

the 50%, 30% and 10% probabilities.

2.1. The naïve Bayes classiﬁcation approach

Our proposed naïve Bayes classiﬁer (Isa et al., in press) performs

its classiﬁcation tasks starting with the initial step of analyzing the

text document by extracting words from the document. To perform

this analysis, a simple word extraction algorithm is used to extract

each individual word from the document to generate a list of

words. This list is used when the naïve Bayes classiﬁer calculates

the probability of each word being annotated to a particular cate-

gory. The list of words is constructed with the assumption that the

input document contains words w

,...,w

n1

,and the

length of the description is n.

This list of words is then used to generate a calculation table of

words and their probabilities to be annotated to each category for

the input text document, which consists of columns of words, and

probability of the word belonging to each category. The column

‘‘Word” is ﬁlled with words which are extracted from the input

document. For the columns of probabilities of a particular word

for each category, the values to be ﬁlled is be calculated by the

naïve Bayes classiﬁer in the following stage. The tables below illus-

trate the use of this method for the input document in Table 1.

Before the naïve Bayes classiﬁer performs the calculation of

word probabilities for each category, it needs to be trained with

a set of well categorized documents. Each individual word in all

documents from the same category are extracted and listed in a list

of words occurrence, by using a simple data structure algorithm,

based on the computation of the frequency of word occurrence.

Based on the list of words occurrence, the trained classiﬁer is

able to calculate the posterior probability values of words which

are extracted from the input document individually, by using the

same formula for calculating the posterior probability which is de-

rived from Bayes’ theorem, given by the formula which is shown as

below, since each single word from the input document

Fig. 1a. Proposed hybrid approach block diagram.

Calculus Al

ebra Geometr

Graph Theor

50%

30%

10%

Probability

Fig. 1b. An example of the vectorization results using the na Bayes Classiﬁer. The

document has a 50% chance of being related to the subtopic ‘‘Calculus”.

9586 D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591

contributes to the probability values of a document to be anno-

tated to every existing category,

PrðCategoryjWordÞ¼PrðWordjCategoryÞPrðCategoryÞ

PrðWordÞ

The derived equation above shows that by observing the value

of Word, the posterior probability, Pr(Category|Word), which repre-

sent the probability of the state of nature being a particular cate-

gory given that feature value can be calculated based on the

Bayes formula. The prior probability, Pr(Category) can be compute

from the equation below:

PrðCategoryÞ¼ Total of Words in Category

Total of Words in Training Dataset

¼Size of Category

Size of Training Dataset

Meanwhile, the evidence, which is also known as normalizing

constant, Pr(Word) is calculated by using the equation:

PrðWordÞ¼ Poccurrence of Word in every Category

Poccurrence of all words in every Category

The total occurrence of a particular word in every category can

be retrieved by searching the relevance data from the training data

base, which are the lists of words occurrence for every category,

generated from the analysis of all training ﬁles in the particular

category during the stage of initial training. The sum of occurrence

of all words in every category can also been calculated from the

lists of words occurrence for every category.

To calculate the likelihood of a particular category with respect

to a particular word, the lists of words occurrence from the training

data base is referred to retrieve the occurrence of the particular

word in the particular category, and the sum of all words in that

particular category. These information will contribute to the value

of Pr(Word|Category) with the equation:

PrðWordjCategoryÞ¼ occurrence of Word in Category

Poccurrence of all words in Category

Based on the derived Bayes’ formula for text classiﬁcation, with

the value of the prior probability Pr(Category), the likelihood

Pr(Word|Category), and the evidence Pr(Word), the posterior prob-

ability, Pr( Category|Word) of each word in the input document

annotated to each category can be measured. The posterior proba-

bility of each word annotated to each category is then used to ﬁll

the appointed cells in the table as illustrated in Table 1. After all

the cells of ‘‘Probability” have been ﬁlled, the overall probability

for an input document to be annotated to a particular category is

calculated by dividing the sum of each of the ‘‘Probability” column

with the total number of words in the input document, n, which is

shown in the equation below:

PrðCategoryjDocumentÞ¼PrðCategoryjw

;...;w

n1

where w

,...,w

n1

, are the words which are extracted

from the input document.

Typically, the ordinary naïve Bayes classiﬁer is able to deter-

mine the right category of an input document based on the Bayes

Classiﬁcation Rule, by referring to the associated probability values

calculated by the trained classiﬁer based on the Bayes formula. The

right category is represented by the category which has the highest

posterior probability value, Pr(Category|Document) (Kontakanen,

Myllymaki, Silander, & Tirri, 1997). It is this simplicity that is both

an advantage (simple algorithm) and a disadvantage (poor gener-

alization). Since the ordinary naïve Bayes classiﬁcation algorithm

has been proven to be one of the poorest performing of classiﬁers,

we propose SOM clustering for the purposes of increasing general-

ization and classiﬁcation accuracy.

2.2. Clustering thru the use of the self organizing map

Knowledge discovery tasks can be broken down into two gen-

eral steps: pre-processing and classiﬁcation. In the pre-processing

step, data is transformed into a format which can be processed by a

classiﬁer. The self organizing map (SOM) can be used to carry out

the classiﬁcation tasks effectively, especially for the analysis and

visualization of a variety of economical, ﬁnancial, scientiﬁc, and

manufacturing data sets (Petrushin, 2005; Wang, 2001). The ﬁrst

step in designing the SOM is to decide on what prominent features

are to be used in order to effectively cluster the data into groups.

The criterion for selecting the main features plays an important

role in ensuring that the SOM clusters properly and thus supports

goal based decision making. Traditionally statistical cluster analy-

sis is an important step in improving feature extraction and is done

iteratively. An alternative to these statistical methods is the SOM

(Vesanto, Alhonieme, Himberg, Kiviluoto, & Pervienen, 1999).

Kohonen

s principle of topographic map formation, states that

the spatial location of an output neuron in the topographic map

corresponds to a particular feature of the input pattern. The SOM

model, which is shown in Fig. 2, provides a map which places a

ﬁxed number input patterns from an input layer into the so called

Kohonen layer (Kriegel, Brechesen, Kroger, Pfeiﬂe, & Schbert, 2003;

Wang, 2001). The system learns through self organization of ran-

dom neurons whose weights are attached to the layers of neurons.

These weights are altered at every epoch during the training ses-

sion. The change depends upon the similarity or neighborhood be-

tween the input pattern and the map pattern (Michalski et al.,

1999). The topographic feature maps reduce the dimensions of

data to two dimensions simplifying viewing and interpretation.

In the SOM, certain trends in clustering can be observed by

changing some of the training parameters. After the incremental

Table 1

Table of words occurrence and probabilities

Word Probability category 1 Probability category 2 Probability category 3 ...... Probability category k1 Probability category k

Pr(C

) Pr(C

)... Pr(C

k1

) Pr(C

)

Pr(C

) Pr(C

)... Pr(C

k1

) Pr(C

)

Pr(C

) Pr(C

)... Pr(C

k1

) Pr(C

)

. ..... .

n1

Pr(C

n1

) Pr(C

n1

) Pr(C

n1

)... Pr(C

k1

n1

) Pr(Ck|w n-1)

Pr(C

) Pr(C

)... Pr(C

k1

) Pr(Ck|wn)

Total PPr(C

|W)PPr(C

|W)... PPr(C

k1

|W)PPr(Ck|W)

Probability of input document P

PrðC

jWÞ

PrðC

jWÞ

PrðC

jWÞ

PrðC

k1

jWÞ

PrðC

jWÞ

D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591 9587

training of the map, the application saves the weight vectors of the

map and these weights can be used as the starting weights. Once

the training is over, the output mean and variance of each cluster

is reported. Furthermore, the location of each cluster is also re-

ported. Speed is a big concern in SOM clustering. By reducing

dimensionality thru the use of a probability distribution of catego-

ries in feature space, instead of a raw word occurrence count we

reduce computation time in both training and classiﬁcation. The

concerns associated with data pre-processing before the training

starts and ﬁnal drawing of the map once the training is over (Pal

& Shiu, 2004) is also addressed by our hybrid system.

The SOM is trained iteratively. In each training step, a sample

vector, xfrom the input data set is chosen randomly and the dis-

tance between xand all the weight vectors of the SOM, is calcu-

lated by using a Euclidean distance measure. The neuron with

the weight vector which is closest to the input vector xis called

the Best Matching Unit (BMU). The distance between xand weight

vectors, is computed by using the equation below:

kxm

k¼minfkx

m

where ||.|| is the distance measure, typically Euclidean distance.

After the BMU is found, the weight vectors of the SOM are updated

so that the BMU is moved closer to the input vector in the input

space Miyamoto, 2007. The topological neighbors of the BMU are

treated similarly. The update rule for the weight vector of iis

ðtþ1Þ¼m

ðtÞþ

ðtÞh

ðtÞ½xðtÞm

ðtÞ

where x(t) is a vector which is randomly drawn from the input data

set, and function

(t) is the learning rate and tdenotes time (Deb-

oeck & Kohonen, 1998). The function h

(t) is the neighborhood ker-

nel around the winner unit c. The dataset of manufacturing details

are fed into the input layer of SOM. Learning parameter is selected

between 0.0 and 0.9, and the SOM is trained. The training steps will

be in the range of 100,000 epochs in order to obtain a trained map.

These training datasets are coded with reference to their promi-

nent features (Wang, 2001).

In this work, the naïve Bayes classiﬁer and the SOM are trained

independently; the Bayes classiﬁer with raw documents and the

SOM with address vectors arising from the execution of the Bayes

classiﬁer. Once a new document is to be identiﬁed and categorized,

the naïve Bayes classiﬁer is executed and outputs an ‘‘address”

consisting of the probability distribution of the document being

annotated to various pre-deﬁned categories. At this point various

enhancements are added via tournament ranking algorithms

(and the HRKE facility) with the intention to improve on the plain

vanilla naïve Bayes classiﬁer (in our case we call this plain naïve

Bayes option, ‘‘Flat Ranking”). This address is then fed into the

SOM interface program which is then executed to ﬁnd the best

matching unit (BMU), which is the neuron corresponding to the

document most closely related to the input document described

by the address given by the naïve Bayes classiﬁer. The top ﬁve sim-

ilar documents are then retrieved and presented to the user. The

original document is then classiﬁed as the same as other docu-

ments closest to the BMU. In this way, a multi-dimensional classi-

ﬁcation system is obtained, the SOM adding robustness and

increasing generalization to the overall approach with the naïve

Bayes classiﬁer providing a way to vectorize the documents and

reduce dimensionality thus resulting in faster training and classiﬁ-

cation time. In summary, this combination gives ‘‘enough” general-

ization (multi-dimensional as opposed to single dimensional

classiﬁcation of the naïve Bayes alone) but not so much so as to

make the classiﬁer overﬁt and detrimentally increase training

and classiﬁcation time.

3. Experimental results

The objective of this evaluation is to determine whether our

proposed approach results in better classiﬁcation accuracy and

performance when compared to the naïve Bayes classiﬁer (both

with and without tournament ranking methods). As mentioned

in the sections above, the hybrid approach utilizes the simplicity,

low requirements of the naïve Bayes classiﬁer as a raw text docu-

ment vectorizer, and uses the SOM to cluster the vectorized docu-

ments into groups. The evaluations are made by comparing the

classiﬁcation accuracy of the ordinary naïve Bayes classiﬁer, along

with some specialized techniques which was presented in our pre-

vious work (Isa et al., in press), with the hybrid naïve Bayes vecto-

rizer (plus specialized techniques) and SOM clustering approach

which is proposed in this paper. In particular these specialized

techniques include the naïve Bayes with ﬂat ranking that is, the

system computes the probability distribution by considering all

categories in a single round of competition. The round robin ver-

sion on the other hand, calculates the probability distribution by

considering only two categories at a time iteratively until each cat-

egory has competed with all the other categories and the winner is

determined by the highest winning score. The single elimination

method entails ﬁnding a winner that has not lost even once within

the competition. The HRKE algorithm culls out words such as ‘‘a”,

‘‘the”, etc., which have a low effect on the classiﬁcation task be-

cause it appears in every document. The algorithms mentioned

above determine the right category for input documents by refer-

ring to the associated probability values calculated by the trained

classiﬁer based on the Bayes formula. The right category is repre-

sented by the category which has the highest posterior probability

value, Pr(Category|Document).

A dataset of vehicle characteristics is tested in the prototype

system for the evaluation of classiﬁcation performance in handling

the case with four categories which have low degrees of similarity.

Our selected dataset contains four categories of vehicles: Aircrafts,

Boats, Cars, and Trains. All the four categories are easily differenti-

ated and every category has a set of unique keywords. We have

collected 110 documents for each category, with the total of 440

documents in the entire dataset. 50 documents from each category

are extracted randomly to build the training dataset for the classi-

ﬁer. The other 60 documents for each category are used as the test-

ing dataset to test the classiﬁer.

Initially, we perform the experiment by implementing different

pure naïve Bayes classiﬁcation algorithms: the naïve Bayes with

ﬂat ranking algorithm, the naïve Bayes with round robin tourna-

ment ranking algorithm, the naïve Bayes with single elimination

tournament ranking algorithm and the naïve Bayes with high rele-

vance keywords extraction (HRKE) algorithm. These classiﬁcation

algorithms have been discussed in our previous work (Isa et al.,

in press) and have also been brieﬂy described above. The algo-

rithms mentioned above determine the right category for input

documents by referring to the associated probability values calcu-

Fig. 2. The SOM model.

9588 D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591

lated by the trained classiﬁer based on the Bayes formula. Accord-

ing to the Bayes Classiﬁcation Rule, the right category is repre-

sented by the category which has the highest posterior

probability value, Pr(Category|Document).

To evaluate the hybrid approach which is proposed in this pa-

per, we implement the naïve Bayes classiﬁcation algorithms men-

tioned above in the front-end to vectorize raw text data into the

associated probability values calculated based on the Bayes for-

mula. In the right category determination stage, we do not perform

the same method as the ordinary naïve Bayes classiﬁer by imple-

menting the Bayes Classiﬁcation Rule which selects the category

with the highest posterior probability value, Pr(Category|Docu-

ment). We implement the SOM to cluster the vectorized documents

into groups. The details processes of the SOM in performing the

right category determination steps have been discussed in Section

2.2.

By submitting the entire training data matrix to the SOM, a U-

matrix, which represents the discovery of SOM, is obtained as

shown in Fig. 3. Each hexagon represents a node on a map or a pro-

cessing unit of the output layer of the neural network. The shade of

each node on this map indicates a pattern among the attributes

which are used for this map. In this experiment, the map is clus-

tered into ﬁve clusters. Two hundred and forty (60 documents

per category) test vectors are chosen for testing purpose. Table 5

illustrates the details of data and cluster nodes and their

distributions.

In this experiment, we present a large data matrix for the data-

set with 200 training documents. The dataset has four dimensional

of information, which is categorized by different methods as

shown in Table 4. The listing of results shows the performances

are carried in conjunction with SOM. The vectorized data by the

front-end naïve Bayes vectorizer are considered as the input data

to the SOM for further clustering purposes.

Different maps are created for different levels and the best

matching unit (BMU) is calculated by using the Euclidian distance

(Deboeck & Kohonen, 1998). The results show an improvement in

recognition rate of case retrieval process. The visualization of the

trained data is shown in Fig. 3. The min, max and average value

of the attributes of the neuron map is shown in Table 3. The train-

ing parameters are chosen as shown in Table 2. The training

parameters considered in this case are: initial radius of 3 and ﬁnal

as 1, in rough training, where as in ﬁne tune it is chosen as 1 and 1.

Fig. 3. Map-visualization of clusters.

Table 2

Training parameters

Training Fine tune

Map size 10 7

Radius initial 3 1

Radius ﬁnal 1 1

Final length 30,000 20

D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591 9589

The comparison table in Table 4 illustrates that the recognition

rate is poor, when the SOM is trained in combination with the sin-

gle elimination tournament ranking. The recognition rate is only

56.66%. A similar situation was found when the hybrid approach

is used with the ﬂat ranking naïve Bayes classiﬁer. The recognition

rate is only 58.00%.

The hybrid approach with the naïve Bayes vectorizer when en-

hanced by high relevance keywords extraction facility shows good

results, with 98.33% recognition rate, a slight improvement over

the pure naïve Bayes classiﬁer classiﬁcation rate. The highest result

of 100% recognition is obtained for the hybrid approach with the

naïve Bayes vectorizer with Round Robin ranking method. Table

2above provides the parameters chosen in training the SOM. The

results are tested for 30,000 epochs of training cycle. Initial radius

of 3 is considered in rough training and radius of 1 is considered in

the ﬁnal training. Table 3 provides the statistical values with min-

imum and maximum values for each one of the six vectorization

techniques presented here.

With reference to Figs. 3a and b we see that clustering is better

deﬁned and has more detail as indicated by the lesser amount of

yellows and greens as opposed to Figs. 3c and d where there is a

spread of yellows and greens throughout the map. Table 5 shows

the relative spread of colors between the round robin and (round

robin + HRKE) techniques displaying the relationship between

non primary colors and classiﬁcation error.

We further draw the conclusion that the best performing algo-

rithms, i.e., the round robin and (ﬂat + HRKE) combination pro-

vides better delineation between clusters as high lighted by the

minimal non primary green and yellow colors. This may be seen

as due to an action akin to ‘‘ﬁltering” out the noise in the input data

to the SOM by means of the HRKE facility and through the iterative

competition afforded by the round robin tournament method. Out

of 240 test documents (represented by vectors) which have been

presented to the organized map trained using different documents

from the same data base, all 240 vectors were recognized accu-

rately. This combination of the naïve Bayes vectorizer with round

robin tournament ranking structure enhanced with HRKE com-

bined with the SOM gives 100% classiﬁcation accuracy for the data

sets we have tested.

We see from Table 4, however, that the SOM classiﬁcation accu-

racy for (round robin + HRKE) has dropped from 100% (without

HRKE) to 82%. This is due to the fact that implementing HRKE arti-

ﬁcially reduces the effectiveness of multi-dimensional classiﬁca-

tion (clustering) through the elimination of words such as ‘‘a”

and ‘‘the” (stop words) which is indirectly considered by the

SOM in its classiﬁcation task. This is due to the fact that the vect-

orizer calculates the probability distribution using all words in the

document including the stop words and the elimination of these

words effects the input address to the SOM. For the naïve Bayes

classiﬁer on the other hand, the elimination of these stop words

is less detrimental because only the highest probability is taken

as the right category, i.e. a single dimension is used for classiﬁca-

tion instead of a multi-dimensional approach.

4. Conclusion

A hybrid text document classiﬁcation approach is proposed.

Through the implementation of an enhanced naïve Bayes classiﬁca-

Table 3

Statistics of the maps for different techniques

Distance RR (100%) RR + HRKE (82%)

D1D2D3D4D1D2D3D4

Average 1.376223 0.368416 1.949174 0.647098 1.458709 0.693082 1.548414 0.621877

Maximum 2.34731 1.57367 2.57733 2.17383 2.46863 2.09739 2.49728 2.41367

Minimum 0.740616 5.78E-09 1.28699 1.12E-06 0.736841 2.05E-05 0.762738 1.5E-07

Variance 0.22436 0.186627 0.135178 0.503128 0.292686 0.56394 0.230363 0.56791

Std Dev 0.491255 0.432003 0.367665 0.709315 0.554666 0.75096 0.479961 0.753598

Distance Flat (58%) Flat and HRKE (98.33%)

D1D2D3D4D1D2D3D4

Average 0.331384 0.243228 0.355904 0.268384 0.293388 0.434992 0.245219 0.457635

Maximum 0.504098 0.421988 0.538897 0.537959 0.583702 1.21695 0.471976 1.19018

Minimum 0.234698 0.17502 0.257436 0.192912 0.085879 0.109249 0.085964 0.179638

Variance 0.006616 0.006526 0.006062 0.010637 0.020681 0.131244 0.009334 0.108925

Std Dev 0.081341 0.080782 0.07786 0.103135 0.146401 0.361481 0.099921 0.330275

Distance SE (57%) SE + HRKE (57%)

D1D2D3D4D1D2D3D4

Average 0.320067 0.249916 0.358831 0.266348 0.320067 0.25086 0.358789 0.266276

Maximum 0.499788 0.412554 0.534841 0.527001 0.499788 0.412554 0.534841 0.527001

Minimum 0.237894 0.176811 0.259863 0.195361 0.237894 0.176811 0.256949 0.195361

Variance 0.005397 0.006509 0.006549 0.009856 0.005397 0.006585 0.006557 0.009861

Std Dev 0.081941 0.08068 0.080925 0.09928 0.081941 0.081148 0.080977 0.099305

Table 4

Classiﬁcation accuracy of the na Bayes algorithms and hybrid algorithms with and

without HRKE

Tournament ranking

enhancements

Pure na Bayes With SOM

No HRKE

(%)

With HRKE

(%)

No HRKE

(%)

With HRKE

(%)

Flat ranking 81.25 96.25 58.00 98.33

Round robin tournament

ranking

69.58 79.5 100 82

Single elimination

tournament ranking

64.58 76.82 56.66 57

Table 5

Cluster formation according to distance measurement (round robin)

Color code No. of neurons % Distribution Values

Dark red 6 8 2.15–2.36

Red 5 7 1.85–2.10

Yellow 5 7 1.65–1.80

Green 13 18 1.45–1.40

Light blue 17 24 1.14–1.40

Blue 26 36 0.741–1.10

9590 D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591

tion method at the front-end for raw text data vectorization, in con-

junction with a SOM at the back-end to determine the right cluster

for the input documents, better generalization, lower training and

classiﬁcation time, and a 100% classiﬁcation accuracy can be

obtained.

References

Brucher, H., Knowlmayer, G., & Mittermayer, M. A. (2002). Document classiﬁcation

methods for organizing explicit knowledge. University of Bern, Institute of

Information System, Research Group Information Engineering,

Engehaldenstrasse 8, CH-3012 Bern, Switzerland.

Chakrabarti, S., Roy, S., & Soundalgekar, M. V. (2003). Fast and accurate text

classiﬁcation via multiple linear discriminant projection. The VLDB Journal The

International Journal on Very Large Data Bases, 170–185.

Cunningham, P., Nowlan, N., Delany, S. J., & Haahr, M. (2003). A case-based approach

in spam ﬁltering that can track concept drift. In The ICCBR’03 workshop on long-

lived cbr systems, Trondheim, Norway.

Deboeck, G., & Kohonen, T. (1998). Visual explorations in ﬁnance with self organizing

maps. Springer-Verlag.

Delany, S. J., Cunningham, P., & Coyle, L. (2005). An assessment of case-based

reasoning for spam ﬁltering. Artiﬁcial Intelligence Review Journal, 24(3–4),

359–378.

Delany, S. J., Cunningham, P., Tsymbal, A., & Coyle, L. (2004). A case-based technique

for tracking concept drift in spam ﬁltering. Journal of Knowledge Based Systems,

18(4–5), 187–195.

Flach, P. A., Gyftodimos, E., & Lachiche, N. (2002). Probabilistic reasoning with terms.

Strasbourg: University of Bristol, Loius Pasteur University.

Han, E. H., Karypis, G., & Kumar, V. (1999). Text categorization using weight adjusted

k-nearest neighbour classiﬁcation. Department of Computer Science and

Engineering, Army HPC Research Center, University of Minnesota.

Hartley, M., Isa, D., Kallimani, V. P., & Lee, L. H. (2006). A domain knowledge

preserving in process engineering using self-organizing concept, ICAIET 06. Sabah,

Malaysia: Kota Kinabalu.

Haykin, S. (1999). Neural Networks. A Comprehensive Foundation (2nd ed.). Prentice

Hall.

Isa, D., Prasad, R., Lee, L. H., & Kallimani, V. P. (2007). Pipeline defect prediction using

support vector machines. In CSECS 2007, Cairo, Egypt.

Isa, D., Lee, L. H., & Kallimani, V. P. (in press). A polychotomizer for case-based

reasoning beyond the traditional bayesian classiﬁcation approach. Journal of

Computer and Information Science.

Joachims, T. (1998). Text categorization with support vector machines: Learning

with many relevant features. In Machine learning: ECML-98, 10th European

conference on machine learning (pp. 137–142).

Kamens, B. (2005). Bayesian ﬁltering: Beyond binary classiﬁcation. Fog Creek

Software, Inc.

Kim, S. B., Rim, H. C., Yook, D. S., & Lim, H. S. (2002). Effective methods for improving

naïve bayes text classiﬁers. In Proceeding of the 7th Paciﬁc rim international

conference on artiﬁcial intelligence (Vol. 2417).

Kontkanen, P., Myllymaki, P., Silander, T., & Tirri, H. (1997). A Bayesian approach for

retrieving relevant cases. In Proceedings of the EXPERSYS-97 conference,

Sunderland, UK (pp. 67–72).

Kriegel, H. P., Brechesen, S., Kroger, P., Pfeiﬂe, M., & Schbert, M. (2003). Using sets of

feature vectors for similarity search on voxelized CAD objects. In Proceedings of

the ACM SIGMOD 2003 international conference on management of data, San

Diago, 2003.

McCallum, A., & Nigam, K. (2003). A comparison of event models for naïve Bayes

text classiﬁcation. Journal of Machine Learning Research, 3, 1265–1287.

Michalski, R. S., Bratko, I., & Kubat, M. (1999). Machine learning and data mining

methods and applications. Wiley.

Miyamoto, S. (2007). Data clustering algorithms for Information Systems. Berlin:

Springer-Verlag.

Negnevitsky, M. (2002). Artiﬁcial intelligence. A guide to intelligent systems. Addison

Wesley.

Nigam, K., Lafferty, J., McCallum, A. (1999). Using maximum entropy for text

classiﬁcation. In IJCAI-99 workshop on machine learning for information ﬁltering

(pp. 61–67).

Brien, C., & Vogel, C. (2002). Spam ﬁlters: Bayes vs Chisquared; letters vs words. In

Proceedings of the 1st international symposium on Information and communication

technologies.

Pal, S. K., & Shiu, C. K. (2004). Foundation of soft case-based reasoning. Wiley.

Petrushin, V. A. (2005). Mining rare and frequent events in multi-camera

surveillance video using self organizing maps. In Proceeding of the 11th ACM

SIGKDD international conference on knowledge discovery in data mining.

Provost, J. (1999). Naïve-Bayes vs Rule-Learning in Classiﬁcation of E-mail.

Department of Computer Science, The University of Austin.

Quinlan, J. R. (1993). C4.5: program for machine learning. San Mateo, CA: Morgan

Kaufmann.

Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to

ﬁltering junk e-mail. In AAAI-98 workshop on learning for text categorization.

Su, X. (2002). A text categorization perspective for ontology mapping. Norway:

Department of Computer and Information Science, Norweigian University of

Science and Technology.

Vapnik, V. N. (1995). The nature of statistical learning theory. NewYork: Springer.

Vesanto, J., Alhonieme, E., Himberg, J., Kiviluoto, K., & Pervienen, J. (1999). Self

organising maps for data mining in Matlab. In The SOM toolbox, simulation news

Europe (Vol. 25, p. 54).

Wang, S. H. (2001). Cluster Analysis using a validated Self organizing method: Cases

of problem identiﬁcation. International Journal of Intelligent systems in

Accounting, Finance and Management, 127.

Wei, K. (2003). A naïve Bayes spam ﬁlter. Faculty of Computer Science, University of

Berkely.

Xia, Y., Liu, W., & Guthrie L. (2005). Email categorization with tournament methods.

In Proceeding international conference on application of natural language.

D. Isa et al. / Expert Systems with Applications 36 (2009) 9584–9591 9591

A Real-Time Digital Twin and Neural Net Cluster-Based Framework for Faults Identification in Power Converters of Microgrids, Self Organized Map Neural Network

Article

Full-text available

Oct 2022

In developing distribution networks, the deployment of alternative generation sources is heavily motivated by the growing energy demand, as by environmental and political motives. Consequently, microgrids are implemented to coordinate the operation of these energy generation assets. Microgrids are systems that rely on power conversion technologies based on high-frequency switching devices to generate a stable distribution network. However, disrupting scenarios can occur in deployed systems, causing faults at the sub-component and the system level of microgrids where its identification is an economical and technological challenge. This paradigm can be addressed by having a digital twin of the low-level components to monitor and analyze their response and identify faults to take preventive or corrective actions. Nonetheless, accurate execution of digital twins of low-level components in traditional simulation systems is a difficult task to achieve due to the fast dynamics of the power converter devices, leading to inaccurate results and false identification of system faults. Therefore, this work proposes a fault identification framework for low-level components that includes the combination of Real-Time systems with the Digital Twin concept to guarantee the dynamic consistency of the low-level components. The proposed framework includes an offline trained Self Organized Map Neural Network in a hexagonal topology to identify such faults within a Real-Time system. As a case study, the proposed framework is applied to a three-phase two-level inverter connected to its digital model in a Real-Time simulator for open circuit faults identification.

Intelligent Agent Technology for Cellular-Assisted GPS Positioning using Bayesian and Self-Organizing Map

Article

Full-text available

Nov 2019

This paper presents the use of intelligent agent technology, cellular-assisted Global Positioning System (GPS) and data mining for positioning purpose. Due to overlapping coverage areas of cell towers, conventional cell-based positioning techniques have been reported to be inaccurate. Current cell-assisted GPS positioning setup with high accuracy is costly as it requires huge investments on hardware deployments. A new solution of using intelligent agent technology was proposed by the authors for an economical and satisfactory cell-assisted GPS positioning system. Location information in the form of cell identity (ID) and GPS coordinates pairs can be acquired via devices such as smart phones and GPS trackers. The cell ID-GPS coordinates pairs are then grouped by each individual cell ID. An intelligent agent equipped with data mining capabilities is then deployed to computer the optimal GPS coordinates of the cell ID to provide more precise location information. The proposed solution was evaluated via a prototype system. The system was built to collect raw data of cell-ID and GPS coordinate pairs from trackers and mobile phone applications. Using the reference GPS coordinate that was calculated by taking the mean of longitude and mean of latitude for all the GPS coordinates clustered in the same group, the geographical distance between each GPS coordinate and the reference GPS coordinate in the same group was computed to evaluate the performance of the proposed solution.. Experimental results showed that the proposed solution based on intelligent agent equipped with data mining capability helped in improving the prediction of location with sub-kilometer accuracy, in contract to the conventional cell-assisted GPS positioning system which have low accuracy with distance rate various in kilometers.

Topic Oriented Probability Based and Semi Supervised Document Clustering

Article

Full-text available

May 2012

Clustering of related or similar objects has long been regarded as a potentially useful contribution for helping users to navigate an information space such as a document collection. But, the major challenge in document clustering is high dimensionality. Data mining and statistical techniques have been applied with some success to large set of documents to automatically produce meaningful subsets. Many clustering algorithms and techniques have been developed and implemented since the earliest days of computational information retrieval but as the sizes of document collections have grown, these techniques have not been scaled to large collections because of their computational overhead. Traditional document clustering is usually considered as an unsupervised learning. It cannot effectively group documents under user’s need. To solve this problem, the proposed system concentrates on an interactive text clustering methodology, topic oriented probability based and semi supervised document clustering. It suggests interactive approach for document clustering, to facilitate human refinement of clustering outputs. The proposed system evaluates system efficiency by implementing and testing the clustering results with Dbscan and K-means clustering algorithms. Experiment shows that the proposed document clustering algorithm performs with an average efficiency of 94.4% for various document categories.

Identifying the Causes of Ship Collisions Accident Using Text Mining and Bayesian Networks

Article

Dec 2023

Under the backdrop of the robust growth of the global economy, the water transport industry is experiencing rapid development, resulting in an increase in ship collisions and a critical water traffic safety situation. This study uses text mining techniques to gather a corpus of data. The corpus includes human factors, ship factors, natural environmental factors, and management factors, which are used as target data to obtain a high-dimensional sparse original feature vector space set comprising eigenvalues and eigenvalue weight attributes. Chi-square statistics are utilised to reduce dimensionality, resulting in a final set of 33-dimensional text feature items that determine the causal factors of ship collision risk. Taking the four steps involved in the collision process as the primary focus, a Bayesian network structure for ship collision risk is constructed based on the “human-ship-environment-management” system. By incorporating existing ship collision accident/danger reports, conditional probability tables are computed for each node in the Bayesian network structure, enabling the modelling of ship collision risk. The model is validated through an example, revealing that, under relevant conditions, the probability of collision exceeds 90 %. This finding demonstrates the validity of the model and allows one to deduce the primary cause of ship collision accidents.

Pattern Classification of Stock Price Moving

Article

Full-text available

Dec 2022

Chenyu Wang

The stock is one of the most important instruments of finance. However, the tendency of stock always has a high level of irregularity. In stock market, the stock price moving is considered as a time series problem. Clustering method on stock data is one of the machine learning methods and it is one of the most important analysis methods of technical analysis. The aim of this project is to find an efficient unsupervised learning way to analysis the stock market data to make classification of the patterns on different stock price moving data and get useful information for investment decisions by implementing different clustering algorithms. For this aim, the research objective of this project is to compare several of clustering methods like K-means algorithm, EM algorithm, Canopy algorithm, specify the best number of clusters for each clustering method by several evaluation indexes, show the result of each clustering method and make evaluation on the results of these clustering methods on stock market data of standard S&P 500 stock marketing data. In addition, Weka 3 and Matlab are used to implement the clustering methods and evaluation program. Data visualization shows clearly that those public companies in the same cluster have similar stock price moving pattern. The experiment shows the result that K-means algorithm and EM algorithm perform effectively in stock price moving and Canopy algorithm can be used before K-means algorithm to improve the efficiency.

Seismic Facies Analysis Using the Multiattribute SOM-K-Means Clustering

Article

Full-text available

Oct 2022
Comput Intell Neurosci

An accurate seismic facies analysis (SFA) can provide insight into the subsurface sedimentary facies and has guiding significance for geological exploration. Many machine learning algorithms, including unsupervised, supervised, and deep learning algorithms, have been developed successfully for SFA over the past decades. However, SFA and facies classification are still challenging tasks due to the complex characteristics of geological and seismic data. A multiattribute SOM-K-means clustering algorithm, which implements a two-stage clustering by using multiple geological attributes, is proposed and applied for SFA. The proposed algorithm can effectively extract complementary features from the multiple attribute volumes and comprehensively use the different attributes to improve the recognition ability of seismic facies. Experimental results show that the proposed algorithm improves clustering accuracy and can be used as an effective and powerful tool for SFA.

Clustering Techniques and Their Applications: A Review

Article

Oct 2020

Arjun Dutta

This paper deals with concise study on clustering: existing methods and developments made at various times . Clustering is defined as an unsupervised learning where the targets are sorted out on the foundatio n of some similarity inherent among them. In the recent times, we dispense with large masses of data including images, video, social text, DNA, gene information, etc. Data clustering analysis has come out as an efficient technique to accurately achieve the task of categorizing information into sensible groups. Clustering has a deep association with researches in several scientific fields. k-means algorithm was suggested in 1957 . K-mean is the most popular partitional clustering method till date. In many commercial and non-commercial fields, clustering techniques are used. The applications of clustering in some areas like image segmentation, object and role recognition and data mining are highlighted. In this paper, we have presented a brief description of the surviving types of clustering approaches followed by a survey of the areas.

Differential analysis of negative geotaxis climbing trajectories in Drosophila under different conditions

Article

Jun 2022

The decline of Drosophila climbing behavior is one of the common phenomena of Drosophila aging. The so-called negative geotaxis refers to the natural upward climbing behavior of Drosophila melanogaster after it oscillates to the bottom of the test tube. The strength of climbing ability is regarded as the index of aging change of D. melanogaster. At present, many laboratories use the percentage of 10 fruit flies climbing a specific height in 5 s as a general indicator of the climbing ability of fruit flies. This group research index ignores the climbing performance of a single fruit fly, and the climbing height belongs to the concept of vertical distance in physics, which cannot truly and effectively reflect the concept of curve distance in the actual climbing process of fruit flies. Therefore, based on the image processing algorithm, we added an experimental method to draw the climbing trajectory of a single fruit fly. By comparing the differences in climbing behavior of fruit flies under different sex, group or single, oscillation condition or rotation inversion condition, we can find that the K-Nearest Neighbor target detection algorithm has good applicability in fruit fly climbing experiment, and the climbing ability of fruit flies decreases with age. Under the same experimental conditions, the climbing ability of female fruit flies was greater than that of male fruit flies. The climbing track length of a single fruit fly can better reflect the climbing process of a fruit fly.

Different Clustering Algorithms in Data Mining

Article

Apr 2022

Apurva Vashist

Clustering is the grouping together of similar data items into clusters. Clustering analysis is one of the main analytical methods in data mining; the method of clustering algorithm will influence the clustering results directly. This paper discusses the various types of algorithms like Hierarchical Clustering Algorithms Partitioning Method Nearest Neighbor algorithm K-Mean (A centroid based Technique) Density-Based clustering etc. This paper also deals with the problems of clustering algorithm such as time complexity and accuracy to provide the better results based on various environments.

Role of Big Data Analytics in the Financial Service Sector

Chapter

Full-text available

Dec 2021

Neural Networks. A Comprehensive Foundation

Article

Jan 1994

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Article

Jan 1998

Thorsten Joachims

The Nature of Statistical Learning Theory

Chapter

Jan 2000

Vladimir N. Vapnik

In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.

Visual Explorations in Finance with Self-Organizing Maps

Book

Jan 1998

From the Publisher: SOMs (Self-Organizing Maps) have proven to be an effective methodology for analyzing problems in finance and economics--including applications such as market analysis, financial statement analysis, prediction of bankruptcies, interest rates, and stock indices. This book covers real-world financial applications of neural networks, using the SOM approach, as well as introducing SOM methodology, software tools, and tips for processing. 106 illus. in color.

Artificial Intelligence: A Guide to Intelligent Systems

Article

Jan 2005

Michael Negnevitsky

A Text Categorization Perspective for Ontology Mapping

Article

Xiaomeng Su

This position paper addresses the problem of ontology mapping which is pervasive in context where semantic interoperability is needed. A preliminary solution is proposed using external information, i.e. documents assigned to the ontology to calculate similarities between concepts in two ontologies. Text categorization is used to automatic assign documents to the concepts in the ontology. Based on the similarities measure, a heuristic method is used to establish mapping assertions for the two ontologies.

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classica tion

Article

Nov 2000

Categorization of documents is challenging, as the number of discriminating words can be very large. We present a nearest neighbor classification scheme for text categorization in which the importance of discriminating words is learned using mutual information and weight adjustment techniques. The nearest neighbors for a particular document are then computed based on the matching words and their weights. We evaluate our scheme on both synthetic and real world documents. Our experiments with synthetic data sets show that this scheme is robust under different emulated conditions. Empirical results on real world documents demonstrate that this scheme outperforms state of the art classification algorithms such as C4.5, RIPPER, Rainbow, and PEBLS.

Na¨ ive-Bayes vs. Rule-Learning in Classification of Email

Article

Aug 2002

Jefferson Provost

Recent growth in the use of email for communication and the corresponding growth in the volume of email received has made automatic processing of email desirable. Two learn- ing methods, na¨ ive bayesian learning with bag-valued features and the RIPPER rule-learning algorithm have shown promise in other text categorization tasks. I present three experiments in automatic mail foldering and spam filtering, showing that na¨ ive bayes outperforms RIPPER in classification accuracy.

Using Maximum Entropy for Text Classication

Article

Jun 1999

This paper proposes the use of maximum en- tropy techniques for text classication. Maxi- mum entropy is a probability distribution esti- mation technique widely used for a variety of natural language tasks, such as language mod- eling, part-of-speech tagging, and text segmen- tation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uni- form. Constraints on the distribution, derived from labeled training data, inform the tech- nique where to be minimally non-uniform. The maximum entropy formulation has a unique so- lution which can be found by the improved it- erative scaling algorithm. In this paper, max- imum entropy is used for text classication by estimating the conditional distribution of the class variable given the document. In experi- ments on several text datasets we compare ac- curacy to naive Bayes and show that maximum entropy is sometimes signicantly better, but also sometimes worse. Much future work re- mains, but the results indicate that maximum entropy is a promising technique for text clas- sication.

Machine Learning and Data Mining; Methods and Applications

Article

Jan 1998

From the Publisher:Master the new computational tools to get the most out of your information system.This practical guide, the first to clearly outline the situation for the benefit of engineers and scientists, provides a straightforward introduction to basic machine learning and data mining methods, covering the analysis of numerical, text, and sound data.

Using the self organizing map for clustering of text documents

Abstract and Figures

Recommended publications

Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Mo...

Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machi...

Using unsupervised clustering approach to train the Support Vector Machine for text classification

Polychotomiser for Case-based Reasoning beyond the Traditional Bayesian Classification Approach