ArticlePDF Available

Text document clustering using global term context vectors

June 2012
Knowledge and Information Systems 31(3)

June 2012
31(3)

DOI:10.1007/s10115-011-0412-6

Authors:

Argyris Kalogeratos

Ecole Normale Supérieure Paris-Saclay

Aristidis Likas

University of Ioannina

Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.

arious weight distributions for the neighboring terms around a reference term occurring in the middle of a term sequence of length 50. The distributions are obtained by varying the value of parameter α in Eq. 10. This distribution defines the contribution of each term to the context of the specific reference term. The scale value of the local kernel is set to σ = 5, while self-weight is set to α to 0.05 (left), 0.10 (middle), 0.2 (right)

…

No caption available

…

Figures - uploaded by Argyris Kalogeratos

Content may be subject to copyright.

Content uploaded by Argyris Kalogeratos

Content may be subject to copyright.

Knowl Inf Syst

DOI 10.1007/s10115-011-0412-6

REGULAR PAPER

Text document clustering using global term context

vectors

Argyris Kalogeratos ·Aristidis Likas

Received: 13 May 2010 / Revised: 20 December 2010 / Accepted: 6 May 2011

Abstract Despite the advantages of the traditional vector space model (VSM)

representation, there are known deﬁciencies concerning the term independence assump-

tion. The high dimensionality and sparsity of the text feature space and phenomena such as

polysemy and synonymy can only be handled if a way is provided to measure term simi-

larity. Many approaches have been proposed that map document vectors onto a new feature

space where learning algorithms can achieve better solutions. This paper presents the global

term context vector-VSM (GTCV-VSM) method for text document representation. It is an

extension to VSM that: (i) it captures local contextual information for each term occur-

rence in the term sequences of documents; (ii) the local contexts for the occurrences of a

term are combined to deﬁne the global context of that term; (iii) using the global context

of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to lin-

early map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically

smoothed’ feature space where problems such as text document clustering can be solved more

efﬁciently. We present an experimental study demonstrating the improvement of clustering

results when the proposed GTCV-VSM representation is used compared with traditional

VSM-based approaches.

Keywords Text mining ·Document clustering ·Semantic matrix ·Data projection

1 Introduction

The text document clustering procedure aims toward automatically partitioning a given

collection of unlabeled text documents into a (usually predeﬁned) number of groups, called

A. Kalogeratos ·A. Likas (B

)

Department of Computer Science, University of Ioannina,

45110, Ioannina, Greece

e-mail: arly@cs.uoi.gr

A. Kalogeratos

e-mail: akaloger@cs.uoi.gr

123

A. Kalogeratos, A. Likas

clusters, such that similar documents are assigned to the same cluster while dissimilar

documents are assigned to different clusters. This is a task that discovers the underlying

structure in a set of data objects and enables the efﬁcient organization and navigation in large

text collections.

The challenging characteristics of the text document clustering problem are related to the

complexity of the natural language. Text documents are represented in high dimensional and

sparse (HDS) feature spaces, due to their large term vocabularies (the number of different

terms of a document collection or text features in general). In an HDS feature space, the

difference between the distance of two similar objects and the distance of two dissimilar

objects is relatively small [4]. This phenomenon prevents clustering methods from achieving

good data partitions. Moreover, the text semantics, e.g., term correlations, are mostly implicit

and non-trivial, hence difﬁcult to extract without prior knowledge for a speciﬁc problem.

The traditional document representation is the vector space model (VSM)[28] where each

document is represented by a vector of weights corresponding to text features. Many vari-

ations of VSM have been proposed [17] that differ in what they consider as a feature or

‘term’. The most common approach is to consider different words as distinct terms, which is

the widely known Bag Of Words (BOW) model. An extension is the Bag Of Phrases (BOP)

model [23] that extracts a set of informative phrases or word n-grams (nconsecutive words).

Especially for noisy document collections, e.g., containing many spelling errors, or collec-

tions whose language is not known in advance, it is often better to use VSM to model the

distribution of character n-grams in documents. Herein, we consider word features and we

refer to them as terms; however, the procedures we describe can be directly extended to more

complex features.

Despite the simplicity of the popular word-based VSM version, there are common lan-

guage phenomena that it cannot handle. More specifically, it cannot distinguish the different

senses of a polysemous word in different contexts or realize the common sense between

synonyms. It also fails to recognize multi-word expressions (e.g., ‘Olympic Games’). These

deﬁciencies are in part due to the over-simplistic assumption of term independence,where

each dimension of the HDS feature space is considered to be vertical to the others and

makes the classic VSM model incapable of capturing the complex language semantics. The

VSM representations of documents can be improved by examining the relations between

terms either at a low level, such as terms co-occurrence frequency, or at a higher semantic

similarity level.

Among the popular approaches is the Latent Semantic Indexing (LSI)[7] that solves an

eigenproblem using Singular Value Decomposition (SVD) to determine a proper feature space

to project data. Concept Indexing [16] computes a k-partition by clustering the documents and

then uses the centroid vectors of the clusters as the axes of the reduced space. Similarly, Con-

cept Decomposition [8] approximates in a least-squares fashion the term-by-document data

matrix using centroid vectors. A more simple but quite efﬁcient method is the Generalized

vector space model (GVSM)[31]. GSVM represents documents in the document similarity

space, i.e., each document is represented as a vector containing its similarities to the rest of the

documents in the collection. The Context Vector Model (CVM-VSM)[5] is a VSM-extension

that describes the semantics of each term by introducing a term context vector that stores its

similarities to the other terms. The similarity between terms is based on a document-wise

term co-occurrence frequency. The term context vectors are then used to map document

vectors into a feature space of equal size to the original, but less sparse. The ontology-based

VSM approaches [13,15] map the terms of the original space onto a feature space deﬁned by

a hierarchically structured thesaurus, called ontology. Ontologies provide information about

the words of a language and their possible semantic relations; thus, an efﬁcient mapping can

123

Text document clustering

disambiguate the word senses in the context of each document. The main disadvantage is

that, in most cases, the ontologies are static and rather generic knowledge bases, which may

cause heavy semantic smoothing of the data. A special text representation problem is related

to very short texts [14,25].

In this work, we present the Global Term Context Vector-VSM (GTCV-VSM) representation

that is an entirely corpus-based extension to the traditional VSM that incorporates contextual

information for each vocabulary term. First, the local context for each term occurrence in

the term sequences of documents is captured and represented in vector space by exploiting

theideaoftheLocally Weighted Bag of Words [18]. Then, all the local contexts of a term

are combined to form its global context vector. Global context vectors constitute a semantic

matrix that efﬁciently maps the traditional VSM document vectors onto a semantically richer

feature space of same dimensionality to the original. As indicated by our experimental study,

in the new space, superior clustering solutions are achieved using well-known clustering

algorithms such as the spherical k-means [8] or spectral clustering [24].

The rest of this paper is organized as follows. Section 2provides some background on doc-

ument representation using the vector space model. In Sect. 3, we describe recent approaches

for representing a text document using histograms that describe the local context at each

location of the document-term sequence. In Sect. 4, we present our proposed approach for

document representation. The experimental results are presented in Sect. 5,andﬁnally,in

Sect. 6, we provide conclusions and directions for future work.

2 Document representation in vector space

In order to apply any clustering algorithm, the raw collection of Ntext documents must

be ﬁrst preprocessed and represented in a suitable feature space. A standard approach is to

eliminate trivial words (e.g., stopwords) and words that appear in a small number of docu-

ments. Then, stemming [26] is applied, which aims to replace each word by its corresponding

word stem.TheVderived word stems constitute the collection’s term vocabulary, denoted

as V={ν1,...,ν

V}. Thus, a text document, which is a ﬁnite term sequence of Tvocabulary

terms, is denoted as dseq =dseq(1),...,dseq(T), with dseq(i)∈V. For example, the

phrase ‘The dog ate a cat and a mouse!’ is a sequence dseq =dog,ate,cat,mouse.

2.1 The bag of words model

According to the typical VSM approach, the Bag of Words (BOW) model, a document is rep-

resented by a vector d∈RV, where each word term νiof the vocabulary is associated with a

single vector dimension. The most popular weighting scheme is the normalized t f ×idf that

introduces the inverse document frequency as an external weight to enforce the terms that

have discrimination power and appear in a small number of documents. For the νivocabulary

term, it is computed as idfi=log(N/df

i),whereNdenotes the total number of documents

and df

idenotes the document frequency, i.e., the number of documents that contain term νi.

Thus, the normalized tf ×idf BOW vector is a mapping of the term sequence dseq deﬁned

as follows

bow:dseq →d=h·(tf

1idf1,...,tfVidfV)∈RV,(1)

where normalization is performed with respect to the Euclidean norm using the coefﬁcient h.

The document collection can then be represented using the Ndocument vectors as rows in

123

A. Kalogeratos, A. Likas

the Document-Term matrix D,whichisaN×Vmatrix whose rows and columns are indexed

by the documents and the vocabulary terms, respectively.

In the VSM, there are several alternatives to quantify the semantic similarity between

document pairs. Among them, Cosine similarity has shown to be an effective measure [11],

and for a pair of document vectors, diand djis given by

simcos (di,dj)=d

idj

di2dj2∈[0,1].(2)

Unit similarity value implies that the two documents are described by identical distributions

of term frequencies. Note that this is equal to the dot product d

idjif document vectors are

normalized in the unit positive V-dimensional hypersphere.

2.2 Extensions to VSM

The BOW model, despite having a series of advantages, such as generality and simplicity, it

cannot model efﬁciently the rich semantic content of text. The Bag Of Phrases model uses

phrases of two or three consecutive words as features. Its disadvantage is the fact that it has

been observed that as phrases become longer, they obtain superior semantic value, but at the

same time, they become statistically inferior with respect to single-word representations [19].

A category of methods developed aiming on tackling this difﬁculty recognize the frequent

wordsets (unordered itemsets) in a document collection [3,10], while the method proposed

in the study by [20] exploits the frequent word subsequences (ordered) that are stored in a

Generalized Sufﬁx Tree (GST) for each document.

Modern variations of VSM are used to tackle the difﬁculties occurring due to HDS spaces,

by projecting the document vectors onto a new feature space called concept space. Each con-

cept is represented as a concept vector of relations between the concept and the vocabulary

terms. Generally, this approach of document mapping can be expressed as

VSM :d→d=Sd ∈RV,V≤V,(3)

where the V×Vmatrix Sstores the concept vectors as rows. This projection matrix is also

known as semantic matrix. The Cosine similarity between two normalized document images

in the concept space can be computed as a dot product

sim(cos)

sem di,dj=SdiSdj=hS

iSdihS

jSdj=hS

ihS

jd

iSSdj,(4)

where the scalar normalization coefﬁcient for each document is hS

i=1/Sdi2.Thesimi-

larity deﬁned in Eq. 4can be interpreted in two ways: (i) as a dot product of the document

images SdiSdjthat both belong to the new space RVand (ii) as a composite measure

that takes into account the pairwise correlations between the original features expressed by

the matrix SS.

There is a variety of methods proposing alternative ways to deﬁne the semantic matrix

though many of them are based on the above linear mapping. The widely used Latent Seman-

tic Indexing (LSI)[7] projects the document vectors onto a space spanned by the eigenvec-

tors corresponding to the Vlargest eigenvalues of the matrix DD. The eigenvectors are

extracted by the means of Singular Value Decomposition (SVD)onmatrixD

, and they cap-

ture the latent semantic information of the feature space. In this case, each eigenvector is a

different concept vector and Vis a user parameter much smaller than V, while there is also a

considerable computational cost to perform the SVD. In Concept Indexing [16], the concept

123

Text document clustering

vectors are the centroids of a V-partition obtained by applying document clustering. In [9],

statistical information such as the covariance matrix is combined with traditional mapping

approaches into latent space (LSI, PCA) to compose a hybrid vector mapping.

A computationally simpler alternative that utilizes the Document-Term Matrix Das a

semantic matrix is the Generalized vector space model (GVSM)[31], i.e., Sgvsm =Dand

the image of a document is given by d=Dd. By examining the product Dd ∈RN×1

,we

can conclude that a GVSM projected document vector dhas lower dimensionality if N≤V.

Moreover, if both dand Dare properly normalized, then image vector dconsists of the N

Cosine similarities between the document vector dand the rest of the N−1 documents in

the collection. This observation implies that the GVSM works in the document similarity

space by considering each document as a different concept. On the other hand, the respective

product S

gvsm Sgvsm =DD(usedinEq.4)isaV×V Term Similarity Matrix whose r-th

row has the dot-product similarities between term νrand the rest of the V−1 of vocabulary

terms. Note that terms become more similar as their corresponding normalized frequency

distributions into the Ndocuments are more alike. Based on the GVSM model, it is proposed

in [1] to build local semantic matrices for each cluster during document clustering.

A rather different approach proposed in [5] for information retrieval is the Context Vector

Model (CVM-VSM) where, instead of a few concise concept vectors, it computes the context

in which each of the Vvocabulary terms appears in the data set, called term context vector

(tcv). This model computes a V×Vmatrix Scvmcontaining the term context vectors as

rows. Each tcvivector aims to capture the Vpairwise similarities of term νito the rest of

the vocabulary terms. Such similarity is computed using a co-occurrence frequency measure.

Each matrix element [S

cvm]ij stores the similarity between terms νiand νjcomputed as

[Scvm]ij =⎧

⎪

⎨

⎪

⎩

1,i=j

N

r=1tf

ritf

N

r=1(tf

ri ·V

q=1,q=itf

rq),i= j.(5)

Note that this measure is not symmetric, generally [Scvm]ij =[Scvm]ji, due to the denom-

inator that normalizes the pairwise similarity to [0, 1] with respect to the ‘total amount’of

similarity between term νiand the other vocabulary terms. The rows of matrix Scvmcan be

normalized with respect to the Euclidean norm, and each document image is then computed

as the centroid of the normalized context vectors of all terms appearing in that document

cvm:d→d=



i=1

i·tcvi,(6)

where tf

iis the frequency of term νi. The motivation for using term context vectors is to

capture the semantic content of a document based on the co-occurrence frequency of terms

in the same document, averaged over the whole corpus. The CVM-VSM representation is

less sparse than BOW. Moreover, weights such as idf can be incorporated to the transformed

document vectors computed using Eq. 6.In[5], several more complicated weighting alter-

natives have been tested in the context of information retrieval that in our text document,

clustering experiments did not perform better than the standard idf weights.

In a higher semantic level than term co-occurrences, additional information for vocabulary

terms provided by ontologies has also been exploited to compute the term similarities and

to construct a proper semantic matrix. Word Net [22]andWikipedia [30] have been used for

this purpose in [6,15], and [29], respectively.

123

A. Kalogeratos, A. Likas

2.3 Discussion

Summarizing the properties of the above-mentioned vector-based document representations,

in the traditional BOW approach, the dimensions of the term feature space are considered to

be independent to each other. Such an assumption is very simplistic, since there exist semantic

relations among terms that are ignored. The VSM-extensions aim to achieve semantic smooth-

ing, a process that redistributes the term weights of a vector model, or map data in a new

feature space, by taking into account the correlations between terms. For instance, if the term

‘child’ appears in a document, then it could be assumed that the term ‘kid’ is also related

to the speciﬁc document or even terms like ‘boy’, ‘girl’, ‘toy’. The resulting representation

model is also a VSM, but the document vectors become less sparse and the independence

of features is mitigated in an indirect way. The smoothing is usually achieved by a linear

mapping of data vectors to a new feature space using a semantic matrix S. It is convenient

to think that the new document vector d=Sd contains the dot product similarities between

the original BOW vector dand the rows of the semantic matrix S.

A basic difference between the various semantic smoothing methods is related to the

dimension of the new feature space, which is determined by the number Vof row vectors of

matrix S. In case their number is less than the size Vof the vocabulary, such vectors are called

as concept vectors and are usually produced using the LSI method. Each concept vector has a

distribution of weights associated with the Voriginal terms that deﬁne their contribution of to

the corresponding concept. Of course, the resulting representation of the smoothed vector d

is less interpretable than the original, and there is always a problem of determining the proper

number of concept vectors.

An alternative approach for semantic smoothing assumes that each row vector of matrix S

is associated with one vocabulary term. Unlike a concept vector that describes abstract seman-

tics of higher level, here, the elements of each vector describe the relation of this term to the

other terms. Those relations constitute the so-called term context, thus the respective vector

is called term context vector. Each element of the mapped vector dwill contain the dot

product similarity between document dand the corresponding term context vector, i.e., for

each term vi, the element diprovides the degree to which the original document dcontains

the term viand its context, instead of just computing its frequency as happens in the BOW

representation. Note also that in BOW representation, a dot product would give zero simi-

larity for two documents that do not have common terms. On the contrary, the dot product

between a document vector and a term context vector of a term vithat does not appear in

that document may give a non-zero similarity. This happens if the document contains at least

one term vjwith non-zero weight in the context of term vi. For this reason, the smoothed

representation dis usually less sparse that dand retains their interpretability of dimensions.

Moreover, concept-based methods may be applied on the new representations.

The motivation of our work is to establish the importance of term context vectors and

to deﬁne an efﬁcient way to compute them. The CVM-SVM method considers that the

term context is computed based on term co-occurrence frequency at the document level.

It does not take into account the sequential nature of text and thus ignores the local dis-

tance of terms when computing term context. On the other hand, the GTCV-VSM pro-

posed in this work extends the previous approach by considering term context at three

levels: (i) It uses the notion of local term context vector (ltcv) to model the context

around the location in the text sequence where a term appears. These vectors are com-

puted using a local smoothing kernel as suggested in the LoWBOW approach [18]which

is described in the next section. The kernel takes into account the distance in which

other terms appear around the sequence location under consideration. (ii) It computes

123

Text document clustering

the document term context vector (dtcv) for each term that summarizes the term con-

text at the document level, and (iii) it computes the ﬁnal global term context vector

(gtcv) for each term representing the overall term context at corpus level. The gtcvvec-

tors constitute the rows of the semantic matrix S. Thus, the intuition behind GTCV-VSM

approach is to capture the local term context from term sequences and then to construct

a representation for global term context by averaging ltcvsat the document and corpus

level.

3 Utilizing local contextual information

A text document can be considered as a ﬁnite term sequence of its Tconsecutive terms denoted

as dseq =dseq(1),...,dseq(T)but, except for Bag of Phrases, so far in this paper, the

previously mentioned VSM-extensions ignore this property.A category of methods has been

proposed aiming to capture local information directly from the term sequence of a document.

The representation proposed that in the study by [27], ﬁrst considers a segmentation of the

sequence that is done by dragging a window of nterms along the sequence and computing the

local BOW vectors for each of the overlapping segments. All these local BOW vectors con-

stitute the document representation called Local Word Bag (LWB). To compute the similarity

between a pair of documents, the authors introduce a variant of the VG-Pyramid Matching

Kernel [12] that maps the two sets of local BOW vectors to a multi-resolution histogram and

computes a weighted histogram intersection.

Another approach for text representation presented in [18]istheLocally Weighted Bag

of Words (LoWBOW) that preserves local contextual information of text documents by the

effective modeling of the text sequential structure. At ﬁrst, a number of Lequally distant

locations are deﬁned in the term sequence. Each sequence location i,i=1,...,Lis then

associated with a local histogram which is a point in the multinomial simplex PV−1,where

Vis the number of vocabulary terms. More specifically, for (V−1)≥0, the PV−1space is

the (V−1)-dimensional subset of RVthat contains all probability vectors (histograms) over

Vobjects (for a discussion on the multinomial simplex see the Appendix of [18])

PV−1=H∈RV:Hi≥0,∀i=1,...,Vand



i=1

Hi=1.(7)

Contrary to LWB, in LoWBOW the local histogram is computed using a smoothing ker-

nel to weight the contribution of terms appearing around the referenced location in the term

sequence and to assign more importance to closely neighboring terms. Denoting as Hδ(dseq(t))

the trivial term histogram of Vterms whose probability mass is concentrated only at the term

that occurs at the location tin dseq

Hδ(dseq (t))i=⎧

⎨

⎩

1,ν

i=dseq(t)

,i=1,...,V,

0,ν

i= dseq(t)

(8)

then the locally smoothed histogram at a location in the dseq term sequence is computed

as in [18]

lowbow(dseq,) =



t=1

Hδ(dseq (t)) K,σ (t), (9)

123

A. Kalogeratos, A. Likas

where Tis the length of dseq.K,σ (t)denotes the weight for location tin sequence given

by a discrete Gaussian weighting kernel function of mean value and standard deviation σ.

Specifically, the weighting function is a Gaussian probability density function restricted in

[1,T]and renormalized so that T

t=1K,σ (t)=1. It is easy to verify that the result of the

histogram smoothing of Eq. 9is also a histogram.

It must be noted that for σ=0, the lowbowhistogram (Eq. 9) coincides with the trivial

histogram Hδ(dseq ()), where all the probability mass is concentrated at the term at location .

As σgrows, part of the probability mass is transferred to the terms occurring near location .

In this way, the lowbowhistogram at location is enriched with information about the terms

occurring in the neighborhood of . The smoothing parameter σadjusts the ‘locality’ of

term semantics that is taken into account by the model. Thus, instead of mining unordered

local vectors as in [27], the LoWBOW approach embeds the term sequence of a document in

the PV−1simplex. The sequence of the Llocally smoothed histograms (denoted as lowbow

histograms)formacurveinthe(V−1)-dimensional simplex (denoted as LoWBOW curve).

Figure 1illustrates the LoWBOW curves generated for a toy example and describes the role

of parameter σ. In this ﬁgure, we aim to illustrate (i) the LoWBOW curve representation, i.e.,

the curve that corresponds to a sequence of histograms (local context vectors), where each

local context vector is computed at a speciﬁc location of the sequence and corresponds to a

point in the (V−1)-dimensional simplex; (ii) the impact of the smoothing coefﬁcient σon

the computed local context vectors. It is illustrated that the increase in smoothing makes the

lowbowhistograms (points of the curve) more similar. This can also be veriﬁed by observing

that as smoothing increases, the curve becomes more concentrated around a central location

of the simplex. For σ=∞, all histograms become similar to the BOW representation and

the curve reduces to a single point. On the contrary, for σ=0, the histograms correspond to

simplex corners.

A similarity measure between LoWBOW curves has been proposed in [18] that assumes a

sequential correspondence between two documents and computes the sum of the similarities

between the Lpairs of LoWBOW histograms. Obviously, it is expected for this similarity

measure to underestimate the thematic similarity between documents that follow different

order in the presentation of similar semantic content.

4 A semantic matrix based on global term context vectors

In this section, we present the global term context vector-VSM (GTCV-VSM) approach

for capturing the semantics of the original term feature space of a document collection.

The method computes the contextual information of each vocabulary term, which is sub-

sequently utilized in order to create a semantic matrix. In analogy with CVM-VSM, our

approach reduces data sparsity but not dimensionality. The interpretability of the derived

vector dimensions remains as strong as in the BOW model as the value of each dimen-

sion of the mapped vector corresponds to one vocabulary term. Methods that reduce data

dimensionality could also be applied on the new representations at a subsequent phase. Com-

pared with CVM-VSM, GTCV-VSM generalizes the way the term context is computed by

taking into account the distance between terms in the term sequence of each document.

This is achieved by exploiting the idea of LoWBOW to describe the local contextual

information at a certain location in a term sequence. It must be noted that our method

borrows from the LoWBOW approach only the way the local histogram is computed

at each location of the term sequence and does not make use of the LoWBOW curve

representation.

123

Text document clustering

(a)

(b)

Fig. 1 A toy example where the sequence ν1,ν

2,ν

1,ν

3,ν

1ν1,ν

1,ν

2,ν

3is considered

that uses three different terms ν1,ν

2,ν

3(vocabulary size: V=3). The subﬁgures present LoWBOW curves in

the (V−1)-dimensional simplex for increasing values of the parameter σthat induces more smoothing to the

curve. Each point of the curve corresponds to a local histogram computed at a sequence location. The more a

term affects the local context at a location in the sequence, the more the curve point (the lowbowhistogram

related to that location) moves toward the respective corner of the simplex. For σ=0, local histograms

correspond to simplex corners; thus, the curve moves from corner to corner of the simplex. Two different

sampling rates for LoWBOW representation are illustrated: sampling at every term location in the sequence

(dashed line), which is the our strategy to collect contextual information for each term, and sampling every

two terms (solid line). dFor σ=∞, the LoWBOW curve reduces to a single point that coincides with the

BOW histogram of the sequence. In d,wepresentas‘stars’ the average ltcvhistograms for each term (dtcv

histograms) for the three different values of σand α=0.6 for all terms. As the value of σincreases, the dtcv

histograms of all terms become more similar tending to coincide with the BOW representation

More specifically, we deﬁne the local term context vector (ltcv) as a histogram associ-

ated with the exact occurrence of term dseq() at location in a sequence dseq. Hence, one

ltcvvector is computed at every location in the term sequence, i.e., =1,...,T.Note

that GTCV-VSM does not preserve any curve representation. This means that we are not

interested in the temporal order of the local term context vectors. The ltcv(dseq,) is a

modiﬁed lowbow(dseq,)probability vector that represents contextual information around

location , while adjusting explicitly the self-weight αdseq() of the reference term appearing

123

A. Kalogeratos, A. Likas

10 20 30 40 50

0.05

0.1

0.15

0.2

0.25

0.3

Term sequence

Local term weighting

10 20 30 40 50

0.05

0.1

0.15

0.2

0.25

0.3

Term sequence

10 20 30 40

0.05

0.1

0.15

0.2

0.25

0.3

Term sequence

Fig. 2 Various weight distributions for the neighboring terms around a reference term occurring in the middle

of a term sequence of length 50. The distributions are obtained by varying the value of parameter αin Eq. 10.

This distribution deﬁnes the contribution of each term to the context of the speciﬁc reference term. The scale

value of the local kernel is set to σ=5, while self-weight is set to αto 0.05 (left), 0.10 (middle), 0.2 (right)

at location 

ltcv(dseq,)

i=αdseq() ,ν

i=dseq(),

(1−αdseq() )·idfi·[lowbow(dseq,)

V

j=1,j=iidf j·[lowbow(dseq ,)

]j,ν

i= dseq(). (10)

The self-weight (0 ≤αdseq () ≤1) adjusts the relative importance between contextual

information (computed using the lowbowhistogram) and the self-representation of each

term. Figure 2illustrates an example of how the value of parameter αaffects the local term

weighting around a reference term in a sequence. When the parameter σof the Gaussian

smoothing kernel is set to zero, or α=1, the ltcv(dseq,) reduces to a trivial histogram

Hδ(d(seq)()) (see Eq. 8). The other extreme is the inﬁnite σvalue, where for small αvalues,

all the ltcvcomputed in a document dbecome similar to the tf histogram for that document.

The latter observation is the reason for considering an explicit self-weight in Eq. 10,

because a ﬂat smoothing kernel obtained for large σvalue can make a lowbowvector to

have improperly low self-weight for the reference term. For example, if a term appears once

in a document, then the lowbowvector with σ=∞at that location would contain very low

weight for that term. Generally, the value of ανdetermines how much the context vector of

term νshould be dominated by the self-weight of term ν. In our method, we set this parameter

independently for each individual term as a function of its idfνcomponent

αν=λ+(1−λ) ·1−idfν

logN ,λ∈[0,1],(11)

where λis a lower bound for all aν,ν =1,...,V(in our experiments we used λ=0.2).

The rationale for the above equation is that for terms with high document frequency (i.e.,

low idfν), we assign high ανvalues that suppress the local context in the respective context

vectors. In other words, the context is considered more important for terms that occur in

fewer documents. In Fig. 3a, we present an example illustrating the ltcvvectors of two term

sequences presented in Fig. 3c.

123

Text document clustering

Local term context histograms (columns)

for document A

Term sequence (dseq

Vocabulary terms

v1 v2 v3 v4 v5 v6 v7 v8 v6 v9

advanc: v1

electron: v2

commun: v3

help: v4

conduct: v5

busi: v6

interoper: v7

problem: v8

applic: v9

profession:v10

product:v11

commerc:v12

secur:v13

oca

erm con

thi

ograms

(

umns

)

for document B

Term sequence (dseq

Vocabulary terms

v10 v6 v11 v1 v7 v2 v12 v9 v13 v6 v6 v2 v3

v10

v11

v12

v13

(a)

Averaged term context histograms (columns)

Vocabulary terms

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13

advanc: v1

electron: v2

commun: v3

help: v4

conduct: v5

busi: v6

interoper: v7

problem: v8

applic: v9

profession:v10

product:v11

commerc:v12

secur:v13

(b) (c)

Fig. 3 An example of how ltcvhistograms are used to summarize the overall context in which a term appears

in the two term sequences of (c)usingEq.14.aThe term sequences (x-axis) of documents A,Bare presented

and the corresponding local term context vectors are illustrated as gray-scaled columns. Those vectors are

computed at every location in the sequence using a Gaussian smoothing kernel with σ=1andα=0.6 for all

terms. Brighter intensity at cell i,jindicates higher contribution of the term νito the local context of the term

appearing at location jin the sequence. bThe resulting transposed semantic matrix (S), where the gray-

scaled columns illustrate the global contextual information for each vocabulary term computed by averaging

the respective local context histograms (Eq. 13). cThe two initial term sequences (the stem of each non-trivial

term is emphasized). Assuming the same idf weight for each vocabulary term, the table presents the BOW

vector, the transformed vector dusing Eq. 14 as well as the effect of semantic smoothing (diff =BOW−d)

on document vectors. The redistribution of term weights that results by the proposed mapping reveals is done

in such a way that low-frequency terms are gaining weight against the more frequent ones. Note also that the

similarity between the two documents is 0.756 for the BOW model and 0.896 for the GTCV-VSM

We further deﬁne the document term context vector (dtcv) as a probability vector that sum-

marizes the context of a speciﬁc term at the document level by averaging the ltcvhistograms

corresponding to the occurrences of this term in the document. More specifically, suppose

that a term νappears noi,ν >0 times in the term sequence dseq

i(i.e., in the i-th document)

which is of length Ti. Then, the dtcvof this term νfor document iis computed as:

dtcvdseq

i,ν

=1

noν,i

noi,ν



j=1

ltcvdseq

i,

i,ν (j),(12)

123

A. Kalogeratos, A. Likas

where i,ν (j)is an integer value in [1,...,Ti]denoting the location of the j-th occurrence

of νin dseq

Next, the global term context vector (gtcv) is deﬁned for a vocabulary term νso as to

represent the overall contextual information for all appearances of νin the corpus of all N

term sequences (documents).

gtcv(ν) =hgtcv(ν) N



i=1

i,v dtcvdseq

i,ν

.(13)

The coefﬁcient hgtcv(ν) normalizes the vector gtcv(ν) with respect to the Euclidean norm,

and tf

i,ν is the frequency of the term νin the i-th document. Thus, the gtcv(ν) of term νis

computed using a weighted average of the document context vectors dtcvdseq

i,ν

obtained

for each document iin which term νappears. Thus, in contrast to LoWBOW curve approach

which focuses on the sequence of local histograms that describe the writing structure of a

document, our method focuses on the extraction of the global semantic context of a term by

averaging the local contextual information at all the corpus locations where this term appears.

Finally, the extracted global contextual information is used to construct the V×Vseman-

tic matrix Sgtcvwhere each row νis the gtcv(ν) vector of the corresponding vocabulary

term ν. Figure 1d provides an example of illustrating the dtcvdseq

i,ν

vectors for each

document (the points denoted as ‘stars’). Figure 3b illustrates the ﬁnal gtcvvectors obtained

by averaging the document level contexts for each vocabulary term.

To map a document using the proposed global term context vector-VSM approach,

we compute the vector dwhere each element νis Cosine similarity between the BOW

representation dof the document and the global term context vector gtcv(ν):

gtcv:d→d=Sgtcvd,d∈RV.(14)

Note that the transformed document vector dis V-dimensional that retains the interpretabil-

ity, since each dimension still corresponds to a unique vocabulary term. Moreover, if σ=0

and α>0, then Sgtcvd=d. Looking at Eq. 4, the product S

gtcvSgtcvessentially computes

a Term Similarity Matrix where the similarity between two terms is based on the distribution

of term weights in their respective global term context vectors, i.e., on the similarity of their

global context histograms. The table of Fig. 3c illustrates the effect of redistribution (com-

pared with BOW) of the term weights (semantic smoothing) in the transformed document

vectors achieved by the proposed mapping.

The procedure of representing the input documents using GTCV-VSM takes place in the

preprocessing phase. Let Tithe length of the i-th document and Viits vocabulary. Let also V

the size of the whole corpus vocabulary. Then, the cost to compute one ltcvvector at a

location of the term sequence using Eq. 10, and to add its Vinon-zero dimensions to the

respective dtcv,isO(Ti+Vi). This is done Titimes and the ﬁnal dtcvof each different

term of the document is added to the respective the gtcvrows. Thus, using proper notation

for the average length Tiand vocabulary size Viof the documents in a corpus, the cost of

constructing the semantic matrix can be expressed as O(N·Ti·(Ti+2·Vi)). However,

since Vi≤TiV, the overall computational cost of the GTCV-VSM is determined by the

O(N·V2) cost of the matrix multiplication of the mapping of Eq. 14.

123

Text document clustering

Tab l e 1 Characteristics of text document collections

Name Topics Classes NBalance V ViTi

D120–NGs: graphics, windows.x, motor, baseball, 6 2,000 200/400 4,343 48.8 110

space, mideast

D220–NGs: atheism, autos, baseball, electronics, 7 3,500 500/500 6,442 52.6 108

med, mac, motor, politics.misc

D320–NGs: atheism, christian, guns, mideast 4 1,600 400/400 4,080 62 131

D420–NGs: forsale, autos, baseball, motor, hockey 5 1,250 250/250 4,762 44.1 104

D5Reuters–21,578: acq, corn, crude, earn, grain, 10 9979 237/3,964 5,613 39.176

interest, money-fx, ship, trade, wheat

Ndenotes the number of documents, Vis the size of the global vocabulary and Vithe average document

vocabulary, Balance is the ratio of the smallest to the largest class and Tiis the average length of the term

sequences of documents

5 Clustering experiments

Our experimental setup was based on ﬁve different data sets: D1–D4are subsets of the 20-

Newsgroups,1while D5is the Mod Apte split [2] version of the Reuters-215782benchmark

document collection where the 10 classes with larger number of training examples are kept.

The characteristics of these data sets are presented in Table 1. The preprocessing of data sets

included the removal of all tags, headers, and metadata from the documents, while applied

word stemming and discarded terms appearing in less than ﬁve documents. It is worth men-

tioning how we preprocessed the term sequences of documents. We considered a dummy

term that replaced in the sequences all the low-frequency terms that were discarded so as to

maintain the relative distance between the terms that remained in each sequence. For similar

reasons, two dummy terms were considered at the end of every sentence denoted by characters

as (e.g., ‘.’, ‘?’, ‘!’). The dummy term is ignored when constructing the ﬁnal data vectors.

For each data set, we have considered several data mappings, and after each mapping,

the spherical k-means (spk-means)[8] and spectral clustering (spectral-c)[24] algorithms

were applied to cluster the mapped documents vectors into the kpredeﬁned number of clus-

ters corresponding to the different topics (classes) in a collection. In contrast to k-means that

is based on the Euclidean distance [21], spk-means uses the Cosine similarity and maximizes

the Cohesion of the clusters C={c1,...,ck}

Cohesion(C)=



j=1

di∈cj

u

jdi,(15)

where ujis the normalized centroid of cluster cjwith respect to the Euclidean norm.

Spectral clustering projects the document vectors in a subspace that is spanned by the k

largest eigenvectors of the Laplacian matrix Lcomputed from the similarity matrix A(N×N)

of pairwise Cosine similarities between documents. More specifically, the Laplacian matrix

is computed as L=D−1/2AD−1/2,whereDis a diagonal matrix. Each diagonal ele-

ment contains the sum of the i-th row of similarities Dii =N

j=1Aij. The next step is the

construction of a matrix X(N×k)={xi:i=1, ...,k}whose columns correspond to the k

1http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz.

2http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz.

123

A. Kalogeratos, A. Likas

largest eigenvectors of L. The standard k-means algorithm is then used to cluster the rows of

matrix Xafter being normalized to unit length in Euclidean space, where the i-th row is the

vector representation of the i-th document in the new feature space.

Clustering evaluation was based on the supervised measure Normalized Mutual Informa-

tion (NMI)andtheF1-measure.Wedenoteasngt

ithe number of documents of class i,njthe

size of cluster j,nij the number of documents belonging to class ithat are clustered in cluster

j,Cgt the grouping based on ground truth labels of documents cgt

1,...,cgt

k(true classes).

Let us further denote p(cgt

i)=ngt

i/Nand p(cj)=nj/Nthe probability of selecting arbi-

trarily a document from the data set and that belongs to class cgt

iand cluster cj, respectively,

and pcgt

i,cj=nij/Nthe joint of arbitrarily selecting a document from data set and that

belongs to cluster cjand is of class cgt

i. Then, the [0,1]–Normalized MI measure is computed

by dividing the Mutual Information by the maximum between the cluster and class entropy:

NMICgt,C=⎛

⎜

⎝

cgt

i∈Cgt

cj∈C

pcgt

i,cjlog pcgt

i,cj

pcgt

ipcj

⎞

⎟

⎠max{HCgt,H(C)}.

(16)

When Cand Cgt are independent, the value of NMI equals to zero, while it equals to one if

these partitions contain identical clusters.

The F1-measure is the harmonic mean of the precision and recall measures of the

clustering solution:

F1=precision ·recall

precision +recall.(17)

Higher values of F1in [0,1] indicate better clustering solutions.

Tabl es 2,3,5,and6present the results from the experiments conducted for each collec-

tion. Specifically, we compared the classic BOW representation, the GVSM, the proposed

GTCV-VSM method (with λ=0.2inEq.11), that represents the documents as described

in Eq. 14 and the CVM-VSM as proposed in [5], where document vectors are computed

basedonEq.6with idf weights. More specifically, for each collection, each representation

method was tested for 100 runs of spk-means (Tables 2,3,4) and spectral-c (Tables 5,6,7).

To provide fair comparative results, for each document collection, all methods were initial-

ized using the same random document seeds. The average of all runs (avg), the average of

the worst 10% of the clustering solutions (avg10%), and the best values are reported for each

performance measure. The worst 10% concerns the 10% of the solutions with the lowest

Cohesion, while the best clustering solution is that having the maximum Cohesion in the

100 runs (for spectral-c the sum of squared distances is considered for this purpose). The

best result for each dataset is emphasized in each of the avg and best columns. Moreover, in

Fig. 4, we present the average clustering performance of spk-means with respect to the value

of λparameter of Eq. 11 where, although not best for all cases, the value 0.2 we used seems

to be a reasonable choice for all the data sets we have considered. Note that similar effect

was observed for spectral-c method.

In order to illustrate the statistical significance of the obtained results, the well-known

t-test was applied for each data set to determine the significance of the performance differ-

ence between our methods and the compared methods. We have considered the case where

σ=10 for the gaussian kernel for all data sets. Within a conﬁdence interval of 95% and

for the value of degrees of freedom equal to 198 (for two sets of 100 experiments each),

123

Text document clustering

Tab l e 2 NMI values of the clustering solution for VSM (BOW), GVSM, CVM-VSM and the proposed

GTCV-VSM (for several values of σ) document representations using the spk-means algorithm

MethodσD1D2D3D4D5

avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%

BOW – 0.722 0.821 0.594 0.748 0.829 0.638 0.537 0.548 0.379 0.625 0.779 0.505 0.552 0.562 0.535

GTCV 1 0.749 0.854 0.601 0.767 0.845 0.638 0.544 0.564 0.372 0.667 0.793 0.515 0.570 0.578 0.561

20.756 0.871 0.631 0.765 0.852 0.657 0.563 0.574 0.396 0.670 0.832 0.539 0.572 0.580 0.561

50.773 0.881 0.687 0.777 0.864 0.662 0.577 0.602 0.400 0.688 0.851 0.539 0.589 0.633 0.578

10 0.777 0.886 0.685 0.781 0.873 0.672 0.590 0.621 0.424 0.684 0.849 0.540 0.590 0.630 0.580

30 0.761 0.879 0.659 0.776 0.863 0.653 0.579 0.590 0.369 0.683 0.842 0.518 0.576 0.612 0.568

inf 0.760 0.862 0.631 0.772 0.862 0.639 0.574 0.586 0.366 0.681 0.840 0.521 0.576 0.610 0.566

GVSM – 0.752 0.832 0.611 0.747 0.822 0.637 0.556 0.576 0.419 0.670 0.827 0.547 0.575 0.580 0.573

CVM – 0.750 0.841 0.612 0.754 0.851 0.659 0.547 0.604 0.400 0.672 0.824 0.541 0.578 0.581 0.575

Tab l e 3 F1-measure values of the spk-means clustering solution for the different representation methods

MethodσD1D2D3D4D5

avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%

BOW – 0.779 0.920 0.685 0.780 0.901 0.645 0.703 0.706 0.570 0.735 0.918 0.558 0.675 0.697 0.646

GTCV 1 0.806 0.940 0.688 0.790 0.921 0.650 0.709 0.713 0.576 0.755 0.920 0.561 0.691 0.695 0.677

20.814 0.946 0.688 0.792 0.924 0.674 0.721 0.728 0.580 0.764 0.938 0.598 0.698 0.714 0.672

50.828 0.953 0.722 0.817 0.929 0.665 0.736 0.737 0.597 0.773 0.948 0.611 0.712 0.751 0.681

10 0.832 0.954 0.733 0.820 0.936 0.603 0.737 0.739 0.603 0.773 0.947 0.581 0.712 0.749 0.681

30 0.814 0.950 0.747 0.794 0.929 0.657 0.725 0.727 0.576 0.766 0.944 0.579 0.698 0.746 0.666

inf 0.813 0.942 0.689 0.792 0.926 0.651 0.722 0.728 0.576 0.765 0.944 0.581 0.698 0.744 0.666

GVSM – 0.790 0.923 0.705 0.783 0.903 0.640 0.706 0.71 0.576 0.750 0.943 0.591 0.687 0.720 0.672

CVM – 0.765 0.941 0.672 0.790 0.930 0.672 0.708 0.725 0.576 0.751 0.934 0.604 0.685 0.716 0.669

Tab l e 4 The pand tvalues of the statistical significance t-test of the difference in k-means performance

using GTCV-VSM (σ=10) and the compared representation methods, with respect to the two evaluation

measures

GTCV D1D2D3D4D5

(σ=10)vs p-val t-val p-val t-val p-val t-val p-val t-val p-val t-val

BOWNMI 0.011×10−65.98 0.075×10−34.05 0.025×10−65.81 0.080×10−86.45 .0000 12.8

GVSMNMI 0.00008 2.68 0.081×10−34.02 0.050×10−34.15 0.085 1.73 0.056×10−55.17

CVMNMI 0.0051 2.83 0.0010 3.33 0.052×10−44.65 0.1659 1.39 0.077×10−34.04

BOWF10.020×10−55.39 0.050×10−23.54 0.046×10−23.56 0.0010 3.32 0.0000 12.8

GVSMF10.037×10−34.22 0.00021 3.11 0.067×10−23.45 0.0329 2.15 0.0000 9.06

CVMF10.081×10−34.02 0.06×10−86.50 0.0027 3.04 0.0314 2.18 0.0000 9.31

Values of psmaller than the significance level of 0.05 (5%) indicate significant superiority of GTCV-VSM

123

A. Kalogeratos, A. Likas

Tab l e 5 NMI values of the clustering solution for VSM (BOW), GVSM, CVM-VSM and the proposed

GTCV-VSM (for several values of σ) document representations using the spectral clustering algorithm

MethodσD1D2D3D4D5

avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%

BOW – 0.753 0.761 0.750 0.781 0.788 0.737 0.569 0.585 0.555 0.718 0.780 0.631 0.558 0.559 0.506

GTCV 1 0.770 0.774 0.769 0.790 0.795 0.750 0.614 0.626 0.600 0.735 0.779 0.642 0.560 0.561 0.516

20.781 0.785 0.760 0.790 0.794 0.757 0.625 0.632 0.601 0.752 0.789 0.649 0.562 0.564 0.523

50.794 0.804 0.790 0.833 0.853 0.763 0.639 0.640 0.619 0.768 0.827 0.669 0.579 0.600 0.557

10 0.807 0.814 0.801 0.833 0.853 0.761 0.645 0.648 0.620 0.758 0.819 0.661 0.581 0.589 0.558

30 0.791 0.796 0.769 0.807 0.832 0.743 0.613 0.613 0.609 0.755 0.797 0.647 0.567 0.582 0.535

inf 0.774 0.782 0.767 0.794 0.794 0.722 0.619 0.619 0.610 0.749 0.793 0.637 0.560 0.568 0.530

GVSM – 0.756 0.770 0.702 0.794 0.830 0.747 0.593 0.595 0.586 0.722 0.780 0.637 0.548 0.554 0.513

CVM – 0.761 0.768 0.751 0.801 0.823 0.760 0.605 0.606 0.590 0.728 0.794 0.642 0.557 0.566 0.519

Tab l e 6 F1-measure values of the spectral clustering solution for the different representation methods

MethodσD1D2D3D4D5

avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%

BOW – 0.801 0.811 0.780 0.819 0.822 0.767 0.710 0.723 0.701 0.808 0.911 0.697 0.666 0.669 0.654

GTCV 1 0.811 0.819 0.809 0.822 0.832 0.772 0.729 0.741 0.728 0.834 0.915 0.722 0.694 0.703 0.663

20.818 0.823 0.806 0.837 0.841 0.779 0.733 0.746 0.732 0.865 0.922 0.725 0.689 0.703 0.652

50.837 0.840 0.818 0.887 0.927 0.792 0.744 0.756 0.737 0.870 0.930 0.740 0.716 0.727 0.647

10 0.840 0.842 0.826 0.890 0.925 0.788 0.754 0.759 0.742 0.865 0.929 0.736 0.710 0.725 0.654

30 0.823 0.826 0.809 0.856 0.886 0.769 0.726 0.735 0.725 0.864 0.925 0.705 0.704 0.701 0.642

inf 0.814 0.817 0.806 0.826 0.832 0.734 0.728 0.735 0.729 0.859 0.922 0.703 0.692 0.686 0.653

GVSM – 0.756 0.770 0.702 0.826 0.901 0.780 0.709 0.714 0.724 0.823 0.916 0.705 0.642 0.657 0.654

CVM – 0.761 0.768 0.779 0.831 0.897 0.791 0.725 0.725 0.723 0.825 0.916 0.713 0.673 0.678 0.654

Tab l e 7 The pand tvalues of the statistical significance t-test of the difference in spectral clustering per-

formance using GTCV-VSM (σ=10) and the compared representation methods, with respect to the two

evaluation measures

GTCV D1D2D3D4D5

(σ=10) vs p-val t-val p-val t-val p-val t-val p-val t-val p-val t-val

BOWNMI 0.0000 27.3 0.0000 13.80.0000 620 0.026×10−44.85 0.0000 8.03

GVSMNMI 0.0000 16.7 0.0000 7.51 0.0000 130 0.129×10−54.99 0.0000 12.1

CVMNMI 0.0000 19.3 0.150×10−86.35 0.0000 138 0.316×10−33.67 0.0000 8.83

BOWF10.0000 24.1 0.0000 11.40.0000 875 0.123×10−44.48 0.0000 19.1

GVSMF10.0000 15.1 0.0000 7.53 0.0000 410 0.113×10−23.31 0.0000 30.7

CVMF10.0000 18.7 0.0000 7.11 0.0000 268 0.115×10−33.94 0.0000 14.1

Values of psmaller than the significance level of 0.05 (5%) indicate significant superiority of GTCV-VSM

123

Text document clustering

Fig. 4 The effect of varying the parameter λon the spk-means clustering performance for each data set.

Eq. 11 is used to determine the term self-weight ανwhen computing the ltcvhistograms

the critical value for tis tc=1.972 (pc=5% for pvalue). This means that if the computed

t≥tc, then the null hypothesis is rejected (p≥5%, respectively), i.e., our method is supe-

rior, otherwise the null hypothesis is accepted. As it can be observed from the results of the

statistical tests for spk-means presented in Table 4, the performance superiority of GTCV-

VSM is clearly significant in four out of ﬁve data sets with respect to all other methods.

For data set D4, the tests indicate that GTCV-VSM, although still better than BOW, has

less significant difference in performance compared with GVSM and CVM-VSM. Table 4

provides the respective t-test results for the spectral-c method where, also due to the lower

standard deviation of the results using all document representation methods, the GTCV-VSM

demonstrates significantly better results than the compared representations.

The experimental results indicate that our method outperforms the traditional BOW

approach in all cases, even for small values of smoothing parameter σ(e.g., σ=1or2).This

substantiates our rationale that the clustering procedure is assisted by the proposed semantic

123

A. Kalogeratos, A. Likas

smoothing, which takes into account the local contextual information associated with a term

occurrence. GTCV-VSM requires moderate values for the parameter σto achieve better per-

formance. The same is observed for the quality (in terms of NMI or F1) of the best solution

(i.e., the one with maximum Cohesion) found in the 100 runs, where moderate values of σ

(i.e., σ=5 or 10) result in better GTCV-VSM performance. Moreover, the clustering results

for a wide range of values of the smoothing parameter σindicate that the method is quite

robust to the speciﬁcation of this parameter. GTCV-VSM behaves similarly to BOW when a

low value is set for σ, while when this value becomes very high, the discriminative informa-

tion of the global term context vectors is reduced. This was demonstrated using spk-means

and spectral clustering methods. Among them, the latter in all cases except from D5presented

better average clustering solutions in terms of both evaluation measures NMI and F1, while

interestingly, spk-means was superior in terms of the best clustering solutions in most cases

(with the exception of D3) despite operating in a feature space of a much larger size.

6 Conclusions

We have presented the global term context vector-VSM (GTCV-VSM) document represen-

tation, an extension to the vector space model that determines a proper feature space to

project the typical VSM document vector representations. Our approach is entirely corpus-

based and operates in the preprocessing phase in a sequence of four steps: (i) captures local

contextual information associated with each term occurrence in the term sequences of doc-

uments; (ii) summarizes the local context vectors of each term into the respective global

term context vectors; (iii) constructs the semantic matrix for a problem using the global

term context vectors; and ﬁnally, (iv) projects documents using the semantic matrix. The

proposed approach achieves semantic smoothing by reducing data sparsity, while retaining

the original dimensionality. The derived representation maintains the initial interpretability

since each dimension is associated with a single vocabulary term. In the experimental docu-

ment clustering study, we compared the proposed representation with the typical VSM, the

Generalized-VSM and CVM-VSM, using Cosine similarity. The statistical analysis of the

obtained results indicates that GTCV-VSM assists well-known clustering algorithms, such

as spherical k-means and spectral clustering, to achieve better clustering solutions compared

with other representation methods.

Our plans for future work are to investigate the potential of combining the local and

global contextual information associated with terms to explore ways of building compact

concept vectors, to efﬁciently project the transformed document vectors in feature spaces of

lower dimensionality, and to perform a systematic study for procedures that could efﬁciently

compute ανparameters (Eq. 13) for each vocabulary term, which could improve the global

term context vectors. Finally, we aim at examining the proposed representation for document

classiﬁcation.

References

1. AlSumait L, Domeniconi C (2008) Text clustering with local semantic kernels. In: Berry M, Castellanos

M (eds) Survey of text mining II. Springer, London, pp 219–232

2. Apté C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text catego-

rization models. In: SIGIR ’94: proceedings of the 17th annual international ACM SIGIR conference on

research and development in information retrieval. Springer, New York, pp 23–30

123

Text document clustering

3. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD ’02: proceedings of the 8th

ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York,

pp 436–442. doi:10.1145/775047.775110

4. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?

In: ICDT ’99: proceedings of the 7th international conference on database theory. Springer, London,

pp 217–235

5. Billhardt H, Borrajo D, Maojo V (2002) A context vector model for information retrieval. J Am Soc Inf

Sci Technol 53(3):236–249. doi:10.1002/asi.10032

6. Chen C, Tseng F, Liang T (2010) An integration of fuzzy association rules and wordnet for document

clustering. Knowl Inf Syst (available online). doi:10.1007/s10115-010-0364- 2

7. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis.

J Am Soc Inf Sci 41:391–407

8. Dhillon I, Modha D (2001) Concept decompositions for large sparse text data using clustering. Mach

Learn 42(1):143–175. doi:10.1023/A:1007612920971

9. Farahat A, Kamel M (2010) Statistical semantics for enhancing document clustering. Knowledge and

Information Systems (available online). doi:10.1007/s10115-010-0367-z

10. Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceed-

ings of SIAM international conference on data mining

11. Ghosh J, Strehl A (2006) Similarity-based text clustering: a comparative study. In: Kogan J, Nicholas C,

Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 73–97

12. Grauman K, Darrell T (2007) The pyramid match kernel: efﬁcient learning with sets of features. J Mach

Learn Res 8:725–760. doi:10.1145/361219.361220

13. Hotho A, Maedche E, Staab S (2001) Ontology-based text document clustering. Knstliche Intell 4:48–54

14. Hu X, Sun N, Zhang C, Chua T (2009) Exploiting internal and external semantics for the clustering

of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and

knowledge management. ACM, New York, CIKM ’09, pp 919–928. doi:10.1145/1645953.1646071

15. Jing J, Zhou L, Ng M, Huang Z (2006) Ontology-based distance measure for text clustering. In: Proceed-

ings SIAM SDM workshop on text mining

16. Karypis G, Han E (2000) Concept indexing: a fast dimensionality reduction algorithm with applications

to document retrieval and categorization. In: Technical report TR-00-0016. University of Minnesota

17. Keikha M, Razavian N, Oroumchian F, Razi H (2008) Document representation and quality of text: an

analysis. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232

18. Lebanon G, Mao Y, Dillon J (2007) The locally weighted bag of words framework for document repre-

sentation. J Mach Learn Res 8:2405–2441

19. Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task.

In: SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on research and

development in information retrieval. ACM, New York, pp 37–50. doi:10.1145/133160.133172

20. Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data

Knowl Eng 64(1):381–404. doi:10.1016/j.datak.2007.08.001

21. McQueen J (1967) Some methods for classiﬁcation and analysis of multivariate observations. In:

Proceedings of 5th Berkley symposium on mathematical statistics and probability. pp 281–297

22. Miller G, Beckwith R, Fellbaum C, Gross D, Miller K (1990) Wordnet: an on-line lexical database. Int J

Lexicogr 3:235–244

23. Mladenic D (1998) Machine learning on non-homogeneous, distributed text data. PhD thesis, University

of Ljubljana, Faculty of Computer and Information Science

24. Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf

Process Syst 14:849–864

25. Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by ﬁnding core terms. Knowl Inf Syst

1–21. doi:10.1007/s10115-010-0299-7

26. Porter M (1997) An algorithm for sufﬁx stripping. In: Jones K, Willett P (eds) Readings in information

retrieval. Morgan Kaufmann Publishers, San Francisco, pp 313–316

27. Pu W, Liu N, Yan S, Yan J, Xie K, Chen Z (2007) Local word bag model for text categorization. In:

ICDM ’07: proceedings of the 2007 7th IEEE international conference on data mining. IEEE Computer

Society, Washington, pp 625–630. doi:10.1109/ICDM.2007.69

28. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM

18(11):613–620. doi:10.1145/361219.361220

29. Wang P, Domeniconi C (2008) Building semantic kernels for text classiﬁcation using wikipedia. In:

KDD ’08: proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and

data mining. ACM, New York, pp 713–721. doi:10.1145/1401890.1401976

30. Wikipedia (2004) Wikipedia, the free encyclopedia. http://en.wikipedia.org/

123

A. Kalogeratos, A. Likas

31. Wong S, Ziarko W, Wong P (1985) Generalized vector spaces model in information retrieval. In:

SIGIR ’85: proceedings of the 8th annual international ACM SIGIR conference on research and devel-

opment in information retrieval. ACM, New York, pp 18–25. doi:10.1145/253495.253506

Author Biographies

Argyris Kalogeratos received the B.Sc. and M.Sc. degrees in Com-

puter Science from the University of Ioannina, Ioannina, Greece, in

2006 and 2008, respectively. Currently, he is pursuing the Ph.D. degree

in the Department of Computer Science, University of Ioannina. His

research interests include machine learning, data clustering, text repre-

sentation, and mining.

Aristidis Likas received the Diploma degree in electrical engineering

and the Ph.D. degree in electrical and computer engineering from the

National Technical University of Athens, Greece, in 1990 and 1994,

respectively. Since 1996, he has been with the Department of Computer

Science, University of Ioannina, Greece, where he is currently an Asso-

ciate Professor. His research interests include machine learning, data

mining, multimedia content analysis and bioinformatics.

123

Software Expert Discovery via Knowledge Domain Embeddings in a Collaborative Network

Article

Oct 2018
PATTERN RECOGN LETT

Community Question Answering (CQA) websites can be claimed as the most major venues for knowledge sharing, and the most effective way of exchanging knowledge at present. Considering that massive amount of users are participating online and generating huge amount data, management of knowledge here systematically can be challenging. Expert recommendation is one of the major challenges, as it highlights users in CQA with potential expertise, which may help match unresolved questions with existing high quality answers while at the same time may help external services like human resource systems as another reference to evaluate their candidates. In this paper, we in this work we propose to exploring experts in CQA websites. We take advantage of recent distributed word representation technology to help summarize text chunks, and in a semantic view exploiting the relationships between natural language phrases to extract latent knowledge domains. By domains, the users' expertise is determined on their historical performance, and a rank can be compute to given recommendation accordingly. In particular, Stack Overflow is chosen as our dataset to test and evaluate our work, where inclusive experiment shows our competence.

Software Expert Discovery via Knowledge Domain Embeddings in a Collaborative Network

Preprint

Full-text available

Oct 2018

Use of text mining techniques for unsupervised organization of digital procedural acts

Article

Full-text available

Nov 2018

The rapid advances in technologies related to the capture and storage of data in digital format have allowed to organizations the accumulation of a volume of information extremely high, constituted a higher proportion of data in unstructured format, represented by texts. However, it is noted that the retrieval of useful information from these large repositories has been a very challenging activity. In this context, data mining is presented as a self-discovery process that acts on large databases and enables the knowledge extraction from raw text documents. Among the many sources of textual documents are electronic diaries of justice, which are intended to make public officially all the acts of the Judiciary. Despite the publication in digital form has provided improvements represented by the removal of imperfections related to divulgation at printed format, it is observed that the application of data mining methods could render more rapid analysis of its contents. In this sense, this article establishes a tool capable of automatically grouping and categorizing digital procedural acts, based on the evaluation of text mining techniques applied to groups determination activity. In addition, the strategy of defining the descriptors of the groups, that is usually conducted based on the most frequent words in the documents, was evaluated and remodeled in order to use, instead of words, the most regularly identified concepts in the texts

A novel hybrid multi-verse optimizer with K-means for text documents clustering

Article

Full-text available

Dec 2020
NEURAL COMPUT APPL

Text clustering has been widely utilized with the aim of partitioning specific document collection into different subsets using homogeneity/heterogeneity criteria. It has also become a very complicated area of research, including pattern recognition, information retrieval, and text mining. Metaheuristics are typically used as efficient approaches for the text clustering problem. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. However, a recently applied research trend involves hybridizing two or more algorithms with the aim of obtaining a superior solution regarding the problems of optimization. In this paper, a new hybrid of MVO algorithm with the K-means clustering algorithm is proposed, i.e., the H-MVO algorithm with the aims of enhancing the quality of initial candidate solutions, as well as the best solution, which is produced by MVO at each iteration. This hybrid algorithm aims at improving the global (diversification) ability of the search and finding a better cluster partition. The proposed H-MVO effectiveness was tested on five standard datasets, which are used in the domain of data clustering, as well as six standard text datasets, which are utilized in the domain of text document clustering, in addition to two scientific articles’ datasets. The experiments showed that K-means hybridized MVO improves the results in terms of high convergence rate, accuracy, error rate, purity, entropy, recall, precision, and F-measure criteria. In general, H-MVO has outperformed or at least proven to be highly competitive compared to the original MVO algorithm and with well-known optimization algorithms like KHA, HS, PSO, GA, H-PSO, and H-GA and the clustering techniques like K-mean, K-mean++, DBSCAN, agglomerative, and spectral clustering techniques.

Data-Information-Concept Continuum From a Text Mining Perspective

Chapter

Jan 2018

Avaliação da performance de índices de similaridade aplicados ao agrupamento de objetos textuais

Article

Full-text available

Dec 2017

A captura e o armazenamento de dados em formato digital têm permitido às organizações o acúmulo de um volume de informações extremamente elevado, constituído em maior proporção por dados em formato não estruturado, representados por textos. Neste contexto, as atividades de análise de agrupamentos ou classificação não supervisionada de objetos, se constituem como uma das técnicas de mineração de informações mais frequentemente empregadas no intuito de proporcionar a organização do volume progressivamente crescente de elementos textuais, por meio da disposição dos documentos em grupos de itens semelhantes com base em um índice de similaridade. Neste sentido, este estudo avalia os índices de similaridade distância Euclidiana, distância do coseno, distância de Hamming, coeficiente de Jaccard estendido e coeficiente de correlação de Pearson, sob a perspectiva de seis índices de validação de agrupamentos, observando que a distância do coseno representa, conforme a presente análise, o índice de similaridade mais apropriado ao agrupamento de objetos textuais, convertidos em formato estruturado por intermédio de técnicas de mineração de textos.

Improved Meta-Heuristic Model for Text Document Clustering by Adaptive Weighted Similarity

Article

Oct 2023

This paper intends to develop a novel framework for text document clustering with the aid of a new improved meta-heuristic algorithm. Initially, the features are selected from the text document by subjecting each word under Term Frequency-Inverse Document Frequency (TF-IDF) computation. Subsequently, centroid selection plays a vital role in cluster formation, which is done using a new Improved Lion Algorithm (LA) termed as Cross over probability-based LA model (CP-LA). As a novelty, this paper introduced a new inter and intracluster similarity model. Moreover, this centroid selection is made in such a way that the proposed adaptive weighted similarity should be minimal. Based on the characteristics of the document, the weights are automatically adapted with the similarity measure. The proposed adaptive weighted similarity function involves the inter-cluster, and intra-cluster similarity of both ordered and unordered documents. Finally, the superiority of the proposed over other models is proved under different performance measures.

Metaheuristic algorithms in text clustering

Chapter

Full-text available

Jan 2023

Optimal Text Document Clustering Enabled by Weighed Similarity Oriented Jaya With Grey Wolf Optimization Algorithm

Article

Apr 2021

Owing to scientific development, a variety of challenges present in the field of information retrieval. These challenges are because of the increased usage of large volumes of data. These huge amounts of data are presented from large-scale distributed networks. Centralization of these data to carry out analysis is tricky. There exists a requirement for novel text document clustering algorithms, which overcomes challenges in clustering. The two most important challenges in clustering are clustering accuracy and quality. For this reason, this paper intends to present an ideal clustering model for text document using term frequency–inverse document frequency, which is considered as feature sets. Here, the initial centroid selection is much concentrated which can automatically cluster the text using weighted similarity measure in the proposed clustering process. In fact, the weighted similarity function involves the inter-cluster, and intra-cluster similarity of both ordered and unordered documents, which is used to minimize weighted similarity among the documents. An advanced model for clustering is proposed by the hybrid optimization algorithm, which is the combination of the Jaya Algorithm (JA) and Grey Wolf Algorithm (GWO), and so the proposed algorithm is termed as JA-based GWO. Finally, the performance of the proposed model is verified through a comparative analysis with the state-of-the-art models. The performance analysis exhibits that the proposed model is 96.56% better than genetic algorithm, 99.46% better than particle swarm optimization, 97.09% superior to Dragonfly algorithm, and 96.21% better than JA for the similarity index. Therefore, the proposed model has confirmed its efficiency through valuable analysis.

GRAW+: A two-view graph propagation method with word coupling for readability assessment

Article

Feb 2019

Existing methods for readability assessment usually construct inductive classification models to assess the readability of singular text documents based on extracted features, which have been demonstrated to be effective. However, they rarely make use of the interrelationship among documents on readability, which can help increase the accuracy of readability assessment. In this article, we adopt a graph‐based classification method to model and utilize the relationship among documents using the coupled bag‐of‐words model. We propose a word coupling method to build the coupled bag‐of‐words model by estimating the correlation between words on reading difficulty. In addition, we propose a two‐view graph propagation method to make use of both the coupled bag‐of‐words model and the linguistic features. Our method employs a graph merging operation to combine graphs built according to different views, and improves the label propagation by incorporating the ordinal relation among reading levels. Experiments were conducted on both English and Chinese data sets, and the results demonstrate both effectiveness and potential of the method.

Hierarchical Document Clustering

Chapter

Full-text available

Jan 2006

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory.

Indexing by Latent Semantic Analysis

Article

Jan 1990

Some methods for classification and analysis of multivariate observations

Article

Jan 1967

J. McQueen

Article

Jan 2006

Clustering of text documents enables unsupervised categorization and facilitates browsing and search. Any clustering method has to embed the objects to be clustered in a suitable representational space that provides a measure of (dis)similarity between any pair of objects. While several clustering methods and the associated similarity measures have been proposed in the past for text clus-tering, there is no systematic comparative study of the impact of similarity mea-sures on the quality of document clusters, possibly because most popular cost criteria for evaluating cluster quality do not readily translate across qualitatively different measures. This chapter compares popular similarity measures (Euclidean, cosine, Pearson correlation, extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hypergraph partitioning, general-ized k-means, weighted graph partitioning), on a variety of high dimension sparse vector data sets representing text documents as bags of words. Performance is measured based on mutual information with a human-imposed classification. Our key findings are that in the quasiorthogonal space of word frequencies: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) Graph partitioning tends to be superior espe-cially when balanced clusters are desired; (iv) Performance curves generally do not cross.

Mercurial --- Wikipedia, The Free Encyclopedia

Article

Wikipedia

Landauer ? Indexing by Latent Semantic Analysis

Article

Jan 1990

Machine Learning on non-homogeneous, distributed text data

Article

Jan 1998

Dunja Mladenić

An Algorithm for Suffix Stripping

Article

Mar 1980
PROGRAM-ELECTRON LIB

MF Porter

The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

WordNet: An on-line lexical database

Article

Jul 2008

WordNet is an on-line lexical reference system whose design is inspired by current

Generalized Vector Space Model in Information Retrieval II

Article

Jan 1985

In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems. The preliminary experimental results obtained from the new model are very encouraging.

Text document clustering using global term context vectors

Abstract and Figures

Recommended publications

Text Clustering Based on LSA-HGSOM

Clustering web documents using hierarchical representation with multi-granularity

Semantic Oriented Document Clustering Using Distribution Semantics

Document clustering using synthetic cluster prototypes

Movie segmentation into scenes and chapters using locally weighted bag of visual words