ArticlePDF Available

Text document clustering using global term context vectors

Authors:

Abstract and Figures

Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.
Content may be subject to copyright.
Knowl Inf Syst
DOI 10.1007/s10115-011-0412-6
REGULAR PAPER
Text document clustering using global term context
vectors
Argyris Kalogeratos ·Aristidis Likas
Received: 13 May 2010 / Revised: 20 December 2010 / Accepted: 6 May 2011
© Springer-Verlag London Limited 2011
Abstract Despite the advantages of the traditional vector space model (VSM)
representation, there are known deficiencies concerning the term independence assump-
tion. The high dimensionality and sparsity of the text feature space and phenomena such as
polysemy and synonymy can only be handled if a way is provided to measure term simi-
larity. Many approaches have been proposed that map document vectors onto a new feature
space where learning algorithms can achieve better solutions. This paper presents the global
term context vector-VSM (GTCV-VSM) method for text document representation. It is an
extension to VSM that: (i) it captures local contextual information for each term occur-
rence in the term sequences of documents; (ii) the local contexts for the occurrences of a
term are combined to define the global context of that term; (iii) using the global context
of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to lin-
early map traditional VSM (Bag of Words—BOW) document vectors onto a semantically
smoothed feature space where problems such as text document clustering can be solved more
efficiently. We present an experimental study demonstrating the improvement of clustering
results when the proposed GTCV-VSM representation is used compared with traditional
VSM-based approaches.
Keywords Text mining ·Document clustering ·Semantic matrix ·Data projection
1 Introduction
The text document clustering procedure aims toward automatically partitioning a given
collection of unlabeled text documents into a (usually predefined) number of groups, called
A. Kalogeratos ·A. Likas (B
)
Department of Computer Science, University of Ioannina,
45110, Ioannina, Greece
e-mail: arly@cs.uoi.gr
A. Kalogeratos
e-mail: akaloger@cs.uoi.gr
123
A. Kalogeratos, A. Likas
clusters, such that similar documents are assigned to the same cluster while dissimilar
documents are assigned to different clusters. This is a task that discovers the underlying
structure in a set of data objects and enables the efficient organization and navigation in large
text collections.
The challenging characteristics of the text document clustering problem are related to the
complexity of the natural language. Text documents are represented in high dimensional and
sparse (HDS) feature spaces, due to their large term vocabularies (the number of different
terms of a document collection or text features in general). In an HDS feature space, the
difference between the distance of two similar objects and the distance of two dissimilar
objects is relatively small [4]. This phenomenon prevents clustering methods from achieving
good data partitions. Moreover, the text semantics, e.g., term correlations, are mostly implicit
and non-trivial, hence difficult to extract without prior knowledge for a specific problem.
The traditional document representation is the vector space model (VSM)[28] where each
document is represented by a vector of weights corresponding to text features. Many vari-
ations of VSM have been proposed [17] that differ in what they consider as a feature or
term’. The most common approach is to consider different words as distinct terms, which is
the widely known Bag Of Words (BOW) model. An extension is the Bag Of Phrases (BOP)
model [23] that extracts a set of informative phrases or word n-grams (nconsecutive words).
Especially for noisy document collections, e.g., containing many spelling errors, or collec-
tions whose language is not known in advance, it is often better to use VSM to model the
distribution of character n-grams in documents. Herein, we consider word features and we
refer to them as terms; however, the procedures we describe can be directly extended to more
complex features.
Despite the simplicity of the popular word-based VSM version, there are common lan-
guage phenomena that it cannot handle. More specifically, it cannot distinguish the different
senses of a polysemous word in different contexts or realize the common sense between
synonyms. It also fails to recognize multi-word expressions (e.g., Olympic Games’). These
deficiencies are in part due to the over-simplistic assumption of term independence,where
each dimension of the HDS feature space is considered to be vertical to the others and
makes the classic VSM model incapable of capturing the complex language semantics. The
VSM representations of documents can be improved by examining the relations between
terms either at a low level, such as terms co-occurrence frequency, or at a higher semantic
similarity level.
Among the popular approaches is the Latent Semantic Indexing (LSI)[7] that solves an
eigenproblem using Singular Value Decomposition (SVD) to determine a proper feature space
to project data. Concept Indexing [16] computes a k-partition by clustering the documents and
then uses the centroid vectors of the clusters as the axes of the reduced space. Similarly, Con-
cept Decomposition [8] approximates in a least-squares fashion the term-by-document data
matrix using centroid vectors. A more simple but quite efficient method is the Generalized
vector space model (GVSM)[31]. GSVM represents documents in the document similarity
space, i.e., each document is represented as a vector containing its similarities to the rest of the
documents in the collection. The Context Vector Model (CVM-VSM)[5] is a VSM-extension
that describes the semantics of each term by introducing a term context vector that stores its
similarities to the other terms. The similarity between terms is based on a document-wise
term co-occurrence frequency. The term context vectors are then used to map document
vectors into a feature space of equal size to the original, but less sparse. The ontology-based
VSM approaches [13,15] map the terms of the original space onto a feature space defined by
a hierarchically structured thesaurus, called ontology. Ontologies provide information about
the words of a language and their possible semantic relations; thus, an efficient mapping can
123
Text document clustering
disambiguate the word senses in the context of each document. The main disadvantage is
that, in most cases, the ontologies are static and rather generic knowledge bases, which may
cause heavy semantic smoothing of the data. A special text representation problem is related
to very short texts [14,25].
In this work, we present the Global Term Context Vector-VSM (GTCV-VSM) representation
that is an entirely corpus-based extension to the traditional VSM that incorporates contextual
information for each vocabulary term. First, the local context for each term occurrence in
the term sequences of documents is captured and represented in vector space by exploiting
theideaoftheLocally Weighted Bag of Words [18]. Then, all the local contexts of a term
are combined to form its global context vector. Global context vectors constitute a semantic
matrix that efficiently maps the traditional VSM document vectors onto a semantically richer
feature space of same dimensionality to the original. As indicated by our experimental study,
in the new space, superior clustering solutions are achieved using well-known clustering
algorithms such as the spherical k-means [8] or spectral clustering [24].
The rest of this paper is organized as follows. Section 2provides some background on doc-
ument representation using the vector space model. In Sect. 3, we describe recent approaches
for representing a text document using histograms that describe the local context at each
location of the document-term sequence. In Sect. 4, we present our proposed approach for
document representation. The experimental results are presented in Sect. 5,andfinally,in
Sect. 6, we provide conclusions and directions for future work.
2 Document representation in vector space
In order to apply any clustering algorithm, the raw collection of Ntext documents must
be first preprocessed and represented in a suitable feature space. A standard approach is to
eliminate trivial words (e.g., stopwords) and words that appear in a small number of docu-
ments. Then, stemming [26] is applied, which aims to replace each word by its corresponding
word stem.TheVderived word stems constitute the collection’s term vocabulary, denoted
as V={ν1,...,ν
V}. Thus, a text document, which is a finite term sequence of Tvocabulary
terms, is denoted as dseq =dseq(1),...,dseq(T), with dseq(i)V. For example, the
phrase The dog ate a cat and a mouse! is a sequence dseq =dog,ate,cat,mouse.
2.1 The bag of words model
According to the typical VSM approach, the Bag of Words (BOW) model, a document is rep-
resented by a vector dRV, where each word term νiof the vocabulary is associated with a
single vector dimension. The most popular weighting scheme is the normalized t f ×idf that
introduces the inverse document frequency as an external weight to enforce the terms that
have discrimination power and appear in a small number of documents. For the νivocabulary
term, it is computed as idfi=log(N/df
i),whereNdenotes the total number of documents
and df
idenotes the document frequency, i.e., the number of documents that contain term νi.
Thus, the normalized tf ×idf BOW vector is a mapping of the term sequence dseq defined
as follows
bow:dseq d=h·(tf
1idf1,...,tfVidfV)RV,(1)
where normalization is performed with respect to the Euclidean norm using the coefficient h.
The document collection can then be represented using the Ndocument vectors as rows in
123
A. Kalogeratos, A. Likas
the Document-Term matrix D,whichisaN×Vmatrix whose rows and columns are indexed
by the documents and the vocabulary terms, respectively.
In the VSM, there are several alternatives to quantify the semantic similarity between
document pairs. Among them, Cosine similarity has shown to be an effective measure [11],
and for a pair of document vectors, diand djis given by
simcos (di,dj)=d
idj
di2dj2∈[0,1].(2)
Unit similarity value implies that the two documents are described by identical distributions
of term frequencies. Note that this is equal to the dot product d
idjif document vectors are
normalized in the unit positive V-dimensional hypersphere.
2.2 Extensions to VSM
The BOW model, despite having a series of advantages, such as generality and simplicity, it
cannot model efficiently the rich semantic content of text. The Bag Of Phrases model uses
phrases of two or three consecutive words as features. Its disadvantage is the fact that it has
been observed that as phrases become longer, they obtain superior semantic value, but at the
same time, they become statistically inferior with respect to single-word representations [19].
A category of methods developed aiming on tackling this difficulty recognize the frequent
wordsets (unordered itemsets) in a document collection [3,10], while the method proposed
in the study by [20] exploits the frequent word subsequences (ordered) that are stored in a
Generalized Suffix Tree (GST) for each document.
Modern variations of VSM are used to tackle the difficulties occurring due to HDS spaces,
by projecting the document vectors onto a new feature space called concept space. Each con-
cept is represented as a concept vector of relations between the concept and the vocabulary
terms. Generally, this approach of document mapping can be expressed as
VSM :dd=Sd RV,VV,(3)
where the V×Vmatrix Sstores the concept vectors as rows. This projection matrix is also
known as semantic matrix. The Cosine similarity between two normalized document images
in the concept space can be computed as a dot product
sim(cos)
sem di,dj=SdiSdj=hS
iSdihS
jSdj=hS
ihS
jd
iSSdj,(4)
where the scalar normalization coefficient for each document is hS
i=1/Sdi2.Thesimi-
larity defined in Eq. 4can be interpreted in two ways: (i) as a dot product of the document
images SdiSdjthat both belong to the new space RVand (ii) as a composite measure
that takes into account the pairwise correlations between the original features expressed by
the matrix SS.
There is a variety of methods proposing alternative ways to define the semantic matrix
though many of them are based on the above linear mapping. The widely used Latent Seman-
tic Indexing (LSI)[7] projects the document vectors onto a space spanned by the eigenvec-
tors corresponding to the Vlargest eigenvalues of the matrix DD. The eigenvectors are
extracted by the means of Singular Value Decomposition (SVD)onmatrixD
, and they cap-
ture the latent semantic information of the feature space. In this case, each eigenvector is a
different concept vector and Vis a user parameter much smaller than V, while there is also a
considerable computational cost to perform the SVD. In Concept Indexing [16], the concept
123
Text document clustering
vectors are the centroids of a V-partition obtained by applying document clustering. In [9],
statistical information such as the covariance matrix is combined with traditional mapping
approaches into latent space (LSI, PCA) to compose a hybrid vector mapping.
A computationally simpler alternative that utilizes the Document-Term Matrix Das a
semantic matrix is the Generalized vector space model (GVSM)[31], i.e., Sgvsm =Dand
the image of a document is given by d=Dd. By examining the product Dd RN×1
,we
can conclude that a GVSM projected document vector dhas lower dimensionality if NV.
Moreover, if both dand Dare properly normalized, then image vector dconsists of the N
Cosine similarities between the document vector dand the rest of the N1 documents in
the collection. This observation implies that the GVSM works in the document similarity
space by considering each document as a different concept. On the other hand, the respective
product S
gvsm Sgvsm =DD(usedinEq.4)isaV×V Term Similarity Matrix whose r-th
row has the dot-product similarities between term νrand the rest of the V1 of vocabulary
terms. Note that terms become more similar as their corresponding normalized frequency
distributions into the Ndocuments are more alike. Based on the GVSM model, it is proposed
in [1] to build local semantic matrices for each cluster during document clustering.
A rather different approach proposed in [5] for information retrieval is the Context Vector
Model (CVM-VSM) where, instead of a few concise concept vectors, it computes the context
in which each of the Vvocabulary terms appears in the data set, called term context vector
(tcv). This model computes a V×Vmatrix Scvmcontaining the term context vectors as
rows. Each tcvivector aims to capture the Vpairwise similarities of term νito the rest of
the vocabulary terms. Such similarity is computed using a co-occurrence frequency measure.
Each matrix element [S
cvm]ij stores the similarity between terms νiand νjcomputed as
[Scvm]ij =
1,i=j
N
r=1tf
ritf
rj
N
r=1(tf
ri ·V
q=1,q=itf
rq),i= j.(5)
Note that this measure is not symmetric, generally [Scvm]ij =[Scvm]ji, due to the denom-
inator that normalizes the pairwise similarity to [0, 1] with respect to the total amount’of
similarity between term νiand the other vocabulary terms. The rows of matrix Scvmcan be
normalized with respect to the Euclidean norm, and each document image is then computed
as the centroid of the normalized context vectors of all terms appearing in that document
cvm:dd=
V
i=1
tf
i·tcvi,(6)
where tf
iis the frequency of term νi. The motivation for using term context vectors is to
capture the semantic content of a document based on the co-occurrence frequency of terms
in the same document, averaged over the whole corpus. The CVM-VSM representation is
less sparse than BOW. Moreover, weights such as idf can be incorporated to the transformed
document vectors computed using Eq. 6.In[5], several more complicated weighting alter-
natives have been tested in the context of information retrieval that in our text document,
clustering experiments did not perform better than the standard idf weights.
In a higher semantic level than term co-occurrences, additional information for vocabulary
terms provided by ontologies has also been exploited to compute the term similarities and
to construct a proper semantic matrix. Word Net [22]andWikipedia [30] have been used for
this purpose in [6,15], and [29], respectively.
123
A. Kalogeratos, A. Likas
2.3 Discussion
Summarizing the properties of the above-mentioned vector-based document representations,
in the traditional BOW approach, the dimensions of the term feature space are considered to
be independent to each other. Such an assumption is very simplistic, since there exist semantic
relations among terms that are ignored. The VSM-extensions aim to achieve semantic smooth-
ing, a process that redistributes the term weights of a vector model, or map data in a new
feature space, by taking into account the correlations between terms. For instance, if the term
child appears in a document, then it could be assumed that the term kid is also related
to the specific document or even terms like boy’, girl’, toy’. The resulting representation
model is also a VSM, but the document vectors become less sparse and the independence
of features is mitigated in an indirect way. The smoothing is usually achieved by a linear
mapping of data vectors to a new feature space using a semantic matrix S. It is convenient
to think that the new document vector d=Sd contains the dot product similarities between
the original BOW vector dand the rows of the semantic matrix S.
A basic difference between the various semantic smoothing methods is related to the
dimension of the new feature space, which is determined by the number Vof row vectors of
matrix S. In case their number is less than the size Vof the vocabulary, such vectors are called
as concept vectors and are usually produced using the LSI method. Each concept vector has a
distribution of weights associated with the Voriginal terms that define their contribution of to
the corresponding concept. Of course, the resulting representation of the smoothed vector d
is less interpretable than the original, and there is always a problem of determining the proper
number of concept vectors.
An alternative approach for semantic smoothing assumes that each row vector of matrix S
is associated with one vocabulary term. Unlike a concept vector that describes abstract seman-
tics of higher level, here, the elements of each vector describe the relation of this term to the
other terms. Those relations constitute the so-called term context, thus the respective vector
is called term context vector. Each element of the mapped vector dwill contain the dot
product similarity between document dand the corresponding term context vector, i.e., for
each term vi, the element diprovides the degree to which the original document dcontains
the term viand its context, instead of just computing its frequency as happens in the BOW
representation. Note also that in BOW representation, a dot product would give zero simi-
larity for two documents that do not have common terms. On the contrary, the dot product
between a document vector and a term context vector of a term vithat does not appear in
that document may give a non-zero similarity. This happens if the document contains at least
one term vjwith non-zero weight in the context of term vi. For this reason, the smoothed
representation dis usually less sparse that dand retains their interpretability of dimensions.
Moreover, concept-based methods may be applied on the new representations.
The motivation of our work is to establish the importance of term context vectors and
to define an efficient way to compute them. The CVM-SVM method considers that the
term context is computed based on term co-occurrence frequency at the document level.
It does not take into account the sequential nature of text and thus ignores the local dis-
tance of terms when computing term context. On the other hand, the GTCV-VSM pro-
posed in this work extends the previous approach by considering term context at three
levels: (i) It uses the notion of local term context vector (ltcv) to model the context
around the location in the text sequence where a term appears. These vectors are com-
puted using a local smoothing kernel as suggested in the LoWBOW approach [18]which
is described in the next section. The kernel takes into account the distance in which
other terms appear around the sequence location under consideration. (ii) It computes
123
Text document clustering
the document term context vector (dtcv) for each term that summarizes the term con-
text at the document level, and (iii) it computes the final global term context vector
(gtcv) for each term representing the overall term context at corpus level. The gtcvvec-
tors constitute the rows of the semantic matrix S. Thus, the intuition behind GTCV-VSM
approach is to capture the local term context from term sequences and then to construct
a representation for global term context by averaging ltcvsat the document and corpus
level.
3 Utilizing local contextual information
A text document can be considered as a finite term sequence of its Tconsecutive terms denoted
as dseq =dseq(1),...,dseq(T)but, except for Bag of Phrases, so far in this paper, the
previously mentioned VSM-extensions ignore this property.A category of methods has been
proposed aiming to capture local information directly from the term sequence of a document.
The representation proposed that in the study by [27], first considers a segmentation of the
sequence that is done by dragging a window of nterms along the sequence and computing the
local BOW vectors for each of the overlapping segments. All these local BOW vectors con-
stitute the document representation called Local Word Bag (LWB). To compute the similarity
between a pair of documents, the authors introduce a variant of the VG-Pyramid Matching
Kernel [12] that maps the two sets of local BOW vectors to a multi-resolution histogram and
computes a weighted histogram intersection.
Another approach for text representation presented in [18]istheLocally Weighted Bag
of Words (LoWBOW) that preserves local contextual information of text documents by the
effective modeling of the text sequential structure. At first, a number of Lequally distant
locations are defined in the term sequence. Each sequence location i,i=1,...,Lis then
associated with a local histogram which is a point in the multinomial simplex PV1,where
Vis the number of vocabulary terms. More specifically, for (V1)0, the PV1space is
the (V1)-dimensional subset of RVthat contains all probability vectors (histograms) over
Vobjects (for a discussion on the multinomial simplex see the Appendix of [18])
PV1=HRV:Hi0,i=1,...,Vand
V
i=1
Hi=1.(7)
Contrary to LWB, in LoWBOW the local histogram is computed using a smoothing ker-
nel to weight the contribution of terms appearing around the referenced location in the term
sequence and to assign more importance to closely neighboring terms. Denoting as Hδ(dseq(t))
the trivial term histogram of Vterms whose probability mass is concentrated only at the term
that occurs at the location tin dseq
Hδ(dseq (t))i=
1
i=dseq(t)
,i=1,...,V,
0
i= dseq(t)
(8)
then the locally smoothed histogram at a location in the dseq term sequence is computed
as in [18]
lowbow(dseq,) =
T
t=1
Hδ(dseq (t)) K,σ (t), (9)
123
A. Kalogeratos, A. Likas
where Tis the length of dseq.K,σ (t)denotes the weight for location tin sequence given
by a discrete Gaussian weighting kernel function of mean value and standard deviation σ.
Specifically, the weighting function is a Gaussian probability density function restricted in
[1,T]and renormalized so that T
t=1K,σ (t)=1. It is easy to verify that the result of the
histogram smoothing of Eq. 9is also a histogram.
It must be noted that for σ=0, the lowbowhistogram (Eq. 9) coincides with the trivial
histogram Hδ(dseq ()), where all the probability mass is concentrated at the term at location .
As σgrows, part of the probability mass is transferred to the terms occurring near location .
In this way, the lowbowhistogram at location is enriched with information about the terms
occurring in the neighborhood of . The smoothing parameter σadjusts the ‘locality’ of
term semantics that is taken into account by the model. Thus, instead of mining unordered
local vectors as in [27], the LoWBOW approach embeds the term sequence of a document in
the PV1simplex. The sequence of the Llocally smoothed histograms (denoted as lowbow
histograms)formacurveinthe(V1)-dimensional simplex (denoted as LoWBOW curve).
Figure 1illustrates the LoWBOW curves generated for a toy example and describes the role
of parameter σ. In this figure, we aim to illustrate (i) the LoWBOW curve representation, i.e.,
the curve that corresponds to a sequence of histograms (local context vectors), where each
local context vector is computed at a specific location of the sequence and corresponds to a
point in the (V1)-dimensional simplex; (ii) the impact of the smoothing coefficient σon
the computed local context vectors. It is illustrated that the increase in smoothing makes the
lowbowhistograms (points of the curve) more similar. This can also be verified by observing
that as smoothing increases, the curve becomes more concentrated around a central location
of the simplex. For σ=∞, all histograms become similar to the BOW representation and
the curve reduces to a single point. On the contrary, for σ=0, the histograms correspond to
simplex corners.
A similarity measure between LoWBOW curves has been proposed in [18] that assumes a
sequential correspondence between two documents and computes the sum of the similarities
between the Lpairs of LoWBOW histograms. Obviously, it is expected for this similarity
measure to underestimate the thematic similarity between documents that follow different
order in the presentation of similar semantic content.
4 A semantic matrix based on global term context vectors
In this section, we present the global term context vector-VSM (GTCV-VSM) approach
for capturing the semantics of the original term feature space of a document collection.
The method computes the contextual information of each vocabulary term, which is sub-
sequently utilized in order to create a semantic matrix. In analogy with CVM-VSM, our
approach reduces data sparsity but not dimensionality. The interpretability of the derived
vector dimensions remains as strong as in the BOW model as the value of each dimen-
sion of the mapped vector corresponds to one vocabulary term. Methods that reduce data
dimensionality could also be applied on the new representations at a subsequent phase. Com-
pared with CVM-VSM, GTCV-VSM generalizes the way the term context is computed by
taking into account the distance between terms in the term sequence of each document.
This is achieved by exploiting the idea of LoWBOW to describe the local contextual
information at a certain location in a term sequence. It must be noted that our method
borrows from the LoWBOW approach only the way the local histogram is computed
at each location of the term sequence and does not make use of the LoWBOW curve
representation.
123
Text document clustering
(a)
(c) (d)
(b)
Fig. 1 A toy example where the sequence ν1
2
2
2
1
3
3
1ν1
1
2
2
3is considered
that uses three different terms ν1
2
3(vocabulary size: V=3). The subfigures present LoWBOW curves in
the (V1)-dimensional simplex for increasing values of the parameter σthat induces more smoothing to the
curve. Each point of the curve corresponds to a local histogram computed at a sequence location. The more a
term affects the local context at a location in the sequence, the more the curve point (the lowbowhistogram
related to that location) moves toward the respective corner of the simplex. For σ=0, local histograms
correspond to simplex corners; thus, the curve moves from corner to corner of the simplex. Two different
sampling rates for LoWBOW representation are illustrated: sampling at every term location in the sequence
(dashed line), which is the our strategy to collect contextual information for each term, and sampling every
two terms (solid line). dFor σ=∞, the LoWBOW curve reduces to a single point that coincides with the
BOW histogram of the sequence. In d,wepresentas‘stars the average ltcvhistograms for each term (dtcv
histograms) for the three different values of σand α=0.6 for all terms. As the value of σincreases, the dtcv
histograms of all terms become more similar tending to coincide with the BOW representation
More specifically, we define the local term context vector (ltcv) as a histogram associ-
ated with the exact occurrence of term dseq() at location in a sequence dseq. Hence, one
ltcvvector is computed at every location in the term sequence, i.e., =1,...,T.Note
that GTCV-VSM does not preserve any curve representation. This means that we are not
interested in the temporal order of the local term context vectors. The ltcv(dseq,) is a
modified lowbow(dseq,)probability vector that represents contextual information around
location , while adjusting explicitly the self-weight αdseq() of the reference term appearing
123
A. Kalogeratos, A. Likas
10 20 30 40 50
0
0.05
0.1
0.15
0.2
0.25
0.3
Term sequence
Local term weighting
10 20 30 40 50
0
0.05
0.1
0.15
0.2
0.25
0.3
Term sequence
10 20 30 40
0
0.05
0.1
0.15
0.2
0.25
0.3
Term sequence
Fig. 2 Various weight distributions for the neighboring terms around a reference term occurring in the middle
of a term sequence of length 50. The distributions are obtained by varying the value of parameter αin Eq. 10.
This distribution defines the contribution of each term to the context of the specific reference term. The scale
value of the local kernel is set to σ=5, while self-weight is set to αto 0.05 (left), 0.10 (middle), 0.2 (right)
at location
ltcv(dseq,)
i=αdseq()
i=dseq(),
(1αdseq() )·idfi·[lowbow(dseq,)
]i
V
j=1,j=iidf j·[lowbow(dseq ,)
]j
i= dseq(). (10)
The self-weight (0 αdseq () 1) adjusts the relative importance between contextual
information (computed using the lowbowhistogram) and the self-representation of each
term. Figure 2illustrates an example of how the value of parameter αaffects the local term
weighting around a reference term in a sequence. When the parameter σof the Gaussian
smoothing kernel is set to zero, or α=1, the ltcv(dseq,) reduces to a trivial histogram
Hδ(d(seq)()) (see Eq. 8). The other extreme is the infinite σvalue, where for small αvalues,
all the ltcvcomputed in a document dbecome similar to the tf histogram for that document.
The latter observation is the reason for considering an explicit self-weight in Eq. 10,
because a flat smoothing kernel obtained for large σvalue can make a lowbowvector to
have improperly low self-weight for the reference term. For example, if a term appears once
in a document, then the lowbowvector with σ=∞at that location would contain very low
weight for that term. Generally, the value of ανdetermines how much the context vector of
term νshould be dominated by the self-weight of term ν. In our method, we set this parameter
independently for each individual term as a function of its idfνcomponent
αν=λ+(1λ) ·1idfν
logN ∈[0,1],(11)
where λis a lower bound for all aν =1,...,V(in our experiments we used λ=0.2).
The rationale for the above equation is that for terms with high document frequency (i.e.,
low idfν), we assign high ανvalues that suppress the local context in the respective context
vectors. In other words, the context is considered more important for terms that occur in
fewer documents. In Fig. 3a, we present an example illustrating the ltcvvectors of two term
sequences presented in Fig. 3c.
123
Text document clustering
Local term context histograms (columns)
for document A
Term sequence (dseq
A)
Vocabulary terms
v1 v2 v3 v4 v5 v6 v7 v8 v6 v9
advanc: v1
electron: v2
commun: v3
help: v4
conduct: v5
busi: v6
interoper: v7
problem: v8
applic: v9
profession:v10
product:v11
commerc:v12
secur:v13
L
oca
lt
erm con
t
ex
thi
s
t
ograms
(
co
l
umns
)
for document B
Term sequence (dseq
B)
Vocabulary terms
v10 v6 v11 v1 v7 v2 v12 v9 v13 v6 v6 v2 v3
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
(a)
Averaged term context histograms (columns)
Vocabulary terms
Vocabulary terms
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13
advanc: v1
electron: v2
commun: v3
help: v4
conduct: v5
busi: v6
interoper: v7
problem: v8
applic: v9
profession:v10
product:v11
commerc:v12
secur:v13
(b) (c)
Fig. 3 An example of how ltcvhistograms are used to summarize the overall context in which a term appears
in the two term sequences of (c)usingEq.14.aThe term sequences (x-axis) of documents A,Bare presented
and the corresponding local term context vectors are illustrated as gray-scaled columns. Those vectors are
computed at every location in the sequence using a Gaussian smoothing kernel with σ=1andα=0.6 for all
terms. Brighter intensity at cell i,jindicates higher contribution of the term νito the local context of the term
appearing at location jin the sequence. bThe resulting transposed semantic matrix (S), where the gray-
scaled columns illustrate the global contextual information for each vocabulary term computed by averaging
the respective local context histograms (Eq. 13). cThe two initial term sequences (the stem of each non-trivial
term is emphasized). Assuming the same idf weight for each vocabulary term, the table presents the BOW
vector, the transformed vector dusing Eq. 14 as well as the effect of semantic smoothing (diff =BOWd)
on document vectors. The redistribution of term weights that results by the proposed mapping reveals is done
in such a way that low-frequency terms are gaining weight against the more frequent ones. Note also that the
similarity between the two documents is 0.756 for the BOW model and 0.896 for the GTCV-VSM
We further define the document term context vector (dtcv) as a probability vector that sum-
marizes the context of a specific term at the document level by averaging the ltcvhistograms
corresponding to the occurrences of this term in the document. More specifically, suppose
that a term νappears noi >0 times in the term sequence dseq
i(i.e., in the i-th document)
which is of length Ti. Then, the dtcvof this term νfor document iis computed as:
dtcvdseq
i
=1
noν,i
noi
j=1
ltcvdseq
i,
i (j),(12)
123
A. Kalogeratos, A. Likas
where i (j)is an integer value in [1,...,Ti]denoting the location of the j-th occurrence
of νin dseq
i.
Next, the global term context vector (gtcv) is defined for a vocabulary term νso as to
represent the overall contextual information for all appearances of νin the corpus of all N
term sequences (documents).
gtcv(ν) =hgtcv(ν) N
i=1
tf
i,v dtcvdseq
i
.(13)
The coefficient hgtcv(ν) normalizes the vector gtcv(ν) with respect to the Euclidean norm,
and tf
i is the frequency of the term νin the i-th document. Thus, the gtcv(ν) of term νis
computed using a weighted average of the document context vectors dtcvdseq
i
obtained
for each document iin which term νappears. Thus, in contrast to LoWBOW curve approach
which focuses on the sequence of local histograms that describe the writing structure of a
document, our method focuses on the extraction of the global semantic context of a term by
averaging the local contextual information at all the corpus locations where this term appears.
Finally, the extracted global contextual information is used to construct the V×Vseman-
tic matrix Sgtcvwhere each row νis the gtcv(ν) vector of the corresponding vocabulary
term ν. Figure 1d provides an example of illustrating the dtcvdseq
i
vectors for each
document (the points denoted as ‘stars’). Figure 3b illustrates the final gtcvvectors obtained
by averaging the document level contexts for each vocabulary term.
To map a document using the proposed global term context vector-VSM approach,
we compute the vector dwhere each element νis Cosine similarity between the BOW
representation dof the document and the global term context vector gtcv(ν):
gtcv:dd=Sgtcvd,dRV.(14)
Note that the transformed document vector dis V-dimensional that retains the interpretabil-
ity, since each dimension still corresponds to a unique vocabulary term. Moreover, if σ=0
and α>0, then Sgtcvd=d. Looking at Eq. 4, the product S
gtcvSgtcvessentially computes
a Term Similarity Matrix where the similarity between two terms is based on the distribution
of term weights in their respective global term context vectors, i.e., on the similarity of their
global context histograms. The table of Fig. 3c illustrates the effect of redistribution (com-
pared with BOW) of the term weights (semantic smoothing) in the transformed document
vectors achieved by the proposed mapping.
The procedure of representing the input documents using GTCV-VSM takes place in the
preprocessing phase. Let Tithe length of the i-th document and Viits vocabulary. Let also V
the size of the whole corpus vocabulary. Then, the cost to compute one ltcvvector at a
location of the term sequence using Eq. 10, and to add its Vinon-zero dimensions to the
respective dtcv,isO(Ti+Vi). This is done Titimes and the final dtcvof each different
term of the document is added to the respective the gtcvrows. Thus, using proper notation
for the average length Tiand vocabulary size Viof the documents in a corpus, the cost of
constructing the semantic matrix can be expressed as O(N·Ti·(Ti+2·Vi)). However,
since ViTiV, the overall computational cost of the GTCV-VSM is determined by the
O(N·V2) cost of the matrix multiplication of the mapping of Eq. 14.
123
Text document clustering
Tab l e 1 Characteristics of text document collections
Name Topics Classes NBalance V ViTi
D120–NGs: graphics, windows.x, motor, baseball, 6 2,000 200/400 4,343 48.8 110
space, mideast
D220–NGs: atheism, autos, baseball, electronics, 7 3,500 500/500 6,442 52.6 108
med, mac, motor, politics.misc
D320–NGs: atheism, christian, guns, mideast 4 1,600 400/400 4,080 62 131
D420–NGs: forsale, autos, baseball, motor, hockey 5 1,250 250/250 4,762 44.1 104
D5Reuters–21,578: acq, corn, crude, earn, grain, 10 9979 237/3,964 5,613 39.176
interest, money-fx, ship, trade, wheat
Ndenotes the number of documents, Vis the size of the global vocabulary and Vithe average document
vocabulary, Balance is the ratio of the smallest to the largest class and Tiis the average length of the term
sequences of documents
5 Clustering experiments
Our experimental setup was based on five different data sets: D1D4are subsets of the 20-
Newsgroups,1while D5is the Mod Apte split [2] version of the Reuters-215782benchmark
document collection where the 10 classes with larger number of training examples are kept.
The characteristics of these data sets are presented in Table 1. The preprocessing of data sets
included the removal of all tags, headers, and metadata from the documents, while applied
word stemming and discarded terms appearing in less than five documents. It is worth men-
tioning how we preprocessed the term sequences of documents. We considered a dummy
term that replaced in the sequences all the low-frequency terms that were discarded so as to
maintain the relative distance between the terms that remained in each sequence. For similar
reasons, two dummy terms were considered at the end of every sentence denoted by characters
as (e.g., ‘.’, ‘?’, ‘!’). The dummy term is ignored when constructing the final data vectors.
For each data set, we have considered several data mappings, and after each mapping,
the spherical k-means (spk-means)[8] and spectral clustering (spectral-c)[24] algorithms
were applied to cluster the mapped documents vectors into the kpredefined number of clus-
ters corresponding to the different topics (classes) in a collection. In contrast to k-means that
is based on the Euclidean distance [21], spk-means uses the Cosine similarity and maximizes
the Cohesion of the clusters C={c1,...,ck}
Cohesion(C)=
k
j=1
dicj
u
jdi,(15)
where ujis the normalized centroid of cluster cjwith respect to the Euclidean norm.
Spectral clustering projects the document vectors in a subspace that is spanned by the k
largest eigenvectors of the Laplacian matrix Lcomputed from the similarity matrix A(N×N)
of pairwise Cosine similarities between documents. More specifically, the Laplacian matrix
is computed as L=D1/2AD1/2,whereDis a diagonal matrix. Each diagonal ele-
ment contains the sum of the i-th row of similarities Dii =N
j=1Aij. The next step is the
construction of a matrix X(N×k)={xi:i=1, ...,k}whose columns correspond to the k
1http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz.
2http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz.
123
A. Kalogeratos, A. Likas
largest eigenvectors of L. The standard k-means algorithm is then used to cluster the rows of
matrix Xafter being normalized to unit length in Euclidean space, where the i-th row is the
vector representation of the i-th document in the new feature space.
Clustering evaluation was based on the supervised measure Normalized Mutual Informa-
tion (NMI)andtheF1-measure.Wedenoteasngt
ithe number of documents of class i,njthe
size of cluster j,nij the number of documents belonging to class ithat are clustered in cluster
j,Cgt the grouping based on ground truth labels of documents cgt
1,...,cgt
k(true classes).
Let us further denote p(cgt
i)=ngt
i/Nand p(cj)=nj/Nthe probability of selecting arbi-
trarily a document from the data set and that belongs to class cgt
iand cluster cj, respectively,
and pcgt
i,cj=nij/Nthe joint of arbitrarily selecting a document from data set and that
belongs to cluster cjand is of class cgt
i. Then, the [0,1]–Normalized MI measure is computed
by dividing the Mutual Information by the maximum between the cluster and class entropy:
NMICgt,C=
cgt
iCgt
cjC
pcgt
i,cjlog pcgt
i,cj
pcgt
ipcj
max{HCgt,H(C)}.
(16)
When Cand Cgt are independent, the value of NMI equals to zero, while it equals to one if
these partitions contain identical clusters.
The F1-measure is the harmonic mean of the precision and recall measures of the
clustering solution:
F1=precision ·recall
precision +recall.(17)
Higher values of F1in [0,1] indicate better clustering solutions.
Tabl es 2,3,5,and6present the results from the experiments conducted for each collec-
tion. Specifically, we compared the classic BOW representation, the GVSM, the proposed
GTCV-VSM method (with λ=0.2inEq.11), that represents the documents as described
in Eq. 14 and the CVM-VSM as proposed in [5], where document vectors are computed
basedonEq.6with idf weights. More specifically, for each collection, each representation
method was tested for 100 runs of spk-means (Tables 2,3,4) and spectral-c (Tables 5,6,7).
To provide fair comparative results, for each document collection, all methods were initial-
ized using the same random document seeds. The average of all runs (avg), the average of
the worst 10% of the clustering solutions (avg10%), and the best values are reported for each
performance measure. The worst 10% concerns the 10% of the solutions with the lowest
Cohesion, while the best clustering solution is that having the maximum Cohesion in the
100 runs (for spectral-c the sum of squared distances is considered for this purpose). The
best result for each dataset is emphasized in each of the avg and best columns. Moreover, in
Fig. 4, we present the average clustering performance of spk-means with respect to the value
of λparameter of Eq. 11 where, although not best for all cases, the value 0.2 we used seems
to be a reasonable choice for all the data sets we have considered. Note that similar effect
was observed for spectral-c method.
In order to illustrate the statistical significance of the obtained results, the well-known
t-test was applied for each data set to determine the significance of the performance differ-
ence between our methods and the compared methods. We have considered the case where
σ=10 for the gaussian kernel for all data sets. Within a confidence interval of 95% and
for the value of degrees of freedom equal to 198 (for two sets of 100 experiments each),
123
Text document clustering
Tab l e 2 NMI values of the clustering solution for VSM (BOW), GVSM, CVM-VSM and the proposed
GTCV-VSM (for several values of σ) document representations using the spk-means algorithm
MethodσD1D2D3D4D5
avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%
BOW 0.722 0.821 0.594 0.748 0.829 0.638 0.537 0.548 0.379 0.625 0.779 0.505 0.552 0.562 0.535
GTCV 1 0.749 0.854 0.601 0.767 0.845 0.638 0.544 0.564 0.372 0.667 0.793 0.515 0.570 0.578 0.561
20.756 0.871 0.631 0.765 0.852 0.657 0.563 0.574 0.396 0.670 0.832 0.539 0.572 0.580 0.561
50.773 0.881 0.687 0.777 0.864 0.662 0.577 0.602 0.400 0.688 0.851 0.539 0.589 0.633 0.578
10 0.777 0.886 0.685 0.781 0.873 0.672 0.590 0.621 0.424 0.684 0.849 0.540 0.590 0.630 0.580
30 0.761 0.879 0.659 0.776 0.863 0.653 0.579 0.590 0.369 0.683 0.842 0.518 0.576 0.612 0.568
inf 0.760 0.862 0.631 0.772 0.862 0.639 0.574 0.586 0.366 0.681 0.840 0.521 0.576 0.610 0.566
GVSM 0.752 0.832 0.611 0.747 0.822 0.637 0.556 0.576 0.419 0.670 0.827 0.547 0.575 0.580 0.573
CVM 0.750 0.841 0.612 0.754 0.851 0.659 0.547 0.604 0.400 0.672 0.824 0.541 0.578 0.581 0.575
Tab l e 3 F1-measure values of the spk-means clustering solution for the different representation methods
MethodσD1D2D3D4D5
avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%
BOW 0.779 0.920 0.685 0.780 0.901 0.645 0.703 0.706 0.570 0.735 0.918 0.558 0.675 0.697 0.646
GTCV 1 0.806 0.940 0.688 0.790 0.921 0.650 0.709 0.713 0.576 0.755 0.920 0.561 0.691 0.695 0.677
20.814 0.946 0.688 0.792 0.924 0.674 0.721 0.728 0.580 0.764 0.938 0.598 0.698 0.714 0.672
50.828 0.953 0.722 0.817 0.929 0.665 0.736 0.737 0.597 0.773 0.948 0.611 0.712 0.751 0.681
10 0.832 0.954 0.733 0.820 0.936 0.603 0.737 0.739 0.603 0.773 0.947 0.581 0.712 0.749 0.681
30 0.814 0.950 0.747 0.794 0.929 0.657 0.725 0.727 0.576 0.766 0.944 0.579 0.698 0.746 0.666
inf 0.813 0.942 0.689 0.792 0.926 0.651 0.722 0.728 0.576 0.765 0.944 0.581 0.698 0.744 0.666
GVSM 0.790 0.923 0.705 0.783 0.903 0.640 0.706 0.71 0.576 0.750 0.943 0.591 0.687 0.720 0.672
CVM 0.765 0.941 0.672 0.790 0.930 0.672 0.708 0.725 0.576 0.751 0.934 0.604 0.685 0.716 0.669
Tab l e 4 The pand tvalues of the statistical significance t-test of the difference in k-means performance
using GTCV-VSM (σ=10) and the compared representation methods, with respect to the two evaluation
measures
GTCV D1D2D3D4D5
(σ=10)vs p-val t-val p-val t-val p-val t-val p-val t-val p-val t-val
BOWNMI 0.011×1065.98 0.075×1034.05 0.025×1065.81 0.080×1086.45 .0000 12.8
GVSMNMI 0.00008 2.68 0.081×1034.02 0.050×1034.15 0.085 1.73 0.056×1055.17
CVMNMI 0.0051 2.83 0.0010 3.33 0.052×1044.65 0.1659 1.39 0.077×1034.04
BOWF10.020×1055.39 0.050×1023.54 0.046×1023.56 0.0010 3.32 0.0000 12.8
GVSMF10.037×1034.22 0.00021 3.11 0.067×1023.45 0.0329 2.15 0.0000 9.06
CVMF10.081×1034.02 0.06×1086.50 0.0027 3.04 0.0314 2.18 0.0000 9.31
Values of psmaller than the significance level of 0.05 (5%) indicate significant superiority of GTCV-VSM
123
A. Kalogeratos, A. Likas
Tab l e 5 NMI values of the clustering solution for VSM (BOW), GVSM, CVM-VSM and the proposed
GTCV-VSM (for several values of σ) document representations using the spectral clustering algorithm
MethodσD1D2D3D4D5
avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%
BOW 0.753 0.761 0.750 0.781 0.788 0.737 0.569 0.585 0.555 0.718 0.780 0.631 0.558 0.559 0.506
GTCV 1 0.770 0.774 0.769 0.790 0.795 0.750 0.614 0.626 0.600 0.735 0.779 0.642 0.560 0.561 0.516
20.781 0.785 0.760 0.790 0.794 0.757 0.625 0.632 0.601 0.752 0.789 0.649 0.562 0.564 0.523
50.794 0.804 0.790 0.833 0.853 0.763 0.639 0.640 0.619 0.768 0.827 0.669 0.579 0.600 0.557
10 0.807 0.814 0.801 0.833 0.853 0.761 0.645 0.648 0.620 0.758 0.819 0.661 0.581 0.589 0.558
30 0.791 0.796 0.769 0.807 0.832 0.743 0.613 0.613 0.609 0.755 0.797 0.647 0.567 0.582 0.535
inf 0.774 0.782 0.767 0.794 0.794 0.722 0.619 0.619 0.610 0.749 0.793 0.637 0.560 0.568 0.530
GVSM 0.756 0.770 0.702 0.794 0.830 0.747 0.593 0.595 0.586 0.722 0.780 0.637 0.548 0.554 0.513
CVM 0.761 0.768 0.751 0.801 0.823 0.760 0.605 0.606 0.590 0.728 0.794 0.642 0.557 0.566 0.519
Tab l e 6 F1-measure values of the spectral clustering solution for the different representation methods
MethodσD1D2D3D4D5
avg best avg10% avg best avg10% avg best avg10% avg best avg10% avg best avg10%
BOW 0.801 0.811 0.780 0.819 0.822 0.767 0.710 0.723 0.701 0.808 0.911 0.697 0.666 0.669 0.654
GTCV 1 0.811 0.819 0.809 0.822 0.832 0.772 0.729 0.741 0.728 0.834 0.915 0.722 0.694 0.703 0.663
20.818 0.823 0.806 0.837 0.841 0.779 0.733 0.746 0.732 0.865 0.922 0.725 0.689 0.703 0.652
50.837 0.840 0.818 0.887 0.927 0.792 0.744 0.756 0.737 0.870 0.930 0.740 0.716 0.727 0.647
10 0.840 0.842 0.826 0.890 0.925 0.788 0.754 0.759 0.742 0.865 0.929 0.736 0.710 0.725 0.654
30 0.823 0.826 0.809 0.856 0.886 0.769 0.726 0.735 0.725 0.864 0.925 0.705 0.704 0.701 0.642
inf 0.814 0.817 0.806 0.826 0.832 0.734 0.728 0.735 0.729 0.859 0.922 0.703 0.692 0.686 0.653
GVSM 0.756 0.770 0.702 0.826 0.901 0.780 0.709 0.714 0.724 0.823 0.916 0.705 0.642 0.657 0.654
CVM 0.761 0.768 0.779 0.831 0.897 0.791 0.725 0.725 0.723 0.825 0.916 0.713 0.673 0.678 0.654
Tab l e 7 The pand tvalues of the statistical significance t-test of the difference in spectral clustering per-
formance using GTCV-VSM (σ=10) and the compared representation methods, with respect to the two
evaluation measures
GTCV D1D2D3D4D5
(σ=10) vs p-val t-val p-val t-val p-val t-val p-val t-val p-val t-val
BOWNMI 0.0000 27.3 0.0000 13.80.0000 620 0.026×1044.85 0.0000 8.03
GVSMNMI 0.0000 16.7 0.0000 7.51 0.0000 130 0.129×1054.99 0.0000 12.1
CVMNMI 0.0000 19.3 0.150×1086.35 0.0000 138 0.316×1033.67 0.0000 8.83
BOWF10.0000 24.1 0.0000 11.40.0000 875 0.123×1044.48 0.0000 19.1
GVSMF10.0000 15.1 0.0000 7.53 0.0000 410 0.113×1023.31 0.0000 30.7
CVMF10.0000 18.7 0.0000 7.11 0.0000 268 0.115×1033.94 0.0000 14.1
Values of psmaller than the significance level of 0.05 (5%) indicate significant superiority of GTCV-VSM
123
Text document clustering
Fig. 4 The effect of varying the parameter λon the spk-means clustering performance for each data set.
Eq. 11 is used to determine the term self-weight ανwhen computing the ltcvhistograms
the critical value for tis tc=1.972 (pc=5% for pvalue). This means that if the computed
ttc, then the null hypothesis is rejected (p5%, respectively), i.e., our method is supe-
rior, otherwise the null hypothesis is accepted. As it can be observed from the results of the
statistical tests for spk-means presented in Table 4, the performance superiority of GTCV-
VSM is clearly significant in four out of five data sets with respect to all other methods.
For data set D4, the tests indicate that GTCV-VSM, although still better than BOW, has
less significant difference in performance compared with GVSM and CVM-VSM. Table 4
provides the respective t-test results for the spectral-c method where, also due to the lower
standard deviation of the results using all document representation methods, the GTCV-VSM
demonstrates significantly better results than the compared representations.
The experimental results indicate that our method outperforms the traditional BOW
approach in all cases, even for small values of smoothing parameter σ(e.g., σ=1or2).This
substantiates our rationale that the clustering procedure is assisted by the proposed semantic
123
A. Kalogeratos, A. Likas
smoothing, which takes into account the local contextual information associated with a term
occurrence. GTCV-VSM requires moderate values for the parameter σto achieve better per-
formance. The same is observed for the quality (in terms of NMI or F1) of the best solution
(i.e., the one with maximum Cohesion) found in the 100 runs, where moderate values of σ
(i.e., σ=5 or 10) result in better GTCV-VSM performance. Moreover, the clustering results
for a wide range of values of the smoothing parameter σindicate that the method is quite
robust to the specification of this parameter. GTCV-VSM behaves similarly to BOW when a
low value is set for σ, while when this value becomes very high, the discriminative informa-
tion of the global term context vectors is reduced. This was demonstrated using spk-means
and spectral clustering methods. Among them, the latter in all cases except from D5presented
better average clustering solutions in terms of both evaluation measures NMI and F1, while
interestingly, spk-means was superior in terms of the best clustering solutions in most cases
(with the exception of D3) despite operating in a feature space of a much larger size.
6 Conclusions
We have presented the global term context vector-VSM (GTCV-VSM) document represen-
tation, an extension to the vector space model that determines a proper feature space to
project the typical VSM document vector representations. Our approach is entirely corpus-
based and operates in the preprocessing phase in a sequence of four steps: (i) captures local
contextual information associated with each term occurrence in the term sequences of doc-
uments; (ii) summarizes the local context vectors of each term into the respective global
term context vectors; (iii) constructs the semantic matrix for a problem using the global
term context vectors; and finally, (iv) projects documents using the semantic matrix. The
proposed approach achieves semantic smoothing by reducing data sparsity, while retaining
the original dimensionality. The derived representation maintains the initial interpretability
since each dimension is associated with a single vocabulary term. In the experimental docu-
ment clustering study, we compared the proposed representation with the typical VSM, the
Generalized-VSM and CVM-VSM, using Cosine similarity. The statistical analysis of the
obtained results indicates that GTCV-VSM assists well-known clustering algorithms, such
as spherical k-means and spectral clustering, to achieve better clustering solutions compared
with other representation methods.
Our plans for future work are to investigate the potential of combining the local and
global contextual information associated with terms to explore ways of building compact
concept vectors, to efficiently project the transformed document vectors in feature spaces of
lower dimensionality, and to perform a systematic study for procedures that could efficiently
compute ανparameters (Eq. 13) for each vocabulary term, which could improve the global
term context vectors. Finally, we aim at examining the proposed representation for document
classification.
References
1. AlSumait L, Domeniconi C (2008) Text clustering with local semantic kernels. In: Berry M, Castellanos
M (eds) Survey of text mining II. Springer, London, pp 219–232
2. Apté C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text catego-
rization models. In: SIGIR ’94: proceedings of the 17th annual international ACM SIGIR conference on
research and development in information retrieval. Springer, New York, pp 23–30
123
Text document clustering
3. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD ’02: proceedings of the 8th
ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York,
pp 436–442. doi:10.1145/775047.775110
4. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?
In: ICDT ’99: proceedings of the 7th international conference on database theory. Springer, London,
pp 217–235
5. Billhardt H, Borrajo D, Maojo V (2002) A context vector model for information retrieval. J Am Soc Inf
Sci Technol 53(3):236–249. doi:10.1002/asi.10032
6. Chen C, Tseng F, Liang T (2010) An integration of fuzzy association rules and wordnet for document
clustering. Knowl Inf Syst (available online). doi:10.1007/s10115-010-0364- 2
7. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis.
J Am Soc Inf Sci 41:391–407
8. Dhillon I, Modha D (2001) Concept decompositions for large sparse text data using clustering. Mach
Learn 42(1):143–175. doi:10.1023/A:1007612920971
9. Farahat A, Kamel M (2010) Statistical semantics for enhancing document clustering. Knowledge and
Information Systems (available online). doi:10.1007/s10115-010-0367-z
10. Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceed-
ings of SIAM international conference on data mining
11. Ghosh J, Strehl A (2006) Similarity-based text clustering: a comparative study. In: Kogan J, Nicholas C,
Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 73–97
12. Grauman K, Darrell T (2007) The pyramid match kernel: efficient learning with sets of features. J Mach
Learn Res 8:725–760. doi:10.1145/361219.361220
13. Hotho A, Maedche E, Staab S (2001) Ontology-based text document clustering. Knstliche Intell 4:48–54
14. Hu X, Sun N, Zhang C, Chua T (2009) Exploiting internal and external semantics for the clustering
of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and
knowledge management. ACM, New York, CIKM ’09, pp 919–928. doi:10.1145/1645953.1646071
15. Jing J, Zhou L, Ng M, Huang Z (2006) Ontology-based distance measure for text clustering. In: Proceed-
ings SIAM SDM workshop on text mining
16. Karypis G, Han E (2000) Concept indexing: a fast dimensionality reduction algorithm with applications
to document retrieval and categorization. In: Technical report TR-00-0016. University of Minnesota
17. Keikha M, Razavian N, Oroumchian F, Razi H (2008) Document representation and quality of text: an
analysis. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232
18. Lebanon G, Mao Y, Dillon J (2007) The locally weighted bag of words framework for document repre-
sentation. J Mach Learn Res 8:2405–2441
19. Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task.
In: SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on research and
development in information retrieval. ACM, New York, pp 37–50. doi:10.1145/133160.133172
20. Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data
Knowl Eng 64(1):381–404. doi:10.1016/j.datak.2007.08.001
21. McQueen J (1967) Some methods for classification and analysis of multivariate observations. In:
Proceedings of 5th Berkley symposium on mathematical statistics and probability. pp 281–297
22. Miller G, Beckwith R, Fellbaum C, Gross D, Miller K (1990) Wordnet: an on-line lexical database. Int J
Lexicogr 3:235–244
23. Mladenic D (1998) Machine learning on non-homogeneous, distributed text data. PhD thesis, University
of Ljubljana, Faculty of Computer and Information Science
24. Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf
Process Syst 14:849–864
25. Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst
1–21. doi:10.1007/s10115-010-0299-7
26. Porter M (1997) An algorithm for suffix stripping. In: Jones K, Willett P (eds) Readings in information
retrieval. Morgan Kaufmann Publishers, San Francisco, pp 313–316
27. Pu W, Liu N, Yan S, Yan J, Xie K, Chen Z (2007) Local word bag model for text categorization. In:
ICDM ’07: proceedings of the 2007 7th IEEE international conference on data mining. IEEE Computer
Society, Washington, pp 625–630. doi:10.1109/ICDM.2007.69
28. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM
18(11):613–620. doi:10.1145/361219.361220
29. Wang P, Domeniconi C (2008) Building semantic kernels for text classification using wikipedia. In:
KDD ’08: proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and
data mining. ACM, New York, pp 713–721. doi:10.1145/1401890.1401976
30. Wikipedia (2004) Wikipedia, the free encyclopedia. http://en.wikipedia.org/
123
A. Kalogeratos, A. Likas
31. Wong S, Ziarko W, Wong P (1985) Generalized vector spaces model in information retrieval. In:
SIGIR ’85: proceedings of the 8th annual international ACM SIGIR conference on research and devel-
opment in information retrieval. ACM, New York, pp 18–25. doi:10.1145/253495.253506
Author Biographies
Argyris Kalogeratos received the B.Sc. and M.Sc. degrees in Com-
puter Science from the University of Ioannina, Ioannina, Greece, in
2006 and 2008, respectively. Currently, he is pursuing the Ph.D. degree
in the Department of Computer Science, University of Ioannina. His
research interests include machine learning, data clustering, text repre-
sentation, and mining.
Aristidis Likas received the Diploma degree in electrical engineering
and the Ph.D. degree in electrical and computer engineering from the
National Technical University of Athens, Greece, in 1990 and 1994,
respectively. Since 1996, he has been with the Department of Computer
Science, University of Ioannina, Greece, where he is currently an Asso-
ciate Professor. His research interests include machine learning, data
mining, multimedia content analysis and bioinformatics.
123
... Following the idea of Kalogeratos and Likas (2012), in this work, we propose to apply clustering algorithm on the word vectors we produced before on the semantic similarities. Due to the vectors are representing words semantically and sentimentally, it can be testified that words in the same cluster share the same concept and thus we can use it as the domain. ...
... Since we have vectorized posts and see them as documents, it would be also meaningful and sensible to treat clusters with element words in them as documents, given only where words appear once. We averagely summarize words in clusters to produce centroids of clusters, which is similar to the concept of global context vectors Kalogeratos and Likas (2012) proposed. Those centroids are used as representations of domains. ...
... ;Li et al. (2008);Hu et al. (2009)), especially with word vectors and document representations(Kalogeratos and Likas (2012);Forsati et al. ( ...
Article
Community Question Answering (CQA) websites can be claimed as the most major venues for knowledge sharing, and the most effective way of exchanging knowledge at present. Considering that massive amount of users are participating online and generating huge amount data, management of knowledge here systematically can be challenging. Expert recommendation is one of the major challenges, as it highlights users in CQA with potential expertise, which may help match unresolved questions with existing high quality answers while at the same time may help external services like human resource systems as another reference to evaluate their candidates. In this paper, we in this work we propose to exploring experts in CQA websites. We take advantage of recent distributed word representation technology to help summarize text chunks, and in a semantic view exploiting the relationships between natural language phrases to extract latent knowledge domains. By domains, the users' expertise is determined on their historical performance, and a rank can be compute to given recommendation accordingly. In particular, Stack Overflow is chosen as our dataset to test and evaluate our work, where inclusive experiment shows our competence.
... Following the idea of Kalogeratos and Likas (2012), in this work, we propose to apply clustering algorithm on the word vectors we produced before on the semantic similarities. Due to the vectors are representing words semantically and sentimentally, it can be testified that words in the same cluster share the same concept and thus we can use it as the domain. ...
... Since we have vectorized posts and see them as documents, it would be also meaningful and sensible to treat clusters with element words in them as documents, given only where words appear once. We averagely summarize words in clusters to produce centroids of clusters, which is similar to the concept of global context vectors Kalogeratos and Likas (2012) proposed. Those centroids are used as representations of domains. ...
... ;Li et al. (2008);Hu et al. (2009)), especially with word vectors and document representations(Kalogeratos and Likas (2012);Forsati et al. ( ...
Preprint
Full-text available
Community Question Answering (CQA) websites can be claimed as the most major venues for knowledge sharing, and the most effective way of exchanging knowledge at present. Considering that massive amount of users are participating online and generating huge amount data, management of knowledge here systematically can be challenging. Expert recommendation is one of the major challenges, as it highlights users in CQA with potential expertise, which may help match unresolved questions with existing high quality answers while at the same time may help external services like human resource systems as another reference to evaluate their candidates. In this paper, we in this work we propose to exploring experts in CQA websites. We take advantage of recent distributed word representation technology to help summarize text chunks, and in a semantic view exploiting the relationships between natural language phrases to extract latent knowledge domains. By domains, the users' expertise is determined on their historical performance, and a rank can be compute to given recommendation accordingly. In particular, Stack Overflow is chosen as our dataset to test and evaluate our work, where inclusive experiment shows our competence.
... O problema do agrupamento de documentos pode ser formalmente definido como: dados i) um conjunto de documentos D = {d i }, i = 1, ..., n; ii) um número desejado de grupos K; e iii) uma função objetivo que avalia a qualidade do agrupamento, deseja-se determinar uma associação γ : D → 1, ..., K que minimiza (ou, em alguns casos, maximiza) a função objetivo. A função objetivoé normalmente definida em função da similaridade ou distância entre os documentos, e, em geral, demanda-se também que a associação γ seja sobrejetiva, a fim de garantir que nenhum dos K grupos esteja vazio [14,15]. ...
... Use of text mining techniques for unsupervised organization of digital procedural acts Calcule o valor da função objetivo da partição obtida pela inicialização de ordem i e a armazene em uma lista de soluções LS; 15 Selecione de LS a melhor partição e a defina como solução para o problema; ...
Article
Full-text available
The rapid advances in technologies related to the capture and storage of data in digital format have allowed to organizations the accumulation of a volume of information extremely high, constituted a higher proportion of data in unstructured format, represented by texts. However, it is noted that the retrieval of useful information from these large repositories has been a very challenging activity. In this context, data mining is presented as a self-discovery process that acts on large databases and enables the knowledge extraction from raw text documents. Among the many sources of textual documents are electronic diaries of justice, which are intended to make public officially all the acts of the Judiciary. Despite the publication in digital form has provided improvements represented by the removal of imperfections related to divulgation at printed format, it is observed that the application of data mining methods could render more rapid analysis of its contents. In this sense, this article establishes a tool capable of automatically grouping and categorizing digital procedural acts, based on the evaluation of text mining techniques applied to groups determination activity. In addition, the strategy of defining the descriptors of the groups, that is usually conducted based on the most frequent words in the documents, was evaluated and remodeled in order to use, instead of words, the most regularly identified concepts in the texts
... Substantial efforts were exerted by many techniques so that documents within a specific cluster possess high intrasimilarity, as well as low inter-similarity with other clusters [3]. In other words, similar documents are allocated for a similar cluster, whereas dissimilar documents are allocated for varied clusters [4]. The similarities and dissimilarities are evaluated based on the attribute values that describe the documents. ...
Article
Full-text available
Text clustering has been widely utilized with the aim of partitioning specific document collection into different subsets using homogeneity/heterogeneity criteria. It has also become a very complicated area of research, including pattern recognition, information retrieval, and text mining. Metaheuristics are typically used as efficient approaches for the text clustering problem. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. However, a recently applied research trend involves hybridizing two or more algorithms with the aim of obtaining a superior solution regarding the problems of optimization. In this paper, a new hybrid of MVO algorithm with the K-means clustering algorithm is proposed, i.e., the H-MVO algorithm with the aims of enhancing the quality of initial candidate solutions, as well as the best solution, which is produced by MVO at each iteration. This hybrid algorithm aims at improving the global (diversification) ability of the search and finding a better cluster partition. The proposed H-MVO effectiveness was tested on five standard datasets, which are used in the domain of data clustering, as well as six standard text datasets, which are utilized in the domain of text document clustering, in addition to two scientific articles’ datasets. The experiments showed that K-means hybridized MVO improves the results in terms of high convergence rate, accuracy, error rate, purity, entropy, recall, precision, and F-measure criteria. In general, H-MVO has outperformed or at least proven to be highly competitive compared to the original MVO algorithm and with well-known optimization algorithms like KHA, HS, PSO, GA, H-PSO, and H-GA and the clustering techniques like K-mean, K-mean++, DBSCAN, agglomerative, and spectral clustering techniques.
... Other techniques extend the VSM representation with context-related information for each term, transforming the VSM vector in a global text context vector. This way, a global context about a term is built by merging its local contexts, which are derived from each document where the term appears in (Argyris Kalogeratos and Aristidis Likas (2012)). Similarity between documents is calculated by similarity measures, which evaluate the document similarity as the similarity between their feature vectors. ...
... , K que minimiza (ou, em alguns casos, maximiza) a função objetivo. A função objetivo é normalmente definida em função da similaridade ou distância entre os documentos, e, em geral, demanda-se também que a associação γ seja sobrejetiva, a fim de garantir que nenhum dos K grupos esteja vazio [24,28]. ...
Article
Full-text available
A captura e o armazenamento de dados em formato digital têm permitido às organizações o acúmulo de um volume de informações extremamente elevado, constituído em maior proporção por dados em formato não estruturado, representados por textos. Neste contexto, as atividades de análise de agrupamentos ou classificação não supervisionada de objetos, se constituem como uma das técnicas de mineração de informações mais frequentemente empregadas no intuito de proporcionar a organização do volume progressivamente crescente de elementos textuais, por meio da disposição dos documentos em grupos de itens semelhantes com base em um índice de similaridade. Neste sentido, este estudo avalia os índices de similaridade distância Euclidiana, distância do coseno, distância de Hamming, coeficiente de Jaccard estendido e coeficiente de correlação de Pearson, sob a perspectiva de seis índices de validação de agrupamentos, observando que a distância do coseno representa, conforme a presente análise, o índice de similaridade mais apropriado ao agrupamento de objetos textuais, convertidos em formato estruturado por intermédio de técnicas de mineração de textos.
Article
This paper intends to develop a novel framework for text document clustering with the aid of a new improved meta-heuristic algorithm. Initially, the features are selected from the text document by subjecting each word under Term Frequency-Inverse Document Frequency (TF-IDF) computation. Subsequently, centroid selection plays a vital role in cluster formation, which is done using a new Improved Lion Algorithm (LA) termed as Cross over probability-based LA model (CP-LA). As a novelty, this paper introduced a new inter and intracluster similarity model. Moreover, this centroid selection is made in such a way that the proposed adaptive weighted similarity should be minimal. Based on the characteristics of the document, the weights are automatically adapted with the similarity measure. The proposed adaptive weighted similarity function involves the inter-cluster, and intra-cluster similarity of both ordered and unordered documents. Finally, the superiority of the proposed over other models is proved under different performance measures.
Article
Owing to scientific development, a variety of challenges present in the field of information retrieval. These challenges are because of the increased usage of large volumes of data. These huge amounts of data are presented from large-scale distributed networks. Centralization of these data to carry out analysis is tricky. There exists a requirement for novel text document clustering algorithms, which overcomes challenges in clustering. The two most important challenges in clustering are clustering accuracy and quality. For this reason, this paper intends to present an ideal clustering model for text document using term frequency–inverse document frequency, which is considered as feature sets. Here, the initial centroid selection is much concentrated which can automatically cluster the text using weighted similarity measure in the proposed clustering process. In fact, the weighted similarity function involves the inter-cluster, and intra-cluster similarity of both ordered and unordered documents, which is used to minimize weighted similarity among the documents. An advanced model for clustering is proposed by the hybrid optimization algorithm, which is the combination of the Jaya Algorithm (JA) and Grey Wolf Algorithm (GWO), and so the proposed algorithm is termed as JA-based GWO. Finally, the performance of the proposed model is verified through a comparative analysis with the state-of-the-art models. The performance analysis exhibits that the proposed model is 96.56% better than genetic algorithm, 99.46% better than particle swarm optimization, 97.09% superior to Dragonfly algorithm, and 96.21% better than JA for the similarity index. Therefore, the proposed model has confirmed its efficiency through valuable analysis.
Article
Existing methods for readability assessment usually construct inductive classification models to assess the readability of singular text documents based on extracted features, which have been demonstrated to be effective. However, they rarely make use of the interrelationship among documents on readability, which can help increase the accuracy of readability assessment. In this article, we adopt a graph‐based classification method to model and utilize the relationship among documents using the coupled bag‐of‐words model. We propose a word coupling method to build the coupled bag‐of‐words model by estimating the correlation between words on reading difficulty. In addition, we propose a two‐view graph propagation method to make use of both the coupled bag‐of‐words model and the linguistic features. Our method employs a graph merging operation to combine graphs built according to different views, and improves the label propagation by incorporating the ordinal relation among reading levels. Experiments were conducted on both English and Chinese data sets, and the results demonstrate both effectiveness and potential of the method.
Chapter
Full-text available
Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory.
Article
Clustering of text documents enables unsupervised categorization and facilitates browsing and search. Any clustering method has to embed the objects to be clustered in a suitable representational space that provides a measure of (dis)similarity between any pair of objects. While several clustering methods and the associated similarity measures have been proposed in the past for text clus-tering, there is no systematic comparative study of the impact of similarity mea-sures on the quality of document clusters, possibly because most popular cost criteria for evaluating cluster quality do not readily translate across qualitatively different measures. This chapter compares popular similarity measures (Euclidean, cosine, Pearson correlation, extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hypergraph partitioning, general-ized k-means, weighted graph partitioning), on a variety of high dimension sparse vector data sets representing text documents as bags of words. Performance is measured based on mutual information with a human-imposed classification. Our key findings are that in the quasiorthogonal space of word frequencies: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) Graph partitioning tends to be superior espe-cially when balanced clusters are desired; (iv) Performance curves generally do not cross.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
WordNet is an on-line lexical reference system whose design is inspired by current
Article
In information retrieval, it is common to model index terms and documents as vectors in a suitably defined vector space. The main difficulty with this approach is that the explicit representation of term vectors is not known a priori. For this reason, the vector space model adopted by Salton for the SMART system treats the terms as a set of orthogonal vectors. In such a model it is often necessary to adopt a separate, corrective procedure to take into account the correlations between terms. In this paper, we propose a systematic method (the generalized vector space model) to compute term correlations directly from automatic indexing scheme. We also demonstrate how such correlations can be included with minimal modification in the existing vector based information retrieval systems. The preliminary experimental results obtained from the new model are very encouraging.