ArticlePDF Available

Mining Knowledge from Corpora: An Application to Retrieval and Indexing

Authors:

Abstract and Figures

Unlabelled: The present work aims at discovering new associations between medical concepts to be exploited as input in retrieval and indexing. Material and methods: Association rules method is applied to documents. The process is carried out on three major document categories referring to e-health information consumers: health professionals, students and lay people. Association rules evaluation is founded on statistical measures combined with domain knowledge. Results: Association rules represent existing relations between medical concepts (60.62%) and new knowledge (54.21%). Based on observations, 463 expert rules are defined by medical librarians for retrieval and indexing. Conclusions: Association rules bear out existing relations, produce new knowledge and support users and indexers in document retrieval and indexing.
Content may be subject to copyright.
Mining Knowledge from Corpora: an
Application to Retrieval and Indexing
Lina F.SOUALMIA
a,
1
, Badisse DAHAMNA
b
, Stéfan DARMONI
b
a
LIM&Bio EA 3969, SMBH Léonard de Vinci, Paris XIII University, Bobigny, France.
b
CISMeF team and LITIS EA 4051,University of Rouen, France
Abstract. The present work aims at discovering new associations between medical
concepts to be exploited as input in retrieval and indexing. Material and Methods:
Association rules method is applied to documents. The process is carried out on
three major document categories referring to e-health information consumers:
health professionals, students and lay people. Association rules evaluation is
founded on statistical measures combined with domain knowledge. Results:
Association rules represent existing relations between medical concepts (60.62%)
and new knowledge (54.21%). Based on observations, 463 expert rules are defined
by medical librarians for retrieval and indexing. Conclusions: Association rules
bear out existing relations, produce new knowledge and support users and indexers
in document retrieval and indexing.
Keywords. Data Analysis; Indexing; Terminology; Data Mining; MeSH.
Introduction
Internet is a major source of biomedical knowledge. As the access to structured
medical information is difficult with directories or general search engines, many
applications have been developed [1]. Since 1995, CISMeF (acronym of Catalog and
Index of French-speaking Medical Sites) [2] has been selecting institutional and
educational resources for patients, students and health professionals. It references
36,247 e-documents by using Medical Subject Headings (MeSH) [3]. Among many
sources to support users, such as morphological bases, dynamic and contextual search
tools [4], MeSH structure is exploited. To complete these sources, we propose to mine
e-documents to discover new associations between medical concepts by data mining.
1. Material and Methods
1.1. Medical Subject Headings, Resource Types and Metaterms
The MeSH thesaurus is used by the National Library of Medicine for indexing
biomedical resources. Its core is a hierarchical structure that consists of sets of
1 Corresponding Author: Lina Soualmia, LIM&Bio EA 3969, 74 rue Marcel Cachin, 93017 Bobigny, France;
E-mail: lina.soualmia@gmail.com.
eHealth Beyond the Horizon – Get IT There
S.K. Andersen et al. (Eds.)
IOS Press, 2008
© 2008 Organizing Committee of MIE 2008. All rights reserved.
467
descriptors: at the top level general headings (e.g. diseases) and deeper more specific
headings (e.g. brain infraction). The 2007 version contains over 24,357 main headings
(e.g.: hepatitis) and 83 subheadings (e.g.: diagnosis). Together with a main heading, a
subheading can be used to specify a particular aspect. For example, the pair
[hepatitis/diagnosis] specifies diagnosis aspect of hepatitis.
MeSH is originally used to index biomedical scientific articles for the MEDLINE
database. In order to customize it to the field of e-health resources resource types have
been introduced [2]. CISMeF resource types are an extension of MEDLINE publication
types (e.g. clinical guidelines). Each document in CISMeF is described with a set of
MeSH main headings, subheadings and CISMeF resource types. Each main heading,
[main heading/subheading] pair and resource type is allotted a ‘minor’ or ‘major’
weight, according to the importance of the concept it refers to in the resource. Major
terms are marked by a star (*).
1.2. Data Mining
Knowledge extraction from databases or data mining in computer science [5] consists
in discovering additional information from large structured sets of data. This
knowledge could be used to do predictions about new data or to explain existing data.
One of the objectives of extraction process is the generation of association rules. It is
processed in several steps: data and context preparation (objects and items selection),
extraction of frequent itemsets (compared to a minimum support threshold), generation
of most informative rules using a data mining algorithm, and finally interpretation and
deduction of new knowledge [6]. An extraction context is a triplet C=(O, I, R) where:
O is the set of objects, I is the set of all the items and R is a binary relation between O
and I.
1.2.1. Association Rules
A data mining system may generate several thousands and even several millions
frequent association rules, and only some of them are interesting. An association rule is
interesting if it is easily understandable by the users, valid for new data, useful or if it
confirms a hypothesis. It is expressed as: i
1
i
2
i
k
i
k+1
i
n
and states
that if an object has the items {i
1
,i
2
…,i
k
} it tends also to have the items {i
k+1
,…,i
n
}.
Support represents the rule utility. It corresponds to the proportion of objects which
contains at the same time antecedent and consequent. Support = |{i
1
, i
2
,…, i
n
}|.
Confidence represents precision and corresponds to the proportion of objects that
contains the consequent rule among those containing the antecedent. Two rule types are
distinguished: exact rule having Confidence=100%, i.e. verified in all the objects of the
database and approximative rule. Confidence = |{i
1
, i
2
,…, i
n
}|/|{i
1
, i
2
,…, i
k
}|.
1.2.2. A-Close for Mining e-Documents
The problem of the relevance and the usefulness of extracted association rules is of a
primary importance because real-life databases lead to several thousands and even
millions of association rules whose confidence measures are high, and among which
are many redundancies, i.e. rules conveying the same information among them. Two
bases for association rules are defined by A-Close [7]. These bases generate sets for all
valid non-redundant association rules, being thus smaller, composed by minimal
L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing468
antecedents and maximal consequents i.e. the most relevant association rules. We adapt
A-Close to the case of e-health documents data base by considering conceptual
indexing: the set of objects O is the set of indexed documents; the set of items I is the
set of MeSH descriptors; the relation R represents the indexing relation between an
object and an item, i.e. between a document and a descriptor.
1.2.3. Processing Collections of Documents
End-users are categorised in CISMeF in mainly three types: professionals, students in
medicine, patients and lay people. Rather than extracting knowledge referring to the
main medical specialties as in [4], we consider the three major resource types
guidelines*, education* and patients* and two kinds of itemsets: the set of major main
headings (MH*) and the set of major [main heading/subheading] pairs (MH/SH*).
Table 1. Description of the collections of documents.
Resources Documents Items Min Max Mean
MH* 1 64 5.21
Guidelines* 2,727
MH/SH* 1 70 6.12
MH* 0 25 1.63
Patients* 3,272
MH/SH* 0 30 1.82
MH* 0 25 2.22
Education* 3,610
MH/SH* 0 34 2.73
2. Results
2.1. Mining e-Documents
For all contexts, minimum support was fixed to minsup=20 and minimum confidence
to minconf=70% for the approximative association rules (Table 2). We obtain
association rules between major MH* (resp. MH/SH* pairs). For the major resource
types patients* and education* all (100%) association rules are between two MHs*
(resp. MH/SH* pairs) i.e one descriptor in the antecedent and one descriptor in the
consequent. For guidelines* 24% of the rules are between more than two descriptors.
Characteristics of documents may explain these results: average descriptors from 1.63
to 2.22 for patients* and education* whereas 5.21 to 6.12 for guidelines*.
Table 2. Number of rules, exact rules (ER), approximative rules (AR) and pairs.
Context: item=MH* Context: item= [MH/SH]*
Resources Rules ER
Conf=1
AR
Conf
0.7
Pairs
Rules
ER
Conf=1
AR
Conf
0.7
Pairs
Guidelines* 50
12
(24%)
38
(76%)
38
(76%)
39
8
(20.51%)
31
(79.49%)
35
(76%)
Patients* 20
9
(45%)
11
(55%)
20
(100%)
19
8
(42.1%)
11
(57.9%)
19
(100%)
Education* 23
6
(26.09%)
17
(73.91%)
23
(100%)
25
13
(52%)
12
(48%)
25
(100%)
Another experiment is carried out in the context of documents with the resource
type guidelines* to obtain more complete association rules: we consider the descriptors
MH and MH/SH pairs without alloted minor or major weight. An average of 12
L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing 469
on
y
n
g
2.2. Association Rules Evaluation
As defined, an interesting association rule confirms a hypothesis or states a new
hypothesis [6]. We propose here to combine background domain knowledge with
simple statistical measures used traditionally in association rules mining for evaluation.
We consider several cases of interesting association rules according to relations
between MeSH descriptors [3]. As these relations are defined between two main
headings and between two subheadings we consider only the association rules betwee
two elements. Hence, an interesting existing association rule could associate: a
(in)direct son and its father (FS); two descriptors that belong to the same hierarchy
(same (in)direct father) (B); two descriptors with See Also relation (SA). These rules
are automatically classified thanks to MeSH structure. The other rules that satisfy the
misup and minconf are then considered as «new» interesting association rules.
Exact association rules, except for collection patients*, are mostly new interesting
rules: from 62.5% to 99.86%. Therefore, existing rules are mainly from the patients*
collection: 77.77% for MH* and 75% for MH/SH*. Approximative rules
, except for the
guidelines* collection with items MH and MH/SH pairs, are mostly existing interestin
rules: from 58.07% to 78.73%. New interesting rules are between MH and MH/SH
from the collection guidelines*: 99.73% for MH and 99.52% for MH/SH.
Subjective interest measures are based on the expert knowledge about the data, i.e.
here the medical librarian. New interesting rules for the contexts MH* and MH/SH*
520
(1.92%)
MH/SH 27,011
6,102
(22.6%)
20,909
(77.4%)
338
(0.95%)
MH 35,454
6,990
(19.71%)
28,464
(80.29%)
Pairs
0.7
We obtain a high number of association rules with a minimum support threshold
minsup=20 and a minimum confidence threshold minconf=70% (Table 4) but onl
0.95% (respectively 1.92%) are between two MH (respectively between two MH/SH
pairs). By reducing the confidence from 1 to 0.7 the number of rules between MH
(respectively between MH/SH) growths with a factor of 5 (respectively 4.42).
Table 4. Association rules between MH and MH/SH in the context Guidelines*.
Items Rules ER
Conf=1
AR
Conf
301 13.541
Guidelines* 2,727
MH/SH
111 10.081MH
Max MeanMin
descriptors with a minimum of 1 and a maximum of 301 descriptors compose the
documents (Table 3). As A-Close works on databases with a maximum of 12 items, we
have added a constraint on the number of descriptors. To avoid long time generati
and to have interpretable association rules, we added the maximum size of the closed
itemsets as a new parameter of the algorithm.
Table 3. Description of the documents of the Guidelines* collection.
Docs Items
py theragain/drupe gesics/administration and dosagioids analp
breast cancer/diagnosis mammography
aids/prevention and control condom
influenza vaccinesinfluenza/prevention and control
Turner syndrome child human growth hormone growth disorders
obstetric delivery pregnancy
prostate cancer/surgery biopsy prostatectomy
amniocentesis prenatal diagnosis chorionic villi sampling
o
Figure 1. Some examples of new interesting rules validated by the expert
L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing470
466
on
nd
e

++
appendicitis/surgery states that the pair appendicitis/surgery
pairs are evaluated manually. 93.75% (resp. 84.78%) of the interesting new rules with
confidence=1 (resp. confidence0.7
) between major descriptors are validated.
++
MH
2
/SH
2
states that the pair MH
2
/SH
2
should be added to the pair
MH
1
/SH
1
. appendectomy
radiography abdominal. Th
rule
MH
1
/SH
1
MH
2
states that MH
1
/SH
1
should be replaced by the
main heading
MH
2
. For example abdomen/radiography
2.3. Indexing Correction and Expert Rules
Documents are manually indexed and according to the indexing policy, the more
precise descriptor should be used, i.e. in lower level in hierarchy. However, 1,
documents contain descriptors that have father-son relation and 478 documents are
indexed by subheadings that have a relation while associated to the same keyword. For
example, a document is indexed by trisomy and chromosome aberrations, whereas
trisomy is a chromosome aberration. This may explain the proportion of existing
associations. Correction should be proposed to the indexers.
Main return on experiences of association rules extraction and evaluation is
modeling and formalisation of rules between [main heading/subheading] pairs based
observations. The pattern of the rule
hepatitis/prevention and controlhepatitis vaccines is
used to model
dysentery bacillary/prevention and control shigella vaccines. 463 rules are
modeled. Formalization concerns different cases and contexts for retrieval a
indexing. The rule
MH
1
/SH
1
Guidelines*
MH/SH
6
0.1%
4
0.06%
7
0.11%
6,085
99.73%
25
0.12%
50
0.23%
27
0.13%
20,807
99.52%
MH
0
0%
2
0.03%
8
0.14%
6,980
99.86%
12
0.04%
37
0.13%
30
0.10%
28,382
99.73%
MH/SH*
1
12.5%
1
12.5%
1
12.5%
5
62.5%
3
9.67%
3
9.67%
9
29.03%
13
41.93%
MH*
0
0%
0
0%
4
33.33%
8
66.67%
2
5.26%
7
18.42%
10
26.31%
12
31.57%
Education*
MH/SH*
1
7.69%
0
0%
1
7.69%
11
87.62%
2
16.76%
3
25%
2
16.76%
5
41.66%
MH*
1
16.66%
1
16.66%
0
0%
4
66.67%
2
11.76%
6
35.29%
3
17.64%
6
35.29%
Patients*
MH/SH*
0
0%
5
62.5%
1
12.5%
2
25%
2
18.18%
2
18.18%
3
27.27%
7
36.36%
MH*
0
0%
5
55.55%
2
22.22%
2
22.22%
2
18.18%
2
18.18%
4
36.36%
3
27.27%
FS B SA
New
FS B SA
New
Existing knowledge Existing knowledge
0.7
Table 5. Association rules evaluation according to MeSH structure.
Exact rules: Confidence=1 Approximative rules: Confidence
L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing 471
should be added to queries (or to document description when indexing) containing the
main heading
appendectomy.
3. Discussion and Future Work
There is an increasing activity in text mining in the genomic model [8]. In [9] co-
occurrences between Gene Ontology terms are analyzed and association rules are
mined to identify pairs of related Go Terms. Association rules are more complete than
co-occurrences measures between pairs of concepts but one of the challenging issues is
the overabundance of associations that may be discovered as in [10]. A-Close generates
all the valid non-redundant association rules composed by minimal antecedents and
maximal consequents. Evaluation is processed in two steps: first the selection of the
most informative rules and second the classification of the rules according to the MeSH
taxonomy structure to filter existing associations. Only the most frequent rules that are
not classified are presented to the expert for a final evaluation. This method combines
statistical measures and background domain knowledge.
Association rules are used in retrieval by query expansion (automatic and
interactive) and enriching users’ queries with new knowledge [4]. As exact rules
(respectively approximative rules) state that the antecedent and the consequent are at
the same time in all (respectively some) documents, this kind of rules should be used in
automatic (respectively interactive) query expansion. However, these expansions work
only in the case of queries that return documents. Association rules link conceptual
structures of the documents i.e. descriptors organised in hierarchies on which it is
possible to make specialization and generalization. We plan to generate generalized
association rules and to examine how other data collections such as MEDLINE will
work with our approach. Association rules and expert rules can be translated in the
form of automatas for processing automatic indexing of raw text documents. Finally
formalised association rules could improve the power of reasoning based on MeSH-
OWL [11].
References
[1] Abad Garcia F et al. A comparative study of six European databases of medically oriented Web
resources. J Med Libr Assoc. 2005;93(4):467-479.
[2] Douyère M et al. Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-
controlled gateway. Health Info Libr J. 2004;21(4):253-261.
[3] Nelson SJ et al. Relationships in MeSH. Kluwer Publishers 2001;171-84.
[4] Soualmia LF, Darmoni SJ. Combining knowledge-based methods to refine and expand queries in
medicine. LN in Computer Science 2004;3055:243-255.
[5] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. Very Large Data
Bases 1994;478-499.
[6] Fayyad UM et al. Advances in knowledge discovery and data mining. Am. Ass. Artificial Intelligence
Press 1996;601-611.
[7] Lakhal L et al. Efficient mining of association rules using closed itemset lattices. Information Systems
1999;24:25-46.
[8] Ananiadou S, Mc Naught J. Text mining for biology and biomedicine. Artech House publishers 2005.
[9] Bodenreider O et al. Non lexical approaches to identifying associative relations in the Gene ontology.
Pacific Symposium on Biocomputing 2005;10:91:102
[10] Berardi M et al. A data mining approach to PubMed query refinement. DEXA 2004. IEEE Computer
Society; 401-405.
[11] Soualmia LF et al. Representing the MeSH in OWL: towards a semi-automatic migration. KR-Med
2004;72-80.
L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing472
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Discovering association rules is one of the most important task in data mining. Many efficient algorithms have been proposed in the literature. The most noticeable are Apriori, Mannila's algorithm, Partition, Sampling and DIC, that are all based on the Apriori mining method: pruning the subset lattice (itemset lattice). In this paper we propose an efficient algorithm, called Close, based on a new mining method: pruning the closed set lattice (closed itemset lattice). This lattice, which is a sub-order of the subset lattice, is closely related to Wille's concept lattice in formal concept analysis. Experiments comparing Close to an optimized version of Apriori showed that Close is very efficient for mining dense and/or correlated data such as census style data, and performs reasonably well for market basket style data.
Conference Paper
Full-text available
Information retrieval remains problematic in spite of the numerous existing search engines. It is the same problem for health information retrieval. We propose in this paper to combine three knowledge-based methods to enhance information retrieval using query expansion in the context of the CISMeF project (Catalogue and Index of French-speaking Medical Sites) in which the resources are indexed according to a structured terminology of the medical domain and a set of metadata. The first method consists of building and using morphological knowledge of the terms. The second method consists of extracting association rules between terms by applying a data mining technique over the indexed resources. The last method consists of formalizing the terminology using the OWL-DL language to benefit from its powerful reasoning mechanisms. We describe how these methods could be used conjointly in the KnowQuE prototype (Knowledge-based Query Expansion) and we give some preliminary results.
Article
Full-text available
The Gene Ontology (GO) is a controlled vocabulary widely used for the annotation of gene products. GO is organized in three hierarchies for molecular functions, cellular components, and biological processes but no relations are provided among terms across hierarchies. The objective of this study is to investigate three non-lexical approaches to identifying such associative relations in GO and compare them among themselves and to lexical approaches. The three approaches are: computing similarity in a vector space model, statistical analysis of co-occurrence of GO terms in annotation databases, and association rule mining. Five annotation databases (FlyBase, the Human subset of GOA, MGI, SGD, and WormBase) are used in this study. A total of 7,665 associations were identified by at least one of the three non-lexical approaches. Of these, 12% were identified by more than one approach. While there are almost 6,000 lexical relations among GO terms, only 203 associations were identified by both non-lexical and lexical approaches. The associations identified in this study could serve as the starting point for adding associative relations across hierarchies to GO, but would require manual curation. The application to quality assurance of annotation databases is also discussed.
Article
Full-text available
The paper describes six European medically oriented databases of Web resources, pertaining to five quality-controlled subject gateways, and compares their performance. The characteristics, coverage, procedure for selecting Web resources, record structure, searching possibilities, and existence of user assistance were described for each database. Performance indicators for each database were obtained by means of searches carried out using the key words, "myocardial infarction." Most of the databases originated in the 1990s in an academic or library context and include all types of Web resources of an international nature. Five databases use Medical Subject Headings. The number of fields per record varies between three and nineteen. The language of the search interfaces is mostly English, and some of them allow searches in other languages. In some databases, the search can be extended to Pubmed. Organizing Medical Networked Information, Catalogue et Index des Sites Médicaux Francophones, and Diseases, Disorders and Related Topics produced the best results. The usefulness of these databases as quick reference resources is clear. In addition, their lack of content overlap means that, for the user, they complement each other. Their continued survival faces three challenges: the instability of the Internet, maintenance costs, and lack of use in spite of their potential usefulness.
Article
The amount of health information available on the Internet is considerable. In this context, several health gateways have been developed. Among them, CISMeF (Catalogue and Index of Health Resources in French) was designed to catalogue and index health resources in French. The goal of this article is to describe the various enhancements to the MeSH thesaurus developed by the CISMeF team to adapt this terminology to the broader field of health Internet resources instead of scientific articles for the medline bibliographic database. CISMeF uses two standard tools for organizing information: the MeSH thesaurus and several metadata element sets, in particular the Dublin Core metadata format. The heterogeneity of Internet health resources led the CISMeF team to enhance the MeSH thesaurus with the introduction of two new concepts, respectively, resource types and metaterms. CISMeF resource types are a generalization of the publication types of medline. A resource type describes the nature of the resource and MeSH keyword/qualifier pairs describe the subject of the resource. A metaterm is generally a medical specialty or a biological science, which has semantic links with one or more MeSH keywords, qualifiers and resource types. The CISMeF terminology is exploited for several tasks: resource indexing performed manually, resource categorization performed automatically, visualization and navigation through the concept hierarchies and information retrieval using the Doc'CISMeF search engine. The CISMeF health gateway uses several MeSH thesaurus enhancements to optimize information retrieval, hierarchy navigation and automatic indexing.
  • S J Nelson
Nelson SJ et al. Relationships in MeSH. Kluwer Publishers 2001;171-84.