ArticlePDF Available

Mining Knowledge from Corpora: An Application to Retrieval and Indexing

February 2008
Studies in Health Technology and Informatics 136:467-72

February 2008
136:467-72

DOI:10.3233/978-1-58603-864-9-467

Source
PubMed

Authors:

Lina F Soualmia

Université de Rouen Normandie

Badisse Dahamna

Centre Hospitalier Universitaire Rouen

Stefan J Darmoni

Université de Rouen

Unlabelled: The present work aims at discovering new associations between medical concepts to be exploited as input in retrieval and indexing. Material and methods: Association rules method is applied to documents. The process is carried out on three major document categories referring to e-health information consumers: health professionals, students and lay people. Association rules evaluation is founded on statistical measures combined with domain knowledge. Results: Association rules represent existing relations between medical concepts (60.62%) and new knowledge (54.21%). Based on observations, 463 expert rules are defined by medical librarians for retrieval and indexing. Conclusions: Association rules bear out existing relations, produce new knowledge and support users and indexers in document retrieval and indexing.

. Description of the collections of documents.

…

Figures - uploaded by Lina F Soualmia

Content may be subject to copyright.

Content uploaded by Lina F Soualmia

Content may be subject to copyright.

Content uploaded by Lina F Soualmia

Content may be subject to copyright.

Mining Knowledge from Corpora: an

Application to Retrieval and Indexing

Lina F.SOUALMIA

, Badisse DAHAMNA

, Stéfan DARMONI

LIM&Bio EA 3969, SMBH Léonard de Vinci, Paris XIII University, Bobigny, France.

CISMeF team and LITIS EA 4051,University of Rouen, France

Abstract. The present work aims at discovering new associations between medical

concepts to be exploited as input in retrieval and indexing. Material and Methods:

Association rules method is applied to documents. The process is carried out on

three major document categories referring to e-health information consumers:

health professionals, students and lay people. Association rules evaluation is

founded on statistical measures combined with domain knowledge. Results:

Association rules represent existing relations between medical concepts (60.62%)

and new knowledge (54.21%). Based on observations, 463 expert rules are defined

by medical librarians for retrieval and indexing. Conclusions: Association rules

bear out existing relations, produce new knowledge and support users and indexers

in document retrieval and indexing.

Keywords. Data Analysis; Indexing; Terminology; Data Mining; MeSH.

Introduction

Internet is a major source of biomedical knowledge. As the access to structured

medical information is difficult with directories or general search engines, many

applications have been developed [1]. Since 1995, CISMeF (acronym of Catalog and

Index of French-speaking Medical Sites) [2] has been selecting institutional and

educational resources for patients, students and health professionals. It references

36,247 e-documents by using Medical Subject Headings (MeSH) [3]. Among many

sources to support users, such as morphological bases, dynamic and contextual search

tools [4], MeSH structure is exploited. To complete these sources, we propose to mine

e-documents to discover new associations between medical concepts by data mining.

1. Material and Methods

1.1. Medical Subject Headings, Resource Types and Metaterms

The MeSH thesaurus is used by the National Library of Medicine for indexing

biomedical resources. Its core is a hierarchical structure that consists of sets of

1 Corresponding Author: Lina Soualmia, LIM&Bio EA 3969, 74 rue Marcel Cachin, 93017 Bobigny, France;

E-mail: lina.soualmia@gmail.com.

eHealth Beyond the Horizon – Get IT There

S.K. Andersen et al. (Eds.)

IOS Press, 2008

467

descriptors: at the top level general headings (e.g. diseases) and deeper more specific

headings (e.g. brain infraction). The 2007 version contains over 24,357 main headings

(e.g.: hepatitis) and 83 subheadings (e.g.: diagnosis). Together with a main heading, a

subheading can be used to specify a particular aspect. For example, the pair

[hepatitis/diagnosis] specifies diagnosis aspect of hepatitis.

MeSH is originally used to index biomedical scientific articles for the MEDLINE

database. In order to customize it to the field of e-health resources resource types have

been introduced [2]. CISMeF resource types are an extension of MEDLINE publication

types (e.g. clinical guidelines). Each document in CISMeF is described with a set of

MeSH main headings, subheadings and CISMeF resource types. Each main heading,

[main heading/subheading] pair and resource type is allotted a ‘minor’ or ‘major’

weight, according to the importance of the concept it refers to in the resource. Major

terms are marked by a star (*).

1.2. Data Mining

Knowledge extraction from databases or data mining in computer science [5] consists

in discovering additional information from large structured sets of data. This

knowledge could be used to do predictions about new data or to explain existing data.

One of the objectives of extraction process is the generation of association rules. It is

processed in several steps: data and context preparation (objects and items selection),

extraction of frequent itemsets (compared to a minimum support threshold), generation

of most informative rules using a data mining algorithm, and finally interpretation and

deduction of new knowledge [6]. An extraction context is a triplet C=(O, I, R) where:

O is the set of objects, I is the set of all the items and R is a binary relation between O

and I.

1.2.1. Association Rules

A data mining system may generate several thousands and even several millions

frequent association rules, and only some of them are interesting. An association rule is

interesting if it is easily understandable by the users, valid for new data, useful or if it

confirms a hypothesis. It is expressed as: i

 i

 …  i

 i

k+1

 …  i

and states

that if an object has the items {i

…,i

} it tends also to have the items {i

k+1

,…,i

Support represents the rule utility. It corresponds to the proportion of objects which

contains at the same time antecedent and consequent. Support = |{i

, i

,…, i

}|.

Confidence represents precision and corresponds to the proportion of objects that

contains the consequent rule among those containing the antecedent. Two rule types are

distinguished: exact rule having Confidence=100%, i.e. verified in all the objects of the

database and approximative rule. Confidence = |{i

, i

,…, i

}|/|{i

, i

,…, i

}|.

1.2.2. A-Close for Mining e-Documents

The problem of the relevance and the usefulness of extracted association rules is of a

primary importance because real-life databases lead to several thousands and even

millions of association rules whose confidence measures are high, and among which

are many redundancies, i.e. rules conveying the same information among them. Two

bases for association rules are defined by A-Close [7]. These bases generate sets for all

valid non-redundant association rules, being thus smaller, composed by minimal

L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing468

antecedents and maximal consequents i.e. the most relevant association rules. We adapt

A-Close to the case of e-health documents data base by considering conceptual

indexing: the set of objects O is the set of indexed documents; the set of items I is the

set of MeSH descriptors; the relation R represents the indexing relation between an

object and an item, i.e. between a document and a descriptor.

1.2.3. Processing Collections of Documents

End-users are categorised in CISMeF in mainly three types: professionals, students in

medicine, patients and lay people. Rather than extracting knowledge referring to the

main medical specialties as in [4], we consider the three major resource types

guidelines*, education* and patients* and two kinds of itemsets: the set of major main

headings (MH*) and the set of major [main heading/subheading] pairs (MH/SH*).

Table 1. Description of the collections of documents.

Resources Documents Items Min Max Mean

MH* 1 64 5.21

Guidelines* 2,727

MH/SH* 1 70 6.12

MH* 0 25 1.63

Patients* 3,272

MH/SH* 0 30 1.82

MH* 0 25 2.22

Education* 3,610

MH/SH* 0 34 2.73

2. Results

2.1. Mining e-Documents

For all contexts, minimum support was fixed to minsup=20 and minimum confidence

to minconf=70% for the approximative association rules (Table 2). We obtain

association rules between major MH* (resp. MH/SH* pairs). For the major resource

types patients* and education* all (100%) association rules are between two MHs*

(resp. MH/SH* pairs) i.e one descriptor in the antecedent and one descriptor in the

consequent. For guidelines* 24% of the rules are between more than two descriptors.

Characteristics of documents may explain these results: average descriptors from 1.63

to 2.22 for patients* and education* whereas 5.21 to 6.12 for guidelines*.

Table 2. Number of rules, exact rules (ER), approximative rules (AR) and pairs.

Context: item=MH* Context: item= [MH/SH]*

Resources Rules ER

Conf=1

Conf



0.7

Pairs

Rules

Conf=1

Conf



0.7

Pairs

Guidelines* 50

(24%)

(76%)

(20.51%)

(79.49%)

(76%)

Patients* 20

(45%)

(55%)

(100%)

(42.1%)

(57.9%)

(100%)

Education* 23

(26.09%)

(73.91%)

(100%)

(52%)

(48%)

(100%)

Another experiment is carried out in the context of documents with the resource

type guidelines* to obtain more complete association rules: we consider the descriptors

MH and MH/SH pairs without alloted minor or major weight. An average of 12

L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing 469

2.2. Association Rules Evaluation

As defined, an interesting association rule confirms a hypothesis or states a new

hypothesis [6]. We propose here to combine background domain knowledge with

simple statistical measures used traditionally in association rules mining for evaluation.

We consider several cases of interesting association rules according to relations

between MeSH descriptors [3]. As these relations are defined between two main

headings and between two subheadings we consider only the association rules betwee

two elements. Hence, an interesting existing association rule could associate: a

(in)direct son and its father (FS); two descriptors that belong to the same hierarchy

(same (in)direct father) (B); two descriptors with See Also relation (SA). These rules

are automatically classified thanks to MeSH structure. The other rules that satisfy the

misup and minconf are then considered as «new» interesting association rules.

Exact association rules, except for collection patients*, are mostly new interesting

rules: from 62.5% to 99.86%. Therefore, existing rules are mainly from the patients*

collection: 77.77% for MH* and 75% for MH/SH*. Approximative rules

, except for the

guidelines* collection with items MH and MH/SH pairs, are mostly existing interestin

rules: from 58.07% to 78.73%. New interesting rules are between MH and MH/SH

from the collection guidelines*: 99.73% for MH and 99.52% for MH/SH.

Subjective interest measures are based on the expert knowledge about the data, i.e.

here the medical librarian. New interesting rules for the contexts MH* and MH/SH*

520

(1.92%)

MH/SH 27,011

6,102

(22.6%)

20,909

(77.4%)

338

(0.95%)

MH 35,454

6,990

(19.71%)

28,464

(80.29%)

Pairs

0.7



We obtain a high number of association rules with a minimum support threshold

minsup=20 and a minimum confidence threshold minconf=70% (Table 4) but onl

0.95% (respectively 1.92%) are between two MH (respectively between two MH/SH

pairs). By reducing the confidence from 1 to 0.7 the number of rules between MH

(respectively between MH/SH) growths with a factor of 5 (respectively 4.42).

Table 4. Association rules between MH and MH/SH in the context Guidelines*.

Items Rules ER

Conf=1

Conf



301 13.541

Guidelines* 2,727

MH/SH

111 10.081MH

Max MeanMin

descriptors with a minimum of 1 and a maximum of 301 descriptors compose the

documents (Table 3). As A-Close works on databases with a maximum of 12 items, we

have added a constraint on the number of descriptors. To avoid long time generati

and to have interpretable association rules, we added the maximum size of the closed

itemsets as a new parameter of the algorithm.

Table 3. Description of the documents of the Guidelines* collection.

Docs Items

py theragain/drupe  gesics/administration and dosagioids analp

breast cancer/diagnosis  mammography

aids/prevention and control  condom

influenza vaccinesinfluenza/prevention and control

Turner syndrome  child  human growth hormone  growth disorders

obstetric delivery pregnancy

prostate cancer/surgery biopsy  prostatectomy

amniocentesis  prenatal diagnosis  chorionic villi sampling

Figure 1. Some examples of new interesting rules validated by the expert

L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing470

466



appendicitis/surgery states that the pair appendicitis/surgery

pairs are evaluated manually. 93.75% (resp. 84.78%) of the interesting new rules with

confidence=1 (resp. confidence0.7

) between major descriptors are validated.



/SH

states that the pair MH

/SH

should be added to the pair

/SH

. appendectomy





radiography abdominal. Th

rule

/SH





states that MH

/SH

should be replaced by the

main heading

. For example abdomen/radiography

2.3. Indexing Correction and Expert Rules

Documents are manually indexed and according to the indexing policy, the more

precise descriptor should be used, i.e. in lower level in hierarchy. However, 1,

documents contain descriptors that have father-son relation and 478 documents are

indexed by subheadings that have a relation while associated to the same keyword. For

example, a document is indexed by trisomy and chromosome aberrations, whereas

trisomy is a chromosome aberration. This may explain the proportion of existing

associations. Correction should be proposed to the indexers.

Main return on experiences of association rules extraction and evaluation is

modeling and formalisation of rules between [main heading/subheading] pairs based

observations. The pattern of the rule

hepatitis/prevention and controlhepatitis vaccines is

used to model

dysentery bacillary/prevention and control  shigella vaccines. 463 rules are

modeled. Formalization concerns different cases and contexts for retrieval a

indexing. The rule

/SH

Guidelines*

MH/SH

0.1%

0.06%

0.11%

6,085

99.73%

0.12%

0.23%

0.13%

20,807

99.52%

0.03%

0.14%

6,980

99.86%

0.04%

0.13%

0.10%

28,382

99.73%

MH/SH*

12.5%

62.5%

9.67%

29.03%

41.93%

MH*

33.33%

66.67%

5.26%

18.42%

26.31%

31.57%

Education*

MH/SH*

7.69%

87.62%

16.76%

25%

16.76%

41.66%

MH*

16.66%

66.67%

11.76%

35.29%

17.64%

35.29%

Patients*

MH/SH*

62.5%

12.5%

25%

18.18%

27.27%

36.36%

MH*

55.55%

22.22%

18.18%

36.36%

27.27%

FS B SA

New

FS B SA

New

Existing knowledge Existing knowledge

0.7



Table 5. Association rules evaluation according to MeSH structure.

Exact rules: Confidence=1 Approximative rules: Confidence



L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing 471

should be added to queries (or to document description when indexing) containing the

main heading

appendectomy.

3. Discussion and Future Work

There is an increasing activity in text mining in the genomic model [8]. In [9] co-

occurrences between Gene Ontology terms are analyzed and association rules are

mined to identify pairs of related Go Terms. Association rules are more complete than

co-occurrences measures between pairs of concepts but one of the challenging issues is

the overabundance of associations that may be discovered as in [10]. A-Close generates

all the valid non-redundant association rules composed by minimal antecedents and

maximal consequents. Evaluation is processed in two steps: first the selection of the

most informative rules and second the classification of the rules according to the MeSH

taxonomy structure to filter existing associations. Only the most frequent rules that are

not classified are presented to the expert for a final evaluation. This method combines

statistical measures and background domain knowledge.

Association rules are used in retrieval by query expansion (automatic and

interactive) and enriching users’ queries with new knowledge [4]. As exact rules

(respectively approximative rules) state that the antecedent and the consequent are at

the same time in all (respectively some) documents, this kind of rules should be used in

automatic (respectively interactive) query expansion. However, these expansions work

only in the case of queries that return documents. Association rules link conceptual

structures of the documents i.e. descriptors organised in hierarchies on which it is

possible to make specialization and generalization. We plan to generate generalized

association rules and to examine how other data collections such as MEDLINE will

work with our approach. Association rules and expert rules can be translated in the

form of automatas for processing automatic indexing of raw text documents. Finally

formalised association rules could improve the power of reasoning based on MeSH-

OWL [11].

References

[1] Abad Garcia F et al. A comparative study of six European databases of medically oriented Web

resources. J Med Libr Assoc. 2005;93(4):467-479.

[2] Douyère M et al. Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-

controlled gateway. Health Info Libr J. 2004;21(4):253-261.

[3] Nelson SJ et al. Relationships in MeSH. Kluwer Publishers 2001;171-84.

[4] Soualmia LF, Darmoni SJ. Combining knowledge-based methods to refine and expand queries in

medicine. LN in Computer Science 2004;3055:243-255.

[5] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. Very Large Data

Bases 1994;478-499.

[6] Fayyad UM et al. Advances in knowledge discovery and data mining. Am. Ass. Artificial Intelligence

Press 1996;601-611.

[7] Lakhal L et al. Efficient mining of association rules using closed itemset lattices. Information Systems

1999;24:25-46.

[8] Ananiadou S, Mc Naught J. Text mining for biology and biomedicine. Artech House publishers 2005.

[9] Bodenreider O et al. Non lexical approaches to identifying associative relations in the Gene ontology.

Pacific Symposium on Biocomputing 2005;10:91:102

[10] Berardi M et al. A data mining approach to PubMed query refinement. DEXA 2004. IEEE Computer

Society; 401-405.

[11] Soualmia LF et al. Representing the MeSH in OWL: towards a semi-automatic migration. KR-Med

2004;72-80.

L.F. Soualmia et al. / Mining Knowledge from Corpora: An Application to Retrieval and Indexing472

ResearchGate has not been able to resolve any citations for this publication.

Efficient Mining of Association Rules Using Closed Itemset Lattices

Article

Full-text available

Jun 2001
INFORM SYST

Discovering association rules is one of the most important task in data mining. Many efficient algorithms have been proposed in the literature. The most noticeable are Apriori, Mannila's algorithm, Partition, Sampling and DIC, that are all based on the Apriori mining method: pruning the subset lattice (itemset lattice). In this paper we propose an efficient algorithm, called Close, based on a new mining method: pruning the closed set lattice (closed itemset lattice). This lattice, which is a sub-order of the subset lattice, is closely related to Wille's concept lattice in formal concept analysis. Experiments comparing Close to an optimized version of Apriori showed that Close is very efficient for mining dense and/or correlated data such as census style data, and performs reasonably well for market basket style data.

Combining Knowledge-Based Methods to Refine and Expand Queries in Medicine

Conference Paper

Full-text available

Jun 2004
Lect Notes Comput Sci

Information retrieval remains problematic in spite of the numerous existing search engines. It is the same problem for health information retrieval. We propose in this paper to combine three knowledge-based methods to enhance information retrieval using query expansion in the context of the CISMeF project (Catalogue and Index of French-speaking Medical Sites) in which the resources are indexed according to a structured terminology of the medical domain and a set of metadata. The first method consists of building and using morphological knowledge of the terms. The second method consists of extracting association rules between terms by applying a data mining technique over the indexed resources. The last method consists of formalizing the terminology using the OWL-DL language to benefit from its powerful reasoning mechanisms. We describe how these methods could be used conjointly in the KnowQuE prototype (Knowledge-based Query Expansion) and we give some preliminary results.

Non-lexical approaches to identifying associative relations in the Gene Ontology

Article

Full-text available

Feb 2005

The Gene Ontology (GO) is a controlled vocabulary widely used for the annotation of gene products. GO is organized in three hierarchies for molecular functions, cellular components, and biological processes but no relations are provided among terms across hierarchies. The objective of this study is to investigate three non-lexical approaches to identifying such associative relations in GO and compare them among themselves and to lexical approaches. The three approaches are: computing similarity in a vector space model, statistical analysis of co-occurrence of GO terms in annotation databases, and association rule mining. Five annotation databases (FlyBase, the Human subset of GOA, MGI, SGD, and WormBase) are used in this study. A total of 7,665 associations were identified by at least one of the three non-lexical approaches. Of these, 12% were identified by more than one approach. While there are almost 6,000 lexical relations among GO terms, only 203 associations were identified by both non-lexical and lexical approaches. The associations identified in this study could serve as the starting point for adding associative relations across hierarchies to GO, but would require manual curation. The application to quality assurance of annotation databases is also discussed.

A comparative study of six European databases of medically oriented Web resources

Article

Full-text available

Nov 2005
J MED LIBR ASSOC

The paper describes six European medically oriented databases of Web resources, pertaining to five quality-controlled subject gateways, and compares their performance. The characteristics, coverage, procedure for selecting Web resources, record structure, searching possibilities, and existence of user assistance were described for each database. Performance indicators for each database were obtained by means of searches carried out using the key words, "myocardial infarction." Most of the databases originated in the 1990s in an academic or library context and include all types of Web resources of an international nature. Five databases use Medical Subject Headings. The number of fields per record varies between three and nineteen. The language of the search interfaces is mostly English, and some of them allow searches in other languages. In some databases, the search can be extended to Pubmed. Organizing Medical Networked Information, Catalogue et Index des Sites Médicaux Francophones, and Diseases, Disorders and Related Topics produced the best results. The usefulness of these databases as quick reference resources is clear. In addition, their lack of content overlap means that, for the user, they complement each other. Their continued survival faces three challenges: the instability of the Internet, maintenance costs, and lack of use in spite of their potential usefulness.

Knowledge Discovery and Data Mining

Chapter

Jan 1996

Advances in Knowledge Discovery & Data Mining

Book

Jan 1996

Text Mining for Biology And Biomedicine

Article

Jan 2005

Fast Algorithms for Mining Association Rules in Large Databases

Conference Paper

Jan 1994

Enhancing the MeSH thesaurus to retrieve French online health resources in a quality-controlled gateway

Article

Jan 2005

The amount of health information available on the Internet is considerable. In this context, several health gateways have been developed. Among them, CISMeF (Catalogue and Index of Health Resources in French) was designed to catalogue and index health resources in French. The goal of this article is to describe the various enhancements to the MeSH thesaurus developed by the CISMeF team to adapt this terminology to the broader field of health Internet resources instead of scientific articles for the medline bibliographic database. CISMeF uses two standard tools for organizing information: the MeSH thesaurus and several metadata element sets, in particular the Dublin Core metadata format. The heterogeneity of Internet health resources led the CISMeF team to enhance the MeSH thesaurus with the introduction of two new concepts, respectively, resource types and metaterms. CISMeF resource types are a generalization of the publication types of medline. A resource type describes the nature of the resource and MeSH keyword/qualifier pairs describe the subject of the resource. A metaterm is generally a medical specialty or a biological science, which has semantic links with one or more MeSH keywords, qualifiers and resource types. The CISMeF terminology is exploited for several tasks: resource indexing performed manually, resource categorization performed automatically, visualization and navigation through the concept hierarchies and information retrieval using the Doc'CISMeF search engine. The CISMeF health gateway uses several MeSH thesaurus enhancements to optimize information retrieval, hierarchy navigation and automatic indexing.

Jan 2001
171-184

S J Nelson

Nelson SJ et al. Relationships in MeSH. Kluwer Publishers 2001;171-84.

Mining Knowledge from Corpora: An Application to Retrieval and Indexing

Abstract and Figures

Recommended publications

MEDLARS, 1963-1967

Identities and relationships: Parallels between metadata and professional relevance

SmartSearch: automated recommendations using librarian expertise and the National Center for Biotech...

Basic list of veterinary medical serials, third edition: Using a decision matrix to update the core...