Conference PaperPDF Available

Automatic construction of a domain-independent knowledge base from heterogeneous data sources

May 2012

May 2012

DOI:10.1109/FSKD.2012.6234188

Conference: Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on

Authors:

Mohammed Maree

Arab American University

Saadat Alhashmi

University of Sharjah

Manual construction and maintenance of general-purpose knowledge bases forms a major limiting factor towards their full adoption, use and reuse in practical settings. In this paper, we present KnowBase, a system for automatic knowledge base construction from heterogeneous data sources including domain-specific ontologies, general-purpose ontologies, plain texts, and image and video captions, which are automatically extracted from WebPages. In our approach, several information extraction techniques are integrated to automatically create, enrich, and keep the knowledge base up to date. Consequently, knowledge represented by the produced knowledge base can be employed in several application domains. In our experiments, we used the produced knowledge base as an external resource to align heterogeneous ontologies from the environmental and agricultural domains. The produced results demonstrate the effectiveness of the used knowledge base in finding corresponding entities between the used ontologies.

Part of a Merged Ontology about the Organizations Domain. Concepts are related through is-a Transitive Relation

…

Senses of the Concept Organization in KB

…

Figures - uploaded by Mohammed Maree

Content may be subject to copyright.

Content uploaded by Mohammed Maree

Content may be subject to copyright.

2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012)

Automatic Construction of a Domain-independent

Knowledge Base from Heterogeneous Data Sources

Mohammed Maree, Saadat M. Alhashmi

School of Information Technology

Monash University, Sunway Campus

Kuala Lumpur, Malaysia

Mohammed Belkhatir

Lyon Institute of Technology

University of Lyon I

Lyon, France

Andre Hawit

Mixberry Media Inc.

Burlingame, CA,

USA

Abstract— Manual construction and maintenance of general-

purpose knowledge bases forms a major limiting factor towards

their full adoption, use and reuse in practical settings. In this

paper, we present KnowBase, a system for automatic knowledge

base construction from heterogeneous data sources including

domain-specific ontologies, general-purpose ontologies, plain

texts, and image and video captions, which are automatically

extracted from WebPages. In our approach, several information

extraction techniques are integrated to automatically create,

enrich, and keep the knowledge base up to date. Consequently,

knowledge represented by the produced knowledge base can be

employed in several application domains. In our experiments, we

used the produced knowledge base as an external resource to

align heterogeneous ontologies from the environmental and

agricultural domains. The produced results demonstrate the

effectiveness of the used knowledge base in finding corresponding

entities between the used ontologies.

Keywords- Knowledge Base; Information Extraction; Pattern

Acquisition; Merging; Heterogeneous Data Sources; Experimental

Validation

I. INTRODUCTION

Recently, several approaches have been proposed to

automatically build knowledge bases. These approaches either

rely on a single data source such as Wikipedia1 to create the

knowledge base [1] or build knowledge bases for specific

domains such as the medical, biomedical and terrorism

domains [2][3][4]. In this paper, we present KnowBase, a

system for automatic knowledge base construction from

heterogeneous data sources including domain-specific

ontologies, general-purpose ontologies, plain texts, and image

and video captions; which are automatically extracted from

WebPages. To obtain domain-specific ontologies, we use

Swoogle semantic Web search engine [5]. For each domain, we

submit queries including keywords that are related to that

domain and download the returned ontologies. We obtain

general-purpose ontologies from online ontology repositories

and libraries on the Web. Plain texts are automatically

extracted from the Web using a Web crawler. The extracted

texts from relevant websites are then processed using several

Natural Language Processing (NLP) techniques. To identify

and extract image and video captions from WebPages, we use

1 http://www.wikipedia.org/

the DOM Tree-based webpage segmentation algorithm that is

proposed in [6].

DOM Tree-based Webpage Segmentation Algorithm

Figure 1. Example of the Output of the Segmentation Algorithm

The segmentation process is based on the Document Object

Model (DOM) Tree structure of WebPages. In this algorithm,

multimedia documents on the Web are classified into three

categories: Listed, Semi-listed, and Unlisted documents. For

every extracted multimedia document, the segmentation

method only searches the surrounding region making it more

efficient and scalable for large websites that contain huge

amount of multimedia documents. As shown in Figure 1,

given a webpage as input, the algorithm processes the DOM

tree of the webpage and extracts the segments as output.

Due to the heterogeneity of the used resources, different

information extraction techniques are required. These

techniques are integrated to build a coherent structure of the

knowledge base. For instance, to extract facts from plain texts,

and image and video captions, we utilize several NLP,

statistical-based and named-entity recognition techniques. On

the other hand, we employ other extraction and merging

techniques to populate the knowledge base with knowledge

triples (entity2-relation-concept), which are obtained from the

2 Entity in this context refers to a concept or an instance of a concept

1484

downloaded domain-specific and general-purpose ontologies.

Details of these techniques are presented in Section 3A. The

main contributions of our work are summarized as follows:

• Exploiting heterogeneous data sources for automatic

knowledge base construction.

• Combining several information extraction techniques and

aggregating their output to construct and enrich the

knowledge base.

The rest of this paper is organized as follows. Section 2

presents the details of some other knowledge base construction

and population systems. The theoretical framework for

automatic construction and update of domain-independent

knowledge bases is presented in Section 3. In this section, we

also present the details of the methods that are used in the

proposed system. Section 4 discusses the experiments that

were carried out to construct the knowledge base, as well as to

exploit the constructed knowledge base in aligning

heterogenuous ontologies. Section 5 presents the conclusions

and outlines the future work.

II. RELATED WORK

In this section, we discuss the related work in terms of two

different aspects: (a) knowledge base construction approaches

and (b) issues related to automatically identifying and

extracting entities and facts from heterogeneous data sources.

A. Knowledge Base Construction

Knowledge base construction has always been at the heart

of the Semantic Web (SW) technology. Apparently, with the

continuous expansion of the Web, this task became more

difficult [7]. To address this issue, several knowledge base

construction systems have been proposed [8, 9, 10, 11]. Some

of these systems rely on human input to enrich the knowledge

base and keep it up-to-date. Examples of the produced

knowledge bases by such systems are Freebase3 and True

Knowledge4. On the other hand, other systems rely on a single

data source to create the knowledge base or create knowledge

bases that are related to a particular domain. For instance,

Geifman and Rubin, proposed to model and store knowledge

about age-related phenotypic patterns and events in an Age-

Phenome Knowledge Base (APK) [3]. Another example is the

terrorism knowledge base, which contains all relevant

knowledge about terrorist groups, their members, leaders,

affiliations, and full descriptions of specific terrorist events [4].

This knowledge base was integrated into Cyc [12], which is a

general-purpose ontology that captures knowledge from

multiple domains. In our approach, we not only aim at avoiding

the effort required by users to manually maintain and update

the knowledge base, but also at populating and extending the

knowledge base from heterogeneous data sources.

B. Information Extraction

The task of automatically identifying and extracting

entities and facts from heterogeneous data sources is a

3 http://www.freebase.com/

4 http://www.trueknowledge.com/

prevalent problem. For each data source, we need to identify

and utilize different information extraction methods. The

ultimate goal of these methods is to automatically construct

and populate the knowledge base with entities, facts (these are

automatically extracted from plain texts and image and video

captions from WebPages), and knowledge triples, which are

automatically extracted from online ontologies. For instance,

to extract information from plain texts on the Web,

TextRunner [13] employs heuristics to produce extractions in

the form of a tuple t = (ei, ri,j , ej), where ei and ej are strings

meant to denote entities, and ri,j is a string meant to denote a

relationship between them. Similar systems are GRAZER [14]

and KnowItAll [15]. In GRAZER, the inputs are seed facts for

given entities, which are automatically generated using

specialized wrappers. KnowItAll used bootstrapping to extract

patterns and facts simultaneously from text [14]. The relevant

pages are retrieved from a search engine via a query composed

of keywords in a pattern. The initial seed set contains a few

hand-generated patterns. In our approach, we combine several

NLP, statistical-based and named-entity recognition

techniques to extract facts, entities and their attributes from

textual information on the Web.

Another data source that can be exploited for knowledge

base construction is online ontologies. For example, the authors

of [16] propose to construct ontologies from online domain-

specific ontologies. To do this, they submit quires to Swoogle

SW search engine and download relevant ontologies from the

list of the returned results. In addition, they employ ranking and

segmentation techniques to rank the returned ontologies and

extract segments from them. To address the issue of overlap

between the extracted segments, the authors propose to merge

overlapping segments into a single representation. We build on

this work to automatically download domain-specific and

general-purpose ontologies from Swoogle SW search engine

and other ontology repositories and libraries on the Web.

Considering domain-specific ontologies, we merge them using

the merging techniques proposed in our previous work [17]. In

this context, for each domain of interest, we merge its relevant

ontologies using semantic, name and statistical based ontology

merging techniques. Then, we extract knowledge triples from

the merged domain-specific ontologies and other general-

purpose ontologies to populate the knowledge base.

III. THEORETICAL BASES AND SYSTEM DESCRIPTION

A knowledge base is a repository of facts about entities and

their relationships. A fact is a triple consisting of entity-

relation-entity structure. Entities are related through different

types of semantic and taxonomic relations. These relations are

automatically extracted from multiple heterogeneous data

sources such as plain texts, image and video captions, and

online ontologies. A detailed discussion on these data sources

and the techniques that we use in our system are given in the

next sections.

1485

A. Knowledge Base Construction from Online Domain-

specific Ontologies

Among the data sources that we exploit to construct the

knowledge base are online domain-specific ontologies. A

domain-specific ontology can be formally defined as:

Definition 1: A domain-specific ontology Ω is a 4-tuple 〈C,

R, I, A〉 where:

• C={(ci),i∈[1,Card(C)]} represents the set of domain

concepts of the ontology.

• R={(ri),i∈[1,Card(R)]} represents the set of semantic

relations holding between the ontology concepts.

• I is the set of instances or individuals.

• A is the set of axioms verifying: A={(ri,cj,ck)} s.t.

i∈[1,Card(R)], j,k∈[1,Card(C)], cj,ck∈C and ri∈R.

We associate to the ontology a logic-based translation towards a

predicate calculus PC, based on C, R, I, A and a set of predicate

symbols P ⊇C∪R. Predicates linked to C are monadic while

those linked to R are dyadic.

Φo: C ∪A∪{ Ω } →CP

Φo(c) = c(xc) where xc∈I

Φo(a) = r(ci,cj) where r∈R and ci,cj∈C

Φo(Ω) =

cC∈

∧ Φo (ci)

aA∈

∧ Φo (ai)

To address the semantic heterogeneity problem (i.e.

conceptual and terminological differences between domain-

specific ontologies), we employ the merging techniques

proposed [17]. Formally, a merging algorithm can be defined

as:

Definition 2: Merging: Given two domain-specific ontologies

Ω1 and Ω2, the merging operation finds semantic

correspondences between their concepts and produces a single

merged ontology Ωmerged as output. Semantic correspondences

between both ontologies are 4-tuples 〈Cid, Ci, Cj, r〉 such that:

• Cid is a unique identifier of the correspondence.

• Ci ∈ Ω1, Cj ∈ Ω2 are corresponding concepts of the input

ontologies.

• r ∈ R is a semantic relation holding between both

elements Ci and Cj.

To represent the knowledge base we use an Entity-

Relationship graph. Such a graph connects different entities E

from multiple domains through different types of relations R.

In this context, a finite set of entity labels are represented by

the graph nodes and directed links between those nodes are

used to represent R. This can be formally represented as

follows:

Definition 3: Knowledge Base: A knowledge base KB is a

structure KB:= (CKB, RKB, IKB) where:

• CKB: is the set of concepts that are defined in KB.

Generally, CKB:={C ∪ Cmiss}, where Cmiss is the set of

concepts that are not defined in Ω but are defined in KB.

This is due to the fact that KB covers information across

multiple domains and is not limited to a particular domain

as Ω. However, we may find that C includes concepts

{C1, C2, . . . , Cn} which are not defined in CKB. For this

particular case, we utilize statistical-based techniques to

enrich KB with the set of concepts {C1, C2, . . . , Cn}.

• RKB: is the set of semantic and taxonomic relations that are

defined to relate the concepts in CKB.

• IKB: is the set of instances of the concepts that are defined

in KB.

B. Knowledge Base Construction from Plain Texts

In our approach, we divide plain texts into two categories.

The first category consists of plain texts that are automatically

extracted from Web documents. These are extracted using a

Web crawler. The second category consists of image and

video captions, which are used to describe the content of these

types of multimedia documents on the Web. As discussed in

Section 1, in order to identify and extract the captions of such

multimedia documents we use the DOM Tree-based Webpage

segmentation algorithm that is proposed in [6]. To process

texts from both categories we first utilize several NLP

techniques such as stopword removal [18], tokenization [19],

and Part Of Speech (POS) tagging [20]. Then, we extract

named entities through employing GATE [21], which is a

syntactical pattern matching entity recognizer enriched with

gazetteers. Although the coverage of GATE is limited to a

certain number of named-entities, additional rules can be

defined in order to expand its coverage. However, the process

of manual enrichment of GATE’s rules can be difficult and

time-consuming task. Therefore, we exploit the constructed

knowledge base as a supplementary source of named-entity

recognition. In this context, entities that are not defined in

GATE are submitted to the knowledge base to find whether

they are defined in it or not. After extracting named entities

from texts, we utilize a statistical-based semantic relatedness

measure to compute the degree of semantic relatedness

between the extracted entities. This measure is based on the

Normalized Retrieval Distance (NRD) function [17]. This

function is formally defined as follows:

Definition 4: Normalized Retrieval Distance (NRD): is an

adapted form of the Normalized Google Distance (NGD) [22]

function that measures the semantic relatedness between pairs

of entities (such as concepts or instances): Given two entities

E_miss and E_in, the Normalized Retrieval Distance between

E_miss and E_in can be obtained as follows:

max{log ( _ ), log ( _ )} log ( _ , _ )

(_ ,_ ) log min{log ( _ ), log ( _ )}

E miss f E in f E miss E in

NRD E miss E in MfEmissfEin

−

=−

(1)

Where,

• E_miss is an entity that is recognized by GATE but not

defined in the knowledge base, KB.

• E_in is an entity that exists in the knowledge base, KB.

• f(E_miss) is the number of hits for the search term, E_miss.

• f(E_in) is the number of hits for the search term, E_in.

• f(E_miss , E_in) is the number of hits for the search terms,

E_miss and E_in.

• M is the number of WebPages indexed by the search

engine.

1486

Unlike the NGD function, the NRD function returns

different semantic relatedness measures according to several

search engines (Google, AltaVista, Yahoo!). Therefore, we

sum up all NRD values for each candidate entity. This

summation represents an aggregated decision made by several

search engines on the degree of semantic relatedness between

the entities E_miss and E_in . The returned semantic relatedness

measures give us an indication that two entities are strongly

related or not based on a threshold value v=0.5. Therefore,

entities with semantic relatedness measures > v will be

considered for further processing by the Semantic Relation

Extractor (SRE) function. This function takes as input pairs of

entities with strong semantic relatedness measures and

produces as output the suggested semantic relation(s) between

them based on a set of pre-defined lexico-syntactic patterns.

Details on this function can be found in [17].

C. Automatic Knowledge Base Enrichment

Enrichment of the knowledge base is performed at two

parallel stages. In the first stages, entities (concepts or their

instances) of the merged domain-specific ontologies and other

general-purpose ontologies are submitted to the knowledge

base. If any of the entities does not exist in the knowledge

base, then it will be transferred to the second stage, wherein

the entity is automatically extracted and added to the Entity-

Relationship graph of the knowledge base based on the its

context. A context represents all concepts that exist at the

semantic path(s) of each entity. To illustrate this step, we take

an example of a merged ontology that describes the

organizations domain.

Figure 2. Part of a Merged Ontology about the Organizations Domain.

Concepts are related through is-a Transitive Relation

In Figure 2, we see that the contexts of the concept

“Corpoate_Body” are:

1. {“Organization”, “Body”, “Gathering”, “Psychological

Feature”, “Abstraction”, “Abstract Entity”, “Entity”}

2. {“Organization”, “Social Group”, “Group”,

“Abstraction”, “Abstract Entity”, “Entity”}

We call each of these contexts a semantic path. For instance,

to enrich the knowledge base with the entity “Corporate

Body”, we first submit it to the knowledge base to find

whether it is already defined in it or not. Assuming that this

entity is missing from the knowledge base, we traverse the

hierarchy of the semantic paths of this entity and extract the

concepts in each path. Then, in an ascending order ,we attempt

to find whether the parents of the missing entity exist in the

knowledge base or not. This loop is repeated until we reach

the root node in the graph. Accordingly, we consider the

following cases for enriching the knowledge base:

A) If the parent p of the missing entity e (e.g. “Corporate

Body” is a missig entity in our example) is also missing

from the knowledge base KB, then we extract the segment

of the semantic path from the merged ontology Ωmerged

that consists of the triple: e-relation-p. For example,

assuming the concept “Organization” is also missing

from KB, we extract the triple: Corporate Body is-a

Organization from Ωmerged. Then, the Entity-Relationship

graph of KB will be updated by adding two new nodes

(“Corporate Body” & “Organization”) and linking them

through (is-a) relation.

B) If the parent p of the missing entity e is already defined in

the KB, then we attach e directly to p in the hierarchy of

the KB. Here, it is important to mention that p might have

differet meanings (senses) in KB. Therefore, it is

important to disambiguate the meaning of p before linking

it to e. To do this, we compare the contexts of the entity e

to the conctexts of p in KB. Accordingly e will be linked

to p based on the similarity between their contexts. For

instance, if we have the following senses for the entity

“Organization” in KB:

1. Organization: A group of people who work

togethor.

2. Organization: An organized structure for

classifying.

Figure 3. Senses of the Concept Organization in KB

We compare the context of “Organization” from KB to

context of the same concept from Ωmerged. We find that the

most similar contexts are context No. 2 from Figure 2

({“Organization”, “Social Group”, “Group”, “Abstraction”,

“Abstract Entity”, “Entity”}) and context No. 1 from Figure 3.

Therefore, we link “Corporate Body” to the semantic path that

Entit

…

Group

Social Grou

Organization

Knowledge

Structure

Organization

Corporate

Body

…

Corporate

Body

Entit

Abstract Entit

Abstraction

Group

Psychological Featur

Social Grou

Organization Gathering

Bod

Organization

…

Organization

Beginning

Organization

Administration

...

Organization

Structure

Organization

…

Organization

1487

represents the first sense of “Organization” in KB. The result

of this step is shown in Figure 4.

Figure 4. Adding a New Concept to KB

IV. EXPERIMENTAL RESULTS

In the following sections, we discuss experiments in terms

of two different aspects. First, we discuss the experiments that

we carried out to construct and populate the knowledge base.

Second, we experimentally demonstrate the effectiveness of

employing the knowledge that is represented by the constructed

knowledge base in real application domains. We implemented

all solutions in Java and experiments were performed on a PC

with dual-core CPU (3000 GHz) and (4 GB RAM). The

operating system that was used is OpenSuse 11.1.

A. Automatic Construction and Population of the Knowledge

Base

In this section we discuss the experiments that we carried

out to automatically construct and populate the knowledge

base. The sources that we used for this purpose are 500 text

documents obtained from the Web, 17855 image and video

captions extracted from WebPages, 35 domain-specific

ontologies downloaded using Swoogle SW search engine, and

6 general purpose ontologies downloaded from online ontology

repositories and libraries. To obtain the text documents, we

developed a script to query general-purpose search engines

such as Google and AltaVista about several concepts from

different domains. Examples of these domains are (Sport,

Medicine, Programming Languages, and Universities). We

manually selected the top-10 results from the lists of returned

results by each search engine. To obtain the image and video

captions, we used the DOM Tree-based Webpage segmentation

algorithm (described in Section 1). To download online

domain-specific ontologies, we submitted queries to Swoogle

search engine and selected those that are relevant to each

query’s intent. Then, for each domain, we merged its relevant

ontologies using the merging techniques (described in Section

2). The total size of both types of the used ontologies is 8.621

GB. The current version of the constructed knowledge base

consists of 2,404,485 Entity (384,051 concepts and 2,020,434

instances). It is important to mention that the produced

knowledge base is still evolving and we are updating it in a

constant manner.

B. Using the Produced Knowledge Base in Real Application

Domains

In this section, we describe the experiments that we carried

out to validate our proposal of employing knowledge

represented by the produced knowledge base in several

application domains. Certainly, it can be employed for other

purposes such as semantic-based indexing of multimedia

documents on the Web, computing the degree of semantic

relatedness, query reformulation, document clustering, and

ontology alignment and mapping…etc. However, in these

experiments, we used it as an external resource to find

alignments between heterogeneous domain-specific

ontologies. In this context, we attempted to find alignments

(i.e. correspondences) between the concepts and instances of

the three real-world heavyweight ontologies (GEMET,

AGROVOC, and NAL). Details on these ontologies are listed

in [23]. To compute precision and recall, we used the official

gold-standard alignments [24] that are provided by the OAEI

2007 environment task organizers. These sample alignments

are classified into different domains such as alignments in the

chemistry, geography and agriculture domains. We used the

sample alignments to compute the precision and recall of our

system in each domain. A comparison between the results of

our system, S1 and the gold standard reference alignments is

shown in Table 1.

TABLE I. USING THE PRODUCED KNOWLEDGE BASE TO FIND

ALIGNMENTS BETWEEN HEAVYWEIGHT ONTOLOGIES

Task

# of Matches

produced by our

system

# of Matches

produced by

# of Matches

in the Ref.

Alignments

GEMET-AGROVOC

Chemistry-Precision 14 out of 14 14 out of 14 14

Geography-Precision 23 out of 23 23 out of 23 23

Geography-Recall 87 out of 87 87 out of 87 87

Agriculture-Recall 61 out of 61 61 out of 61 61

Misc-Precision 28 out of 28 28 out of 28 28

Tax-Precision 21 out of 21 21 out of 21 21

NAL-AGROVOC

Chemistry-Precision 141 out of 141 141 out of 141 141

Geography-Precision 58 out of 58 58 out of 58 58

Misc-Precision 231 out of 231 231 out of 231 231

Tax-Precision 10 out of 10 10 out of 10 10

Eur-Recall 62 out of 62 62 out of 62 62

Geography-Recall 58 out of 58 58 out of 58 58

GEMET-NAL

Chemistry-Precision 30 out of 30 30 out of 30 30

Geography-Precision 17 out of 17 17 out of 17 17

Misc-Precision 29 out of 29 29 out of 29 29

Tax-Precision 15 out of 15 15 out of 15 15

Agriculture-Recall 61 out of 61 61 out of 61 61

Geography-Recall 77 out of 77 77 out of 77 77

As shown in Table 1, despite the heterogeneity of the

alignment tasks and domains, we were able to find the same

number of equivalent concepts as in the reference alignments

provided in the gold standard.

Entit

…

Group

Social Grou

Organization

Knowledge

Structure

Organization

Corporate

Body

1488

V. CONCLUSIONS AND FUTURE WORK

In this paper, we presented an automatically constructed

domain-independent knowledge base, developed by our

system KnowBase. Unlike traditional knowledge base

construction and population approaches, we used

heterogeneous data sources to create the knowledge base. In

addition, we avoided the human effort that is required to

control and update the knowledge base. To do this, we

employed several NLP, statistical-based and named-entity

recognition techniques. We aggregated the outputs of these

techniques to automatically populate and enrich the

knowledge base. The current version of the produced

knowledge base consists of 2,404,485 entities (384,051

concepts and 2,020,434 instances). We employed knowledge

represented by this knowledge base to find alignments

between heterogeneous ontologies in the environmental and

agricultural domains. We experimentally demonstrated that we

were able to find the same number of alignments between the

entities of the used ontologies as in the reference alignments

which are provided in the gold standard and in S1. In the

future work, we plan to integrate other existing publically

available knowledge bases to our knowledge base. In addition,

we plan to exploit additional ontologies from online ontology

repositories on the Web to enrich and expand the coverage of

our knowledge base.

REFERENCES

[1] Finin, T., Syed, Z., Mayﬁeld, Z., McNamee, P., and Piatko, C.: Using

wikitology for cross-document entity coreference resolution. In

Proceedings of the AAAI Spring Symposium on Learning by Reading

and Learning to Read, pp. 29--35, (2009).

[2] Wishart, D.S., Knox, C., Guo, A., Eisner, R., Young, N., Gautam, B.,

Hau, D.D., Psychogios, N., Dong, E., Bouatra, S., Mandal, R.,

Sinelnikov, I., Xia, J., Jia, L., Cruz, J.A., Lim, E., Sobsey, C.A.,

Shrivastava, S., Huang, P., Liu, P., Fang, L., Peng, J., Fradette, R.,

Cheng, D., Tzur, D., Clements, M., Lewis, A., Souza, A.D., Zuniga, A.,

Dawe, M., Xiong, Y., Clive, D., Greiner, R., Nazyrova, A.,

Shaykhutdinov, R., Li, L., Vogel, H.J., Forsythe, I.J.: HMDB: a

knowledgebase for the human metabolome. Nucleic Acids Research, pp.

603--610, (2009)

[3] Geifman, N., and Rubin, E.: Towards an Age-Phenome Knowledge-

base. BMC Bioinformatics, 12:229, doi:10.1186/1471-2105-12-229.

(2011)

[4] Deaton, C., Shepard, B., Klein, C., Mayans, C., Summers, B., Brusseau,

A., and Witbrock, M.: The comprehensive terrorism knowledge base in

Cyc. In Proceedings of the 2005 International Conference on

Intelligence Analysis, (2005)

[5] Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari,

P., Doshi, V. and Sachs, J.: Swoogle: A semantic web search and

metadata engine. In Proc. 13th ACM Conf. on Information and

Knowledge Management, Nov. pp. 652--659, (2004)

[6] Fauzi, F., Belkhatir, M., Hong, J.: Webpage Segmentation for Extracting

Images and Their Surrounding Contextual Information. In ACM

Multimedia’09, Beijing, China. 649--652, (2009)

[7] Gregory, M., McGrath, L., Bell, E., O’Hara, K., and Domico, K.:

Domain Independent Knowledge Base Population From Structured and

Unstructured Data Sources. In Proc. of the 24th International Florida

Artificial Intelligence Research Society Conference. pp. 251--256,

(2011)

[8] Suchanek, F. M., Kasneci, G., Weikum, G.: YAGO: A Core of Semantic

Knowledge Unifying WordNet and Wikipedia. In Proc. of the 16th

International World Wide Web (WWW) conference. pp. 697--706,

(2007)

[9] Hoffart, J., Suchanek, M. F., Berberich, K., Kelham, E. L., de Melo, G.,

and Weikum., G.: YAGO2: Exploring and Querying World Knowledge

in Time, Space, Context, and Many Languages. In Proc. of the 20th

International World Wide Web Conference (WWW 2011) Hyderabad,

India, pp. 229--232, (2011)

[10] Etzioni, O., Cafarella, M.J., Downey, D., Kok, S., Popescu, A., Shaked,

T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information

extraction in knowitall: (preliminary results). In WWW, pp. 100--110,

(2004)

[11] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives,

Z. G.: DB-pedia: A nucleus for a web of open data. In The Semantic

Web, 6th International Semantic Web Conference, 2nd Asian Semantic

Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, volume

4825 of Lecture Notes in Computer Science, pp. 722--735. (2007)

[12] Lenat, D. B.: Cyc: a large-scale investment in knowledge infrastructure.

Communications of the ACM, 38(11), pp. 33--38, (1995)

[13] Etzioni. O., Banko, M., Soderland, S., Weld, S. D.: Open Information

Extraction from the Web. Communications of the ACM, Vol 51, No.12,

pp. 68--74, (2008)

[14] Zhao, S. and Betz, J.: Corroborate and Learn Facts from the Web. In

KDD '07: Proceedings of the 13th ACM SIGKDD International

Conference on Knowledge discovery and data mining, pp. 995--1003.

ACM, (2007)

[15] Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T.,

Soderland, S., Weld, D. S., and Yates, A.: Unsupervised named-entity

extraction from the Web: An experimental study. Artificial Intelligence,

165(1): pp. 91--134, (2005)

[16] Alani, H.: Ontology Construction from Online Ontologies. Proceedings

of the 15th international conference on World Wide Web, WWW 2006,

Edinburgh, Scotland, UK. pp. 491--495, (2006)

[17] Maree, M., and Belkhatir, M.: A Coupled Statistical/Semantic

Framework for Merging Heterogeneous domain-Specific Ontologies. In

Proc. of the 22nd International Conference on Tools with Artificial

Intelligence (ICTAI’10), Arras, France, Vol. 2. pp. 159--166, (2010)

[18] Croft, B., Metzler, D. and Strohman, T.: Search Engines: Information

Retrieval in Practice. Addison-Wesley Publishing Company, USA,

(2009)

[19] Cavnar, W. B. and Trenkle, J. M.: N-gram-based text categorization. In

Proceedings of SDAIR-94, 3rd Annual Symposium on Document

Analysis and Information Retrieval. Las Vegas, US, pp. 161--175,

(1994)

[20] Roth, D.: Learning to resolve natural language ambiguities: a uniﬁed

approach. In Proceedings of AAAI-98, 15th Conference of the American

Association for Artiﬁcial Intelligence. Madison, US, pp. 806--813,

(1998)

[21] Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V.: GATE:

a framework and graphical development environment for robust NLP

tools and applications. Proc. of the 40th Anniversary Meeting of the

Association for Computational Linguistics, Phil.,USA, (2002)

[22] Cilibrasi R., Vitanyi P.: The Google Similarity Distance. IEEE

Transactions on knowledge and data engineering. 19(3), pp. 370--383,

(2007)

[23] Zhong, Q., Li, H., Li, J., Xie, G., Tang, J., Zhou, L., and Pan,Y: “A

Gauss Function Based Approach for Unbalanced Ontology Matching”.

SIGMOD’09, pp. 669-680, (2009)

[24] http://oaei.ontologymatching.org/2007/results/environemt/gold_standard

Analysis & Shortcomings of E-Recruitment Systems: Towards a Semantics-based Approach Addressing Knowledge Incompleteness and Limited Domain Coverage

Preprint

Apr 2020

The rapid development of the Internet has led to introducing new methods for e-recruitment and human resources management. These methods aim to systematically address the limitations of conventional recruitment procedures through incorporating natural language processing tools and semantics-based methods. In this context, for a given job post, applicant resumes (usually uploaded as free-text unstructured documents in different formats such as .pdf, .doc, or .rtf) are matched/screened out using the conventional keyword-based model enriched by additional resources such as occupational categories and semantics-based techniques. Employing these techniques has proved to be effective in reducing the cost, time, and efforts required in traditional recruitment and candidate selection methods. However, the skill gap, i.e. the propensity to precisely detect and extract relevant skills in applicant resumes and job posts, and the hidden semantic dimensions encoded in applicant resumes still form a major obstacle for e-recruitment systems. This is due to the fact that resources exploited by current e-recruitment systems are obtained from generic domain-independent sources, therefore resulting in knowledge incompleteness and the lack of domain coverage. In this paper, we review state-of-the-art e-recruitment approaches and highlight recent advancements in this domain. An e-recruitment framework addressing current shortcomings through the use of multiple cooperative semantic resources, feature extraction techniques and skill relatedness measures is detailed. An instantiation of the proposed framework is proposed and an experimental validation using a real-world recruitment dataset from two employment portals demonstrates the effectiveness of the proposed approach.

Analysis and shortcomings of e-recruitment systems: Towards a semantics-based approach addressing knowledge incompleteness and limited domain coverage

Article

Nov 2018
J INF SCI

The rapid development of the Internet has led to introducing new methods for e-recruitment and human resources management. These methods aim to systematically address the limitations of conventional recruitment procedures through incorporating natural language processing tools and semantics-based methods. In this context, for a given job post, applicant resumes (usually uploaded as free-text unstructured documents in different formats such as.pdf,.doc or.rtf) are matched/screened out using the conventional keyword-based model enriched by additional resources such as occupational categories and semantics-based techniques. Employing these techniques has proved to be effective in reducing the cost, time, and efforts required in traditional recruitment and candidate selection methods. However, bridging the skill gap - that is, the propensity to precisely detect and extract relevant skills in applicant resumes and job posts - and highlighting the hidden semantic dimensions encoded in applicant resumes are still challenging issues in the process of devising effective e-recruitment systems. This is due to the fact that resources exploited by current e-recruitment systems are obtained from generic domain-independent sources, therefore resulting in knowledge incompleteness and the lack of domain coverage. In this article, we review state-of-the-art e-recruitment approaches and highlight recent advancements in this domain. An e-recruitment framework addressing current shortcomings through the use of multiple cooperative semantic resources, feature extraction techniques and skill relatedness measures is detailed. An instantiation of the proposed framework is proposed and an experimental validation using a real-world recruitment dataset from two employment portals demonstrates the effectiveness of the proposed approach.

The New Approach for Creating the Knowledge Base Using WikiPedia

Chapter

Jan 2020

Wikipedia is recognized as one of the largest repositories in the Web. The term knowledge base was in connection with the expert systems as it is the part of Artificial Intelligence. A knowledge base can be created for any entity. The existing system like YAGO, MediaWiki tries to convert Wikipedia into a structured database to provide a vast knowledge base across the domains. It is very difficult to get the information which we want across the domains. So, the solution would be to get a systematic automated approach to build a knowledge base using Wikipedia on entity which we are interested in. The proposed system provides a knowledge base built upon the location as its entity. The system is feeded with seed data, by using these seed data it traverse through the Wikipedia graph and builds knowledge base using similarity measurement between seed data and traversed upcoming pages of wiki graph. Any expert AI systems uses gold standard knowledge base to take any decisions.

A Feature-Enhanced Ranking-Based Classifier for Multimodal Data and Heterogeneous Information Networks

Conference Paper

Dec 2013

We propose a heterogeneous information network mining algorithm: feature-enhanced Rank Class (F-Rank Class). F-Rank Class extends Rank Class to a unified classification framework that can be applied to binary or multiclass classification of unimodal or multimodal data. We experimented on a multimodal document dataset, 2008/9 Wikipedia Selection for Schools. For unimodal classification, F-Rank Class is compared to support vector machines (SVMs). F-Rank Class provides improvements up to 27.3% on the Wikipedia dataset. For multimodal document classification, F-Rank Class shows improvements up to 19.7% in accuracy when compared to SVM-based meta-classifiers. We also study 1) how the structure of the network and 2) how the choice of parameters affect the classification results.

The Comprehensive Terrorism Knowledge Base in Cyc

Article

Full-text available

Jan 2005

This paper describes the comprehensive Terror- ism Knowledge Base TM (TKB TM ) which will ul- timately contain all relevant knowledge about terrorist groups, their members, leaders, affilia- tions, etc., and full descriptions of specific ter- rorist events. Led by world-class experts in ter- rorism, knowledge enterers have, with simple tools, been building the TKB at the rate of up to 100 assertions per person-hour. The knowledge is stored in a manner suitable for computer un- derstanding and reasoning. The TKB also util- izes its reasoning modules to integrate data and correlate observations, generate scenarios, an- swer questions and compose explanations.

Swoogle: A search and metadata engine for the Semantic Web

Conference Paper

Full-text available

Nov 2004

Swoogle is a crawler-based indexing and retrieval system for the Semantic Web. It extracts metadata for each discovered document, and computes relations between documents. Discovered documents are also indexed by an information retrieval system which can use either character N-Gram or URIrefs as keywords to find relevant documents and to compute the similarity among a set of documents. One of the interesting properties we compute is ontology rank, a measure of the importance of a Semantic Web document.

Webpage segmentation for extracting images and their surrounding contextual information

Conference Paper

Full-text available

Oct 2009

Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention has been given to address issues in mining this contextual information. In this paper, we propose a webpage segmentation algorithm targeting the extraction of web images and their contextual information based on their characteristics as they appear on webpages. We conducted a user study to obtain a human-labeled dataset to validate the effectiveness of our method and experiments demonstrated that our method can achieve better results compared to an existing segmentation algorithm.

DBpedia: A Nucleus for a Web of Open Data

Conference Paper

Full-text available

Jan 2007

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human- and machine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data.

Using Wikitology for Cross-Document Entity Coreference Resolution.

Conference Paper

Full-text available

Jan 2009

We describe the use of the Wikitology knowledge base as a resource for a variety of applications with special focus on a cross-document entity coreference resolution task. This task involves recognizing when entities and relations mentioned in different documents refer to the same object or relation in the world. Wikitology is a knowledge base system constructed with material from Wikipedia, DBpedia and Freebase that in- cludes both unstructured text and semi-structured information. Wikitology was used to define features that were part of a sys- tem implemented by the Johns Hopkins University Human Language Technology Center of Excellence for the 2008 Auto- matic Content Extraction cross-document coreference resolu- tion evaluation organized by National Institute of Standards and Technology.

A gauss function based approach for unbalanced ontology matching

Conference Paper

Full-text available

Jun 2009

Ontology matching, aiming to obtain semantic correspon- dences between two ontologies, has played a key role in data exchange, data integration and metadata management. Among numerous matching scenarios, especially the appli- cations cross multiple domains, we observe an important problem, denoted as unbalanced ontology matching which requires to find the matches between an ontology describing a local domain knowledge and another ontology covering the information over multiple domains, is not well studied in the community. In this paper, we propose a novel Gauss Function based ontology matching approach to deal with this unbalanced ontology matching issue. Given a relative lightweight on- tology which represents the local domain knowledge, we ex- tract a "similar" sub-ontology from the corresponding heavy- weight ontology and then carry out the matching procedure between this lightweight ontology and the newly generated sub-ontology. The sub-ontology generation is based on the influences between concepts in the heavyweight ontology. We propose a Gauss Function based method to properly cal- culate the influence values between concepts. In addition, we perform an extensive experiment to verify the effective- ness and efficiency of our proposed approach by using OAEI 2007 tasks. Experimental results clearly demonstrate that our solution outperforms the existing methods in terms of precision, recall and elapsed time.

Unsupervised named-entity extraction from the Web: An experimental study

Article

Jun 2005
ARTIF INTELL

The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., “chemist” and “biologist” are identified as sub-classes of “scientist”). List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.

Corroborate and learn facts from the web

Conference Paper

Aug 2007

The web contains lots of interesting factual information about entities, such as celebrities, movies or products. This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant page to find fact mentions. When fact mentions are found, they are taken as examples for learning new facts from the page via HTML pattern discovery. Extracted new facts are added to the known fact set for the next learning cycle. The bootstrapping process continues until no new facts can be learned. This approach is language-independent. It demonstrated good performance in experiment on country facts. Results of a large scale experiment will also be shown with initial facts imported from wikipedia.

Domain Independent Knowledge Base Population from Structured and Unstructured Data Sources.

Conference Paper

Jan 2011

A Coupled Statistical/Semantic Framework for Merging Heterogeneous Domain-Specific Ontologies

Conference Paper

Oct 2010

Discovering semantic correspondences between ontology elements is a crucial task for merging heterogeneous ontologies. Most ontology merging tools use several methods to aggregate and combine similarity measures. In addition, some of the ontology merging systems exploit external resources such as, Linguistic Knowledge Bases (e.g. WordNet) to support this task. However, the quality of their results is subjected to the limitations of the exploited knowledge base. In this paper, we present a framework that exploits multiple knowledge bases that cover information in multiple domains for: i) Indentifying and correcting incorrect semantic relations between the concepts of domain-specific ontologies. This is a primary step before ontology merging; ii) Merging domain-specific ontologies; and iii) Handling the issue of missing background knowledge in the exploited knowledge bases by utilizing statistical techniques. An experimental instantiation of the framework and comparisons with state-of-the-art syntactic and semantic-based systems validate our proposal.

Automatic construction of a domain-independent knowledge base from heterogeneous data sources

Abstract and Figures

Recommended publications

Text Information Extraction Based on OWL Ontologies

Addressing semantic heterogeneity through multiple knowledge base assisted merging of domain-specifi...

A Unified Ontology Merging and Enrichment Framework

QuerySem: Deriving Query Semantics Based on Multiple Ontologies

On Developing a Framework for Knowledge-Based Learning Indicator System in the Context of Learning A...