Conference PaperPDF Available

Automatic construction of a domain-independent knowledge base from heterogeneous data sources

Authors:

Abstract and Figures

Manual construction and maintenance of general-purpose knowledge bases forms a major limiting factor towards their full adoption, use and reuse in practical settings. In this paper, we present KnowBase, a system for automatic knowledge base construction from heterogeneous data sources including domain-specific ontologies, general-purpose ontologies, plain texts, and image and video captions, which are automatically extracted from WebPages. In our approach, several information extraction techniques are integrated to automatically create, enrich, and keep the knowledge base up to date. Consequently, knowledge represented by the produced knowledge base can be employed in several application domains. In our experiments, we used the produced knowledge base as an external resource to align heterogeneous ontologies from the environmental and agricultural domains. The produced results demonstrate the effectiveness of the used knowledge base in finding corresponding entities between the used ontologies.
Content may be subject to copyright.
978-1-4673-0024-7/10/$26.00 ©2012 IEEE 1483
2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012)
Automatic Construction of a Domain-independent
Knowledge Base from Heterogeneous Data Sources
Mohammed Maree, Saadat M. Alhashmi
School of Information Technology
Monash University, Sunway Campus
Kuala Lumpur, Malaysia
Mohammed Belkhatir
Lyon Institute of Technology
University of Lyon I
Lyon, France
Andre Hawit
Mixberry Media Inc.
Burlingame, CA,
USA
Abstract— Manual construction and maintenance of general-
purpose knowledge bases forms a major limiting factor towards
their full adoption, use and reuse in practical settings. In this
paper, we present KnowBase, a system for automatic knowledge
base construction from heterogeneous data sources including
domain-specific ontologies, general-purpose ontologies, plain
texts, and image and video captions, which are automatically
extracted from WebPages. In our approach, several information
extraction techniques are integrated to automatically create,
enrich, and keep the knowledge base up to date. Consequently,
knowledge represented by the produced knowledge base can be
employed in several application domains. In our experiments, we
used the produced knowledge base as an external resource to
align heterogeneous ontologies from the environmental and
agricultural domains. The produced results demonstrate the
effectiveness of the used knowledge base in finding corresponding
entities between the used ontologies.
Keywords- Knowledge Base; Information Extraction; Pattern
Acquisition; Merging; Heterogeneous Data Sources; Experimental
Validation
I. INTRODUCTION
Recently, several approaches have been proposed to
automatically build knowledge bases. These approaches either
rely on a single data source such as Wikipedia1 to create the
knowledge base [1] or build knowledge bases for specific
domains such as the medical, biomedical and terrorism
domains [2][3][4]. In this paper, we present KnowBase, a
system for automatic knowledge base construction from
heterogeneous data sources including domain-specific
ontologies, general-purpose ontologies, plain texts, and image
and video captions; which are automatically extracted from
WebPages. To obtain domain-specific ontologies, we use
Swoogle semantic Web search engine [5]. For each domain, we
submit queries including keywords that are related to that
domain and download the returned ontologies. We obtain
general-purpose ontologies from online ontology repositories
and libraries on the Web. Plain texts are automatically
extracted from the Web using a Web crawler. The extracted
texts from relevant websites are then processed using several
Natural Language Processing (NLP) techniques. To identify
and extract image and video captions from WebPages, we use
1 http://www.wikipedia.org/
the DOM Tree-based webpage segmentation algorithm that is
proposed in [6].
DOM Tree-based Webpage Segmentation Algorithm
Figure 1. Example of the Output of the Segmentation Algorithm
The segmentation process is based on the Document Object
Model (DOM) Tree structure of WebPages. In this algorithm,
multimedia documents on the Web are classified into three
categories: Listed, Semi-listed, and Unlisted documents. For
every extracted multimedia document, the segmentation
method only searches the surrounding region making it more
efficient and scalable for large websites that contain huge
amount of multimedia documents. As shown in Figure 1,
given a webpage as input, the algorithm processes the DOM
tree of the webpage and extracts the segments as output.
Due to the heterogeneity of the used resources, different
information extraction techniques are required. These
techniques are integrated to build a coherent structure of the
knowledge base. For instance, to extract facts from plain texts,
and image and video captions, we utilize several NLP,
statistical-based and named-entity recognition techniques. On
the other hand, we employ other extraction and merging
techniques to populate the knowledge base with knowledge
triples (entity2-relation-concept), which are obtained from the
2 Entity in this context refers to a concept or an instance of a concept
1484
downloaded domain-specific and general-purpose ontologies.
Details of these techniques are presented in Section 3A. The
main contributions of our work are summarized as follows:
Exploiting heterogeneous data sources for automatic
knowledge base construction.
Combining several information extraction techniques and
aggregating their output to construct and enrich the
knowledge base.
The rest of this paper is organized as follows. Section 2
presents the details of some other knowledge base construction
and population systems. The theoretical framework for
automatic construction and update of domain-independent
knowledge bases is presented in Section 3. In this section, we
also present the details of the methods that are used in the
proposed system. Section 4 discusses the experiments that
were carried out to construct the knowledge base, as well as to
exploit the constructed knowledge base in aligning
heterogenuous ontologies. Section 5 presents the conclusions
and outlines the future work.
II. RELATED WORK
In this section, we discuss the related work in terms of two
different aspects: (a) knowledge base construction approaches
and (b) issues related to automatically identifying and
extracting entities and facts from heterogeneous data sources.
A. Knowledge Base Construction
Knowledge base construction has always been at the heart
of the Semantic Web (SW) technology. Apparently, with the
continuous expansion of the Web, this task became more
difficult [7]. To address this issue, several knowledge base
construction systems have been proposed [8, 9, 10, 11]. Some
of these systems rely on human input to enrich the knowledge
base and keep it up-to-date. Examples of the produced
knowledge bases by such systems are Freebase3 and True
Knowledge4. On the other hand, other systems rely on a single
data source to create the knowledge base or create knowledge
bases that are related to a particular domain. For instance,
Geifman and Rubin, proposed to model and store knowledge
about age-related phenotypic patterns and events in an Age-
Phenome Knowledge Base (APK) [3]. Another example is the
terrorism knowledge base, which contains all relevant
knowledge about terrorist groups, their members, leaders,
affiliations, and full descriptions of specific terrorist events [4].
This knowledge base was integrated into Cyc [12], which is a
general-purpose ontology that captures knowledge from
multiple domains. In our approach, we not only aim at avoiding
the effort required by users to manually maintain and update
the knowledge base, but also at populating and extending the
knowledge base from heterogeneous data sources.
B. Information Extraction
The task of automatically identifying and extracting
entities and facts from heterogeneous data sources is a
3 http://www.freebase.com/
4 http://www.trueknowledge.com/
prevalent problem. For each data source, we need to identify
and utilize different information extraction methods. The
ultimate goal of these methods is to automatically construct
and populate the knowledge base with entities, facts (these are
automatically extracted from plain texts and image and video
captions from WebPages), and knowledge triples, which are
automatically extracted from online ontologies. For instance,
to extract information from plain texts on the Web,
TextRunner [13] employs heuristics to produce extractions in
the form of a tuple t = (ei, ri,j , ej), where ei and ej are strings
meant to denote entities, and ri,j is a string meant to denote a
relationship between them. Similar systems are GRAZER [14]
and KnowItAll [15]. In GRAZER, the inputs are seed facts for
given entities, which are automatically generated using
specialized wrappers. KnowItAll used bootstrapping to extract
patterns and facts simultaneously from text [14]. The relevant
pages are retrieved from a search engine via a query composed
of keywords in a pattern. The initial seed set contains a few
hand-generated patterns. In our approach, we combine several
NLP, statistical-based and named-entity recognition
techniques to extract facts, entities and their attributes from
textual information on the Web.
Another data source that can be exploited for knowledge
base construction is online ontologies. For example, the authors
of [16] propose to construct ontologies from online domain-
specific ontologies. To do this, they submit quires to Swoogle
SW search engine and download relevant ontologies from the
list of the returned results. In addition, they employ ranking and
segmentation techniques to rank the returned ontologies and
extract segments from them. To address the issue of overlap
between the extracted segments, the authors propose to merge
overlapping segments into a single representation. We build on
this work to automatically download domain-specific and
general-purpose ontologies from Swoogle SW search engine
and other ontology repositories and libraries on the Web.
Considering domain-specific ontologies, we merge them using
the merging techniques proposed in our previous work [17]. In
this context, for each domain of interest, we merge its relevant
ontologies using semantic, name and statistical based ontology
merging techniques. Then, we extract knowledge triples from
the merged domain-specific ontologies and other general-
purpose ontologies to populate the knowledge base.
III. THEORETICAL BASES AND SYSTEM DESCRIPTION
A knowledge base is a repository of facts about entities and
their relationships. A fact is a triple consisting of entity-
relation-entity structure. Entities are related through different
types of semantic and taxonomic relations. These relations are
automatically extracted from multiple heterogeneous data
sources such as plain texts, image and video captions, and
online ontologies. A detailed discussion on these data sources
and the techniques that we use in our system are given in the
next sections.
1485
A. Knowledge Base Construction from Online Domain-
specific Ontologies
Among the data sources that we exploit to construct the
knowledge base are online domain-specific ontologies. A
domain-specific ontology can be formally defined as:
Definition 1: A domain-specific ontology is a 4-tuple C,
R, I, A where:
C={(ci),i[1,Card(C)]} represents the set of domain
concepts of the ontology.
R={(ri),i[1,Card(R)]} represents the set of semantic
relations holding between the ontology concepts.
I is the set of instances or individuals.
A is the set of axioms verifying: A={(ri,cj,ck)} s.t.
i[1,Card(R)], j,k[1,Card(C)], cj,ckC and riR.
We associate to the ontology a logic-based translation towards a
predicate calculus PC, based on C, R, I, A and a set of predicate
symbols P CR. Predicates linked to C are monadic while
those linked to R are dyadic.
Φo: C A{ } CP
Φo(c) = c(xc) where xcI
Φo(a) = r(ci,cj) where rR and ci,cjC
Φo() =
i
cC
Φo (ci)
i
aA
Φo (ai)
To address the semantic heterogeneity problem (i.e.
conceptual and terminological differences between domain-
specific ontologies), we employ the merging techniques
proposed [17]. Formally, a merging algorithm can be defined
as:
Definition 2: Merging: Given two domain-specific ontologies
1 and 2, the merging operation finds semantic
correspondences between their concepts and produces a single
merged ontology merged as output. Semantic correspondences
between both ontologies are 4-tuples Cid, Ci, Cj, r such that:
Cid is a unique identifier of the correspondence.
Ci 1, Cj 2 are corresponding concepts of the input
ontologies.
r R is a semantic relation holding between both
elements Ci and Cj.
To represent the knowledge base we use an Entity-
Relationship graph. Such a graph connects different entities E
from multiple domains through different types of relations R.
In this context, a finite set of entity labels are represented by
the graph nodes and directed links between those nodes are
used to represent R. This can be formally represented as
follows:
Definition 3: Knowledge Base: A knowledge base KB is a
structure KB:= (CKB, RKB, IKB) where:
CKB: is the set of concepts that are defined in KB.
Generally, CKB:={C Cmiss}, where Cmiss is the set of
concepts that are not defined in but are defined in KB.
This is due to the fact that KB covers information across
multiple domains and is not limited to a particular domain
as . However, we may find that C includes concepts
{C1, C2, . . . , Cn} which are not defined in CKB. For this
particular case, we utilize statistical-based techniques to
enrich KB with the set of concepts {C1, C2, . . . , Cn}.
RKB: is the set of semantic and taxonomic relations that are
defined to relate the concepts in CKB.
IKB: is the set of instances of the concepts that are defined
in KB.
B. Knowledge Base Construction from Plain Texts
In our approach, we divide plain texts into two categories.
The first category consists of plain texts that are automatically
extracted from Web documents. These are extracted using a
Web crawler. The second category consists of image and
video captions, which are used to describe the content of these
types of multimedia documents on the Web. As discussed in
Section 1, in order to identify and extract the captions of such
multimedia documents we use the DOM Tree-based Webpage
segmentation algorithm that is proposed in [6]. To process
texts from both categories we first utilize several NLP
techniques such as stopword removal [18], tokenization [19],
and Part Of Speech (POS) tagging [20]. Then, we extract
named entities through employing GATE [21], which is a
syntactical pattern matching entity recognizer enriched with
gazetteers. Although the coverage of GATE is limited to a
certain number of named-entities, additional rules can be
defined in order to expand its coverage. However, the process
of manual enrichment of GATE’s rules can be difficult and
time-consuming task. Therefore, we exploit the constructed
knowledge base as a supplementary source of named-entity
recognition. In this context, entities that are not defined in
GATE are submitted to the knowledge base to find whether
they are defined in it or not. After extracting named entities
from texts, we utilize a statistical-based semantic relatedness
measure to compute the degree of semantic relatedness
between the extracted entities. This measure is based on the
Normalized Retrieval Distance (NRD) function [17]. This
function is formally defined as follows:
Definition 4: Normalized Retrieval Distance (NRD): is an
adapted form of the Normalized Google Distance (NGD) [22]
function that measures the semantic relatedness between pairs
of entities (such as concepts or instances): Given two entities
E_miss and E_in, the Normalized Retrieval Distance between
E_miss and E_in can be obtained as follows:
max{log ( _ ), log ( _ )} log ( _ , _ )
(_ ,_ ) log min{log ( _ ), log ( _ )}
f
E miss f E in f E miss E in
NRD E miss E in MfEmissfEin
=
(1)
Where,
E_miss is an entity that is recognized by GATE but not
defined in the knowledge base, KB.
E_in is an entity that exists in the knowledge base, KB.
f(E_miss) is the number of hits for the search term, E_miss.
f(E_in) is the number of hits for the search term, E_in.
f(E_miss , E_in) is the number of hits for the search terms,
E_miss and E_in.
M is the number of WebPages indexed by the search
engine.
1486
Unlike the NGD function, the NRD function returns
different semantic relatedness measures according to several
search engines (Google, AltaVista, Yahoo!). Therefore, we
sum up all NRD values for each candidate entity. This
summation represents an aggregated decision made by several
search engines on the degree of semantic relatedness between
the entities E_miss and E_in . The returned semantic relatedness
measures give us an indication that two entities are strongly
related or not based on a threshold value v=0.5. Therefore,
entities with semantic relatedness measures > v will be
considered for further processing by the Semantic Relation
Extractor (SRE) function. This function takes as input pairs of
entities with strong semantic relatedness measures and
produces as output the suggested semantic relation(s) between
them based on a set of pre-defined lexico-syntactic patterns.
Details on this function can be found in [17].
C. Automatic Knowledge Base Enrichment
Enrichment of the knowledge base is performed at two
parallel stages. In the first stages, entities (concepts or their
instances) of the merged domain-specific ontologies and other
general-purpose ontologies are submitted to the knowledge
base. If any of the entities does not exist in the knowledge
base, then it will be transferred to the second stage, wherein
the entity is automatically extracted and added to the Entity-
Relationship graph of the knowledge base based on the its
context. A context represents all concepts that exist at the
semantic path(s) of each entity. To illustrate this step, we take
an example of a merged ontology that describes the
organizations domain.
Figure 2. Part of a Merged Ontology about the Organizations Domain.
Concepts are related through is-a Transitive Relation
In Figure 2, we see that the contexts of the concept
Corpoate_Body” are:
1. {“Organization”, “Body”, “Gathering”, “Psychological
Feature”, “Abstraction”, “Abstract Entity”, “Entity”}
2. {“Organization”, “Social Group”, “Group”,
“Abstraction”, “Abstract Entity”, “Entity”}
We call each of these contexts a semantic path. For instance,
to enrich the knowledge base with the entity “Corporate
Body”, we first submit it to the knowledge base to find
whether it is already defined in it or not. Assuming that this
entity is missing from the knowledge base, we traverse the
hierarchy of the semantic paths of this entity and extract the
concepts in each path. Then, in an ascending order ,we attempt
to find whether the parents of the missing entity exist in the
knowledge base or not. This loop is repeated until we reach
the root node in the graph. Accordingly, we consider the
following cases for enriching the knowledge base:
A) If the parent p of the missing entity e (e.g. “Corporate
Body” is a missig entity in our example) is also missing
from the knowledge base KB, then we extract the segment
of the semantic path from the merged ontology merged
that consists of the triple: e-relation-p. For example,
assuming the concept “Organization” is also missing
from KB, we extract the triple: Corporate Body is-a
Organization from merged. Then, the Entity-Relationship
graph of KB will be updated by adding two new nodes
(“Corporate Body” & “Organization”) and linking them
through (is-a) relation.
B) If the parent p of the missing entity e is already defined in
the KB, then we attach e directly to p in the hierarchy of
the KB. Here, it is important to mention that p might have
differet meanings (senses) in KB. Therefore, it is
important to disambiguate the meaning of p before linking
it to e. To do this, we compare the contexts of the entity e
to the conctexts of p in KB. Accordingly e will be linked
to p based on the similarity between their contexts. For
instance, if we have the following senses for the entity
Organization” in KB:
1. Organization: A group of people who work
togethor.
2. Organization: An organized structure for
classifying.
Figure 3. Senses of the Concept Organization in KB
We compare the context of “Organization” from KB to
context of the same concept from merged. We find that the
most similar contexts are context No. 2 from Figure 2
({“Organization”, “Social Group”, “Group”, “Abstraction”,
“Abstract Entity”, “Entity”}) and context No. 1 from Figure 3.
Therefore, we link “Corporate Body” to the semantic path that
Entit
y
Group
Social Grou
p
Organization
Knowledge
Structure
Organization
Corporate
Body
Corporate
Body
Entit
y
Abstract Entit
y
Abstraction
Group
Psychological Featur
e
Social Grou
p
Organization Gathering
Bod
Organization
Organization
Beginning
Organization
Administration
...
Organization
Structure
Organization
Organization
1487
represents the first sense of “Organization” in KB. The result
of this step is shown in Figure 4.
Figure 4. Adding a New Concept to KB
IV. EXPERIMENTAL RESULTS
In the following sections, we discuss experiments in terms
of two different aspects. First, we discuss the experiments that
we carried out to construct and populate the knowledge base.
Second, we experimentally demonstrate the effectiveness of
employing the knowledge that is represented by the constructed
knowledge base in real application domains. We implemented
all solutions in Java and experiments were performed on a PC
with dual-core CPU (3000 GHz) and (4 GB RAM). The
operating system that was used is OpenSuse 11.1.
A. Automatic Construction and Population of the Knowledge
Base
In this section we discuss the experiments that we carried
out to automatically construct and populate the knowledge
base. The sources that we used for this purpose are 500 text
documents obtained from the Web, 17855 image and video
captions extracted from WebPages, 35 domain-specific
ontologies downloaded using Swoogle SW search engine, and
6 general purpose ontologies downloaded from online ontology
repositories and libraries. To obtain the text documents, we
developed a script to query general-purpose search engines
such as Google and AltaVista about several concepts from
different domains. Examples of these domains are (Sport,
Medicine, Programming Languages, and Universities). We
manually selected the top-10 results from the lists of returned
results by each search engine. To obtain the image and video
captions, we used the DOM Tree-based Webpage segmentation
algorithm (described in Section 1). To download online
domain-specific ontologies, we submitted queries to Swoogle
search engine and selected those that are relevant to each
query’s intent. Then, for each domain, we merged its relevant
ontologies using the merging techniques (described in Section
2). The total size of both types of the used ontologies is 8.621
GB. The current version of the constructed knowledge base
consists of 2,404,485 Entity (384,051 concepts and 2,020,434
instances). It is important to mention that the produced
knowledge base is still evolving and we are updating it in a
constant manner.
B. Using the Produced Knowledge Base in Real Application
Domains
In this section, we describe the experiments that we carried
out to validate our proposal of employing knowledge
represented by the produced knowledge base in several
application domains. Certainly, it can be employed for other
purposes such as semantic-based indexing of multimedia
documents on the Web, computing the degree of semantic
relatedness, query reformulation, document clustering, and
ontology alignment and mapping…etc. However, in these
experiments, we used it as an external resource to find
alignments between heterogeneous domain-specific
ontologies. In this context, we attempted to find alignments
(i.e. correspondences) between the concepts and instances of
the three real-world heavyweight ontologies (GEMET,
AGROVOC, and NAL). Details on these ontologies are listed
in [23]. To compute precision and recall, we used the official
gold-standard alignments [24] that are provided by the OAEI
2007 environment task organizers. These sample alignments
are classified into different domains such as alignments in the
chemistry, geography and agriculture domains. We used the
sample alignments to compute the precision and recall of our
system in each domain. A comparison between the results of
our system, S1 and the gold standard reference alignments is
shown in Table 1.
TABLE I. USING THE PRODUCED KNOWLEDGE BASE TO FIND
ALIGNMENTS BETWEEN HEAVYWEIGHT ONTOLOGIES
Task
# of Matches
produced by our
system
# of Matches
produced by
S1
# of Matches
in the Ref.
Alignments
GEMET-AGROVOC
Chemistry-Precision 14 out of 14 14 out of 14 14
Geography-Precision 23 out of 23 23 out of 23 23
Geography-Recall 87 out of 87 87 out of 87 87
Agriculture-Recall 61 out of 61 61 out of 61 61
Misc-Precision 28 out of 28 28 out of 28 28
Tax-Precision 21 out of 21 21 out of 21 21
NAL-AGROVOC
Chemistry-Precision 141 out of 141 141 out of 141 141
Geography-Precision 58 out of 58 58 out of 58 58
Misc-Precision 231 out of 231 231 out of 231 231
Tax-Precision 10 out of 10 10 out of 10 10
Eur-Recall 62 out of 62 62 out of 62 62
Geography-Recall 58 out of 58 58 out of 58 58
GEMET-NAL
Chemistry-Precision 30 out of 30 30 out of 30 30
Geography-Precision 17 out of 17 17 out of 17 17
Misc-Precision 29 out of 29 29 out of 29 29
Tax-Precision 15 out of 15 15 out of 15 15
Agriculture-Recall 61 out of 61 61 out of 61 61
Geography-Recall 77 out of 77 77 out of 77 77
As shown in Table 1, despite the heterogeneity of the
alignment tasks and domains, we were able to find the same
number of equivalent concepts as in the reference alignments
provided in the gold standard.
Entit
y
Group
Social Grou
p
Organization
Knowledge
Structure
Organization
Corporate
Body
1488
V. CONCLUSIONS AND FUTURE WORK
In this paper, we presented an automatically constructed
domain-independent knowledge base, developed by our
system KnowBase. Unlike traditional knowledge base
construction and population approaches, we used
heterogeneous data sources to create the knowledge base. In
addition, we avoided the human effort that is required to
control and update the knowledge base. To do this, we
employed several NLP, statistical-based and named-entity
recognition techniques. We aggregated the outputs of these
techniques to automatically populate and enrich the
knowledge base. The current version of the produced
knowledge base consists of 2,404,485 entities (384,051
concepts and 2,020,434 instances). We employed knowledge
represented by this knowledge base to find alignments
between heterogeneous ontologies in the environmental and
agricultural domains. We experimentally demonstrated that we
were able to find the same number of alignments between the
entities of the used ontologies as in the reference alignments
which are provided in the gold standard and in S1. In the
future work, we plan to integrate other existing publically
available knowledge bases to our knowledge base. In addition,
we plan to exploit additional ontologies from online ontology
repositories on the Web to enrich and expand the coverage of
our knowledge base.
REFERENCES
[1] Finin, T., Syed, Z., Mayeld, Z., McNamee, P., and Piatko, C.: Using
wikitology for cross-document entity coreference resolution. In
Proceedings of the AAAI Spring Symposium on Learning by Reading
and Learning to Read, pp. 29--35, (2009).
[2] Wishart, D.S., Knox, C., Guo, A., Eisner, R., Young, N., Gautam, B.,
Hau, D.D., Psychogios, N., Dong, E., Bouatra, S., Mandal, R.,
Sinelnikov, I., Xia, J., Jia, L., Cruz, J.A., Lim, E., Sobsey, C.A.,
Shrivastava, S., Huang, P., Liu, P., Fang, L., Peng, J., Fradette, R.,
Cheng, D., Tzur, D., Clements, M., Lewis, A., Souza, A.D., Zuniga, A.,
Dawe, M., Xiong, Y., Clive, D., Greiner, R., Nazyrova, A.,
Shaykhutdinov, R., Li, L., Vogel, H.J., Forsythe, I.J.: HMDB: a
knowledgebase for the human metabolome. Nucleic Acids Research, pp.
603--610, (2009)
[3] Geifman, N., and Rubin, E.: Towards an Age-Phenome Knowledge-
base. BMC Bioinformatics, 12:229, doi:10.1186/1471-2105-12-229.
(2011)
[4] Deaton, C., Shepard, B., Klein, C., Mayans, C., Summers, B., Brusseau,
A., and Witbrock, M.: The comprehensive terrorism knowledge base in
Cyc. In Proceedings of the 2005 International Conference on
Intelligence Analysis, (2005)
[5] Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari,
P., Doshi, V. and Sachs, J.: Swoogle: A semantic web search and
metadata engine. In Proc. 13th ACM Conf. on Information and
Knowledge Management, Nov. pp. 652--659, (2004)
[6] Fauzi, F., Belkhatir, M., Hong, J.: Webpage Segmentation for Extracting
Images and Their Surrounding Contextual Information. In ACM
Multimedia’09, Beijing, China. 649--652, (2009)
[7] Gregory, M., McGrath, L., Bell, E., O’Hara, K., and Domico, K.:
Domain Independent Knowledge Base Population From Structured and
Unstructured Data Sources. In Proc. of the 24th International Florida
Artificial Intelligence Research Society Conference. pp. 251--256,
(2011)
[8] Suchanek, F. M., Kasneci, G., Weikum, G.: YAGO: A Core of Semantic
Knowledge Unifying WordNet and Wikipedia. In Proc. of the 16th
International World Wide Web (WWW) conference. pp. 697--706,
(2007)
[9] Hoffart, J., Suchanek, M. F., Berberich, K., Kelham, E. L., de Melo, G.,
and Weikum., G.: YAGO2: Exploring and Querying World Knowledge
in Time, Space, Context, and Many Languages. In Proc. of the 20th
International World Wide Web Conference (WWW 2011) Hyderabad,
India, pp. 229--232, (2011)
[10] Etzioni, O., Cafarella, M.J., Downey, D., Kok, S., Popescu, A., Shaked,
T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information
extraction in knowitall: (preliminary results). In WWW, pp. 100--110,
(2004)
[11] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives,
Z. G.: DB-pedia: A nucleus for a web of open data. In The Semantic
Web, 6th International Semantic Web Conference, 2nd Asian Semantic
Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, volume
4825 of Lecture Notes in Computer Science, pp. 722--735. (2007)
[12] Lenat, D. B.: Cyc: a large-scale investment in knowledge infrastructure.
Communications of the ACM, 38(11), pp. 33--38, (1995)
[13] Etzioni. O., Banko, M., Soderland, S., Weld, S. D.: Open Information
Extraction from the Web. Communications of the ACM, Vol 51, No.12,
pp. 68--74, (2008)
[14] Zhao, S. and Betz, J.: Corroborate and Learn Facts from the Web. In
KDD '07: Proceedings of the 13th ACM SIGKDD International
Conference on Knowledge discovery and data mining, pp. 995--1003.
ACM, (2007)
[15] Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T.,
Soderland, S., Weld, D. S., and Yates, A.: Unsupervised named-entity
extraction from the Web: An experimental study. Artificial Intelligence,
165(1): pp. 91--134, (2005)
[16] Alani, H.: Ontology Construction from Online Ontologies. Proceedings
of the 15th international conference on World Wide Web, WWW 2006,
Edinburgh, Scotland, UK. pp. 491--495, (2006)
[17] Maree, M., and Belkhatir, M.: A Coupled Statistical/Semantic
Framework for Merging Heterogeneous domain-Specific Ontologies. In
Proc. of the 22nd International Conference on Tools with Artificial
Intelligence (ICTAI’10), Arras, France, Vol. 2. pp. 159--166, (2010)
[18] Croft, B., Metzler, D. and Strohman, T.: Search Engines: Information
Retrieval in Practice. Addison-Wesley Publishing Company, USA,
(2009)
[19] Cavnar, W. B. and Trenkle, J. M.: N-gram-based text categorization. In
Proceedings of SDAIR-94, 3rd Annual Symposium on Document
Analysis and Information Retrieval. Las Vegas, US, pp. 161--175,
(1994)
[20] Roth, D.: Learning to resolve natural language ambiguities: a unied
approach. In Proceedings of AAAI-98, 15th Conference of the American
Association for Articial Intelligence. Madison, US, pp. 806--813,
(1998)
[21] Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V.: GATE:
a framework and graphical development environment for robust NLP
tools and applications. Proc. of the 40th Anniversary Meeting of the
Association for Computational Linguistics, Phil.,USA, (2002)
[22] Cilibrasi R., Vitanyi P.: The Google Similarity Distance. IEEE
Transactions on knowledge and data engineering. 19(3), pp. 370--383,
(2007)
[23] Zhong, Q., Li, H., Li, J., Xie, G., Tang, J., Zhou, L., and Pan,Y: “A
Gauss Function Based Approach for Unbalanced Ontology Matching”.
SIGMOD’09, pp. 669-680, (2009)
[24] http://oaei.ontologymatching.org/2007/results/environemt/gold_standard
... The process of further enriching knowledge bases through coupled statistical and semantic-based techniques [23,24,25] or using additional Web resources [26,27,28] successfully used in various information retrieval tasks [29,30,31] paved the way leading to our current proposal. We therefore present an automatic e-recruitment system that employs multiple cooperative semantic resources and occupational classifications (WordNet [32,33], YAGO3 [34] and the HiringSolved (HS) dataset [35]) to screen out irrelevant resumes and precisely match candidate resumes to their relevant job posts. ...
Preprint
The rapid development of the Internet has led to introducing new methods for e-recruitment and human resources management. These methods aim to systematically address the limitations of conventional recruitment procedures through incorporating natural language processing tools and semantics-based methods. In this context, for a given job post, applicant resumes (usually uploaded as free-text unstructured documents in different formats such as .pdf, .doc, or .rtf) are matched/screened out using the conventional keyword-based model enriched by additional resources such as occupational categories and semantics-based techniques. Employing these techniques has proved to be effective in reducing the cost, time, and efforts required in traditional recruitment and candidate selection methods. However, the skill gap, i.e. the propensity to precisely detect and extract relevant skills in applicant resumes and job posts, and the hidden semantic dimensions encoded in applicant resumes still form a major obstacle for e-recruitment systems. This is due to the fact that resources exploited by current e-recruitment systems are obtained from generic domain-independent sources, therefore resulting in knowledge incompleteness and the lack of domain coverage. In this paper, we review state-of-the-art e-recruitment approaches and highlight recent advancements in this domain. An e-recruitment framework addressing current shortcomings through the use of multiple cooperative semantic resources, feature extraction techniques and skill relatedness measures is detailed. An instantiation of the proposed framework is proposed and an experimental validation using a real-world recruitment dataset from two employment portals demonstrates the effectiveness of the proposed approach.
... The process of further enriching knowledge bases through coupled statistical and semantic-based techniques [23,24,25] or using additional Web resources [26,27,28] successfully used in various information retrieval tasks [29,30,31] paved the way leading to our current proposal. We therefore present an automatic e-recruitment system that employs multiple cooperative semantic resources and occupational classifications (WordNet [32,33], YAGO3 [34] and the HiringSolved (HS) dataset [35]) to screen out irrelevant resumes and precisely match candidate resumes to their relevant job posts. ...
Article
The rapid development of the Internet has led to introducing new methods for e-recruitment and human resources management. These methods aim to systematically address the limitations of conventional recruitment procedures through incorporating natural language processing tools and semantics-based methods. In this context, for a given job post, applicant resumes (usually uploaded as free-text unstructured documents in different formats such as.pdf,.doc or.rtf) are matched/screened out using the conventional keyword-based model enriched by additional resources such as occupational categories and semantics-based techniques. Employing these techniques has proved to be effective in reducing the cost, time, and efforts required in traditional recruitment and candidate selection methods. However, bridging the skill gap - that is, the propensity to precisely detect and extract relevant skills in applicant resumes and job posts - and highlighting the hidden semantic dimensions encoded in applicant resumes are still challenging issues in the process of devising effective e-recruitment systems. This is due to the fact that resources exploited by current e-recruitment systems are obtained from generic domain-independent sources, therefore resulting in knowledge incompleteness and the lack of domain coverage. In this article, we review state-of-the-art e-recruitment approaches and highlight recent advancements in this domain. An e-recruitment framework addressing current shortcomings through the use of multiple cooperative semantic resources, feature extraction techniques and skill relatedness measures is detailed. An instantiation of the proposed framework is proposed and an experimental validation using a real-world recruitment dataset from two employment portals demonstrates the effectiveness of the proposed approach.
Chapter
Wikipedia is recognized as one of the largest repositories in the Web. The term knowledge base was in connection with the expert systems as it is the part of Artificial Intelligence. A knowledge base can be created for any entity. The existing system like YAGO, MediaWiki tries to convert Wikipedia into a structured database to provide a vast knowledge base across the domains. It is very difficult to get the information which we want across the domains. So, the solution would be to get a systematic automated approach to build a knowledge base using Wikipedia on entity which we are interested in. The proposed system provides a knowledge base built upon the location as its entity. The system is feeded with seed data, by using these seed data it traverse through the Wikipedia graph and builds knowledge base using similarity measurement between seed data and traversed upcoming pages of wiki graph. Any expert AI systems uses gold standard knowledge base to take any decisions.
Conference Paper
We propose a heterogeneous information network mining algorithm: feature-enhanced Rank Class (F-Rank Class). F-Rank Class extends Rank Class to a unified classification framework that can be applied to binary or multiclass classification of unimodal or multimodal data. We experimented on a multimodal document dataset, 2008/9 Wikipedia Selection for Schools. For unimodal classification, F-Rank Class is compared to support vector machines (SVMs). F-Rank Class provides improvements up to 27.3% on the Wikipedia dataset. For multimodal document classification, F-Rank Class shows improvements up to 19.7% in accuracy when compared to SVM-based meta-classifiers. We also study 1) how the structure of the network and 2) how the choice of parameters affect the classification results.
Article
Full-text available
This paper describes the comprehensive Terror- ism Knowledge Base TM (TKB TM ) which will ul- timately contain all relevant knowledge about terrorist groups, their members, leaders, affilia- tions, etc., and full descriptions of specific ter- rorist events. Led by world-class experts in ter- rorism, knowledge enterers have, with simple tools, been building the TKB at the rate of up to 100 assertions per person-hour. The knowledge is stored in a manner suitable for computer un- derstanding and reasoning. The TKB also util- izes its reasoning modules to integrate data and correlate observations, generate scenarios, an- swer questions and compose explanations.
Conference Paper
Full-text available
Swoogle is a crawler-based indexing and retrieval system for the Semantic Web. It extracts metadata for each discovered document, and computes relations between documents. Discovered documents are also indexed by an information retrieval system which can use either character N-Gram or URIrefs as keywords to find relevant documents and to compute the similarity among a set of documents. One of the interesting properties we compute is ontology rank, a measure of the importance of a Semantic Web document.
Conference Paper
Full-text available
Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention has been given to address issues in mining this contextual information. In this paper, we propose a webpage segmentation algorithm targeting the extraction of web images and their contextual information based on their characteristics as they appear on webpages. We conducted a user study to obtain a human-labeled dataset to validate the effectiveness of our method and experiments demonstrated that our method can achieve better results compared to an existing segmentation algorithm.
Conference Paper
Full-text available
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human- and machine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data.
Conference Paper
Full-text available
We describe the use of the Wikitology knowledge base as a resource for a variety of applications with special focus on a cross-document entity coreference resolution task. This task involves recognizing when entities and relations mentioned in different documents refer to the same object or relation in the world. Wikitology is a knowledge base system constructed with material from Wikipedia, DBpedia and Freebase that in- cludes both unstructured text and semi-structured information. Wikitology was used to define features that were part of a sys- tem implemented by the Johns Hopkins University Human Language Technology Center of Excellence for the 2008 Auto- matic Content Extraction cross-document coreference resolu- tion evaluation organized by National Institute of Standards and Technology.
Conference Paper
Full-text available
Ontology matching, aiming to obtain semantic correspon- dences between two ontologies, has played a key role in data exchange, data integration and metadata management. Among numerous matching scenarios, especially the appli- cations cross multiple domains, we observe an important problem, denoted as unbalanced ontology matching which requires to find the matches between an ontology describing a local domain knowledge and another ontology covering the information over multiple domains, is not well studied in the community. In this paper, we propose a novel Gauss Function based ontology matching approach to deal with this unbalanced ontology matching issue. Given a relative lightweight on- tology which represents the local domain knowledge, we ex- tract a "similar" sub-ontology from the corresponding heavy- weight ontology and then carry out the matching procedure between this lightweight ontology and the newly generated sub-ontology. The sub-ontology generation is based on the influences between concepts in the heavyweight ontology. We propose a Gauss Function based method to properly cal- culate the influence values between concepts. In addition, we perform an extensive experiment to verify the effective- ness and efficiency of our proposed approach by using OAEI 2007 tasks. Experimental results clearly demonstrate that our solution outperforms the existing methods in terms of precision, recall and elapsed time.
Article
The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., “chemist” and “biologist” are identified as sub-classes of “scientist”). List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Conference Paper
The web contains lots of interesting factual information about entities, such as celebrities, movies or products. This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant page to find fact mentions. When fact mentions are found, they are taken as examples for learning new facts from the page via HTML pattern discovery. Extracted new facts are added to the known fact set for the next learning cycle. The bootstrapping process continues until no new facts can be learned. This approach is language-independent. It demonstrated good performance in experiment on country facts. Results of a large scale experiment will also be shown with initial facts imported from wikipedia.
Conference Paper
Discovering semantic correspondences between ontology elements is a crucial task for merging heterogeneous ontologies. Most ontology merging tools use several methods to aggregate and combine similarity measures. In addition, some of the ontology merging systems exploit external resources such as, Linguistic Knowledge Bases (e.g. WordNet) to support this task. However, the quality of their results is subjected to the limitations of the exploited knowledge base. In this paper, we present a framework that exploits multiple knowledge bases that cover information in multiple domains for: i) Indentifying and correcting incorrect semantic relations between the concepts of domain-specific ontologies. This is a primary step before ontology merging; ii) Merging domain-specific ontologies; and iii) Handling the issue of missing background knowledge in the exploited knowledge bases by utilizing statistical techniques. An experimental instantiation of the framework and comparisons with state-of-the-art syntactic and semantic-based systems validate our proposal.