Content uploaded by Mohammed Maree
Author content
All content in this area was uploaded by Mohammed Maree on Feb 19, 2019
Content may be subject to copyright.
978-1-4673-0024-7/10/$26.00 ©2012 IEEE 1483
2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012)
Automatic Construction of a Domain-independent
Knowledge Base from Heterogeneous Data Sources
Mohammed Maree, Saadat M. Alhashmi
School of Information Technology
Monash University, Sunway Campus
Kuala Lumpur, Malaysia
Mohammed Belkhatir
Lyon Institute of Technology
University of Lyon I
Lyon, France
Andre Hawit
Mixberry Media Inc.
Burlingame, CA,
USA
Abstract— Manual construction and maintenance of general-
purpose knowledge bases forms a major limiting factor towards
their full adoption, use and reuse in practical settings. In this
paper, we present KnowBase, a system for automatic knowledge
base construction from heterogeneous data sources including
domain-specific ontologies, general-purpose ontologies, plain
texts, and image and video captions, which are automatically
extracted from WebPages. In our approach, several information
extraction techniques are integrated to automatically create,
enrich, and keep the knowledge base up to date. Consequently,
knowledge represented by the produced knowledge base can be
employed in several application domains. In our experiments, we
used the produced knowledge base as an external resource to
align heterogeneous ontologies from the environmental and
agricultural domains. The produced results demonstrate the
effectiveness of the used knowledge base in finding corresponding
entities between the used ontologies.
Keywords- Knowledge Base; Information Extraction; Pattern
Acquisition; Merging; Heterogeneous Data Sources; Experimental
Validation
I. INTRODUCTION
Recently, several approaches have been proposed to
automatically build knowledge bases. These approaches either
rely on a single data source such as Wikipedia1 to create the
knowledge base [1] or build knowledge bases for specific
domains such as the medical, biomedical and terrorism
domains [2][3][4]. In this paper, we present KnowBase, a
system for automatic knowledge base construction from
heterogeneous data sources including domain-specific
ontologies, general-purpose ontologies, plain texts, and image
and video captions; which are automatically extracted from
WebPages. To obtain domain-specific ontologies, we use
Swoogle semantic Web search engine [5]. For each domain, we
submit queries including keywords that are related to that
domain and download the returned ontologies. We obtain
general-purpose ontologies from online ontology repositories
and libraries on the Web. Plain texts are automatically
extracted from the Web using a Web crawler. The extracted
texts from relevant websites are then processed using several
Natural Language Processing (NLP) techniques. To identify
and extract image and video captions from WebPages, we use
1 http://www.wikipedia.org/
the DOM Tree-based webpage segmentation algorithm that is
proposed in [6].
DOM Tree-based Webpage Segmentation Algorithm
Figure 1. Example of the Output of the Segmentation Algorithm
The segmentation process is based on the Document Object
Model (DOM) Tree structure of WebPages. In this algorithm,
multimedia documents on the Web are classified into three
categories: Listed, Semi-listed, and Unlisted documents. For
every extracted multimedia document, the segmentation
method only searches the surrounding region making it more
efficient and scalable for large websites that contain huge
amount of multimedia documents. As shown in Figure 1,
given a webpage as input, the algorithm processes the DOM
tree of the webpage and extracts the segments as output.
Due to the heterogeneity of the used resources, different
information extraction techniques are required. These
techniques are integrated to build a coherent structure of the
knowledge base. For instance, to extract facts from plain texts,
and image and video captions, we utilize several NLP,
statistical-based and named-entity recognition techniques. On
the other hand, we employ other extraction and merging
techniques to populate the knowledge base with knowledge
triples (entity2-relation-concept), which are obtained from the
2 Entity in this context refers to a concept or an instance of a concept
1484
downloaded domain-specific and general-purpose ontologies.
Details of these techniques are presented in Section 3A. The
main contributions of our work are summarized as follows:
• Exploiting heterogeneous data sources for automatic
knowledge base construction.
• Combining several information extraction techniques and
aggregating their output to construct and enrich the
knowledge base.
The rest of this paper is organized as follows. Section 2
presents the details of some other knowledge base construction
and population systems. The theoretical framework for
automatic construction and update of domain-independent
knowledge bases is presented in Section 3. In this section, we
also present the details of the methods that are used in the
proposed system. Section 4 discusses the experiments that
were carried out to construct the knowledge base, as well as to
exploit the constructed knowledge base in aligning
heterogenuous ontologies. Section 5 presents the conclusions
and outlines the future work.
II. RELATED WORK
In this section, we discuss the related work in terms of two
different aspects: (a) knowledge base construction approaches
and (b) issues related to automatically identifying and
extracting entities and facts from heterogeneous data sources.
A. Knowledge Base Construction
Knowledge base construction has always been at the heart
of the Semantic Web (SW) technology. Apparently, with the
continuous expansion of the Web, this task became more
difficult [7]. To address this issue, several knowledge base
construction systems have been proposed [8, 9, 10, 11]. Some
of these systems rely on human input to enrich the knowledge
base and keep it up-to-date. Examples of the produced
knowledge bases by such systems are Freebase3 and True
Knowledge4. On the other hand, other systems rely on a single
data source to create the knowledge base or create knowledge
bases that are related to a particular domain. For instance,
Geifman and Rubin, proposed to model and store knowledge
about age-related phenotypic patterns and events in an Age-
Phenome Knowledge Base (APK) [3]. Another example is the
terrorism knowledge base, which contains all relevant
knowledge about terrorist groups, their members, leaders,
affiliations, and full descriptions of specific terrorist events [4].
This knowledge base was integrated into Cyc [12], which is a
general-purpose ontology that captures knowledge from
multiple domains. In our approach, we not only aim at avoiding
the effort required by users to manually maintain and update
the knowledge base, but also at populating and extending the
knowledge base from heterogeneous data sources.
B. Information Extraction
The task of automatically identifying and extracting
entities and facts from heterogeneous data sources is a
3 http://www.freebase.com/
4 http://www.trueknowledge.com/
prevalent problem. For each data source, we need to identify
and utilize different information extraction methods. The
ultimate goal of these methods is to automatically construct
and populate the knowledge base with entities, facts (these are
automatically extracted from plain texts and image and video
captions from WebPages), and knowledge triples, which are
automatically extracted from online ontologies. For instance,
to extract information from plain texts on the Web,
TextRunner [13] employs heuristics to produce extractions in
the form of a tuple t = (ei, ri,j , ej), where ei and ej are strings
meant to denote entities, and ri,j is a string meant to denote a
relationship between them. Similar systems are GRAZER [14]
and KnowItAll [15]. In GRAZER, the inputs are seed facts for
given entities, which are automatically generated using
specialized wrappers. KnowItAll used bootstrapping to extract
patterns and facts simultaneously from text [14]. The relevant
pages are retrieved from a search engine via a query composed
of keywords in a pattern. The initial seed set contains a few
hand-generated patterns. In our approach, we combine several
NLP, statistical-based and named-entity recognition
techniques to extract facts, entities and their attributes from
textual information on the Web.
Another data source that can be exploited for knowledge
base construction is online ontologies. For example, the authors
of [16] propose to construct ontologies from online domain-
specific ontologies. To do this, they submit quires to Swoogle
SW search engine and download relevant ontologies from the
list of the returned results. In addition, they employ ranking and
segmentation techniques to rank the returned ontologies and
extract segments from them. To address the issue of overlap
between the extracted segments, the authors propose to merge
overlapping segments into a single representation. We build on
this work to automatically download domain-specific and
general-purpose ontologies from Swoogle SW search engine
and other ontology repositories and libraries on the Web.
Considering domain-specific ontologies, we merge them using
the merging techniques proposed in our previous work [17]. In
this context, for each domain of interest, we merge its relevant
ontologies using semantic, name and statistical based ontology
merging techniques. Then, we extract knowledge triples from
the merged domain-specific ontologies and other general-
purpose ontologies to populate the knowledge base.
III. THEORETICAL BASES AND SYSTEM DESCRIPTION
A knowledge base is a repository of facts about entities and
their relationships. A fact is a triple consisting of entity-
relation-entity structure. Entities are related through different
types of semantic and taxonomic relations. These relations are
automatically extracted from multiple heterogeneous data
sources such as plain texts, image and video captions, and
online ontologies. A detailed discussion on these data sources
and the techniques that we use in our system are given in the
next sections.
1485
A. Knowledge Base Construction from Online Domain-
specific Ontologies
Among the data sources that we exploit to construct the
knowledge base are online domain-specific ontologies. A
domain-specific ontology can be formally defined as:
Definition 1: A domain-specific ontology Ω is a 4-tuple 〈C,
R, I, A〉 where:
• C={(ci),i∈[1,Card(C)]} represents the set of domain
concepts of the ontology.
• R={(ri),i∈[1,Card(R)]} represents the set of semantic
relations holding between the ontology concepts.
• I is the set of instances or individuals.
• A is the set of axioms verifying: A={(ri,cj,ck)} s.t.
i∈[1,Card(R)], j,k∈[1,Card(C)], cj,ck∈C and ri∈R.
We associate to the ontology a logic-based translation towards a
predicate calculus PC, based on C, R, I, A and a set of predicate
symbols P ⊇C∪R. Predicates linked to C are monadic while
those linked to R are dyadic.
Φo: C ∪A∪{ Ω } →CP
Φo(c) = c(xc) where xc∈I
Φo(a) = r(ci,cj) where r∈R and ci,cj∈C
Φo(Ω) =
i
cC∈
∧ Φo (ci)
i
aA∈
∧ Φo (ai)
To address the semantic heterogeneity problem (i.e.
conceptual and terminological differences between domain-
specific ontologies), we employ the merging techniques
proposed [17]. Formally, a merging algorithm can be defined
as:
Definition 2: Merging: Given two domain-specific ontologies
Ω1 and Ω2, the merging operation finds semantic
correspondences between their concepts and produces a single
merged ontology Ωmerged as output. Semantic correspondences
between both ontologies are 4-tuples 〈Cid, Ci, Cj, r〉 such that:
• Cid is a unique identifier of the correspondence.
• Ci ∈ Ω1, Cj ∈ Ω2 are corresponding concepts of the input
ontologies.
• r ∈ R is a semantic relation holding between both
elements Ci and Cj.
To represent the knowledge base we use an Entity-
Relationship graph. Such a graph connects different entities E
from multiple domains through different types of relations R.
In this context, a finite set of entity labels are represented by
the graph nodes and directed links between those nodes are
used to represent R. This can be formally represented as
follows:
Definition 3: Knowledge Base: A knowledge base KB is a
structure KB:= (CKB, RKB, IKB) where:
• CKB: is the set of concepts that are defined in KB.
Generally, CKB:={C ∪ Cmiss}, where Cmiss is the set of
concepts that are not defined in Ω but are defined in KB.
This is due to the fact that KB covers information across
multiple domains and is not limited to a particular domain
as Ω. However, we may find that C includes concepts
{C1, C2, . . . , Cn} which are not defined in CKB. For this
particular case, we utilize statistical-based techniques to
enrich KB with the set of concepts {C1, C2, . . . , Cn}.
• RKB: is the set of semantic and taxonomic relations that are
defined to relate the concepts in CKB.
• IKB: is the set of instances of the concepts that are defined
in KB.
B. Knowledge Base Construction from Plain Texts
In our approach, we divide plain texts into two categories.
The first category consists of plain texts that are automatically
extracted from Web documents. These are extracted using a
Web crawler. The second category consists of image and
video captions, which are used to describe the content of these
types of multimedia documents on the Web. As discussed in
Section 1, in order to identify and extract the captions of such
multimedia documents we use the DOM Tree-based Webpage
segmentation algorithm that is proposed in [6]. To process
texts from both categories we first utilize several NLP
techniques such as stopword removal [18], tokenization [19],
and Part Of Speech (POS) tagging [20]. Then, we extract
named entities through employing GATE [21], which is a
syntactical pattern matching entity recognizer enriched with
gazetteers. Although the coverage of GATE is limited to a
certain number of named-entities, additional rules can be
defined in order to expand its coverage. However, the process
of manual enrichment of GATE’s rules can be difficult and
time-consuming task. Therefore, we exploit the constructed
knowledge base as a supplementary source of named-entity
recognition. In this context, entities that are not defined in
GATE are submitted to the knowledge base to find whether
they are defined in it or not. After extracting named entities
from texts, we utilize a statistical-based semantic relatedness
measure to compute the degree of semantic relatedness
between the extracted entities. This measure is based on the
Normalized Retrieval Distance (NRD) function [17]. This
function is formally defined as follows:
Definition 4: Normalized Retrieval Distance (NRD): is an
adapted form of the Normalized Google Distance (NGD) [22]
function that measures the semantic relatedness between pairs
of entities (such as concepts or instances): Given two entities
E_miss and E_in, the Normalized Retrieval Distance between
E_miss and E_in can be obtained as follows:
max{log ( _ ), log ( _ )} log ( _ , _ )
(_ ,_ ) log min{log ( _ ), log ( _ )}
f
E miss f E in f E miss E in
NRD E miss E in MfEmissfEin
−
=−
(1)
Where,
• E_miss is an entity that is recognized by GATE but not
defined in the knowledge base, KB.
• E_in is an entity that exists in the knowledge base, KB.
• f(E_miss) is the number of hits for the search term, E_miss.
• f(E_in) is the number of hits for the search term, E_in.
• f(E_miss , E_in) is the number of hits for the search terms,
E_miss and E_in.
• M is the number of WebPages indexed by the search
engine.
1486
Unlike the NGD function, the NRD function returns
different semantic relatedness measures according to several
search engines (Google, AltaVista, Yahoo!). Therefore, we
sum up all NRD values for each candidate entity. This
summation represents an aggregated decision made by several
search engines on the degree of semantic relatedness between
the entities E_miss and E_in . The returned semantic relatedness
measures give us an indication that two entities are strongly
related or not based on a threshold value v=0.5. Therefore,
entities with semantic relatedness measures > v will be
considered for further processing by the Semantic Relation
Extractor (SRE) function. This function takes as input pairs of
entities with strong semantic relatedness measures and
produces as output the suggested semantic relation(s) between
them based on a set of pre-defined lexico-syntactic patterns.
Details on this function can be found in [17].
C. Automatic Knowledge Base Enrichment
Enrichment of the knowledge base is performed at two
parallel stages. In the first stages, entities (concepts or their
instances) of the merged domain-specific ontologies and other
general-purpose ontologies are submitted to the knowledge
base. If any of the entities does not exist in the knowledge
base, then it will be transferred to the second stage, wherein
the entity is automatically extracted and added to the Entity-
Relationship graph of the knowledge base based on the its
context. A context represents all concepts that exist at the
semantic path(s) of each entity. To illustrate this step, we take
an example of a merged ontology that describes the
organizations domain.
Figure 2. Part of a Merged Ontology about the Organizations Domain.
Concepts are related through is-a Transitive Relation
In Figure 2, we see that the contexts of the concept
“Corpoate_Body” are:
1. {“Organization”, “Body”, “Gathering”, “Psychological
Feature”, “Abstraction”, “Abstract Entity”, “Entity”}
2. {“Organization”, “Social Group”, “Group”,
“Abstraction”, “Abstract Entity”, “Entity”}
We call each of these contexts a semantic path. For instance,
to enrich the knowledge base with the entity “Corporate
Body”, we first submit it to the knowledge base to find
whether it is already defined in it or not. Assuming that this
entity is missing from the knowledge base, we traverse the
hierarchy of the semantic paths of this entity and extract the
concepts in each path. Then, in an ascending order ,we attempt
to find whether the parents of the missing entity exist in the
knowledge base or not. This loop is repeated until we reach
the root node in the graph. Accordingly, we consider the
following cases for enriching the knowledge base:
A) If the parent p of the missing entity e (e.g. “Corporate
Body” is a missig entity in our example) is also missing
from the knowledge base KB, then we extract the segment
of the semantic path from the merged ontology Ωmerged
that consists of the triple: e-relation-p. For example,
assuming the concept “Organization” is also missing
from KB, we extract the triple: Corporate Body is-a
Organization from Ωmerged. Then, the Entity-Relationship
graph of KB will be updated by adding two new nodes
(“Corporate Body” & “Organization”) and linking them
through (is-a) relation.
B) If the parent p of the missing entity e is already defined in
the KB, then we attach e directly to p in the hierarchy of
the KB. Here, it is important to mention that p might have
differet meanings (senses) in KB. Therefore, it is
important to disambiguate the meaning of p before linking
it to e. To do this, we compare the contexts of the entity e
to the conctexts of p in KB. Accordingly e will be linked
to p based on the similarity between their contexts. For
instance, if we have the following senses for the entity
“Organization” in KB:
1. Organization: A group of people who work
togethor.
2. Organization: An organized structure for
classifying.
Figure 3. Senses of the Concept Organization in KB
We compare the context of “Organization” from KB to
context of the same concept from Ωmerged. We find that the
most similar contexts are context No. 2 from Figure 2
({“Organization”, “Social Group”, “Group”, “Abstraction”,
“Abstract Entity”, “Entity”}) and context No. 1 from Figure 3.
Therefore, we link “Corporate Body” to the semantic path that
Entit
y
…
…
Group
Social Grou
p
Organization
Knowledge
Structure
Organization
Corporate
Body
…
…
Corporate
Body
Entit
y
Abstract Entit
y
Abstraction
Group
Psychological Featur
e
Social Grou
p
Organization Gathering
Bod
y
Organization
…
…
…
Organization
Beginning
Organization
Administration
...
Organization
Structure
Organization
…
Organization
1487
represents the first sense of “Organization” in KB. The result
of this step is shown in Figure 4.
Figure 4. Adding a New Concept to KB
IV. EXPERIMENTAL RESULTS
In the following sections, we discuss experiments in terms
of two different aspects. First, we discuss the experiments that
we carried out to construct and populate the knowledge base.
Second, we experimentally demonstrate the effectiveness of
employing the knowledge that is represented by the constructed
knowledge base in real application domains. We implemented
all solutions in Java and experiments were performed on a PC
with dual-core CPU (3000 GHz) and (4 GB RAM). The
operating system that was used is OpenSuse 11.1.
A. Automatic Construction and Population of the Knowledge
Base
In this section we discuss the experiments that we carried
out to automatically construct and populate the knowledge
base. The sources that we used for this purpose are 500 text
documents obtained from the Web, 17855 image and video
captions extracted from WebPages, 35 domain-specific
ontologies downloaded using Swoogle SW search engine, and
6 general purpose ontologies downloaded from online ontology
repositories and libraries. To obtain the text documents, we
developed a script to query general-purpose search engines
such as Google and AltaVista about several concepts from
different domains. Examples of these domains are (Sport,
Medicine, Programming Languages, and Universities). We
manually selected the top-10 results from the lists of returned
results by each search engine. To obtain the image and video
captions, we used the DOM Tree-based Webpage segmentation
algorithm (described in Section 1). To download online
domain-specific ontologies, we submitted queries to Swoogle
search engine and selected those that are relevant to each
query’s intent. Then, for each domain, we merged its relevant
ontologies using the merging techniques (described in Section
2). The total size of both types of the used ontologies is 8.621
GB. The current version of the constructed knowledge base
consists of 2,404,485 Entity (384,051 concepts and 2,020,434
instances). It is important to mention that the produced
knowledge base is still evolving and we are updating it in a
constant manner.
B. Using the Produced Knowledge Base in Real Application
Domains
In this section, we describe the experiments that we carried
out to validate our proposal of employing knowledge
represented by the produced knowledge base in several
application domains. Certainly, it can be employed for other
purposes such as semantic-based indexing of multimedia
documents on the Web, computing the degree of semantic
relatedness, query reformulation, document clustering, and
ontology alignment and mapping…etc. However, in these
experiments, we used it as an external resource to find
alignments between heterogeneous domain-specific
ontologies. In this context, we attempted to find alignments
(i.e. correspondences) between the concepts and instances of
the three real-world heavyweight ontologies (GEMET,
AGROVOC, and NAL). Details on these ontologies are listed
in [23]. To compute precision and recall, we used the official
gold-standard alignments [24] that are provided by the OAEI
2007 environment task organizers. These sample alignments
are classified into different domains such as alignments in the
chemistry, geography and agriculture domains. We used the
sample alignments to compute the precision and recall of our
system in each domain. A comparison between the results of
our system, S1 and the gold standard reference alignments is
shown in Table 1.
TABLE I. USING THE PRODUCED KNOWLEDGE BASE TO FIND
ALIGNMENTS BETWEEN HEAVYWEIGHT ONTOLOGIES
Task
# of Matches
produced by our
system
# of Matches
produced by
S1
# of Matches
in the Ref.
Alignments
GEMET-AGROVOC
Chemistry-Precision 14 out of 14 14 out of 14 14
Geography-Precision 23 out of 23 23 out of 23 23
Geography-Recall 87 out of 87 87 out of 87 87
Agriculture-Recall 61 out of 61 61 out of 61 61
Misc-Precision 28 out of 28 28 out of 28 28
Tax-Precision 21 out of 21 21 out of 21 21
NAL-AGROVOC
Chemistry-Precision 141 out of 141 141 out of 141 141
Geography-Precision 58 out of 58 58 out of 58 58
Misc-Precision 231 out of 231 231 out of 231 231
Tax-Precision 10 out of 10 10 out of 10 10
Eur-Recall 62 out of 62 62 out of 62 62
Geography-Recall 58 out of 58 58 out of 58 58
GEMET-NAL
Chemistry-Precision 30 out of 30 30 out of 30 30
Geography-Precision 17 out of 17 17 out of 17 17
Misc-Precision 29 out of 29 29 out of 29 29
Tax-Precision 15 out of 15 15 out of 15 15
Agriculture-Recall 61 out of 61 61 out of 61 61
Geography-Recall 77 out of 77 77 out of 77 77
As shown in Table 1, despite the heterogeneity of the
alignment tasks and domains, we were able to find the same
number of equivalent concepts as in the reference alignments
provided in the gold standard.
Entit
y
…
…
Group
Social Grou
p
Organization
Knowledge
Structure
Organization
Corporate
Body
1488
V. CONCLUSIONS AND FUTURE WORK
In this paper, we presented an automatically constructed
domain-independent knowledge base, developed by our
system KnowBase. Unlike traditional knowledge base
construction and population approaches, we used
heterogeneous data sources to create the knowledge base. In
addition, we avoided the human effort that is required to
control and update the knowledge base. To do this, we
employed several NLP, statistical-based and named-entity
recognition techniques. We aggregated the outputs of these
techniques to automatically populate and enrich the
knowledge base. The current version of the produced
knowledge base consists of 2,404,485 entities (384,051
concepts and 2,020,434 instances). We employed knowledge
represented by this knowledge base to find alignments
between heterogeneous ontologies in the environmental and
agricultural domains. We experimentally demonstrated that we
were able to find the same number of alignments between the
entities of the used ontologies as in the reference alignments
which are provided in the gold standard and in S1. In the
future work, we plan to integrate other existing publically
available knowledge bases to our knowledge base. In addition,
we plan to exploit additional ontologies from online ontology
repositories on the Web to enrich and expand the coverage of
our knowledge base.
REFERENCES
[1] Finin, T., Syed, Z., Mayfield, Z., McNamee, P., and Piatko, C.: Using
wikitology for cross-document entity coreference resolution. In
Proceedings of the AAAI Spring Symposium on Learning by Reading
and Learning to Read, pp. 29--35, (2009).
[2] Wishart, D.S., Knox, C., Guo, A., Eisner, R., Young, N., Gautam, B.,
Hau, D.D., Psychogios, N., Dong, E., Bouatra, S., Mandal, R.,
Sinelnikov, I., Xia, J., Jia, L., Cruz, J.A., Lim, E., Sobsey, C.A.,
Shrivastava, S., Huang, P., Liu, P., Fang, L., Peng, J., Fradette, R.,
Cheng, D., Tzur, D., Clements, M., Lewis, A., Souza, A.D., Zuniga, A.,
Dawe, M., Xiong, Y., Clive, D., Greiner, R., Nazyrova, A.,
Shaykhutdinov, R., Li, L., Vogel, H.J., Forsythe, I.J.: HMDB: a
knowledgebase for the human metabolome. Nucleic Acids Research, pp.
603--610, (2009)
[3] Geifman, N., and Rubin, E.: Towards an Age-Phenome Knowledge-
base. BMC Bioinformatics, 12:229, doi:10.1186/1471-2105-12-229.
(2011)
[4] Deaton, C., Shepard, B., Klein, C., Mayans, C., Summers, B., Brusseau,
A., and Witbrock, M.: The comprehensive terrorism knowledge base in
Cyc. In Proceedings of the 2005 International Conference on
Intelligence Analysis, (2005)
[5] Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari,
P., Doshi, V. and Sachs, J.: Swoogle: A semantic web search and
metadata engine. In Proc. 13th ACM Conf. on Information and
Knowledge Management, Nov. pp. 652--659, (2004)
[6] Fauzi, F., Belkhatir, M., Hong, J.: Webpage Segmentation for Extracting
Images and Their Surrounding Contextual Information. In ACM
Multimedia’09, Beijing, China. 649--652, (2009)
[7] Gregory, M., McGrath, L., Bell, E., O’Hara, K., and Domico, K.:
Domain Independent Knowledge Base Population From Structured and
Unstructured Data Sources. In Proc. of the 24th International Florida
Artificial Intelligence Research Society Conference. pp. 251--256,
(2011)
[8] Suchanek, F. M., Kasneci, G., Weikum, G.: YAGO: A Core of Semantic
Knowledge Unifying WordNet and Wikipedia. In Proc. of the 16th
International World Wide Web (WWW) conference. pp. 697--706,
(2007)
[9] Hoffart, J., Suchanek, M. F., Berberich, K., Kelham, E. L., de Melo, G.,
and Weikum., G.: YAGO2: Exploring and Querying World Knowledge
in Time, Space, Context, and Many Languages. In Proc. of the 20th
International World Wide Web Conference (WWW 2011) Hyderabad,
India, pp. 229--232, (2011)
[10] Etzioni, O., Cafarella, M.J., Downey, D., Kok, S., Popescu, A., Shaked,
T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information
extraction in knowitall: (preliminary results). In WWW, pp. 100--110,
(2004)
[11] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives,
Z. G.: DB-pedia: A nucleus for a web of open data. In The Semantic
Web, 6th International Semantic Web Conference, 2nd Asian Semantic
Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, volume
4825 of Lecture Notes in Computer Science, pp. 722--735. (2007)
[12] Lenat, D. B.: Cyc: a large-scale investment in knowledge infrastructure.
Communications of the ACM, 38(11), pp. 33--38, (1995)
[13] Etzioni. O., Banko, M., Soderland, S., Weld, S. D.: Open Information
Extraction from the Web. Communications of the ACM, Vol 51, No.12,
pp. 68--74, (2008)
[14] Zhao, S. and Betz, J.: Corroborate and Learn Facts from the Web. In
KDD '07: Proceedings of the 13th ACM SIGKDD International
Conference on Knowledge discovery and data mining, pp. 995--1003.
ACM, (2007)
[15] Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T.,
Soderland, S., Weld, D. S., and Yates, A.: Unsupervised named-entity
extraction from the Web: An experimental study. Artificial Intelligence,
165(1): pp. 91--134, (2005)
[16] Alani, H.: Ontology Construction from Online Ontologies. Proceedings
of the 15th international conference on World Wide Web, WWW 2006,
Edinburgh, Scotland, UK. pp. 491--495, (2006)
[17] Maree, M., and Belkhatir, M.: A Coupled Statistical/Semantic
Framework for Merging Heterogeneous domain-Specific Ontologies. In
Proc. of the 22nd International Conference on Tools with Artificial
Intelligence (ICTAI’10), Arras, France, Vol. 2. pp. 159--166, (2010)
[18] Croft, B., Metzler, D. and Strohman, T.: Search Engines: Information
Retrieval in Practice. Addison-Wesley Publishing Company, USA,
(2009)
[19] Cavnar, W. B. and Trenkle, J. M.: N-gram-based text categorization. In
Proceedings of SDAIR-94, 3rd Annual Symposium on Document
Analysis and Information Retrieval. Las Vegas, US, pp. 161--175,
(1994)
[20] Roth, D.: Learning to resolve natural language ambiguities: a unified
approach. In Proceedings of AAAI-98, 15th Conference of the American
Association for Artificial Intelligence. Madison, US, pp. 806--813,
(1998)
[21] Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V.: GATE:
a framework and graphical development environment for robust NLP
tools and applications. Proc. of the 40th Anniversary Meeting of the
Association for Computational Linguistics, Phil.,USA, (2002)
[22] Cilibrasi R., Vitanyi P.: The Google Similarity Distance. IEEE
Transactions on knowledge and data engineering. 19(3), pp. 370--383,
(2007)
[23] Zhong, Q., Li, H., Li, J., Xie, G., Tang, J., Zhou, L., and Pan,Y: “A
Gauss Function Based Approach for Unbalanced Ontology Matching”.
SIGMOD’09, pp. 669-680, (2009)
[24] http://oaei.ontologymatching.org/2007/results/environemt/gold_standard