Conference PaperPDF Available

Exploring Term Networks for Semantic Search over RDF Knowledge Graphs

Authors:

Abstract and Figures

Information retrieval approaches are considered as a key technology to empower lay users to access the Web of Data. A large number of related approaches such as Question Answering and Semantic Search have been developed to address this problem. While Question Answering promises more accurate results by returning a specific answer, Semantic Search engines are designed to retrieve the best top-\(K\) ranked resources. In this work, we propose *path, a Semantic Search approach that explores term networks for querying RDF knowledge graphs. The adequacy of the approach is evaluated employing benchmark datasets against state-of-the-art Question Answering as well as Semantic Search systems. The results show that *path achieves better F\(_1\)-score than the currently best performing Semantic Search system.
Content may be subject to copyright.
Exploring Term Networks for Semantic Search
over RDF Knowledge Graphs
Edgard Marx1,3, Konrad H¨
offner1, Saeedeh Shekarpour1, Axel-Cyrille Ngonga
Ngomo1, Jens Lehmann2, and S¨
oren Auer2
1AKSW, University of Leipzig, Germany
2Computer Science Institute, University of Bonn / Fraunhofer IAIS
3Instituto de Pesquisa e Desenvolvimento Albert Schirmer
4Knoesis Center, USA
Abstract.
Information retrieval approaches are considered as a key technology
to empower lay users to access the Web of Data. A large number of related
approaches such as Question Answering and Semantic Search have been developed
to address this problem. While Question Answering promises more accurate results
by returning a specific answer, Semantic Search engines are designed to retrieve the
best top-
K
ranked resources. In this work, we propose
*path
, a Semantic Search
approach that explores term networks for querying RDF knowledge graphs. The
adequacy of the approach is evaluated employing benchmark datasets against state-
of-the-art Question Answering as well as Semantic Search systems. The results
show that
*path
achieves better F
1
-score than the currently best performing
Semantic Search system.
1 Introduction
The growth of Semantic Web technologies has led to the publication of large volumes
of data. Approximately
10 000
Resource Description Framework (RDF)
5
datasets are
available via public data portals.
6
However, retrieving desired information from datasets
still poses a significant challenge. Lay users cannot be expected to make themselves
familiar with the underlying query languages and modeling structures. A major challenge
is the efficient retrieval of the resource that best represents the user’s intent via natural
language (NL) keyword queries. Relying solely on off-the-shelf triple stores or document
retrieval may lead to poor performance or precision (see Section 5). To address this
problem, we propose an approach for Semantic Search RDF knowledge graphs by
exploring its Term Network. A Term Network (see Section 4) is a graph whose vertices
are labeled terms. Overall, our contributions are as follows:
We develop a new formal model for Semantic Search (SemS) based on Term
Networks;
We present a ranking method that increases the precision on retrieving RDF data;
We compare our approach with state of the art SemS techniques on the QALD-4 [
17
]
benchmark and show that we achieve a higher F1-score.
5http://www.w3.org/RDF
6http://lodstats.aksw.org/
The rest of this paper is organized as follows: The related work is reviewed in
Section 2. Section 3 defines the preliminaries. Section 4 describes the
*path
model.
Section 5 outlines the evaluation and discusses the results. Finally, Section 6 concludes
giving an outlook of potential future work.
2 Related Work
Information retrieval (IR) over Linked Data is an active and diverse research field with
many existing related work focusing on designed for different environments, diverging
in complexity and precision. The related work can be mainly categorized in two types of
approaches that recover information from Linked Data Knowledge Graphs (KGs, see
Definition 1): (1) by using conventional IR techniques and (2) by answering natural
language questions. While the use of time efficient traditional IR systems lacks the ability
to deal with complex queries, they are usually faster. Wang et al. [
19
] shows that pure
traditional IR engines are faster than the combination of a triple store with a full-text
index. However, both models explore the semantics of an NL query for delivering the
response by applying statistics measures and heuristics in the KG. Semantic Search
(SemS) approaches aim to retrieve the top-k ranked resources for a given NL input query.
Swoogle [
3
], introduces a modified version of PageRank that takes into account the
types of the links between ontologies. Sindice [
10
], Falcons [
2
] and Sig.ma [
16
] explores
traditional document retrieval to index and locate relevant sources and/or resources.
Sindice is a search engine that can retrieve documents containing a given statement.
Falcon, uses a built-in ranking mechanism for entity ranking while Sig.ma allows the use
of constraints to query for particular classes and/or properties. In all cases, the structure
and semantics are not taken into account during the matching phase. YAHOO! BNC [
4
]
used a local, per property, term frequency as well as a global term frequency. It also
applied a boost based on the number of matched query terms. Umass [
4
] explored existing
ranking functions applied to four field types: (1) title; (2) name; (3)
dbo:title
, and;
(4) all others. The fields were weighted separately with a specific boost applied to each
of them. Later, Blanco et al. [
1
] proposed a modified version of BM25F ranking function
adapted for RDF data. The function was applied to a horizontal pairwise index structure
composed of the subject and its property values. However, the most important feature in
the proposed structure is the possibility to assign different weights to predicates. The
proposed adaptation is implemented in the Glimmer
Y!
engine and is shown to be time
efficient as well as outperforms other state-of-the-art methods in ranking RDF resources.
Recently,Virgilio et al. [
18
] introduced a distributed technique for SemS on RDF data
using MapReduce. The method uses a distributed index of RDF paths. The proposed
strategy returns the best top-k answers in the first k generated results. The retrieval is
done by evaluating the paths containing the terms of the query using two strategies: (1)
Linear and (2) Monotonic. (1) The Linear strategy uses only the high ranked path(s).
As a consequence, it does not produce an optimum solution but has linear complexity
with respect to the size of matched entities. (2) The Monotonic strategy uses all matched
paths and, thus, produces better results. Intuitively, measuring all suitable paths from all
entities is less time efficient. Please refer to the work of Mangold et al. [
8
] for a more
detailed analysis of SemS approaches.
One of the biggest challenges in SemS method lies in evaluating the relatedness
between the terms in a KG and an NL query. Document retrieval engines rely on term
frequency weighting, which is based on the assumption, that the more frequently a term
occurs, the more related it is to the topic of the document [
7
]. While good retrieval
performance needs to take the frequency into account, it suffers from frequent yet
unspecific words such as “the”, “a” or “in”. Inverse document frequency corrects this by
diminishing the weight of words that are frequently occurring in the corpus, leading to the
combined term frequency–inverse document frequency (tf-idf) [
15
] to score documents
for a query.
3 Preliminaries
We begin by introducing a formal definition of the RDF model. Thereafter, we introduce
fundamental concepts that are required for full understanding of the rest of the paper.
RDF
7
is a standard for describing Web resources. A resource can refer to any physical or
conceptual thing, such as a Web site, a person or a device. The RDF data model expresses
statements about resources in the form of subject-predicate-object triples. The subject
denotes a resource; the predicate expresses a property (of the subject) or a relationship
(between subject and object); the object is either a resource or literal. Resources are
identified with IRIs, a generalization of URIs, while literals are used to identify values
such as numbers and dates by means of a lexical representation.
Definition 1 (RDF knowledge Graph, KG).
Formally, let
K
be a finite RDF knowl-
edge graph (KG).
K
can be regarded as a set of triples
(s, p, o)(I ∪ B)× P × (I ∪
L∪B)
, where
R=I ∪ B
is the set of all RDF resources
r∈ R
in the KG,
I
is the
set of all IRIs,
B
is the set of all blank nodes,
B ∩ I =
.
P
is the set of all predicates,
P ⊆ I
.
L
is the set of all literals,
L ⊂ Σ
and
L ∩ I =
, where
Σ
is the unicode
alphabet.
E
is the set of all entities,
E=I ∪ B \ P
. An RDFTerm
ϕ
refers to any edge
label
pP
or vertex in the KG
ϕ(I ∪ B ∪ L
). A KG is modeled as a directed labeled
graph
G= (V,D)
, where
V=E ∪ L
,
D ⊆ E × (E ∪ L)
and the labeling function
8
of
the edges is a mapping λ:D 7→ P. We disregard literal language tags and data types.
Figure 1 shows an excerpt of a KG where a literal vertex
vi∈ L
(respectively
a resource vertex
vi∈ R
) is illustrated by a rectangle, respectively an oval. Each
edge between two vertices corresponds to a triple, where the first vertex is called the
subject, the labeled edge the predicate and the second vertex the object. For example,
e2 rdfs:label
Mona Lisa
corresponds to the triple
<e2, rdfs:label, "Mona
Lisa">.
In this work, we address the problem of SemS systems that aim to retrieve the top-k
ranked entities representing the intention behind an NL user query.
Definition 2 (Natural Language Query).
A NL query
qΣ
is a user given keyword
string expressing a factual information needed.
7https://www.w3.org/TR/REC-rdf- syntax/
8Not to be confused with rdfs:label.
r1
e1
e2
Leonardo da Vinci
Person
Mona Lisa
rdfs:label
rdfs:label
rdfs:label
dbo:artist rdf:type
type
p1
p2
p3
p5
artist
p4
rdfs:label
rdfs:label
p6
p7
Fig. 1.
An excerpt of a KG. The label of
rdfs:label
properties were omitted for simplification.
4 Approach
For many years, scientists from the most diverse fields of cognitive science have tried to
explain and reproduce the human cognition system, including psychology, neuroscience,
philosophy, linguistics and artificial intelligence. While diverse theories have been
developed, a commonly shared idea is that knowledge is organized as a network [
12
].
Hudson et al. [
6
] go further and states that grammar is organized as a network as
well. According to Hudson’s work, the syntactic structure of a sentence consists of
a network of dependencies between single terms. Thus, everything that needs to be
said about the syntactic structure of a sentence can be represented in such a network.
Hudson explores Saussure’s [
13
] idea that “language is a system of interdependent terms
in which the value of each term results solely from the simultaneous presence of the
others”. He also argues about the psycholinguistic evidence for the use of spreading
activation in supporting knowledge reasoning. However, according to Hudson et al.,
the main challenge is finding out how the activation occurs in mathematical terms [
6
].
Our intuition is that as the KG contains a network of terms formed by the label (e.g.
rdfs:label
) of the RDFTerms—properties, classes and entities—they can be used to
query.
Definition 3 (Term).
A term
9
can be a word or a phrase used to describe a thing or to
express a concept [11]. In this work we consider as term any literal (lL) in a KG.
Definition 4 (RDFTerm Label).
A term associated with an RDFTerm
ϕ
, denoted by
L(ϕ)
, is the literal respectively the label of
ϕ
. Considering the
rdfs:label10
. as
labeling property:
label(r):={lL|(r, rdfs:label, l)K}
L(ϕ):={ϕ}if ϕL,
label(ϕ)otherwise.
9Not to be confused with an RDFTerm.
10 Other labeling properties may also be used.
Although there is no evidence that the previous works were influenced by Hudson’s
theory, there are models that make use of the KG in order to evaluate the answer [
20
,
14
].
Figure 1 shows a set of literals associated with the resources in the KG sample. Each
resource contains a set of terms
LR(r)
. This terms are called Resource-Associated Terms
and are defined as follows:
Definition 5 (Resource-Associated Terms).
The set of terms associated with a re-
source
r
denoted by
LR(r)
is the union of all literals as well as labels of each property
and object in the triples in which ris the subject.
LR(r):={lL| ∃(r, p, o)K:
ϕ∈ {p, o}:l=L(ϕ)}
Example 1
(Resource-Associated Terms). Considering the KG depicted in Figure 1, the
triples having the entity e2as subject are as follows:
1. e2 rdfs:label "Mona Lisa".
2. e2 dbo:artist e1.
The associated terms for
e2
are:
LR(e2) = {"label"
,
"Mona Lisa"
,
"artist"
,
"Leonardo da Vinci"}
Definition 6 (Term Network).
A Term Network is a graph whose vertices are labeled
with terms.
A KG can be converted to a TN by visiting all vertices and edges executing the
following operations (Fig. 2 shows the TN for Example 1):
1.
Labeling edges and non-literal vertices by a copy of their respective labels defined
by the labeling property rdfs:label;
2. Converting edges to vertices.
Mona Lisa
artist
Leonardo da Vinci
label
Mona Lisa
Fig. 2.
Representation of a TN extracted from the triples that have
e2
as subject from the KG
depicted in Fig. 1.
The TN of a KG is connected and its paths can have cycles as well as an arbitrary
length. In order to simplify the TN and eliminate its ambiguity, the proposed model
works on a simplified version of the TN extracted from a structure called Semantic
Connected Component (SC C ), defined as follows:
Definition 7 (Semantic Connected Component).
The Semantic Connected Compo-
nent (SCC) of an entity
e
in an RDF graph
G
under a consequence relation
|=
is defined
as SC CG,|=(e):={(e, p, o)|G|={(e, p, o)}} ∪ {(p, rdfs:label, l)G}∪{(o,
rdfs:label, l)G}}
. If the graph and consequence relation is clear from the context,
we use the shorter notation
SC C (e)
. Within this paper, we use the RDFS entailment
consequence relation as defined in its specification11.
Example 2
(Semantic Connected Component). For instance, by RDFS entailment, the en-
tity
dbr:Australia
is a
dbo:PopulatedPlace
. The inference is due to
dbr:Australia
being typed as
dbo:Country
which is a subclass of
dbo:PopulatedPlace
. Con-
sidering the running example, the SCC of the entity
e2
is
SC C (e2)=({e2, e1,"Mona Lisa"}
,
{p5, p4}).
e2
Mona Lisa
artist
Leonardo
da Vinci
e1
rdfs:label
dbo:artist
rdfs:label
rdfs:label
Fig. 3. Representation of the SCC of the entity e2extracted from the KG depicted in Fig. 1.
The structure used for and ranking is called Semantic Unit (SU). The SU is a tree,
where the nodes starting from its root node are labeled with tokens and have only one
child. Tokens are sub-strings extracted from another string, they are formally defined as
follows.
Definition 8 (Token).
A token
t∈ T
is the result from a tokenizing function
T:Σ
Σ∗∗, which converts a string to a set of tokens.
The root node sub-trees of the SU form a set of paths starting from the resource to
which the SCC is associated, see Fig. 4. The SU is defined as follows:
Definition 9 (Semantic Unit (SU)). The Semantic Unit is a tree where:
The root node is an entity;
All vertices in the root node sub-trees only have one child, and;
Vertices in the root node sub-trees are labeled with tokens.
Example 3
(Semantic Unit (SU)). Considering the running example, the SU of the
entity
e2
is
SU (e2)=({e2
,
v1
,
v2
,
v3
,
v4
,
v5
,
v6
,
v7},{(e2, v1)
,
(e2, v5)
,
(v1, v2)
,
(v2, v3)
,
(v3, v4),(v5, v6),(v6, v7)})12 and is depicted in Fig. 4.
An SCC can be converted into an SU as follows:
11 http://www.w3.org/TR/rdf-mt/
12 The output of the tokenizer used in this example are lowercase lexemes from a literal.
e2
label
v5
mona
v6
lisa
v7
artist
v1
leonardo
v2
da
v3
vinci
v4
Fig. 4. Representation of the SU of the entity e2extracted from the KG depicted in Fig. 1.
1. Converting the sub-trees starting from the root node of the SCC into TN;
2.
Converting the literal vertices to a graph where there is an edge starting from each
token to its subsequent one, defined as follows:
G(l):= (T(l),D(l))
D(l):={(t1, t2)∈ T (l)| ∃iN: (πi(T(l)) = t1)(πi+1(T(l)) = t2)}
Example 4 (Literal to graph). Converting the term "mona lisa" to a graph.
G("mona lisa") = ({"mona","lisa"},{("mona","lisa")})
In the following sections, we start by describing how we retrieve SU in the KG using
the query terms. Later, we discuss how we can efficiently rank it.
4.1 Retrieving
The idea is to perform the selection of SUs which have a term in intersection with the
query terms. For instance, one possible solution for
{"mona"
,
"lisa"
,
"artist"}
is the co-occurrence of all terms in a SU. The next possible solution is the co-occurrence
of two of the three terms and so on. Thus, it is necessary to check for the existence of the
query terms in different paths. For example, one SU may contain the token
"artist"
and another with the tokens ("mona","lisa"), see Example 5.
Example 5
(Retrieving
"Mona Lisa artist"
). In the KG in Fig. 1, the SCC con-
taining the answer for the query
{"mona","lisa","artist"}
is SCC(
e2
) and can
be retrieved by a simple lookup with a SPARQL query.
Query and Resource Labels Analysis Information retrieval systems for RDF are com-
monly designed to support full or keyword NL queries. However, converting keywords
to full queries is a more challenging task. The
*path
query approach is designed to
deal with keyword or full queries by converting the latter into keyword queries. The
process of conversion of a NL input query to a tuple of keywords consists of applying
known techniques, in order: (1) lowercase and (2) lemmatization. In order to increase
the number of matched SUs, the same analysis is applied to the SU labels.
After extracting the SUs, the SCC of the SU’s entity is used for ranking.
4.2 Ranking
Document retrieval approaches are not suitable for RDF because the most important
feature of RDF is not the terms, but the relation of the concepts underlying its graph
structure. The challenge of adapting the ranking method is measuring the relatedness
between the resources in the target KG and the input query terms. As a query rarely
exactly matches the resource associated terms, both are first converted into tokens.
Thereafter, the proposed ranking assumes that the probability of a resource being part
of an answer correlates with the number of matched tokens between the query and the
resource associated terms. For instance, a query containing birth date should be more
related to the property
dbo:birthDate
than to the property
dbo:deathDate
or
dbpprop:date
. The strength is measured by the number of query tokens matching
with the resource tokens.
Definition 10 (Resource Matching).
A resource matching is a function
MT :T 2R
that maps query tokens
T={t1, t2, t3...tn}
to resources, formally defined by
MT(t)
,
where δis a string dissimilarity function and θ[0,1] R:
M T (t):={r∈ R | ∃t0∈ T (LR(r)) : δ(t, t0)< θ}
Example 6
(Resource Matching). Let
T(q) = {"mona","lisa","artist"}
. Ac-
cording to Fig. 1, the tokens are mapped to:
M T ("mona")
=
{e2}
,
M T ("lisa")
=
{e2},M T ("artist")={p4}.
As the knowledge base is a graph, the resources and literal values are connected by paths
formed by edges and vertices, see Fig. 3.
Example 7
(Path). In the SCC shown in Fig. 3, there are two paths starting from the
entity e2as follows: γ1=((e2,"Mona Lisa")) and γ2=((e2, e1)).
Furthermore, resources belonging to a path between one resource to another are
labeled (e.g.
rdfs:label
). Therefore, it is possible to explore the terms associated to
the entity’s paths to determine its relevance.
Definition 11 (Path terms).
Path terms are the set of all literals in the path
γ
, defined
as follows:
LP (γ):={l| ∃ϕγ:lL(ϕ)}
Example 8
(Path terms). For Example 7, the set of associated terms for the two given
paths are as follows:
LP (γ1)
=
{"label","Mona Lisa"}
and
LP (γ2)
=
{"artist",
"Leonardo da Vinci"}.
Thus, the relevance score of an entity depends on the number of matched terms in its
associated paths. The higher the number of matched terms, the higher the relevance of
the entity. Furthermore, if a term matches multiple paths of an entity, it is only attributed
to the path with the highest number of matched terms. The relevance score of an entity
is the sum of all individual path scores; it is measured by the Semantic Weight Model
(SWM), which is formally defined as follows.
Definition 12 (Semantic Weight Model (SWM)).
Each token
t
in
T(q)
is first mapped
to the paths of the SCC
S
. The set of matched tokens from a path
γ
is returned by the
function
T P (γ, q )
. A path match of an SCC
S
is evaluated by the function
MTP(γ, q , S)
using a path weighting function w : D+R.
T P (γ, q ):={t∈ T (LP (γ)) | ∃t0∈ T (q) : δ(t, t0)< θ}
MTP(γ, q , S):={tT P (γ , q)| ∀γ0D(S)+: w(γ)|T P (γ, q)| ≥ w(γ0)|T P (γ0, q)|}
The final score of an SCC
S
is a sum of its
n
path-scores and is measured by the
function score(S), as follows:
score(S) =
X
γD(S)+w(γ)|T P (γ, q )|if MTP(γ, q, S)6=,
0otherwise.
In case there are terms matching multiple paths and the paths have equal number of
matched terms and equal score, only one of the path scores is added to the SCC score.
The SWM assigns different weights based on the RDF properties on the path. This
means that the weight of a term in a path is determined by the type of the properties (label,
is-a relation, other) on that path and it acts as a tiebreaker for the paths with equal number
of tokens. The weight hierarchy of paths is constructed to allow the exploration of the
KG by querying entities by type, label, predicates and objects. Since terms extracted
from resources can have overlaps, there is a need for providing a disambiguation method.
Weighing: Following we start explaining the rationality behind the defined weights, later
we use examples to better illustrate it.
Is-a relation The problem is that tokens can exist in different paths of an SCC. Thereafter,
a token in an is-a relation property can also exists in other properties. However, a property
as an entity label references the entity itself while an is-a relation references classes
of entities. In this case, if a query intends to select a specific class of entities, other
entities can be retrieved by mistake. Thus, it is important to provide an efficient method
to disambiguate between classes and entities. To alleviate this problem, the weight of the
paths containing an is-a relation property are set higher than other paths. Thereafter, the
selection of a specific entity can be done by building a more precise query. The reason is
that beside the entity’s label, other properties can be used to disambiguate. For instance,
in the case of a class and an entity have the same label, the user can use other entity
property’s term. Therefore, the highest weight is assigned to paths with an is-a relation
property γt—i.e. the paths containing rdf:type.
Entity label The second highest weight is assigned to labeling property paths
γl
—i.e.
the paths containing the
rdfs:label
property—and those are assigned higher values
than other property paths
γo
. Entities can be referenced multiple times in a KG, but when
a query contains an entity label, it is more likely that it is looking for the entity than for
its references—an object instance. Therefore, to prevent entities with references to be
higher ranked than the entity itself, the weight of the path with an labeling property is
set higher than a path with another property. Despite the different weights, we still want
a higher number of matched tokens to score higher in practical cases, i.e.
n+ 1
matched
tokens should score higher than
n
matched tokens for reasonably low
n
. Following, the
model is explained using examples.
(n+ 1) w(γt)>(n+ 1) w(γl)>
(n+ 1) w(γo)> n w(γt)>
nw(γl)> n w(γo)
(1)
Case 1: Querying by entity label For the query “Rio de Janeiro”, the SWM should
consider the DBpedia entity
dbpedia:Rio de Janeiro
as the best answer although
the DBpedia entity dbpedia:Tom Jobim has the DBpedia property dbpprop:
birthPlace
referencing the entity
dbpedia:Rio de Janeiro
. For the term
“The” in a query, the model will consider as a possible answer the entities dbpedia:
The Simpsons
and
dbpedia:The Beatles
rather than the DBpedia property
dbpprop:The GIP.
Case 2: Querying by is-a relation Considering the query “place”, the implemented SWM
will prefer the data type dbo:Place instead of the property dbo:place.
Case 3: Querying by another properties Let us consider the case that the query is
“birth place” rather than “place” as in the previous example. As the number of matching
terms in the property
dbo:birthPlace
is higher than for the data type
dbo:place
,
consequently the weight of dbo:birthPlace will be higher than the data type.
5 Experimental Evaluation
We evaluate the performance of
*path
in comparison to the state-of-the-art SemS
system as well as QA in terms of Precision, Recall and F-measure. To the best of our
knowledge is the first time that the precision of both approaches are measured in the
same benchmark.
Benchmark
Several benchmarks can be used to measure the precision of our approach,
including benchmarks from the initiatives SemSearch [
4
]
13
and QA Over Linked Data
(QALD)
14
.SemSearch is based on user queries extracted from the YAHO O! search
log, with an average distribution of
2.2
words per-query. QALD provides both QA and
keyword search benchmarks for RDF data that aim to evaluate the extrinsic behavior
of systems. The QALD benchmarks are the most suitable for our evaluation due to the
wide type of queries they contain and also because it makes use of DBpedia, a very large
and diverse dataset. In this work, we use openQA framework [
9
] over the newest version
of the QALD benchmark compatible with the framework—QALD version 4 (QALD-4)
benchmark [
17
]. The proposed approach was compared with respect to the performance
of Glimmer
Y!
because it is the best performing SemS system and it is open-source, which
allows to evaluates its performance.
13 http://km.aifb.kit.edu/ws/semsearch10/
14 http://greententacle.techfak.uni-bielefeld.de/˜cunger/qald/
Results
Table 1 shows the performance of
*path
in comparison to Glimmer
Y!
, the
state-of-the-art SemS system [
1
], and all participating QA systems in the multilingual
challenge of the QALD-4 benchmark.
System P R F1Approach
Xser 0.71 0.72 0.72 QA
gAnswer 0.37 0.37 0.37 QA
CASIA 0.40 0.32 0.36 QA
Intui3 0.25 0.23 0.24 QA
ISOFT 0.26 0.21 0.23 QA
*path 0.19 0.19 0.19 SemS
RO FII 0.12 0.12 0.12 QA
GlimmerY! 0.07 0.07 0.07 SemS
Table 1.
Precision (
P
), Recall (
R
) and F-measure (
F1
) achieved by different SemS and QA
systems in QALD-4 Multilingual Challenge. The systems are Glimmer
Y!
,
*path
, SINA, TBSL
and all QALD-4 participating systems.
Discussion
The proposed approach is faster than the best SemS participating in Sem-
Search’10. The main reason is that Glimmer
Y!
build an an index without reasoning
which imposes constraints on the precision (Table 1). The index without reasoning
is a core limitation of Glimmer
Y!
, since the user cannot query by using terms from
properties as well as from entity objects. For instance, in
Case 1
in Section 4.2,
Glimmer
Y!
fails to retrieve
dbpedia:Tom Jobim
because the terms of the entity
dbpedia:Rio de Janeiro
belonging to the property
dbpprop:birthPlace
are not indexed. The same occurs for the data type in
Case 2
where the type is also
given by a non-literal object. However, the
F-measure
of
*path
decreases sensitively
(
0.52
) in comparison with the best performing QA system in QALD-4. The drawback
is due to
*path
does not target the treatment of complex queries—i.e., queries that
require the use of aggregations, restrictions as well as solution modifiers to be answered.
6 Conclusion, Limitations & Future Work
We have presented a novel ranking method for SemS over KGs. The results of an
experimental study show a significant improvement in comparison to the state-of-the-art
SemS. Furthermore, the approach achieves comparable precision when compared with
QA systems. There are a few challenges not addressed in the current implementation as
complex queries [
5
]. In future work, we plan to extend the precision of this approach
by addressing the mentioned challenges. Furthermore, we plan to investigate indexing
techniques. We see this work as the first step of a larger research agenda for SemS
over Linked Data.
Acknowledgements:
This work was supported by a grant from the
EU H2020 Framework Programme provided for the projects Big Data Europe (GA
no. 644564), HOBBIT (GA no. 688227), and CNPq under the program Ci
ˆ
encias Sem
Fronteiras.
References
1.
Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: The
Semantic Web–ISWC 2011. Springer, Berlin Heidelberg (2011)
2.
Cheng, G., Qu, Y.: Searching Linked Objects with Falcons: Approach, Implementation and
Evaluation. Int. J. Semantic Web Inf. Syst. 5(3), 49–70 (2009)
3.
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V.C., Sachs,
J.: Swoogle: A Search and Metadata Engine for the Semantic Web. In: Proceedings of
the Thirteenth ACM Conference on Information and Knowledge Management (CIKM). pp.
652–659. ACM (2004)
4.
Halpin, H., Herzig, D.M., Mika, P., Blanco, R., Pound, J., Thompson, H.S., Tran, D.T.:
Evaluating Ad-hoc Object Retrieval. In: Proceedings of the International Workshop on Evalu-
ation of Semantic Technologies (IWEST 2010). 9th International Semantic Web Conference
(ISWC2010), Shanghai, PR China (November 2010)
5.
H
¨
offner, K., Walter, S., Marx, E., Usbeck, R., Lehmann, J., Ngonga Ngomo, A.C.: Survey
on challenges of Question Answering in the Semantic Web. Submitted to the Semantic Web
Journal (2016)
6.
Hudson, R.A.: Language networks: The new word grammar. Oxford linguistics, Oxford
University Press (2007)
7.
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary informa-
tion. IBM Journal of research and development 1(4), 309–317 (1957)
8.
Mangold, C.: A survey and classification of semantic search approaches. International Journal
of Metadata, Semantics and Ontologies 2(1), 23–34 (2007)
9.
Marx, E., Usbeck, R., Ngomo Ngonga, A.C., H
¨
offner, K., Lehmann, J., Auer, S.: Towards an
open Question Answering architecture. In: SEMANTiCS 2014 (2014)
10.
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com:
a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)
11.
Pearsall, J., Hanks, P., Soanes, C., Stevenson, A. (eds.): Oxford Dictionary of English (Kindle
Edition) (2010)
12. Reisburg, D.: Cognition: Exploring the science of the mind. Norton, New York (1997)
13.
de Saussure, F.: Course in General Linguistics. McGraw-Hill, New York (1959), translated by
Wade Baskin
14.
Shekarpour, S., Marx, E., Ngomo, A.C.N., Auer, S.: SINA: Semantic interpretation of user
queries for Question Answering on interlinked data. Journal of Web Semantics 30, 39–51
(2015)
15.
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval.
Journal of documentation 28(1), 11–21 (1972)
16.
Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker, S.: Sig.ma:
Live views on the web of data. J. Web Sem. 8(4), 355–364 (2010)
17.
Unger, C., Forascu, C., Lopez, V., Ngomo, A.C.N., Cabrio, E., Cimiano, P., Walter, S.: Ques-
tion Answering over Linked Data (QALD-4). In: Working Notes for CLEF 2014 Conference
(2014)
18.
Virgilio, R.D., Maccioni, A.: Distributed Keyword Search over RDF via MapReduce. In: The
Semantic Web: Trends and Challenges. pp. 208–223. Springer, Berlin Heidelberg, Germany
(2014)
19.
Wang, H., Liu, Q., Penin, T., Fu, L., Zhang, L., Tran, T., Yu, Y., Pan, Y.: Semplore: A scalable
IR approach to search the Web of Data. Journal of Web Semantics 7(3) (Sep 2009)
20.
Zhang, L., Liu, Q., Zhang, J., Wang, H., Pan, Y., Yu, Y.: Semplore: An IR approach to scalable
hybrid query of Semantic Web data. In: The Semantic Web: 6th International Semantic Web
Conference. Springer, Berlin Heidelberg, Germany (2007)
... Additionally, many of the RDF data available on the Web has no equivalent human-friendly format such as web pages or relies on third-party search engines such as Google for content access and discovery. Over the last years, several approaches such as question answering [2], search [4] and user interfaces [1] have been proposed to address this problem. In this article, we demonstrate SANTé, an open-source semantic search framework that aims to democratize RDF access by providing an end-to-end semantic search framework. ...
... In this article, we demonstrate SANTé, an open-source semantic search framework that aims to democratize RDF access by providing an end-to-end semantic search framework. SANTé is a result of several years of research [4,5] and is designed for enabling RDF data publishing, browsing, and search through keyword queries. SANTé can be used to leverage complex applications such as SPARQL query building capabilities using natural language queries [3] and facet search [6]. ...
... Structured Highlights are knowledgecard-snippets automatically generated using the most likely property-objects containing the information sought. Structured Highlights works as a cognitive activity snapshot giving an outlook on every available relevant information through highly activated graph connections-using *P [4]. ...
Chapter
Full-text available
Natural language interfaces are one of the most powerful technologies to enable content access. It is a diverse and thriving topic that tackles a multitude of challenges ranging from designing better ranking models to user interfaces. Developing or adapting search engines is a very time-demanding and resource-consuming task. We present SANTé, a semantic search framework that facilitates publishing, querying, and browsing RDF data sets. We show the different interfaces implemented by SANTé through guided steps from raw RDF data to the search result using keyword queries. We demonstrate how SANTé can be used to publish and consume RDF data. Repository: http://github.com/AKSW/sante License: https://www.apache.org/licenses/LICENSE-2.0 FOAF demo: http://foaf.aksw.org/ Pokémon demo: http://pokemon.aksw.org/
... The research has been shifted to explore the KG entities and concepts relations in fields, the field retrieval models [5]. Late studies focus on evaluating the word sequence and property-type influence [2,21,41]. Recently, the use of EL is being considered for ER improvement [14]. ...
... Other generation of ER approaches focused on the problem of unrelatedness by employing field retrieval models [5]. Late studies focused on evaluating how to weight fields differently so that to improve ER accuracy [2,21,41]. Nevertheless, field retrieval models are unable to relate query keywords with a specific predicate or object because they are treated as one, a bag-of-(field-words). Recent approaches introduced the use of two stage techniques employing ER followed by an Entity Link Retrieval (ELR) [14]. ...
... dbo:birthDate, dbo:deathDate or dbpprop:date). Previous works demonstrate that scoring fields differently can improve the ER accuracy [2,21,41]. Hence, CACAO employs field weighting as described by Marx et al. [21]. Additionally, a query intent can be one or a set of entities. ...
Chapter
Full-text available
Information retrieval is regarded as pivotal to empower lay users to access the Web of Data. Over the past years, it achieved momentum with a large number of approaches being developed for different scenarios such as entity retrieval, question answering, and entity linking. This work copes with the problem of entity retrieval over RDF knowledge graphs using keyword factual queries. It discloses an approach that incorporates keyword graph structure dependencies through a conditional spread activation. Experimental evaluation on standard benchmarks demonstrates that the proposed method can improve the performance of current state-of-the-art entity retrieval approaches reasonably.
... *Path is part of previous research published in [8] and was originally developed for Question Answering on knowledge graphs and graph disambiguation. In this challenge, the approach was used to validate the given predicates (i.e., profession and nationality) using the DBpedia [1] and Yago [7] knowledge graphs. ...
... The aim of the Graph Cross module is to estimate the triple scores using ProBase [14], a taxonomy of concepts for short text understanding, which is integrated at the time of writing into a project dubbed Microsoft Concept Graph. 8 As the Concept Graph is available only in TSV files, we firstly converted the knowledge base into RDF. The knowledge base only contains type relationships weighted by an integer number. ...
... We gathered this information from DBpedia 2016-04 using the following SPARQL query. 8 The Graph Cross section consists of three steps: ...
Article
Full-text available
With the continuous increase of data daily published in knowledge bases across the Web, one of the main issues is regarding information relevance. In most knowledge bases, a triple (i.e., a statement composed by subject, predicate, and object) can be only true or false. However, triples can be assigned a score to have information sorted by relevance. In this work, we describe the participation of the Catsear team in the Triple Scoring Challenge at the WSDM Cup 2017. The Catsear approach scores triples by combining the answers coming from three different sources using a linear regression classifier. We show how our approach achieved an Accuracy2 value of 79.58% and the overall 4th place.
... By relying on these networks, our approach provides means for improving the information access on knowledge graphs and, in addition, for developing more reliable methods for Semantic Search systems. In particular, we focus on the study of entity retrieval scoring functions over a particular graph structure called Semantic Connected Component (SCC) [36]. We extend our previous works [36,54] as follows: ...
... In particular, we focus on the study of entity retrieval scoring functions over a particular graph structure called Semantic Connected Component (SCC) [36]. We extend our previous works [36,54] as follows: ...
... -We extend the formalization of the previously introduced scoring function based on Term Networks [36] dubbed as *P (read star path) that allows to SemS over RDF data; -We propose the principle of Information Atomicity for IR; -We compare our approach with 13 state-of-the-art Entity Retrieval methods for SemS and show that we outperform them w.r.t. MAP and P@10; -We compare our approach with the state-of-theart RDF SemS and Entity Linking techniques on the QALD-4 [61] benchmark and show that we achieved higher F 1 score; -We provide an evaluation of two versions of our approach with all participating QA systems in QALD-3 [10] and QALD-4 [61]; -We describe the participation of *P in the Triple Scoring Challenge at WSDM 2017 [37] and how it helps the Catsear team to achieve the general 4th place; -We provide a detailed discussion of the weaknesses and strengths of previously introduced approaches for IR on RDF data as well as for *P itself. ...
Working Paper
Full-text available
Information retrieval approaches are currently regarded as a key technology to empower lay users to access the Web of Data. To assist such need, a large number of approaches such as Question Answering and Semantic Search have been developed.While Question Answering promises accurate results by returning a specific answer, Semantic Search engines are designed to retrieve the top-K resources on a given scoring function. In this work, we focus on the latter paradigm. We aim to address one of the major drawbacks of current implementations, i.e., the accuracy. We propose *P, a Semantic Search approach that explores term networks to answer keyword queries on large RDF knowledge graphs. The proposed method is based on a novel graph disambiguation model. The adequacy of the approach is demonstrated on the QALD benchmark data set against state-of-the-art Question Answering and Semantic Search systems as well as in the Triple Scoring Challenge at the International Conference on Web Search and Data Mining (WSDM) 2017. The results show that *P is more accurate than the current best performing Semantic Search scoring function while achieving a performance comparable to an average Question Answering system.
... *Path is part of previous research published in [8] and was originally developed for Question Answering on knowledge graphs and graph disambiguation. In this challenge, the approach was used to validate the given predicates (i.e., profession and nationality) using the DBpedia [1] and Yago [7] knowledge graphs. ...
... The aim of the Graph Cross module is to estimate the triple scores using ProBase [14], a taxonomy of concepts for short text understanding, which is integrated at the time of writing into a project dubbed Microsoft Concept Graph. 8 As the Concept Graph is available only in TSV files, we firstly converted the knowledge base into RDF. The knowledge base only contains type relationships weighted by an integer number. ...
... We gathered this information from DBpedia 2016-04 using the following SPARQL query. 8 The Graph Cross section consists of three steps: ...
Conference Paper
Full-text available
With the continuous increase of data daily published in knowledge bases across the Web, one of the main issues is regarding information relevance. In most knowledge bases, a triple (i.e., a statement composed by subject, predicate, and object) can be only true or false. However, triples can be assigned a score to have information sorted by relevance. In this work, we describe the participation of the Catsear team in the Triple Scoring Challenge at the WSDM Cup 2017. The Catsear approach scores triples by combining the answers coming from three different sources using a linear regression classifier. We show how our approach achieved an Accuracy2 value of 79.58% and the overall 4th place.
... One to process RDF Knowledge graphs and another to process unstructured data coming from Q&A forums. An RDF knowledge graph is defined in [3] as following: To facilitate the information deployment and management, we made use of KBox [2]. In the following sections, we describe individually how each of the two Cortexes processes information and how the SMART system elect the most prominent hypothesis as possible answer. ...
... The Cortex operating into RDF knowledge graph was designed for answering factbased questions. It processed information from DBpedia knowledge graph using the * pah approach [3]. The * pah approach works with a Semantic Weight Model (SWM) applied to a Term Network extracted from a structure called Semantic Connected Component (SCC). ...
... The * pah approach works with a Semantic Weight Model (SWM) applied to a Term Network extracted from a structure called Semantic Connected Component (SCC). The Term Network, SCC and SWM are formally defined in [3] as follows: ...
Conference Paper
Full-text available
A significant portion of information is today available in a digital format. However, users still face difficulties in accessing it. One of the challenges consists in designing efficient approaches for reasoning over heterogeneous data sources. In this paper, we describe the participation of the Semantic Search and Question Answering group (SMART) in Live QA track at TREC 2016. SMART system answered live questions using information from Stackoverflow and DB-pedia knowledge graph. SMART uses different approaches dubbed as Cortex for different target data source and chose the answer based on the surface form's intersection with the given live question.
... Another one is based on the RDF graph query method. is approach allows capturing more information from RDF graph structure such as subgraphs [7][8][9][10], neighbour [11], path [5], and distance [12]. ese ways can express the semantic needs of users and the results are retrieved quickly. ...
... erefore, m needs to be Input: keyword set query � t 1 , t 2 , . . . , t n , the results number k, the location of user q Output: e Top-k results heap H top−k (1) Heap H top−k � ∅, (2) for each keyword t i in query do (3) t i ∈ I (4) θ � +∞ (5) while key � GETKEY(R, q) do (6) if S(q, key) ≥ θ then break (7) if key refers to a place p then (8) T p � GETKSP(query, p) do (9) if L(T p ) � +∞ then continue (10) f(L(T p ), S(q, p)) (11) if f < θ then (12) H top−k .add(T p , f) (13) ...
Article
Full-text available
With the rapid development of Internet and big data, place retrieval has become an indispensable part of daily life. However, traditional retrieval technology cannot meet the semantic needs of users. Knowledge graph has been introduced into the new-generation retrieval systems to improve retrieval performance. Knowledge graph abstracts things into entities and establishes relationships among entities, which are expressed in the form of triples. However, with the expansion of knowledge graph and the rapid increase of data volume, traditional place retrieval methods on knowledge graph have low performance. This paper designs a place retrieval method in order to improve the efficiency of place retrieval. Firstly, perform data preprocessing and problem model building in the offline stage. Meanwhile, build semantic distance index, spatial quadtree index, and spatial semantic hybrid index according to semantic and spatial information. At the same time, in the online retrieval stage, this paper designs an efficient query algorithm and ranking model based on the index information constructed in the offline stage, aiming at improving the overall performance of the retrieval system. Finally, we use experiment to verify the effectiveness and feasibility of the place retrieval method based on knowledge graph in terms of retrieval accuracy and retrieval efficiency under the real data.
... By relying on these networks, our approach provides means for improving the information access on knowledge graphs and, in addition, for developing more reliable methods for Semantic Search and Question Answering systems. In particular, we focus on the study of entity retrieval scoring functions over a particular graph structure called Semantic Connected Component (SCC) [29]. Our contributions are as follows: ...
... -We extend the previously introduced Semantic Search scoring function based on Term Networks [29] dubbed as *P (read star path) that allows querying RDF data; -We compare our approach with the state-of-the-art Semantic Search techniques on the QALD-4 [51] benchmark and show that we outperform them w.r.t. F 1 score; -We provide an evaluation of QA version of our approach with all participating systems in QALD-3 [9] and QALD-4 [51]; -We describe the participation of *P in Triple Scoring Challenge at WSDM 2017 [30] and how it helps the Catsear team to achieve the general 4th place; -We provide a detailed discussion of the weaknesses and strengths of previously introduced approaches for IR on RDF data as well as for *P itself. ...
... SANT [37] allows the publication, browsing, and querying of arbitrary RDF data. SANTé keyword-based search engine relies on building a network of terms using the values of the rdfs:label 8 property, following the formalisation of Marx et al. [38]. Azad et al. [39] proposed a system allowing users to enter the search term and to choose whether to perform a forward or a backward search. ...
Article
Full-text available
Background Secondary use of health data is a valuable source of knowledge that boosts observational studies, leading to important discoveries in the medical and biomedical sciences. The fundamental guiding principle for performing a successful observational study is the research question and the approach in advance of executing a study. However, in multi-centre studies, finding suitable datasets to support the study is challenging, time-consuming, and sometimes impossible without a deep understanding of each dataset. Methods We propose a strategy for retrieving biomedical datasets of interest that were semantically annotated, using an interface built by applying a methodology for transforming natural language questions into formal language queries. The advantages of creating biomedical semantic data are enhanced by using natural language interfaces to issue complex queries without manipulating a logical query language. Results Our methodology was validated using Alzheimer’s disease datasets published in a European platform for sharing and reusing biomedical data. We converted data to semantic information format using biomedical ontologies in everyday use in the biomedical community and published it as a FAIR endpoint. We have considered natural language questions of three types: single- concept questions, questions with exclusion criteria, and multi-concept questions. Finally, we analysed the performance of the question-answering module we used and its limitations. The source code is publicly available at https:// bioinformatics-ua.github.io/BioKBQA/. Conclusion We propose a strategy for using information extracted from biomedical data and transformed into a semantic format using open biomedical ontologies. Our method uses natural language to formulate questions to be answered by this semantic data without the direct use of formal query languages.
... It evaluated by employing the datasets instead of question answering as semantic searches performed score rather than the existing semantic search system. Limitations of the system are to implementing complex queries [16]. A semantic search framework as 'Mimir' introduced, it is able to index and search the full-text contents, documents, ontologies, and linguistic annotations. ...
Article
Full-text available
Semantic Question Answering (SQA) removes two major access requirements to the Semantic Web: the mastery of a formal query language like SPARQL and knowledge of a specific vocabulary. Because of the complexity of natural language, SQA presents difficult challenges and many research opportunities. Instead of a shared effort, however, many essential components are redeveloped, which is an inefficient use of researcher’s time and resources. This survey analyzes 62 different SQA systems, which are systematically and manually selected using predefined inclusion and exclusion criteria, leading to 72 selected publications out of 1960 candidates. We identify common challenges, structure solutions, and provide recommendations for future systems. This work is based on publications from the end of 2010 to July 2015 and is also compared to older but similar surveys.
Conference Paper
Full-text available
The third edition of the open challenge on Question Answering over Linked Data (QALD-3) has been conducted as a half-day lab at CLEF 2013. Differently from previous editions of the challenge, has put a strong emphasis on multilinguality, offering two tasks: one on multilingual question answering and one on ontology lexicalization. While no submissions were received for the latter, the former attracted six teams who submitted their systems’ results on the provided datasets. This paper provides an overview of QALD-3, discussing the approaches proposed by the participating systems as well as the obtained results.
Conference Paper
Full-text available
Billions of facts pertaining to a multitude of domains are now available on the Web as RDF data. However, accessing this data is still a difficult endeavour for non-expert users. In order to meliorate the access to this data, approaches imposing minimal hurdles to their users are required. Although many question answering systems over Linked Data have being proposed, retrieving the desired data is still significantly challenging. In addition, developing and evaluating question answering systems remains a very complex task. To overcome these obstacles, we present a modular and extensible open-source question answering framework. We demonstrate how the framework can be used by integrating two state-of-the-art question answering systems. As a result our evaluation shows that overall better results can be achieved by the use of combination rather than individual stand-alone versions.
Book
This book argues that language is a network of concepts which in turn is part of the general cognitive network of the mind. It challenges the widely-held view that language is an innate mental module with its own special internal organization. It shows that language has the same internal organization as other areas of knowledge such as social relations and action schemas, and reveals the rich links between linguistic elements and contextual categories. Professor Hudson presents a new theory of how we learn and use our knowledge of language. He puts this to work in a series of extended explorations of morphology, syntax, semantics, and sociolinguistics. Every step of his argument and exposition is illustrated with examples, including the kind mainstream theory finds it hard to analyse. He introduces the latest version of his influential theory of Word Grammar and shows how it can be used to explain the operations of language and as a key to understanding the associated operations of the mind.
Conference Paper
Non expert users need support to access linked data available on the Web. To this aim, keyword-based search is considered an essential feature of database systems. The distributed nature of the Semantic Web demands query processing techniques to evolve towards a scenario where data is scattered on distributed data stores. Existing approaches to keyword search cannot guarantee scalability in a distributed environment, because, at runtime, they are unaware of the location of the relevant data to the query and thus, they cannot optimize join tasks. In this paper, we illustrate a novel distributed approach to keyword search over RDF data that exploits the MapReduce paradigm by switching the problem from graph-parallel to data-parallel processing. Moreover, our framework is able to consider ranking during the building phase to return directly the best (top-k) answers in the first (k) generated results, reducing greatly the overall computational load and complexity. Finally, a comprehensive evaluation demonstrates that our approach exhibits very good efficiency guaranteeing high level of accuracy, especially with respect to state-of-the-art competitors.
Article
Along with the rapid growth of the data Web, searching linked objects for information needs and for reusing become emergent for ordinary Web users and developers, respectively. To meet the challenge, we present Falcons Object Search, a keyword-based search engine for linked objects. To serve various keyword queries, for each object the system constructs a comprehensive virtual document including not only associated literals but also the textual descriptions of associated links and linked objects. The resulting objects are ranked by considering both their relevance to the query and their popularity. For each resulting object, a query-relevant structured snippet is provided to show the associated literals and linked objects matched with the query. Besides, Web-scale class-inclusion reasoning is performed to discover implicit typing information, and users could navigate class hierarchies for incremental class-based results filtering. The results of a task-based experiment show the promising features of the system.