Conference PaperPDF Available

Exploring Term Networks for Semantic Search over RDF Knowledge Graphs

November 2016
Communications in Computer and Information Science

November 2016

DOI:10.1007/978-3-319-49157-8_22

Conference: Research Conference on Metadata and Semantics Research

Authors:

Edgard Marx

Hochschule für Technik, Wirtschaft und Kultur Leipzig

Konrad Höffner

University of Leipzig

Saeedeh Shekarpour

University of Dayton

Axel-Cyrille Ngonga Ngomo

Universität Paderborn

Show all 6 authorsHide

Information retrieval approaches are considered as a key technology to empower lay users to access the Web of Data. A large number of related approaches such as Question Answering and Semantic Search have been developed to address this problem. While Question Answering promises more accurate results by returning a specific answer, Semantic Search engines are designed to retrieve the best top-\(K\) ranked resources. In this work, we propose *path, a Semantic Search approach that explores term networks for querying RDF knowledge graphs. The adequacy of the approach is evaluated employing benchmark datasets against state-of-the-art Question Answering as well as Semantic Search systems. The results show that *path achieves better F\(_1\)-score than the currently best performing Semantic Search system.

An excerpt of a KG. The label of rdfs:label properties were omitted for simplification.

…

Representation of the SCC of the entity e2 extracted from the KG depicted in Fig. 1.

…

Representation of the SU of the entity e2 extracted from the KG depicted in Fig. 1.

…

Figures - uploaded by Edgard Marx

Content may be subject to copyright.

Content uploaded by Edgard Marx

Content may be subject to copyright.

Exploring Term Networks for Semantic Search

over RDF Knowledge Graphs

Edgard Marx1,3, Konrad H¨

offner1, Saeedeh Shekarpour1, Axel-Cyrille Ngonga

Ngomo1, Jens Lehmann2, and S¨

oren Auer2

1AKSW, University of Leipzig, Germany

2Computer Science Institute, University of Bonn / Fraunhofer IAIS

3Instituto de Pesquisa e Desenvolvimento Albert Schirmer

4Knoesis Center, USA

Abstract.

Information retrieval approaches are considered as a key technology

to empower lay users to access the Web of Data. A large number of related

approaches such as Question Answering and Semantic Search have been developed

to address this problem. While Question Answering promises more accurate results

by returning a speciﬁc answer, Semantic Search engines are designed to retrieve the

best top-

ranked resources. In this work, we propose

*path

, a Semantic Search

approach that explores term networks for querying RDF knowledge graphs. The

adequacy of the approach is evaluated employing benchmark datasets against state-

of-the-art Question Answering as well as Semantic Search systems. The results

show that

*path

achieves better F

-score than the currently best performing

Semantic Search system.

1 Introduction

The growth of Semantic Web technologies has led to the publication of large volumes

of data. Approximately

10 000

Resource Description Framework (RDF)

datasets are

available via public data portals.

However, retrieving desired information from datasets

still poses a signiﬁcant challenge. Lay users cannot be expected to make themselves

familiar with the underlying query languages and modeling structures. A major challenge

is the efﬁcient retrieval of the resource that best represents the user’s intent via natural

language (NL) keyword queries. Relying solely on off-the-shelf triple stores or document

retrieval may lead to poor performance or precision (see Section 5). To address this

problem, we propose an approach for Semantic Search RDF knowledge graphs by

exploring its Term Network. A Term Network (see Section 4) is a graph whose vertices

are labeled terms. Overall, our contributions are as follows:

–

We develop a new formal model for Semantic Search (SemS) based on Term

Networks;

–We present a ranking method that increases the precision on retrieving RDF data;

–

We compare our approach with state of the art SemS techniques on the QALD-4 [

]

benchmark and show that we achieve a higher F1-score.

5http://www.w3.org/RDF

6http://lodstats.aksw.org/

The rest of this paper is organized as follows: The related work is reviewed in

Section 2. Section 3 deﬁnes the preliminaries. Section 4 describes the

*path

model.

Section 5 outlines the evaluation and discusses the results. Finally, Section 6 concludes

giving an outlook of potential future work.

2 Related Work

Information retrieval (IR) over Linked Data is an active and diverse research ﬁeld with

many existing related work focusing on designed for different environments, diverging

in complexity and precision. The related work can be mainly categorized in two types of

approaches that recover information from Linked Data Knowledge Graphs (KGs, see

Deﬁnition 1): (1) by using conventional IR techniques and (2) by answering natural

language questions. While the use of time efﬁcient traditional IR systems lacks the ability

to deal with complex queries, they are usually faster. Wang et al. [

] shows that pure

traditional IR engines are faster than the combination of a triple store with a full-text

index. However, both models explore the semantics of an NL query for delivering the

response by applying statistics measures and heuristics in the KG. Semantic Search

(SemS) approaches aim to retrieve the top-k ranked resources for a given NL input query.

Swoogle [

], introduces a modiﬁed version of PageRank that takes into account the

types of the links between ontologies. Sindice [

], Falcons [

] and Sig.ma [

] explores

traditional document retrieval to index and locate relevant sources and/or resources.

Sindice is a search engine that can retrieve documents containing a given statement.

Falcon, uses a built-in ranking mechanism for entity ranking while Sig.ma allows the use

of constraints to query for particular classes and/or properties. In all cases, the structure

and semantics are not taken into account during the matching phase. YAHOO! BNC [

]

used a local, per property, term frequency as well as a global term frequency. It also

applied a boost based on the number of matched query terms. Umass [

] explored existing

ranking functions applied to four ﬁeld types: (1) title; (2) name; (3)

dbo:title

, and;

(4) all others. The ﬁelds were weighted separately with a speciﬁc boost applied to each

of them. Later, Blanco et al. [

] proposed a modiﬁed version of BM25F ranking function

adapted for RDF data. The function was applied to a horizontal pairwise index structure

composed of the subject and its property values. However, the most important feature in

the proposed structure is the possibility to assign different weights to predicates. The

proposed adaptation is implemented in the Glimmer

engine and is shown to be time

efﬁcient as well as outperforms other state-of-the-art methods in ranking RDF resources.

Recently,Virgilio et al. [

] introduced a distributed technique for SemS on RDF data

using MapReduce. The method uses a distributed index of RDF paths. The proposed

strategy returns the best top-k answers in the ﬁrst k generated results. The retrieval is

done by evaluating the paths containing the terms of the query using two strategies: (1)

Linear and (2) Monotonic. (1) The Linear strategy uses only the high ranked path(s).

As a consequence, it does not produce an optimum solution but has linear complexity

with respect to the size of matched entities. (2) The Monotonic strategy uses all matched

paths and, thus, produces better results. Intuitively, measuring all suitable paths from all

entities is less time efﬁcient. Please refer to the work of Mangold et al. [

] for a more

detailed analysis of SemS approaches.

One of the biggest challenges in SemS method lies in evaluating the relatedness

between the terms in a KG and an NL query. Document retrieval engines rely on term

frequency weighting, which is based on the assumption, that the more frequently a term

occurs, the more related it is to the topic of the document [

]. While good retrieval

performance needs to take the frequency into account, it suffers from frequent yet

unspeciﬁc words such as “the”, “a” or “in”. Inverse document frequency corrects this by

diminishing the weight of words that are frequently occurring in the corpus, leading to the

combined term frequency–inverse document frequency (tf-idf) [

] to score documents

for a query.

3 Preliminaries

We begin by introducing a formal deﬁnition of the RDF model. Thereafter, we introduce

fundamental concepts that are required for full understanding of the rest of the paper.

RDF

is a standard for describing Web resources. A resource can refer to any physical or

conceptual thing, such as a Web site, a person or a device. The RDF data model expresses

statements about resources in the form of subject-predicate-object triples. The subject

denotes a resource; the predicate expresses a property (of the subject) or a relationship

(between subject and object); the object is either a resource or literal. Resources are

identiﬁed with IRIs, a generalization of URIs, while literals are used to identify values

such as numbers and dates by means of a lexical representation.

Deﬁnition 1 (RDF knowledge Graph, KG).

Formally, let

be a ﬁnite RDF knowl-

edge graph (KG).

can be regarded as a set of triples

(s, p, o)∈(I ∪ B)× P × (I ∪

L∪B)

, where

R=I ∪ B

is the set of all RDF resources

r∈ R

in the KG,

is the

set of all IRIs,

is the set of all blank nodes,

B ∩ I =∅

is the set of all predicates,

P ⊆ I

is the set of all literals,

L ⊂ Σ∗

and

L ∩ I =∅

, where

is the unicode

alphabet.

is the set of all entities,

E=I ∪ B \ P

. An RDFTerm

refers to any edge

label

p∈P

or vertex in the KG

ϕ∈(I ∪ B ∪ L

). A KG is modeled as a directed labeled

graph

G= (V,D)

, where

V=E ∪ L

D ⊆ E × (E ∪ L)

and the labeling function

the edges is a mapping λ:D 7→ P. We disregard literal language tags and data types.

Figure 1 shows an excerpt of a KG where a literal vertex

vi∈ L

(respectively

a resource vertex

vi∈ R

) is illustrated by a rectangle, respectively an oval. Each

edge between two vertices corresponds to a triple, where the ﬁrst vertex is called the

subject, the labeled edge the predicate and the second vertex the object. For example,

e2 rdfs:label

−−−−−−−→ Mona Lisa

corresponds to the triple

<e2, rdfs:label, "Mona

Lisa">.

In this work, we address the problem of SemS systems that aim to retrieve the top-k

ranked entities representing the intention behind an NL user query.

Deﬁnition 2 (Natural Language Query).

A NL query

q∈Σ∗

is a user given keyword

string expressing a factual information needed.

7https://www.w3.org/TR/REC-rdf- syntax/

8Not to be confused with rdfs:label.

Leonardo da Vinci

Person

Mona Lisa

rdfs:label

dbo:artist rdf:type

type

artist

rdfs:label

Fig. 1.

An excerpt of a KG. The label of

rdfs:label

properties were omitted for simpliﬁcation.

4 Approach

For many years, scientists from the most diverse ﬁelds of cognitive science have tried to

explain and reproduce the human cognition system, including psychology, neuroscience,

philosophy, linguistics and artiﬁcial intelligence. While diverse theories have been

developed, a commonly shared idea is that knowledge is organized as a network [

Hudson et al. [

] go further and states that grammar is organized as a network as

well. According to Hudson’s work, the syntactic structure of a sentence consists of

a network of dependencies between single terms. Thus, everything that needs to be

said about the syntactic structure of a sentence can be represented in such a network.

Hudson explores Saussure’s [

] idea that “language is a system of interdependent terms

in which the value of each term results solely from the simultaneous presence of the

others”. He also argues about the psycholinguistic evidence for the use of spreading

activation in supporting knowledge reasoning. However, according to Hudson et al.,

the main challenge is ﬁnding out how the activation occurs in mathematical terms [

Our intuition is that as the KG contains a network of terms formed by the label (e.g.

rdfs:label

) of the RDFTerms—properties, classes and entities—they can be used to

query.

Deﬁnition 3 (Term).

A term

can be a word or a phrase used to describe a thing or to

express a concept [11]. In this work we consider as term any literal (l∈L) in a KG.

Deﬁnition 4 (RDFTerm Label).

A term associated with an RDFTerm

, denoted by

L(ϕ)

, is the literal respectively the label of

. Considering the

rdfs:label10

. as

labeling property:

label(r):={l∈L|(r, rdfs:label, l)∈K}

L(ϕ):={ϕ}if ϕ∈L,

label(ϕ)otherwise. 

9Not to be confused with an RDFTerm.

10 Other labeling properties may also be used.

Although there is no evidence that the previous works were inﬂuenced by Hudson’s

theory, there are models that make use of the KG in order to evaluate the answer [

Figure 1 shows a set of literals associated with the resources in the KG sample. Each

resource contains a set of terms

LR(r)

. This terms are called Resource-Associated Terms

and are deﬁned as follows:

Deﬁnition 5 (Resource-Associated Terms).

The set of terms associated with a re-

source

denoted by

LR(r)

is the union of all literals as well as labels of each property

and object in the triples in which ris the subject.

LR(r):={l∈L| ∃(r, p, o)∈K:

∃ϕ∈ {p, o}:l=L(ϕ)}

Example 1

(Resource-Associated Terms). Considering the KG depicted in Figure 1, the

triples having the entity e2as subject are as follows:

1. e2 rdfs:label "Mona Lisa".

2. e2 dbo:artist e1.

The associated terms for

are:

LR(e2) = {"label"

"Mona Lisa"

"artist"

"Leonardo da Vinci"}

Deﬁnition 6 (Term Network).

A Term Network is a graph whose vertices are labeled

with terms.

A KG can be converted to a TN by visiting all vertices and edges executing the

following operations (Fig. 2 shows the TN for Example 1):

Labeling edges and non-literal vertices by a copy of their respective labels deﬁned

by the labeling property rdfs:label;

2. Converting edges to vertices.

Mona Lisa

artist

Leonardo da Vinci

label

Mona Lisa

Fig. 2.

Representation of a TN extracted from the triples that have

as subject from the KG

depicted in Fig. 1.

The TN of a KG is connected and its paths can have cycles as well as an arbitrary

length. In order to simplify the TN and eliminate its ambiguity, the proposed model

works on a simpliﬁed version of the TN extracted from a structure called Semantic

Connected Component (SC C ), deﬁned as follows:

Deﬁnition 7 (Semantic Connected Component).

The Semantic Connected Compo-

nent (SCC) of an entity

in an RDF graph

under a consequence relation

is deﬁned

as SC CG,|=(e):={(e, p, o)|G|={(e, p, o)}} ∪ {(p, rdfs:label, l)∈G}∪{(o,

rdfs:label, l)∈G}}

. If the graph and consequence relation is clear from the context,

we use the shorter notation

SC C (e)

. Within this paper, we use the RDFS entailment

consequence relation as deﬁned in its speciﬁcation11.

Example 2

(Semantic Connected Component). For instance, by RDFS entailment, the en-

tity

dbr:Australia

is a

dbo:PopulatedPlace

. The inference is due to

dbr:Australia

being typed as

dbo:Country

which is a subclass of

dbo:PopulatedPlace

. Con-

sidering the running example, the SCC of the entity

SC C (e2)=({e2, e1,"Mona Lisa"}

{p5, p4}).

Mona Lisa

artist

Leonardo

da Vinci

rdfs:label

dbo:artist

rdfs:label

Fig. 3. Representation of the SCC of the entity e2extracted from the KG depicted in Fig. 1.

The structure used for and ranking is called Semantic Unit (SU). The SU is a tree,

where the nodes starting from its root node are labeled with tokens and have only one

child. Tokens are sub-strings extracted from another string, they are formally deﬁned as

follows.

Deﬁnition 8 (Token).

A token

t∈ T

is the result from a tokenizing function

T:Σ∗→

Σ∗∗, which converts a string to a set of tokens.

The root node sub-trees of the SU form a set of paths starting from the resource to

which the SCC is associated, see Fig. 4. The SU is deﬁned as follows:

Deﬁnition 9 (Semantic Unit (SU)). The Semantic Unit is a tree where:

–The root node is an entity;

–All vertices in the root node sub-trees only have one child, and;

–Vertices in the root node sub-trees are labeled with tokens.

Example 3

(Semantic Unit (SU)). Considering the running example, the SU of the

entity

SU (e2)=({e2

v7},{(e2, v1)

(e2, v5)

(v1, v2)

(v2, v3)

(v3, v4),(v5, v6),(v6, v7)})12 and is depicted in Fig. 4.

An SCC can be converted into an SU as follows:

11 http://www.w3.org/TR/rdf-mt/

12 The output of the tokenizer used in this example are lowercase lexemes from a literal.

label

mona

lisa

artist

leonardo

vinci

Fig. 4. Representation of the SU of the entity e2extracted from the KG depicted in Fig. 1.

1. Converting the sub-trees starting from the root node of the SCC into TN;

Converting the literal vertices to a graph where there is an edge starting from each

token to its subsequent one, deﬁned as follows:

G(l):= (T(l),D(l))

D(l):={(t1, t2)∈ T (l)| ∃i∈N: (πi(T(l)) = t1)∧(πi+1(T(l)) = t2)}

Example 4 (Literal to graph). Converting the term "mona lisa" to a graph.

G("mona lisa") = ({"mona","lisa"},{("mona","lisa")})

In the following sections, we start by describing how we retrieve SU in the KG using

the query terms. Later, we discuss how we can efﬁciently rank it.

4.1 Retrieving

The idea is to perform the selection of SUs which have a term in intersection with the

query terms. For instance, one possible solution for

{"mona"

"lisa"

"artist"}

is the co-occurrence of all terms in a SU. The next possible solution is the co-occurrence

of two of the three terms and so on. Thus, it is necessary to check for the existence of the

query terms in different paths. For example, one SU may contain the token

"artist"

and another with the tokens ("mona","lisa"), see Example 5.

Example 5

(Retrieving

"Mona Lisa artist"

). In the KG in Fig. 1, the SCC con-

taining the answer for the query

{"mona","lisa","artist"}

is SCC(

) and can

be retrieved by a simple lookup with a SPARQL query.

Query and Resource Labels Analysis Information retrieval systems for RDF are com-

monly designed to support full or keyword NL queries. However, converting keywords

to full queries is a more challenging task. The

*path

query approach is designed to

deal with keyword or full queries by converting the latter into keyword queries. The

process of conversion of a NL input query to a tuple of keywords consists of applying

known techniques, in order: (1) lowercase and (2) lemmatization. In order to increase

the number of matched SUs, the same analysis is applied to the SU labels.

After extracting the SUs, the SCC of the SU’s entity is used for ranking.

4.2 Ranking

Document retrieval approaches are not suitable for RDF because the most important

feature of RDF is not the terms, but the relation of the concepts underlying its graph

structure. The challenge of adapting the ranking method is measuring the relatedness

between the resources in the target KG and the input query terms. As a query rarely

exactly matches the resource associated terms, both are ﬁrst converted into tokens.

Thereafter, the proposed ranking assumes that the probability of a resource being part

of an answer correlates with the number of matched tokens between the query and the

resource associated terms. For instance, a query containing birth date should be more

related to the property

dbo:birthDate

than to the property

dbo:deathDate

dbpprop:date

. The strength is measured by the number of query tokens matching

with the resource tokens.

Deﬁnition 10 (Resource Matching).

A resource matching is a function

MT :T → 2R

that maps query tokens

T={t1, t2, t3...tn}

to resources, formally deﬁned by

MT(t)

where δis a string dissimilarity function and θ∈[0,1] ⊂R:

M T (t):={r∈ R | ∃t0∈ T (LR(r)) : δ(t, t0)< θ}

Example 6

(Resource Matching). Let

T(q) = {"mona","lisa","artist"}

. Ac-

cording to Fig. 1, the tokens are mapped to:

M T ("mona")

{e2}

M T ("lisa")

{e2},M T ("artist")={p4}.

As the knowledge base is a graph, the resources and literal values are connected by paths

formed by edges and vertices, see Fig. 3.

Example 7

(Path). In the SCC shown in Fig. 3, there are two paths starting from the

entity e2as follows: γ1=((e2,"Mona Lisa")) and γ2=((e2, e1)).

Furthermore, resources belonging to a path between one resource to another are

labeled (e.g.

rdfs:label

). Therefore, it is possible to explore the terms associated to

the entity’s paths to determine its relevance.

Deﬁnition 11 (Path terms).

Path terms are the set of all literals in the path

, deﬁned

as follows:

LP (γ):={l| ∃ϕ∈γ:l∈L(ϕ)}

Example 8

(Path terms). For Example 7, the set of associated terms for the two given

paths are as follows:

LP (γ1)

{"label","Mona Lisa"}

and

LP (γ2)

{"artist",

"Leonardo da Vinci"}.

Thus, the relevance score of an entity depends on the number of matched terms in its

associated paths. The higher the number of matched terms, the higher the relevance of

the entity. Furthermore, if a term matches multiple paths of an entity, it is only attributed

to the path with the highest number of matched terms. The relevance score of an entity

is the sum of all individual path scores; it is measured by the Semantic Weight Model

(SWM), which is formally deﬁned as follows.

Deﬁnition 12 (Semantic Weight Model (SWM)).

Each token

T(q)

is ﬁrst mapped

to the paths of the SCC

. The set of matched tokens from a path

is returned by the

function

T P (γ, q )

. A path match of an SCC

is evaluated by the function

MTP(γ, q , S)

using a path weighting function w : D+→R.

T P (γ, q ):={t∈ T (LP (γ)) | ∃t0∈ T (q) : δ(t, t0)< θ}

MTP(γ, q , S):={t∈T P (γ , q)| ∀γ0∈D(S)+: w(γ)|T P (γ, q)| ≥ w(γ0)|T P (γ0, q)|}

The ﬁnal score of an SCC

is a sum of its

path-scores and is measured by the

function score(S), as follows:

score(S) =

γ∈D(S)+w(γ)|T P (γ, q )|if MTP(γ, q, S)6=∅,

0otherwise.

In case there are terms matching multiple paths and the paths have equal number of

matched terms and equal score, only one of the path scores is added to the SCC score.

The SWM assigns different weights based on the RDF properties on the path. This

means that the weight of a term in a path is determined by the type of the properties (label,

is-a relation, other) on that path and it acts as a tiebreaker for the paths with equal number

of tokens. The weight hierarchy of paths is constructed to allow the exploration of the

KG by querying entities by type, label, predicates and objects. Since terms extracted

from resources can have overlaps, there is a need for providing a disambiguation method.

Weighing: Following we start explaining the rationality behind the deﬁned weights, later

we use examples to better illustrate it.

Is-a relation The problem is that tokens can exist in different paths of an SCC. Thereafter,

a token in an is-a relation property can also exists in other properties. However, a property

as an entity label references the entity itself while an is-a relation references classes

of entities. In this case, if a query intends to select a speciﬁc class of entities, other

entities can be retrieved by mistake. Thus, it is important to provide an efﬁcient method

to disambiguate between classes and entities. To alleviate this problem, the weight of the

paths containing an is-a relation property are set higher than other paths. Thereafter, the

selection of a speciﬁc entity can be done by building a more precise query. The reason is

that beside the entity’s label, other properties can be used to disambiguate. For instance,

in the case of a class and an entity have the same label, the user can use other entity

property’s term. Therefore, the highest weight is assigned to paths with an is-a relation

property γt—i.e. the paths containing rdf:type.

Entity label The second highest weight is assigned to labeling property paths

γl

—i.e.

the paths containing the

rdfs:label

property—and those are assigned higher values

than other property paths

γo

. Entities can be referenced multiple times in a KG, but when

a query contains an entity label, it is more likely that it is looking for the entity than for

its references—an object instance. Therefore, to prevent entities with references to be

higher ranked than the entity itself, the weight of the path with an labeling property is

set higher than a path with another property. Despite the different weights, we still want

a higher number of matched tokens to score higher in practical cases, i.e.

n+ 1

matched

tokens should score higher than

matched tokens for reasonably low

. Following, the

model is explained using examples.

(n+ 1) w(γt)>(n+ 1) w(γl)>

(n+ 1) w(γo)> n w(γt)>

nw(γl)> n w(γo)

(1)

Case 1: Querying by entity label For the query “Rio de Janeiro”, the SWM should

consider the DBpedia entity

dbpedia:Rio de Janeiro

as the best answer although

the DBpedia entity dbpedia:Tom Jobim has the DBpedia property dbpprop:

birthPlace

referencing the entity

dbpedia:Rio de Janeiro

. For the term

“The” in a query, the model will consider as a possible answer the entities dbpedia:

The Simpsons

and

dbpedia:The Beatles

rather than the DBpedia property

dbpprop:The GIP.

Case 2: Querying by is-a relation Considering the query “place”, the implemented SWM

will prefer the data type dbo:Place instead of the property dbo:place.

Case 3: Querying by another properties Let us consider the case that the query is

“birth place” rather than “place” as in the previous example. As the number of matching

terms in the property

dbo:birthPlace

is higher than for the data type

dbo:place

consequently the weight of dbo:birthPlace will be higher than the data type.

5 Experimental Evaluation

We evaluate the performance of

*path

in comparison to the state-of-the-art SemS

system as well as QA in terms of Precision, Recall and F-measure. To the best of our

knowledge is the ﬁrst time that the precision of both approaches are measured in the

same benchmark.

Benchmark

Several benchmarks can be used to measure the precision of our approach,

including benchmarks from the initiatives SemSearch [

]

and QA Over Linked Data

(QALD)

.SemSearch is based on user queries extracted from the YAHO O! search

log, with an average distribution of

2.2

words per-query. QALD provides both QA and

keyword search benchmarks for RDF data that aim to evaluate the extrinsic behavior

of systems. The QALD benchmarks are the most suitable for our evaluation due to the

wide type of queries they contain and also because it makes use of DBpedia, a very large

and diverse dataset. In this work, we use openQA framework [

] over the newest version

of the QALD benchmark compatible with the framework—QALD version 4 (QALD-4)

benchmark [

]. The proposed approach was compared with respect to the performance

of Glimmer

because it is the best performing SemS system and it is open-source, which

allows to evaluates its performance.

13 http://km.aifb.kit.edu/ws/semsearch10/

14 http://greententacle.techfak.uni-bielefeld.de/˜cunger/qald/

Results

Table 1 shows the performance of

*path

in comparison to Glimmer

, the

state-of-the-art SemS system [

], and all participating QA systems in the multilingual

challenge of the QALD-4 benchmark.

System P R F1Approach

Xser 0.71 0.72 0.72 QA

gAnswer 0.37 0.37 0.37 QA

CASIA 0.40 0.32 0.36 QA

Intui3 0.25 0.23 0.24 QA

ISOFT 0.26 0.21 0.23 QA

*path 0.19 0.19 0.19 SemS

RO FII 0.12 0.12 0.12 QA

GlimmerY! 0.07 0.07 0.07 SemS

Table 1.

Precision (

), Recall (

) and F-measure (

) achieved by different SemS and QA

systems in QALD-4 Multilingual Challenge. The systems are Glimmer

*path

, SINA, TBSL

and all QALD-4 participating systems.

Discussion

The proposed approach is faster than the best SemS participating in Sem-

Search’10. The main reason is that Glimmer

build an an index without reasoning

which imposes constraints on the precision (Table 1). The index without reasoning

is a core limitation of Glimmer

, since the user cannot query by using terms from

properties as well as from entity objects. For instance, in

Case 1

in Section 4.2,

Glimmer

fails to retrieve

dbpedia:Tom Jobim

because the terms of the entity

dbpedia:Rio de Janeiro

belonging to the property

dbpprop:birthPlace

are not indexed. The same occurs for the data type in

Case 2

where the type is also

given by a non-literal object. However, the

F-measure

*path

decreases sensitively

(

0.52

) in comparison with the best performing QA system in QALD-4. The drawback

is due to

*path

does not target the treatment of complex queries—i.e., queries that

require the use of aggregations, restrictions as well as solution modiﬁers to be answered.

6 Conclusion, Limitations & Future Work

We have presented a novel ranking method for SemS over KGs. The results of an

experimental study show a signiﬁcant improvement in comparison to the state-of-the-art

SemS. Furthermore, the approach achieves comparable precision when compared with

QA systems. There are a few challenges not addressed in the current implementation as

complex queries [

]. In future work, we plan to extend the precision of this approach

by addressing the mentioned challenges. Furthermore, we plan to investigate indexing

techniques. We see this work as the ﬁrst step of a larger research agenda for SemS

over Linked Data.

Acknowledgements:

This work was supported by a grant from the

EU H2020 Framework Programme provided for the projects Big Data Europe (GA

no. 644564), HOBBIT (GA no. 688227), and CNPq under the program Ci

encias Sem

Fronteiras.

References

Blanco, R., Mika, P., Vigna, S.: Effective and efﬁcient entity search in RDF data. In: The

Semantic Web–ISWC 2011. Springer, Berlin Heidelberg (2011)

Cheng, G., Qu, Y.: Searching Linked Objects with Falcons: Approach, Implementation and

Evaluation. Int. J. Semantic Web Inf. Syst. 5(3), 49–70 (2009)

Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V.C., Sachs,

J.: Swoogle: A Search and Metadata Engine for the Semantic Web. In: Proceedings of

the Thirteenth ACM Conference on Information and Knowledge Management (CIKM). pp.

652–659. ACM (2004)

Halpin, H., Herzig, D.M., Mika, P., Blanco, R., Pound, J., Thompson, H.S., Tran, D.T.:

Evaluating Ad-hoc Object Retrieval. In: Proceedings of the International Workshop on Evalu-

ation of Semantic Technologies (IWEST 2010). 9th International Semantic Web Conference

(ISWC2010), Shanghai, PR China (November 2010)

offner, K., Walter, S., Marx, E., Usbeck, R., Lehmann, J., Ngonga Ngomo, A.C.: Survey

on challenges of Question Answering in the Semantic Web. Submitted to the Semantic Web

Journal (2016)

Hudson, R.A.: Language networks: The new word grammar. Oxford linguistics, Oxford

University Press (2007)

Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary informa-

tion. IBM Journal of research and development 1(4), 309–317 (1957)

Mangold, C.: A survey and classiﬁcation of semantic search approaches. International Journal

of Metadata, Semantics and Ontologies 2(1), 23–34 (2007)

Marx, E., Usbeck, R., Ngomo Ngonga, A.C., H

offner, K., Lehmann, J., Auer, S.: Towards an

open Question Answering architecture. In: SEMANTiCS 2014 (2014)

10.

Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com:

a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)

11.

Pearsall, J., Hanks, P., Soanes, C., Stevenson, A. (eds.): Oxford Dictionary of English (Kindle

Edition) (2010)

12. Reisburg, D.: Cognition: Exploring the science of the mind. Norton, New York (1997)

13.

de Saussure, F.: Course in General Linguistics. McGraw-Hill, New York (1959), translated by

Wade Baskin

14.

Shekarpour, S., Marx, E., Ngomo, A.C.N., Auer, S.: SINA: Semantic interpretation of user

queries for Question Answering on interlinked data. Journal of Web Semantics 30, 39–51

(2015)

15.

Sparck Jones, K.: A statistical interpretation of term speciﬁcity and its application in retrieval.

Journal of documentation 28(1), 11–21 (1972)

16.

Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker, S.: Sig.ma:

Live views on the web of data. J. Web Sem. 8(4), 355–364 (2010)

17.

Unger, C., Forascu, C., Lopez, V., Ngomo, A.C.N., Cabrio, E., Cimiano, P., Walter, S.: Ques-

tion Answering over Linked Data (QALD-4). In: Working Notes for CLEF 2014 Conference

(2014)

18.

Virgilio, R.D., Maccioni, A.: Distributed Keyword Search over RDF via MapReduce. In: The

Semantic Web: Trends and Challenges. pp. 208–223. Springer, Berlin Heidelberg, Germany

(2014)

19.

Wang, H., Liu, Q., Penin, T., Fu, L., Zhang, L., Tran, T., Yu, Y., Pan, Y.: Semplore: A scalable

IR approach to search the Web of Data. Journal of Web Semantics 7(3) (Sep 2009)

20.

Zhang, L., Liu, Q., Zhang, J., Wang, H., Pan, Y., Yu, Y.: Semplore: An IR approach to scalable

hybrid query of Semantic Web data. In: The Semantic Web: 6th International Semantic Web

Conference. Springer, Berlin Heidelberg, Germany (2007)

SANTé: A Light-Weight End-to-End Semantic Search Framework for RDF Data

Chapter

Full-text available

Jul 2021

Natural language interfaces are one of the most powerful technologies to enable content access. It is a diverse and thriving topic that tackles a multitude of challenges ranging from designing better ranking models to user interfaces. Developing or adapting search engines is a very time-demanding and resource-consuming task. We present SANTé, a semantic search framework that facilitates publishing, querying, and browsing RDF data sets. We show the different interfaces implemented by SANTé through guided steps from raw RDF data to the search result using keyword queries. We demonstrate how SANTé can be used to publish and consume RDF data. Repository: http://github.com/AKSW/sante License: https://www.apache.org/licenses/LICENSE-2.0 FOAF demo: http://foaf.aksw.org/ Pokémon demo: http://pokemon.aksw.org/

CACAO: Conditional Spread Activation for Keyword Factual Query Interpretation

Chapter

Full-text available

Nov 2019

Information retrieval is regarded as pivotal to empower lay users to access the Web of Data. Over the past years, it achieved momentum with a large number of approaches being developed for different scenarios such as entity retrieval, question answering, and entity linking. This work copes with the problem of entity retrieval over RDF knowledge graphs using keyword factual queries. It discloses an approach that incorporates keyword graph structure dependencies through a conditional spread activation. Experimental evaluation on standard benchmarks demonstrates that the proposed method can improve the performance of current state-of-the-art entity retrieval approaches reasonably.

Triple Scoring Using a Hybrid Fact Validation Approach - The Catsear Triple Scorer at WSDM Cup 2017

Article

Full-text available

Dec 2017

With the continuous increase of data daily published in knowledge bases across the Web, one of the main issues is regarding information relevance. In most knowledge bases, a triple (i.e., a statement composed by subject, predicate, and object) can be only true or false. However, triples can be assigned a score to have information sorted by relevance. In this work, we describe the participation of the Catsear team in the Triple Scoring Challenge at the WSDM Cup 2017. The Catsear approach scores triples by combining the answers coming from three different sources using a linear regression classifier. We show how our approach achieved an Accuracy2 value of 79.58% and the overall 4th place.

Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs

Working Paper

Full-text available

Nov 2017

Information retrieval approaches are currently regarded as a key technology to empower lay users to access the Web of Data. To assist such need, a large number of approaches such as Question Answering and Semantic Search have been developed.While Question Answering promises accurate results by returning a specific answer, Semantic Search engines are designed to retrieve the top-K resources on a given scoring function. In this work, we focus on the latter paradigm. We aim to address one of the major drawbacks of current implementations, i.e., the accuracy. We propose *P, a Semantic Search approach that explores term networks to answer keyword queries on large RDF knowledge graphs. The proposed method is based on a novel graph disambiguation model. The adequacy of the approach is demonstrated on the QALD benchmark data set against state-of-the-art Question Answering and Semantic Search systems as well as in the Triple Scoring Challenge at the International Conference on Web Search and Data Mining (WSDM) 2017. The results show that *P is more accurate than the current best performing Semantic Search scoring function while achieving a performance comparable to an average Question Answering system.

Triple Scoring Using a Hybrid Fact Validation Approach: The Catsear Triple Scorer at WSDM Cup 2017

Conference Paper

Full-text available

Jan 2017

Answering Live Questions from Heterogeneous Data Sources: SMART in Live QA at TREC 2016

Conference Paper

Full-text available

Nov 2016

A significant portion of information is today available in a digital format. However, users still face difficulties in accessing it. One of the challenges consists in designing efficient approaches for reasoning over heterogeneous data sources. In this paper, we describe the participation of the Semantic Search and Question Answering group (SMART) in Live QA track at TREC 2016. SMART system answered live questions using information from Stackoverflow and DB-pedia knowledge graph. SMART uses different approaches dubbed as Cortex for different target data source and chose the answer based on the surface form's intersection with the given live question.

Place Retrieval in Knowledge Graph

Article

Full-text available

Jul 2020

With the rapid development of Internet and big data, place retrieval has become an indispensable part of daily life. However, traditional retrieval technology cannot meet the semantic needs of users. Knowledge graph has been introduced into the new-generation retrieval systems to improve retrieval performance. Knowledge graph abstracts things into entities and establishes relationships among entities, which are expressed in the form of triples. However, with the expansion of knowledge graph and the rapid increase of data volume, traditional place retrieval methods on knowledge graph have low performance. This paper designs a place retrieval method in order to improve the efficiency of place retrieval. Firstly, perform data preprocessing and problem model building in the offline stage. Meanwhile, build semantic distance index, spatial quadtree index, and spatial semantic hybrid index according to semantic and spatial information. At the same time, in the online retrieval stage, this paper designs an efficient query algorithm and ranking model based on the index information constructed in the offline stage, aiming at improving the overall performance of the retrieval system. Finally, we use experiment to verify the effectiveness and feasibility of the place retrieval method based on knowledge graph in terms of retrieval accuracy and retrieval efficiency under the real data.

Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs

Data

Full-text available

Jan 2018

Querying Semantic Catalogues of Biomedical Databases

Article

Full-text available

Dec 2022
J BIOMED INFORM

Background Secondary use of health data is a valuable source of knowledge that boosts observational studies, leading to important discoveries in the medical and biomedical sciences. The fundamental guiding principle for performing a successful observational study is the research question and the approach in advance of executing a study. However, in multi-centre studies, finding suitable datasets to support the study is challenging, time-consuming, and sometimes impossible without a deep understanding of each dataset. Methods We propose a strategy for retrieving biomedical datasets of interest that were semantically annotated, using an interface built by applying a methodology for transforming natural language questions into formal language queries. The advantages of creating biomedical semantic data are enhanced by using natural language interfaces to issue complex queries without manipulating a logical query language. Results Our methodology was validated using Alzheimer’s disease datasets published in a European platform for sharing and reusing biomedical data. We converted data to semantic information format using biomedical ontologies in everyday use in the biomedical community and published it as a FAIR endpoint. We have considered natural language questions of three types: single- concept questions, questions with exclusion criteria, and multi-concept questions. Finally, we analysed the performance of the question-answering module we used and its limitations. The source code is publicly available at https:// bioinformatics-ua.github.io/BioKBQA/. Conclusion We propose a strategy for using information extracted from biomedical data and transformed into a semantic format using open biomedical ontologies. Our method uses natural language to formulate questions to be answered by this semantic data without the direct use of formal query languages.

Empowering Information Retrieval in Semantic Web

Article

Full-text available

Apr 2020

Survey on Challenges of Question Answering in the Semantic Web

Article

Full-text available

Nov 2016

Semantic Question Answering (SQA) removes two major access requirements to the Semantic Web: the mastery of a formal query language like SPARQL and knowledge of a specific vocabulary. Because of the complexity of natural language, SQA presents difficult challenges and many research opportunities. Instead of a shared effort, however, many essential components are redeveloped, which is an inefficient use of researcher’s time and resources. This survey analyzes 62 different SQA systems, which are systematically and manually selected using predefined inclusion and exclusion criteria, leading to 72 selected publications out of 1960 candidates. We identify common challenges, structure solutions, and provide recommendations for future systems. This work is based on publications from the end of 2010 to July 2015 and is also compared to older but similar surveys.

Question answering over linked data (QALD-4)

Article

Full-text available

Sep 2014

Multilingual Question Answering over Linked Data (QALD-3): Lab Overview

Conference Paper

Full-text available

Sep 2013

The third edition of the open challenge on Question Answering over Linked Data (QALD-3) has been conducted as a half-day lab at CLEF 2013. Differently from previous editions of the challenge, has put a strong emphasis on multilinguality, offering two tasks: one on multilingual question answering and one on ontology lexicalization. While no submissions were received for the latter, the former attracted six teams who submitted their systems’ results on the provided datasets. This paper provides an overview of QALD-3, discussing the approaches proposed by the participating systems as well as the obtained results.

Towards an Open Question Answering Architecture

Conference Paper

Full-text available

Sep 2014

Billions of facts pertaining to a multitude of domains are now available on the Web as RDF data. However, accessing this data is still a difficult endeavour for non-expert users. In order to meliorate the access to this data, approaches imposing minimal hurdles to their users are required. Although many question answering systems over Linked Data have being proposed, retrieving the desired data is still significantly challenging. In addition, developing and evaluating question answering systems remains a very complex task. To overcome these obstacles, we present a modular and extensible open-source question answering framework. We demonstrate how the framework can be used by integrating two state-of-the-art question answering systems. As a result our evaluation shows that overall better results can be achieved by the use of combination rather than individual stand-alone versions.

Language Networks: The New Word Grammar

Book

Oct 2023

Richard Hudson

This book argues that language is a network of concepts which in turn is part of the general cognitive network of the mind. It challenges the widely-held view that language is an innate mental module with its own special internal organization. It shows that language has the same internal organization as other areas of knowledge such as social relations and action schemas, and reveals the rich links between linguistic elements and contextual categories. Professor Hudson presents a new theory of how we learn and use our knowledge of language. He puts this to work in a series of extended explorations of morphology, syntax, semantics, and sociolinguistics. Every step of his argument and exposition is illustrated with examples, including the kind mainstream theory finds it hard to analyse. He introduces the latest version of his influential theory of Word Grammar and shows how it can be used to explain the operations of language and as a key to understanding the associated operations of the mind.

Course in General Linguistics

Book

Jan 1998

Ferdinand De Saussure

A statistical interpretation of term specificity and its application in retrieval

Article

K. Sparck Jones

Distributed Keyword Search over RDF via MapReduce

Conference Paper

May 2014

Non expert users need support to access linked data available on the Web. To this aim, keyword-based search is considered an essential feature of database systems. The distributed nature of the Semantic Web demands query processing techniques to evolve towards a scenario where data is scattered on distributed data stores. Existing approaches to keyword search cannot guarantee scalability in a distributed environment, because, at runtime, they are unaware of the location of the relevant data to the query and thus, they cannot optimize join tasks. In this paper, we illustrate a novel distributed approach to keyword search over RDF data that exploits the MapReduce paradigm by switching the problem from graph-parallel to data-parallel processing. Moreover, our framework is able to consider ranking during the building phase to return directly the best (top-k) answers in the first (k) generated results, reducing greatly the overall computational load and complexity. Finally, a comprehensive evaluation demonstrates that our approach exhibits very good efficiency guaranteeing high level of accuracy, especially with respect to state-of-the-art competitors.

Searching Linked Objects with Falcons

Article

Sep 2011

Along with the rapid growth of the data Web, searching linked objects for information needs and for reusing become emergent for ordinary Web users and developers, respectively. To meet the challenge, we present Falcons Object Search, a keyword-based search engine for linked objects. To serve various keyword queries, for each object the system constructs a comprehensive virtual document including not only associated literals but also the textual descriptions of associated links and linked objects. The resulting objects are ranked by considering both their relevance to the query and their popularity. For each resulting object, a query-relevant structured snippet is provided to show the associated literals and linked objects matched with the query. Besides, Web-scale class-inclusion reasoning is performed to discover implicit typing information, and users could navigate class hierarchies for incremental class-based results filtering. The results of a task-based experiment show the promising features of the system.

Course in General Linguistics

Article

Jan 1960

Exploring Term Networks for Semantic Search over RDF Knowledge Graphs

Abstract and Figures

Recommended publications

Fast Track to Tenure: Accelerate your rise to the top.

Cluster of Excellence PhoenixD

Cluster of Excellence QuantumFrontiers

CACAO: Conditional Spread Activation for Keyword Factual Query Interpretation

Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs

Exploring Term Networks for Semantic Search over Large RDF Knowledge Graphs

CACAO: Conditional Spread Activation for Keyword Factual Query Interpretation

Question Answering on Interlinked Data