Conference PaperPDF Available

Query Rewriting for an Incremental Search in Heterogeneous Linked Data Sources

September 2013

September 2013
8132:13-24

DOI:10.1007/978-3-642-40769-7_2

Conference: International Conference on Flexible Query Answering Systems

Authors:

Ana-Isabel Torre-Bastida

Tecnalia

Jesus Bermudez

Universidad del País Vasco / Euskal Herriko Unibertsitatea

Arantza Illarramendi

Universidad del País Vasco / Euskal Herriko Unibertsitatea

Eduardo Mena

University of Zaragoza

Show all 5 authorsHide

Nowadays, the number of linked data sources available on the Web is considerable. In this scenario, users are interested in frameworks that help them to query those heterogeneous data sources in a friendly way, so avoiding awareness of the technical details related to the heterogeneity and variety of data sources. With this aim, we present a system that implements an innovative query approach that obtains results to user queries in an incremental way. It sequentially accesses different datasets, expressed with possibly different vocabularies. Our approach enriches previous answers each time a different dataset is accessed. Mapping axioms between datasets are used for rewriting the original query and so obtaining new queries expressed with terms in the vocabularies of the target dataset. These rewritten queries may be semantically equivalent or they could result in a certain semantic loss; in this case, an estimation of the loss of information incurred is presented.

Content uploaded by Eduardo Mena

Content may be subject to copyright.

Query Rewriting for an Incremental Search

in Heterogeneous Linked Data Sources

Ana I. Torre-Bastida1,Jes´us Berm´udez2, Arantza Illarramendi2,

Eduardo Mena3,andMartaGonz´alez1

1Tecnalia Research & Innovation

{isabel.torre,marta.gonzalez}@tecnalia.com

2Departamento de Lenguajes y Sistemas Inform´aticos, UPV-EHU

{a.illarramendi,jesus.bermudez}@ehu.es

3Departamento de Inform´atica e Ingenier´ıa de Sistemas, Univ. Zaragoza

emena@unizar.es

Abstract. Nowadays, the number of linked data sources available on the

Web is considerable. In this scenario, users are interested in frameworks

that help them to query those heterogeneous data sources in a friendly

way, so avoiding awareness of the technical details related to the het-

erogeneity and variety of data sources. With this aim, we present a sys-

tem that implements an innovative query approach that obtains results

to user queries in an incremental way. It sequentially accesses diﬀerent

datasets, expressed with possibly diﬀerent vocabularies. Our approach

enriches previous answers each time a diﬀerent dataset is accessed. Map-

ping axioms between datasets are used for rewriting the original query

and so obtaining new queries expressed with terms in the vocabular-

ies of the target dataset. These rewritten queries may be semantically

equivalent or they could result in a certain semantic loss; in this case, an

estimation of the loss of information incurred is presented.

Keywords: Semantic Web, Linked Open Data Sources, SPARQL query,

vocabulary mapping, query rewriting.

1 Introduction

In recent years an increasing number of RDF open data sources are emerging,

partly due to the existence of techniques to convert non RDF datasources into

RDF ones, supported by initiatives like the Linking Open Data (LOD)1with the

aim of creating a “Web of Data”. The Linked Open Data cloud diagram2shows

datasets that have been published in Linked Data Format (around 338 datasets

by 20133), and this diagram is continuosly growing. Moreover, although those

datasets follow the same representation format, they can deal with heterogeneous

1http://www.w3.org/wiki/SweoIG/TaskForces/

CommunityProjects/LinkingOpenData

2http://lod-cloud.net/state/

3http://datahub.io/lv/group/lodcloud?tags% 3Dno-vocab-mappings

H.L. Larsen et al. (Eds.): FQAS 2013, LNAI 8132, pp. 13–24, 2013.

Springer-Verlag Berlin Heidelberg 2013

14 A.I. Torre-Bastida et al.

vocabularies to name the resources. In that scenario, users ﬁnd diﬃculties in

taking advantage of the contents of many of those datasets because they get

lost with the quantity and the variety of them. For example, a user that is only

familiar with BNE (Biblioteca Nacional de Espa˜na)4dataset vocabularies could

be interested in accessing BNB (Bristh National Bibliography)5,DBpedia

6,or

VEROIA (Public Library of Veroia)7datasets in order to ﬁnd more information.

However, not being familiar with the vocabularies of those datasets may dissuade

him from trying to query them.

So, taking into account the signiﬁcant volume of Linked Data being pub-

lished on the Web, numerous research eﬀorts have been oriented to ﬁnd new

ways to exploit this Web of Data. Those eﬀorts can be broadly classiﬁed into

three main categories: Linked Data browsers, Linked Data search engines, and

domain-speciﬁc Linked Data appplications [1]. The proposal presented in this

paper can be considered under the category of Linked Data search engines and

more particularly, under human-oriented Linked Data search engines, where we

can ﬁnd other approaches such as Falcons8and SWSE9, amongst others. Never-

theless, the main diﬀerence of our proposal with respect to existing engines lies

in the fact that it provides the possibility of obtaining a broader response to a

query formulated by a user by combining the following two aspects: (1) an auto-

matic navigation through diﬀerent datasets, one by one, using mappings deﬁned

among datasets; and (2) a controlled rewriting (generalization/specialization) of

the query formulated by the user according to the vocabularies managed by the

target dataset.

In summary, the novel contribution of this paper is the development of a

system that provides the following main advantages:

–A greater number of datasets at the users disposal. Using our system the user

can gain access to more datasets without bothering to know the existence

of those datasets or the heterogeneous vocabularies that they manage. The

system manages the navigation into diﬀerent datasets.

–Incremental answer enrichment. By accessing diﬀerent datasets the user can

obtain more information of interest. For that, the system manages existing

mapping axioms between datasets.

–Exact or approximate answers. If the system is not capable of obtaining a

semantically equivalent rewriting for the query formulated by the user it

will try to obtain a related query by generalizing/specializing that query

and it will provide information about the loss in precision and/or recall with

respect to the original query.

In the rest of the paper we present ﬁrst some related works in section 2.

Then, we introduce an overview of the query processing approach in section 3.

4BNE - (http://datos.bne.es/sparql)

5BNB - (http://bnb.data.bl.uk/sparql)

6DBpedia - (http://wiki.dbpedia.org/Datasets)

7VEROIA - (http://libver.math.auth.gr/sparql)

8http://ws.nju.edu.cn/falcons/objectsearch/index.jsp

9http://swse.deri.org/

Query Rewriting for an Incremental Search in Linked Data Sources 15

We follow with a detailed explanation of the query rewriting algorithm and with

a brief presentation of how the information loss is measured in sections 4 and 5.

Finally we end with some conclusions in section 6.

2 Related Works

According to the growth of the Semantic Web, SPARQL query processing over

heterogeneous data sources is an active research ﬁeld. Some systems (such as

DARQ [10], FedX [12]) consider federated approaches over distributed data

sources with the ultimate goal of virtual integration. One main diﬀerence with

our proposal is that they focus on top-down strategies where the relevant sources

are known while in our proposal the sources are discovered during the query pro-

cessing. Nevertheless one main drawback for query processing over heterogeneous

data is that existing mapping axioms are scarce and very simples (most of them

are of the owl:sameAs type)

Even closer to our approach are the works related to SPARQL query rewrit-

ing. Some of them, such as [7] and [2], support the query rewriting with map-

ping axioms described by logic rules that are applied to the triple patterns that

compose the query; [7] uses a quite expressive speciﬁc mapping language based

on Description Logics and [2] uses less expressive Horn clause-like rules. In both

cases, the mapping language is much more expressive than what is usually found

in datasets metadata (for instance, VoID10 linksets) and the approach does not

seem to scale up well due to the hard work needed to deﬁne that kind of mapping.

Another approach to query rewriting is query relaxation, which consists of

reformulating the triple patterns of a query to retrieve more results without

excessive loss in precision. Examples of that approach are [6] and [3]. Each

work presents a diﬀerent methodology for deﬁning some types of relaxation: [6]

uses vocabulary inference on triple patterns and [3] uses a statistical language

modeling technique that allows them to compute the similarity between two

entities. Both of them deﬁne a ranking model for the presentation of the query

results. Although our proposal shares with them the goal of providing more

results to the user, they are focused more on generalizing the query while we are

focused on rewriting the query trying to preserve the meaning of the original

query as much as possible, and so generalizing or specializing parts of the query

when necessary in the new context.

The query rewriting problem is also considered [5], but we diﬀer in the way

to face it. In our proposal we cope with existing mapping axioms, that relate

diﬀerent vocabularies, and we make the most of them in the query rewriting

process. In contrast, [5] disregards such mapping axioms and looks for results

in the target dataset by evaluating similarity with an Entity Relevance Model

(ERM) calculated with the results of the original query. The calculation of the

ERM is based on the number of word occurrences in the results obtained, which

are later used as keywords for evaluating similarity. The strength of this method

turns into its weakness in some scenarios because there are datasets that make

10 http://vocab.deri.ie/void

16 A.I. Torre-Bastida et al.

abundant use of codes to identify their entities and those strings do not help as

keywords.

Finally, query rewriting has also been extensively considered in the area of

ontology matching [4]. A distinguising aspect of our system is the measurement

of information loss. In order to compute it we adapt the approach presented

in [11] and further elaborated in [9] to estimate the information loss when a

term is substituted by an expression.

3 An Overview of the Query Processing Approach

In this section we present ﬁrst some terminology that will be used throughout

the rest of the paper. Then we show the main steps followed by the system to

provide an answer to one query formulated by the user. Finally, we present a

brief speciﬁcation of the main components of the system that are involved in the

process of providing the answer.

With respect to the terminology used, we consider datasets that are modeled

as RDF graphs. An RDF graph is a set of RDF triples. An RDF triple is a

statement formed by a subject,apredicate and an object [8]. Elements in a triple

are represented by IRIs and objects may also be represented by literals. We use

term for any element in a triple. Each dataset is described with terms from a

declared vocabulary set. Let us use target dataset for the dataset over which the

query is going to be evaluated, and we use target vocabulary set for its declared

vocabulary set.

SPARQL queries11 are made of graph patterns. A graph pattern is a query

expression made of a set of triple patterns. A triple pattern is a triple where any

of its elements may be a variable. When a triple pattern of a query is expressed

with terms of the target vocabulary set we say that the triple pattern is adequate

for the target dataset. When every triple pattern of a query is adequate for a

target dataset, we say that the query is adequate for that target dataset.

The original query is expressed with terms from a source vocabulary set. Let

us call Tthe set of terms used by the original query. As long as any term in

Tbelongs to a vocabulary in the target vocabulary set, the original query is

adequate for the target dataset and the query can be properly processed over

that dataset. However, if there were terms in Tnot appearing in the target

vocabulary set, triple patterns of the original query including any such term

should be rewritten into appropriate graph patterns, with terms taken from

the target vocabulary set, in order to become an adequate query for the target

dataset.

Terms in Tappearing in synonymy mapping axioms (i.e. expressed with any

of the properties owl:sameAs,owl:equivalentClass,owl:equivalentProp-

erty) with a term in the target vocabulary set can be directly replaced by

the synonym term. Those terms in Tnot appearing in the target vocabulary

set and not appearing in synonymy mapping axioms with terms in the target

vocabulary set are called conﬂicting terms. Since there is no guarantee for enough

11 http://www.w3.org/TR/rdf-sparql-query/

Query Rewriting for an Incremental Search in Linked Data Sources 17

synonym mapping axioms between source and target vocabulary sets that allow

a semantic preserving rewriting of the original query into an adequate query for

the target vocabulary, we must cope with query rewritings with some loss of

information. The goal of the query rewriting algorithm is to replace every triple

pattern including conﬂicting terms with a graph pattern adequate for the target

dataset.

3.1 Main Query Processing Steps

The query which we will use as a running example is “Give me resources whose

author is Tim Berners-Lee”. The steps followed to answer that query are pre-

sented next:

1. The user formulates the query dealing with a provided GUI. For that, he

uses terms that belong to a vocabulary that he is familiar with (for example,

DBLP and FOAF vocabularies in this case). Notice that it is not required

that the user knows the SPARQL language for RDF, he should only know the

terms dblp:Tim Berners-Lee,andfoaf:maker from the DBLP and FOAF

vocabularies. The system produces the following query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX dblp: <http://dblp.l3s.de/d2r/resource/authors/>

{?resource foaf:maker dblp:Tim_Berners-Lee>}

2. The system asks the user for a name of a dataset in which he is interested in

ﬁnding the answer. If the user does not provide any speciﬁc name, then the

system shows the user diﬀerent possible datasets that belong to the same

domain (e.g., bibliographic domain). If the user does not select any of them

then the system selects one. Following the previous example, we assume that

the user selects DBpedia dataset among those presented by the system.

3. The system ﬁrst tries to ﬁnd the query terms in the selected dataset. If it ﬁnds

them, it runs the query processing. Otherwise the system tries to rewrite the

query formulated by the user into another equivalent query using mapping

axioms. At this point two diﬀerent situations may happen:

(a) The system ﬁnds synonymy mapping axioms, deﬁned between the

source and target vocabularies, that allows it to rewrite each term of

the query into an equivalent term in the target vocabulary (for in-

stance, mapping axioms of the type dblp:Tim Berners-Lee owl:sameAs

dbpedia: Tim Berners-Lee). Following the previous example, the prop-

erty foaf:maker is replaced with dbpedia-owl:author. The rewritten

query is the following:

PREFIX dbpedia: <http://dbpedia.org/resource/>

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>

{?resource dbpedia-owl:author dbpedia:Tim_Berners-Lee>}

Then the system obtains the answer querying the DBpedia dataset and

shows the answer to the user through the GUI. The results obtained by

the considered query are:

18 A.I. Torre-Bastida et al.

http://dbpedia.org/resource/Tabulator

http://dbpedia.org/resource/Weaving_the_Web:_The_Original_Design_

and_Ultimate_Destiny_of_the_World_Wide_Web_by_its_inventor

(b) The system does not ﬁnd synonymy mapping axioms for every term in

the original query. In this case, the triple including the conﬂicting term

is replaced with a graph pattern until an adequate query is obtained. In

sections 4 and 5 we present the algorithm used for the rewriting and an

example that illustrates the behaviour, respectively.

4. The system asks the user if he is interested in querying another dataset. If

the answer is No the process ends. If the answer is Yes the process returns

to step 2.

3.2 System Modules

In order to accomplish the steps presented in the previous subsection the system

handles the following modules:

–Input/Output Module. This module manages a GUI that facilitates, on the

one hand, the task of querying the datasets using some predeﬁned forms; and,

on the other hand, presents the obtained answer with a friendly appearance.

–Rewriting Module. This module is in charge of two main tasks: Query anal-

ysis and Query rewriting.TheQuery analysis consists of parsing the query

formulated by the user and obtaining a tree model. For this task, the Query

Analyzer module implemented with ARQ12 is used. In this task the datasets

that belong to the domain considered in the query are also selected. Concern-

ing Query rewriting, we have developed an algorithm (explained in section 4)

that rewrites the query expressed using a source vocabulary into an adequate

query. The algorithm makes use of mapping axioms expressed as RDF triples

and which can be obtained through SPARQL endpoints or RDF dumps.

The mapping axioms we are considering in this paper are those triples whose

subject and object are from diﬀerent vocabularies and the predicate is one of

the following terms: owl:sameAs,rdfs:subClassOf,rdfs:subPropertyOf,

owl:equivalentClass,andowl:equivalentProperty. Future work will

consider a broader set of properties for the mapping axioms.

–Evaluation Module. Taking into account that diﬀerent rewritings could be

possible for a query, the goal of this module is to evaluate those diﬀerent

rewritings and to select the one that incurs the least information loss. For

that it handles some deﬁned metrics (see section 5.1) and the information

stored in the VoID statistics of the considered datasets.

–Processing Module. Once the best query rewriting is selected, this module is

in charge of obtaining the answer for the query by accessing the correspond-

ing dataset.

12 Apache Jena/ARQ (http://jena.apache.org/documentation/query/index.html)

Query Rewriting for an Incremental Search in Linked Data Sources 19

4 Query Rewriting Algorithm

In this section we present the query rewriting algorithm. Its foundation is a

graph traversing algorithm looking for the nearest terms (that belong to the

target dataset) of a conﬂicting term.

We follow two guiding principles for the replacement of conﬂicting terms: (1)

a term can be replaced with the conjunction of its directly subsuming terms.

(2) a term can be replaced with the disjunction of its directly subsumee terms.

These guiding principles are recursively followed until adequate expressions are

accomplished.

A distinguishing feature of our working scenario is that source and target vocab-

ulary sets are not necessarily fully integrated. Notice that datasets are totally inde-

pendent from one another and our system is only allowed to access them by their

particular web services (SPARQL endpoint or programmatic interface). There-

fore, our system depends only on the declared vocabulary sets and the published

mapping axioms. Infered relationships between terms are not taken into account

unless the target system provides them. We are aware of the limitations of that

consideration, but we think that it is quite a realistic scenario nowadays.

In the following, we present the algorithm that obtains an adequate query

expression for the target dataset with the minimum loss of information with

respect to the original query Qmeasured by our proposed metrics.

First of all, the original query Qis decomposed into triple patterns which in

turn are decomposed into the collection of terms T.Thisstepisrepresentedin

line 4 in the displayed listing of the algorithm. Notice that variables are not in-

cluded in T. Variables are maintained unchanged in the rewritten query. Neither

literal values are included in T. Literal values are processed by domain speciﬁc

transformer functions that take into account structure, units and measurement

systems.

Then, for each term in T, a collection of expressions is constructed and gath-

ered with the term. Each expression represents a possible substitution of the

triple pattern including the conﬂicting term for a graph pattern adequate for

the target dataset. See lines 5 to 10 in the algorithm. Considering these ex-

pressions associated with each term, the set of all possible adequate queries is

constructed (line 12) and the information loss of each query is measured and the

query with the least loss is selected (line 14).

The core of the algorithm is the Rewrite routine (line 7) which exam-

ines source and target vocabularies, with their respective mapping axioms, in

order to discover possible substitutions for a given term in a source vocab-

ulary. Let us consider terms in a vocabulary as nodes in a graph and rela-

tionships between terms (speciﬁcally rdfs:subClassOf,rdf:subPropertyOf,

owl:equivalentClass,owl:equivalentProperty,and owl:sameAs)asdi-

rected labeled edges between nodes. Notice that, due to mapping axioms be-

tween two vocabularies, we can consider those vocabularies as parts of the same

graph. Rewrite routine performs a variation of a Breadth First Search traverse

from a conﬂicting term, looking for its frontier of terms that belong to a tar-

get vocabulary. A term fbelongs to the frontier of a term tif it satisﬁes the

20 A.I. Torre-Bastida et al.

following three conditions: (a)fbelongs to a target vocabulary, (b) there is a

trail from tto f,and(c) there is not another term g(diﬀerent from f) belonging

to a target vocabulary in that trail. A trail from a node ttoanodefis a se-

quence of edges that connects tand findependent of the direction of the edges.

For instance, trdfs:subClassOf r,frdfs:subClassOf ris a trail from tto f.

Although a trail admits the traversing of edges in whatever direction, our

algorithm keeps track of the pair formed by each node in the trail and the

direction of the edge followed during the traverse since that is crucial informa-

tion for producing the adequate expressions for substitution. Notice that we are

interested in obtaining a conjunction expression with the directly subsuming

terms, and a disjunction expression with the directly subsumee terms. For that

reason, diﬀerent routines are used to traverse the graph. In line 28 of the algo-

rithm, directSuper(t) is the routine in charge of traversing the edges leaving t.

In line 30 of the algorithm, directSub(t) is the routine in charge of traversing

the edges entering t. Whenever a synonym to a term in a target vocabulary is

found (line 25), such information is added to a queue (line 26) that stores the

result of the Rewrite routine.

Termination of our algorithm is guaranteed because the graph traverse pre-

vents the processing of a previously visited node (avoiding cycles) and further-

more a natural threshold parameter is established in order to limit the maximum

distance from the conﬂicting term of a visited node in the graph.

1//Retur ns an a de qu ate qu ery f o r ont o target ,

2//produced by a r ew rit in g o f Q with the lea s t los s of inf ormat ion

3QUERY SELECTION(Q, ontoSource , ontoTarget ) return Query

4ter ms = DecomposeQuery (Q) ; // terms is the s et of terms in Q

5for each term i n terms do

7rewritingExpressions = REWRITE(term, ontoSource , ontoTarget ) ;

8//stores the term together with it s adequate rewriting expressions

9termsRewritings .add(term , rewritingExpressions ) ;

10 }

11 //Constructs q uerie s from the exp ressions obtained fo r each term

12 poss ibleQu eries = ConstructQuery(Q, termsRewritings) ;

13 //Selec ts and returns the query that provides les s l oss of information

14 return LeastLoss (Q, poss ibleQ ueri es ) ;

17 //Constructs a queue of adequate expressio ns for term in onto target

18 REWRITE(term, ontoSource , ontoTarget ) return Queue<Expression>

19 resultQueue = new Queue ( ) ;

20 traverseQueue = new Queue ( ) ;

21 traverseQueue . add(term) ;

22 while not traverseQueue . isEmpty() do

23 {

24 t = traverseQueue . remove ( ) ;

25 if has synonym( t , ontoTarget) then

26 resultQueue . add(map(t , ontoTarget) ) ;

27 else //t is a conflicting term

28 {cei lin g = directSuper ( t) ;

29 traverseQueue . enqueueAll( ce il in g ) ;

30 floor = directSub(t) ;

31 traverseQueue . enqueueAll( f loo r ) ; }

32 }

33 return resultQueue ;

Query Rewriting for an Incremental Search in Linked Data Sources 21

5 Estimation of Information Loss

In this section we describe how we measure the loss of information caused by the

rewriting of the original query. Also we explain in detail a use case that needs

these rewritings to achieve an adequate query.

5.1 Measuring the Loss of Information

The system measures the loss of information using a composite measure adapted

from [11]. This measure is based on the combination of the metrics precision

and recal l from Information Retrieval literature. We measure the proportion of

retrieved data that is relevant (precision) and the proportion of relevant data

that is retrieved (recal l).

To calculate these metrics, we use datasets metadata published as VoID statis-

tics. There are VoID statements that inform us of the number or entities of a class

or the number of pairs of resources related by a property in a certain dataset.

For instance, in :DBpedia dataset, the class dbpedia:Book has 26198 entities

and there are 4102 triples with the property dbpedia:notableWorks.

:DBpedia a void:Dataset;

void:classPartition [ void:propertyPartition [

void:class dbpedia:Book; void:property dbpedia:notableWorks;

void:entities 26198; ]; void:triples 4102; ];

Given a conﬂicting term ct, we deﬁne Ext(ct) as the extension of ct;that

is the collection of relevant instances for that term. Let us call Rewr(ct) to an

expression obtained by the rewriting of a conﬂicting term ct,andExt(Rewr(ct))

to the extension of the rewritten expression, that is the retrieved instances for

that expression.

We deﬁne Ext(ct) as the number of entities (resp. triples) registered for ct

in the dataset (this value should be obtained from the metadata statistics). In

the case of Ext(Rewr(ct)), we cannot expect a registered value in the metadata.

Instead we calculate an estimation for an interval of values [Ext(Rewr(ct).low),

Ext(Rewr(ct).high)] which bound the minimum and the maximum cardinality

of the expression extension. Those values are used for the calculation of our

measures of precision and recall. However, due to the lack of space and the

intricacy of the diﬀerent cases that must be taken into account, we will not to

present a detailed explanation for the calculation here.

Allow us to say that precision and recall of a rewriting of a conﬂicting term ct

will be measured with an interval [Precision(ct).low,Precision(ct).high]where

Precision(ct).low =L(Ext(ct),Ext(Rewr(ct).low),Ext(Rewr(ct).high))and

Precision(ct).high =H(Ext(ct),Ext(Rewr(ct).low),Ext(Rewr(ct).high))are

functional values calculated after a careful analysis of the diverse semantic re-

lationships between ct and Rewr(ct). Oﬀered only as a hint, consider that the

functions are variations on the following formulae, presented in [9]:

P recision(ct)= (Ext(ct)∩Ext(Rewr(ct)))

Ext(Rewr (ct)) ;Recall(ct)= (Ext(ct)∩Ext(Rewr(ct)))

Ext(ct)

22 A.I. Torre-Bastida et al.

In order to provide the user with a certain capacity for expressing preferences

on precision or recall, we introduce a real value parameter α(0 ≤α≤1) for

tuning the function to calculate the loss of information due to the rewriting of

a conﬂicting term. Again, this measure is expressed as an interval of values:

Loss(ct).low =1−1

α(1

P recision(ct).high )+(1−α)( 1

Recall(ct).high )(1)

Loss(ct).high =1−1

α(1

P recision(ct).low )+(1−α)( 1

Recall(ct).low )(2)

Finally, many functions can be considered for the calculation of the loss of

information incurred for the rewriting of the entire original query Q.Weare

aware that more research and experimentation is needed to select the most ap-

propriate ones for our task. Nevertheless, for the sake of this paper, let us use

a very simple and eﬀective one such as the maximum among the set of values

that represent the losses.

Loss(Q).low =max{Loss(ct).low |ct conﬂicting term in Q}(3)

Loss(Q).high =max{Loss(ct).hig h |ct conﬂicting term in Q}(4)

5.2 Rewriting Example

This section describes in detail an example of the process followed by our system

in the case that loss of information is produced during the rewriting process.

Consider that the system is trying to answer the original query shown in ﬁgure 1,

which is expressed with terms in the proprietary bdi vocabulary, and that the

user decides to commit the query to the DBpedia dataset. Some of the mapping

axioms at the disposal of the system are as follows:

bdi:Document rdfs:subClassOf dbpedia:Work .

bdi:Publication rdfs:subClassOf bdi:Document .

dbpedia:WrittenWork rdfs:subClassOf bdi:Publication .

dbpedia:Website rdfs:subClassOf bdi:Publication .

dbpedia:Miguel_de_Cervantes owl:sameAs bdi:Miguel_de_Cervantes.

dbpedia:notableWork owl:sameAs bdi:isAuthor .

During the process, two possible rewritings are generated, as shown in ﬁg-

ure 1. The one on the left is due to the pair of mapping axioms that specify

that dbpedia:Work is a superclass of the conﬂicting term bdi:Publication;

and, the one on the right is due to a pair of mapping axioms that specify that

dbpedia:WritenWork and dbpedia:Website are subclasses of bdi:Publication

(see those terms in the shaded boxes of ﬁgure 1).

The calculation of the loss information for each rewriting is as follows. Notice

that the only conﬂicting term, in this case, is (bdi :P ublication). Firstly, the

extension of the conﬂicting term and the rewriting expresssions are calculated.

Ext(bdi:Publication) = 503;

Ext(dbpedia:Work )= 387599;

Query Rewriting for an Incremental Search in Linked Data Sources 23

Fig. 1. Rewriting expressions generated

Ext(dbpedia:WrittenWork ∪dbpedia:Website).low = min[40016, 2438] = 2438;

Ext(dbpedia:WrittenWork ∪dbpedia:Website).high = 40016+2438 = 42454.

Secondly, precision and recall taking into account the relationships between

the conﬂicting term and its rewriting expressions are calculated.

With respect to Rewr(bdi:Publication) = dbpedia:Work

[Precision.low = 0,0012960; Precision.high = 0,0012977; Recal l = 1]

With respect to Rewr(bdi:Publication) = db:WrittenWork ∪db:Website

[Precision = 1; Recall.low = 0,828969; Recall.high=1]

Then, the loss of information interval for bdi :P ublication with a parameter

α=0.5 (meaning equal preference on precision and recall) is calculated.

With respect to Rewr(bdi:Publication) = dbpedia:Work

[Loss(bdi:Publication).low = 0,997408; Loss(bdi:Publication).high= 0,997412 ]

With respect to Rewr(bdi:Publication) = db:WrittenWork ∪db:Website

[Loss(bdi:Publication).low = 0; Loss(bdi:Publication).high= 0.093511]

Considering the above information loss intervals, the system will choose the sec-

ond option (replacing bdi:Publication with db:WrittenWork ∪db:Website)

as the loss of information is estimated to be between 0% and 9% (i.e., very low

even with the posibility of being 0%, that is no loss of information). However, the

ﬁrst option (replacing bdi:Publication with dbpedia:Work) is estimated to in-

cur in a big loss of information (about 99.7%), which is something that could be

expected: dbpedia:Work references many works that are not publications. Any-

way, in absence of the second option, the ﬁrst one (despite returning many refer-

ences to works that are not publications) also returns the publications included

in dbpedia:Work which could satisfy the user. The alternative, not dealing with

imprecise answers, would return nothing when a semantic preserving query into

a new dataset cannot be achieved.

6 Conclusions

For this new era of Web of Data we present in this paper a proposal that oﬀers

the users the possibility of querying heterogeneous Linked Data sources in a

friendly way. That means that users do not need to take notice of technical details

24 A.I. Torre-Bastida et al.

associated with the heterogeneity and variety of existing datasets. The proposal

gives the opportunity to enrich the answer of the query incrementally, by visiting

diﬀerent datasets one by one, without needing to know the particular features of

each dataset. The main component of the proposal is an algorithm that rewrites

queries formulated by the users, using preferred vocabularies, into other ones

expressed using the vocabularies of the datasets visited. This algorithm makes

an extensive use of mapping axioms already deﬁned in the datasets. To rewrite

preserving query semantics may be diﬃcult many times, for that reason the

algorithm also handles rewritings with some loss of information.

Experiments are being carried out for tuning the estimation loss formulae.

Acknowledgements. This work is together supported by the TIN2010-21387-

CO2-01 project and the I˜naki Goenaga (FCT-IG) Technology Centres

Foundation.

References

1. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International

Journal on Semantic Web and Information Systems (IJSWIS) 5(3), 1–22 (2009)

2. Correndo, G., Salvadores, M., Millard, I., Glaser, H., Shadbolt, N.: Sparql query

rewriting for implementing data integration over linked data. In: Proceedings of

the 2010 EDBT/ICDT Workshops, p. 4. ACM (2010)

3. Elbassuoni, S., Ramanath, M., Weikum, G.: Query relaxation for entity-

relationship search. In: The Semanic Web: Research and Applications, pp. 62–76

(2011)

4. Euzenat, J., Shvaiko, P.: Ontology matching, vol. 18. Springer, Heidelberg (2007)

5. Herzig, D., Tran, T.: One query to bind them all. In: COLD 2011, CEUR Workshop

Proceedings, vol. 782 (2011)

6. Hurtado, C., Poulovassilis, A., Wood, P.: Query relaxation in rdf. Journal on Data

Semantics X, 31–61 (2008)

7. Makris, K., Gioldasis, N., Bikakis, N., Christodoulakis, S.: Ontology mapping and

sparql rewriting for querying federated rdf data sources. In: Meersman, R., Dil-

lon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1108–1117. Springer,

Heidelberg (2010)

8. Manola, F., Miller, E., McBride, B.: Rdf primer w3c recommendation (February

10, 2004)

9. Mena, E., Kashyap, V., Illarramendi, A., Sheth, A.: Imprecise answers on highly

open and distributed environments: An approach based on information loss for

multi-ontology based query processing. International Journal of Cooperative Infor-

mation Systems (IJCIS) 9(4), 403–425 (2000)

10. Quilitz, B., Leser, U.: Querying distributed rdf data sources with sparql. In: Bech-

hofer, S., Hauswirth, M., Hoﬀmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS,

vol. 5021, pp. 524–538. Springer, Heidelberg (2008)

11. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Re-

trieval of. Addison-Wesley (1989)

12. Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: Optimization

techniques for federated query processing on linked data. In: Aroyo, L., Welty, C.,

Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC

2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)

Cross-querying LOD data sets using complex alignments: an experiment using AgronomicTaxon, Agrovoc, DBpedia and TAXREF-LD

Article

Full-text available

Jan 2018

An increasing amount of data sets have being published on the Linked Open Data (LOD), covering different aspects of overlapping domains. This is typically the case of agronomy and related fields, where several LOD data sets describing different points of view on scientific classifications have been published. This opens emerging opportunities in the field, providing to practitioners new knowledge sources. However, without help, querying the different datasets is a time-consuming task for LOD users as they need to know the ontologies describing the data of each of them. Rewriting queries can be automated with the help of ontology alignments. This paper presents a query rewriting approach that relies on complex alignments. This kind of alignment, opposite to simple ones, better deals with ontology modelling heterogeneities. We evaluate our approach on a scenario of query rewriting on agronomic information needs across four different datasets: AgronomicTaxon, AGROVOC, DBpedia, and TAXREF-LD. Copyright © 2018 Inderscience Enterprises Ltd.

Cross-Querying LOD Datasets Using Complex Alignments: An Application to Agronomic Taxa

Conference Paper

Full-text available

Nov 2017

Farmers have new information needs to change their agricultural practices. The Linked Open Data is a considerable source of knowledge, separated into several heterogeneous and complementary datasets. This paper presents a process to query LOD datasets from a known ontology using complex alignments. The approach was applied on AgronomicTaxon, a taxonomic classification ontology, to query Agrovoc and DBpedia.

A Framework for User-Driven Mapping Discovery in Rich Spaces of Heterogeneous Data

Conference Paper

Oct 2017

Federica Mandreoli

Data analysis in rich spaces of heterogeneous data sources is an increasingly common activity. Examples include exploratory data analysis and personal information management. Mapping specification is one of the key issues in this data management setting that answer to the need of a unified search over the full spectrum of relevant knowledge. Indeed, while users in data analytics are engaged in an open-ended interaction between data discovery and data orchestration, most of the solutions for mapping specification available so far are intended for expert users. This paper proposes a general framework for a novel paradigm for user-driven mapping discovery where mapping specification is interactively driven by the information seeking activities of users and the exclusive role of mappings is to contribute to users satisfaction. The underlying key idea is that data semantics is in the eye of the consumers. Thus, we start from user queries which we try to satisfy in the dataspace. In this process of satisfaction, we often need to discover new mappings, to expose the user to the data thereby discovered for their feedback, and possibly continued towards user satisfaction. The framework is made up of (a) a theoretical foundation where we formally introduce the notion of candidate mapping sets for a user query, and (b) an interactive and incremental algorithm that, given a user query, finds a candidate mapping set that satisfies the user. The algorithm incrementally builds the candidate mapping set by searching in the dataspace data samples and deriving mapping lattices that are explored to deliver mappings for user feedback. With the aim of fitting the user information need in a limited number of interactions, the algorithm provides for a multi-criteria selection strategy for candidate mapping sets. Finally, a proof of the correctness of the algorithm is provided in the paper.

Keyword-Based SPARQL Query Generation System to Improve Semantic Tractability on LOD Cloud

Conference Paper

Jul 2014

Augmented context-based recommendation service framework using knowledge over the Linked Open Data cloud

Article

Jul 2015

This research proposes ACARDS (Augmented-Context bAsed RecommenDation Service) framework that is able to utilize knowledge over the Linked Open Data (LOD) cloud to recommend context-based services to users. To improve the level of user satisfaction with the result of the recommendation, the ACARDS framework implements a novel recommendation algorithm that can utilize the knowledge over the LOD cloud. In addition, the noble algorithm is able to use new concepts like the enriched tags and the augmented tags that originate from the hashtags on the SNSs materials. These tags are utilized to recommend the most appropriate services in the user’s context, which can change dynamically. Last but not least, the ACARDS framework implements the context-based reshaping algorithm on the augmented tag cloud. In the reshaping process, the ACARDS framework can recommend the highly receptive services in the users’ context and their preferences. To evaluate the performance of the ACARDS framework, we conduct four kinds of experiments using the Instagram materials and the LOD cloud. As a result, we proved that the ACARDS framework contributes to increasing the query efficiency by reducing the search space and improving the user satisfaction on the recommended services.

One Query to Bind Them All

Article

Full-text available

Jan 2011

Recently, SPARQL became the standard language for query-ing RDF data on the Web. Like other formal query languages, it applies a Boolean-match semantics, i.e. results adhere strictly to the query. Thus, queries formulated for one dataset can not easily be reused for querying other datasets. If another target dataset is to be queried, the queries need to be rewritten using the vocabulary of the target dataset, while preserv-ing the captured information need. This is a tedious manual task, which requires the knowledge of the target vocabulary and often relies on com-putational expensive techniques, such as mapping, data consolidation or reasoning methods. Given the scale as well as the dynamics of Web datasets, even automatic rewriting is often infeasible. In this paper, we elaborate on a novel approach, which allows to reuse existing SPARQL queries adhering to one dataset to search for entities in other dataset, which are neither linked nor otherwise integrated beforehand. We use the results returned by the given seed query, to construct an entity relevance model (ERM), which captures the content and the structure of relevant results. Candidate entities from the target dataset are obtained using existing keyword search techniques and subsequently ranked according to their similarity to the ERM. During this ranking task, we compute mappings between the structure of the ERM and of the candidates on-the-fly. The effectiveness of this approach is shown in experiments using large-scale datasets and compared to a keyword search baseline.

Linked Data: The Story so Far

Article

Full-text available

Jan 2009

The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.

Linked Data: The Story so Far

Article

Full-text available

Jul 2009

The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.

FedX: Optimization Techniques for Federated Query Processing on Linked Data

Conference Paper

Full-text available

Jan 2011

Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on theWeb, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.

Querying Distributed RDF Data Sources with SPARQL

Conference Paper

Full-text available

Jan 2008

Integrated access to multiple distributed and autonomous RDF data sources is a key challenge for many semantic web applications. As a reaction to this challenge, SPARQL, the W3C Recommendation for an RDF query language, supports querying of multiple RDF graphs. However, the current standard does not provide transparent query federation, which makes query formulation hard and lengthy. Furthermore, current implementations of SPARQL load all RDF graphs mentioned in a query to the local machine. This usually incurs a large overhead in network traffic, and sometimes is simply impossible for technical or legal reasons. To overcome these problems we present DARQ, an engine for federated SPARQL queries. DARQ provides transparent query access to multiple SPARQL services, i.e., it gives the user the impression to query one single RDF graph despite the real data being distributed on the web. A service description language enables the query engine to decompose a query into sub-queries, each of which can be answered by an individual service. DARQ also uses query rewriting and cost-based query optimization to speed-up query execution. Experiments show that these optimizations significantly improve query performance even when only a very limited amount of statistical information is available. DARQ is available under GPL License at http://darq.sf.net/ .

Ontology Mapping and SPARQL Rewriting for Querying Federated RDF Data Sources - (Short Paper).

Conference Paper

Full-text available

Oct 2010

The web of data consists of distributed, diverse (in terms of schema adopted), and large RDF datasets. In this paper we present a SPARQL query rewriting method which can be used to achieve interoperability in semantic information retrieval and/or knowledge discovery processes over interconnected RDF data sources. Formal mappings between different overlapping ontologies are exploited in order to rewrite initial user SPARQL queries, so that they can be evaluated over different RDF data sources on different sites. The proposed environment is utilized by an ontology-based mediator system, which we have developed in order to provide data integration within the Semantic Web environment.

SPARQL Query Rewriting for Implementing Data Integration over Linked Data

Conference Paper

Mar 2010

There has been lately an increased activity of publishing structured data in RDF due to the activity of the Linked Data community. The presence on the Web of such a huge information cloud, ranging from academic to geographic to gene related information, poses a great challenge when it comes to reconcile heterogeneous schemas adopted by data publishers. For several years, the Semantic Web community has been developing algorithms for aligning data models (ontologies). Nevertheless, exploiting such ontology alignments for achieving data integration is still an under supported research topic. The semantics of ontology alignments, often defined over a logical frameworks, implies a reasoning step over huge amounts of data, that is often hard to implement and rarely scales on Web dimensions. This paper presents an algorithm for achieving RDF data mediation based on SPARQL query rewriting. The approach is based on the encoding of rewriting rules for RDF patterns that constitute part of the structure of a SPARQL query.

Rdf primer: W3c recommendation

Article

Jan 2004

Automatic Text Processing

Article

Jan 1989

Gerard Salton

Query Relaxation for Entity-Relationship Search

Conference Paper

May 2011

Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triple-pattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform.

Query Rewriting for an Incremental Search in Heterogeneous Linked Data Sources

Abstract

Recommended publications

Unknown

Incremental SPARQL Query Processing

A Rule-Based Transducer for Querying Incompletely Aligned Datasets

SPARQL-RW: transparent query access over mapped RDF data sources

A Federation Layer for Query Processing over the Web of Linked Data