Conference PaperPDF Available

Abstract

Nowadays, the number of linked data sources available on the Web is considerable. In this scenario, users are interested in frameworks that help them to query those heterogeneous data sources in a friendly way, so avoiding awareness of the technical details related to the heterogeneity and variety of data sources. With this aim, we present a system that implements an innovative query approach that obtains results to user queries in an incremental way. It sequentially accesses different datasets, expressed with possibly different vocabularies. Our approach enriches previous answers each time a different dataset is accessed. Mapping axioms between datasets are used for rewriting the original query and so obtaining new queries expressed with terms in the vocabularies of the target dataset. These rewritten queries may be semantically equivalent or they could result in a certain semantic loss; in this case, an estimation of the loss of information incurred is presented.
Query Rewriting for an Incremental Search
in Heterogeneous Linked Data Sources
Ana I. Torre-Bastida1,Jes´us Berm´udez2, Arantza Illarramendi2,
Eduardo Mena3,andMartaGonz´alez1
1Tecnalia Research & Innovation
{isabel.torre,marta.gonzalez}@tecnalia.com
2Departamento de Lenguajes y Sistemas Inform´aticos, UPV-EHU
{a.illarramendi,jesus.bermudez}@ehu.es
3Departamento de Inform´atica e Ingenier´ıa de Sistemas, Univ. Zaragoza
emena@unizar.es
Abstract. Nowadays, the number of linked data sources available on the
Web is considerable. In this scenario, users are interested in frameworks
that help them to query those heterogeneous data sources in a friendly
way, so avoiding awareness of the technical details related to the het-
erogeneity and variety of data sources. With this aim, we present a sys-
tem that implements an innovative query approach that obtains results
to user queries in an incremental way. It sequentially accesses different
datasets, expressed with possibly different vocabularies. Our approach
enriches previous answers each time a different dataset is accessed. Map-
ping axioms between datasets are used for rewriting the original query
and so obtaining new queries expressed with terms in the vocabular-
ies of the target dataset. These rewritten queries may be semantically
equivalent or they could result in a certain semantic loss; in this case, an
estimation of the loss of information incurred is presented.
Keywords: Semantic Web, Linked Open Data Sources, SPARQL query,
vocabulary mapping, query rewriting.
1 Introduction
In recent years an increasing number of RDF open data sources are emerging,
partly due to the existence of techniques to convert non RDF datasources into
RDF ones, supported by initiatives like the Linking Open Data (LOD)1with the
aim of creating a “Web of Data”. The Linked Open Data cloud diagram2shows
datasets that have been published in Linked Data Format (around 338 datasets
by 20133), and this diagram is continuosly growing. Moreover, although those
datasets follow the same representation format, they can deal with heterogeneous
1http://www.w3.org/wiki/SweoIG/TaskForces/
CommunityProjects/LinkingOpenData
2http://lod-cloud.net/state/
3http://datahub.io/lv/group/lodcloud?tags% 3Dno-vocab-mappings
H.L. Larsen et al. (Eds.): FQAS 2013, LNAI 8132, pp. 13–24, 2013.
c
Springer-Verlag Berlin Heidelberg 2013
14 A.I. Torre-Bastida et al.
vocabularies to name the resources. In that scenario, users find difficulties in
taking advantage of the contents of many of those datasets because they get
lost with the quantity and the variety of them. For example, a user that is only
familiar with BNE (Biblioteca Nacional de Espa˜na)4dataset vocabularies could
be interested in accessing BNB (Bristh National Bibliography)5,DBpedia
6,or
VEROIA (Public Library of Veroia)7datasets in order to find more information.
However, not being familiar with the vocabularies of those datasets may dissuade
him from trying to query them.
So, taking into account the significant volume of Linked Data being pub-
lished on the Web, numerous research efforts have been oriented to find new
ways to exploit this Web of Data. Those efforts can be broadly classified into
three main categories: Linked Data browsers, Linked Data search engines, and
domain-specific Linked Data appplications [1]. The proposal presented in this
paper can be considered under the category of Linked Data search engines and
more particularly, under human-oriented Linked Data search engines, where we
can find other approaches such as Falcons8and SWSE9, amongst others. Never-
theless, the main difference of our proposal with respect to existing engines lies
in the fact that it provides the possibility of obtaining a broader response to a
query formulated by a user by combining the following two aspects: (1) an auto-
matic navigation through different datasets, one by one, using mappings defined
among datasets; and (2) a controlled rewriting (generalization/specialization) of
the query formulated by the user according to the vocabularies managed by the
target dataset.
In summary, the novel contribution of this paper is the development of a
system that provides the following main advantages:
A greater number of datasets at the users disposal. Using our system the user
can gain access to more datasets without bothering to know the existence
of those datasets or the heterogeneous vocabularies that they manage. The
system manages the navigation into different datasets.
Incremental answer enrichment. By accessing different datasets the user can
obtain more information of interest. For that, the system manages existing
mapping axioms between datasets.
Exact or approximate answers. If the system is not capable of obtaining a
semantically equivalent rewriting for the query formulated by the user it
will try to obtain a related query by generalizing/specializing that query
and it will provide information about the loss in precision and/or recall with
respect to the original query.
In the rest of the paper we present first some related works in section 2.
Then, we introduce an overview of the query processing approach in section 3.
4BNE - (http://datos.bne.es/sparql)
5BNB - (http://bnb.data.bl.uk/sparql)
6DBpedia - (http://wiki.dbpedia.org/Datasets)
7VEROIA - (http://libver.math.auth.gr/sparql)
8http://ws.nju.edu.cn/falcons/objectsearch/index.jsp
9http://swse.deri.org/
Query Rewriting for an Incremental Search in Linked Data Sources 15
We follow with a detailed explanation of the query rewriting algorithm and with
a brief presentation of how the information loss is measured in sections 4 and 5.
Finally we end with some conclusions in section 6.
2 Related Works
According to the growth of the Semantic Web, SPARQL query processing over
heterogeneous data sources is an active research field. Some systems (such as
DARQ [10], FedX [12]) consider federated approaches over distributed data
sources with the ultimate goal of virtual integration. One main difference with
our proposal is that they focus on top-down strategies where the relevant sources
are known while in our proposal the sources are discovered during the query pro-
cessing. Nevertheless one main drawback for query processing over heterogeneous
data is that existing mapping axioms are scarce and very simples (most of them
are of the owl:sameAs type)
Even closer to our approach are the works related to SPARQL query rewrit-
ing. Some of them, such as [7] and [2], support the query rewriting with map-
ping axioms described by logic rules that are applied to the triple patterns that
compose the query; [7] uses a quite expressive specific mapping language based
on Description Logics and [2] uses less expressive Horn clause-like rules. In both
cases, the mapping language is much more expressive than what is usually found
in datasets metadata (for instance, VoID10 linksets) and the approach does not
seem to scale up well due to the hard work needed to define that kind of mapping.
Another approach to query rewriting is query relaxation, which consists of
reformulating the triple patterns of a query to retrieve more results without
excessive loss in precision. Examples of that approach are [6] and [3]. Each
work presents a different methodology for defining some types of relaxation: [6]
uses vocabulary inference on triple patterns and [3] uses a statistical language
modeling technique that allows them to compute the similarity between two
entities. Both of them define a ranking model for the presentation of the query
results. Although our proposal shares with them the goal of providing more
results to the user, they are focused more on generalizing the query while we are
focused on rewriting the query trying to preserve the meaning of the original
query as much as possible, and so generalizing or specializing parts of the query
when necessary in the new context.
The query rewriting problem is also considered [5], but we differ in the way
to face it. In our proposal we cope with existing mapping axioms, that relate
different vocabularies, and we make the most of them in the query rewriting
process. In contrast, [5] disregards such mapping axioms and looks for results
in the target dataset by evaluating similarity with an Entity Relevance Model
(ERM) calculated with the results of the original query. The calculation of the
ERM is based on the number of word occurrences in the results obtained, which
are later used as keywords for evaluating similarity. The strength of this method
turns into its weakness in some scenarios because there are datasets that make
10 http://vocab.deri.ie/void
16 A.I. Torre-Bastida et al.
abundant use of codes to identify their entities and those strings do not help as
keywords.
Finally, query rewriting has also been extensively considered in the area of
ontology matching [4]. A distinguising aspect of our system is the measurement
of information loss. In order to compute it we adapt the approach presented
in [11] and further elaborated in [9] to estimate the information loss when a
term is substituted by an expression.
3 An Overview of the Query Processing Approach
In this section we present first some terminology that will be used throughout
the rest of the paper. Then we show the main steps followed by the system to
provide an answer to one query formulated by the user. Finally, we present a
brief specification of the main components of the system that are involved in the
process of providing the answer.
With respect to the terminology used, we consider datasets that are modeled
as RDF graphs. An RDF graph is a set of RDF triples. An RDF triple is a
statement formed by a subject,apredicate and an object [8]. Elements in a triple
are represented by IRIs and objects may also be represented by literals. We use
term for any element in a triple. Each dataset is described with terms from a
declared vocabulary set. Let us use target dataset for the dataset over which the
query is going to be evaluated, and we use target vocabulary set for its declared
vocabulary set.
SPARQL queries11 are made of graph patterns. A graph pattern is a query
expression made of a set of triple patterns. A triple pattern is a triple where any
of its elements may be a variable. When a triple pattern of a query is expressed
with terms of the target vocabulary set we say that the triple pattern is adequate
for the target dataset. When every triple pattern of a query is adequate for a
target dataset, we say that the query is adequate for that target dataset.
The original query is expressed with terms from a source vocabulary set. Let
us call Tthe set of terms used by the original query. As long as any term in
Tbelongs to a vocabulary in the target vocabulary set, the original query is
adequate for the target dataset and the query can be properly processed over
that dataset. However, if there were terms in Tnot appearing in the target
vocabulary set, triple patterns of the original query including any such term
should be rewritten into appropriate graph patterns, with terms taken from
the target vocabulary set, in order to become an adequate query for the target
dataset.
Terms in Tappearing in synonymy mapping axioms (i.e. expressed with any
of the properties owl:sameAs,owl:equivalentClass,owl:equivalentProp-
erty) with a term in the target vocabulary set can be directly replaced by
the synonym term. Those terms in Tnot appearing in the target vocabulary
set and not appearing in synonymy mapping axioms with terms in the target
vocabulary set are called conflicting terms. Since there is no guarantee for enough
11 http://www.w3.org/TR/rdf-sparql-query/
Query Rewriting for an Incremental Search in Linked Data Sources 17
synonym mapping axioms between source and target vocabulary sets that allow
a semantic preserving rewriting of the original query into an adequate query for
the target vocabulary, we must cope with query rewritings with some loss of
information. The goal of the query rewriting algorithm is to replace every triple
pattern including conflicting terms with a graph pattern adequate for the target
dataset.
3.1 Main Query Processing Steps
The query which we will use as a running example is “Give me resources whose
author is Tim Berners-Lee”. The steps followed to answer that query are pre-
sented next:
1. The user formulates the query dealing with a provided GUI. For that, he
uses terms that belong to a vocabulary that he is familiar with (for example,
DBLP and FOAF vocabularies in this case). Notice that it is not required
that the user knows the SPARQL language for RDF, he should only know the
terms dblp:Tim Berners-Lee,andfoaf:maker from the DBLP and FOAF
vocabularies. The system produces the following query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dblp: <http://dblp.l3s.de/d2r/resource/authors/>
{?resource foaf:maker dblp:Tim_Berners-Lee>}
2. The system asks the user for a name of a dataset in which he is interested in
finding the answer. If the user does not provide any specific name, then the
system shows the user different possible datasets that belong to the same
domain (e.g., bibliographic domain). If the user does not select any of them
then the system selects one. Following the previous example, we assume that
the user selects DBpedia dataset among those presented by the system.
3. The system first tries to find the query terms in the selected dataset. If it finds
them, it runs the query processing. Otherwise the system tries to rewrite the
query formulated by the user into another equivalent query using mapping
axioms. At this point two different situations may happen:
(a) The system finds synonymy mapping axioms, defined between the
source and target vocabularies, that allows it to rewrite each term of
the query into an equivalent term in the target vocabulary (for in-
stance, mapping axioms of the type dblp:Tim Berners-Lee owl:sameAs
dbpedia: Tim Berners-Lee). Following the previous example, the prop-
erty foaf:maker is replaced with dbpedia-owl:author. The rewritten
query is the following:
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
{?resource dbpedia-owl:author dbpedia:Tim_Berners-Lee>}
Then the system obtains the answer querying the DBpedia dataset and
shows the answer to the user through the GUI. The results obtained by
the considered query are:
18 A.I. Torre-Bastida et al.
http://dbpedia.org/resource/Tabulator
http://dbpedia.org/resource/Weaving_the_Web:_The_Original_Design_
and_Ultimate_Destiny_of_the_World_Wide_Web_by_its_inventor
(b) The system does not find synonymy mapping axioms for every term in
the original query. In this case, the triple including the conflicting term
is replaced with a graph pattern until an adequate query is obtained. In
sections 4 and 5 we present the algorithm used for the rewriting and an
example that illustrates the behaviour, respectively.
4. The system asks the user if he is interested in querying another dataset. If
the answer is No the process ends. If the answer is Yes the process returns
to step 2.
3.2 System Modules
In order to accomplish the steps presented in the previous subsection the system
handles the following modules:
Input/Output Module. This module manages a GUI that facilitates, on the
one hand, the task of querying the datasets using some predefined forms; and,
on the other hand, presents the obtained answer with a friendly appearance.
Rewriting Module. This module is in charge of two main tasks: Query anal-
ysis and Query rewriting.TheQuery analysis consists of parsing the query
formulated by the user and obtaining a tree model. For this task, the Query
Analyzer module implemented with ARQ12 is used. In this task the datasets
that belong to the domain considered in the query are also selected. Concern-
ing Query rewriting, we have developed an algorithm (explained in section 4)
that rewrites the query expressed using a source vocabulary into an adequate
query. The algorithm makes use of mapping axioms expressed as RDF triples
and which can be obtained through SPARQL endpoints or RDF dumps.
The mapping axioms we are considering in this paper are those triples whose
subject and object are from different vocabularies and the predicate is one of
the following terms: owl:sameAs,rdfs:subClassOf,rdfs:subPropertyOf,
owl:equivalentClass,andowl:equivalentProperty. Future work will
consider a broader set of properties for the mapping axioms.
Evaluation Module. Taking into account that different rewritings could be
possible for a query, the goal of this module is to evaluate those different
rewritings and to select the one that incurs the least information loss. For
that it handles some defined metrics (see section 5.1) and the information
stored in the VoID statistics of the considered datasets.
Processing Module. Once the best query rewriting is selected, this module is
in charge of obtaining the answer for the query by accessing the correspond-
ing dataset.
12 Apache Jena/ARQ (http://jena.apache.org/documentation/query/index.html)
Query Rewriting for an Incremental Search in Linked Data Sources 19
4 Query Rewriting Algorithm
In this section we present the query rewriting algorithm. Its foundation is a
graph traversing algorithm looking for the nearest terms (that belong to the
target dataset) of a conflicting term.
We follow two guiding principles for the replacement of conflicting terms: (1)
a term can be replaced with the conjunction of its directly subsuming terms.
(2) a term can be replaced with the disjunction of its directly subsumee terms.
These guiding principles are recursively followed until adequate expressions are
accomplished.
A distinguishing feature of our working scenario is that source and target vocab-
ulary sets are not necessarily fully integrated. Notice that datasets are totally inde-
pendent from one another and our system is only allowed to access them by their
particular web services (SPARQL endpoint or programmatic interface). There-
fore, our system depends only on the declared vocabulary sets and the published
mapping axioms. Infered relationships between terms are not taken into account
unless the target system provides them. We are aware of the limitations of that
consideration, but we think that it is quite a realistic scenario nowadays.
In the following, we present the algorithm that obtains an adequate query
expression for the target dataset with the minimum loss of information with
respect to the original query Qmeasured by our proposed metrics.
First of all, the original query Qis decomposed into triple patterns which in
turn are decomposed into the collection of terms T.Thisstepisrepresentedin
line 4 in the displayed listing of the algorithm. Notice that variables are not in-
cluded in T. Variables are maintained unchanged in the rewritten query. Neither
literal values are included in T. Literal values are processed by domain specific
transformer functions that take into account structure, units and measurement
systems.
Then, for each term in T, a collection of expressions is constructed and gath-
ered with the term. Each expression represents a possible substitution of the
triple pattern including the conflicting term for a graph pattern adequate for
the target dataset. See lines 5 to 10 in the algorithm. Considering these ex-
pressions associated with each term, the set of all possible adequate queries is
constructed (line 12) and the information loss of each query is measured and the
query with the least loss is selected (line 14).
The core of the algorithm is the Rewrite routine (line 7) which exam-
ines source and target vocabularies, with their respective mapping axioms, in
order to discover possible substitutions for a given term in a source vocab-
ulary. Let us consider terms in a vocabulary as nodes in a graph and rela-
tionships between terms (specifically rdfs:subClassOf,rdf:subPropertyOf,
owl:equivalentClass,owl:equivalentProperty,and owl:sameAs)asdi-
rected labeled edges between nodes. Notice that, due to mapping axioms be-
tween two vocabularies, we can consider those vocabularies as parts of the same
graph. Rewrite routine performs a variation of a Breadth First Search traverse
from a conflicting term, looking for its frontier of terms that belong to a tar-
get vocabulary. A term fbelongs to the frontier of a term tif it satisfies the
20 A.I. Torre-Bastida et al.
following three conditions: (a)fbelongs to a target vocabulary, (b) there is a
trail from tto f,and(c) there is not another term g(different from f) belonging
to a target vocabulary in that trail. A trail from a node ttoanodefis a se-
quence of edges that connects tand findependent of the direction of the edges.
For instance, trdfs:subClassOf r,frdfs:subClassOf ris a trail from tto f.
Although a trail admits the traversing of edges in whatever direction, our
algorithm keeps track of the pair formed by each node in the trail and the
direction of the edge followed during the traverse since that is crucial informa-
tion for producing the adequate expressions for substitution. Notice that we are
interested in obtaining a conjunction expression with the directly subsuming
terms, and a disjunction expression with the directly subsumee terms. For that
reason, different routines are used to traverse the graph. In line 28 of the algo-
rithm, directSuper(t) is the routine in charge of traversing the edges leaving t.
In line 30 of the algorithm, directSub(t) is the routine in charge of traversing
the edges entering t. Whenever a synonym to a term in a target vocabulary is
found (line 25), such information is added to a queue (line 26) that stores the
result of the Rewrite routine.
Termination of our algorithm is guaranteed because the graph traverse pre-
vents the processing of a previously visited node (avoiding cycles) and further-
more a natural threshold parameter is established in order to limit the maximum
distance from the conflicting term of a visited node in the graph.
1//Retur ns an a de qu ate qu ery f o r ont o target ,
2//produced by a r ew rit in g o f Q with the lea s t los s of inf ormat ion
3QUERY SELECTION(Q, ontoSource , ontoTarget ) return Query
4ter ms = DecomposeQuery (Q) ; // terms is the s et of terms in Q
5for each term i n terms do
6{
7rewritingExpressions = REWRITE(term, ontoSource , ontoTarget ) ;
8//stores the term together with it s adequate rewriting expressions
9termsRewritings .add(term , rewritingExpressions ) ;
10 }
11 //Constructs q uerie s from the exp ressions obtained fo r each term
12 poss ibleQu eries = ConstructQuery(Q, termsRewritings) ;
13 //Selec ts and returns the query that provides les s l oss of information
14 return LeastLoss (Q, poss ibleQ ueri es ) ;
15
16
17 //Constructs a queue of adequate expressio ns for term in onto target
18 REWRITE(term, ontoSource , ontoTarget ) return Queue<Expression>
19 resultQueue = new Queue ( ) ;
20 traverseQueue = new Queue ( ) ;
21 traverseQueue . add(term) ;
22 while not traverseQueue . isEmpty() do
23 {
24 t = traverseQueue . remove ( ) ;
25 if has synonym( t , ontoTarget) then
26 resultQueue . add(map(t , ontoTarget) ) ;
27 else //t is a conflicting term
28 {cei lin g = directSuper ( t) ;
29 traverseQueue . enqueueAll( ce il in g ) ;
30 floor = directSub(t) ;
31 traverseQueue . enqueueAll( f loo r ) ; }
32 }
33 return resultQueue ;
Query Rewriting for an Incremental Search in Linked Data Sources 21
5 Estimation of Information Loss
In this section we describe how we measure the loss of information caused by the
rewriting of the original query. Also we explain in detail a use case that needs
these rewritings to achieve an adequate query.
5.1 Measuring the Loss of Information
The system measures the loss of information using a composite measure adapted
from [11]. This measure is based on the combination of the metrics precision
and recal l from Information Retrieval literature. We measure the proportion of
retrieved data that is relevant (precision) and the proportion of relevant data
that is retrieved (recal l).
To calculate these metrics, we use datasets metadata published as VoID statis-
tics. There are VoID statements that inform us of the number or entities of a class
or the number of pairs of resources related by a property in a certain dataset.
For instance, in :DBpedia dataset, the class dbpedia:Book has 26198 entities
and there are 4102 triples with the property dbpedia:notableWorks.
:DBpedia a void:Dataset;
void:classPartition [ void:propertyPartition [
void:class dbpedia:Book; void:property dbpedia:notableWorks;
void:entities 26198; ]; void:triples 4102; ];
Given a conflicting term ct, we define Ext(ct) as the extension of ct;that
is the collection of relevant instances for that term. Let us call Rewr(ct) to an
expression obtained by the rewriting of a conflicting term ct,andExt(Rewr(ct))
to the extension of the rewritten expression, that is the retrieved instances for
that expression.
We define Ext(ct) as the number of entities (resp. triples) registered for ct
in the dataset (this value should be obtained from the metadata statistics). In
the case of Ext(Rewr(ct)), we cannot expect a registered value in the metadata.
Instead we calculate an estimation for an interval of values [Ext(Rewr(ct).low),
Ext(Rewr(ct).high)] which bound the minimum and the maximum cardinality
of the expression extension. Those values are used for the calculation of our
measures of precision and recall. However, due to the lack of space and the
intricacy of the different cases that must be taken into account, we will not to
present a detailed explanation for the calculation here.
Allow us to say that precision and recall of a rewriting of a conflicting term ct
will be measured with an interval [Precision(ct).low,Precision(ct).high]where
Precision(ct).low =L(Ext(ct),Ext(Rewr(ct).low),Ext(Rewr(ct).high))and
Precision(ct).high =H(Ext(ct),Ext(Rewr(ct).low),Ext(Rewr(ct).high))are
functional values calculated after a careful analysis of the diverse semantic re-
lationships between ct and Rewr(ct). Offered only as a hint, consider that the
functions are variations on the following formulae, presented in [9]:
P recision(ct)= (Ext(ct)Ext(Rewr(ct)))
Ext(Rewr (ct)) ;Recall(ct)= (Ext(ct)Ext(Rewr(ct)))
Ext(ct)
22 A.I. Torre-Bastida et al.
In order to provide the user with a certain capacity for expressing preferences
on precision or recall, we introduce a real value parameter α(0 α1) for
tuning the function to calculate the loss of information due to the rewriting of
a conflicting term. Again, this measure is expressed as an interval of values:
Loss(ct).low =11
α(1
P recision(ct).high )+(1α)( 1
Recall(ct).high )(1)
Loss(ct).high =11
α(1
P recision(ct).low )+(1α)( 1
Recall(ct).low )(2)
Finally, many functions can be considered for the calculation of the loss of
information incurred for the rewriting of the entire original query Q.Weare
aware that more research and experimentation is needed to select the most ap-
propriate ones for our task. Nevertheless, for the sake of this paper, let us use
a very simple and effective one such as the maximum among the set of values
that represent the losses.
Loss(Q).low =max{Loss(ct).low |ct conflicting term in Q}(3)
Loss(Q).high =max{Loss(ct).hig h |ct conflicting term in Q}(4)
5.2 Rewriting Example
This section describes in detail an example of the process followed by our system
in the case that loss of information is produced during the rewriting process.
Consider that the system is trying to answer the original query shown in figure 1,
which is expressed with terms in the proprietary bdi vocabulary, and that the
user decides to commit the query to the DBpedia dataset. Some of the mapping
axioms at the disposal of the system are as follows:
bdi:Document rdfs:subClassOf dbpedia:Work .
bdi:Publication rdfs:subClassOf bdi:Document .
dbpedia:WrittenWork rdfs:subClassOf bdi:Publication .
dbpedia:Website rdfs:subClassOf bdi:Publication .
dbpedia:Miguel_de_Cervantes owl:sameAs bdi:Miguel_de_Cervantes.
dbpedia:notableWork owl:sameAs bdi:isAuthor .
During the process, two possible rewritings are generated, as shown in fig-
ure 1. The one on the left is due to the pair of mapping axioms that specify
that dbpedia:Work is a superclass of the conflicting term bdi:Publication;
and, the one on the right is due to a pair of mapping axioms that specify that
dbpedia:WritenWork and dbpedia:Website are subclasses of bdi:Publication
(see those terms in the shaded boxes of figure 1).
The calculation of the loss information for each rewriting is as follows. Notice
that the only conflicting term, in this case, is (bdi :P ublication). Firstly, the
extension of the conflicting term and the rewriting expresssions are calculated.
Ext(bdi:Publication) = 503;
Ext(dbpedia:Work )= 387599;
Query Rewriting for an Incremental Search in Linked Data Sources 23
Fig. 1. Rewriting expressions generated
Ext(dbpedia:WrittenWork dbpedia:Website).low = min[40016, 2438] = 2438;
Ext(dbpedia:WrittenWork dbpedia:Website).high = 40016+2438 = 42454.
Secondly, precision and recall taking into account the relationships between
the conflicting term and its rewriting expressions are calculated.
With respect to Rewr(bdi:Publication) = dbpedia:Work
[Precision.low = 0,0012960; Precision.high = 0,0012977; Recal l = 1]
With respect to Rewr(bdi:Publication) = db:WrittenWork db:Website
[Precision = 1; Recall.low = 0,828969; Recall.high=1]
Then, the loss of information interval for bdi :P ublication with a parameter
α=0.5 (meaning equal preference on precision and recall) is calculated.
With respect to Rewr(bdi:Publication) = dbpedia:Work
[Loss(bdi:Publication).low = 0,997408; Loss(bdi:Publication).high= 0,997412 ]
With respect to Rewr(bdi:Publication) = db:WrittenWork db:Website
[Loss(bdi:Publication).low = 0; Loss(bdi:Publication).high= 0.093511]
Considering the above information loss intervals, the system will choose the sec-
ond option (replacing bdi:Publication with db:WrittenWork db:Website)
as the loss of information is estimated to be between 0% and 9% (i.e., very low
even with the posibility of being 0%, that is no loss of information). However, the
first option (replacing bdi:Publication with dbpedia:Work) is estimated to in-
cur in a big loss of information (about 99.7%), which is something that could be
expected: dbpedia:Work references many works that are not publications. Any-
way, in absence of the second option, the first one (despite returning many refer-
ences to works that are not publications) also returns the publications included
in dbpedia:Work which could satisfy the user. The alternative, not dealing with
imprecise answers, would return nothing when a semantic preserving query into
a new dataset cannot be achieved.
6 Conclusions
For this new era of Web of Data we present in this paper a proposal that offers
the users the possibility of querying heterogeneous Linked Data sources in a
friendly way. That means that users do not need to take notice of technical details
24 A.I. Torre-Bastida et al.
associated with the heterogeneity and variety of existing datasets. The proposal
gives the opportunity to enrich the answer of the query incrementally, by visiting
different datasets one by one, without needing to know the particular features of
each dataset. The main component of the proposal is an algorithm that rewrites
queries formulated by the users, using preferred vocabularies, into other ones
expressed using the vocabularies of the datasets visited. This algorithm makes
an extensive use of mapping axioms already defined in the datasets. To rewrite
preserving query semantics may be difficult many times, for that reason the
algorithm also handles rewritings with some loss of information.
Experiments are being carried out for tuning the estimation loss formulae.
Acknowledgements. This work is together supported by the TIN2010-21387-
CO2-01 project and the naki Goenaga (FCT-IG) Technology Centres
Foundation.
References
1. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International
Journal on Semantic Web and Information Systems (IJSWIS) 5(3), 1–22 (2009)
2. Correndo, G., Salvadores, M., Millard, I., Glaser, H., Shadbolt, N.: Sparql query
rewriting for implementing data integration over linked data. In: Proceedings of
the 2010 EDBT/ICDT Workshops, p. 4. ACM (2010)
3. Elbassuoni, S., Ramanath, M., Weikum, G.: Query relaxation for entity-
relationship search. In: The Semanic Web: Research and Applications, pp. 62–76
(2011)
4. Euzenat, J., Shvaiko, P.: Ontology matching, vol. 18. Springer, Heidelberg (2007)
5. Herzig, D., Tran, T.: One query to bind them all. In: COLD 2011, CEUR Workshop
Proceedings, vol. 782 (2011)
6. Hurtado, C., Poulovassilis, A., Wood, P.: Query relaxation in rdf. Journal on Data
Semantics X, 31–61 (2008)
7. Makris, K., Gioldasis, N., Bikakis, N., Christodoulakis, S.: Ontology mapping and
sparql rewriting for querying federated rdf data sources. In: Meersman, R., Dil-
lon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1108–1117. Springer,
Heidelberg (2010)
8. Manola, F., Miller, E., McBride, B.: Rdf primer w3c recommendation (February
10, 2004)
9. Mena, E., Kashyap, V., Illarramendi, A., Sheth, A.: Imprecise answers on highly
open and distributed environments: An approach based on information loss for
multi-ontology based query processing. International Journal of Cooperative Infor-
mation Systems (IJCIS) 9(4), 403–425 (2000)
10. Quilitz, B., Leser, U.: Querying distributed rdf data sources with sparql. In: Bech-
hofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS,
vol. 5021, pp. 524–538. Springer, Heidelberg (2008)
11. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Re-
trieval of. Addison-Wesley (1989)
12. Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: Optimization
techniques for federated query processing on linked data. In: Aroyo, L., Welty, C.,
Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC
2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)
... Some approaches have also been proposed in order to query several LOD datasets, thus helping the users to adapt the expression of their need to several sources. [23], for example, relies on explicit correspondences expressed within the dataset (with owl:sameAs, owl:equivalentClass, or owl:equivalentProperty properties) to automatically reformulate queries. Another example come from the SemaGrow [11] project. ...
Article
Full-text available
An increasing amount of data sets have being published on the Linked Open Data (LOD), covering different aspects of overlapping domains. This is typically the case of agronomy and related fields, where several LOD data sets describing different points of view on scientific classifications have been published. This opens emerging opportunities in the field, providing to practitioners new knowledge sources. However, without help, querying the different datasets is a time-consuming task for LOD users as they need to know the ontologies describing the data of each of them. Rewriting queries can be automated with the help of ontology alignments. This paper presents a query rewriting approach that relies on complex alignments. This kind of alignment, opposite to simple ones, better deals with ontology modelling heterogeneities. We evaluate our approach on a scenario of query rewriting on agronomic information needs across four different datasets: AgronomicTaxon, AGROVOC, DBpedia, and TAXREF-LD. Copyright © 2018 Inderscience Enterprises Ltd.
... Some approaches have also been proposed in order to query several LOD datasets, thus helping the users to adapt the expression of their need to several sources. [23], for example, relies on explicit correspondences expressed within the dataset (with owl:sameAs, owl:equivalentClass, or owl:equivalentProperty properties) to automatically reformulate queries. Another example come from the SemaGrow [11] project. ...
Conference Paper
Full-text available
Farmers have new information needs to change their agricultural practices. The Linked Open Data is a considerable source of knowledge, separated into several heterogeneous and complementary datasets. This paper presents a process to query LOD datasets from a known ontology using complex alignments. The approach was applied on AgronomicTaxon, a taxonomic classification ontology, to query Agrovoc and DBpedia.
... Our user-driven mapping discovery algorithm can be interpreted as an algorithm for graph-based query approximation (see e.g. [11,16]). Indeed, in both cases, the input to the algorithm is a user query that cannot be solved over the queried data source. ...
Conference Paper
Data analysis in rich spaces of heterogeneous data sources is an increasingly common activity. Examples include exploratory data analysis and personal information management. Mapping specification is one of the key issues in this data management setting that answer to the need of a unified search over the full spectrum of relevant knowledge. Indeed, while users in data analytics are engaged in an open-ended interaction between data discovery and data orchestration, most of the solutions for mapping specification available so far are intended for expert users. This paper proposes a general framework for a novel paradigm for user-driven mapping discovery where mapping specification is interactively driven by the information seeking activities of users and the exclusive role of mappings is to contribute to users satisfaction. The underlying key idea is that data semantics is in the eye of the consumers. Thus, we start from user queries which we try to satisfy in the dataspace. In this process of satisfaction, we often need to discover new mappings, to expose the user to the data thereby discovered for their feedback, and possibly continued towards user satisfaction. The framework is made up of (a) a theoretical foundation where we formally introduce the notion of candidate mapping sets for a user query, and (b) an interactive and incremental algorithm that, given a user query, finds a candidate mapping set that satisfies the user. The algorithm incrementally builds the candidate mapping set by searching in the dataspace data samples and deriving mapping lattices that are explored to deliver mappings for user feedback. With the aim of fitting the user information need in a limited number of interactions, the algorithm provides for a multi-criteria selection strategy for candidate mapping sets. Finally, a proof of the correctness of the algorithm is provided in the paper.
Article
This research proposes ACARDS (Augmented-Context bAsed RecommenDation Service) framework that is able to utilize knowledge over the Linked Open Data (LOD) cloud to recommend context-based services to users. To improve the level of user satisfaction with the result of the recommendation, the ACARDS framework implements a novel recommendation algorithm that can utilize the knowledge over the LOD cloud. In addition, the noble algorithm is able to use new concepts like the enriched tags and the augmented tags that originate from the hashtags on the SNSs materials. These tags are utilized to recommend the most appropriate services in the user’s context, which can change dynamically. Last but not least, the ACARDS framework implements the context-based reshaping algorithm on the augmented tag cloud. In the reshaping process, the ACARDS framework can recommend the highly receptive services in the users’ context and their preferences. To evaluate the performance of the ACARDS framework, we conduct four kinds of experiments using the Instagram materials and the LOD cloud. As a result, we proved that the ACARDS framework contributes to increasing the query efficiency by reducing the search space and improving the user satisfaction on the recommended services.
Article
Full-text available
Recently, SPARQL became the standard language for query-ing RDF data on the Web. Like other formal query languages, it applies a Boolean-match semantics, i.e. results adhere strictly to the query. Thus, queries formulated for one dataset can not easily be reused for querying other datasets. If another target dataset is to be queried, the queries need to be rewritten using the vocabulary of the target dataset, while preserv-ing the captured information need. This is a tedious manual task, which requires the knowledge of the target vocabulary and often relies on com-putational expensive techniques, such as mapping, data consolidation or reasoning methods. Given the scale as well as the dynamics of Web datasets, even automatic rewriting is often infeasible. In this paper, we elaborate on a novel approach, which allows to reuse existing SPARQL queries adhering to one dataset to search for entities in other dataset, which are neither linked nor otherwise integrated beforehand. We use the results returned by the given seed query, to construct an entity relevance model (ERM), which captures the content and the structure of relevant results. Candidate entities from the target dataset are obtained using existing keyword search techniques and subsequently ranked according to their similarity to the ERM. During this ranking task, we compute mappings between the structure of the ERM and of the candidates on-the-fly. The effectiveness of this approach is shown in experiments using large-scale datasets and compared to a keyword search baseline.
Article
Full-text available
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Conference Paper
Full-text available
Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on theWeb, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.
Conference Paper
Full-text available
Integrated access to multiple distributed and autonomous RDF data sources is a key challenge for many semantic web applications. As a reaction to this challenge, SPARQL, the W3C Recommendation for an RDF query language, supports querying of multiple RDF graphs. However, the current standard does not provide transparent query federation, which makes query formulation hard and lengthy. Furthermore, current implementations of SPARQL load all RDF graphs mentioned in a query to the local machine. This usually incurs a large overhead in network traffic, and sometimes is simply impossible for technical or legal reasons. To overcome these problems we present DARQ, an engine for federated SPARQL queries. DARQ provides transparent query access to multiple SPARQL services, i.e., it gives the user the impression to query one single RDF graph despite the real data being distributed on the web. A service description language enables the query engine to decompose a query into sub-queries, each of which can be answered by an individual service. DARQ also uses query rewriting and cost-based query optimization to speed-up query execution. Experiments show that these optimizations significantly improve query performance even when only a very limited amount of statistical information is available. DARQ is available under GPL License at http://darq.sf.net/ .
Conference Paper
Full-text available
The web of data consists of distributed, diverse (in terms of schema adopted), and large RDF datasets. In this paper we present a SPARQL query rewriting method which can be used to achieve interoperability in semantic information retrieval and/or knowledge discovery processes over interconnected RDF data sources. Formal mappings between different overlapping ontologies are exploited in order to rewrite initial user SPARQL queries, so that they can be evaluated over different RDF data sources on different sites. The proposed environment is utilized by an ontology-based mediator system, which we have developed in order to provide data integration within the Semantic Web environment.
Conference Paper
There has been lately an increased activity of publishing structured data in RDF due to the activity of the Linked Data community. The presence on the Web of such a huge information cloud, ranging from academic to geographic to gene related information, poses a great challenge when it comes to reconcile heterogeneous schemas adopted by data publishers. For several years, the Semantic Web community has been developing algorithms for aligning data models (ontologies). Nevertheless, exploiting such ontology alignments for achieving data integration is still an under supported research topic. The semantics of ontology alignments, often defined over a logical frameworks, implies a reasoning step over huge amounts of data, that is often hard to implement and rarely scales on Web dimensions. This paper presents an algorithm for achieving RDF data mediation based on SPARQL query rewriting. The approach is based on the encoding of rewriting rules for RDF patterns that constitute part of the structure of a SPARQL query.
Conference Paper
Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triple-pattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform.