ArticlePDF Available

DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Authors:

Abstract and Figures

The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.
Content may be subject to copyright.
Semantic Web 1 (2012) 1–5 1
IOS Press
DBpedia – A Large-scale, Multilingual
Knowledge Base Extracted from Wikipedia
Editor(s): Name Surname, University, Country
Solicited review(s): Name Surname, University, Country
Open review(s): Name Surname, University, Country
Jens Lehmann a,, Robert Isele g, Max Jakob e, Anja Jentzsch d, Dimitris Kontokostasa,
Pablo N. Mendes f, Sebastian Hellmann a, Mohamed Morsey a, Patrick van Kleef c, S¨
oren Auer a,
Christian Bizer b
a
University of Leipzig, Institute of Computer Science, AKSW Group, Augustusplatz 10, D-04009 Leipzig, Germany
E-mail: {lastname}@informatik.uni-leipzig.de
bUniversity of Mannheim, Research Group Data and Web Science, B6-26, D-68159 Mannheim
E-mail: chris@informatik.uni-mannheim.de
cOpenLink Software, 10 Burlington Mall Road, Suite 265, Burlington, MA 01803, U.S.A.
E-mail: pkleef@openlinksw.com
dHasso-Plattner-Institute for IT-Systems Engineering, Prof.-Dr.- Helmert-Str. 2-3, D-14482 Potsdam, Germany
E-mail: mail@anjajentzsch.de
eNeofonie GmbH, Robert-Koch-Platz 4, D-10115 Berlin, Germany
E-mail: max.jakob@neofonie.de
fKno.e.sis - Ohio Center of Excellence in Knowledge-enabled Computing, Wright State University, Dayton, USA.
E-Mail: pablo@knoesis.org
gBrox IT-Solutions GmbH, An der Breiten Wiese 9, D-30625 Hannover, Germany
E-Mail: mail@robertisele.com
Abstract.
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely
available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different
language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia
consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other
110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps
Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties.
The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to
be combined. The project publishes regular releases of all DBpedia knowledge bases for download and provides SPARQL query
access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases,
the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million
RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia
data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and thus make DBpedia one of
the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia
community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and
applications.
Keywords: Knowledge Extraction, Wikipedia, Multilingual Knowledge Bases, Linked Data, RDF
1570-0844/12/$27.50 c
2012 – IOS Press and the authors. All rights reserved
2Lehmann et al. / DBpedia
1. Introduction
Wikipedia is the 6th most popular website
1
, the most
widely used encyclopedia, and one of the finest exam-
ples of truly collaboratively created content. There are
official Wikipedia editions in 287 different languages
which range in size from a couple of hundred articles
up to 3.8 million articles (English edition)
2
. Besides of
free text, Wikipedia articles consist of different types of
structured data such as infoboxes, tables, lists, and cat-
egorization data. Wikipedia currently offers only free-
text search capabilities to its users. Using Wikipedia
search, it is thus very difficult to find all rivers that flow
into the Rhine and are longer than 100 miles, or all
Italian composers that were born in the 18th century.
The DBpedia project builds a large-scale, multilin-
gual knowledge base by extracting structured data from
Wikipedia editions in 111 languages. This knowledge
base can be used to answer expressive queries such as
the ones outlined above. Being multilingual and cover-
ing an wide range of topics, the DBpedia knowledge
base is also useful within further application domains
such as data integration, named entity recognition, topic
detection, and document ranking.
The DBpedia knowledge base is widely used as a
testbed in the research community and numerous appli-
cations, algorithms and tools have been built around or
applied to DBpedia. DBpedia is served as Linked Data
on the Web. Since it covers a wide variety of topics
and sets RDF links pointing into various external data
sources, many Linked Data publishers have decided
to set RDF links pointing to DBpedia from their data
sets. Thus, DBpedia has developed into a central inter-
linking hub in the Web of Linked Data and has been
a key factor for the success of the Linked Open Data
initiative.
The structure of the DBpedia knowledge base is
maintained by the DBpedia user community. Most
importantly, the community creates mappings from
Wikipedia information representation structures to the
DBpedia ontology. This ontology – which will be ex-
plained in detail in Section 3 – unifies different tem-
plate structures, both within single Wikipedia language
editions and across currently 27 different languages.
*
Corresponding author. E-mail: lehmannn@informatik.uni-
leipzig.de.
1http://www.alexa.com/topsites
. Retrieved in Octo-
ber 2013.
2http://meta.wikimedia.org/wiki/List_of_
Wikipedias
The maintenance of different language editions of DB-
pedia is spread across a number of organisations. Each
organisation is responsible for the support of a certain
language. The local DBpedia chapters are coordinated
by the DBpedia Internationalisation Committee.
The aim of this system report is to provide a descrip-
tion of the DBpedia community project, including the
architecture of the DBpedia extraction framework, its
technical implementation, maintenance, internationali-
sation, usage statistics as well as presenting some pop-
ular DBpedia applications. This system report is a com-
prehensive update and extension of previous project de-
scriptions in [
1
] and [
5
]. The main advances compared
to these articles are:
The concept and implementation of the extraction
based on a community-curated DBpedia ontology.
The wide internationalisation of DBpedia.
A live synchronisation module which processes
updates in Wikipedia as well as the DBpedia on-
tology and allows third parties to keep their copies
of DBpedia up-to-date.
A description of the maintenance of public DBpe-
dia services and statistics about their usage.
An increased number of interlinked data sets
which can be used to further enrich the content of
DBpedia.
The discussion and summary of novel third party
applications of DBpedia.
Overall, DBpedia has undergone 7 years of continu-
ous evolution. Table 17 provides an overview of the
project’s timeline.
The system report is structured as follows: In the next
section, we describe the DBpedia extraction framework,
which forms the technical core of DBpedia. This is
followed by an explanation of the community-curated
DBpedia ontology with a focus on multilingual sup-
port. In Section 4, we explicate how DBpedia is syn-
chronised with Wikipedia with just very short delays
and how updates are propagated to DBpedia mirrors
employing the DBpedia Live system. Subsequently, we
give an overview of the external data sets that are in-
terlinked from DBpedia or that set RDF links pointing
to DBpedia themselves (Section 5). In Section 6, we
provide statistics on the usage of DBpedia and describe
the maintenance of a large scale public data set. Within
Section 7, we briefly describe several use cases and
applications of DBpedia in a variety of different areas.
Finally, we report on related work in Section 8 and give
an outlook on the further development of DBpedia in
Section 9.
Lehmann et al. / DBpedia 3
Fig. 1. Overview of DBpedia extraction framework.
2. Extraction Framework
Wikipedia articles consist mostly of free text, but
also comprise of various types of structured information
in the form of wiki markup. Such information includes
infobox templates, categorisation information, images,
geo-coordinates, links to external web pages, disam-
biguation pages, redirects between pages, and links
across different language editions of Wikipedia. The
DBpedia extraction framework extracts this structured
information from Wikipedia and turns it into a rich
knowledge base. In this section, we give an overview
of the DBpedia knowledge extraction framework.
2.1. General Architecture
Figure 1 shows an overview of the technical frame-
work. The DBpedia extraction is structured into four
phases:
Input:
Wikipedia pages are read from an external
source. Pages can either be read from a Wikipedia
dump or directly fetched from a MediaWiki instal-
lation using the MediaWiki API.
Parsing:
Each Wikipedia page is parsed by the wiki
parser. The wiki parser transforms the source code
of a Wikipedia page into an Abstract Syntax Tree.
Extraction:
The Abstract Syntax Tree of each Wikipedia
page is forwarded to the extractors. DBpedia of-
fers extractors for many different purposes, for in-
stance, to extract labels, abstracts or geographical
coordinates. Each extractor consumes an Abstract
Syntax Tree and yields a set of RDF statements.
Output:
The collected RDF statements are written to
a sink. Different formats, such as N-Triples, are
supported.
2.2. Extractors
The DBpedia extraction framework employs various
extractors for translating different parts of Wikipedia
pages to RDF statements. A list of all available extrac-
tors is shown in Table 1. DBpedia extractors can be
divided into four categories:
Mapping-Based Infobox Extraction:
The mapping-
based infobox extraction uses manually written
mappings that relate infoboxes in Wikipedia to
terms in the DBpedia ontology. The mappings also
specify a datatype for each infobox property and
thus help the extraction framework to produce high
quality data. The mapping-based extraction will
be described in detail in Section 2.4.
Raw Infobox Extraction:
The raw infobox extraction
provides a direct mapping from infoboxes in
Wikipedia to RDF. As the raw infobox extraction
does not rely on explicit extraction knowledge in
the form of mappings, the quality of the extracted
data is lower. The raw infobox data is useful if a
specific infobox has not been mapped yet and thus
is not available in the mapping-based extraction.
Feature Extraction:
The feature extraction uses a
number of extractors that are specialized in ex-
tracting a single feature from an article, such as a
label or geographic coordinates.
4Lehmann et al. / DBpedia
Statistical Extraction:
Some NLP related extractors
aggregate data from all Wikipedia pages in order to
provide data that is based on statistical measures
of page links or word counts, as further described
in Section 2.6.
2.3. Raw Infobox Extraction
The type of Wikipedia content that is most valuable
for the DBpedia extraction are infoboxes. Infoboxes are
frequently used to list an article’s most relevant facts
as a table of attribute-value pairs on the top right-hand
side of the Wikipedia page (for right-to-left languages
on the top left-hand side). Infoboxes that appear in a
Wikipedia article are based on a template that specifies
a list of attributes that can form the infobox. A wide
range of infobox templates are used in Wikipedia. Com-
mon examples are templates for infoboxes that describe
persons, organisations or automobiles. As Wikipedia’s
infobox template system has evolved over time, dif-
ferent communities of Wikipedia editors use differ-
ent templates to describe the same type of things (e.g.
Infobox city japan
,
Infobox swiss town
and
Infobox town de
). In addition, different tem-
plates use different names for the same attribute
(e.g.
birthplace
and
placeofbirth
). As many
Wikipedia editors do not strictly follow the recommen-
dations given on the page that describes a template,
attribute values are expressed using a wide range of
different formats and units of measurement. An excerpt
of an infobox that is based on a template for describing
automobiles is shown below:
{{Infobox automobile
| name = Ford GT40
| manufacturer = [[Ford Advanced Vehicles]]
| production = 1964-1969
| engine = 4181cc
(...)
}}
In this infobox, the first line specifies the infobox
type and the subsequent lines specify various attributes
of the described entity.
An excerpt of the extracted data is as follows:
dbr:Ford_GT40
dbp:name "Ford GT40"@en;
dbp:manufacturer dbr:Ford_Advanced_Vehicles;
dbp:engine 4181;
dbp:production 1964;
(...).
This extraction output has weaknesses: The resource
is not associated to a class in the ontology and parsed
values are cleaned up and assigned a datatyper based
on heuristics. In particular, the raw infobox extractor
searches for values in the following order: dates, coor-
dinates, numbers, links and strings as default. Thus, the
datatype assignment for the same property in different
resources is non deterministic. The
engine
for exam-
ple is extracted as a number but if another instance of
the template used “cc4181” it would be extracted as
string. This behaviour makes querying for properties in
the
dbp
namespace inconsistent. Those problems can
be overcome by the mapping-based infobox extraction
presented in the next subsection.
2.4. Mapping-Based Infobox Extraction
In order to homogenize the description of informa-
tion in the knowledge base, in 2010 a community ef-
fort was initiated to develop an ontology schema and
mappings from Wikipedia infobox properties to this
ontology. The alignment between Wikipedia infoboxes
and the ontology is performed via community-provided
mappings that help to normalize name variations in
properties and classes. Heterogeneity in the Wikipedia
infobox system, like using different infoboxes for the
same type of entity or using different property names for
the same property (cf. Section 2.3), can be alleviated in
this way. This significantly increases the quality of the
raw Wikipedia infobox data by typing resources, merg-
ing name variations and assigning specific datatypes to
the values.
This effort is realized using the DBpedia Mappings
Wiki
3
, a MediaWiki installation set up to enable users
to collaboratively create and edit mappings. These map-
pings are specified using the DBpedia Mapping Lan-
guage. The mapping language makes use of MediaWiki
templates that define DBpedia ontology classes and
properties as well as template/table to ontology map-
pings. A mapping assigns a type from the DBpedia on-
tology to the entities that are described by the corre-
sponding infobox. In addition, attributes in the infobox
are mapped to properties in the DBpedia ontology. In
the following, we show a mapping that maps infoboxes
that use the
Infobox automobile
template to the
DBpedia ontology:
{{TemplateMapping
|mapToClass = Automobile
|mappings =
{{PropertyMapping
| templateProperty = name
3http://mappings.dbpedia.org
Lehmann et al. / DBpedia 5
Table 1
Overview of the DBpedia extractors (cf. Table 16 for a complete list
of prefixes.).
Name Description Example
abstract
Extracts the first lines of the
Wikipedia article.
dbr:Berlin dbo:abstract "Berlin is the capital city of
(...)" .
article categories
Extracts the categorization of the
article.
dbr:Oliver Twist dc:subject dbr:Category:English novels .
category label Extracts labels for categories. dbr:Category:English novels rdfs:label "English novels" .
category hierarchy
Extracts information about which
concept is a category and how cat-
egories are related using the SKOS
Vocabulary.
dbr:Category:World War II skos:broader
dbr:Category:Modern history .
disambiguation Extracts disambiguation links. dbr:Alien dbo:wikiPageDisambiguates dbr:Alien (film) .
external links
Extracts links to external web
pages related to the concept.
dbr:Animal Farm dbo:wikiPageExternalLink
<http://books.google.com/?id=RBGmrDnBs8UC> .
geo coordinates Extracts geo-coordinates. dbr:Berlin georss:point "52.5006 13.3989" .
grammatical gender
Extracts grammatical genders for
persons.
dbr:Abraham Lincoln foaf:gender "male" .
homepage
Extracts links to the official home-
page of an instance.
dbr:Alabama foaf:homepage <http://alabama.gov/> .
image
Extracts the first image of a
Wikipedia page.
dbr:Berlin foaf:depiction <http://.../Overview Berlin.jpg> .
infobox
Extracts all properties from all in-
foboxes.
dbr:Animal Farm dbo:date "March 2010" .
interlanguage Extracts interwiki links. dbr:Albedo dbo:wikiPageInterLanguageLink dbr-de:Albedo .
label Extracts the article title as label. dbr:Berlin rdfs:label "Berlin" .
lexicalizations
Extracts information about surface
forms and their association with
concepts (only N-Quad format).
dbr:Pine sptl:lexicalization lx:pine tree ls:Pine pine tree .
lx:pine tree rdfs:label "pine tree" .
ls:Pine pine tree sptl:pUriGivenSf "0.941" .
mappings
Extraction based on mappings of
Wikipedia infoboxes to the DBpe-
dia ontology.
dbr:Berlin dbo:country dbr:Germany .
page ID Extracts page ids of articles. dbr:Autism dbo:wikiPageID "25" .
page links
Extracts all links between
Wikipedia articles.
dbr:Autism dbo:wikiPageWikiLink dbr:Human brain .
persondata
Extracts information about per-
sons represented using the Person-
Data template.
dbr:Andre Agassi foaf:birthDate "1970-04-29" .
PND
Extracts PND (Personennamen-
datei) data about a person.
dbr:William Shakespeare dbo:individualisedPnd "118613723" .
redirects
Extracts redirect links between ar-
ticles in Wikipedia.
dbr:ArtificialLanguages dbo:wikiPageRedirects
dbr:Constructed language .
revision ID
Extracts the revision ID of the
Wikipedia article.
dbr:Autism <http://www.w3.org/ns/prov#wasDerivedFrom>
<http://en.wikipedia.org/wiki/Autism?oldid=495234324> .
thematic concept
Extracts ‘thematic’ concepts, the
centres of discussion for cate-
gories.
dbr:Category:Music skos:subject dbr:Music .
topic signatures Extracts topic signatures. dbr:Alkane sptl:topicSignature "carbon alkanes atoms" .
wiki page
Extracts links to corresponding ar-
ticles in Wikipedia.
dbr:AnAmericanInParis foaf:isPrimaryTopicOf
<http://en.wikipedia.org/wiki/AnAmericanInParis> .
6Lehmann et al. / DBpedia
| ontologyProperty = foaf:name }}
{{PropertyMapping
| templateProperty = manufacturer
| ontologyProperty = manufacturer }}
{{DateIntervalMapping
| templateProperty = production
| startDateOntologyProperty =
productionStartDate
| endDateOntologyProperty =
productionEndDate }}
{{IntermediateNodeMapping
| nodeClass = AutomobileEngine
| correspondingProperty = engine
| mappings =
{{PropertyMapping
| templateProperty = engine
| ontologyProperty = displacement
| unit = Volume }}
{{PropertyMapping
| templateProperty = engine
| ontologyProperty = powerOutput
| unit = Power }}
}}
(...)
}}
The RDF statements that are extracted from the pre-
vious infobox example are shown below. As we can see,
the production period is correctly split into a start year
and an end year and the engine is represented by a dis-
tinct RDF node. It is worth mentioning that all values
are canonicalized to basic units. For example, in the
engine
mapping we state that
engine
is a Volume
and thus, the extractor converts “4181cc” (cubic cen-
timeters) to cubic meters (“0.004181”). Additionally,
there can exist multiple mappings on the same property
that search for different datatypes or different units. For
example, a number with “PS” as a suffix for engine.
dbr:Ford_GT40
rdf:type dbo:Automobile;
rdfs:label "Ford GT40"@en;
dbo:manufacturer
dbr:Ford_Advanced_Vehicles;
dbo:productionStartYear
"1964"ˆˆxsd:gYear;
dbo:productionEndYear "1969"ˆˆxsd:gYear;
dbo:engine [
rdf:type AutomobileEngine;
dbo:displacement "0.004181";
]
(...) .
The DBpedia Mapping Wiki is not only used to map
different templates within a single language edition
of Wikipedia to the DBpedia ontology, but is used to
map templates from all Wikipedia language editions
to the shared DBpedia ontology. Figure 2 shows how
the infobox properties author and
συγ γ%αϕας
– au-
thor in Greek – are both being mapped to the global
identifier
dbo:author
. That means, in turn, that in-
formation from all language versions of DBpedia can
be merged and DBpedias for smaller languages can be
augmented with knowledge from larger DBpedias such
as the English edition. Conversely, the larger DBpe-
dia editions can benefit from more specialized knowl-
edge from localized editions, such as data about smaller
towns which is often only present in the corresponding
language edition [43].
Besides hosting of the mappings and DBpedia on-
tology definition, the DBpedia Mappings Wiki offers
various tools which support users in their work:
Mapping Syntax Validator
The mapping syntax
validator checks for syntactic correctness and high-
lights inconsistencies such as missing property
definitions.
Extraction Tester
The extraction tester linked on
each mapping page tests a mapping against a set
of example Wikipedia pages. This gives direct
feedback about whether a mapping works and how
the resulting data is structured.
Mapping Tool
The DBpedia Mapping Tool is a
graphical user interface that supports users to cre-
ate and edit mappings.
2.5. URI Schemes
For every Wikipedia article, the framework intro-
duces a number of URIs to represent the concepts de-
scribed on a particular page. Up to 2011, DBpedia pub-
lished URIs only under the
http://dbpedia.org
domain. The main namespaces were:
http://dbpedia.org/resource/
(prefix
dbr) for representing article data. There is a one-
to-one mapping between a Wikipedia page and a
DBpedia resource based on the article title. For
example, for the Wikipedia article on Berlin
4
, DB-
pedia will produce the URI dbr:Berlin. Exceptions
in this rule appear when intermediate nodes are
extracted from the mapping-based infobox extrac-
tor as unique URIs (e.g., the
engine
mapping
example in Section 2.4).
http://dbpedia.org/property/
(prefix
dbp) for representing properties extracted from
the raw infobox extraction (cf. Section 2.3), e.g.
dbp:population.
4http://en.wikipedia.org/wiki/Berlin
Lehmann et al. / DBpedia 7
Fig. 2. Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class
(middle) [24].
http://dbpedia.org/ontology/
(prefix
dbo) for representing the DBpedia ontology (cf.
Section 2.4), e.g. dbo:populationTotal.
Although data from other Wikipedia language edi-
tions were extracted, they used the same namespaces.
This was achieved by exploiting the Wikipedia inter-
language links
5
. For every page in a language other
than English, the page was extracted only if the page
contained an inter-language link to an English page. In
that case, using the English link, the data was extracted
under the English resource name (i.e. dbr:Berlin).
Recent DBpedia internationalisation developments
showed that this approach omitted valuable data [
24
].
Thus, starting from the DBpedia 3.7 release
6
, two types
of data sets were generated. The localized data sets con-
tain all things that are described in a specific language.
Within the datasets, things are identified with language
specific URIs such as
http://<lang>.dbpedia.
org/resource/
for article data and
http://
<lang>.dbpedia.org/property/
for prop-
erty data. In addition, we produce a canonicalized data
set for each language. The canonicalized data sets only
contain things for which a corresponding page in the
5http://en.wikipedia.org/wiki/Help:
Interlanguage_links
6A list of all DBpedia releases is provided in Table 17
English edition of Wikipedia exists. Within all canoni-
calized data sets, the same thing is identified with the
same URI from the generic language-agnostic names-
pace http://dbpedia.org/resource/.
2.6. NLP Extraction
DBpedia provides a number of data sets which have
been created to support Natural Language Processing
(NLP) tasks [
33
]. Currently, four datasets are extracted:
topic signatures,grammatical gender,lexicalizations
and thematic concept. While the topic signatures and
the grammatical gender extractors primarily extract data
from the article text, the lexicalizations and thematic
concept extractors make use of the wiki markup.
DBpedia entities can be referred to using many dif-
ferent names and abbreviations. The Lexicalization data
set provides access to alternative names for entities
and concepts, associated with several scores estimating
the association strength between name and URI. These
scores distinguish more common names for specific
entities from rarely used ones and also show how am-
biguous a name is with respect to all possible concepts
that it can mean.
The topic signatures data set enables the description
of DBpedia resources based on unstructured informa-
tion, as compared to the structured factual data provided
by the mapping-based and raw extractors. We build a
8Lehmann et al. / DBpedia
Vector Space Model (VSM) where each DBpedia re-
source is a point in a multidimensional space of words.
Each DBpedia resource is represented by a vector, and
each word occurring in Wikipedia is a dimension of
this vector. Word scores are computed using the tf-idf
weight, with the intention to measure how strong is the
association between a word and a DBpedia resource.
Note that word stems are used in this context in order to
generalize over inflected words. We use the computed
weights to select the strongest related word stems for
each entity and build topic signatures [30].
There are two more Feature Extractors related to Nat-
ural Language Processing. The thematic concepts data
set relies on Wikipedia’s category system to capture the
idea of a ‘theme’, a subject that is discussed in its arti-
cles. Many of the categories in Wikipedia are linked to
an article that describes the main topic of that category.
We rely on this information to mark DBpedia entities
and concepts that are ‘thematic’, that is, they are the
centre of discussion for a category.
The grammatical gender data set uses a simple heuris-
tic to decide on a grammatical gender for instances of
the class Person in DBpedia. While parsing an article
in the English Wikipedia, if there is a mapping from
an infobox in this article to the class
dbo:Person
,
we record the frequency of gender-specific pronouns in
their declined forms (Subject, Object, Possessive Ad-
jective, Possessive Pronoun and Reflexive) – i.e. he,
him, his, himself (masculine) and she, her, hers, herself
(feminine). Grammatical genders for DBpedia entities
are assigned based on the dominating gender in these
pronouns.
2.7. Summary of Other Recent Developments
In this section we summarize the improvements of
the DBpedia extraction framework since the publication
of the previous DBpedia overview article [
5
] in 2009.
One of the major changes on the implementation level
is that the extraction framework has been rewritten in
Scala in 2010
7
resulting in an improvement of the effi-
ciency of the extractors by an order of magnitude com-
pared to the previous PHP based framework. This is
crucial for the DBpedia Live project, which will be ex-
plained in Section 4. The new more modular framework
also allows to extract data from tables in Wikipedia
pages and supports extraction from multiple MediaWiki
templates per page. Another significant change was the
7
Table 17 provides an overview of the project’s evolution through
time.
creation and utilization of the DBpedia Mappings Wiki
as described earlier. Further significant changes include
the mentioned NLP extractors and the introduction of
URI schemes.
In addition, there were several smaller improvements
and general maintenance: Overall, over the past four
years, the parsing of the MediaWiki markup improved
quite a lot which led to better overall coverage, for
example, concerning references and parser functions.
In addition, the collection of MediaWiki namespace
identifiers for many languages is now performed semi-
automatically leading to a high accuracy of detection.
This concerns common title prefixes such as User, File,
Template, Help, Portal etc. in English that indicate
pages that do not contain encyclopedic content and
would produce noise in the data. They are important for
specific extractors as well, for instance, the category hi-
erarchy data set is produced from pages of the Category
namespace. Furthermore, the output of the extraction
system now supports more formats and several compli-
ance issues regarding URIs, IRIs, N-Triples and Turtle
were fixed.
The individual data extractors have been improved as
well in both number and quality in many areas. The ab-
stract extraction was enhanced producing more accurate
plain text representations of the beginning of Wikipedia
article texts. More diverse and more specific datatypes
do exist (e.g. many currencies and XSD datatypes such
as
xsd:gYearMonth
,
xsd:positiveInteger
,
etc.) and for a number of classes and properties, specific
datatypes were added (e.g. inhabitants
/
km
2
for the pop-
ulation density of populated places and m
3/
s for the
discharge of rivers). Many issues related to data parsers
were resolved and the quality of the
owl:sameAs
data set for multiple language versions was increased
by an implementation that takes bijective relations into
account.
There are also further extractors, e.g. for Wikipedia
page IDs and revisions. Moreover, redirect and disam-
biguation extractors were introduced and improved. For
the redirect data, the transitive closure is computed
while taking care of catching cycles in the links. The
redirects also help regarding infobox coverage in the
mapping-based extraction by resolving alternative tem-
plate names. Moreover, in the PHP framework, if an
infobox value pointed to a redirect, this redirection was
not properly resolved and thus resulted in RDF links
that led to URIs which did not contain any further in-
formation. Resolving redirects affected approximately
15% of all links, and hence increased the overall inter-
connectivity of resources in the DBpedia ontology.
Lehmann et al. / DBpedia 9
C
P
Pxsd:
decimal
Crdf:type owl:Class
rdf:type owl:DatatypeProperty
rdf:type owl:ObjectProperty
Legend
dbo:PopulatedPlace
Cdbo:Agent
Cdbo:Place
Cowl:Thing
Cdbo:Species
Cdbo:Settlement
Cdbo:Person
Cdbo:Athlete
Cdbo:Eukaryote
Cdbo:Organisation
Cdbo:Work
dbo:producer
P
dbo:writer
P
dbo:birthPlace
P
dbo:family
P
Pdbo:conservationStatus xsd:
String
Pxsd:
date
dbo:releaseDate
Pdbo:runtime xsd:
double
Pxsd:
date
dbo:birthDate
Pxsd:
date
dbo:deathDate
Pxsd:
double
dbo:areaTotal
Pxsd:
double
dbo:elevation
Pxsd:
String
dbo:utcOffset
Pdbo:populationTotal xsd:
integer
Pxsd:
String
dbo:areaCode
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
rdfs:domain
dbo:subsequentWork
P
dbo:location
P
dbo:canton
P
rdfs:subClassOf
rdfs:subClassOf
Fig. 3. Snapshot of a part of the DBpedia ontology.
Finally, a new heuristic to increase the connective-
ness of DBpedia instances was introduced. If an infobox
contains a string value that is not linked to another
Wikipedia article, the extraction framework searches
for hyperlinks in the same Wikipedia article that have
the same anchor text as the infobox value string. If such
a link exists, the target of that link is used to replace
the string value in the infobox. This method further
increases the number of object property assertions in
the DBpedia ontology.
Orthogonal to the previously mentioned improve-
ments, there have been various efforts to assess the qual-
ity of the DBpedia datasets. [
26
] developed a frame-
work for estimating the quality of DBpedia and a sam-
ple of 75 resources were analysed. A more compre-
hensive effort was performed in [
48
] by providing a
distributed web-based interface [
25
] for quality assess-
ment. In this study, 17 data quality problem types were
analysed by 58 users covering 521 resources in DBpe-
dia.
3. DBpedia Ontology
The DBpedia ontology consists of 320 classes which
form a subsumption hierarchy and are described by
1,650 different properties. With a maximal depth of
5, the subsumption hierarchy is intentionally kept
rather shallow which fits use cases in which the on-
tology is visualized or navigated. Figure 3 depicts
a part of the DBpedia ontology, indicating the rela-
tions among the top ten classes of the DBpedia on-
tology, i.e. the classes with the highest number of
instances. The complete DBpedia ontology can be
browsed online at
http://mappings.dbpedia.
org/server/ontology/classes/.
Classes
Properties
Fig. 4. Growth of the DBpedia ontology.
The DBpedia ontology is maintained and extended
by the community in the DBpedia Mappings Wiki. Fig-
ure 4 depicts the growth of the DBpedia ontology over
time. While the number of classes is not growing too
much due to the already good coverage of the initial
version of the ontology, the number of properties in-
creases over time due to the collaboration on the DBpe-
dia Mappings Wiki and the addition of more detailed
information to infoboxes by Wikipedia editors.
3.1. Mapping Statistics
As of April 2013, there exist mapping communities
for 27 languages, 23 of which are active. Figure 5 shows
10 Lehmann et al. / DBpedia
Fig. 5. Mapping coverage statistics for all mapping-enabled languages.
statistics for the coverage of these mappings in DBpe-
dia. Figures (a) and (c) refer to the absolute number
of template and property mappings that are defined for
every DBpedia language edition. Figures (b) and (d) de-
pict the percentage of the defined template and property
mappings compared to the total number of available
templates and properties for every Wikipedia language
edition. Figures (e) and (g) show the occurrences (in-
stances) that the defined template and property map-
pings have in Wikipedia. Finally, figures (f) and (h)
give the percentage of the mapped templates and prop-
erties occurences, compared to the total templates and
property occurences in a Wikipedia language edition.
It can be observed in the figure that the Portuguese
DBpedia language edition is the most complete regard-
ing mapping coverage. Other language editions such
as Bulgarian, Dutch, English, Greek, Polish and Span-
ish have mapped templates covering more than 50%
of total template occurrences. In addition, almost all
languages have covered more than 20% of property oc-
currences, with Bulgarian and Portuguese reaching up
to 70%.
The mapping activity of the ontology enrichment
process along with the editing of the ten most active
mapping language communities is depicted in Figure 6.
It is interesting to notice that the high mapping ac-
tivity peaks coincide with the DBpedia release dates.
Lehmann et al. / DBpedia 11
Fig. 7. English property mappings occurrence frequency (both axes
are in log scale)
For instance, the DBpedia 3.7 version was released on
September 2011 and the 2nd and 3rd quarter of that
year have a very high activity compared to the 4th quar-
ter. In the last two years (2012 and 2013), most of the
DBpedia mapping language communities have defined
their own chapters and have their own release dates.
Thus, recent mapping activity shows less fluctuation.
Finally, Figure 7 shows the English property map-
pings occurrence frequency. Both axes are in log scale
and represent the number of property mappings (x axis)
that have exactly y occurrences (y axis). The occurrence
frequency follows a long tail distribution. Thus, a low
number of property mappings have a high number of
occurrences and a high number of property mappings
have a low number of occurences.
3.2. Instance Data
The DBpedia 3.8 release contains localized versions
of DBpedia for 111 languages which have been ex-
tracted from the Wikipedia edition in the correspond-
ing language. For 20 of these languages, we report in
this section the overall number of entities being de-
scribed by the localized versions as well as the num-
ber of facts (i.e. statements) that have been extracted
from infoboxes describing these things. Afterwards, we
report on the number of instances of popular classes
within the 20 DBpedia versions as well as the concep-
tual overlap between the languages.
Table 2 shows the overall number of things, ontol-
ogy and raw-infobox properties, infobox statements and
type statements for the 20 languages. The column head-
ings have the following meaning: LD = Localized data
sets (see Section 2.5); CD = Canonicalized data sets
(see Section 2.5); all = Overall number of instances in
the data set, including instances without infobox data;
with MD = Number of instances for which mapping-
based infobox data exists; Raw Properties = Number
of different properties that are generated by the raw
infobox extractor; Mapping Properties = Number of
different properties that are generated by the mapping-
based infobox extractor; Raw Statements = Number of
statements (facts) that are generated by the raw infobox
extractor; Mapping Statements = Number of statements
(facts) that are generated by the mapping-based infobox
extractor.
It is interesting to see that the English version of DB-
pedia describes about three times more instances than
the second and third largest language editions (French,
German). Comparing the first column of the table with
the second and third reveals which portion of the in-
stances of a specific language correspond to instances
in the English version of DBpedia and which portion
of the instances is described by clean, mapping-based
infobox data. The difference between the number of
properties in the raw infobox data set and the cleaner
mapping-based infobox data set (columns 4 and 5) re-
sults on the one hand from multiple Wikipedia infobox
properties being mapped to a single ontology property.
On the other hand, it reflects the number of mappings
that have been so far created in the Mapping Wiki for a
specific language.
Table 3 reports the number of instances for a set of
popular classes from the third and forth hierarchy level
of the ontology within the canonicalized DBpedia data
sets for each language. The indented classes are sub-
classes of the superclasses set in bold. The zero val-
ues in the table indicate that no infoboxes have been
mapped to a specific ontology class within the corre-
sponding language so far. Again, the English version of
DBpedia covers by far the most instances.
Table 4 shows, for the canonicalized, mapping-based
data set, how many instances are described in multi-
ple languages. The Instances column contains the total
number of instances per class across all 20 languages,
the second column contains the number of instances
that are described only in a single language version, the
next column contains the number of instances that are
contained in two languages but not in three or more lan-
guages, etc. For example, 12,936 persons are described
in five languages but not in six or more languages. The
number 871,630 for the class
Person
means that all
20 language versions together describe 871,630 differ-
ent persons. The number is higher than the number of
persons described in the canonicalized English infobox
12 Lehmann et al. / DBpedia
Fig. 6. Mapping community activity for (a) ontology and (b) 10 most active language editions
Table 2
Basic statistics about Localized DBpedia Editions.
Inst. LD all Inst. CD all Inst. with MD CD Raw Prop. CD Map. Prop. CD Raw Statem. CD Map. Statem. CD
en 3,769,926 3,769,926 2,359,521 48,293 1,313 65,143,840 33,742,015
de 1,243,771 650,037 204,335 9,593 261 7,603,562 2,880,381
fr 1,197,334 740,044 214,953 13,551 228 8,854,322 2,901,809
it 882,127 580,620 383,643 9,716 181 12,227,870 4,804,731
es 879,091 542,524 310,348 14,643 476 7,740,458 4,383,206
pl 848,298 538,641 344,875 7,306 266 7,696,193 4,511,794
ru 822,681 439,605 123,011 13,522 76 6,973,305 1,389,473
pt 699,446 460,258 272,660 12,851 602 6,255,151 4,005,527
ca 367,362 241,534 112,934 8,696 183 3,689,870 1,301,868
cs 225,133 148,819 34,893 5,564 334 1,857,230 474,459
hu 209,180 138,998 63,441 6,821 295 2,506,399 601,037
ko 196,132 124,591 30,962 7,095 419 1,035,606 417,605
tr 187,850 106,644 40,438 7,512 440 1,350,679 556,943
ar 165,722 103,059 16,236 7,898 268 635,058 168,686
eu 132,877 108,713 41,401 2,245 19 2,255,897 532,709
sl 129,834 73,099 22,036 4,235 470 1,213,801 222,447
bg 125,762 87,679 38,825 3,984 274 774,443 488,678
hr 109,890 71,469 10,343 3,334 158 701,182 151,196
el 71,936 48,260 10,813 2,866 288 206,460 113,838
data set (763,643) listed in Table 3, since there are
infoboxes in non-English articles describing a person
without a corresponding infobox in the English article
describing the same person. Summing up columns 2 to
10+ for the
Person
class, we see that 195,263 persons
are described in two or more languages. The large dif-
ference of this number compared to the total number of
871,630 persons is due to the much smaller size of the
localized DBpedia versions compared to the English
one (cf. Table 2).
3.3. Internationalisation Community
The introduction of the mapping-based infobox
extractor alongside live synchronisation approaches
in [
20
] allowed the international DBpedia community
to easily define infobox-to-ontology mappings. As a
result of this development, there are presently mappings
for 27 languages
8
. The DBpedia 3.7 release
9
in Septem-
ber 2011 was the first DBpedia release to use the lo-
calized I18n (Internationalisation) DBpedia extraction
framework [24].
8
Arabic (ar), Bulgarian (bg), Bengali (bn), Catalan (ca), Czech
(cs), German (de), Greek (el), English (en), Spanish (es), Estonian
(et), Basque (eu), French (fr), Irish (ga), Hindi (hi), Croatian (hr),
Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean
(ko), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Slovene
(sl), Turkish (tr), Urdu (ur)
9http://blog.dbpedia.org/2011/09/11/
Lehmann et al. / DBpedia 13
Table 3
Number of instances per class within 10 localized DBpedia versions.
en it pl es pt fr de ru ca hu
Person 763,643 145,060 70,708 65,337 43,057 62,942 33,122 18,620 7,107 15,529
Athlete 185,126 47,187 30,332 19,482 14,130 21,646 31,237 0 721 4,527
Artist 61,073 12,511 16,120 25,992 10,571 13,465 0 0 2,004 3,821
Politician 23,096 0 8,943 5,513 3,342 0 0 12,004 1,376 760
Place 572,728 14,1101 182,727 132,961 116,660 80,602 131,766 67,932 73,078 18,324
Popul.Place 387,166 138,077 167,034 121,204 109,418 72,252 79,410 63,826 72,743 15,535
Building 60,514 1,270 2,946 3,570 803 921 83 43 0 527
River 24,267 0 1,383 2 4,149 3,333 6,707 3,924 0 565
Organisation 192,832 4,142 12,193 11,710 10,949 17,513 16,973 1,598 1,399 3,993
Company 44,516 4,142 2,566 975 1,903 5,832 7,200 0 440 618
Educ.Inst. 42,270 0 599 1,207 270 1,636 2,171 1,010 0 115
Band 27,061 0 2,993 0 4,476 3,868 5,368 0 263 802
Work 333,269 51,918 32,386 36,484 33,869 39,195 18,195 34,363 5,240 9,777
Music.Work 159,070 23,652 14,987 15,911 17,053 16,206 0 6,588 697 4,704
Film 71,715 17,210 9,467 9,396 8,896 9,741 15,038 12,045 3,859 2,320
Software 27,947 5,682 3,050 4,833 3,419 5,733 2,368 0 606 857
Table 4
Cross-language overlap: Number of instances that are described in multiple languages.
Class Instances 1 2 3 4 5 6 7 8 9 10+
Person 871.630 676.367 94.339 42.382 21.647 12.936 8.198 5.295 3.437 2.391 4.638
Place 643.260 307.729 150.349 45.836 36.339 20.831 13.523 20.808 31.422 11.262 5.161
Organisation 206.670 160.398 22.661 9.312 5.002 3.221 2.072 1.421 928 594 1.061
Work 360.808 243.706 54.855 23.097 12.605 8.277 5.732 4.007 2.911 1.995 3.623
At the time of writing, DBpedia chapters for 14 lan-
guages have been founded: Basque, Czech, Dutch, En-
glish, French, German, Greek, Italian, Japanese, Ko-
rean, Polish, Portuguese, Russian and Spanish.
10
Be-
sides providing mappings from infoboxes in the corre-
sponding Wikipedia editions, DBpedia chapters organ-
ise a local community and provide hosting for data sets
and associated services.
While at the moment chapters are defined by owner-
ship of the IP and server of the sub domain A record
(e.g.
http://ko.dbpedia.org
) given by DBpe-
dia maintainers, the DBpedia internationalisation com-
mittee
11
is manifesting its structure and each language
edition has a representative with a vote in elections. In
some cases (e.g. Greek
12
and Dutch
13
) the existence of
10
Accessed on 25/09/2013:
http://wiki.dbpedia.org/
Internationalization/Chapters
11http://wiki.dbpedia.org/
Internationalization
12http://el.dbpedia.org
13http://nl.dbpedia.org
a local DBpedia chapter has had a positive effect on the
creation of localized LOD clouds [24].
In the weeks leading to a new release, the DBpe-
dia project organises a mapping sprint, where commu-
nities from each language work together to improve
mappings, increase coverage and detect bugs in the ex-
traction process. The progress of the mapping effort
is tracked through statistics on the number of mapped
templates and properties, as well as the number of
times these templates and properties occur in Wikipedia.
These statistics provide an estimate of the coverage of
each Wikipedia edition in terms of how many entities
will be typed and how many properties from those en-
tities will be extracted. Therefore, they can be used
by each language edition to prioritize properties and
templates with higher impact on the coverage.
The mapping statistics have also been used as a way
to promote a healthy competition between language
editions. A sprint page was created with bar charts that
show how close each language is from achieving to-
tal coverage (as shown in Figure 5), and line charts
showing the progress over time highlighting when one
14 Lehmann et al. / DBpedia
language is overtaking another in their race for higher
coverage. The mapping sprints have served as a great
motivator for the crowd-sourcing efforts, as it can be
noted from the increase in the number of mapping con-
tributions in the weeks leading to a release.
4. Live Synchronisation
Wikipedia articles are continuously revised at a very
high rate, e.g. the English Wikipedia, in June 2013,
has approximately 3.3 million edits per month which
is equal to 77 edits per minute
14
. This high change
frequency leads to DBpedia data quickly being out-
dated, which in turn leads to the need for a methodology
to keep DBpedia in synchronisation with Wikipedia.
As a consequence, the DBpedia Live system was de-
veloped, which works on a continuous stream of up-
dates from Wikipedia and processes that stream on the
fly [
20
,
36
]. It allows extracted data to stay up-to-date
with a small delay of at most a few minutes. Since the
English Wikipedia is the largest among all Wikipedia
editions with respect to the number of articles and the
number of edits per month, it was the first language
DBpedia Live supported
15
. Meanwhile, DBpedia Live
for Dutch16 was developed.
4.1. DBpedia Live System Architecture
In order for live synchronisation to be possible, we
need access to the changes made in Wikipedia. The
Wikimedia foundation kindly provided us access to
their update stream using the OAI-PMH protocol [
27
].
This protocol allows a programme to pull page updates
in XML via HTTP. A Java component, serving as a
proxy, constantly retrieves new updates and feeds them
to the DBpedia Live framework. This proxy is nec-
essary to decouple the stream from the framework to
simplify maintenance of the software. The live extrac-
tion workflow uses this update stream to extract new
knowledge upon relevant changes in Wikipedia articles.
The overall architecture of DBpedia Live is indicated
in Figure 8. The major components of the system are
as follows:
14http://stats.wikimedia.org/EN/SummaryEN.
htm
15http://live.dbpedia.org
16http://live.nl.dbpedia.org
Fig. 8. Overview of DBpedia Live extraction framework.
– Local Wikipedia Mirror
: A local copy of a
Wikipedia language edition is installed which is
kept in real-time synchronisation with its live ver-
sion using the OAI-PMH protocol. Keeping a local
Wikipedia mirror allows us to exceed any access
limits posed by Wikipedia.
Mappings Wiki
: The DBpedia Mappings Wiki,
described in Section 2.4, serves as secondary input
source. Changes of the mappings wiki are also
consumed via an OAI-PMH stream. Note that a
single mapping change can affect a high number
of DBpedia resources.
DBpedia Live Extraction Manager
: This is the
core component of the DBpedia Live extraction
architecture. The manager takes feeds of pages for
re-processing as input and applies all the enabled
extractors. After processing a page, the extracted
triples are a) inserted into a backend triple store
(in our case Virtuoso [
10
]), updating the old triples
and b) saved as changesets into a compressed N-
Triples file structure.
Synchronisation Tool
: This tool allows third par-
ties to keep DBpedia Live mirrors up-to-date by
harvesting the produced changesets.
4.2. Features of DBpedia Live
The core components of the DBpedia Live Extraction
framework provide the following features:
Mapping-Affected Pages: The update of all pages
that are affected by a mapping change.
Unmodified Pages: The update of unmodified
pages at regular intervals.
Changesets Publication: The publication of triple-
changesets.
Synchronisation Tool: A synchronisation tool for
harvesting updates to DBpedia Live mirrors.
Data Isolation: Separate data from different
sources.
Lehmann et al. / DBpedia 15
Mapping-Affected Pages: Whenever an infobox map-
ping change occurs, all the Wikipedia pages that use
that infobox are reprocessed. Taking Figure 2 as an
example, if a new property mapping is introduced (i.e.
dbo:translator) or an existing (i.e. dbo:illustrator) is
updated or deleted, then all entities belonging to the
class dbo:Book are reprocessed. Thus, upon a mapping
change, we identify all the affected Wikipedia pages
and feed them for reprocessing.
Unmodified Pages: Extraction framework improve-
ments or activation / deactivation of DBpedia extractors
might never be applied to rarely modified pages. To
overcome this problem, we obtain a list of the pages
which have not been processed over a period of time
(30 days in our case) and feed that list to the DBpedia
Live extraction framework for reprocessing. This feed
has a lower priority than the update or the mapping
affected pages feed and ensures that all articles reflect a
recent state of the output of the extraction framework.
Publication of Changesets: Whenever a Wikipedia
article is processed, we get two disjoint sets of triples. A
set for the added triples, and another set for the deleted
triples. We write those two sets into N-Triples files,
compress them, and publish the compressed files as
changesets. If another DBpedia Live mirror wants to
synchronise with the DBpedia Live endpoint, it can just
download those files, decompress and integrate them.
Synchronisation Tool: The synchronisation tool en-
ables a DBpedia Live mirror to stay in synchronisation
with our live endpoint. It downloads the changeset files
sequentially, decompresses them and updates the target
SPARQL endpoint via insert and delete operations.
Data Isolation: In order to keep the data isolated,
DBpedia Live keeps different sources of data in
different SPARQL graphs. Data from the article
update feeds are contained in the graph with the
URI
http://live.dbpedia.org
, static data (i.e.
links to the LOD cloud) are kept in
http://static.
dbpedia.org
and the DBpedia ontology is stored
in
http://dbpedia.org/ontology
. All data is
also accessible under the
http://dbpedia.org
graph for combined queries. Next versions of DBpe-
dia Live will also separate data from the raw infobox
extraction and mapping-based infobox extraction.
5. Interlinking
DBpedia is interlinked with numerous external data
sets following the Linked Data principles. In this sec-
tion, we give an overview of the number and types
of outgoing links that point from DBpedia into other
data sets, as well as the external data sets that set links
pointing to DBpedia resources.
5.1. Outgoing Links
Similar to the DBpedia ontology, DBpedia also fol-
lows a community approach for adding links to other
third party data sets. The DBpedia project maintains
a link repository
17
for which conventions for adding
linksets and linkset metadata are defined. The adher-
ence to those guidelines is supervised by a linking com-
mittee. Linksets which are added to the repository are
used for the subsequent official DBpedia release as well
as for DBpedia Live. Table 5 lists the linksets created
by the DBpedia community as of April 2013. The first
column names the data set that is the target of the links.
The second and third column contain the predicate that
is used for linking as well as the overall number of links
that is set between DBpedia and the external data set.
The last column names the tool that was used to gener-
ate the links. The value S refers to Silk, L to LIMES,
C to custom script and a missing entry means that the
dataset is copied from the previous releases and not
regenerated.
An example for the usage of links is the combina-
tion of data about European Union project funding
(FTS) [
32
] and data about countries in DBpedia. The
query below compares funding per year (from FTS) and
country with the gross domestic product of a country
(from DBpedia)18 .
1SELECT *{ {
2SELECT ?ftsyear ?ftscountry (SUM(?amount) AS
?funding) {
3?com rdf:type fts-o:Commitment .
4?com fts-o:year ?year .
5?year rdfs:label ?ftsyear .
6?com fts-o:benefit ?benefit .
7?benefit fts-o:detailAmount ?amount .
8?benefit fts-o:beneficiary ?beneficiary .
9?beneficiary fts-o:country ?country .
10 ?country owl:sameAs ?ftscountry .
11 }}{
12 SELECT ?dbpcountry ?gdpyear ?gdpnominal {
13 ?dbpcountry rdf:type dbo:Country .
14 ?dbpcountry dbp:gdpNominal ?gdpnominal .
15 ?dbpcountry dbp:gdpNominalYear ?gdpyear .
16 } }
17 FILTER ((?ftsyear = str(?gdpyear)) &&
18 (?ftscountry = ?dbpcountry)) }
17https://github.com/dbpedia/dbpedia- links
18Endpoint: http://fts.publicdata.eu/sparql
Results: http://bit.ly/1c2mIwQ.
16 Lehmann et al. / DBpedia
Table 5
Data sets linked from DBpedia
Data set Predicate Count Tool
Amsterdam Museum owl:sameAs 627 S
BBC Wildlife Finder owl:sameAs 444 S
Book Mashup rdf:type 9 100
owl:sameAs
Bricklink dc:publisher 10 100
CORDIS owl:sameAs 314 S
Dailymed owl:sameAs 894 S
DBLP Bibliography owl:sameAs 196 S
DBTune owl:sameAs 838 S
Diseasome owl:sameAs 2 300 S
Drugbank owl:sameAs 4 800 S
EUNIS owl:sameAs 3 100 S
Eurostat (Linked Stats) owl:sameAs 253 S
Eurostat (WBSG) owl:sameAs 137
CIA World Factbook owl:sameAs 545 S
flickr wrappr dbp:hasPhoto- 3 800 000 C
Collection
Freebase owl:sameAs 3 600 000 C
GADM owl:sameAs 1 900
GeoNames owl:sameAs 86 500 S
GeoSpecies owl:sameAs 16 000 S
GHO owl:sameAs 196 L
Project Gutenberg owl:sameAs 2 500 S
Italian Public Schools owl:sameAs 5 800 S
LinkedGeoData owl:sameAs 103 600 S
LinkedMDB owl:sameAs 13 800 S
MusicBrainz owl:sameAs 23 000
New York Times owl:sameAs 9 700
OpenCyc owl:sameAs 27 100 C
OpenEI (Open Energy) owl:sameAs 678 S
Revyu owl:sameAs 6
Sider owl:sameAs 2 000 S
TCMGeneDIT owl:sameAs 904
UMBEL rdf:type 896 400
US Census owl:sameAs 12 600
WikiCompany owl:sameAs 8 300
WordNet dbp:wordnet type 467 100
YAGO2 rdf:type 18 100 000
Sum 27 211 732
In addition to providing outgoing links on an
instance-level, DBpedia also sets links on schema-
level pointing from the DBpedia ontology to equiva-
lent terms in other schemas. Links to other schemata
can be set by the community within the DBpedia Map-
pings Wiki by using
owl:equivalentClass
in
class templates and
owl:equivalentProperty
in datatype or object property templates, respectively.
In particular, in 2011 Google, Microsoft, and Yahoo!
announced their collaboration on Schema.org, a col-
lection of vocabularies for marking up content on web
pages. The DBpedia 3.8 ontology contains 45 equiva-
lent class and 31 equivalent property links pointing to
http://schema.org terms.
5.2. Incoming Links
DBpedia is being linked to from a variety of data
sets. The overall number of incoming links, i.e. links
pointing to DBpedia from other data sets, is 39,007,478
according to the Data Hub.
19
However, those counts
are entered by users and may not always be valid and
up-to-date.
In order to identify actually published and online
data sets that link to DBpedia, we used Sindice [
39
].
The Sindice project crawls RDF resources on the web
and indexes those resources. In Sindice, a data set is
defined by the second-level domain name of the en-
tity’s URI, e.g. all resources available at the domain
fu-berlin.de
are considered to belong to the same
data set. A triple is considered to be a link if the data
set of subject and object are different. Furthermore, the
Sindice data we used for analysis only considers au-
thoritative entities: The data set of a subject of a triple
must match the domain it was retrieved from, otherwise
it is not considered. Sindice computes a graph sum-
mary [
8
] over all resources they store. With the help
of the Sindice team, we examined this graph summary
to obtain all links pointing to DBpedia. As shown in
Table 7, Sindice knows about 248 data sets linking to
DBpedia. 70 of those data sets link to DBpedia via
owl:sameAs
, but other link predicates are also very
common as evident in this table. In total, Sindice has
indexed 4 million links pointing to DBpedia. Table 6
lists the 10 data sets which set most links to DBpedia.
It should be noted that the data in Sindice is not com-
plete, for instance it does not contain all data sets that
are catalogued by the DataHub
20
. However, it crawls for
RDFa snippets, converts microformats etc., which are
not captured by the DataHub. Despite the inaccuracy,
the relative comparison of different datasets can still
give us insights. Therefore, we analysed the link struc-
ture of all Sindice datasets using the Sindice cluster.
Table 8 shows the datasets with most incoming links.
Those are authorities in the network structure of the
19
See
http://wiki.dbpedia.org/Interlinking
for
details.
20http://datahub.io/
Lehmann et al. / DBpedia 17
Table 6
Top 10 data sets in Sindice ordered by the number of links to DBpedia.
Data set Link Predicate Count Link Count
okaboo.com 4 2,407,121
tfri.gov.tw 57 837,459
naplesplus.us 2 212,460
fu-berlin.de 7 145,322
freebase.com 108 138,619
geonames.org 3 125,734
opencyc.org 3 19,648
geospecies.org 10 16,188
dbrec.net 3 12,856
faviki.com 5 12,768
Table 7
Sindice summary statistics for incoming links to DBpedia.
Metric Value
Total links: 3,960,212
Total distinct data sets: 248
Total distinct predicates: 345
Table 8
Top 10 datasets by incoming links in Sindice.
domain datasets links
purl.org 498 6,717,520
dbpedia.org 248 3,960,212
creativecommons.org 2,483 3,030,910
identi.ca 1,021 2,359,276
l3s.de 34 1,261,487
rkbexplorer.com 24 1,212,416
nytimes.com 27 1,174,941
w3.org 405 658,535
geospecies.org 13 523,709
livejournal.com 14,881 366,025
web of data and DBpedia is currently ranked second in
terms of incoming links.
6. DBpedia Usage Statistics
DBpedia is served on the web in three forms: First,
it is provided in the form of downloadable data sets
where each data set contains the results of one of the
extractors listed in Table 1. Second, DBpedia is served
via a public SPARQL endpoint and, third, it provides
dereferencable URIs according to the Linked Data prin-
ciples. In this section, we explore some of the statistics
gathered during the hosting of DBpedia over the last
two of years.
0
20
40
60
80
100
120
2012-Q1
2012-Q2
2012-Q3
2012-Q4
Download Count (in Thousands)
Download Volume (in TB)
Fig. 9. The download count and download volume of the English
language edition of DBpedia.
6.1. Download Statistics for the DBpedia Data Sets
DBpedia covers more than 100 languages, but those
languages vary with respect to the download popular-
ity as well. The top five languages with respect to the
download volume are English, Chinese, German, Cata-
lan, and French respectively. The download count and
download volume of the English language is indicated
in Figure 9. To host the DBpedia dataset downloads,
a bandwidth of approximately 6 TB per month is cur-
rently needed.
Furthermore, DBpedia consists of several data sets
which vary with respect to their download popularity.
The download count and the download volume of each
data set during the year 2012 is depicted in Figure 10.
In those statistics we filtered out all IP addresses, which
requested a file more than 1000 times per month.
21
Pagelinks are the most downloaded dataset, although
they are not semantically rich as they do not reveal
which type of links exists between two resources. Sup-
posedly, they are used for network analysis or providing
relevant links in user interfaces and downloaded more
often as they are not provided via the official SPARQL
endpoint.
6.2. Public Static DBpedia SPARQL Endpoint
The main public DBpedia SPARQL endpoint
22
is
hosted using the Virtuoso Universal Server (Enterprise
Edition) version 6.4 software in a 4-nodes cluster con-
figuration. This cluster setup provides parallelization
of query execution, even when the cluster nodes are
on the same machine, as splitting a query over several
nodes allows better use of parallel threads on modern
multi-core CPUs on standard commodity hardware.
Virtuoso supports horizontal scale-out, either by re-
distributing the existing cluster nodes onto multiple ma-
21
The IP address was only filtered for that specific file and month
in those cases.
22http://dbpedia.org/sparql
18 Lehmann et al. / DBpedia
0
10
20
30
40
50
60
70
Download Count (in Thousands)
Download Volume (in TB)
Fig. 10. The download count and download volume (in GB) of the DBpedia data sets.
Table 9
Hardware of the machines serving the public SPARQL endpoint.
DBpedia Configuration
3.3 - 3.4 AMD Opteron 8220 2.80Ghz, 4 Cores, 32GB
3.5 - 3.7 Intel Xeon E5520 2.27Ghz, 8 Cores, 48GB
3.8 Intel Xeon E5-2630 2.30GHz, 8 Cores, 64GB
chines, or by adding several separate clusters with a
round robin HTTP front-end. This allows the cluster
setup to grow in line with desired response times for
an RDF data set collection. As the size of the DBpedia
data set increased and its use by the Linked Data com-
munity grew, the project migrated to increasingly pow-
erful, but still moderately priced, hardware as shown in
Table 9. The Virtuoso instance is configured to process
queries within a 1,200 second timeout window and a
maximum result set size of 50,000 rows. It provides
OFFSET
and
LIMIT
support for paging alongside the
ability to produce partial results.
The log files used in the following analysis excluded
traffic generated by:
1.
clients that have been temporarily rate limited
after a burst period,
2. clients that have been banned after misuse,
3.
applications, spiders and other crawlers that are
blocked after frequently hitting the rate limit or
generally use too many resources.
Virtuoso supports HTTP Access Control Lists
(ACLs) which allow the administrator to rate limit cer-
tain IP addresses or whole IP ranges. A maximum num-
ber of requests per second (currently 15) as well as
a bandwidth limit per request (currently 10MB) are
enforced. If the client software can handle compres-
Table 11
Number of unique sites accessing DBpedia endpoints.
DBpedia Avg/Day Median Stdev Maximum
3.3 5,824 5,977 1,046 7,865
3.4 6,656 6,704 947 8,312
3.5 9,392 9,432 1,643 12,584
3.6 10,810 10,810 1,612 14,798
3.7 17,077 17,091 6,782 33,288
3.8 14,513 14,493 2,783 20,510
sion, replies are compressed to further save bandwidth.
Exception rules can be configured for multiple clients
hidden behind a NAT firewall (appearing as a single IP
address) or for temporary requests for higher rate limits.
When a client hits an ACL limit, the system reports an
appropriate HTTP status code
23
like 509 (”Bandwidth
Limit Exceeded”) and quickly drops the connection.
The system further uses an iptables based firewall for
permanent blocking of clients identified by their IP
addresses.
6.3. Public Static Endpoint Statistics
The statistics presented in this section were extracted
from reports generated by Webalizer v2.21
24
. Table 10
and Table 11 show various DBpedia SPARQL endpoint
usage statistics for the DBpedia 3.3 to 3.8 releases. Note
that the usage of all endpoints mentioned in Table 12 is
counted. The Avg/Day column represents the average
number of hits (resp. visits/sites) per day, followed by
23http://en.wikipedia.org/wiki/List_of_HTTP_
status_codes
24http://www.webalizer.org
Lehmann et al. / DBpedia 19
Table 10
Number of endpoint hits (left) and visits (right).
DBpedia Avg/Day Median Stdev Maximum
3.3 733,811 711,306 188,991 1,319,539
3.4 1,212,549 1,165,893 351,226 2,371,657
3.5 1,122,612 1,035,444 386,454 2,908,720
3.6 1,328,355 1,286,750 256,945 2,495,031
3.7 2,085,399 1,930,728 1,057,398 8,873,473
3.8 2,910,410 2,717,775 1,085,640 7,678,490
DBpedia Avg/Day Median Stdev Maximum
3.3 9,750 9,775 2,036 13,126
3.4 11,329 11,473 1,546 14,198
3.5 16,592 16,902 2,969 23,129
3.6 19,471 17,664 5,691 56,064
3.7 23,972 22,262 10,741 127,869
3.8 16,851 16,711 2,960 27,416
the Median and Standard Deviation. The last column
shows the maximum number of hits (resp. visits/sites)
that was recorded on a single day for each data set
version. Visits (i.e. sessions of subsequent queries from
the same client) are determined by a floating 30 minute
time window. All requests from behind a NAT firewall
are logged under the same external IP address and are
therefore counted towards the same visit if they occur
within the 30 minute interval.
Table 10 shows the increasing popularity of DBpedia.
There is a distinct dip in hits to the SPARQL endpoint in
DBpedia 3.5, which is partially due to more strict initial
limits for bot-related traffic which were later relaxed.
The sudden drop of visits between the 3.7 and the 3.8
data sets can be attributed to:
1.
applications starting to use their own private
DBpedia endpoint
2.
blocking of apps that were abusing the DBpedia
endpoint
3.
uptake of the language specific DBpedia end-
points and DBpedia Live
6.4. Query Types and Trends
The DBpedia server is not only a SPARQL endpoint,
but also serves as a Linked Data Hub returning re-
sources in a number of different formats. For each data
set we randomly selected 14 days worth of log files and
processed those in order to show the various services
called. Table 12 shows the number of hits to the various
endpoints.
The /resource endpoint uses the Accept: line in the
HTTP header sent by the client to return a HTTP sta-
tus code 30x to redirect the client to either the /page
(HTML based) or /data (formats like RDF/XML or Tur-
tle) equivalent of the article. Clients also frequently
mint their own URLs to either /page or /data version of
an articles directly, or download the raw data directly.
This explains why the count of /page and /data hits
in the table is larger than the number of hits on the
Table 12
Hits per service to http://dbpedia.org in thousands.
Endpoint 3.3 3.4 3.5 3.6 3.7 3.8
/data 1,355 2,681 2,230 2,547 3,714 4,246
/ontology 80 168 142 178 116 97
/page 2,634 4,703 1,810 1,687 3,787 7,377
/property 231 311 137 176 176 128
/resource 2,976 4,080 2,554 2,309 4,436 7,143
/sparql 2,090 4,177 2,266 9,112 15,902 15,475
other 252 420 342 277 579 695
total 9,619 16,541 9,434 16,286 28,710 35,142
Fig. 11. Traffic Linked Data versus SPARQL endpoint
/resource endpoint. The /ontology and /property end-
points return meta information about the DBpedia on-
tology. While all of these endpoints themselves may
use SPARQL queries to generate various page content,
these requests are handled by the internal Virtuoso en-
gine directly and do not show up as extra calls to the
/sparql endpoint in our analysis.
Figure 11 shows the percentages of traffic hits that
were generated by the main endpoints. As we can see,
the usage of the SPARQL endpoint has doubled from
about 22 percent in 2009 to about 44 percent in 2013.
However, this still means that 56 percent of traffic hits
are directed to the Linked Data service.
In Table 13, we focussed on the calls to the /sparql
endpoint and counted the number of statements per type.
20 Lehmann et al. / DBpedia
Table 13
Hits per statement type in thousands.
Statement 3.3 3.4 3.5 3.6 3.7 3.8
ask 56 269 360 326 159 677
construct 55 31 14 11 22 38
describe 11 8 4 7 62 111
select 1891 3663 1658 8030 11204 13516
unknown 78 206 229 738 4455 1134
total 2090 4177 2266 9112 15902 15475
Table 14
Trends in SPARQL select (rounded values in %).
Statement 3.3 3.4 3.5 3.6 3.7 3.8
distinct 19.5 11.4 17.3 19.4 13.3 25.4
filter 45.7 13.7 31.8 25.3 29.2 36.1
functions 8.8 6.9 23.5 21.3 25.5 25.9
geo 27.7 7.0 39.6 6.7 9.3 9.2
group 0.0 0.0 0.0 0.0 0.0 0.2
limit 4.6 6.5 11.6 10.5 7.7 7.8
optional 30.6 23.8 47.3 26.3 16.7 17.2
order 2.3 1.3 1.9 3.2 1.2 1.6
union 3.3 3.2 26.4 11.3 12.1 20.6
As the log files only record the full SPARQL query on
a GET request, all the POST requests are counted as
unknown.
Finally, we analyzed each SPARQL query and
counted the use of keywords and constructs like:
DISTINCT
FILTER
FUNCTIONS like CONCAT, CONTAINS, ISIRI
Use of GEO objects
GROUP BY
LIMIT / OFFSET
OPTIONAL
ORDER BY
UNION
For the GEO objects we counted the use of SPARQL
PREFIX geo: and wgs84*: declarations and usage in
property tags. Table 14 shows the use of various key-
words as a percentage of the total select queries made
to the /sparql endpoint for the sample sets. In general,
we observed that queries became more complex over
time indicating an increasing maturity and higher ex-
pectations of the user base.
6.5. Statistics for DBpedia Live
Since its official release at the end of June 2011,
DBpedia Live attracted a steadily increasing number
of users. Furthermore, more users tend to use the syn-
chronisation tool to synchronise their own DBpedia
Live mirrors. This leads to an increasing number of live
update requests, i.e. changeset downloads. Figure 12
indicates the number of daily SPARQL and synchroni-
sation requests sent to DBpedia Live endpoint in the
period between August 2012 and January 2013.
Fig. 12. Number of daily requests sent to the DBpedia Live for a)
SPARQL queries and b) synchronisation requests from August 2012
until January 2013
7. Use Cases and Applications
Due to DBpedia’s coverage of various domains as
well as its steady growth as a hub on the Web of Data,
the data sets provided by DBpedia can serve many pur-
poses. Such applications include improving search and
exploration of Wikipedia, data proliferation for applica-
tions, mashups as well as text analysis and annotation
tools.
7.1. Natural Language Processing
DBpedia can support many tasks in Natural Lan-
guage Processing (NLP) [
33
]. For that purpose, DBpedia
includes a number of specialized data sets
25
. For in-
stance, the lexicalizations data set can be used to esti-
mate the ambiguity of phrases, to help select unambigu-
ous identifiers for ambiguous phrases, or to provide
alternative names for entities, just to mention a few ex-
amples. Topic signatures can be useful in tasks such as
query expansion or document summarization, and has
25http://wiki.dbpedia.org/Datasets/NLP
Lehmann et al. / DBpedia 21
been successfully employed to classify ambiguously
described images as good depictions of DBpedia enti-
ties [
13
]. The thematic concepts data set of resources
can be used for creating a corpus from Wikipedia to be
used as training data for topic classifiers, among other
things (see below). The grammatical gender data set
can, for example, be used to add a gender feature in
co-reference resolution.
7.1.1. Annotation: Entity Disambiguation
An important use case for NLP is annotating texts
or other content with semantic information. Named
entity recognition and disambiguation – also known as
key phrase extraction and entity linking tasks – refers
to the task of finding real world entities in text and
linking them to unique identifiers. One of the main
challenges in this regard is ambiguity: an entity name,
or surface form, may be used in different contexts to
refer to different concepts. Many different methods
have been developed to resolve this ambiguity with
fairly high accuracy [22].
As DBpedia reflects a vast amount of structured real
world knowledge obtained from Wikipedia, DBpedia
URIs can be used as identifiers for the majority of do-
mains in text annotation. Consequently, interlinking text
documents with Linked Data enables the Web of Data
to be used as background knowledge within document-
oriented applications such as semantic search or faceted
browsing (cf. Section 7.3).
Many applications performing this task of annotating
text with entities in fact use DBpedia entities as targets.
For example, DBpedia Spotlight [
34
] is an open source
tool
26
including a free web service that detects men-
tions of DBpedia resources in text. It uses the lexical-
izations in conjunction with the topic signatures data
set as context model in order to be able to disambiguate
found mentions. The main advantage of this system is
its comprehensiveness and flexibility, allowing one to
configure it based on quality measures such as promi-
nence, contextual ambiguity, topical pertinence and dis-
ambiguation confidence, as well as the DBpedia on-
tology. The resources that should be annotated can be
specified by a list of resource types or by more complex
relationships within the knowledge base described as
SPARQL queries.
There are numerous other NLP APIs that link enti-
ties in text to DBpedia: AlchemyAPI
27
,Semantic API
26http://spotlight.dbpedia.org/
27http://www.alchemyapi.com/
from Ontos
28
,Open Calais
29
and Zemanta
30
among oth-
ers. Furthermore, the DBpedia ontology has been used
for training named entity recognition systems (without
disambiguation) in the context of the Apache Stanbol
project31.
A related project is ImageSnippets
32
, which is a sys-
tem for annotating images. It uses DBpedia as one of
its main datasets for unambiguously identifying entities
depicted within an image.
Tag disambiguation Similar to linking entities in text
to DBpedia, user-generated tags attached to multimedia
content such as music, photos or videos can also be con-
nected to the Linked Data hub. This has previously been
implemented by letting the user resolve ambiguities.
For example, Faviki
33
suggests a set of DBpedia entities
coming from Zemanta’s API and lets the user choose
the desired one. Alternatively, similar disambiguation
techniques as mentioned above can be utilized to choose
entities from tags automatically [
14
]. The BBC
34
[
23
]
employs DBpedia URIs for tagging their programmes.
Short clips and full episodes are tagged using two dif-
ferent tools while utilizing DBpedia to benefit from
global identifiers that can be easily integrated with other
knowledge bases.
7.1.2. Question Answering
DBpedia provides a wealth of human knowledge
across different domains and languages, which makes
it an excellent target for question answering and key-
word search approaches. One of the most prominent
efforts in this area is the DeepQA project, which re-
sulted in the IBM Watson system [
12
]. The Watson
system won a $1 million prize in Jeopardy and relies
on several data sets including DBpedia
35
. DBpedia is
also the primary target for several QA systems in the
Question Answering over Linked Data (QALD) work-
shop series
36
. Several QA systems, such as TBSL [
44
],
PowerAqua [
31
], FREyA [
9
] and QAKiS [
6
] have been
applied to DBpedia using the QALD benchmark ques-
tions. DBpedia is interesting as a test case for such
28http://www.ontos.com/
29http://www.opencalais.com/
30http://www.zemanta.com/
31http://stanbol.apache.org/
32http://www.imagesnippets.com
33http://www.faviki.com/
34http://bbc.co.uk
35http://www.aaai.org/Magazine/Watson/
watson.php
36http://greententacle.techfak.uni-
bielefeld.de/˜cunger/qald/
22 Lehmann et al. / DBpedia
systems. Due to its large schema and data size as well
as its topic diversity, it provides significant scientific
challenges. In particular, it would be difficult to provide
capable QA systems for DBpedia based only on simple
patterns or via domain specific dictionaries, because of
its size and broad coverage. Therefore, a question an-
swering system, which is able to reliable answer ques-
tions over DBpedia correctly, could be seen as a truly
intelligent system. In the latest QALD series, question
answering benchmarks also exploit national DBpedia
chapters for multilingual question answering.
Similarly, the slot filling task in natural language
processing poses the challenge of finding values for a
given entity and property from mining text. This can
be viewed as question answering with static questions
but changing targets. DBpedia can be exploited for fact
validation or training data in this task, as was done by
the Watson team [4] and others [28].
7.2. Digital Libraries and Archives
In the case of libraries and archives, DBpedia could
offer a broad range of information on a broad range of
domains. In particular, DBpedia could provide:
Context information for bibliographic and archive
records: Background information such as an au-
thor’s demographics, a film’s homepage or an im-
age could be used to enhance user interaction.
Stable and curated identifiers for linking: DBpedia
is a hub of Linked Open Data. Thus, (re-)using
commonly used identifiers could ease integration
with other libraries or knowledge bases.
A basis for a thesaurus for subject indexing: The
broad range of Wikipedia topics in addition to
the stable URIs could form the basis for a global
classification system.
Libraries have already invested both in Linked Data
and Wikipedia (and transitively to DBpedia) though
the realization of the Virtual International Authority
Files (VIAF) project.
37
Recently, it was announced that
VIAF added a total of 250,000 reciprocal authority
links to Wikipedia.
38
These links are already harvested
by DBpedia Live and will also be included in the next
static DBpedia release. This creates a huge opportunity
for libraries that use VIAF to get connected to DBpedia
and the LOD cloud in general.
37http://viaf.org
38
Accessed on 12/02/2013:
http://www.oclc.org/
research/news/2012/12-07a.html
7.3. Knowledge Exploration
Since DBpedia spans many domains and has a di-
verse schema, many knowledge exploration tools either
used DBpedia as a testbed or were specifically built
for DBpedia. We give a brief overview of tools and
structure them in categories:
Facet Based Browsers An award-winning
39
facet-
based browser used the Neofonie search engine to com-
bine facts in DBpedia with full-text from Wikipedia in
order to compute hierarchical facets [
15
]. Another facet
based browser, which allows to create complex graph
structures of facets in a visually appealing interface and
filter them is gFacet [
16
]. A generic SPARQL based
facet explorer, which also uses a graph based visualisa-
tion of facets, is LODLive [
7
]. The OpenLink built-in
facet based browser
40
is an interface, which enables
developers to explore DBpedia, compute aggregations
over facets and view the underlying SPARQL queries.
Search and Querying The DBpedia Query Builder
41
allows developers to easily create simple SPARQL
queries, more specifically sets of triple patterns via in-
telligent autocompletion. The autocompletion function-
ality ensures that only URIs, which lead to solutions are
suggested to the user. The RelFinder [
17
] tool provides
an intuitive interface, which allows to explore the neigh-
borhood and connections between resources specified
by the user. For instance, the user can view the short-
est paths connecting certain persons in DBpedia. Sem-
Lens [
18
] allows to create statistical analysis queries
and correlations in RDF data and DBpedia in particular.
Spatial Applications DBpedia Mobile [
3
] is a location
aware client, which renders a map of nearby locations
from DBpedia, provides icons for schema classes and
supports more than 30 languages from various DBpedia
language editions. It can follow RDF links to other
data sets linked from DBpedia and supports powerful
SPARQL filters to restrict the viewed data.
7.4. Applications of the Extraction Framework:
Wiktionary Extraction
Wiktionary is one of the biggest collaboratively cre-
ated lexical-semantic and linguistic resources, avail-
39http://blog.dbpedia.org/2009/11/20/german-
government-proclaims- faceted-wikipedia-
search-one- of-the- 365-best- ideas-in- germany/
40http://dbpedia.org/fct/
41http://querybuilder.dbpedia.org/
Lehmann et al. / DBpedia 23
able in 171 languages (of which approximately 147 can
be considered active
42
). It contains information about
hundreds of spoken and even ancient languages. In the
case of the English Wiktionary there are nearly 3 mil-
lion detailed descriptions of words covering several do-
mains43. Such descriptions provide, for a lexical word,
a hierarchical disambiguation to its language, part of
speech, sometimes etymologies, synonyms, hyponyms,
hyperonyms, example sentences, and most prominently
senses.
Due to its fast changing nature, together with the
fragmentation of the project into Wiktionary language
editions (WLE) with independent layout rules a, config-
urable mediator/wrapper approach is taken for its auto-
mated transformation into a structured knowledge base.
The workflow of this dedicated Wiktionary extractor
being part of the Wiktionary2RDF [
19
] project is as
follows: For every WLE to be transformed an XML
configuration file is provided as input. This configura-
tion is used by the Wiktionary extractor, invoked by
the DBpedia extraction framework, to first generate a
schema reflecting the configured page structure (wrap-
per part). After this, these language specific schemas
are converted to a global schema (mediator part) and
later serialized to RDF.
To enable non-programmers (the community of
adopters and domain experts) to tailor and maintain
the WLE wrappers themselves, a simple XML dialect
was created to encode the page structure to be parsed
and declare triple patterns, that define how the resulting
RDF should be built. The described setup is run against
Wiktionary dumps. The resulting data set is open in
every aspect and hosted as Linked Data.
44
Statistics are
shown in Table 15.
8. Related Work
8.1. Cross Domain Community Knowledge Bases
8.1.1. Wikidata
In March 2012, the Wikimedia Germany e.V. started
the development of Wikidata
45
. Wikidata is a free
knowledge base about the world that can be read and
42http://s23.org/wikistats/wiktionaries_
html.php
43
See
http://en.wiktionary.org/wiki/semantic
for a simple example page
44http://wiktionary.dbpedia.org/
45http://wikidata.org/
edited by humans and machines alike. It provides data
in all languages of the Wikimedia projects, and allows
for central access to the data in a similar vein as Wiki-
media Commons does for multimedia files. Things de-
scribed in the Wikidata knowledge base are called items
and can have labels, descriptions and aliases in all lan-
guages. Wikidata does not aim at offering a single truth
about things, but providing statements given in a par-
ticular context. Rather than stating that Berlin has a
population of 3.5 million, Wikidata contains the state-
ment about Berlin’s population being 3.5 million as of
2011 according to the German statistical office. Thus,
Wikidata can offer a variety of statements from differ-
ent sources and dates. As there are potentially many
different statements for a given item and property, ranks
can be added to statements to define their status (pre-
ferred, normal or deprecated). The initial development
was divided in three phases:
The first phase (interwiki links) created an entity
base for the Wikimedia projects. This provides
a better alternative to the previous interlanguage
link system.
The second phase (infoboxes) gathered infobox-
related data for a subset of the entities, with the
explicit goal of augmenting the infoboxes that are
currently widely used with data from Wikidata.
The third phase (lists) will expand the set of prop-
erties beyond those related to infoboxes, and will
provide ways of exploiting this data within and
outside the Wikimedia projects.
At the time of writing of this article, the development
of the third phase is ongoing.
Wikidata already contains 11.95 million items and
348 properties that can be used to describe them. Since
March 2013 the Wikidata extension is live on all
Wikipedia language editions and thus their pages can
be linked to items in Wikidata and include data from
Wikidata.
Wikidata also offers a Linked Data interface
46
as
well as regular RDF dumps of all its data. The planned
collaboration with Wikidata is outlined in Section 9.
8.1.2. Freebase
Freebase
47
is a graph database, which also extracts
structured data from Wikipedia and makes it available
in RDF. Both DBpedia and Freebase link to each other
46http://meta.wikimedia.org/wiki/Wikidata/
Development/LinkedDataInterface
47http://www.freebase.com/
24 Lehmann et al. / DBpedia
Table 15
Statistical comparison of extractions for different languages.
language #words #triples #resources #predicates #senses
en 2,142,237 28,593,364 11,804,039 28 424,386
fr 4,657,817 35,032,121 20,462,349 22 592,351
ru 1,080,156 12,813,437 5,994,560 17 149,859
de 701,739 5,618,508 2,966,867 16 122,362
and provide identifiers based on those for Wikipedia
articles. They both provide dumps of the extracted data,
as well as APIs or endpoints to access the data and
allow their communities to influence the schema of the
data. There are, however, also major differences be-
tween both projects. DBpedia focuses on being an RDF
representation of Wikipedia and serving as a hub on the
Web of Data, whereas Freebase uses several sources to
provide broad coverage. The store behind Freebase is
the GraphD [
35
] graph database, which allows to effi-
ciently store metadata for each fact. This graph store is
append-only. Deleted triples are marked and the system
can easily revert to a previous version. This is neces-
sary, since Freebase data can be directly edited by users,
whereas information in DBpedia can only indirectly be
edited by modifying the content of Wikipedia or the
Mappings Wiki. From an organisational point of view,
Freebase is mainly run by Google, whereas DBpedia is
an open community project. In particular in focus areas
of Google and areas in which Freebase includes other
data sources, the Freebase database provides a higher
coverage than DBpedia.
8.1.3. YAGO
One of the projects that pursues similar goals
to DBpedia is YAGO
48
[
42
]. YAGO is identical to
DBpedia in that each article in Wikipedia becomes an
entity in YAGO. Based on this, it uses the leaf cate-
gories in the Wikipedia category graph to infer type
information about an entity. One of its key features is to
link this type information to WordNet. WordNet synsets
are represented as classes and the extracted types of
entities may become subclasses of such a synset. In the
YAGO2 system [
21
], declarative extraction rules were
introduced, which can extract facts from different parts
of Wikipedia articles, e.g. infoboxes and categories, as
well as other sources. YAGO2 also supports spatial and
temporal dimensions for facts at the core of its system.
One of the main differences between DBpedia and
YAGO in general is that DBpedia tries to stay very
close to Wikipedia and provide an RDF version of its
48http://www.mpi- inf.mpg.de/yago-naga/yago/
content. YAGO focuses on extracting a smaller number
of relations compared to DBpedia to achieve very high
precision and consistent knowledge. The two knowl-
edge bases offer different type systems: whereas the
DBpedia ontology is manually maintained, YAGO is
backed by WordNet and Wikipedia leaf categories.
Due to this, YAGO contains many more classes than
DBpedia. Another difference is that the integration of
attributes and objects in infoboxes is done via mappings
in DBpedia and, therefore, by the DBpedia community
itself, whereas this task is facilitated by expert-designed
declarative rules in YAGO2.
The two knowledge bases are connected, e.g. DBpedia
offers the YAGO type hierarchy as an alternative to
the DBpedia ontology and
sameAs
links are provided
in both directions. While the underlying systems are
very different, both projects share similar aims and
positively complement and influence each other.
8.2. Knowledge Extraction from Wikipedia
Since its official start in 2001, Wikipedia has always
been the target of automatic extraction of information
due to its easy availability, open license and encyclo-
pedic knowledge. A large number of parsers, scraper
projects and publications exist. In this section, we re-
strict ourselves to approaches that are either notable, re-
cent or pertinent to DBpedia. MediaWiki.org maintains
an up-to-date list of software projects
49
, who are able to
process wiki syntax, as well as a list of data extraction
extensions50 for MediaWiki.
JWPL (Java Wikipedia Library, [
49
]) is an open-
source, Java-based API that allows to access informa-
tion provided by the Wikipedia API (redirects, cate-
gories, articles and link structure). JWPL contains a
MediaWiki Markup parser that can be used to further
analyze the contents of a Wikipedia page. Data is also
49http://www.mediawiki.org/wiki/Alternative_
parsers
50http://www.mediawiki.org/wiki/Extension_
Matrix/data_extraction
Lehmann et al. / DBpedia 25
provided as XML dump and is incorporated in the lexi-
cal resource UBY51 for language tools.
Several different approaches to extract knowledge
from Wikipedia are presented in [
37
]. Given features
like anchor texts, interlanguage links, category links
and redirect pages are utilized e.g. for word-sense dis-
ambiguations or synonyms, translations, taxonomic re-
lations and abbreviation or hypernym resolution, re-
spectively. Apart from this, link structures are used to
build the Wikipedia Thesaurus Web service
52
. Addi-
tional projects that exploit the mentioned features are
listed on the Special Interest Group on Wikipedia Min-
ing (SIGWP) Web site53.
An earlier approach to improve the quality of the
infobox schemata and contents is described in [
47
].
The presented methodology encompasses a three step
process of preprocessing,classification and extraction.
During preprocessing refined target infobox schemata
are created applying statistical methods and training
sets are extracted based on real Wikipedia data. After
assigning a class and the corresponding target schema
(classification) the training sets are used to extract tar-
get infobox values from the document’s text applying
machine learning algorithms.
The idea of using structured data from certain
markup structures was also applied to other user-driven
Web encyclopedias. In [
38
] the authors describe their ef-
fort building an integrated Chinese Linking Open Data
(CLOD) source based on the Chinese Wikipedia and
the two widely used and large encyclopedias Baidu
Baike
54
and Hudong Baike
55
. Apart from utilizing Me-
diaWiki and HTML Markup for the actual extraction,
the Wikipedia interlanguage links were used to link the
CLOD source to the English DBpedia.
A more generic approach to achieve a better cross-
lingual knowledge-linkage beyond the use of Wikipedia
interlanguage links is presented in [
45
]. Focusing on
wiki knowledge bases the authors introduce their solu-
tion based on structural properties like similar linkage
structures, the assignment to similar categories and sim-
ilar interests of the authors of wiki documents in the
considered languages. Since this approach is language-
feature-agnostic it is not restricted to certain languages.
51http://www.ukp.tu- darmstadt.de/data/
lexical-resources/uby/
52http://sigwp.org/en/index.php/Wikipedia_
Thesaurus
53http://sigwp.org/en/
54http://baike.baidu.com/
55http://www.hudong.com/
KnowItAll
56
is a web scale knowledge extraction
effort, which is domain-independent, and uses generic
extraction rules, co-occurrence statistics and Naive
Bayes classification [
11
]. Cyc [
29
] is a large com-
mon sense knowledge base, which is now partially
released as OpenCyc and also available as an OWL
ontology. OpenCyc is linked to DBpedia, which pro-
vides an ontological embedding in its comprehensive
structures. WikiTaxonomy [
40
] is a large taxonomy de-
rived from categories in Wikipedia by classifying cate-
gories as instances or classes and deriving a subsump-
tion hierarchy. The KOG system [
46
] refines existing
Wikipedia infoboxes based on machine learning tech-
niques using both SVMs and a more powerful joint-
inference approach expressed in Markov Logic Net-
works. KYLIN [
47
] is a system which autonomously
extracts structured data from Wikipedia and uses self-
supervised linking.
9. Conclusions and Future Work
In this system report, we presented an overview on
recent advances of the DBpedia community project.
The technical innovations described in this article in-
cluded in particular: (1) the extraction based on the
community-curated DBpedia ontology, (2) the live syn-
chronisation of DBpedia with Wikipedia and DBpedia
mirrors through update propagation, and (3) the facilita-
tion of the internationalisation of DBpedia. As a result,
we demonstrated that in the past four years DBpedia
matured and improved significantly in terms of cover-
age, usability, and data quality.
With DBpedia, we also aim to provide a proof-
of-concept and blueprint for the feasibility of large-
scale knowledge extraction from crowd-sourced con-
tent repositories. There are a large number of further
crowd-sourced content repositories and DBpedia al-
ready had an impact on their structured data publishing
and interlinking. Two examples are Wiktionary with
the Wiktionary extraction [
19
] meanwhile becoming
part of DBpedia and LinkedGeoData [
41
], which aims
to implement similar data extraction, publishing and
linking strategies for OpenStreetMaps.
In the future, we see in particular the following di-
rections for advancing the DBpedia project:
Multilingual data integration and fusion. An area,
which is still largely unexplored is the integration and
56http://www.cs.washington.edu/research/
knowitall/
26 Lehmann et al. / DBpedia
fusion between different DBpedia language editions.
Non-English DBpedia editions comprise a better and
different coverage of local culture. When we are able to
precisely identify equivalent, overlapping and comple-
mentary parts in different DBpedia language editions,
we can reach significantly increased coverage. On the
other hand, comparing the values of a specific prop-
erty between different language editions will help us
to spot extraction errors as well as wrong or outdated
information in Wikipedia.
Community-driven data quality improvement. In the
future, we also aim to engage a larger community of
DBpedia users in feedback loops, which help us to
identify data quality problems and corresponding defi-
ciencies of the DBpedia extraction framework. By con-
stantly monitoring the data quality and integrating im-
provements into the mappings to the DBpedia ontology
as well as fixes into the extraction framework, we aim to
demonstrate that the Wikipedia community is not only
capable of creating the largest encyclopedia, but also
the most comprehensive and structured knowledge base.
With the DBpedia quality evaluation campaign [
48
] we
were making a first step in this direction.
Inline extraction. Currently DBpedia extracts infor-
mation primarily from templates. In the future, we
envision to also extract semantic information from
typed links. Typed links is a feature of Semantic Me-
diaWiki, which was backported and implemented as
a very lightweight extension for MediaWiki
57
. If this
extension is deployed at Wikipedia installations, this
opens up completely new possibilities for more fine-
grained and non-invasive knowledge representations
and extraction from Wikipedia.
Collaboration between Wikidata and DBpedia.
While DBpedia provides a comprehensive and current
view on entity descriptions extracted from Wikipedia,
Wikidata offers a variety of factual statements from
different sources and dates. One of the richest sources
of DBpedia are Wikipedia infoboxes, which are struc-
tured but at the same time heterogeneous and non-
standardized (thus making the extraction error prone in
certain cases). The aim of Wikidata is to populate in-
foboxes automatically from a centrally managed, high-
quality fact database. In this regard, both projects com-
plement each other and there are several ongoing col-
laboration activities. In future versions, DBpedia will
include more raw data provided by Wikidata and add
57http://www.mediawiki.org/wiki/Extension:
LightweightRDFa
services such as Linked Data/SPARQL endpoints, RDF
dumps, linking and ontology mapping for Wikidata.
Feedback for Wikipedia. A promising prospect is that
DBpedia can help to identify misrepresentations, errors
and inconsistencies in Wikipedia. In the future, we plan
to provide more feedback to the Wikipedia community
about the quality of Wikipedia. This can, for instance,
be achieved in the form of sanity checks, which are
implemented as SPARQL queries on the DBpedia Live
endpoint, which identify data quality issues and are ex-
ecuted in certain intervals. For example, a query could
check that the birthday of a person must always be be-
fore the death day or spot outliers that differ signifi-
cantly from the range of the majority of the other val-
ues. In case a Wikipedia editor makes a mistake or typo
when adding such information to a page, this could be
automatically identified and provided as feedback to
Wikipedians.
Integrate DBpedia and NLP. Recent advances
(cf. Section 7) show huge potential for employing
Linked Data background knowledge in various Natural
Language Processing (NLP) tasks. One very promising
research avenue in this regard is to employ DBpedia
as structured background knowledge for named en-
tity recognition and disambiguation. Currently, most
approaches use statistical information such as co-
occurrence for named entity disambiguation. However,
co-occurrence is not always easy to determine (depends
on training data) and update (requires recomputation).
With DBpedia and in particular DBpedia Live, we have
comprehensive and evolving background knowledge
comprising information on the relationship between a
large number of real-world entities. Consequently, we
can employ this information for deciding to what entity
a certain surface form should be mapped.
Acknowledgment
We would like to thank and acknowledge the support
of the following people and organisations to DBpedia:
all Wikipedia contributors
all DBpedia Mappings Wiki contributors
OpenLink Software for providing and maintaining
the server infrastructure for the main DBpedia
endpoint.
Kingsley Idehen for SPARQL and Linked Data
hosting and community support
Christopher Sahnwaldt for DBpedia development
and release management
Claus Stadler for DBpedia development
Lehmann et al. / DBpedia 27
Paul Kreis for DBpedia development
people who helped contributing data for certain
parts of the article:
Instance Data Analysis: Volha Bryl (working at
University of Mannheim)
Sindice analysis: Stphane Campinas, Szymon
Danielczyk, and Gabriela Vulcu (working at
DERI)
Freebase: Shawn Simister, and Tom Morris
(working at Freebase)
This work was supported by grants from the Euro-
pean Union’s 7th Framework Programme provided for
the projects LOD2 (GA no. 257943), GeoKnow (GA
no. 318159) and Dicode (GA no. 257184).
Appendix
Table 16
List of namespace prefixes.
Prefix Namespace
dbo http://dbpedia.org/ontology/
dbp http://dbpedia.org/property/
dbr http://dbpedia.org/resource/
dbr-de http://de.dbpedia.org/resource/
dc http://purl.org/dc/elements/1.1/
foaf http://xmlns.com/foaf/0.1/
geo http://www.w3.org/2003/01/geo/wgs84 pos#
georss http://www.georss.org/georss/
ls http://spotlight.dbpedia.org/scores/
lx http://spotlight.dbpedia.org/lexicalizations/
owl http://www.w3.org/2002/07/owl#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs http://www.w3.org/2000/01/rdf-schema#
skos http://www.w3.org/2004/02/skos/core#
sptl http://spotlight.dbpedia.org/vocab/
xsd http://www.w3.org/2001/XMLSchema#
References
[1]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and
Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In
Proceedings of the 6th International Semantic Web Conference
(ISWC), volume 4825 of Lecture Notes in Computer Science,
pages 722–735. Springer, 2008.
Table 17
DBpedia timeline.
Year Month Event
2006 Nov Start of infobox extraction from Wikipedia
2007 Mar DBpedia 1.0 release
Jun ESWC DBpedia article [2]
Nov DBpedia 2.0 release
Nov ISWC DBpedia article [1]
Dec DBpedia 3.0 release candidate
2008 Feb DBpedia 3.0 release
Aug DBpedia 3.1 release
Nov DBpedia 3.2 release
2009 Jan DBpedia 3.3 release
Sep JWS DBpedia article [5]
2010 Feb Information extraction framework in Scala
Mappings Wiki release
Mar DBpedia 3.4 release
Apr DBpedia 3.5 release
May Start of DBpedia Internationalization effort
2011 Feb DBpedia Spotlight release
Mar DBpedia 3.6 release
Jul DBpedia Live release
Sep DBpedia 3.7 release (with I18n datasets)
2012 Aug DBpedia 3.8 release
Sep
Publication of DBpedia Internationalization
article[24]
2013 Sep DBpedia 3.9 release
[2]
S. Auer and J. Lehmann. What have Innsbruck and Leipzig
in common? extracting semantics from wiki content. In Pro-
ceedings of the ESWC (2007), volume 4519 of Lecture Notes in
Computer Science, pages 503–517, Berlin / Heidelberg, 2007.
Springer.
[3]
C. Becker and C. Bizer. Exploring the geospatial Semantic Web
with DBpedia Mobile. J. Web Sem, 7(4):278–286, 2009.
[4]
D. Bikel, V. Castelli, R. Florian, and D.-J. Han. Entity linking
and slot filling through statistical processing and inference rules.
In Proceedings TAC Workshop, 2009.
[5]
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cy-
ganiak, and S. Hellmann. DBpedia - a crystallization point for
the Web of Data. Journal of Web Semantics, 7(3):154–165,
2009.
[6]
E. Cabrio, J. Cojan, A. P. Aprosio, B. Magnini, A. Lavelli, and
F. Gandon. QAKiS: an open domain QA system based on
relational patterns. In ISWC-PD; International Semantic Web
Conference (Posters & Demos), volume 914 of CEUR Workshop
Proceedings. CEUR-WS.org, 2012.
[7]
D. V. Camarda, S. Mazzini, and A. Antonuccio. Lodlive, ex-
ploring the Web of Data. In V. Presutti and H. S. Pinto, editors,
I-SEMANTICS 2012 - 8th International Conference on Seman-
tic Systems, I-SEMANTICS ’12, Graz, Austria, September 5-7,
2012, pages 197–200. ACM, 2012.
[8]
S. Campinas, T. E. Perry, D. Ceccarelli, R. Delbru, and G. Tum-
marello. Introducing RDF graph summary with application to
assisted SPARQL formulation. In Database and Expert Systems
28 Lehmann et al. / DBpedia
Applications (DEXA), 2012 23rd International Workshop on,
pages 261–266. IEEE, 2012.
[9]
D. Damljanovic, M. Agatonovic, and H. Cunningham. FREyA:
An interactive way of querying linked data using natural lan-
guage. In The Semantic Web: ESWC 2011 Workshops, volume
7117, pages 125–138. Springer, 2012.
[10]
O. Erling and I. Mikhailov. RDF support in the Virtuoso DBMS.
In S. Auer, C. Bizer, C. M
¨
uller, and A. V. Zhdanova, editors,
CSSW, volume 113 of LNI, pages 59–68. GI, 2007.
[11]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu,
T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale
information extraction in KnowitAll. In Proceedings of the 13th
international conference on World Wide Web, pages 100–110.
ACM, 2004.
[12]
D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A.
Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al.
Building Watson: An overview of the DeepQA project. AI
magazine, 31(3):59–79, 2010.
[13]
A. Garc
´
ıa-Silva, M. Jakob, P. N. Mendes, and C. Bizer. Mul-
tipedia: Enriching DBpedia with multimedia information. In
Proceedings of the sixth international conference on Knowledge
capture, K-CAP ’11, pages 137–144, New York, NY, USA,
2011. ACM.
[14]
A. Garc
´
ıa-Silva, M. Szomszor, H. Alani, and O. Corcho. Pre-
liminary results in tag disambiguation using DBpedia. In 1st
International Workshop in Collective Knowledage Capturing
and Representation (CKCaR), California, USA, 2009.
[15]
R. Hahn, C. Bizer, C. Sahnwaldt, C. Herta, S. Robinson,
M. B
¨
urgle, H. D
¨
uwiger, and U. Scheel. Faceted Wikipedia
Search. In W. Abramowicz and R. Tolksdorf, editors, Business
Information Systems, 13th International Conference, BIS 2010,
Berlin, Germany, May 3-5, 2010. Proceedings, volume 47 of
Lecture Notes in Business Information Processing, pages 1–11.
Springer, 2010.
[16]
P. Heim, T. Ertl, and J. Ziegler. Facet Graphs: Complex seman-
tic querying made easy. In Proceedings of the 7th Extended
Semantic Web Conference (ESWC 2010), volume 6088 of LNCS,
pages 288–302, Berlin/Heidelberg, 2010. Springer.
[17]
P. Heim, S. Hellmann, J. Lehmann, S. Lohmann, and T. Stege-
mann. RelFinder: Revealing relationships in RDF knowledge
bases. In T.-S. Chua, Y. Kompatsiaris, B. M
´
erialdo, W. Haas,
G. Thallinger, and W. Bailer, editors, Semantic Multimedia, 4th
International Conference on Semantic and Digital Media Tech-
nologies, SAMT 2009, Graz, Austria, December 2-4, 2009, Pro-
ceedings, volume 5887 of Lecture Notes in Computer Science,
pages 182–187. Springer, 2009.
[18]
P. Heim, S. Lohmann, D. Tsendragchaa, and T. Ertl. SemLens:
Visual analysis of semantic data with scatter plots and seman-
tic lenses. In C. Ghidini, A.-C. N. Ngomo, S. N. Lindstaedt,
and T. Pellegrini, editors, Proceedings the 7th International
Conference on Semantic Systems, I-SEMANTICS 2011, Graz,
Austria, September 7-9, 2011, ACM International Conference
Proceeding Series, pages 175–178. ACM, 2011.
[19]
S. Hellmann, J. Brekle, and S. Auer. Leveraging the Crowd-
sourcing of Lexical Resources for Bootstrapping a Linguistic
Data Cloud. In JIST, 2012.
[20]
S. Hellmann, C. Stadler, J. Lehmann, and S. Auer. DBpedia
Live Extraction. In Proc. of 8th International Conference on On-
tologies, DataBases, and Applications of Semantics (ODBASE),
volume 5871 of Lecture Notes in Computer Science, pages
1209–1223, 2009.
[21]
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.
YAGO2: a spatially and temporally enhanced knowledge base
from Wikipedia. Artif. Intell, 194:28–61, 2013.
[22]
H. Ji, R. Grishman, H. T. Dang, K. Griffitt, and J. Ellis.
Overview of the TAC 2010 knowledge base population track.
In Third Text Analysis Conference (TAC 2010), 2010.
[23]
G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, C. Sizemore,
M. Smethurst, C. Bizer, and R. Lee. Media meets Semantic
Web – how the BBC uses DBpedia and linked data to make
connections. In The semantic web: research and applications,
pages 723–737. Springer, 2009.
[24]
D. Kontokostas, C. Bratsas, S. Auer, S. Hellmann, I. Antoniou,
and G. Metakides. Internationalization of Linked Data: The
case of the Greek DBpedia Edition. Web Semantics: Science,
Services and Agents on the World Wide Web, 15(0):51 – 61,
2012.
[25]
D. Kontokostas, A. Zaveri, S. Auer, and J. Lehmann.
TripleCheckMate: A Tool for Crowdsourcing the Quality As-
sessment of Linked Data. In Proceedings of the 4th Conference
on Knowledge Engineering and Semantic Web, 2013.
[26]
P. Kreis. Design of a quality assessment framework for the
DBpedia knowledge base. Master’s thesis, Freie Universit
¨
at
Berlin, 2011.
[27]
C. Lagoze, H. V. de Sompel, M. Nelson, and S. Warner.
The open archives initiative protocol for metadata har-
vesting.
http://www.openarchives.org/OAI/
openarchivesprotocol.html, 2008.
[28] J. Lehmann, S. Monahan, L. Nezda, A. Jung, and Y. Shi. LCC
approaches to knowledge base population at TAC 2010. In
Proceedings TAC Workshop, 2010.
[29]
D. B. Lenat. CYC: A large-scale investment in knowledge
infrastructure. Communications of the ACM, 38(11):33–38,
1995.
[30]
C.-Y. Lin and E. H. Hovy. The automated acquisition of topic
signatures for text summarization. In Proceedings of the 18th
conference on Computational linguistics, pages 495–501, 2000.
[31]
V. Lopez, M. Fern
´
andez, E. Motta, and N. Stieler. PowerAqua:
Supporting users in querying and exploring the Semantic Web.
Semantic Web, 3(3):249–265, 2012.
[32]
M. Martin, C. Stadler, P. Frischmuth, and J. Lehmann. Increas-
ing the financial transparency of European Commission project
funding. Semantic Web Journal, Special Call for Linked Dataset
descriptions, 2013.
[33]
P. N. Mendes, M. Jakob, and C. Bizer. DBpedia for NLP - a
multilingual cross-domain knowledge base. In Proceedings
of the International Conference on Language Resources and
Evaluation (LREC), Istanbul, Turkey, 2012.
[34]
P. N. Mendes, M. Jakob, A. Garc
´
ıa-Silva, and C. Bizer. DBpedia
Spotlight: Shedding light on the web of documents. In Proceed-
ings of the 7th International Conference on Semantic Systems
(I-Semantics), Graz, Austria, 2011.
[35]
S. M. Meyer, J. Degener, J. Giannandrea, and B. Michener.
Optimizing schema-last tuple-store queries in GraphD. In A. K.
Elmagarmid and D. Agrawal, editors, SIGMOD Conference,
pages 1047–1056. ACM, 2010.
[36]
M. Morsey, J. Lehmann, S. Auer, C. Stadler, and S. Hell-
mann. DBpedia and the Live Extraction of Structured Data
from Wikipedia. Program: electronic library and information
systems, 46:27, 2012.
[37]
K. Nakayama, M. Pei, M. Erdmann, M. Ito, M. Shirakawa,
T. Hara, and S. Nishio. Wikipedia mining: Wikipedia as a corpus
Lehmann et al. / DBpedia 29
for knowledge extraction. In Annual Wikipedia Conference
(Wikimania), 2008.
[38]
X. Niu, X. Sun, H. Wang, S. Rong, G. Qi, and Y. Yu. Zhishi.me:
Weaving chinese linking open data. In 10th International Con-
ference on The semantic web - Volume Part II, pages 205–220,
Berlin, Heidelberg, 2011. Springer-Verlag.
[39]
E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn,
and G. Tummarello. Sindice.com: A document-oriented lookup
index for open linked data. Int. J. of Metadata and Semantics
and Ontologies, 3:37–52, Nov. 10 2008.
[40]
S. P. Ponzetto and M. Strube. WikiTaxonomy: A large scale
knowledge resource. In M. Ghallab, C. D. Spyropoulos,
N. Fakotakis, and N. M. Avouris, editors, ECAI, volume 178
of Frontiers in Artificial Intelligence and Applications, pages
751–752. IOS Press, 2008.
[41]
C. Stadler, J. Lehmann, K. H
¨
offner, and S. Auer. LinkedGeo-
Data: A Core for a Web of Spatial Open Data. Semantic Web
Journal, 3(4):333–354, 2012.
[42]
F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core
of semantic knowledge. In C. L. Williamson, M. E. Zurko,
P. F. Patel-Schneider, and P. J. Shenoy, editors, WWW, pages
697–706. ACM, 2007.
[43]
E. Tacchini, A. Schultz, and C. Bizer. Experiments with
Wikipedia cross-language data fusion. In Proceedings of the
5th Workshop on Scripting and Development for the Semantic
Web, ESWC. Citeseer, 2009.
[44]
C. Unger, L. B
¨
uhmann, J. Lehmann, A.-C. Ngonga Ngomo,
D. Gerber, and P. Cimiano. Template-based Question Answer-
ing over RDF Data. In Proceedings of the 21st international
conference on World Wide Web, pages 639–648, 2012.
[45]
Z. Wang, J. Li, Z. Wang, and J. Tang. Cross-lingual Knowledge
Linking Across Wiki Knowledge Bases. In Proceedings of
the 21st international conference on World Wide Web, pages
459–468, New York, NY, USA, 2012. ACM.
[46]
F. Wu and D. Weld. Automatically refining the Wikipedia
Infobox Ontology. In Proceedings of the 17th World Wide Web
Conference, 2008.
[47]
F. Wu and D. S. Weld. Autonomously semantifying Wikipedia.
In Proceedings of the 16th Conference on Information and
Knowledge Management, pages 41–50. ACM, 2007.
[48]
A. Zaveri, D. Kontokostas, M. A. Sherif, L. B
¨
uhmann,
M. Morsey, S. Auer, and J. Lehmann. User-driven Quality
Evaluation of DBpedia. In To appear in Proceedings of 9th
International Conference on Semantic Systems, I-SEMANTICS
’13, Graz, Austria, September 4-6, 2013. ACM, 2013.
[49]
T. Zesch, C. M
¨
uller, and I. Gurevych. Extracting lexical seman-
tic knowledge from Wikipedia and Wiktionary. In Proceedings
of the 6th International Conference on Language Resources
and Evaluation, Marrakech, Morocco, May 2008. electronic
proceedings.
... For a comprehensive understanding of our study and to ensure a thorough examination of textual nuances across diverse domains, we evaluated QZero on six distinct publicly available text classi cation datasets. These include the AG News articles [35] and the DBPedia factual knowledge base [16], which contains a summary of Wikipedia extracts. We also utilize the Yahoo! ...
Preprint
Full-text available
Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has been addressed by improving the embedding model with expensive training. We introduce QZero, a novel training-free knowledge augmentation approach that reformulates queries by retrieving supporting categories from Wikipedia to improve zero-shot text classification performance. Our experiments across six diverse datasets demonstrate that QZero enhances performance for state-of-the-art static and contextual embedding models without the need for retraining. Notably, in News and medical topic classification tasks, QZero improves the performance of even the largest OpenAI embedding model by at least 5% and 3%, respectively. Acting as a knowledge amplifier, QZero enables small word embedding models to achieve performance levels comparable to those of larger contextual models, offering the potential for significant computational savings. Additionally, QZero offers meaningful insights that illuminate query context and verify topic relevance, aiding in understanding model predictions. Overall, QZero improves embedding-based zero-shot classifiers while maintaining their simplicity. This makes it particularly valuable for resource-constrained environments and domains with constantly evolving information.
... • Datasets and Evaluation Metrics. We select five benchmark datasets of multi-label short text, including two datasets characterized by a significant number of labels, namely Amazon [35] and Dbpedia [36], and three datasets with a small number of labels, such as Comment, Tweet, NTCIR [1], and a benchmark multi-label text dataset called Movie [37] with a small label set. ...
Article
Full-text available
Multi-label Short Text Classification (MSTC) is a challenging subtask of Multi-Label Text Classification (MLTC) for tagging a short text with the most relevant subset of labels from a given set of labels. Recent studies have attempted to address MSTC task using MLTC methods and Pre-trained Language Models (PLM) based fine-tuning approaches, but suffering the low performance from the following three reasons, 1) failure to address the issue of data sparsity of short texts, 2) lack of adaptation to the long-tail distribution of labels in multi-label scenarios and 3) an implicit weakness in the encoding length for PLM, which limits the ability of the prompt learning paradigm. Therefore, in this paper, we propose KSSVPT, a Knowledge and Separating Soft Verbalizer based Prompt Tuning method for MSTC to address the above challenges. Firstly, to mitigate the sparsity issue in short texts, we propose a novel approach that enhances the semantic information of short texts by integrating external knowledge into the soft prompt template. Secondly, we construct a new soft prompt verbalizer for MSTC, called separating soft prompt verbalizer, to adapt to the long-tail distribution issue aggravated by multiple labels. Thirdly, we propose a mechanism of label cluster grouping in building a prompt template to directly alleviate limited encoding length and capture the label correlation. Extensive experiments conducted on six benchmark datasets demonstrate the superiority of our model compared to all competing models for MLTC and MSTC in the tackling of MSTC task.
... E, R, T are the set of entities, relations, and triples, respectively. Typical triple-based KGs include Freebase (Bollacker et al., 2008), WordNet (Miller, 1995 and DBPedia (Lehmann et al., 2014). In these triple-based KGs, the triple representation excludes crucial contextual information, often resulting in inaccurate knowledge storage, incomplete representation, and ineffective reasoning. ...
Preprint
Full-text available
Knowledge Graphs (KGs) are foundational structures in many AI applications, representing entities and their interrelations through triples. However, triple-based KGs lack the contextual information of relational knowledge, like temporal dynamics and provenance details, which are crucial for comprehensive knowledge representation and effective reasoning. Instead, \textbf{Contextual Knowledge Graphs} (CKGs) expand upon the conventional structure by incorporating additional information such as time validity, geographic location, and source provenance. This integration provides a more nuanced and accurate understanding of knowledge, enabling KGs to offer richer insights and support more sophisticated reasoning processes. In this work, we first discuss the inherent limitations of triple-based KGs and introduce the concept of contextual KGs, highlighting their advantages in knowledge representation and reasoning. We then present \textbf{KGR$^3$, a context-enriched KG reasoning paradigm} that leverages large language models (LLMs) to retrieve candidate entities and related contexts, rank them based on the retrieved information, and reason whether sufficient information has been obtained to answer a query. Our experimental results demonstrate that KGR$^3$ significantly improves performance on KG completion (KGC) and KG question answering (KGQA) tasks, validating the effectiveness of incorporating contextual information on KG representation and reasoning.
... In the general domain, commonly used knowledge graphs include Wikidata, DBpedia, Freebase, YAGO, and Google Knowledge Graph [25,26]. Advances have also been made in domain knowledge graphs, such as medicine, social interaction, agriculture, aviation, product design, cultural heritage, progress has also been made [27][28][29]. ...
Article
Full-text available
In the globalization trend, China’s cultural heritage is in danger of gradually disappearing. The protection and inheritance of these precious cultural resources has become a critical task. This paper focuses on the Miao batik culture in Guizhou Province, China, and explores the application of knowledge graphs, natural language processing, and deep learning techniques in the promotion and protection of batik culture. We propose a dual-channel mechanism that integrates semantic and visual information, aiming to connect batik pattern features with cultural connotations. First, we use natural language processing techniques to automatically extract batik-related entities and relationships from the literature, and construct and visualize a structured batik pattern knowledge graph. Based on this knowledge graph, users can textually search and understand the images, meanings, taboos, and other cultural information of specific patterns. Second, for the batik pattern classification, we propose an improved ResNet34 model. By embedding average pooling and convolutional operations into the residual blocks and introducing long-range residual connections, the classification performance is enhanced. By inputting pattern images into this model, their categories can be accurately identified, and then the underlying cultural connotations can be understood. Experimental results show that our model outperforms other mainstream models in evaluation metrics such as accuracy, precision, recall, and F1-score, achieving 94.46%, 94.47%, 93.62%, and 93.8%, respectively. This research provides new ideas for the digital protection of batik culture and demonstrates the great potential of artificial intelligence technology in cultural heritage protection.
Article
Full-text available
Although there are numerous and effective BERT models for question answering (QA) over plain texts in English, it is not the same for other languages, such as Greek. Since it can be time-consuming to train a new BERT model for a given language, we present a generic methodology for multilingual QA by combining at runtime existing machine translation (MT) models and BERT QA models pretrained in English, and we perform a comparative evaluation for Greek language. Particularly, we propose a pipeline that (a) exploits widely used MT libraries for translating a question and a context from a source language to the English language, (b) extracts the answer from the translated English context through popular BERT models (pretrained in English corpus), (c) translates the answer back to the source language, and (d) evaluates the answer through semantic similarity metrics based on sentence embeddings, such as Bi-Encoder and BERTScore. For evaluating our system, we use 21 models, whereas we have created a test set with 20 texts and 200 questions and we have manually labelled 4200 answers. These resources can be reused for several tasks including QA and sentence similarity. Moreover, we use the existing multilingual test set XQuAD, with 240 texts and 1190 questions in Greek language. We focus on both the effectiveness and efficiency, through manually and machine labelled results. The results of the evaluation show that the proposed approach can be an efficient and effective alternative option to multilingual BERT. In particular, although the multilingual BERT QA model provides the highest scores for both human and automatic evaluation, all the models combining MT and BERT QA models are faster and some of them achieve quite similar scores.
Conference Paper
Integrating intelligent data engineering and smart applications in the tourism sector offers numerous opportunities to improve the traveler experience, optimize operations, and enhance innovation. User feedback, reflecting their level of satisfaction with the services provided, is a crucial indicator of the success of a tourist destination. In this paper, we present an approach based on artificial intelligence to create a decision support system to guide the choices of tourism sector managers in Morocco. The aim is to maintain the region's competitiveness and attract a growing number of visitors. We propose using deep learning to analyze tourists' comments and evaluate the factors characterizing a tourist destination, such as facilities, value for money, cleanliness, comfort, and staff. Our method is based on two common natural language processing (NLP) tasks: text classification to extract relevant elements and sentiment analysis to assess tourist satisfaction. This approach enables tourism managers to effectively target improvement areas to meet tourists' expectations better and increase their satisfaction.
Preprint
Full-text available
Entity summarization aims to compute concise summaries for entities in knowledge graphs. Existing datasets and benchmarks are often limited to a few hundred entities and discard graph structure in source knowledge graphs. This limitation is particularly pronounced when it comes to ground-truth summaries, where there exist only a few labeled summaries for evaluation and training. We propose WikES, a comprehensive benchmark comprising of entities, their summaries, and their connections. Additionally, WikES features a dataset generator to test entity summarization algorithms in different areas of the knowledge graph. Importantly, our approach combines graph algorithms and NLP models as well as different data sources such that WikES does not require human annotation, rendering the approach cost-effective and generalizable to multiple domains. Finally, WikES is scalable and capable of capturing the complexities of knowledge graphs in terms of topology and semantics. WikES features existing datasets for comparison. Empirical studies of entity summarization methods confirm the usefulness of our benchmark. Data, code, and models are available at: https://github.com/msorkhpar/wiki-entity-summarization.
Article
Full-text available
We present QAKiS, a system for open domain Question Answering over linked data. It addresses the problem of question interpretation as a relation-based match, where fragments of the question are matched to binary relations of the triple store, using relational textual patterns automatically collected. For the demo, the relational patterns are automatically extracted from Wikipedia, while DBpedia is the RDF data set to be queried using a natural language interface.
Article
Full-text available
The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extracted from the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a community effort. The knowledge base can be queried using the SPARQL query language and all its data sets are freely available for download. In this paper, we describe the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.
Technical Report
Full-text available
The Knowledge Base Population (KBP) track at the Text Analysis Conference 2010 marks the second year of this important information extraction evaluation. This paper describes the design and implementation of LCC's systems which participated in the tasks of Entity Link- ing, Slot Filling, and the new task of Surprise Slot Filling. For the entity linking task, our top score was achieved through a robust context modeling approach which incorporates topi- cal evidence. For slot ?lling, we used the output of the entity linking system together with a combination of different types of re- lation extractors. For surprise slot ?lling, our customizable extraction system was extremely useful due to the time sensitive nature of the task.
Article
Full-text available
Born four years ago as a Semantic Web extension for the web browser Firefox, Semantic Turkey pushed forward the traditional concept of links&folders-based bookmarking to a new dimension, allowing users to keep track of relevant information from visited web sites and to organize the collected content according to standard or personally defined ontologies. Today, the tool has broken the boundaries of its original intents and can be considered, under every aspect, an extensible platform for knowledge management and acquisition. The semantic bookmarking and annotation facilities of Semantic Turkey are now supporting just a part of a whole methodology where different actors, from domain experts to knowledge engineers, can cooperate in developing, building and populating ontologies while navigating the Web.
Conference Paper
Full-text available
Linked Open Data (LOD) comprises of an unprecedented volume of structured datasets on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced and even extracted data of relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of a manual and a semi-automatic process. In this paper we focus on the manual process where the first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. The second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is implemented by the tool TripleCheckMate wherein a user assesses an individual resource and evaluates each fact for correctness. This paper focuses on describing the methodology, quality taxonomy and the tools’ system architecture, user perspective and extensibility.
Conference Paper
The availability of tag-based user-generated content for a variety of Web resources (music, photos, videos, text, etc.) has largely increased in the last years. Users can assign tags freely and then use them to share and retrieve information. However, tag-based sharing and retrieval is not optimal due to the fact that tags are plain text labels without an explicit or formal meaning, and hence polysemy and synonymy should be dealt with appropriately. To ameliorate these problems, we propose a context-based tag disambiguation algorithm that selects the meaning of a tag among a set of candidate DBpedia entries, using a common information retrieval similarity measure. The most similar DBpedia en-try is selected as the one representing the meaning of the tag. We describe and analyze some preliminary results, and discuss about current challenges in this area.
Article
Wikipedia, a collaborative Wiki-based encyclopedia, has be- come a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. As a corpus for knowledge extraction, Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, word sense disambiguation based on URL and brief anchor texts. Because of these characteristics, Wikipedia has be- come a promising corpus and a big frontier for researchers. A consid- erable number of researches on Wikipedia Mining such as semantic re- latedness measurement, bilingual dictionary construction, and ontology construction have been conducted. In this paper, we take a comprehen- sive, panoramic view of Wikipedia as a Web corpus since almost all previous researches are just exploiting parts of the Wikipedia charac- teristics. The contribution of this paper is triple-sum. First, we unveil the characteristics of Wikipedia as a corpus for knowledge extraction in detail. In particular, we describe the importance of anchor texts with special emphasis since it is helpful information for both disambiguation and synonym extraction. Second, we introduce some of our Wikipedia mining researches as well as researches conducted by other researches in order to prove the worth of Wikipedia. Finally, we discuss possible directions of Wikipedia research.
Article
LodLive project, http://en.lodlive.it/, provides a demonstration of the use of Linked Data standard (RDF, SPARQL) to browse RDF resources. The application aims to spread linked data principles with a simple and friendly interface and reusable techniques. In this report we present an overview of the potential of LodLive, mentioning tools and methodologies that were used to create it.