ArticlePDF Available

DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

January 2014
Semantic Web 6(2)

January 2014
6(2)

DOI:10.3233/SW-140134

Authors:

Jens Lehmann

Amazon

Anja Jentzsch

Hasso Plattner Institute

Show all 11 authorsHide

The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.

Overview of DBpedia extraction framework.

…

Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class (middle) [24].

…

Basic statistics about Localized DBpedia Editions.

…

Snapshot of a part of the DBpedia ontology.

…

+18

Number of instances per class within 10 localized DBpedia versions.

…

Figures - uploaded by Christian Bizer

Content may be subject to copyright.

Content uploaded by Christian Bizer

Content may be subject to copyright.

Semantic Web 1 (2012) 1–5 1

IOS Press

DBpedia – A Large-scale, Multilingual

Knowledge Base Extracted from Wikipedia

Editor(s): Name Surname, University, Country

Solicited review(s): Name Surname, University, Country

Open review(s): Name Surname, University, Country

Jens Lehmann a,∗, Robert Isele g, Max Jakob e, Anja Jentzsch d, Dimitris Kontokostasa,

Pablo N. Mendes f, Sebastian Hellmann a, Mohamed Morsey a, Patrick van Kleef c, S¨

oren Auer a,

Christian Bizer b

University of Leipzig, Institute of Computer Science, AKSW Group, Augustusplatz 10, D-04009 Leipzig, Germany

E-mail: {lastname}@informatik.uni-leipzig.de

bUniversity of Mannheim, Research Group Data and Web Science, B6-26, D-68159 Mannheim

E-mail: chris@informatik.uni-mannheim.de

cOpenLink Software, 10 Burlington Mall Road, Suite 265, Burlington, MA 01803, U.S.A.

E-mail: pkleef@openlinksw.com

dHasso-Plattner-Institute for IT-Systems Engineering, Prof.-Dr.- Helmert-Str. 2-3, D-14482 Potsdam, Germany

E-mail: mail@anjajentzsch.de

eNeofonie GmbH, Robert-Koch-Platz 4, D-10115 Berlin, Germany

E-mail: max.jakob@neofonie.de

fKno.e.sis - Ohio Center of Excellence in Knowledge-enabled Computing, Wright State University, Dayton, USA.

E-Mail: pablo@knoesis.org

gBrox IT-Solutions GmbH, An der Breiten Wiese 9, D-30625 Hannover, Germany

E-Mail: mail@robertisele.com

Abstract.

The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely

available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different

language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia

consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other

110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps

Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties.

The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to

be combined. The project publishes regular releases of all DBpedia knowledge bases for download and provides SPARQL query

access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases,

the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million

RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia

data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and thus make DBpedia one of

the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia

community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and

applications.

Keywords: Knowledge Extraction, Wikipedia, Multilingual Knowledge Bases, Linked Data, RDF

1570-0844/12/$27.50 c

2Lehmann et al. / DBpedia

1. Introduction

Wikipedia is the 6th most popular website

, the most

widely used encyclopedia, and one of the ﬁnest exam-

ples of truly collaboratively created content. There are

ofﬁcial Wikipedia editions in 287 different languages

which range in size from a couple of hundred articles

up to 3.8 million articles (English edition)

. Besides of

free text, Wikipedia articles consist of different types of

structured data such as infoboxes, tables, lists, and cat-

egorization data. Wikipedia currently offers only free-

text search capabilities to its users. Using Wikipedia

search, it is thus very difﬁcult to ﬁnd all rivers that ﬂow

into the Rhine and are longer than 100 miles, or all

Italian composers that were born in the 18th century.

The DBpedia project builds a large-scale, multilin-

gual knowledge base by extracting structured data from

Wikipedia editions in 111 languages. This knowledge

base can be used to answer expressive queries such as

the ones outlined above. Being multilingual and cover-

ing an wide range of topics, the DBpedia knowledge

base is also useful within further application domains

such as data integration, named entity recognition, topic

detection, and document ranking.

The DBpedia knowledge base is widely used as a

testbed in the research community and numerous appli-

cations, algorithms and tools have been built around or

applied to DBpedia. DBpedia is served as Linked Data

on the Web. Since it covers a wide variety of topics

and sets RDF links pointing into various external data

sources, many Linked Data publishers have decided

to set RDF links pointing to DBpedia from their data

sets. Thus, DBpedia has developed into a central inter-

linking hub in the Web of Linked Data and has been

a key factor for the success of the Linked Open Data

initiative.

The structure of the DBpedia knowledge base is

maintained by the DBpedia user community. Most

importantly, the community creates mappings from

Wikipedia information representation structures to the

DBpedia ontology. This ontology – which will be ex-

plained in detail in Section 3 – uniﬁes different tem-

plate structures, both within single Wikipedia language

editions and across currently 27 different languages.

Corresponding author. E-mail: lehmannn@informatik.uni-

leipzig.de.

1http://www.alexa.com/topsites

. Retrieved in Octo-

ber 2013.

2http://meta.wikimedia.org/wiki/List_of_

Wikipedias

The maintenance of different language editions of DB-

pedia is spread across a number of organisations. Each

organisation is responsible for the support of a certain

language. The local DBpedia chapters are coordinated

by the DBpedia Internationalisation Committee.

The aim of this system report is to provide a descrip-

tion of the DBpedia community project, including the

architecture of the DBpedia extraction framework, its

technical implementation, maintenance, internationali-

sation, usage statistics as well as presenting some pop-

ular DBpedia applications. This system report is a com-

prehensive update and extension of previous project de-

scriptions in [

] and [

]. The main advances compared

to these articles are:

–

The concept and implementation of the extraction

based on a community-curated DBpedia ontology.

–The wide internationalisation of DBpedia.

–

A live synchronisation module which processes

updates in Wikipedia as well as the DBpedia on-

tology and allows third parties to keep their copies

of DBpedia up-to-date.

–

A description of the maintenance of public DBpe-

dia services and statistics about their usage.

–

An increased number of interlinked data sets

which can be used to further enrich the content of

DBpedia.

–

The discussion and summary of novel third party

applications of DBpedia.

Overall, DBpedia has undergone 7 years of continu-

ous evolution. Table 17 provides an overview of the

project’s timeline.

The system report is structured as follows: In the next

section, we describe the DBpedia extraction framework,

which forms the technical core of DBpedia. This is

followed by an explanation of the community-curated

DBpedia ontology with a focus on multilingual sup-

port. In Section 4, we explicate how DBpedia is syn-

chronised with Wikipedia with just very short delays

and how updates are propagated to DBpedia mirrors

employing the DBpedia Live system. Subsequently, we

give an overview of the external data sets that are in-

terlinked from DBpedia or that set RDF links pointing

to DBpedia themselves (Section 5). In Section 6, we

provide statistics on the usage of DBpedia and describe

the maintenance of a large scale public data set. Within

Section 7, we brieﬂy describe several use cases and

applications of DBpedia in a variety of different areas.

Finally, we report on related work in Section 8 and give

an outlook on the further development of DBpedia in

Section 9.

Lehmann et al. / DBpedia 3

Fig. 1. Overview of DBpedia extraction framework.

2. Extraction Framework

Wikipedia articles consist mostly of free text, but

also comprise of various types of structured information

in the form of wiki markup. Such information includes

infobox templates, categorisation information, images,

geo-coordinates, links to external web pages, disam-

biguation pages, redirects between pages, and links

across different language editions of Wikipedia. The

DBpedia extraction framework extracts this structured

information from Wikipedia and turns it into a rich

knowledge base. In this section, we give an overview

of the DBpedia knowledge extraction framework.

2.1. General Architecture

Figure 1 shows an overview of the technical frame-

work. The DBpedia extraction is structured into four

phases:

Input:

Wikipedia pages are read from an external

source. Pages can either be read from a Wikipedia

dump or directly fetched from a MediaWiki instal-

lation using the MediaWiki API.

Parsing:

Each Wikipedia page is parsed by the wiki

parser. The wiki parser transforms the source code

of a Wikipedia page into an Abstract Syntax Tree.

Extraction:

The Abstract Syntax Tree of each Wikipedia

page is forwarded to the extractors. DBpedia of-

fers extractors for many different purposes, for in-

stance, to extract labels, abstracts or geographical

coordinates. Each extractor consumes an Abstract

Syntax Tree and yields a set of RDF statements.

Output:

The collected RDF statements are written to

a sink. Different formats, such as N-Triples, are

supported.

2.2. Extractors

The DBpedia extraction framework employs various

extractors for translating different parts of Wikipedia

pages to RDF statements. A list of all available extrac-

tors is shown in Table 1. DBpedia extractors can be

divided into four categories:

Mapping-Based Infobox Extraction:

The mapping-

based infobox extraction uses manually written

mappings that relate infoboxes in Wikipedia to

terms in the DBpedia ontology. The mappings also

specify a datatype for each infobox property and

thus help the extraction framework to produce high

quality data. The mapping-based extraction will

be described in detail in Section 2.4.

Raw Infobox Extraction:

The raw infobox extraction

provides a direct mapping from infoboxes in

Wikipedia to RDF. As the raw infobox extraction

does not rely on explicit extraction knowledge in

the form of mappings, the quality of the extracted

data is lower. The raw infobox data is useful if a

speciﬁc infobox has not been mapped yet and thus

is not available in the mapping-based extraction.

Feature Extraction:

The feature extraction uses a

number of extractors that are specialized in ex-

tracting a single feature from an article, such as a

label or geographic coordinates.

4Lehmann et al. / DBpedia

Statistical Extraction:

Some NLP related extractors

aggregate data from all Wikipedia pages in order to

provide data that is based on statistical measures

of page links or word counts, as further described

in Section 2.6.

2.3. Raw Infobox Extraction

The type of Wikipedia content that is most valuable

for the DBpedia extraction are infoboxes. Infoboxes are

frequently used to list an article’s most relevant facts

as a table of attribute-value pairs on the top right-hand

side of the Wikipedia page (for right-to-left languages

on the top left-hand side). Infoboxes that appear in a

Wikipedia article are based on a template that speciﬁes

a list of attributes that can form the infobox. A wide

range of infobox templates are used in Wikipedia. Com-

mon examples are templates for infoboxes that describe

persons, organisations or automobiles. As Wikipedia’s

infobox template system has evolved over time, dif-

ferent communities of Wikipedia editors use differ-

ent templates to describe the same type of things (e.g.

Infobox city japan

Infobox swiss town

and

Infobox town de

). In addition, different tem-

plates use different names for the same attribute

(e.g.

birthplace

and

placeofbirth

). As many

Wikipedia editors do not strictly follow the recommen-

dations given on the page that describes a template,

attribute values are expressed using a wide range of

different formats and units of measurement. An excerpt

of an infobox that is based on a template for describing

automobiles is shown below:

{{Infobox automobile

| name = Ford GT40

| manufacturer = [[Ford Advanced Vehicles]]

| production = 1964-1969

| engine = 4181cc

(...)

}}

In this infobox, the ﬁrst line speciﬁes the infobox

type and the subsequent lines specify various attributes

of the described entity.

An excerpt of the extracted data is as follows:

dbr:Ford_GT40

dbp:name "Ford GT40"@en;

dbp:manufacturer dbr:Ford_Advanced_Vehicles;

dbp:engine 4181;

dbp:production 1964;

(...).

This extraction output has weaknesses: The resource

is not associated to a class in the ontology and parsed

values are cleaned up and assigned a datatyper based

on heuristics. In particular, the raw infobox extractor

searches for values in the following order: dates, coor-

dinates, numbers, links and strings as default. Thus, the

datatype assignment for the same property in different

resources is non deterministic. The

engine

for exam-

ple is extracted as a number but if another instance of

the template used “cc4181” it would be extracted as

string. This behaviour makes querying for properties in

the

dbp

namespace inconsistent. Those problems can

be overcome by the mapping-based infobox extraction

presented in the next subsection.

2.4. Mapping-Based Infobox Extraction

In order to homogenize the description of informa-

tion in the knowledge base, in 2010 a community ef-

fort was initiated to develop an ontology schema and

mappings from Wikipedia infobox properties to this

ontology. The alignment between Wikipedia infoboxes

and the ontology is performed via community-provided

mappings that help to normalize name variations in

properties and classes. Heterogeneity in the Wikipedia

infobox system, like using different infoboxes for the

same type of entity or using different property names for

the same property (cf. Section 2.3), can be alleviated in

this way. This signiﬁcantly increases the quality of the

raw Wikipedia infobox data by typing resources, merg-

ing name variations and assigning speciﬁc datatypes to

the values.

This effort is realized using the DBpedia Mappings

Wiki

, a MediaWiki installation set up to enable users

to collaboratively create and edit mappings. These map-

pings are speciﬁed using the DBpedia Mapping Lan-

guage. The mapping language makes use of MediaWiki

templates that deﬁne DBpedia ontology classes and

properties as well as template/table to ontology map-

pings. A mapping assigns a type from the DBpedia on-

tology to the entities that are described by the corre-

sponding infobox. In addition, attributes in the infobox

are mapped to properties in the DBpedia ontology. In

the following, we show a mapping that maps infoboxes

that use the

Infobox automobile

template to the

DBpedia ontology:

{{TemplateMapping

|mapToClass = Automobile

|mappings =

{{PropertyMapping

| templateProperty = name

3http://mappings.dbpedia.org

Lehmann et al. / DBpedia 5

Table 1

Overview of the DBpedia extractors (cf. Table 16 for a complete list

of preﬁxes.).

Name Description Example

abstract

Extracts the ﬁrst lines of the

Wikipedia article.

dbr:Berlin dbo:abstract "Berlin is the capital city of

(...)" .

article categories

Extracts the categorization of the

article.

dbr:Oliver Twist dc:subject dbr:Category:English novels .

category label Extracts labels for categories. dbr:Category:English novels rdfs:label "English novels" .

category hierarchy

Extracts information about which

concept is a category and how cat-

egories are related using the SKOS

Vocabulary.

dbr:Category:World War II skos:broader

dbr:Category:Modern history .

disambiguation Extracts disambiguation links. dbr:Alien dbo:wikiPageDisambiguates dbr:Alien (film) .

external links

Extracts links to external web

pages related to the concept.

dbr:Animal Farm dbo:wikiPageExternalLink

<http://books.google.com/?id=RBGmrDnBs8UC> .

geo coordinates Extracts geo-coordinates. dbr:Berlin georss:point "52.5006 13.3989" .

grammatical gender

Extracts grammatical genders for

persons.

dbr:Abraham Lincoln foaf:gender "male" .

homepage

Extracts links to the ofﬁcial home-

page of an instance.

dbr:Alabama foaf:homepage <http://alabama.gov/> .

image

Extracts the ﬁrst image of a

Wikipedia page.

dbr:Berlin foaf:depiction <http://.../Overview Berlin.jpg> .

infobox

Extracts all properties from all in-

foboxes.

dbr:Animal Farm dbo:date "March 2010" .

interlanguage Extracts interwiki links. dbr:Albedo dbo:wikiPageInterLanguageLink dbr-de:Albedo .

label Extracts the article title as label. dbr:Berlin rdfs:label "Berlin" .

lexicalizations

Extracts information about surface

forms and their association with

concepts (only N-Quad format).

dbr:Pine sptl:lexicalization lx:pine tree ls:Pine pine tree .

lx:pine tree rdfs:label "pine tree" .

ls:Pine pine tree sptl:pUriGivenSf "0.941" .

mappings

Extraction based on mappings of

Wikipedia infoboxes to the DBpe-

dia ontology.

dbr:Berlin dbo:country dbr:Germany .

page ID Extracts page ids of articles. dbr:Autism dbo:wikiPageID "25" .

page links

Extracts all links between

Wikipedia articles.

dbr:Autism dbo:wikiPageWikiLink dbr:Human brain .

persondata

Extracts information about per-

sons represented using the Person-

Data template.

dbr:Andre Agassi foaf:birthDate "1970-04-29" .

PND

Extracts PND (Personennamen-

datei) data about a person.

dbr:William Shakespeare dbo:individualisedPnd "118613723" .

redirects

Extracts redirect links between ar-

ticles in Wikipedia.

dbr:ArtificialLanguages dbo:wikiPageRedirects

dbr:Constructed language .

revision ID

Extracts the revision ID of the

Wikipedia article.

dbr:Autism <http://www.w3.org/ns/prov#wasDerivedFrom>

<http://en.wikipedia.org/wiki/Autism?oldid=495234324> .

thematic concept

Extracts ‘thematic’ concepts, the

centres of discussion for cate-

gories.

dbr:Category:Music skos:subject dbr:Music .

topic signatures Extracts topic signatures. dbr:Alkane sptl:topicSignature "carbon alkanes atoms" .

wiki page

Extracts links to corresponding ar-

ticles in Wikipedia.

dbr:AnAmericanInParis foaf:isPrimaryTopicOf

<http://en.wikipedia.org/wiki/AnAmericanInParis> .

6Lehmann et al. / DBpedia

| ontologyProperty = foaf:name }}

{{PropertyMapping

| templateProperty = manufacturer

| ontologyProperty = manufacturer }}

{{DateIntervalMapping

| templateProperty = production

| startDateOntologyProperty =

productionStartDate

| endDateOntologyProperty =

productionEndDate }}

{{IntermediateNodeMapping

| nodeClass = AutomobileEngine

| correspondingProperty = engine

| mappings =

{{PropertyMapping

| templateProperty = engine

| ontologyProperty = displacement

| unit = Volume }}

{{PropertyMapping

| templateProperty = engine

| ontologyProperty = powerOutput

| unit = Power }}

}}

(...)

}}

The RDF statements that are extracted from the pre-

vious infobox example are shown below. As we can see,

the production period is correctly split into a start year

and an end year and the engine is represented by a dis-

tinct RDF node. It is worth mentioning that all values

are canonicalized to basic units. For example, in the

engine

mapping we state that

engine

is a Volume

and thus, the extractor converts “4181cc” (cubic cen-

timeters) to cubic meters (“0.004181”). Additionally,

there can exist multiple mappings on the same property

that search for different datatypes or different units. For

example, a number with “PS” as a sufﬁx for engine.

dbr:Ford_GT40

rdf:type dbo:Automobile;

rdfs:label "Ford GT40"@en;

dbo:manufacturer

dbr:Ford_Advanced_Vehicles;

dbo:productionStartYear

"1964"ˆˆxsd:gYear;

dbo:productionEndYear "1969"ˆˆxsd:gYear;

dbo:engine [

rdf:type AutomobileEngine;

dbo:displacement "0.004181";

]

(...) .

The DBpedia Mapping Wiki is not only used to map

different templates within a single language edition

of Wikipedia to the DBpedia ontology, but is used to

map templates from all Wikipedia language editions

to the shared DBpedia ontology. Figure 2 shows how

the infobox properties author and

συγ γ%αϕας

– au-

thor in Greek – are both being mapped to the global

identiﬁer

dbo:author

. That means, in turn, that in-

formation from all language versions of DBpedia can

be merged and DBpedias for smaller languages can be

augmented with knowledge from larger DBpedias such

as the English edition. Conversely, the larger DBpe-

dia editions can beneﬁt from more specialized knowl-

edge from localized editions, such as data about smaller

towns which is often only present in the corresponding

language edition [43].

Besides hosting of the mappings and DBpedia on-

tology deﬁnition, the DBpedia Mappings Wiki offers

various tools which support users in their work:

– Mapping Syntax Validator

The mapping syntax

validator checks for syntactic correctness and high-

lights inconsistencies such as missing property

deﬁnitions.

– Extraction Tester

The extraction tester linked on

each mapping page tests a mapping against a set

of example Wikipedia pages. This gives direct

feedback about whether a mapping works and how

the resulting data is structured.

– Mapping Tool

The DBpedia Mapping Tool is a

graphical user interface that supports users to cre-

ate and edit mappings.

2.5. URI Schemes

For every Wikipedia article, the framework intro-

duces a number of URIs to represent the concepts de-

scribed on a particular page. Up to 2011, DBpedia pub-

lished URIs only under the

http://dbpedia.org

domain. The main namespaces were:

–http://dbpedia.org/resource/

(preﬁx

dbr) for representing article data. There is a one-

to-one mapping between a Wikipedia page and a

DBpedia resource based on the article title. For

example, for the Wikipedia article on Berlin

, DB-

pedia will produce the URI dbr:Berlin. Exceptions

in this rule appear when intermediate nodes are

extracted from the mapping-based infobox extrac-

tor as unique URIs (e.g., the

engine

mapping

example in Section 2.4).

–http://dbpedia.org/property/

(preﬁx

dbp) for representing properties extracted from

the raw infobox extraction (cf. Section 2.3), e.g.

dbp:population.

4http://en.wikipedia.org/wiki/Berlin

Lehmann et al. / DBpedia 7

Fig. 2. Depiction of the mapping from the Greek (left) and English Wikipedia templates (right) about books to the same DBpedia Ontology class

(middle) [24].

–http://dbpedia.org/ontology/

(preﬁx

dbo) for representing the DBpedia ontology (cf.

Section 2.4), e.g. dbo:populationTotal.

Although data from other Wikipedia language edi-

tions were extracted, they used the same namespaces.

This was achieved by exploiting the Wikipedia inter-

language links

. For every page in a language other

than English, the page was extracted only if the page

contained an inter-language link to an English page. In

that case, using the English link, the data was extracted

under the English resource name (i.e. dbr:Berlin).

Recent DBpedia internationalisation developments

showed that this approach omitted valuable data [

Thus, starting from the DBpedia 3.7 release

, two types

of data sets were generated. The localized data sets con-

tain all things that are described in a speciﬁc language.

Within the datasets, things are identiﬁed with language

speciﬁc URIs such as

http://<lang>.dbpedia.

org/resource/

for article data and

http://

<lang>.dbpedia.org/property/

for prop-

erty data. In addition, we produce a canonicalized data

set for each language. The canonicalized data sets only

contain things for which a corresponding page in the

5http://en.wikipedia.org/wiki/Help:

Interlanguage_links

6A list of all DBpedia releases is provided in Table 17

English edition of Wikipedia exists. Within all canoni-

calized data sets, the same thing is identiﬁed with the

same URI from the generic language-agnostic names-

pace http://dbpedia.org/resource/.

2.6. NLP Extraction

DBpedia provides a number of data sets which have

been created to support Natural Language Processing

(NLP) tasks [

]. Currently, four datasets are extracted:

topic signatures,grammatical gender,lexicalizations

and thematic concept. While the topic signatures and

the grammatical gender extractors primarily extract data

from the article text, the lexicalizations and thematic

concept extractors make use of the wiki markup.

DBpedia entities can be referred to using many dif-

ferent names and abbreviations. The Lexicalization data

set provides access to alternative names for entities

and concepts, associated with several scores estimating

the association strength between name and URI. These

scores distinguish more common names for speciﬁc

entities from rarely used ones and also show how am-

biguous a name is with respect to all possible concepts

that it can mean.

The topic signatures data set enables the description

of DBpedia resources based on unstructured informa-

tion, as compared to the structured factual data provided

by the mapping-based and raw extractors. We build a

8Lehmann et al. / DBpedia

Vector Space Model (VSM) where each DBpedia re-

source is a point in a multidimensional space of words.

Each DBpedia resource is represented by a vector, and

each word occurring in Wikipedia is a dimension of

this vector. Word scores are computed using the tf-idf

weight, with the intention to measure how strong is the

association between a word and a DBpedia resource.

Note that word stems are used in this context in order to

generalize over inﬂected words. We use the computed

weights to select the strongest related word stems for

each entity and build topic signatures [30].

There are two more Feature Extractors related to Nat-

ural Language Processing. The thematic concepts data

set relies on Wikipedia’s category system to capture the

idea of a ‘theme’, a subject that is discussed in its arti-

cles. Many of the categories in Wikipedia are linked to

an article that describes the main topic of that category.

We rely on this information to mark DBpedia entities

and concepts that are ‘thematic’, that is, they are the

centre of discussion for a category.

The grammatical gender data set uses a simple heuris-

tic to decide on a grammatical gender for instances of

the class Person in DBpedia. While parsing an article

in the English Wikipedia, if there is a mapping from

an infobox in this article to the class

dbo:Person

we record the frequency of gender-speciﬁc pronouns in

their declined forms (Subject, Object, Possessive Ad-

jective, Possessive Pronoun and Reﬂexive) – i.e. he,

him, his, himself (masculine) and she, her, hers, herself

(feminine). Grammatical genders for DBpedia entities

are assigned based on the dominating gender in these

pronouns.

2.7. Summary of Other Recent Developments

In this section we summarize the improvements of

the DBpedia extraction framework since the publication

of the previous DBpedia overview article [

] in 2009.

One of the major changes on the implementation level

is that the extraction framework has been rewritten in

Scala in 2010

resulting in an improvement of the efﬁ-

ciency of the extractors by an order of magnitude com-

pared to the previous PHP based framework. This is

crucial for the DBpedia Live project, which will be ex-

plained in Section 4. The new more modular framework

also allows to extract data from tables in Wikipedia

pages and supports extraction from multiple MediaWiki

templates per page. Another signiﬁcant change was the

Table 17 provides an overview of the project’s evolution through

time.

creation and utilization of the DBpedia Mappings Wiki

as described earlier. Further signiﬁcant changes include

the mentioned NLP extractors and the introduction of

URI schemes.

In addition, there were several smaller improvements

and general maintenance: Overall, over the past four

years, the parsing of the MediaWiki markup improved

quite a lot which led to better overall coverage, for

example, concerning references and parser functions.

In addition, the collection of MediaWiki namespace

identiﬁers for many languages is now performed semi-

automatically leading to a high accuracy of detection.

This concerns common title preﬁxes such as User, File,

Template, Help, Portal etc. in English that indicate

pages that do not contain encyclopedic content and

would produce noise in the data. They are important for

speciﬁc extractors as well, for instance, the category hi-

erarchy data set is produced from pages of the Category

namespace. Furthermore, the output of the extraction

system now supports more formats and several compli-

ance issues regarding URIs, IRIs, N-Triples and Turtle

were ﬁxed.

The individual data extractors have been improved as

well in both number and quality in many areas. The ab-

stract extraction was enhanced producing more accurate

plain text representations of the beginning of Wikipedia

article texts. More diverse and more speciﬁc datatypes

do exist (e.g. many currencies and XSD datatypes such

xsd:gYearMonth

xsd:positiveInteger

etc.) and for a number of classes and properties, speciﬁc

datatypes were added (e.g. inhabitants

for the pop-

ulation density of populated places and m

s for the

discharge of rivers). Many issues related to data parsers

were resolved and the quality of the

owl:sameAs

data set for multiple language versions was increased

by an implementation that takes bijective relations into

account.

There are also further extractors, e.g. for Wikipedia

page IDs and revisions. Moreover, redirect and disam-

biguation extractors were introduced and improved. For

the redirect data, the transitive closure is computed

while taking care of catching cycles in the links. The

redirects also help regarding infobox coverage in the

mapping-based extraction by resolving alternative tem-

plate names. Moreover, in the PHP framework, if an

infobox value pointed to a redirect, this redirection was

not properly resolved and thus resulted in RDF links

that led to URIs which did not contain any further in-

formation. Resolving redirects affected approximately

15% of all links, and hence increased the overall inter-

connectivity of resources in the DBpedia ontology.

Lehmann et al. / DBpedia 9

Pxsd:

decimal

Crdf:type owl:Class

rdf:type owl:DatatypeProperty

rdf:type owl:ObjectProperty

Legend

dbo:PopulatedPlace

Cdbo:Agent

Cdbo:Place

Cowl:Thing

Cdbo:Species

Cdbo:Settlement

Cdbo:Person

Cdbo:Athlete

Cdbo:Eukaryote

Cdbo:Organisation

Cdbo:Work

dbo:producer

dbo:writer

dbo:birthPlace

dbo:family

Pdbo:conservationStatus xsd:

String

Pxsd:

date

dbo:releaseDate

Pdbo:runtime xsd:

double

Pxsd:

date

dbo:birthDate

Pxsd:

date

dbo:deathDate

Pxsd:

double

dbo:areaTotal

Pxsd:

double

dbo:elevation

Pxsd:

String

dbo:utcOffset

Pdbo:populationTotal xsd:

integer

Pxsd:

String

dbo:areaCode

rdfs:subClassOf

rdfs:domain

dbo:subsequentWork

dbo:location

dbo:canton

rdfs:subClassOf

Fig. 3. Snapshot of a part of the DBpedia ontology.

Finally, a new heuristic to increase the connective-

ness of DBpedia instances was introduced. If an infobox

contains a string value that is not linked to another

Wikipedia article, the extraction framework searches

for hyperlinks in the same Wikipedia article that have

the same anchor text as the infobox value string. If such

a link exists, the target of that link is used to replace

the string value in the infobox. This method further

increases the number of object property assertions in

the DBpedia ontology.

Orthogonal to the previously mentioned improve-

ments, there have been various efforts to assess the qual-

ity of the DBpedia datasets. [

] developed a frame-

work for estimating the quality of DBpedia and a sam-

ple of 75 resources were analysed. A more compre-

hensive effort was performed in [

] by providing a

distributed web-based interface [

] for quality assess-

ment. In this study, 17 data quality problem types were

analysed by 58 users covering 521 resources in DBpe-

dia.

3. DBpedia Ontology

The DBpedia ontology consists of 320 classes which

form a subsumption hierarchy and are described by

1,650 different properties. With a maximal depth of

5, the subsumption hierarchy is intentionally kept

rather shallow which ﬁts use cases in which the on-

tology is visualized or navigated. Figure 3 depicts

a part of the DBpedia ontology, indicating the rela-

tions among the top ten classes of the DBpedia on-

tology, i.e. the classes with the highest number of

instances. The complete DBpedia ontology can be

browsed online at

http://mappings.dbpedia.

org/server/ontology/classes/.

● ● ●●●

●●

DBpedia Version

Number of ontology elements

3.2 3.3 3.4 3.5 3.6 3.7 3.8

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

Classes

Properties

Fig. 4. Growth of the DBpedia ontology.

The DBpedia ontology is maintained and extended

by the community in the DBpedia Mappings Wiki. Fig-

ure 4 depicts the growth of the DBpedia ontology over

time. While the number of classes is not growing too

much due to the already good coverage of the initial

version of the ontology, the number of properties in-

creases over time due to the collaboration on the DBpe-

dia Mappings Wiki and the addition of more detailed

information to infoboxes by Wikipedia editors.

3.1. Mapping Statistics

As of April 2013, there exist mapping communities

for 27 languages, 23 of which are active. Figure 5 shows

10 Lehmann et al. / DBpedia

Fig. 5. Mapping coverage statistics for all mapping-enabled languages.

statistics for the coverage of these mappings in DBpe-

dia. Figures (a) and (c) refer to the absolute number

of template and property mappings that are deﬁned for

every DBpedia language edition. Figures (b) and (d) de-

pict the percentage of the deﬁned template and property

mappings compared to the total number of available

templates and properties for every Wikipedia language

edition. Figures (e) and (g) show the occurrences (in-

stances) that the deﬁned template and property map-

pings have in Wikipedia. Finally, ﬁgures (f) and (h)

give the percentage of the mapped templates and prop-

erties occurences, compared to the total templates and

property occurences in a Wikipedia language edition.

It can be observed in the ﬁgure that the Portuguese

DBpedia language edition is the most complete regard-

ing mapping coverage. Other language editions such

as Bulgarian, Dutch, English, Greek, Polish and Span-

ish have mapped templates covering more than 50%

of total template occurrences. In addition, almost all

languages have covered more than 20% of property oc-

currences, with Bulgarian and Portuguese reaching up

to 70%.

The mapping activity of the ontology enrichment

process along with the editing of the ten most active

mapping language communities is depicted in Figure 6.

It is interesting to notice that the high mapping ac-

tivity peaks coincide with the DBpedia release dates.

Lehmann et al. / DBpedia 11

Fig. 7. English property mappings occurrence frequency (both axes

are in log scale)

For instance, the DBpedia 3.7 version was released on

September 2011 and the 2nd and 3rd quarter of that

year have a very high activity compared to the 4th quar-

ter. In the last two years (2012 and 2013), most of the

DBpedia mapping language communities have deﬁned

their own chapters and have their own release dates.

Thus, recent mapping activity shows less ﬂuctuation.

Finally, Figure 7 shows the English property map-

pings occurrence frequency. Both axes are in log scale

and represent the number of property mappings (x axis)

that have exactly y occurrences (y axis). The occurrence

frequency follows a long tail distribution. Thus, a low

number of property mappings have a high number of

occurrences and a high number of property mappings

have a low number of occurences.

3.2. Instance Data

The DBpedia 3.8 release contains localized versions

of DBpedia for 111 languages which have been ex-

tracted from the Wikipedia edition in the correspond-

ing language. For 20 of these languages, we report in

this section the overall number of entities being de-

scribed by the localized versions as well as the num-

ber of facts (i.e. statements) that have been extracted

from infoboxes describing these things. Afterwards, we

report on the number of instances of popular classes

within the 20 DBpedia versions as well as the concep-

tual overlap between the languages.

Table 2 shows the overall number of things, ontol-

ogy and raw-infobox properties, infobox statements and

type statements for the 20 languages. The column head-

ings have the following meaning: LD = Localized data

sets (see Section 2.5); CD = Canonicalized data sets

(see Section 2.5); all = Overall number of instances in

the data set, including instances without infobox data;

with MD = Number of instances for which mapping-

based infobox data exists; Raw Properties = Number

of different properties that are generated by the raw

infobox extractor; Mapping Properties = Number of

different properties that are generated by the mapping-

based infobox extractor; Raw Statements = Number of

statements (facts) that are generated by the raw infobox

extractor; Mapping Statements = Number of statements

(facts) that are generated by the mapping-based infobox

extractor.

It is interesting to see that the English version of DB-

pedia describes about three times more instances than

the second and third largest language editions (French,

German). Comparing the ﬁrst column of the table with

the second and third reveals which portion of the in-

stances of a speciﬁc language correspond to instances

in the English version of DBpedia and which portion

of the instances is described by clean, mapping-based

infobox data. The difference between the number of

properties in the raw infobox data set and the cleaner

mapping-based infobox data set (columns 4 and 5) re-

sults on the one hand from multiple Wikipedia infobox

properties being mapped to a single ontology property.

On the other hand, it reﬂects the number of mappings

that have been so far created in the Mapping Wiki for a

speciﬁc language.

Table 3 reports the number of instances for a set of

popular classes from the third and forth hierarchy level

of the ontology within the canonicalized DBpedia data

sets for each language. The indented classes are sub-

classes of the superclasses set in bold. The zero val-

ues in the table indicate that no infoboxes have been

mapped to a speciﬁc ontology class within the corre-

sponding language so far. Again, the English version of

DBpedia covers by far the most instances.

Table 4 shows, for the canonicalized, mapping-based

data set, how many instances are described in multi-

ple languages. The Instances column contains the total

number of instances per class across all 20 languages,

the second column contains the number of instances

that are described only in a single language version, the

next column contains the number of instances that are

contained in two languages but not in three or more lan-

guages, etc. For example, 12,936 persons are described

in ﬁve languages but not in six or more languages. The

number 871,630 for the class

Person

means that all

20 language versions together describe 871,630 differ-

ent persons. The number is higher than the number of

persons described in the canonicalized English infobox

12 Lehmann et al. / DBpedia

Fig. 6. Mapping community activity for (a) ontology and (b) 10 most active language editions

Table 2

Basic statistics about Localized DBpedia Editions.

Inst. LD all Inst. CD all Inst. with MD CD Raw Prop. CD Map. Prop. CD Raw Statem. CD Map. Statem. CD

en 3,769,926 3,769,926 2,359,521 48,293 1,313 65,143,840 33,742,015

de 1,243,771 650,037 204,335 9,593 261 7,603,562 2,880,381

fr 1,197,334 740,044 214,953 13,551 228 8,854,322 2,901,809

it 882,127 580,620 383,643 9,716 181 12,227,870 4,804,731

es 879,091 542,524 310,348 14,643 476 7,740,458 4,383,206

pl 848,298 538,641 344,875 7,306 266 7,696,193 4,511,794

ru 822,681 439,605 123,011 13,522 76 6,973,305 1,389,473

pt 699,446 460,258 272,660 12,851 602 6,255,151 4,005,527

ca 367,362 241,534 112,934 8,696 183 3,689,870 1,301,868

cs 225,133 148,819 34,893 5,564 334 1,857,230 474,459

hu 209,180 138,998 63,441 6,821 295 2,506,399 601,037

ko 196,132 124,591 30,962 7,095 419 1,035,606 417,605

tr 187,850 106,644 40,438 7,512 440 1,350,679 556,943

ar 165,722 103,059 16,236 7,898 268 635,058 168,686

eu 132,877 108,713 41,401 2,245 19 2,255,897 532,709

sl 129,834 73,099 22,036 4,235 470 1,213,801 222,447

bg 125,762 87,679 38,825 3,984 274 774,443 488,678

hr 109,890 71,469 10,343 3,334 158 701,182 151,196

el 71,936 48,260 10,813 2,866 288 206,460 113,838

data set (763,643) listed in Table 3, since there are

infoboxes in non-English articles describing a person

without a corresponding infobox in the English article

describing the same person. Summing up columns 2 to

10+ for the

Person

class, we see that 195,263 persons

are described in two or more languages. The large dif-

ference of this number compared to the total number of

871,630 persons is due to the much smaller size of the

localized DBpedia versions compared to the English

one (cf. Table 2).

3.3. Internationalisation Community

The introduction of the mapping-based infobox

extractor alongside live synchronisation approaches

in [

] allowed the international DBpedia community

to easily deﬁne infobox-to-ontology mappings. As a

result of this development, there are presently mappings

for 27 languages

. The DBpedia 3.7 release

in Septem-

ber 2011 was the ﬁrst DBpedia release to use the lo-

calized I18n (Internationalisation) DBpedia extraction

framework [24].

Arabic (ar), Bulgarian (bg), Bengali (bn), Catalan (ca), Czech

(cs), German (de), Greek (el), English (en), Spanish (es), Estonian

(et), Basque (eu), French (fr), Irish (ga), Hindi (hi), Croatian (hr),

Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean

(ko), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Slovene

(sl), Turkish (tr), Urdu (ur)

9http://blog.dbpedia.org/2011/09/11/

Lehmann et al. / DBpedia 13

Table 3

Number of instances per class within 10 localized DBpedia versions.

en it pl es pt fr de ru ca hu

Person 763,643 145,060 70,708 65,337 43,057 62,942 33,122 18,620 7,107 15,529

Athlete 185,126 47,187 30,332 19,482 14,130 21,646 31,237 0 721 4,527

Artist 61,073 12,511 16,120 25,992 10,571 13,465 0 0 2,004 3,821

Politician 23,096 0 8,943 5,513 3,342 0 0 12,004 1,376 760

Place 572,728 14,1101 182,727 132,961 116,660 80,602 131,766 67,932 73,078 18,324

Popul.Place 387,166 138,077 167,034 121,204 109,418 72,252 79,410 63,826 72,743 15,535

Building 60,514 1,270 2,946 3,570 803 921 83 43 0 527

River 24,267 0 1,383 2 4,149 3,333 6,707 3,924 0 565

Organisation 192,832 4,142 12,193 11,710 10,949 17,513 16,973 1,598 1,399 3,993

Company 44,516 4,142 2,566 975 1,903 5,832 7,200 0 440 618

Educ.Inst. 42,270 0 599 1,207 270 1,636 2,171 1,010 0 115

Band 27,061 0 2,993 0 4,476 3,868 5,368 0 263 802

Work 333,269 51,918 32,386 36,484 33,869 39,195 18,195 34,363 5,240 9,777

Music.Work 159,070 23,652 14,987 15,911 17,053 16,206 0 6,588 697 4,704

Film 71,715 17,210 9,467 9,396 8,896 9,741 15,038 12,045 3,859 2,320

Software 27,947 5,682 3,050 4,833 3,419 5,733 2,368 0 606 857

Table 4

Cross-language overlap: Number of instances that are described in multiple languages.

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871.630 676.367 94.339 42.382 21.647 12.936 8.198 5.295 3.437 2.391 4.638

Place 643.260 307.729 150.349 45.836 36.339 20.831 13.523 20.808 31.422 11.262 5.161

Organisation 206.670 160.398 22.661 9.312 5.002 3.221 2.072 1.421 928 594 1.061

Work 360.808 243.706 54.855 23.097 12.605 8.277 5.732 4.007 2.911 1.995 3.623

At the time of writing, DBpedia chapters for 14 lan-

guages have been founded: Basque, Czech, Dutch, En-

glish, French, German, Greek, Italian, Japanese, Ko-

rean, Polish, Portuguese, Russian and Spanish.

Be-

sides providing mappings from infoboxes in the corre-

sponding Wikipedia editions, DBpedia chapters organ-

ise a local community and provide hosting for data sets

and associated services.

While at the moment chapters are deﬁned by owner-

ship of the IP and server of the sub domain A record

(e.g.

http://ko.dbpedia.org

) given by DBpe-

dia maintainers, the DBpedia internationalisation com-

mittee

is manifesting its structure and each language

edition has a representative with a vote in elections. In

some cases (e.g. Greek

and Dutch

) the existence of

Accessed on 25/09/2013:

http://wiki.dbpedia.org/

Internationalization/Chapters

11http://wiki.dbpedia.org/

Internationalization

12http://el.dbpedia.org

13http://nl.dbpedia.org

a local DBpedia chapter has had a positive effect on the

creation of localized LOD clouds [24].

In the weeks leading to a new release, the DBpe-

dia project organises a mapping sprint, where commu-

nities from each language work together to improve

mappings, increase coverage and detect bugs in the ex-

traction process. The progress of the mapping effort

is tracked through statistics on the number of mapped

templates and properties, as well as the number of

times these templates and properties occur in Wikipedia.

These statistics provide an estimate of the coverage of

each Wikipedia edition in terms of how many entities

will be typed and how many properties from those en-

tities will be extracted. Therefore, they can be used

by each language edition to prioritize properties and

templates with higher impact on the coverage.

The mapping statistics have also been used as a way

to promote a healthy competition between language

editions. A sprint page was created with bar charts that

show how close each language is from achieving to-

tal coverage (as shown in Figure 5), and line charts

showing the progress over time highlighting when one

14 Lehmann et al. / DBpedia

language is overtaking another in their race for higher

coverage. The mapping sprints have served as a great

motivator for the crowd-sourcing efforts, as it can be

noted from the increase in the number of mapping con-

tributions in the weeks leading to a release.

4. Live Synchronisation

Wikipedia articles are continuously revised at a very

high rate, e.g. the English Wikipedia, in June 2013,

has approximately 3.3 million edits per month which

is equal to 77 edits per minute

. This high change

frequency leads to DBpedia data quickly being out-

dated, which in turn leads to the need for a methodology

to keep DBpedia in synchronisation with Wikipedia.

As a consequence, the DBpedia Live system was de-

veloped, which works on a continuous stream of up-

dates from Wikipedia and processes that stream on the

ﬂy [

]. It allows extracted data to stay up-to-date

with a small delay of at most a few minutes. Since the

English Wikipedia is the largest among all Wikipedia

editions with respect to the number of articles and the

number of edits per month, it was the ﬁrst language

DBpedia Live supported

. Meanwhile, DBpedia Live

for Dutch16 was developed.

4.1. DBpedia Live System Architecture

In order for live synchronisation to be possible, we

need access to the changes made in Wikipedia. The

Wikimedia foundation kindly provided us access to

their update stream using the OAI-PMH protocol [

This protocol allows a programme to pull page updates

in XML via HTTP. A Java component, serving as a

proxy, constantly retrieves new updates and feeds them

to the DBpedia Live framework. This proxy is nec-

essary to decouple the stream from the framework to

simplify maintenance of the software. The live extrac-

tion workﬂow uses this update stream to extract new

knowledge upon relevant changes in Wikipedia articles.

The overall architecture of DBpedia Live is indicated

in Figure 8. The major components of the system are

as follows:

14http://stats.wikimedia.org/EN/SummaryEN.

htm

15http://live.dbpedia.org

16http://live.nl.dbpedia.org

Fig. 8. Overview of DBpedia Live extraction framework.

– Local Wikipedia Mirror

: A local copy of a

Wikipedia language edition is installed which is

kept in real-time synchronisation with its live ver-

sion using the OAI-PMH protocol. Keeping a local

Wikipedia mirror allows us to exceed any access

limits posed by Wikipedia.

– Mappings Wiki

: The DBpedia Mappings Wiki,

described in Section 2.4, serves as secondary input

source. Changes of the mappings wiki are also

consumed via an OAI-PMH stream. Note that a

single mapping change can affect a high number

of DBpedia resources.

– DBpedia Live Extraction Manager

: This is the

core component of the DBpedia Live extraction

architecture. The manager takes feeds of pages for

re-processing as input and applies all the enabled

extractors. After processing a page, the extracted

triples are a) inserted into a backend triple store

(in our case Virtuoso [

]), updating the old triples

and b) saved as changesets into a compressed N-

Triples ﬁle structure.

– Synchronisation Tool

: This tool allows third par-

ties to keep DBpedia Live mirrors up-to-date by

harvesting the produced changesets.

4.2. Features of DBpedia Live

The core components of the DBpedia Live Extraction

framework provide the following features:

–

Mapping-Affected Pages: The update of all pages

that are affected by a mapping change.

–

Unmodiﬁed Pages: The update of unmodiﬁed

pages at regular intervals.

–

Changesets Publication: The publication of triple-

changesets.

–

Synchronisation Tool: A synchronisation tool for

harvesting updates to DBpedia Live mirrors.

–

Data Isolation: Separate data from different

sources.

Lehmann et al. / DBpedia 15

Mapping-Affected Pages: Whenever an infobox map-

ping change occurs, all the Wikipedia pages that use

that infobox are reprocessed. Taking Figure 2 as an

example, if a new property mapping is introduced (i.e.

dbo:translator) or an existing (i.e. dbo:illustrator) is

updated or deleted, then all entities belonging to the

class dbo:Book are reprocessed. Thus, upon a mapping

change, we identify all the affected Wikipedia pages

and feed them for reprocessing.

Unmodiﬁed Pages: Extraction framework improve-

ments or activation / deactivation of DBpedia extractors

might never be applied to rarely modiﬁed pages. To

overcome this problem, we obtain a list of the pages

which have not been processed over a period of time

(30 days in our case) and feed that list to the DBpedia

Live extraction framework for reprocessing. This feed

has a lower priority than the update or the mapping

affected pages feed and ensures that all articles reﬂect a

recent state of the output of the extraction framework.

Publication of Changesets: Whenever a Wikipedia

article is processed, we get two disjoint sets of triples. A

set for the added triples, and another set for the deleted

triples. We write those two sets into N-Triples ﬁles,

compress them, and publish the compressed ﬁles as

changesets. If another DBpedia Live mirror wants to

synchronise with the DBpedia Live endpoint, it can just

download those ﬁles, decompress and integrate them.

Synchronisation Tool: The synchronisation tool en-

ables a DBpedia Live mirror to stay in synchronisation

with our live endpoint. It downloads the changeset ﬁles

sequentially, decompresses them and updates the target

SPARQL endpoint via insert and delete operations.

Data Isolation: In order to keep the data isolated,

DBpedia Live keeps different sources of data in

different SPARQL graphs. Data from the article

update feeds are contained in the graph with the

URI

http://live.dbpedia.org

, static data (i.e.

links to the LOD cloud) are kept in

http://static.

dbpedia.org

and the DBpedia ontology is stored

http://dbpedia.org/ontology

. All data is

also accessible under the

http://dbpedia.org

graph for combined queries. Next versions of DBpe-

dia Live will also separate data from the raw infobox

extraction and mapping-based infobox extraction.

5. Interlinking

DBpedia is interlinked with numerous external data

sets following the Linked Data principles. In this sec-

tion, we give an overview of the number and types

of outgoing links that point from DBpedia into other

data sets, as well as the external data sets that set links

pointing to DBpedia resources.

5.1. Outgoing Links

Similar to the DBpedia ontology, DBpedia also fol-

lows a community approach for adding links to other

third party data sets. The DBpedia project maintains

a link repository

for which conventions for adding

linksets and linkset metadata are deﬁned. The adher-

ence to those guidelines is supervised by a linking com-

mittee. Linksets which are added to the repository are

used for the subsequent ofﬁcial DBpedia release as well

as for DBpedia Live. Table 5 lists the linksets created

by the DBpedia community as of April 2013. The ﬁrst

column names the data set that is the target of the links.

The second and third column contain the predicate that

is used for linking as well as the overall number of links

that is set between DBpedia and the external data set.

The last column names the tool that was used to gener-

ate the links. The value S refers to Silk, L to LIMES,

C to custom script and a missing entry means that the

dataset is copied from the previous releases and not

regenerated.

An example for the usage of links is the combina-

tion of data about European Union project funding

(FTS) [

] and data about countries in DBpedia. The

query below compares funding per year (from FTS) and

country with the gross domestic product of a country

(from DBpedia)18 .

1SELECT *{ {

2SELECT ?ftsyear ?ftscountry (SUM(?amount) AS

?funding) {

3?com rdf:type fts-o:Commitment .

4?com fts-o:year ?year .

5?year rdfs:label ?ftsyear .

6?com fts-o:benefit ?benefit .

7?benefit fts-o:detailAmount ?amount .

8?benefit fts-o:beneficiary ?beneficiary .

9?beneficiary fts-o:country ?country .

10 ?country owl:sameAs ?ftscountry .

11 }}{

12 SELECT ?dbpcountry ?gdpyear ?gdpnominal {

13 ?dbpcountry rdf:type dbo:Country .

14 ?dbpcountry dbp:gdpNominal ?gdpnominal .

15 ?dbpcountry dbp:gdpNominalYear ?gdpyear .

16 } }

17 FILTER ((?ftsyear = str(?gdpyear)) &&

18 (?ftscountry = ?dbpcountry)) }

17https://github.com/dbpedia/dbpedia- links

18Endpoint: http://fts.publicdata.eu/sparql

Results: http://bit.ly/1c2mIwQ.

16 Lehmann et al. / DBpedia

Table 5

Data sets linked from DBpedia

Data set Predicate Count Tool

Amsterdam Museum owl:sameAs 627 S

BBC Wildlife Finder owl:sameAs 444 S

Book Mashup rdf:type 9 100

owl:sameAs

Bricklink dc:publisher 10 100

CORDIS owl:sameAs 314 S

Dailymed owl:sameAs 894 S

DBLP Bibliography owl:sameAs 196 S

DBTune owl:sameAs 838 S

Diseasome owl:sameAs 2 300 S

Drugbank owl:sameAs 4 800 S

EUNIS owl:sameAs 3 100 S

Eurostat (Linked Stats) owl:sameAs 253 S

Eurostat (WBSG) owl:sameAs 137

CIA World Factbook owl:sameAs 545 S

ﬂickr wrappr dbp:hasPhoto- 3 800 000 C

Collection

Freebase owl:sameAs 3 600 000 C

GADM owl:sameAs 1 900

GeoNames owl:sameAs 86 500 S

GeoSpecies owl:sameAs 16 000 S

GHO owl:sameAs 196 L

Project Gutenberg owl:sameAs 2 500 S

Italian Public Schools owl:sameAs 5 800 S

LinkedGeoData owl:sameAs 103 600 S

LinkedMDB owl:sameAs 13 800 S

MusicBrainz owl:sameAs 23 000

New York Times owl:sameAs 9 700

OpenCyc owl:sameAs 27 100 C

OpenEI (Open Energy) owl:sameAs 678 S

Revyu owl:sameAs 6

Sider owl:sameAs 2 000 S

TCMGeneDIT owl:sameAs 904

UMBEL rdf:type 896 400

US Census owl:sameAs 12 600

WikiCompany owl:sameAs 8 300

WordNet dbp:wordnet type 467 100

YAGO2 rdf:type 18 100 000

Sum 27 211 732

In addition to providing outgoing links on an

instance-level, DBpedia also sets links on schema-

level pointing from the DBpedia ontology to equiva-

lent terms in other schemas. Links to other schemata

can be set by the community within the DBpedia Map-

pings Wiki by using

owl:equivalentClass

class templates and

owl:equivalentProperty

in datatype or object property templates, respectively.

In particular, in 2011 Google, Microsoft, and Yahoo!

announced their collaboration on Schema.org, a col-

lection of vocabularies for marking up content on web

pages. The DBpedia 3.8 ontology contains 45 equiva-

lent class and 31 equivalent property links pointing to

http://schema.org terms.

5.2. Incoming Links

DBpedia is being linked to from a variety of data

sets. The overall number of incoming links, i.e. links

pointing to DBpedia from other data sets, is 39,007,478

according to the Data Hub.

However, those counts

are entered by users and may not always be valid and

up-to-date.

In order to identify actually published and online

data sets that link to DBpedia, we used Sindice [

The Sindice project crawls RDF resources on the web

and indexes those resources. In Sindice, a data set is

deﬁned by the second-level domain name of the en-

tity’s URI, e.g. all resources available at the domain

fu-berlin.de

are considered to belong to the same

data set. A triple is considered to be a link if the data

set of subject and object are different. Furthermore, the

Sindice data we used for analysis only considers au-

thoritative entities: The data set of a subject of a triple

must match the domain it was retrieved from, otherwise

it is not considered. Sindice computes a graph sum-

mary [

] over all resources they store. With the help

of the Sindice team, we examined this graph summary

to obtain all links pointing to DBpedia. As shown in

Table 7, Sindice knows about 248 data sets linking to

DBpedia. 70 of those data sets link to DBpedia via

owl:sameAs

, but other link predicates are also very

common as evident in this table. In total, Sindice has

indexed 4 million links pointing to DBpedia. Table 6

lists the 10 data sets which set most links to DBpedia.

It should be noted that the data in Sindice is not com-

plete, for instance it does not contain all data sets that

are catalogued by the DataHub

. However, it crawls for

RDFa snippets, converts microformats etc., which are

not captured by the DataHub. Despite the inaccuracy,

the relative comparison of different datasets can still

give us insights. Therefore, we analysed the link struc-

ture of all Sindice datasets using the Sindice cluster.

Table 8 shows the datasets with most incoming links.

Those are authorities in the network structure of the

See

http://wiki.dbpedia.org/Interlinking

for

details.

20http://datahub.io/

Lehmann et al. / DBpedia 17

Table 6

Top 10 data sets in Sindice ordered by the number of links to DBpedia.

Data set Link Predicate Count Link Count

okaboo.com 4 2,407,121

tfri.gov.tw 57 837,459

naplesplus.us 2 212,460

fu-berlin.de 7 145,322

freebase.com 108 138,619

geonames.org 3 125,734

opencyc.org 3 19,648

geospecies.org 10 16,188

dbrec.net 3 12,856

faviki.com 5 12,768

Table 7

Sindice summary statistics for incoming links to DBpedia.

Metric Value

Total links: 3,960,212

Total distinct data sets: 248

Total distinct predicates: 345

Table 8

Top 10 datasets by incoming links in Sindice.

domain datasets links

purl.org 498 6,717,520

dbpedia.org 248 3,960,212

creativecommons.org 2,483 3,030,910

identi.ca 1,021 2,359,276

l3s.de 34 1,261,487

rkbexplorer.com 24 1,212,416

nytimes.com 27 1,174,941

w3.org 405 658,535

geospecies.org 13 523,709

livejournal.com 14,881 366,025

web of data and DBpedia is currently ranked second in

terms of incoming links.

6. DBpedia Usage Statistics

DBpedia is served on the web in three forms: First,

it is provided in the form of downloadable data sets

where each data set contains the results of one of the

extractors listed in Table 1. Second, DBpedia is served

via a public SPARQL endpoint and, third, it provides

dereferencable URIs according to the Linked Data prin-

ciples. In this section, we explore some of the statistics

gathered during the hosting of DBpedia over the last

two of years.

100

120

2012-Q1

2012-Q2

2012-Q3

2012-Q4

Download Count (in Thousands)

Download Volume (in TB)

Fig. 9. The download count and download volume of the English

language edition of DBpedia.

6.1. Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages, but those

languages vary with respect to the download popular-

ity as well. The top ﬁve languages with respect to the

download volume are English, Chinese, German, Cata-

lan, and French respectively. The download count and

download volume of the English language is indicated

in Figure 9. To host the DBpedia dataset downloads,

a bandwidth of approximately 6 TB per month is cur-

rently needed.

Furthermore, DBpedia consists of several data sets

which vary with respect to their download popularity.

The download count and the download volume of each

data set during the year 2012 is depicted in Figure 10.

In those statistics we ﬁltered out all IP addresses, which

requested a ﬁle more than 1000 times per month.

Pagelinks are the most downloaded dataset, although

they are not semantically rich as they do not reveal

which type of links exists between two resources. Sup-

posedly, they are used for network analysis or providing

relevant links in user interfaces and downloaded more

often as they are not provided via the ofﬁcial SPARQL

endpoint.

6.2. Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint

hosted using the Virtuoso Universal Server (Enterprise

Edition) version 6.4 software in a 4-nodes cluster con-

ﬁguration. This cluster setup provides parallelization

of query execution, even when the cluster nodes are

on the same machine, as splitting a query over several

nodes allows better use of parallel threads on modern

multi-core CPUs on standard commodity hardware.

Virtuoso supports horizontal scale-out, either by re-

distributing the existing cluster nodes onto multiple ma-

The IP address was only ﬁltered for that speciﬁc ﬁle and month

in those cases.

22http://dbpedia.org/sparql

18 Lehmann et al. / DBpedia

Download Count (in Thousands)

Download Volume (in TB)

Fig. 10. The download count and download volume (in GB) of the DBpedia data sets.

Table 9

Hardware of the machines serving the public SPARQL endpoint.

DBpedia Conﬁguration

3.3 - 3.4 AMD Opteron 8220 2.80Ghz, 4 Cores, 32GB

3.5 - 3.7 Intel Xeon E5520 2.27Ghz, 8 Cores, 48GB

3.8 Intel Xeon E5-2630 2.30GHz, 8 Cores, 64GB

chines, or by adding several separate clusters with a

round robin HTTP front-end. This allows the cluster

setup to grow in line with desired response times for

an RDF data set collection. As the size of the DBpedia

data set increased and its use by the Linked Data com-

munity grew, the project migrated to increasingly pow-

erful, but still moderately priced, hardware as shown in

Table 9. The Virtuoso instance is conﬁgured to process

queries within a 1,200 second timeout window and a

maximum result set size of 50,000 rows. It provides

OFFSET

and

LIMIT

support for paging alongside the

ability to produce partial results.

The log ﬁles used in the following analysis excluded

trafﬁc generated by:

clients that have been temporarily rate limited

after a burst period,

2. clients that have been banned after misuse,

applications, spiders and other crawlers that are

blocked after frequently hitting the rate limit or

generally use too many resources.

Virtuoso supports HTTP Access Control Lists

(ACLs) which allow the administrator to rate limit cer-

tain IP addresses or whole IP ranges. A maximum num-

ber of requests per second (currently 15) as well as

a bandwidth limit per request (currently 10MB) are

enforced. If the client software can handle compres-

Table 11

Number of unique sites accessing DBpedia endpoints.

DBpedia Avg/Day Median Stdev Maximum

3.3 5,824 5,977 1,046 7,865

3.4 6,656 6,704 947 8,312

3.5 9,392 9,432 1,643 12,584

3.6 10,810 10,810 1,612 14,798

3.7 17,077 17,091 6,782 33,288

3.8 14,513 14,493 2,783 20,510

sion, replies are compressed to further save bandwidth.

Exception rules can be conﬁgured for multiple clients

hidden behind a NAT ﬁrewall (appearing as a single IP

address) or for temporary requests for higher rate limits.

When a client hits an ACL limit, the system reports an

appropriate HTTP status code

like 509 (”Bandwidth

Limit Exceeded”) and quickly drops the connection.

The system further uses an iptables based ﬁrewall for

permanent blocking of clients identiﬁed by their IP

addresses.

6.3. Public Static Endpoint Statistics

The statistics presented in this section were extracted

from reports generated by Webalizer v2.21

. Table 10

and Table 11 show various DBpedia SPARQL endpoint

usage statistics for the DBpedia 3.3 to 3.8 releases. Note

that the usage of all endpoints mentioned in Table 12 is

counted. The Avg/Day column represents the average

number of hits (resp. visits/sites) per day, followed by

23http://en.wikipedia.org/wiki/List_of_HTTP_

status_codes

24http://www.webalizer.org

Lehmann et al. / DBpedia 19

Table 10

Number of endpoint hits (left) and visits (right).

DBpedia Avg/Day Median Stdev Maximum

3.3 733,811 711,306 188,991 1,319,539

3.4 1,212,549 1,165,893 351,226 2,371,657

3.5 1,122,612 1,035,444 386,454 2,908,720

3.6 1,328,355 1,286,750 256,945 2,495,031

3.7 2,085,399 1,930,728 1,057,398 8,873,473

3.8 2,910,410 2,717,775 1,085,640 7,678,490

DBpedia Avg/Day Median Stdev Maximum

3.3 9,750 9,775 2,036 13,126

3.4 11,329 11,473 1,546 14,198

3.5 16,592 16,902 2,969 23,129

3.6 19,471 17,664 5,691 56,064

3.7 23,972 22,262 10,741 127,869

3.8 16,851 16,711 2,960 27,416

the Median and Standard Deviation. The last column

shows the maximum number of hits (resp. visits/sites)

that was recorded on a single day for each data set

version. Visits (i.e. sessions of subsequent queries from

the same client) are determined by a ﬂoating 30 minute

time window. All requests from behind a NAT ﬁrewall

are logged under the same external IP address and are

therefore counted towards the same visit if they occur

within the 30 minute interval.

Table 10 shows the increasing popularity of DBpedia.

There is a distinct dip in hits to the SPARQL endpoint in

DBpedia 3.5, which is partially due to more strict initial

limits for bot-related trafﬁc which were later relaxed.

The sudden drop of visits between the 3.7 and the 3.8

data sets can be attributed to:

applications starting to use their own private

DBpedia endpoint

blocking of apps that were abusing the DBpedia

endpoint

uptake of the language speciﬁc DBpedia end-

points and DBpedia Live

6.4. Query Types and Trends

The DBpedia server is not only a SPARQL endpoint,

but also serves as a Linked Data Hub returning re-

sources in a number of different formats. For each data

set we randomly selected 14 days worth of log ﬁles and

processed those in order to show the various services

called. Table 12 shows the number of hits to the various

endpoints.

The /resource endpoint uses the Accept: line in the

HTTP header sent by the client to return a HTTP sta-

tus code 30x to redirect the client to either the /page

(HTML based) or /data (formats like RDF/XML or Tur-

tle) equivalent of the article. Clients also frequently

mint their own URLs to either /page or /data version of

an articles directly, or download the raw data directly.

This explains why the count of /page and /data hits

in the table is larger than the number of hits on the

Table 12

Hits per service to http://dbpedia.org in thousands.

Endpoint 3.3 3.4 3.5 3.6 3.7 3.8

/data 1,355 2,681 2,230 2,547 3,714 4,246

/ontology 80 168 142 178 116 97

/page 2,634 4,703 1,810 1,687 3,787 7,377

/property 231 311 137 176 176 128

/resource 2,976 4,080 2,554 2,309 4,436 7,143

/sparql 2,090 4,177 2,266 9,112 15,902 15,475

other 252 420 342 277 579 695

total 9,619 16,541 9,434 16,286 28,710 35,142

Fig. 11. Trafﬁc Linked Data versus SPARQL endpoint

/resource endpoint. The /ontology and /property end-

points return meta information about the DBpedia on-

tology. While all of these endpoints themselves may

use SPARQL queries to generate various page content,

these requests are handled by the internal Virtuoso en-

gine directly and do not show up as extra calls to the

/sparql endpoint in our analysis.

Figure 11 shows the percentages of trafﬁc hits that

were generated by the main endpoints. As we can see,

the usage of the SPARQL endpoint has doubled from

about 22 percent in 2009 to about 44 percent in 2013.

However, this still means that 56 percent of trafﬁc hits

are directed to the Linked Data service.

In Table 13, we focussed on the calls to the /sparql

endpoint and counted the number of statements per type.

20 Lehmann et al. / DBpedia

Table 13

Hits per statement type in thousands.

Statement 3.3 3.4 3.5 3.6 3.7 3.8

ask 56 269 360 326 159 677

construct 55 31 14 11 22 38

describe 11 8 4 7 62 111

select 1891 3663 1658 8030 11204 13516

unknown 78 206 229 738 4455 1134

total 2090 4177 2266 9112 15902 15475

Table 14

Trends in SPARQL select (rounded values in %).

Statement 3.3 3.4 3.5 3.6 3.7 3.8

distinct 19.5 11.4 17.3 19.4 13.3 25.4

ﬁlter 45.7 13.7 31.8 25.3 29.2 36.1

functions 8.8 6.9 23.5 21.3 25.5 25.9

geo 27.7 7.0 39.6 6.7 9.3 9.2

group 0.0 0.0 0.0 0.0 0.0 0.2

limit 4.6 6.5 11.6 10.5 7.7 7.8

optional 30.6 23.8 47.3 26.3 16.7 17.2

order 2.3 1.3 1.9 3.2 1.2 1.6

union 3.3 3.2 26.4 11.3 12.1 20.6

As the log ﬁles only record the full SPARQL query on

a GET request, all the POST requests are counted as

unknown.

Finally, we analyzed each SPARQL query and

counted the use of keywords and constructs like:

–DISTINCT

–FILTER

–

FUNCTIONS like CONCAT, CONTAINS, ISIRI

–Use of GEO objects

–GROUP BY

–LIMIT / OFFSET

–OPTIONAL

–ORDER BY

–UNION

For the GEO objects we counted the use of SPARQL

PREFIX geo: and wgs84*: declarations and usage in

property tags. Table 14 shows the use of various key-

words as a percentage of the total select queries made

to the /sparql endpoint for the sample sets. In general,

we observed that queries became more complex over

time indicating an increasing maturity and higher ex-

pectations of the user base.

6.5. Statistics for DBpedia Live

Since its ofﬁcial release at the end of June 2011,

DBpedia Live attracted a steadily increasing number

of users. Furthermore, more users tend to use the syn-

chronisation tool to synchronise their own DBpedia

Live mirrors. This leads to an increasing number of live

update requests, i.e. changeset downloads. Figure 12

indicates the number of daily SPARQL and synchroni-

sation requests sent to DBpedia Live endpoint in the

period between August 2012 and January 2013.

Fig. 12. Number of daily requests sent to the DBpedia Live for a)

SPARQL queries and b) synchronisation requests from August 2012

until January 2013

7. Use Cases and Applications

Due to DBpedia’s coverage of various domains as

well as its steady growth as a hub on the Web of Data,

the data sets provided by DBpedia can serve many pur-

poses. Such applications include improving search and

exploration of Wikipedia, data proliferation for applica-

tions, mashups as well as text analysis and annotation

tools.

7.1. Natural Language Processing

DBpedia can support many tasks in Natural Lan-

guage Processing (NLP) [

]. For that purpose, DBpedia

includes a number of specialized data sets

. For in-

stance, the lexicalizations data set can be used to esti-

mate the ambiguity of phrases, to help select unambigu-

ous identiﬁers for ambiguous phrases, or to provide

alternative names for entities, just to mention a few ex-

amples. Topic signatures can be useful in tasks such as

query expansion or document summarization, and has

25http://wiki.dbpedia.org/Datasets/NLP

Lehmann et al. / DBpedia 21

been successfully employed to classify ambiguously

described images as good depictions of DBpedia enti-

ties [

]. The thematic concepts data set of resources

can be used for creating a corpus from Wikipedia to be

used as training data for topic classiﬁers, among other

things (see below). The grammatical gender data set

can, for example, be used to add a gender feature in

co-reference resolution.

7.1.1. Annotation: Entity Disambiguation

An important use case for NLP is annotating texts

or other content with semantic information. Named

entity recognition and disambiguation – also known as

key phrase extraction and entity linking tasks – refers

to the task of ﬁnding real world entities in text and

linking them to unique identiﬁers. One of the main

challenges in this regard is ambiguity: an entity name,

or surface form, may be used in different contexts to

refer to different concepts. Many different methods

have been developed to resolve this ambiguity with

fairly high accuracy [22].

As DBpedia reﬂects a vast amount of structured real

world knowledge obtained from Wikipedia, DBpedia

URIs can be used as identiﬁers for the majority of do-

mains in text annotation. Consequently, interlinking text

documents with Linked Data enables the Web of Data

to be used as background knowledge within document-

oriented applications such as semantic search or faceted

browsing (cf. Section 7.3).

Many applications performing this task of annotating

text with entities in fact use DBpedia entities as targets.

For example, DBpedia Spotlight [

] is an open source

tool

including a free web service that detects men-

tions of DBpedia resources in text. It uses the lexical-

izations in conjunction with the topic signatures data

set as context model in order to be able to disambiguate

found mentions. The main advantage of this system is

its comprehensiveness and ﬂexibility, allowing one to

conﬁgure it based on quality measures such as promi-

nence, contextual ambiguity, topical pertinence and dis-

ambiguation conﬁdence, as well as the DBpedia on-

tology. The resources that should be annotated can be

speciﬁed by a list of resource types or by more complex

relationships within the knowledge base described as

SPARQL queries.

There are numerous other NLP APIs that link enti-

ties in text to DBpedia: AlchemyAPI

,Semantic API

26http://spotlight.dbpedia.org/

27http://www.alchemyapi.com/

from Ontos

,Open Calais

and Zemanta

among oth-

ers. Furthermore, the DBpedia ontology has been used

for training named entity recognition systems (without

disambiguation) in the context of the Apache Stanbol

project31.

A related project is ImageSnippets

, which is a sys-

tem for annotating images. It uses DBpedia as one of

its main datasets for unambiguously identifying entities

depicted within an image.

Tag disambiguation Similar to linking entities in text

to DBpedia, user-generated tags attached to multimedia

content such as music, photos or videos can also be con-

nected to the Linked Data hub. This has previously been

implemented by letting the user resolve ambiguities.

For example, Faviki

suggests a set of DBpedia entities

coming from Zemanta’s API and lets the user choose

the desired one. Alternatively, similar disambiguation

techniques as mentioned above can be utilized to choose

entities from tags automatically [

]. The BBC

[

]

employs DBpedia URIs for tagging their programmes.

Short clips and full episodes are tagged using two dif-

ferent tools while utilizing DBpedia to beneﬁt from

global identiﬁers that can be easily integrated with other

knowledge bases.

7.1.2. Question Answering

DBpedia provides a wealth of human knowledge

across different domains and languages, which makes

it an excellent target for question answering and key-

word search approaches. One of the most prominent

efforts in this area is the DeepQA project, which re-

sulted in the IBM Watson system [

]. The Watson

system won a $1 million prize in Jeopardy and relies

on several data sets including DBpedia

. DBpedia is

also the primary target for several QA systems in the

Question Answering over Linked Data (QALD) work-

shop series

. Several QA systems, such as TBSL [

PowerAqua [

], FREyA [

] and QAKiS [

] have been

applied to DBpedia using the QALD benchmark ques-

tions. DBpedia is interesting as a test case for such

28http://www.ontos.com/

29http://www.opencalais.com/

30http://www.zemanta.com/

31http://stanbol.apache.org/

32http://www.imagesnippets.com

33http://www.faviki.com/

34http://bbc.co.uk

35http://www.aaai.org/Magazine/Watson/

watson.php

36http://greententacle.techfak.uni-

bielefeld.de/˜cunger/qald/

22 Lehmann et al. / DBpedia

systems. Due to its large schema and data size as well

as its topic diversity, it provides signiﬁcant scientiﬁc

challenges. In particular, it would be difﬁcult to provide

capable QA systems for DBpedia based only on simple

patterns or via domain speciﬁc dictionaries, because of

its size and broad coverage. Therefore, a question an-

swering system, which is able to reliable answer ques-

tions over DBpedia correctly, could be seen as a truly

intelligent system. In the latest QALD series, question

answering benchmarks also exploit national DBpedia

chapters for multilingual question answering.

Similarly, the slot ﬁlling task in natural language

processing poses the challenge of ﬁnding values for a

given entity and property from mining text. This can

be viewed as question answering with static questions

but changing targets. DBpedia can be exploited for fact

validation or training data in this task, as was done by

the Watson team [4] and others [28].

7.2. Digital Libraries and Archives

In the case of libraries and archives, DBpedia could

offer a broad range of information on a broad range of

domains. In particular, DBpedia could provide:

–

Context information for bibliographic and archive

records: Background information such as an au-

thor’s demographics, a ﬁlm’s homepage or an im-

age could be used to enhance user interaction.

–

Stable and curated identiﬁers for linking: DBpedia

is a hub of Linked Open Data. Thus, (re-)using

commonly used identiﬁers could ease integration

with other libraries or knowledge bases.

–

A basis for a thesaurus for subject indexing: The

broad range of Wikipedia topics in addition to

the stable URIs could form the basis for a global

classiﬁcation system.

Libraries have already invested both in Linked Data

and Wikipedia (and transitively to DBpedia) though

the realization of the Virtual International Authority

Files (VIAF) project.

Recently, it was announced that

VIAF added a total of 250,000 reciprocal authority

links to Wikipedia.

These links are already harvested

by DBpedia Live and will also be included in the next

static DBpedia release. This creates a huge opportunity

for libraries that use VIAF to get connected to DBpedia

and the LOD cloud in general.

37http://viaf.org

Accessed on 12/02/2013:

http://www.oclc.org/

research/news/2012/12-07a.html

7.3. Knowledge Exploration

Since DBpedia spans many domains and has a di-

verse schema, many knowledge exploration tools either

used DBpedia as a testbed or were speciﬁcally built

for DBpedia. We give a brief overview of tools and

structure them in categories:

Facet Based Browsers An award-winning

facet-

based browser used the Neofonie search engine to com-

bine facts in DBpedia with full-text from Wikipedia in

order to compute hierarchical facets [

]. Another facet

based browser, which allows to create complex graph

structures of facets in a visually appealing interface and

ﬁlter them is gFacet [

]. A generic SPARQL based

facet explorer, which also uses a graph based visualisa-

tion of facets, is LODLive [

]. The OpenLink built-in

facet based browser

is an interface, which enables

developers to explore DBpedia, compute aggregations

over facets and view the underlying SPARQL queries.

Search and Querying The DBpedia Query Builder

allows developers to easily create simple SPARQL

queries, more speciﬁcally sets of triple patterns via in-

telligent autocompletion. The autocompletion function-

ality ensures that only URIs, which lead to solutions are

suggested to the user. The RelFinder [

] tool provides

an intuitive interface, which allows to explore the neigh-

borhood and connections between resources speciﬁed

by the user. For instance, the user can view the short-

est paths connecting certain persons in DBpedia. Sem-

Lens [

] allows to create statistical analysis queries

and correlations in RDF data and DBpedia in particular.

Spatial Applications DBpedia Mobile [

] is a location

aware client, which renders a map of nearby locations

from DBpedia, provides icons for schema classes and

supports more than 30 languages from various DBpedia

language editions. It can follow RDF links to other

data sets linked from DBpedia and supports powerful

SPARQL ﬁlters to restrict the viewed data.

7.4. Applications of the Extraction Framework:

Wiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-

ated lexical-semantic and linguistic resources, avail-

39http://blog.dbpedia.org/2009/11/20/german-

government-proclaims- faceted-wikipedia-

search-one- of-the- 365-best- ideas-in- germany/

40http://dbpedia.org/fct/

41http://querybuilder.dbpedia.org/

Lehmann et al. / DBpedia 23

able in 171 languages (of which approximately 147 can

be considered active

). It contains information about

hundreds of spoken and even ancient languages. In the

case of the English Wiktionary there are nearly 3 mil-

lion detailed descriptions of words covering several do-

mains43. Such descriptions provide, for a lexical word,

a hierarchical disambiguation to its language, part of

speech, sometimes etymologies, synonyms, hyponyms,

hyperonyms, example sentences, and most prominently

senses.

Due to its fast changing nature, together with the

fragmentation of the project into Wiktionary language

editions (WLE) with independent layout rules a, conﬁg-

urable mediator/wrapper approach is taken for its auto-

mated transformation into a structured knowledge base.

The workﬂow of this dedicated Wiktionary extractor

being part of the Wiktionary2RDF [

] project is as

follows: For every WLE to be transformed an XML

conﬁguration ﬁle is provided as input. This conﬁgura-

tion is used by the Wiktionary extractor, invoked by

the DBpedia extraction framework, to ﬁrst generate a

schema reﬂecting the conﬁgured page structure (wrap-

per part). After this, these language speciﬁc schemas

are converted to a global schema (mediator part) and

later serialized to RDF.

To enable non-programmers (the community of

adopters and domain experts) to tailor and maintain

the WLE wrappers themselves, a simple XML dialect

was created to encode the page structure to be parsed

and declare triple patterns, that deﬁne how the resulting

RDF should be built. The described setup is run against

Wiktionary dumps. The resulting data set is open in

every aspect and hosted as Linked Data.

Statistics are

shown in Table 15.

8. Related Work

8.1. Cross Domain Community Knowledge Bases

8.1.1. Wikidata

In March 2012, the Wikimedia Germany e.V. started

the development of Wikidata

. Wikidata is a free

knowledge base about the world that can be read and

42http://s23.org/wikistats/wiktionaries_

html.php

See

http://en.wiktionary.org/wiki/semantic

for a simple example page

44http://wiktionary.dbpedia.org/

45http://wikidata.org/

edited by humans and machines alike. It provides data

in all languages of the Wikimedia projects, and allows

for central access to the data in a similar vein as Wiki-

media Commons does for multimedia ﬁles. Things de-

scribed in the Wikidata knowledge base are called items

and can have labels, descriptions and aliases in all lan-

guages. Wikidata does not aim at offering a single truth

about things, but providing statements given in a par-

ticular context. Rather than stating that Berlin has a

population of 3.5 million, Wikidata contains the state-

ment about Berlin’s population being 3.5 million as of

2011 according to the German statistical ofﬁce. Thus,

Wikidata can offer a variety of statements from differ-

ent sources and dates. As there are potentially many

different statements for a given item and property, ranks

can be added to statements to deﬁne their status (pre-

ferred, normal or deprecated). The initial development

was divided in three phases:

–

The ﬁrst phase (interwiki links) created an entity

base for the Wikimedia projects. This provides

a better alternative to the previous interlanguage

link system.

–

The second phase (infoboxes) gathered infobox-

related data for a subset of the entities, with the

explicit goal of augmenting the infoboxes that are

currently widely used with data from Wikidata.

–

The third phase (lists) will expand the set of prop-

erties beyond those related to infoboxes, and will

provide ways of exploiting this data within and

outside the Wikimedia projects.

At the time of writing of this article, the development

of the third phase is ongoing.

Wikidata already contains 11.95 million items and

348 properties that can be used to describe them. Since

March 2013 the Wikidata extension is live on all

Wikipedia language editions and thus their pages can

be linked to items in Wikidata and include data from

Wikidata.

Wikidata also offers a Linked Data interface

well as regular RDF dumps of all its data. The planned

collaboration with Wikidata is outlined in Section 9.

8.1.2. Freebase

Freebase

is a graph database, which also extracts

structured data from Wikipedia and makes it available

in RDF. Both DBpedia and Freebase link to each other

46http://meta.wikimedia.org/wiki/Wikidata/

Development/LinkedDataInterface

47http://www.freebase.com/

24 Lehmann et al. / DBpedia

Table 15

Statistical comparison of extractions for different languages.

language #words #triples #resources #predicates #senses

en 2,142,237 28,593,364 11,804,039 28 424,386

fr 4,657,817 35,032,121 20,462,349 22 592,351

ru 1,080,156 12,813,437 5,994,560 17 149,859

de 701,739 5,618,508 2,966,867 16 122,362

and provide identiﬁers based on those for Wikipedia

articles. They both provide dumps of the extracted data,

as well as APIs or endpoints to access the data and

allow their communities to inﬂuence the schema of the

data. There are, however, also major differences be-

tween both projects. DBpedia focuses on being an RDF

representation of Wikipedia and serving as a hub on the

Web of Data, whereas Freebase uses several sources to

provide broad coverage. The store behind Freebase is

the GraphD [

] graph database, which allows to efﬁ-

ciently store metadata for each fact. This graph store is

append-only. Deleted triples are marked and the system

can easily revert to a previous version. This is neces-

sary, since Freebase data can be directly edited by users,

whereas information in DBpedia can only indirectly be

edited by modifying the content of Wikipedia or the

Mappings Wiki. From an organisational point of view,

Freebase is mainly run by Google, whereas DBpedia is

an open community project. In particular in focus areas

of Google and areas in which Freebase includes other

data sources, the Freebase database provides a higher

coverage than DBpedia.

8.1.3. YAGO

One of the projects that pursues similar goals

to DBpedia is YAGO

[

]. YAGO is identical to

DBpedia in that each article in Wikipedia becomes an

entity in YAGO. Based on this, it uses the leaf cate-

gories in the Wikipedia category graph to infer type

information about an entity. One of its key features is to

link this type information to WordNet. WordNet synsets

are represented as classes and the extracted types of

entities may become subclasses of such a synset. In the

YAGO2 system [

], declarative extraction rules were

introduced, which can extract facts from different parts

of Wikipedia articles, e.g. infoboxes and categories, as

well as other sources. YAGO2 also supports spatial and

temporal dimensions for facts at the core of its system.

One of the main differences between DBpedia and

YAGO in general is that DBpedia tries to stay very

close to Wikipedia and provide an RDF version of its

48http://www.mpi- inf.mpg.de/yago-naga/yago/

content. YAGO focuses on extracting a smaller number

of relations compared to DBpedia to achieve very high

precision and consistent knowledge. The two knowl-

edge bases offer different type systems: whereas the

DBpedia ontology is manually maintained, YAGO is

backed by WordNet and Wikipedia leaf categories.

Due to this, YAGO contains many more classes than

DBpedia. Another difference is that the integration of

attributes and objects in infoboxes is done via mappings

in DBpedia and, therefore, by the DBpedia community

itself, whereas this task is facilitated by expert-designed

declarative rules in YAGO2.

The two knowledge bases are connected, e.g. DBpedia

offers the YAGO type hierarchy as an alternative to

the DBpedia ontology and

sameAs

links are provided

in both directions. While the underlying systems are

very different, both projects share similar aims and

positively complement and inﬂuence each other.

8.2. Knowledge Extraction from Wikipedia

Since its ofﬁcial start in 2001, Wikipedia has always

been the target of automatic extraction of information

due to its easy availability, open license and encyclo-

pedic knowledge. A large number of parsers, scraper

projects and publications exist. In this section, we re-

strict ourselves to approaches that are either notable, re-

cent or pertinent to DBpedia. MediaWiki.org maintains

an up-to-date list of software projects

, who are able to

process wiki syntax, as well as a list of data extraction

extensions50 for MediaWiki.

JWPL (Java Wikipedia Library, [

]) is an open-

source, Java-based API that allows to access informa-

tion provided by the Wikipedia API (redirects, cate-

gories, articles and link structure). JWPL contains a

MediaWiki Markup parser that can be used to further

analyze the contents of a Wikipedia page. Data is also

49http://www.mediawiki.org/wiki/Alternative_

parsers

50http://www.mediawiki.org/wiki/Extension_

Matrix/data_extraction

Lehmann et al. / DBpedia 25

provided as XML dump and is incorporated in the lexi-

cal resource UBY51 for language tools.

Several different approaches to extract knowledge

from Wikipedia are presented in [

]. Given features

like anchor texts, interlanguage links, category links

and redirect pages are utilized e.g. for word-sense dis-

ambiguations or synonyms, translations, taxonomic re-

lations and abbreviation or hypernym resolution, re-

spectively. Apart from this, link structures are used to

build the Wikipedia Thesaurus Web service

. Addi-

tional projects that exploit the mentioned features are

listed on the Special Interest Group on Wikipedia Min-

ing (SIGWP) Web site53.

An earlier approach to improve the quality of the

infobox schemata and contents is described in [

The presented methodology encompasses a three step

process of preprocessing,classiﬁcation and extraction.

During preprocessing reﬁned target infobox schemata

are created applying statistical methods and training

sets are extracted based on real Wikipedia data. After

assigning a class and the corresponding target schema

(classiﬁcation) the training sets are used to extract tar-

get infobox values from the document’s text applying

machine learning algorithms.

The idea of using structured data from certain

markup structures was also applied to other user-driven

Web encyclopedias. In [

] the authors describe their ef-

fort building an integrated Chinese Linking Open Data

(CLOD) source based on the Chinese Wikipedia and

the two widely used and large encyclopedias Baidu

Baike

and Hudong Baike

. Apart from utilizing Me-

diaWiki and HTML Markup for the actual extraction,

the Wikipedia interlanguage links were used to link the

CLOD source to the English DBpedia.

A more generic approach to achieve a better cross-

lingual knowledge-linkage beyond the use of Wikipedia

interlanguage links is presented in [

]. Focusing on

wiki knowledge bases the authors introduce their solu-

tion based on structural properties like similar linkage

structures, the assignment to similar categories and sim-

ilar interests of the authors of wiki documents in the

considered languages. Since this approach is language-

feature-agnostic it is not restricted to certain languages.

51http://www.ukp.tu- darmstadt.de/data/

lexical-resources/uby/

52http://sigwp.org/en/index.php/Wikipedia_

Thesaurus

53http://sigwp.org/en/

54http://baike.baidu.com/

55http://www.hudong.com/

KnowItAll

is a web scale knowledge extraction

effort, which is domain-independent, and uses generic

extraction rules, co-occurrence statistics and Naive

Bayes classiﬁcation [

]. Cyc [

] is a large com-

mon sense knowledge base, which is now partially

released as OpenCyc and also available as an OWL

ontology. OpenCyc is linked to DBpedia, which pro-

vides an ontological embedding in its comprehensive

structures. WikiTaxonomy [

] is a large taxonomy de-

rived from categories in Wikipedia by classifying cate-

gories as instances or classes and deriving a subsump-

tion hierarchy. The KOG system [

] reﬁnes existing

Wikipedia infoboxes based on machine learning tech-

niques using both SVMs and a more powerful joint-

inference approach expressed in Markov Logic Net-

works. KYLIN [

] is a system which autonomously

extracts structured data from Wikipedia and uses self-

supervised linking.

9. Conclusions and Future Work

In this system report, we presented an overview on

recent advances of the DBpedia community project.

The technical innovations described in this article in-

cluded in particular: (1) the extraction based on the

community-curated DBpedia ontology, (2) the live syn-

chronisation of DBpedia with Wikipedia and DBpedia

mirrors through update propagation, and (3) the facilita-

tion of the internationalisation of DBpedia. As a result,

we demonstrated that in the past four years DBpedia

matured and improved signiﬁcantly in terms of cover-

age, usability, and data quality.

With DBpedia, we also aim to provide a proof-

of-concept and blueprint for the feasibility of large-

scale knowledge extraction from crowd-sourced con-

tent repositories. There are a large number of further

crowd-sourced content repositories and DBpedia al-

ready had an impact on their structured data publishing

and interlinking. Two examples are Wiktionary with

the Wiktionary extraction [

] meanwhile becoming

part of DBpedia and LinkedGeoData [

], which aims

to implement similar data extraction, publishing and

linking strategies for OpenStreetMaps.

In the future, we see in particular the following di-

rections for advancing the DBpedia project:

Multilingual data integration and fusion. An area,

which is still largely unexplored is the integration and

56http://www.cs.washington.edu/research/

knowitall/

26 Lehmann et al. / DBpedia

fusion between different DBpedia language editions.

Non-English DBpedia editions comprise a better and

different coverage of local culture. When we are able to

precisely identify equivalent, overlapping and comple-

mentary parts in different DBpedia language editions,

we can reach signiﬁcantly increased coverage. On the

other hand, comparing the values of a speciﬁc prop-

erty between different language editions will help us

to spot extraction errors as well as wrong or outdated

information in Wikipedia.

Community-driven data quality improvement. In the

future, we also aim to engage a larger community of

DBpedia users in feedback loops, which help us to

identify data quality problems and corresponding deﬁ-

ciencies of the DBpedia extraction framework. By con-

stantly monitoring the data quality and integrating im-

provements into the mappings to the DBpedia ontology

as well as ﬁxes into the extraction framework, we aim to

demonstrate that the Wikipedia community is not only

capable of creating the largest encyclopedia, but also

the most comprehensive and structured knowledge base.

With the DBpedia quality evaluation campaign [

] we

were making a ﬁrst step in this direction.

Inline extraction. Currently DBpedia extracts infor-

mation primarily from templates. In the future, we

envision to also extract semantic information from

typed links. Typed links is a feature of Semantic Me-

diaWiki, which was backported and implemented as

a very lightweight extension for MediaWiki

. If this

extension is deployed at Wikipedia installations, this

opens up completely new possibilities for more ﬁne-

grained and non-invasive knowledge representations

and extraction from Wikipedia.

Collaboration between Wikidata and DBpedia.

While DBpedia provides a comprehensive and current

view on entity descriptions extracted from Wikipedia,

Wikidata offers a variety of factual statements from

different sources and dates. One of the richest sources

of DBpedia are Wikipedia infoboxes, which are struc-

tured but at the same time heterogeneous and non-

standardized (thus making the extraction error prone in

certain cases). The aim of Wikidata is to populate in-

foboxes automatically from a centrally managed, high-

quality fact database. In this regard, both projects com-

plement each other and there are several ongoing col-

laboration activities. In future versions, DBpedia will

include more raw data provided by Wikidata and add

57http://www.mediawiki.org/wiki/Extension:

LightweightRDFa

services such as Linked Data/SPARQL endpoints, RDF

dumps, linking and ontology mapping for Wikidata.

Feedback for Wikipedia. A promising prospect is that

DBpedia can help to identify misrepresentations, errors

and inconsistencies in Wikipedia. In the future, we plan

to provide more feedback to the Wikipedia community

about the quality of Wikipedia. This can, for instance,

be achieved in the form of sanity checks, which are

implemented as SPARQL queries on the DBpedia Live

endpoint, which identify data quality issues and are ex-

ecuted in certain intervals. For example, a query could

check that the birthday of a person must always be be-

fore the death day or spot outliers that differ signiﬁ-

cantly from the range of the majority of the other val-

ues. In case a Wikipedia editor makes a mistake or typo

when adding such information to a page, this could be

automatically identiﬁed and provided as feedback to

Wikipedians.

Integrate DBpedia and NLP. Recent advances

(cf. Section 7) show huge potential for employing

Linked Data background knowledge in various Natural

Language Processing (NLP) tasks. One very promising

research avenue in this regard is to employ DBpedia

as structured background knowledge for named en-

tity recognition and disambiguation. Currently, most

approaches use statistical information such as co-

occurrence for named entity disambiguation. However,

co-occurrence is not always easy to determine (depends

on training data) and update (requires recomputation).

With DBpedia and in particular DBpedia Live, we have

comprehensive and evolving background knowledge

comprising information on the relationship between a

large number of real-world entities. Consequently, we

can employ this information for deciding to what entity

a certain surface form should be mapped.

Acknowledgment

We would like to thank and acknowledge the support

of the following people and organisations to DBpedia:

–all Wikipedia contributors

–all DBpedia Mappings Wiki contributors

–

OpenLink Software for providing and maintaining

the server infrastructure for the main DBpedia

endpoint.

–

Kingsley Idehen for SPARQL and Linked Data

hosting and community support

–

Christopher Sahnwaldt for DBpedia development

and release management

–Claus Stadler for DBpedia development

Lehmann et al. / DBpedia 27

–Paul Kreis for DBpedia development

–

people who helped contributing data for certain

parts of the article:

∗

Instance Data Analysis: Volha Bryl (working at

University of Mannheim)

∗

Sindice analysis: Stphane Campinas, Szymon

Danielczyk, and Gabriela Vulcu (working at

DERI)

∗

Freebase: Shawn Simister, and Tom Morris

(working at Freebase)

This work was supported by grants from the Euro-

pean Union’s 7th Framework Programme provided for

the projects LOD2 (GA no. 257943), GeoKnow (GA

no. 318159) and Dicode (GA no. 257184).

Appendix

Table 16

List of namespace preﬁxes.

Preﬁx Namespace

dbo http://dbpedia.org/ontology/

dbp http://dbpedia.org/property/

dbr http://dbpedia.org/resource/

dbr-de http://de.dbpedia.org/resource/

dc http://purl.org/dc/elements/1.1/

foaf http://xmlns.com/foaf/0.1/

geo http://www.w3.org/2003/01/geo/wgs84 pos#

georss http://www.georss.org/georss/

ls http://spotlight.dbpedia.org/scores/

lx http://spotlight.dbpedia.org/lexicalizations/

owl http://www.w3.org/2002/07/owl#

rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

rdfs http://www.w3.org/2000/01/rdf-schema#

skos http://www.w3.org/2004/02/skos/core#

sptl http://spotlight.dbpedia.org/vocab/

xsd http://www.w3.org/2001/XMLSchema#

References

[1]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and

Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In

Proceedings of the 6th International Semantic Web Conference

(ISWC), volume 4825 of Lecture Notes in Computer Science,

pages 722–735. Springer, 2008.

Table 17

DBpedia timeline.

Year Month Event

2006 Nov Start of infobox extraction from Wikipedia

2007 Mar DBpedia 1.0 release

Jun ESWC DBpedia article [2]

Nov DBpedia 2.0 release

Nov ISWC DBpedia article [1]

Dec DBpedia 3.0 release candidate

2008 Feb DBpedia 3.0 release

Aug DBpedia 3.1 release

Nov DBpedia 3.2 release

2009 Jan DBpedia 3.3 release

Sep JWS DBpedia article [5]

2010 Feb Information extraction framework in Scala

Mappings Wiki release

Mar DBpedia 3.4 release

Apr DBpedia 3.5 release

May Start of DBpedia Internationalization effort

2011 Feb DBpedia Spotlight release

Mar DBpedia 3.6 release

Jul DBpedia Live release

Sep DBpedia 3.7 release (with I18n datasets)

2012 Aug DBpedia 3.8 release

Sep

Publication of DBpedia Internationalization

article[24]

2013 Sep DBpedia 3.9 release

[2]

S. Auer and J. Lehmann. What have Innsbruck and Leipzig

in common? extracting semantics from wiki content. In Pro-

ceedings of the ESWC (2007), volume 4519 of Lecture Notes in

Computer Science, pages 503–517, Berlin / Heidelberg, 2007.

Springer.

[3]

C. Becker and C. Bizer. Exploring the geospatial Semantic Web

with DBpedia Mobile. J. Web Sem, 7(4):278–286, 2009.

[4]

D. Bikel, V. Castelli, R. Florian, and D.-J. Han. Entity linking

and slot ﬁlling through statistical processing and inference rules.

In Proceedings TAC Workshop, 2009.

[5]

C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cy-

ganiak, and S. Hellmann. DBpedia - a crystallization point for

the Web of Data. Journal of Web Semantics, 7(3):154–165,

2009.

[6]

E. Cabrio, J. Cojan, A. P. Aprosio, B. Magnini, A. Lavelli, and

F. Gandon. QAKiS: an open domain QA system based on

relational patterns. In ISWC-PD; International Semantic Web

Conference (Posters & Demos), volume 914 of CEUR Workshop

Proceedings. CEUR-WS.org, 2012.

[7]

D. V. Camarda, S. Mazzini, and A. Antonuccio. Lodlive, ex-

ploring the Web of Data. In V. Presutti and H. S. Pinto, editors,

I-SEMANTICS 2012 - 8th International Conference on Seman-

tic Systems, I-SEMANTICS ’12, Graz, Austria, September 5-7,

2012, pages 197–200. ACM, 2012.

[8]

S. Campinas, T. E. Perry, D. Ceccarelli, R. Delbru, and G. Tum-

marello. Introducing RDF graph summary with application to

assisted SPARQL formulation. In Database and Expert Systems

28 Lehmann et al. / DBpedia

Applications (DEXA), 2012 23rd International Workshop on,

pages 261–266. IEEE, 2012.

[9]

D. Damljanovic, M. Agatonovic, and H. Cunningham. FREyA:

An interactive way of querying linked data using natural lan-

guage. In The Semantic Web: ESWC 2011 Workshops, volume

7117, pages 125–138. Springer, 2012.

[10]

O. Erling and I. Mikhailov. RDF support in the Virtuoso DBMS.

In S. Auer, C. Bizer, C. M

uller, and A. V. Zhdanova, editors,

CSSW, volume 113 of LNI, pages 59–68. GI, 2007.

[11]

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu,

T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale

information extraction in KnowitAll. In Proceedings of the 13th

international conference on World Wide Web, pages 100–110.

ACM, 2004.

[12]

D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A.

Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al.

Building Watson: An overview of the DeepQA project. AI

magazine, 31(3):59–79, 2010.

[13]

A. Garc

ıa-Silva, M. Jakob, P. N. Mendes, and C. Bizer. Mul-

tipedia: Enriching DBpedia with multimedia information. In

Proceedings of the sixth international conference on Knowledge

capture, K-CAP ’11, pages 137–144, New York, NY, USA,

2011. ACM.

[14]

A. Garc

ıa-Silva, M. Szomszor, H. Alani, and O. Corcho. Pre-

liminary results in tag disambiguation using DBpedia. In 1st

International Workshop in Collective Knowledage Capturing

and Representation (CKCaR), California, USA, 2009.

[15]

R. Hahn, C. Bizer, C. Sahnwaldt, C. Herta, S. Robinson,

M. B

urgle, H. D

uwiger, and U. Scheel. Faceted Wikipedia

Search. In W. Abramowicz and R. Tolksdorf, editors, Business

Information Systems, 13th International Conference, BIS 2010,

Berlin, Germany, May 3-5, 2010. Proceedings, volume 47 of

Lecture Notes in Business Information Processing, pages 1–11.

Springer, 2010.

[16]

P. Heim, T. Ertl, and J. Ziegler. Facet Graphs: Complex seman-

tic querying made easy. In Proceedings of the 7th Extended

Semantic Web Conference (ESWC 2010), volume 6088 of LNCS,

pages 288–302, Berlin/Heidelberg, 2010. Springer.

[17]

P. Heim, S. Hellmann, J. Lehmann, S. Lohmann, and T. Stege-

mann. RelFinder: Revealing relationships in RDF knowledge

bases. In T.-S. Chua, Y. Kompatsiaris, B. M

erialdo, W. Haas,

G. Thallinger, and W. Bailer, editors, Semantic Multimedia, 4th

International Conference on Semantic and Digital Media Tech-

nologies, SAMT 2009, Graz, Austria, December 2-4, 2009, Pro-

ceedings, volume 5887 of Lecture Notes in Computer Science,

pages 182–187. Springer, 2009.

[18]

P. Heim, S. Lohmann, D. Tsendragchaa, and T. Ertl. SemLens:

Visual analysis of semantic data with scatter plots and seman-

tic lenses. In C. Ghidini, A.-C. N. Ngomo, S. N. Lindstaedt,

and T. Pellegrini, editors, Proceedings the 7th International

Conference on Semantic Systems, I-SEMANTICS 2011, Graz,

Austria, September 7-9, 2011, ACM International Conference

Proceeding Series, pages 175–178. ACM, 2011.

[19]

S. Hellmann, J. Brekle, and S. Auer. Leveraging the Crowd-

sourcing of Lexical Resources for Bootstrapping a Linguistic

Data Cloud. In JIST, 2012.

[20]

S. Hellmann, C. Stadler, J. Lehmann, and S. Auer. DBpedia

Live Extraction. In Proc. of 8th International Conference on On-

tologies, DataBases, and Applications of Semantics (ODBASE),

volume 5871 of Lecture Notes in Computer Science, pages

1209–1223, 2009.

[21]

J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.

YAGO2: a spatially and temporally enhanced knowledge base

from Wikipedia. Artif. Intell, 194:28–61, 2013.

[22]

H. Ji, R. Grishman, H. T. Dang, K. Grifﬁtt, and J. Ellis.

Overview of the TAC 2010 knowledge base population track.

In Third Text Analysis Conference (TAC 2010), 2010.

[23]

G. Kobilarov, T. Scott, Y. Raimond, S. Oliver, C. Sizemore,

M. Smethurst, C. Bizer, and R. Lee. Media meets Semantic

Web – how the BBC uses DBpedia and linked data to make

connections. In The semantic web: research and applications,

pages 723–737. Springer, 2009.

[24]

D. Kontokostas, C. Bratsas, S. Auer, S. Hellmann, I. Antoniou,

and G. Metakides. Internationalization of Linked Data: The

case of the Greek DBpedia Edition. Web Semantics: Science,

Services and Agents on the World Wide Web, 15(0):51 – 61,

2012.

[25]

D. Kontokostas, A. Zaveri, S. Auer, and J. Lehmann.

TripleCheckMate: A Tool for Crowdsourcing the Quality As-

sessment of Linked Data. In Proceedings of the 4th Conference

on Knowledge Engineering and Semantic Web, 2013.

[26]

P. Kreis. Design of a quality assessment framework for the

DBpedia knowledge base. Master’s thesis, Freie Universit

Berlin, 2011.

[27]

C. Lagoze, H. V. de Sompel, M. Nelson, and S. Warner.

The open archives initiative protocol for metadata har-

vesting.

http://www.openarchives.org/OAI/

openarchivesprotocol.html, 2008.

[28] J. Lehmann, S. Monahan, L. Nezda, A. Jung, and Y. Shi. LCC

approaches to knowledge base population at TAC 2010. In

Proceedings TAC Workshop, 2010.

[29]

D. B. Lenat. CYC: A large-scale investment in knowledge

infrastructure. Communications of the ACM, 38(11):33–38,

1995.

[30]

C.-Y. Lin and E. H. Hovy. The automated acquisition of topic

signatures for text summarization. In Proceedings of the 18th

conference on Computational linguistics, pages 495–501, 2000.

[31]

V. Lopez, M. Fern

andez, E. Motta, and N. Stieler. PowerAqua:

Supporting users in querying and exploring the Semantic Web.

Semantic Web, 3(3):249–265, 2012.

[32]

M. Martin, C. Stadler, P. Frischmuth, and J. Lehmann. Increas-

ing the ﬁnancial transparency of European Commission project

funding. Semantic Web Journal, Special Call for Linked Dataset

descriptions, 2013.

[33]

P. N. Mendes, M. Jakob, and C. Bizer. DBpedia for NLP - a

multilingual cross-domain knowledge base. In Proceedings

of the International Conference on Language Resources and

Evaluation (LREC), Istanbul, Turkey, 2012.

[34]

P. N. Mendes, M. Jakob, A. Garc

ıa-Silva, and C. Bizer. DBpedia

Spotlight: Shedding light on the web of documents. In Proceed-

ings of the 7th International Conference on Semantic Systems

(I-Semantics), Graz, Austria, 2011.

[35]

S. M. Meyer, J. Degener, J. Giannandrea, and B. Michener.

Optimizing schema-last tuple-store queries in GraphD. In A. K.

Elmagarmid and D. Agrawal, editors, SIGMOD Conference,

pages 1047–1056. ACM, 2010.

[36]

M. Morsey, J. Lehmann, S. Auer, C. Stadler, and S. Hell-

mann. DBpedia and the Live Extraction of Structured Data

from Wikipedia. Program: electronic library and information

systems, 46:27, 2012.

[37]

K. Nakayama, M. Pei, M. Erdmann, M. Ito, M. Shirakawa,

T. Hara, and S. Nishio. Wikipedia mining: Wikipedia as a corpus

Lehmann et al. / DBpedia 29

for knowledge extraction. In Annual Wikipedia Conference

(Wikimania), 2008.

[38]

X. Niu, X. Sun, H. Wang, S. Rong, G. Qi, and Y. Yu. Zhishi.me:

Weaving chinese linking open data. In 10th International Con-

ference on The semantic web - Volume Part II, pages 205–220,

Berlin, Heidelberg, 2011. Springer-Verlag.

[39]

E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn,

and G. Tummarello. Sindice.com: A document-oriented lookup

index for open linked data. Int. J. of Metadata and Semantics

and Ontologies, 3:37–52, Nov. 10 2008.

[40]

S. P. Ponzetto and M. Strube. WikiTaxonomy: A large scale

knowledge resource. In M. Ghallab, C. D. Spyropoulos,

N. Fakotakis, and N. M. Avouris, editors, ECAI, volume 178

of Frontiers in Artiﬁcial Intelligence and Applications, pages

751–752. IOS Press, 2008.

[41]

C. Stadler, J. Lehmann, K. H

offner, and S. Auer. LinkedGeo-

Data: A Core for a Web of Spatial Open Data. Semantic Web

Journal, 3(4):333–354, 2012.

[42]

F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core

of semantic knowledge. In C. L. Williamson, M. E. Zurko,

P. F. Patel-Schneider, and P. J. Shenoy, editors, WWW, pages

697–706. ACM, 2007.

[43]

E. Tacchini, A. Schultz, and C. Bizer. Experiments with

Wikipedia cross-language data fusion. In Proceedings of the

5th Workshop on Scripting and Development for the Semantic

Web, ESWC. Citeseer, 2009.

[44]

C. Unger, L. B

uhmann, J. Lehmann, A.-C. Ngonga Ngomo,

D. Gerber, and P. Cimiano. Template-based Question Answer-

ing over RDF Data. In Proceedings of the 21st international

conference on World Wide Web, pages 639–648, 2012.

[45]

Z. Wang, J. Li, Z. Wang, and J. Tang. Cross-lingual Knowledge

Linking Across Wiki Knowledge Bases. In Proceedings of

the 21st international conference on World Wide Web, pages

459–468, New York, NY, USA, 2012. ACM.

[46]

F. Wu and D. Weld. Automatically reﬁning the Wikipedia

Infobox Ontology. In Proceedings of the 17th World Wide Web

Conference, 2008.

[47]

F. Wu and D. S. Weld. Autonomously semantifying Wikipedia.

In Proceedings of the 16th Conference on Information and

Knowledge Management, pages 41–50. ACM, 2007.

[48]

A. Zaveri, D. Kontokostas, M. A. Sherif, L. B

uhmann,

M. Morsey, S. Auer, and J. Lehmann. User-driven Quality

Evaluation of DBpedia. In To appear in Proceedings of 9th

International Conference on Semantic Systems, I-SEMANTICS

’13, Graz, Austria, September 4-6, 2013. ACM, 2013.

[49]

T. Zesch, C. M

uller, and I. Gurevych. Extracting lexical seman-

tic knowledge from Wikipedia and Wiktionary. In Proceedings

of the 6th International Conference on Language Resources

and Evaluation, Marrakech, Morocco, May 2008. electronic

proceedings.

Retrieval Augmented Zero-Shot Text Classification

Preprint

Full-text available

Jun 2024

Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has been addressed by improving the embedding model with expensive training. We introduce QZero, a novel training-free knowledge augmentation approach that reformulates queries by retrieving supporting categories from Wikipedia to improve zero-shot text classification performance. Our experiments across six diverse datasets demonstrate that QZero enhances performance for state-of-the-art static and contextual embedding models without the need for retraining. Notably, in News and medical topic classification tasks, QZero improves the performance of even the largest OpenAI embedding model by at least 5% and 3%, respectively. Acting as a knowledge amplifier, QZero enables small word embedding models to achieve performance levels comparable to those of larger contextual models, offering the potential for significant computational savings. Additionally, QZero offers meaningful insights that illuminate query context and verify topic relevance, aiding in understanding model predictions. Overall, QZero improves embedding-based zero-shot classifiers while maintaining their simplicity. This makes it particularly valuable for resource-constrained environments and domains with constantly evolving information.

Knowledge and separating soft verbalizer based prompt-tuning for multi-label short text classification

Article

Full-text available

Jun 2024
APPL INTELL

Multi-label Short Text Classification (MSTC) is a challenging subtask of Multi-Label Text Classification (MLTC) for tagging a short text with the most relevant subset of labels from a given set of labels. Recent studies have attempted to address MSTC task using MLTC methods and Pre-trained Language Models (PLM) based fine-tuning approaches, but suffering the low performance from the following three reasons, 1) failure to address the issue of data sparsity of short texts, 2) lack of adaptation to the long-tail distribution of labels in multi-label scenarios and 3) an implicit weakness in the encoding length for PLM, which limits the ability of the prompt learning paradigm. Therefore, in this paper, we propose KSSVPT, a Knowledge and Separating Soft Verbalizer based Prompt Tuning method for MSTC to address the above challenges. Firstly, to mitigate the sparsity issue in short texts, we propose a novel approach that enhances the semantic information of short texts by integrating external knowledge into the soft prompt template. Secondly, we construct a new soft prompt verbalizer for MSTC, called separating soft prompt verbalizer, to adapt to the long-tail distribution issue aggravated by multiple labels. Thirdly, we propose a mechanism of label cluster grouping in building a prompt template to directly alleviate limited encoding length and capture the label correlation. Extensive experiments conducted on six benchmark datasets demonstrate the superiority of our model compared to all competing models for MLTC and MSTC in the tackling of MSTC task.

Move Beyond Triples: Contextual Knowledge Graph Representation and Reasoning

Preprint

Full-text available

Jun 2024

Knowledge Graphs (KGs) are foundational structures in many AI applications, representing entities and their interrelations through triples. However, triple-based KGs lack the contextual information of relational knowledge, like temporal dynamics and provenance details, which are crucial for comprehensive knowledge representation and effective reasoning. Instead, \textbf{Contextual Knowledge Graphs} (CKGs) expand upon the conventional structure by incorporating additional information such as time validity, geographic location, and source provenance. This integration provides a more nuanced and accurate understanding of knowledge, enabling KGs to offer richer insights and support more sophisticated reasoning processes. In this work, we first discuss the inherent limitations of triple-based KGs and introduce the concept of contextual KGs, highlighting their advantages in knowledge representation and reasoning. We then present \textbf{KGR$^3$, a context-enriched KG reasoning paradigm} that leverages large language models (LLMs) to retrieve candidate entities and related contexts, rank them based on the retrieved information, and reason whether sufficient information has been obtained to answer a query. Our experimental results demonstrate that KGR$^3$ significantly improves performance on KG completion (KGC) and KG question answering (KGQA) tasks, validating the effectiveness of incorporating contextual information on KG representation and reasoning.

Protection of Guizhou Miao batik culture based on knowledge graph and deep learning

Article

Full-text available

Jun 2024

In the globalization trend, China’s cultural heritage is in danger of gradually disappearing. The protection and inheritance of these precious cultural resources has become a critical task. This paper focuses on the Miao batik culture in Guizhou Province, China, and explores the application of knowledge graphs, natural language processing, and deep learning techniques in the promotion and protection of batik culture. We propose a dual-channel mechanism that integrates semantic and visual information, aiming to connect batik pattern features with cultural connotations. First, we use natural language processing techniques to automatically extract batik-related entities and relationships from the literature, and construct and visualize a structured batik pattern knowledge graph. Based on this knowledge graph, users can textually search and understand the images, meanings, taboos, and other cultural information of specific patterns. Second, for the batik pattern classification, we propose an improved ResNet34 model. By embedding average pooling and convolutional operations into the residual blocks and introducing long-range residual connections, the classification performance is enhanced. By inputting pattern images into this model, their categories can be accurately identified, and then the underlying cultural connotations can be understood. Experimental results show that our model outperforms other mainstream models in evaluation metrics such as accuracy, precision, recall, and F1-score, achieving 94.46%, 94.47%, 93.62%, and 93.8%, respectively. This research provides new ideas for the digital protection of batik culture and demonstrates the great potential of artificial intelligence technology in cultural heritage protection.

Toward Federated Learning Through Intent Detection Research

Chapter

Jun 2024

Design Knowledge Graph and User Profiling-Driven Product Innovation Design Problem Solving

Chapter

Jun 2024

A comparative evaluation for question answering over Greek texts by using machine translation and BERT

Article

Full-text available

Jun 2024

Although there are numerous and effective BERT models for question answering (QA) over plain texts in English, it is not the same for other languages, such as Greek. Since it can be time-consuming to train a new BERT model for a given language, we present a generic methodology for multilingual QA by combining at runtime existing machine translation (MT) models and BERT QA models pretrained in English, and we perform a comparative evaluation for Greek language. Particularly, we propose a pipeline that (a) exploits widely used MT libraries for translating a question and a context from a source language to the English language, (b) extracts the answer from the translated English context through popular BERT models (pretrained in English corpus), (c) translates the answer back to the source language, and (d) evaluates the answer through semantic similarity metrics based on sentence embeddings, such as Bi-Encoder and BERTScore. For evaluating our system, we use 21 models, whereas we have created a test set with 20 texts and 200 questions and we have manually labelled 4200 answers. These resources can be reused for several tasks including QA and sentence similarity. Moreover, we use the existing multilingual test set XQuAD, with 240 texts and 1190 questions in Greek language. We focus on both the effectiveness and efficiency, through manually and machine labelled results. The results of the evaluation show that the proposed approach can be an efficient and effective alternative option to multilingual BERT. In particular, although the multilingual BERT QA model provides the highest scores for both human and automatic evaluation, all the models combining MT and BERT QA models are faster and some of them achieve quite similar scores.

CA-WGE: A two-view graph neural network-based knowledge graph completion approach combining common sense perception

Conference Paper

Jun 2024

Towards a Decision Support System for Tourism Sector in Daraa-Tafilalet

Conference Paper

May 2024

Integrating intelligent data engineering and smart applications in the tourism sector offers numerous opportunities to improve the traveler experience, optimize operations, and enhance innovation. User feedback, reflecting their level of satisfaction with the services provided, is a crucial indicator of the success of a tourist destination. In this paper, we present an approach based on artificial intelligence to create a decision support system to guide the choices of tourism sector managers in Morocco. The aim is to maintain the region's competitiveness and attract a growing number of visitors. We propose using deep learning to analyze tourists' comments and evaluate the factors characterizing a tourist destination, such as facilities, value for money, cleanliness, comfort, and staff. Our method is based on two common natural language processing (NLP) tasks: text classification to extract relevant elements and sentiment analysis to assess tourist satisfaction. This approach enables tourism managers to effectively target improvement areas to meet tourists' expectations better and increase their satisfaction.

Wiki Entity Summarization Benchmark

Preprint

Full-text available

Jun 2024

Entity summarization aims to compute concise summaries for entities in knowledge graphs. Existing datasets and benchmarks are often limited to a few hundred entities and discard graph structure in source knowledge graphs. This limitation is particularly pronounced when it comes to ground-truth summaries, where there exist only a few labeled summaries for evaluation and training. We propose WikES, a comprehensive benchmark comprising of entities, their summaries, and their connections. Additionally, WikES features a dataset generator to test entity summarization algorithms in different areas of the knowledge graph. Importantly, our approach combines graph algorithms and NLP models as well as different data sources such that WikES does not require human annotation, rendering the approach cost-effective and generalizable to multiple domains. Finally, WikES is scalable and capable of capturing the complexities of knowledge graphs in terms of topology and semantics. WikES features existing datasets for comparison. Empirical studies of entity summarization methods confirm the usefulness of our benchmark. Data, code, and models are available at: https://github.com/msorkhpar/wiki-entity-summarization.

QAKiS: an Open Domain QA System based on Relational Patterns

Article

Full-text available

Nov 2012

We present QAKiS, a system for open domain Question Answering over linked data. It addresses the problem of question interpretation as a relation-based match, where fragments of the question are matched to binary relations of the triple store, using relational textual patterns automatically collected. For the demo, the relational patterns are automatically extracted from Wikipedia, while DBpedia is the RDF data set to be queried using a natural language interface.

DBpedia: A Multilingual Cross-Domain Knowledge Base

Article

Full-text available

Jan 2012

The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extracted from the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a community effort. The knowledge base can be queried using the SPARQL query language and all its data sets are freely available for download. In this paper, we describe the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.

LCC Approaches to Knowledge Base Population at TAC 2010

Technical Report

Full-text available

Jan 2010

The Knowledge Base Population (KBP) track at the Text Analysis Conference 2010 marks the second year of this important information extraction evaluation. This paper describes the design and implementation of LCC's systems which participated in the tasks of Entity Link- ing, Slot Filling, and the new task of Surprise Slot Filling. For the entity linking task, our top score was achieved through a robust context modeling approach which incorporates topi- cal evidence. For slot ?lling, we used the output of the entity linking system together with a combination of different types of re- lation extractors. For surprise slot ?lling, our customizable extraction system was extremely useful due to the time sensitive nature of the task.

Semantic Turkey: A browser-integrated environment for knowledge acquisition and management

Article

Full-text available

Aug 2012

Born four years ago as a Semantic Web extension for the web browser Firefox, Semantic Turkey pushed forward the traditional concept of links&folders-based bookmarking to a new dimension, allowing users to keep track of relevant information from visited web sites and to organize the collected content according to standard or personally defined ontologies. Today, the tool has broken the boundaries of its original intents and can be considered, under every aspect, an extensible platform for knowledge management and acquisition. The semantic bookmarking and annotation facilities of Semantic Turkey are now supporting just a part of a whole methodology where different actors, from domain experts to knowledge engineers, can cooperate in developing, building and populating ontologies while navigating the Web.

TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data

Conference Paper

Full-text available

Oct 2013

Linked Open Data (LOD) comprises of an unprecedented volume of structured datasets on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced and even extracted data of relatively low quality. We present a methodology for assessing the quality of linked data resources, which comprises of a manual and a semi-automatic process. In this paper we focus on the manual process where the first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. The second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is implemented by the tool TripleCheckMate wherein a user assesses an individual resource and evaluates each fact for correctness. This paper focuses on describing the methodology, quality taxonomy and the tools’ system architecture, user perspective and extensibility.

The Open Archives Initiative Protocol for Metadata Harvesting

Article

Jun 2002

Preliminary Results in Tag Disambiguation using DBpedia

Conference Paper

Sep 2009

The availability of tag-based user-generated content for a variety of Web resources (music, photos, videos, text, etc.) has largely increased in the last years. Users can assign tags freely and then use them to share and retrieve information. However, tag-based sharing and retrieval is not optimal due to the fact that tags are plain text labels without an explicit or formal meaning, and hence polysemy and synonymy should be dealt with appropriately. To ameliorate these problems, we propose a context-based tag disambiguation algorithm that selects the meaning of a tag among a set of candidate DBpedia entries, using a common information retrieval similarity measure. The most similar DBpedia en-try is selected as the one representing the meaning of the tag. We describe and analyze some preliminary results, and discuss about current challenges in this area.

Overview of the TAC 2010 knowledge base population track

Conference Paper

Jan 2010

Wikipedia Mining Wikipedia as a Corpus for Knowledge Extraction

Article

Dec 2010

Wikipedia, a collaborative Wiki-based encyclopedia, has be- come a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. As a corpus for knowledge extraction, Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, word sense disambiguation based on URL and brief anchor texts. Because of these characteristics, Wikipedia has be- come a promising corpus and a big frontier for researchers. A consid- erable number of researches on Wikipedia Mining such as semantic re- latedness measurement, bilingual dictionary construction, and ontology construction have been conducted. In this paper, we take a comprehen- sive, panoramic view of Wikipedia as a Web corpus since almost all previous researches are just exploiting parts of the Wikipedia charac- teristics. The contribution of this paper is triple-sum. First, we unveil the characteristics of Wikipedia as a corpus for knowledge extraction in detail. In particular, we describe the importance of anchor texts with special emphasis since it is helpful information for both disambiguation and synonym extraction. Second, we introduce some of our Wikipedia mining researches as well as researches conducted by other researches in order to prove the worth of Wikipedia. Finally, we discuss possible directions of Wikipedia research.

LodLive, exploring the web of data

Article

Sep 2012

LodLive project, http://en.lodlive.it/, provides a demonstration of the use of Linked Data standard (RDF, SPARQL) to browse RDF resources. The application aims to spread linked data principles with a simple and friendly interface and reusable techniques. In this report we present an overview of the potential of LodLive, mentioning tools and methodologies that were used to create it.

DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Abstract and Figures

Recommended publications

Nouvelles méthodes pour l'évaluation, l'évolution et l'interrogation des bases du Web des données

DBpedia and the live extraction of structured data from Wikipedia

DBpedia: A Multilingual Cross-Domain Knowledge Base

Towards a Korean DBpedia and an Approach for Complementing the Korean Wikipedia based on DBpedia.

DBpedia - A Crystallization Point for the Web of Data