Content uploaded by Georgia D. Solomou
Author content
All content in this area was uploaded by Georgia D. Solomou on Apr 30, 2014
Content may be subject to copyright.
The Use of SKOS Vocabularies in Digital
Repositories
The DSpace Case
Georgia Solomou
High Performance Information Systems Laboratory
Computer Engineering and Informatics Dpt.
University of Patras
Patras-Rio, Greece
solomou@hpclab.ceid.upatras.gr
Theodore Papatheodorou
High Performance Information Systems Laboratory
Computer Engineering and Informatics Dpt.
University of Patras
Patras-Rio, Greece
tsp@hpclab.ceid.upatras.gr
Abstract—Thesauri are concept schemes that help in efficiently
characterizing and retrieving items from digital libraries. SKOS
is a data model that provides a standardized way to represent
thesauri -and controlled vocabularies in general- using Resource
Description Framework. A digital repository system that can
inherently ingest and handle thesauri, although not in SKOS
format, is DSpace. SKOS support in DSpace is implemented
thanks to an add-on, provided by the University of Minho. Our
initial objective was to apply this add-on to a running DSpace
instance. We then tested this updated DSpace installation using a
real vocabulary: the Thesaurus of Greek Terms for which we
took on the task of bringing it in SKOS. As a final step, we tried
to tackle with arising problems and to propose solutions, which
are mostly based on the Semantic Web techniques.
Digital Libraries; Semantic Web; SKOS; Thesauri; Controlled
Vocabularies; DSpace;
I.
I
NTRODUCTION
More and more cultural and educational institutions are
based upon well-known digital library systems in order to store,
manage and disseminate their digital assets. These mechanisms
often implement facilities that render their content more
knowledge-intensive and thus suitable for exploitation by
Semantic Web applications.
An extremely popular system implemented for handling
digital collections is DSpace. On top of DSpace many
institutional repositories have been built worldwide, serving
museums, libraries, national archives, etc. A significant feature
of DSpace is the ability to characterize its items using a
predefined set of keywords, namely a controlled vocabulary.
The adoption of a structured controlled vocabulary by a
digital library system is of great importance: It helps in
properly characterizing the ingested content and plays a
fundamental role in effectively indexing, searching and
retrieving information. Related to other types of controlled
vocabularies, thesauri are a more powerful choice as they allow
for the explicit declaration of relationships between concepts.
All type of controlled vocabularies, in order to be utilizable
and exchangeable between computer and web applications,
need to be expressed as machine-readable data. A standardized,
interoperable and machine-understandable way for representing
controlled vocabularies and using them within the framework
of the Semantic Web is SKOS.
SKOS (Simple Knowledge Organization System) [5] is a
data model for expressing the basic structure of all concept
schemes, like thesauri. It is actually a practical application of
RDF [7] (and RDFS) and its main objective is to enable easy
publication of controlled structured vocabularies for the
Semantic Web. SKOS is a W3C recommendation, hence a
standardized Web technology. Thesauri expressed in SKOS
can potentially provide added-value to Semantic Web
applications.
In this work we are focusing on SKOS, its prevalence
among digital repositories as well as on the process of
converting thesauri to SKOS (SKOSification). In particular, in
section II we present SKOSified thesauri and related tools that
witness the wide applicability of SKOS. In section III we talk
about existing methods for converting controlled vocabularies
to SKOS and finally show how we adopted this standard for the
Thesaurus of Greek Terms. In section IV we examine the case
of incorporating a SKOSified vocabulary in DSpace.
Conclusions and future work follow.
II. SKOS
U
SAGE
At this moment, there are enough running projects in the
field of culture that involve SKOS: For example, ATHENA,
1
the European digital library “Europeana”,
2
STERNA
3
(Semantic Web-based Thematic European Reference Network
Application), STAR
4
(Semantic Technologies for
1
http://www.athenaeurope.org/
2
http://www.europeana.eu/
3
http://www.sterna-net.eu/
4
http://hypermedia.research.glam.ac.uk/kos/star/
Archeological Resources) and many more projects adopt this
model as a means to provide more knowledge-intensive data.
In addition to these projects, many more institutions,
responsible for publishing thesauri, show particular interest in
adopting the SKOS standard. Besides, the continuously
increasing number of tools that are built around SKOS (like
editors, validators and converters), definitely encourage and
facilitate the SKOSification process.
A. Thesauri in SKOS
The need to migrate knowledge organization systems into
SKOS has long been recognized by organizations that deal with
controlled vocabularies. Some of them have already deployed
an official SKOSified version of their structured vocabularies
whereas others are on the way to do so.
1) Popular SKOSified vocabularies
In this section, we mention some known thesauri that have
been converted to SKOS. These concept schemes are available
to the public and apply to many different areas of knowledge:
• LCSH - Library of Congress Subject Headings [11].
It is a very popular subject heading system maintained
by the US Library of Congress. It offers an online
catalogue where users can search and browse
thousands of terms. Through this online catalogue,
users are able to also obtain the SKOS version of each
selected term.
• AGROVOC – The Food and Agriculture
Organization Thesaurus.
5
It is a multilingual
thesaurus that provides terminology for all subject
fields in agriculture, forestry, fisheries, food and
related domains. For each concept its corresponding
SKOS description is also available, expressed in the
RDF/XML serialization syntax.
• UKAT - UK Archival Thesaurus.
6
It is a subject
thesaurus aiming to be utilized for indexing and
searching in the UK archive sector. Its SKOSified
version is provided as a single XML file that can be
directly downloaded from the UKAT official page.
• GEMET - General Multilingual Environmental
Thesaurus.
7
A general thesaurus which defines a core
terminology for the environment. It is available as a
web service and can be accessed online; its SKOS
format, though, is not browseable but it can be
obtained in the form of a XML file.
• AAT - Getty Arts and Architecture Thesaurus.
8
It is
a structured vocabulary used for characterizing any
type of cultural material, as well as items of art and
architecture. Although the AAT thesaurus can be
browsed online, its SKOSification is still in a draft
5
http://aims.fao.org/website/Search-AGROVOC
6
http://isegserv.itd.rl.ac.uk/skos/ukat/
7
http://isegserv.itd.rl.ac.uk/skos/gemet/
8
http://www.getty.edu/research/conducting_research/vocabular
ies/aat/
stage and that is the reason why it hasn’t be
incorporated yet to the online catalogue.
• WordNet.
9
It is a lexical database for the English
language that may be considered as a thesaurus:
expressed concepts are interlinked by various semantic
relations. For WordNet 2.0 a partial conversion to
SKOS has been proposed.
The aforementioned SKOS implementations are some
representative examples and definitely not the only existing
attempts. In addition to these vocabularies, there are many
others, more or less significant, that try to successfully
accomplish their migration to SKOS. Through this work, we
chose to further analyze the case of the Thesaurus of Greek
Terms, a controlled vocabulary implemented in Greece and
meant to be used by Greek institutions.
2) The Thesaurus of Greek Terms
The National Documentation Center of Greece (EKT) is the
national infrastructure for scientific documentation and online
information. It is an institution responsible for publishing and
handling the first official thesaurus in Greece, the Thesaurus of
Greek Terms (TGT).
The TGT thesaurus is structured as a controlled vocabulary
that allows representation of both vertical (hierarchical) and
horizontal associations between concepts. It is composed of
5227 bilingual (Greek, English) terms that cover a broad field
of knowledge. Its aim is to facilitate institutions in Greece, like
libraries, museums and information centers, in characterizing
and managing their digital material.
Despite the thesaurus’ notable presence in Greece, EKT
hasn’t proceeded yet with the SKOSification of this product.
For this reason, in section III we first propose -and afterwards
utilize- a SKOSified version of the TGT thesaurus. The
conversion process is described in section III.B.
B. SKOS Specific Tools
Apart from the increasing number of SKOSified thesauri,
the wide acceptance of SKOS becomes also evident by the
number of tools that are built around it. In what follows, we
will shortly present the most known such tools; they are
distinguished in two basic categories: editors and validators.
• ThManager.
10
It is an open-source tool implemented
in Java. Aims at facilitating the creation and
visualization of SKOS vocabularies.
• SKOSEd [6]. It is a plug-in for Protégé 4 (an OWL
ontology editor) that augments the latter with the
ability to create and modify SKOS thesauri. SKOSed is
accompanied by the SKOS API
11
, a programmatic API
implemented in Java that can be utilized for building
SKOS based applications.
9
http://isegserv.itd.rl.ac.uk/skos/WordNet.zip
10
http://thmanager.sourceforge.net/
11
http://skosapi.sourceforge.net/
• PoolParty [10]. It is a commercial system suitable for
editing SKOS vocabularies and for managing any type
of thesauri. It is built upon Semantic Web technologies,
and among others, allows for thesauri management via
easy-to-use GUIs. PoolParty also offers a SKOS
thesaurus consistency checker service that validates the
submitted vocabularies for their alignment with the
SKOS recommendation [9].
• W3C Validation Service.
12
It is an experimental on-
line SKOS validator, provided by W3C.
• The MONDECA SKOS Reader.
13
This tool
facilitates users to easily navigate and browse a SKOS
thesaurus, provided that this thesaurus is published as
an accessible SKOS file. It actually produces readable
versions of the imported files, whereas it can display
concepts in various orders (e.g., hierarchically or
alphabetically).
III. T
HE
SKOS
IFICATION
P
ROCESS
The process of converting thesauri to SKOS, as we will see
in this section, is not standardized and depends on the nature of
the candidate vocabulary.
A. Existing Methods
A notable attempt for SKOSification is proposed in [1] by a
team at the VU University of Amsterdam. They apply their
method in some well-known thesauri, like MeSH and GTAA.
However, although the proposed method behaves well for
controlled vocabularies that are based on older thesauri
recommendations (e.g., ISO or ANSI/NISO standard),
vocabularies with non-standard features cannot be handled.
Apart from the aforementioned case, which tries to propose
a structured method for bringing thesauri to SKOS, some other
attempts have also been published, aiming to fit the needs of a
particular vocabulary. For example, [11] is a technical report
dealing with the conversion of the TheSoz thesaurus
(Thesaurus of the Social Sciences) to SKOS. Moreover, [11]
describes the SKOSification of the Library of Congress Subject
Headings. Both works present a procedure that requires manual
effort for implementing the mapping from the original
thesaurus’ elements to SKOS notions. In both cases, an
appropriate XSL transformation is finally applied, which
accomplishes the migration to SKOS.
B. SKOSifying the TGT Thesaurus
Having all these in mind, we proceeded with the
SKOSification of the TGT thesaurus. The latter follows the
structure of any usual subject thesaurus (see Fig. 1). It makes
use of hierarchical (<BT>, <NT>, <MT>), associative (<RT>)
and equivalence (<UF>) relations. In addition, for each term its
English translation is provided (<ET>), as well as its
correspondence to the Dewey Decimal Classification system
(<dewey>).
12
http://www.w3.org/2004/02/skos/validation
13
http://client2.mondeca.com/mondecalabs/skosReader.html
Figure 1. The TGT thesaurus in its original XML format.
First, we manually mapped thesaurus elements to SKOS
notions, paying particular attention to what the SKOS
specification dictates. This mapping is summarized in Table I.
We then constructed an XSL transformation able to convert
TGT to the desired SKOS format, taking into account the
mapping in Table I. The final SKOSified version of TGT can
be accessed online
14
.
IV. T
HE
DS
PACE
C
ASE
DSpace is an open-source digital repository system. It is
responsible for the efficient description, preservation,
management, and distribution of any kind of digital material. It
is popular because it supports an extensible core metadata
schema, based on the well-known Dublin Core specification
[4]. Furthermore, DSpace is multilingual, adaptable to
administrator’s needs and able to incorporate novel features.
One such feature is the utilization of controlled vocabularies so
as to better characterize and manipulate its items.
TABLE I. M
APPING TO
SKOS
E
LEMENTS
XML
element Function SKOS notion
<TERM> The described term <skos:Concept>
<USER> Thesaurus’ owner -
<CONTEXT>
Term’s label <skos:prefLabel lang="el">
<MT> Microthesauri term <skos:broaderTransitive>
<ET>
a
English translation <skos:prefLabel lang="en">
<ET> Alternative English
translation <skos:altLabel lang="en">
<BT> Broader term <skos:broader>
<NT> Narrower term <skos:narrower>
<RT> Related term <skos:related>
<UF> Opposite of the Used
Instead (USE) term <skos:altLabel lang="el">
<SN> A short description <skos:definition>
<DEWEY>
A number indicating
the correspondence to
Dewey system
<skos:notation>
a. The first occurrence of <ET> element is considered as the preferred translation
14
http://swig.hpclab.ceid.upatras.gr/SKOS?action=AttachFile
&do=get&target=ekt_to_skos.rdf
<TERM>
<CONTEXT>αστικά δικαστήρια</CONTEXT>
<USER>EKT</USER>
<MT>Νομικές Επιστήμες</MT>
<ET>civil courts</ET>
<BT>δικαστήρια</BT>
<NT>ειρηνοδικεία<NT>
<NT>πρωτοδικεία<NT>
<UF>βλ. πολιτικά δικαστήρια</UF>
<RT>πολιτική δικονομία</RT>
<SN>some description</SN>
<dewey>347</dewey>
</TERM>
Figure 2. HTML node tree.
A. Controlled Vocabullaries in DSpace
DSpace supports controlled vocabularies in order to provide
and restrict a set of keywords that end-users utilize for
describing, searching and browsing items. These keywords are
organized in the form of a tree (taxonomy) which becomes
available to the end-user during the search or submission
process (see Fig. 2).
Controlled vocabularies are fed to DSpace as simple XML
files: there is one such file per each ingested vocabulary. But in
contrast to the general multilingual philosophy of DSpace, the
controlled vocabulary facility lacks the multilingualism
characteristic. To solve this, we have augmented DSpace with
the ability to support any number of translations for each
vocabulary. Each vocabulary’s translation is fed, and thus
handled by the system, as a separate XML file. Consequently,
when end-users select a language for the DSpace interface, they
automatically choose the translation in which all available
controlled vocabularies will appear.
In order for a controlled vocabulary to be recognized by the
DSpace system, the former should be structured according to a
specific format (“DSpace node schema”). This means that in
order to make DSpace able to support vocabularies in different
formats -other than the “DSpace node schema”- an
intermediate transformation is needed.
According to the DSpace node schema all information
about a vocabulary term are enclosed in a <node> element
(see Fig. 3). Only hierarchical -narrower in meaning-
relationships can be expressed using the sub-element
<isComposedBy>. Moreover, a simple annotation
mechanism is provided by the optional sub-element
<hasNote>.
B. Support of SKOS in DSpace
The Odisseia Research at the University of Minho in
Portugal has implemented an add-on for version 1.4.2 of
DSpace [3] which augments this digital repository system with
the ability to support controlled vocabularies expressed in
SKOS. In particular, the provided add-on makes the following
changes:
<node id="acmccs 98" label="ACMC CS98">
<isComposedBy>
<node id="A." label=" General Literat ure">
<isComposedBy>
<node id="A.0" la bel="GENERAL"/>
<node id="A.1" la bel="INTRODUCTORY AND SURVEY"/>
…
</isComposedBy >
</node>
Figure 3. The DSpace node schema.
• Enhances the DSpace inherent node schema so as to
manipulate related and preferred (use-instead) terms.
• Allows support for SKOSified thesauri.
• Offers the ability to assign different vocabulary to each
DSpace Community.
Apart from the last feature, which is beyond the scope of
this work, we have successfully accommodated the first two in
the latest version of DSpace (DSpace 1.6). The modifications
we had to accomplish were subtle and didn’t affect the way
DSpace was handling SKOSified vocabularies
The add-on actually alters the controlled vocabulary
ingestion process in DSpace. It adds an intermediate
transformation step, implemented by an appropriate XSLT file
(see Fig. 4). As a result, the support of SKOSified vocabularies
is finally achieved through two subsequent XSL
transformations:
• The first applies to the original SKOS file and
produces a valid DSpace node schema.
• The second is responsible for converting the inherent
node schema to an HTML node tree (taxonomy)
But this approach includes also a number of problems,
something that became evident when we tried to import a real
SKOS thesaurus in a working DSpace instance, enhanced with
this add-on. The arising problems are explained in the
following section.
C. The TGT Thesaurus in DSpace
After we had successfully SKOSified the TGT thesaurus,
we attempted to incorporate it in a DSpace 1.6 working
instance. The result was not satisfactory as we faced two kinds
of problems, concerning the construction of the HTML node
SKOS
SKOS SKOS
SKOS
Vocabulary
(RDF/XML)
XML
XMLXML
XML
DSpace
Node
Schema
XSL Transformation
(
vocabularySKOS2node.xsl
)
XSL Transformation
(
vocabulary2html.xsl
)
HTML
Node
Tree
Submission
Process
Subject
Search
Figure 4. The controlled vocabulary ingestion process.
tree:
1. Some terms appeared in the wrong place of the
taxonomy (wrong depth level or repetitions of terms).
2. A number of terms, although present in the SKOS file,
were missing from the tree hierarchy.
The main reason behind these problems was the inability of
the provided XSL transformations to deal with every possible
relationship among described concepts (e.g., there was no
provision for broader terms). These problematic
transformations, along with the non-exhaustive (but not
semantically inconsistent) implementation of TGT, in which
not every possible relationship is asserted, made the situation
even worse. In particular, we noticed that the thesaurus terms
that do not exist as stand-alone concepts and are only
referenced through a relation by another concept (hierarchical,
associative or relation of equivalence) fail to appear in the final
HTML tree. If the TGT thesaurus was complete the problem of
“missing” terms would probably not exist.
To tackle the aforementioned issues, we initially modified
the XSL transformation that is responsible for converting the
imported SKOS files into the DSpace node schema. By this
way, we managed to handle the first problem, whereas the
latter remains still unsolved.
More specifically, the transformation implemented by the
provided add-on was replaced by a new one, which was
deployed so as to carefully manipulate all relationships
between concepts. As a result we had neither repetitions of
terms nor wrong placement of them. DSpace users are now
presented with an accurate taxonomy, at least as far as its
structure is concerned. Top concepts appear as top categories in
the constructed tree and each sub-term lies beneath them. In
other words, each term is placed under its broader concept. A
part of the taxonomy tree that is produced by the SKOSified
TGT thesaurus when imported in DSpace 1.6, is shown in Fig.
5.
D. Proposed Solutions
A possible and more effective solution to the problem of
“missing” terms would be to consider thesauri as ontologies.
Besides, the SKOS is by itself defined as being in the Web
Ontology Language (OWL [2][1]) format. Such a consideration
would allow for a programmatic access to the thesaurus’
elements. In particular, by exploiting the OWL API, a simpler
way to construct the vocabulary’s node tree would be possible,
in contrast to the more complex one offered by the XSL
transformation. Furthermore, the new constructs offered by the
latest version of OWL (OWL 2) would allow for the expression
of richer semantic conditions in SKOS, as stated in [8].
Another gain in handling the SKOS thesaurus as an OWL
ontology, is the possibility to apply OWL reasoners (like
FaCT++ or Pellet). Such a reasoning based approach allows for
inferencing and thus for the automatic handling of non-asserted
relationships. Consequently, an inferenced-based classification
and rendering of the thesaurus could be achieved, resulting in
having no “missing” terms in the tree hierarchy.
V. C
ONCLUSIONS AND
F
UTURE
W
ORK
In this work we tried to show the importance of SKOS, as a
data model able to transfer knowledge organization systems to
the Semantic Web in a quick and simplified manner. Many
popular thesauri have either already migrated to SKOS or are
working on this task.
Following this trend in library science, we took over the
task to convert the TGT thesaurus -a controlled vocabulary
targeted for use by Greek institutions- in SKOS. Afterwards,
we tried to see how such a SKOSified thesaurus can be
exploited by a digital repository system, which manages any
kind of educational and cultural content. During the SKOSified
vocabulary ingestion process we faced several problems,
mostly originating from the problematic nature of the applied
add-on. Another reason of these problems was the non-
exhaustive description of the thesaurus. To fix some of these
issues, we successfully modified the provided XSL
transformation. Nevertheless, the problem of “missing” terms,
still remains unsolved.
As future work we intend to utilize Semantic Web
technologies in order to manipulate controlled vocabularies in a
more refined way. To the extent of handling thesauri as
ontologies, our final objective is to make them manageable by
OWL reasoners. In this way, we at least expect to overcome the
problem of “missing” terms in the produced HTML hierarchy,
given that inferred relationships can fill in missing descriptions
in a thesaurus.
R
EFERENCES
[1]
M. Assem, V. van, Malaisι, A. Miles, and G. Schreiber, “A Method to
Convert Thesauri to SKOS”, Proc. 3rd European Semantic Web
Conference (ESWC 2006), Springer Berlin/Heidelberg, 2006. pp. 95-
106, doi: 10.1007/11762256_10
[2]
S. Bechhofer, V. F. Harmelen, J. Hendler, I. Horrocks, D.
McGuinness, P. Patel-Schneider, P., and L. Stein, “OWL Web
Ontology Language: Reference”, W3C Recommendation, 2004
http://www.w3.org/TR/owl-ref/
Figure 5. The TGT thesaurus in DSpace 1.6.
[3]
S. Costa, M. Ferreira, and A. Alice, “Controlled-Vocabulary Add-on
Patch for DSpace 1.4.2”, 2007.
http://sourceforge.net/tracker/index.php?func=detail&aid=1833347
&group_id=19984&atid=319984.
[4]
DCMI Usage Board, “DCMI Metadata Terms”, DCMI
Recommendation, 2008.
http://dublincore.org/documents/dcmi-terms/
[5]
A. Issaac, and E. Summers, (eds), “SKOS Simple Knowledge
Organization System Primer”, W3C Proposed Recommendation,
2009. http://www.w3.org/TR/2009/WD-skos-primer-20090615/
[6]
S. Jupp, S. Bechhofer, and R. Stevens, “A Flexible API and Editor for
SKOS”, Proc. 7th International Semantic Web Conference (ISWC
2008), 2008
[7]
G. Klyne, and J. J. Carroll, (eds) “Resource Description Framework
(RDF): Concepts and Abstract Syntax”, W3C Recommendation, 2004.
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
[8]
D. Koutsomtiropoulos, and G. Solomou, “SKOS in OWL 2”, 2009.
http://swig.hpclab.ceid.upatras.gr/SKOS
[9]
A. Miles, and S. Bechhofer, “SKOS Simple Knowledge Organization
System Reference”, W3C Recommendation, 2009.
http://www.w3.org/TR/skos-reference/
[10]
T.
Schandl, and A. Blumauer, “PoolParty: SKOS Thesaurus
Management Utilizing Linked Data”, Proc. Extended Semantic Web
Conference (ESWC 2010), 2010. Springer Berlin/Heidelberg. pp. 421-
425, 10.1007/978-3-642-13489-0_36
[11]
E. Summers, A. Isaac, C. Redding, and D. Krech, “LCSH, SKOS and
Linked Data”, Proc. International Conference on Dublin Core and
Metadata Applications (DC 2008), 2008
[12] B
. Zapilko, and Y. Sure, “Converting the TheSoz to SKOS”, GESIS
Technical Report 2009/07, GESIS - Leibniz Institute for the Social
Sciences, 2009