Modified XML data model  

Modified XML data model  

Source publication
Article
Full-text available
We present an XML-based data model that is deployed in a system for querying corpora with multiple layers of linguistic annotation. The model is based upon the simple, but effective idea of leaving each layer of annotation intact at annotation time and only relate the layers to each other at query time. Queries select parts of the layers or of the...

Similar publications

Article
Full-text available
,Efficient evaluation of XML Query Languages has become,a crucial issue for XML exchanges and integration.
Article
Full-text available
This paper provides an objective evaluation of the performance impacts of binary XML encodings, using a fast stream-based XQuery processor as our representative application. Instead of proposing one binary format and comparing it against standard XML parsers, we investigate the individual effects of several binary encoding techniques that are share...
Article
Full-text available
The popularity of multi-core systems makes software parallelization become an important way to improve performance. As a mainstream XML query language, XQuery is the core of XML processing. It is critical to take full advantage of multi-core computing to improve XML processing performance through parallelization of XQuery. However, usually it is di...
Article
Full-text available
Providing services by integrating information available in web resources is one of the main goals of a mediation architecture. In this paper, we consider the standard wrapper-mediator architecture under the following hypothesis: (i) the information exchanged between wrap- pers and the mediator consists in XML documents, (ii) wrappers have limited r...
Article
Full-text available
We propose an XQuery cost model that is able to estimate the performance gain of source-level transformation. The cost of ma-jor language constructs, including FLWOR, quantified, path, element construction, and predicate expressions are captured. The evaluation of optimization using existing real engines suffer from problems, such as lack of applic...

Citations

... However, they do not report on the creation of the annotation. Similarly , Eckart and Teich [14] focus on querying and representation only. Rehm et al. [27] report response times of up to 3 hours for typical queries. ...
... This section examines whether eXist [18] with the AnnoLab extensions [12], MonetDB/XQuery [4,2,19] as well as Galax/GalaTex [13,9] can provide the facilities given in section 4. The products have been chosen because they implement XQuery, they are freely available and they support features like stand-off annotations and fulltext search -though not necessarily at the same time. An overview over other XQuery implementations in general, their capabilities and how they fit in can be found in [22]. ...
... AnnoLab [12] provides additional functions for working with stand-off annotations. These functions allow to relate nodes to each other based on their stand-off anchors, e.g. by testing for overlap or containment. ...
Article
Full-text available
XML has become the de-facto standard for representing linguistically annotated cor- pora. It seems safe to assume that storing and querying an XML-encoded, annotated corpus in an XML database is a straightforward procedure. In reality, however, it is not. This article aims to provide guidelines for deciding whether to use an XML database and how to choose a suitable product. To this end we examine the following questions: Which aspects should be considered before choosing to store an XML-encoded annot- ated corpus in an XML database? Which facilities does a database need to provide in or- der to be suitable for storing and querying annotated corpora? Do current XML data- bases offer these facilities, and, if not, can they be added?
... As each annotation layer is contained in one XML document , a corpus represents a special form of a multi-rooted tree, i. e., a collection of trees that do not share nodes except for the leaves that contain the annotated primary data. AnnoLab (Eckart and Teich, 2007) is an XML/XQuery-based corpus query and management framework that was specifically designed to deal with multi-rooted trees. To avoid problems regarding projectiveness and overlapping segments, AnnoLab uses a stand-off adaptation of the XML data-model. ...
Conference Paper
Full-text available
We present an approach for querying collections of heterogeneous linguistic corpora that are annotated on multiple layers using arbitrary XML-based markup languages. An OWL ontology provides a homogenising view on the conceptually different markup languages so that a common querying framework can be established using the method of ontology-based query expansion. In addition, we present a highly flexible web-based graphical interface that can be used to query corpora with regard to several different linguistic properties such as, for example, syntactic tree fragments. This interface can also be used for ontology-basedquerying of multiplecorpora simultaneously.
... For the following example [9] assume an alignment layer en de.align (see figure 3); its segments refer to two signals de (Deutsch, German) and en (English). Another layer en.pos contains token elements that have a pos feature (part-of-speech data for en). ...
... As each annotation layer is contained in one XML document, a corpus represents a special form of a multi-rooted tree, i. e., a collection of trees that do not share nodes except the leaves containing annotated data. AnnoLab [9] is an XML/XQuery-based corpus query and management framework designed to deal with multi-rooted trees. An abstract data-model for corpus annotation was synthesized from various approaches (e. g., [4], [12], [14]) and consists of four tiers: (i) signal tier (annotated data), (ii) structure tier (annotation structure), (iii) feature tier (annotation features), (iv) location tier (a mapping between signal and structure tiers). ...
... For the following example [9] ...
Article
Full-text available
We present an approach for querying collections of heterogeneous linguistic corpora that are an-notated on multiple layers using arbitrary XML-based markup languages. An OWL ontology is used to homogenise the conceptually different markup languages so that a common querying framework can be established.
... The back-end hosts the JSP files and related data. It accesses two different databases, the corpus database and the system database, as well as a set of ontologies and additional components. 1 The corpus database is an XML database, extended by the AnnoLab system (Eckart and Teich, 2007), in which all resources and metadata are stored. The system database is a relational database that contains all data about user accounts, resources (i. ...
Article
Full-text available
We present SPLICR, the Web-based Sustainability Platform for Linguis- tic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one's spe- cific research needs. SPLICR also provides a graphical interface that enables users to query and to visualise corpora. The project in which the system is developed aims at sustainably archiving the ca. 60 language resources that have been constructed in three collaborative research centres. Our project has two primary goals: (a) To process and to archive sustainably the resources so that they are still available to the research community in five, ten, or even 20 years time. (b) To enable researchers to query the resources both on the level of their metadata as well as on the level of linguistic annotations. In more general terms, our goal is to enable solutions that leverage the interoperability, reusability, and sustainability of heterogeneous collec- tions of language resources.
Conference Paper
Data models and encoding formats for syntactically annotated text corpora need to deal with syntactic ambiguity; underspecified repre- sentations are particularly well suited for the representa tion of ambiguous data because they allow for high informational efficiency. We discuss the issue of being informationally efficient, and th e trade-off between efficient encoding of linguistic annota tions and complete documentation of linguistic analyses. The main topic of this article is a data model and an encoding scheme based on LAF/GrAF (Ide and Romary, 2006; Ide and Suderman, 2007) which provides a fle xible framework for encoding underspecified representatio ns. We show how a set of dependency structures and a set of TiGer graphs (Brants et al., 2002) representing the readings of an ambiguou s sentence can be encoded, and we discuss basic issues in querying corpora which are encoded using the framework presented here.
Article
Full-text available
This paper describes salient aspects of the OntoSem lexicon of English, a lexicon whose semantic descriptions can either be grounded in a language-independent ontology, rely on extra-ontological expressive means, or exploit a combination of the two. The variety of descriptive means, as well as the conceptual complexity of semantic description to begin with, necessitates that OntoSem lexicons be compiled primarily manually. However, once a semantic description is created for a lexeme in one language, it can be reused in others, often with little or no modification. Said differently, the challenge in building a semantic lexicon is describing semantics; once the semantics are described, it is relatively straightforward to connect given meanings to the appropriate head words in other languages. In this paper we provide a brief overview of the OntoSem lexicon and processing environment, orient our approach to lexical semantics among others in the field, and describe in more detail what we mean by the largely language-independent lexicon. Finally, we suggest reasons why our resources might be of interest to the larger community.