Conference PaperPDF Available

User-System Cooperation in Document Annotation Based on Information Extraction

Authors:

Abstract

The process of document annotation for the Semantic Web is complex and time consuming, as it requires a great deal of manual annotation. Information extraction from texts (IE) is a technology used by some very recent systems for reducing the burden of annotation. The integration of IE systems in annotation tools is quite a new development and there is still the necessity of thinking the impact of the IE system on the whole annotation process. In this paper we initially discuss a number of requirements for the use of IE as support for annotation. Then we present and discuss a model of interaction that addresses such issues and Melita, an annotation framework that implements a methodology for active annotation for the Semantic Web based on IE. Finally we present an experiment that quantifies the gain in using IE as support to human annotators.
65
User-System Cooperation in Document Annotation
based on Information Extraction
Fabio Ciravegna1, Alexiei Dingli1, Daniela Petrelli2 and Yorick Wilks1
1Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello
Street, S1 4DP, Sheffield, UK, email {fabio|alexiei|yorick}@dcs.shef.ac.uk
2Department of Information Studies, University of Sheffield, Regent Court, 211 Portobello
Street, S1 4DP, Sheffield, UK, email D.Petrelli@shef.ac.uk
Abstract. The process of document annotation for the Semantic Web is
complex and time consuming, as it requires a great deal of manual annotation.
Information extraction from texts (IE) is a technology used by some very recent
systems for reducing the burden of annotation. The integration of IE systems in
annotation tools is quite a new development and there is still the necessity of
thinking the impact of the IE system on the whole annotation process. In this
paper we initially discuss a number of requirements for the use of IE as support
for annotation. Then we present and discuss a model of interaction that
addresses such issues and Melita, an annotation framework that implements a
methodology for active annotation for the Semantic Web based on IE. Finally
we present an experiment that quantifies the gain in using IE as support to
human annotators.
1. Introduction
The effort behind the Semantic Web (SW) is to add information to web documents in
order to access knowledge instead of unstructured material, allowing knowledge to be
managed in an automatic way. Much effort has been spent in developing
methodologies for enriching documents, mainly requiring manual insertion of
annotation. It is reasonable to expect users to manually annotate new documents up to
a certain degree, but annotation is a slow time-consuming process that involves high
costs. Therefore it is vital for the Semantic Web to produce automatic or semi-
automatic methods for document enrichment, either to help in annotating new
documents or to extract additional information from existing unannotated or partially
annotated documents. Information Extraction from texts (IE) can provide the
backbone for such tools. IE is an automatic method for locating important facts in
electronic documents. In the SW context, IE can be used for document annotation
either in an automatic way (via unsupervised extraction of information) or semi-
automatic way (e.g. as support for human annotators in locating relevant facts in
documents via information highlighting).
IE is an area of Natural Language Processing with a long history. Its development has
been mainly driven by the MUC conferences, a number of competitive exercises
supported by Darpa. One of the main issues in IE is the way in which applications are
66
defined. The main constraint in the MUC conferences is that applications are to be
developed in a short time (e.g. one month). The MUCs represent a scenario in which
the cost of new application is not considered important: by bounding the development
time they did not put an upper bound neither to the amount of personnel needed for
the application nor to the skills used [1]. As a result, most of the systems were
portable by IE expert only.
The Semantic Web represents a completely different scenario where the cost is the
issue. The rapid and uncontrolled growth of the Web in the last years is mainly due to
the simplicity and effectiveness of HTML. Everyone can make available his/her own
pages at nearly no cost (the cost of a PC and a telephone line) with very limited skills
(i.e. mainly the ability of using a web editor). If we want the Semantic Web to
become the widespread evolution of the current Web we have to provide
methodologies with the same type of requirement: portability with limited skills and
no (or very limited) cost. The requirement is to be extended to all the tools necessary
for building the SW. If IE is to be used for annotation, it must be usable at no cost
(exactly as web browsers are free) with limited skills. The kind of IE technologies that
require experts in IE can be afforded only by big companies and or big service
providers (e.g. search engines companies) and can be used for generic indexing.
EaroDAML, [2] is an example of a tool that requires an expert to adapt the system to
new applications and that is used for very generic IE for the Web (e.g. named entity
recognition). The situation is different in scenarios with distributed agents that
provide local services. For example a university department wanting to provide a SW
service for their Web pages. In this case they will need to define a specific indexing
service themselves. The available budget here is very low and the available skills are
quite limited (e.g. a student want-to-be web designer and a system manager). No
experts in IE can be envisaged here, nor does the budget allow hiring an expensive
external company. In an IE perspective for the SW there is the clear need to allow
users with no knowledge of IE to build applications (e.g. specialized annotation
services for the set of pages).
Adaptive IE systems (IES) use Machine Learning to learn how to adapt to new
applications/domains using only annotated corpora [3] 4][5]. They can be adapted to
provide annotations for the SW: they monitor the annotations inserted by the user and
learn how to reproduce them. When equivalent cases are encountered, annotations are
automatically inserted by the IES and users have just to check them. Some new
annotation tools for the Semantic Web are starting including adaptive IE as support to
annotation. At the Open University, the MnM annotation tool [6] interfaces with both
the UMass IE tools1 and Sheffield’s Amilcare2. At the University of Karlsruhe the
Ontomat annotizer [7] interfaces with Sheffield’s Amilcare. The current methodology
of interaction between annotation tool and IES is still quite simplistic, influencing
also the way in which users and annotation system interacts. Generally a batch
interaction mode is adopted, i.e., the user annotates a batch of texts and the IE tool is
trained on the whole batch. Then annotation is started on another batch of texts and
the IE system proposes annotations to users when cases similar to those found in the
training batches are recognized. Although the use of adaptive IE constitutes quite an
1 www-nlp.cs.umass.edu/software/badger.html
2 www.dcs.shef.ac.uk/~fabio/Amilcare.html
67
improvement with respect to the completely manual annotation approach, in our
opinion the tremendous potentialities of adaptive IE technologies are not fully
exploited. We believe that it is time to consider the way in which the interaction can
be organized in order to both maximize effectiveness in the annotation process and
minimize the burden of annotating/correcting on the user’s side. We expect that such
change will also influence the user-annotation tool interaction style by moving from a
simplistic user-system interaction to real user-system collaboration3. We propose two
user-centered criteria as measure of appropriateness of this collaboration: timeliness
and intrusiveness of the IE process. The first shows the ability to react to user
annotation: how timely is the system to learn from user annotations. The latter
represents the level to which the system bothers the user, because for example it
requires CPU time (and therefore stops the user annotation activity) or because it
suggests wrong annotations.
Timeliness: when the IE system (IES) is trained on blocks of texts, there is a time
gap between the moment in which annotations are inserted by the user and the
moment in which they are used by the system for learning. User and system work in
strict sequence, one after the other. This sequential scheduling hampers true
collaboration. If a batch of texts contains many similar documents, users may spend
considerable amount of time in annotating similar documents without receiving
feedback from the IES for the simple reason that no learning is scheduled for the
moment. The IES is not supportive to the user neither the user effort is very useful,
since similar cases are of very little use for the learner because they cannot offer the
variety of phenomena that empower learning. The bigger the size of the batch of texts
the worse, the problem of lack of timeliness is. A true collaboration implies a
(re)training of the system after every annotated text is released by the user. Training
can take a considerable amount of CPU time, therefore stop the annotation session for
a while. A positive collaboration requires not to constraint the user time to the IES
training time (otherwise intrusiveness increases). We believe that an intelligent
scheduling is needed to keep timeliness in learning without increasing intrusiveness.
Intrusiveness: the IE system can bother users in a number of ways, for example by
proposing annotations generated by unreliable rules (e.g. induced using an insufficient
number of cases). A positive collaboration requires to enable users to tune the
proactivity of the IE system in order to avoid intrusiveness.
In this paper we present an IE-based annotation methodology for the Semantic
Web that takes into account the problems of timeliness and intrusiveness mentioned
above. Moreover we quantitatively evaluate the support provided by IE in a
simulation of experiment of text annotation.
3 Collaboration means working together for a common goal, all partners contributing with their
own capabilities and skills.
68
2. Towards a new interaction model
We propose an interaction model that aims at producing a non-intrusive and timely
support for users during the annotation process. In this section we describe the way in
which user and system interact and discuss how such requirements are met by our
model.
2.1. User-system interaction
We split the annotation process into two main phases from the IES point of view: (1)
training and (2) active annotation with revision. In user terms the first corresponds to
unassisted annotation, while the latter mainly requires correction of annotations
proposed by the IES.
During training users annotate texts without any contribution from the IES. Here
the IES uses the user annotations to train its learner. During this phase the IES is
constantly inducing rules. We can define two sub-phases: (a) bootstrapping and (b)
training with verification. During bootstrapping the only IES task is to learn from the
user annotations. This sub-phase can be of different length, depending on the
minimum number of examples needed for a minimum of training. During the second
sub-phases, the user continues with the unassisted annotation, but the IES behaviour
changes, as it uses its induced rules to silently compete with the user in annotating the
document. The IES automatically compares its annotations with those inserted by the
user and calculates its accuracy. Missing annotations or mistakes are used to retrain
the learners. The training phase ends when the IES accuracy reaches the user
preferred level of pro-activity. It is therefore possible to move to the next phase:
active annotation.
The active annotation with revision phase is heavily based on the IES
suggestions and the user’s main task is correcting and integrating the suggested
annotations (i.e. removing and adding annotations). Human actions are inputted back
to the IES for retraining. This is the phase where the real system-user cooperation
takes place: the system helps the user in annotating; the user feeds back the mistakes
Bare
Text
User
Annotates
Annotates
Errors, missing
tags and
mistakes used
to retrain
Annotation
Comparison
Bare
Text
Bare
Text
User
Annotates
User
Annotates
Annotates
Errors, missing
tags and
mistakes used
to retrain
Annotation
Comparison
Figure 1. The training with verification sub-phase. In this figure Amilcare is used as example of
ada
p
tive IES.
69
to help the system perform better. In user terms this is where the added value of the
IES becomes apparent, because it heavily reduces the amount of annotation to insert
manually. This supervision task is much more convenient from both cognition and
actions. Correcting annotations is simpler than annotating bare texts, it is less time
consuming and it is also likely to be less error prone.
2.2. Coping with Intrusiveness
The design of the interaction model aims to limit intrusiveness of the IES in a number
of ways. First of all the IES does not require any specific annotation interface or any
specific adaptation by the user. It integrates in the usual user environment and
provides suggestions in a way that is both familiar and intuitive for the user. To some
extent users could even ignore that the IES is working for them.
Secondly intrusiveness as a side effect of proactivity is coped with, especially during
active annotation with revision, when the IES can bother users with unreliable
annotations. The requirement here is to enable users to tune the IES behaviour so that
the level of suggestions is appropriate. Some IES provide internal tuning methods for
balancing features such as precision and recall or the minimum number of cases to be
covered in order to accepted a rule for annotation. Such tuning methodologies are
designed for IE experts since they require a deep knowledge of the underlying IE
system. This is especially true because the user’s goal is tuning the level of
intrusiveness in the annotation process and very often there is no obvious
correspondent in the IES tuning methodology. For example Amilcare allows to
modify error thresholds for rules, number of cases covered by rules for acceptance,
balance of precision and recall in rule tuning: none of these correspond directly to
tuning the level of intrusiveness (even if large part of it relies in the precision/recall
balance). Moreover, the acceptable level of intrusiveness is subjective: some users
might like to receive suggestions largely regardless from their correctness, while
others do not want to be bothered unless suggestions are absolutely reliable. A user-
friendly interaction methodology requires enabling the user in selecting the
appropriate level of intrusiveness, without coping with the complexity of tuning an
adaptive IE system. In our model the annotation interface bridges the qualitative
vision of users (e.g. a request to be more/less active or accurate) with the specific IES
settings (e.g. change error thresholds), as also suggested in [8]. This is important
Bare
Text User
Corrects
Annotates
Uses
corrections to
retrain
Bare
Text User
Corrects
Annotates
Uses
corrections to
retrain
Figure 2. The active annotation with revision phase
70
because the annotation interface is a tool designed for specific user classes and
therefore able to elicit tuning requirements by using the correct terminology for the
specific context.
Finally the IES training requires CPU time and this can slow down or even stop the
user activity. For this reason most of the current systems use a batch mode of training
so to limit training to specific moments (e.g. coffee time). As explained above, the
batch approach presents timeliness problems. We propose background learning to
provide timely support without intrusiveness. If we observe how time is spent in the
annotation process (select a document, manually annotate the document, save the
annotation), we notice that most of the user time is spent in the manual annotation
process. This is the right moment to train the IES in the background without the user
noticing it. In principle it is possible to treat every annotation event in the interface as
a request to train on a specific example, but this requires the ability to retreat
annotations in case of user errors, making the interaction with the IES quite complex.
In our approach the IES works in the background with two parallel and asynchronous
processes. While the user annotates document n+1 the system learns the annotations
inserted in document n(i.e. the last annotated). At the same time (i.e. as a separate
process) the IES applies the rules induced in the previous learning sessions (i.e. from
document 1 to document n-1) in order to extract information from document n(either
for suggesting annotations during active annotation or in order to silently test its
accuracy during unassisted learning). The advantage is that there is no idle time for
the user, as the annotation of a document generally requires a great deal more time
than training on a single text.
2.3. Coping with Timeliness
Timeliness means just in time learning from previous user annotations. Timeliness is
not fully obtained with the above interaction methodology: the IES annotation
capability always refers to rules learned by using the entire annotated corpus but the
last document. This means that the IES is not able to help when two similar
documents are annotated in sequence. From the user point of view such a situation is
equivalent to train on batches of two texts. In this respect the collaboration between
the system and the user fails in being effective. We believe that timeliness is a matter
of perception from the user side, not an absolute feature; therefore the only important
matter is that users perceive it. Considering that in many applications the order in
which documents are annotated is unimportant, in such cases it is possible to organize
the annotation order so to avoid the possibility of presenting similar documents in
sequence and therefore to hide the small lack of timeliness. In order to implement
such feature we need a measure of similarity of texts from the annotation point of
view. The IES can be used to work out such a measure. At the end of each learning
session all the induced rules are applied to the unannotated part of the corpus so to
identify two main subsets: texts were the available rules fire (i.e. annotations can be
added: positive subset) and texts were they do not fire at all (uncovered texts:
negative subset). Each text in the positive subset can be associated with a score given
by the number of annotations that can be added. The score can be used as an
approximation of similarity among texts: inserted annotations mean similarity with
71
respect to the part of the corpus annotated so far, no inserted annotation means actual
difference. Such information can be used to make the timeliness more effective: a
completely uncovered document is always followed by a fairly covered document. In
this way a difference between successive documents is very likely and therefore the
probability that similar documents are presented in turn within the batch of two (i.e.
the blindness window of the system) is very low. Incidentally this strategy also
tackles another major problem in annotation, i.e. user boredom, which can make the
user productivity and effectiveness fall proportional to time. Presenting users with
radically different documents avoids the boredom that comes from coping with very
similar documents in sequence.
In the next section a first implementation of the presented interaction model is
presented. We introduce both the IES used (Amilcare) and the annotation interface
(Melita). Finally we discuss how the current implementation meets the requirements
described.
3. Adaptive IE in Amilcare
The model above requires an adaptive IES to strictly cooperate with the user. In our
implementation we have used Amilcare4. Amilcare is a tool for adaptive Information
Extraction from text (IE) designed for supporting active annotation of documents for
the Semantic Web. In its standard version it performs IE by enriching texts with XML
annotations, i.e. the system marks the extracted information with XML annotations. In
the Semantic Web version in which it is supposed to be interacting with an annotation
tool, it actually leaves the text unchanged and it returns the extracted information as a
triple <annotation, startPosition, endPosition> so to let the annotation tool decide how
to actually annotate the text. The only knowledge required for porting Amilcare to
new applications or domains is the ability of manually annotating the information to
be extracted in a training corpus. No knowledge of IE is necessary.
Adaptation starts with the definition of a tag-set for annotation possibly organized as
an ontology where tags are associated to concepts and relations. Then users have to
manually annotate a corpus for training the learner. An annotation interface is to be
connected to Amilcare for annotating texts, e.g. using XML-based mark ups. As
mentioned Amilcare has been integrated with a number of annotation tools so far,
including MnM[6], Ontomat[7]. For example MnM automatically converts the user
annotations into XML tags to train the learner. Amilcare's learner induces rules that
are able to reproduce such annotation. Amilcare can work in two modes: training,
used to adapt to a new application, and extraction, used to actually annotate texts. In
both modes, Amilcare first of all preprocesses texts using Annie, the shallow IE
system included in the Gate package ([9], www.gate.ac.uk). Annie performs text
tokenization (segmenting texts into words), sentence splitting (identifying sentences)
part of speech tagging (lexical disambiguation), gazetteer lookup (dictionary lookup)
ad Named Entity Recognition (e.g. proper names spotting and classification).
4 www.dcs.shef.ac.uk/~fabio/Amilcare.html
72
When operating in training mode, Amilcare induces rules for information
extraction. The learner is based on (LP)2, a covering algorithm for supervised learning
of IE rules based on Lazy-NLP [10] [11]. This is a wrapper induction methodology
[12] that, unlike other wrapper induction approaches, uses linguistic information for
rule generalization. The learner starts inducing wrapper-like rules that make no use of
linguistic information, where rules are sets of conjunctive conditions on adjacent
words. Then the linguistic information provided by Annie is as the basis for rule
generalization: conditions on words are substituted with conditions on the linguistic
information (e.g. condition matching either the lexical category, or the class provided
by the gazetteer, etc. [11]). All the generalizations are tested in parallel and the best k
generalizations are kept for IE. The idea is that the linguistic-based generalization is
used only when the use of NLP information is reliable or effective. The measure of
reliability here is not linguistic correctness (immeasurable by incompetent users), but
effectiveness in extracting information using linguistic information as opposed to
using shallower approaches. Lazy NLP-based learners learn which is the best strategy
for each information/context separately. For example they may decide that using the
result of a part of speech tagger is the best strategy for recognizing the speaker in
seminar announcements, but not to spot the seminar location. This strategy is quite
effective for analysing documents with mixed genres, quite a common situation in
web documents [13].
The learner induces two types of rules: tagging rules and correction rules. A
tagging rule is composed of a left hand side, containing a pattern of conditions on a
connected sequence of words, and a right hand side that is an action inserting an XML
tag in the texts5. Correction rules correct imprecision, i.e. shift misplaced tags to the
correct position. They are learnt from the mistakes made in attempting to re-annotate
the training corpus using the induced tagging rules. The output of the training phase is
a collection of rules for IE that is associated to the specific scenario. When working in
extraction mode, Amilcare receives as input a (collection of) text(s) with the
associated scenario (including the rules induced during the training phase). It
preprocesses the text(s) by using Annie and then it applies its rules and returns the
original text with the added annotations (or just the annotation triples in the SW
version).
With Amilcare it is possible to define automatic or semiautomatic services for the
SW with limited skills (the ability of annotating the texts) and limited cost (the
number of texts to be annotated for training –as we will see- is quite limited). For
example the university department mentioned in the introduction could use the
student creating the pages to annotate the pages. Amilcare would learn in the
background without requiring any specific adaptation except the definition of the
annotation set (necessary in any case for defining SW services). This is the reason
why some annotation tools include Amilcare as support to annotation.
5 In the SW version no tag is actually inserted in the text; as mentioned a triple <annotation,
startIndex, endPosition> is returned to the external annotation interface.
73
Asuncion Gomez-Perez, V. Richard Benjamins (eds.):
Knowledge Engineering and Knowledge Management (Ontologies and the Semantic Web), Proceedings of
the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02),
Lecture Notes in Artificial Intelligence 2473, Springer Verlag
4. The Melita framework
Melita is an ontology-based demonstrator for text annotation. The goal of Melita is
not to produce a further annotation interface, but a demonstrator of how it is possible
to actively interact with the IES in order to meet the requirements of timeliness and
tuneable pro-activity mentioned above. Melita’s main control panel is depicted in
figure 3. It is composed of two main areas:
1. The ontology (left) representing the annotations that can be inserted; annotations
are associated to concepts and relations. A specific color is associated to each node
in the ontology (e.g. “speaker is depicted in blue).
2. The document to be annotated (center-right). Selecting the portion of text with the
mouse and then clicking on the node in the ontology insert annotations. Inserted
annotations are shown by turning the background of the annotated text portion to
the color associated to the node in the hierarchy (e.g. the background of the portion
of text representing a speaker becomes blue).
Melita does not differ in appearance from other annotation interfaces such as the
Gate annotation tool, or MnM or Ontomat. This is because – as mentioned it is a
demonstrator to show how a typical annotation interface could interact with the IES.
The novelty of Melita is the possibility of (1) tuning the IES so to provide the desired
level of proactivity and (2) scheduling texts so to provide timeliness in annotation
learning. The typical annotation cycle in Melita follows the two-phase cycle based on
training and active annotation described in the previous section. Users may not be
aware of the difference between the two phases. They just will notice that at some
point the annotation system will start suggesting annotations and that they have a way
to influence when and with which modalities this will happen.
Figure 3: The Melita annotation Interface
74
4.1. Suggesting Annotations
There are two ways in which Melita can suggest annotations to users, according to the
reliability of such suggestions. For suggestions Amilcare is quite sure about, Melita
will present them in the document panel in a way similar to the annotations inserted
by the user. The background of the text where the information has been found turns
into the specific annotation colour (e.g. grey for speaker in figure 3). The difference
with respect to the actual user annotations is that a darker border surrounds them in
order to be easily spotted for user checking. For example in figure 3 the location “SEI
Auditorium” highlighted in red is a reliable Amilcare’s suggestion, while “12 PM” is
a user defined annotation. In case of suggestions Amilcare is less sure about, they are
presented in a different way. The background is left unchanged (white), but a
coloured border (the same colour of the potential annotation, e.g. grey for speaker)
surrounds the text. For example “11 am” (at the text centre in figure 3) is a suggestion
of this type. They are easy to spot by the user, but they are marked as unreliable. A
difference in the suggestion’s semantics corresponds to the difference in presentation:
reliable annotations are supposed to be correct; a user action is required to remove
them if they are wrong. Less reliable annotations are supposed to be just suggestions
to the user; an action is required to confirm them; otherwise they will not be saved
with the text in the end. We believe that both annotation types are useful as they allow
to clearly communicating the user what suggestions are to be trusted and which are
just a reasonable guess. Reasonable guesses are presented for two reasons: first of all
they represent a situation in which the learner requires user feedback: removing such
information means a clear message to the learner that the guess is wrong and therefore
rules are to be changed. From the user point of view guesses are very often useful
because they are often imprecise but nonetheless they tend to correctly identify the
area in which such information is present even if the information is not correctly
identified (e.g. in “at <time> 3:00</time> pm” the annotation is imprecise pm
should be part of the time but it is useful to focus the user attention on the place
where the correct annotation should go). Note that reliability can vary for different
pieces of information. For example a system can become quite reliable in a short time
in recognizing some information (e.g. seminar start time) requiring more training
examples for others (e.g. speaker). In this case there will be a moment in which the
suggested annotations for the time will be reliably inserted (i.e. with coloured
background) while the annotations for the speaker will be less reliable (presented with
coloured border only).
4.2. Balancing Proactivity
Users must be empowered to customize the strategy above, participating in the
definition of what is reliable information and what is not. Also some very unreliable
suggestions can be not presented, and again we want to empower the user to say
which of them are not to be presented. This means that users must be empowered to
control proactivity (and therefore intrusivity). In Melita, users can customize the
behaviour of the IES, i.e. tuning the IES’s level of proactivity, by using a special
slidebar (fig.4). It allows to set two thresholds that divide the accuracy space in three
75
areas: the first level decides which is the minimum accuracy the IES must be able to
reach in order to start considering annotations as reliable. The second threshold
defines the minimum accuracy the system must reach before starting presenting less
reliable suggestions. In the example in figure 4 the system will consider reliable (and
therefore suggest with coloured background) when the annotation accuracy is greater
than 75%. Annotations that do not reach 75% reliability are still suggested (using the
coloured border only) if they reach at least 43% of reliability. When accuracy is less
than 43% the IES does not suggest at all. There is a general default that can be
customised and holds for all the nodes in the ontology and that can be overridden for
specific nodes by using the same kind of window. Changing the default for specific
annotations (e.g. “speaker”) is useful because users can have different feelings about
intrusiveness for different kinds of information. Note that users do not need to know
in detail what 45% means. They can easily reason from a qualitative point given the
current IES behaviour. If the user feels that the IES is not proactive enough, s/he can
decide to lower (one of the) two thresholds. If the system is intrusive the user can
decide to raise them. For turning off all the system suggestions it is just necessary to
raise both the thresholds above100%. Moreover the more you move in either
direction, the more the effect on the IES will be relevant. It is important that the
thresholds are independent because users can have different feeling on intrusiveness
for the different suggestion modes. The same slidebar shows also the average
accuracy currently reached by the IES in annotating a specific information type: a
blue filler mark grows from the bottom (around 10% in figure 4). It represents the
distribution of accuracy of the potential suggestions for the specific annotation. Such
Reliable Suggestions
Tentative Suggestions
Field: Speaker
Reliable Suggestions
Tentative Suggestions
Field: Speaker
Figure 4. the slidebar to customize intrusiveness
76
information can be used in tuning proactivity: less intrusivity=raise a threshold above
the average, more proactivity, move a threshold below the average.
5. An Experiment on IE’s Effectiveness
We performed a number of experiments for demonstrating how fast the IES
converges to an active annotation status and to quantify its contribution to the
annotation task, i.e. its ability to suggest correctly. We selected a subset of the
Computer Science Jobs announcement corpus, manually annotated by M. E. Califf
[14]. This is a corpus used for evaluating adaptive IE algorithms on semi-structured
texts [15]. The subtask we selected was to recognise in a set of 250 news posts about
job offers for computer scientists: the city, country and state in which the job is
offered, the company offering the job, the actual recruiter, the required knowledge
about both computer languages and platforms, and the offered salary. We believe that
this task can be considered a representative task for the Semantic Web.
In our experiment the annotation in the corpus was used to simulate human
annotation. We have evaluated the potential contribution of the IE system at regular
intervals during corpus tagging, i.e. after the annotation of 5, 10, 20, 25, 30, 50, 62,
75, 100 and 150 documents (each subset fully including the previous one). Each time
we tested the accuracy of the IES on the following 100 texts in the corpus (so when
training on 25 texts, the test was performed also on the following 25 texts to be used
for training on 50). The ability to suggest on the test corpus was measured in terms of
precision and recall. Recall represents here an approximation of the probability that
the user receives a suggestion in tagging a new document. Precision represents the
probability that such suggestion is correct. Results are shown in the figure at the end
of the paper. On the X-axis the number of documents provided for training is shown.
On the Y-axis precision, recall and f-measure6 are presented.
The maximum support comes in annotating city, country, state and posting date.
This is not surprising as they present quite regular fillers. Other experiments on other
corpora have shown that an equivalent gain can be obtained also for annotations
requiring time expressions as fillers. After training on only 10 texts, the system is
potentially able to propose 253 instances of cities (out of 303 present in the corpus),
228 are correct, 22 are wrong, 3 partially correct7, 72 missing, leading to Precision=90
Recall=75 (see figure 5 and table 1). This is possible because of Amilcare’s ability to
generalize over both the text context and the gazetteer information provided by Annie,
where a list of locations is present. Please note that the recognition of cities, state and
country is not a simple Named Entity Recognition task. The system must not only
recognise the name of a place, but also recognise that such place is the location of
work. There are other locations in the texts that are irrelevant (e.g. in the address of
the recruiter) and only the job location must be recognised. This implies the ability to
recognise the context in which the location name appears. The same applies to the
posting date: there are many other dates in the texts and only the correct one must be
6 A balanced average of precision and recall.
7 Where the proposed and correct annotations partially overlap. They count as half correct in
calculating precision and recall.
77
identified. The situation is more complex for other fields such as recruiter or
company, where 80% F-measure is reached after 100 texts. These annotations are
much more difficult to learn than expressions whose filler are either very regular (e.g.
time or date expressions) or can be listed in a gazetteer (we did not have a suitable list
of companies), because their regularity is much less direct. We performed the same
type of analysis on other corpora for adaptive IE, the CMU seminar announcements
corpus, where 483 emails are manually annotated with speaker, starting time, ending
time and location of seminars (www.isi.edu/~muslea/RISE/) and found analogous
results.
Table 1. Amount of training texts needed for reaching at least 75% precision and 50% recall
Tag Amount of Texts
needed for training
Prec Rec F-measure
City 10 90 75 82
country 10 81 92 86
state 5 79 87 83
company 100 91 72 86
recruiter 30 81 50 62
language 50 80 59 68
platform 50 77 52 62
salary 5 75 54 62
post_date 5 97 100 98
The above experiments show that the contribution of the IES can be quite high.
Reliable annotation can be obtained with limited training, especially when adopting
high precision IES configurations. In the case of the job announcement task, our
experiments show that it is possible to move from bootstrapping to active annotation
after annotating a very limited amount of texts. In table 1 we show the amount of
training needed for moving to active annotation for each type of information, given a
minimum user requirement of 75% precision. This shows that the IES contribution
heavily reduces the burden of manual annotation and that such reduction is
particularly relevant and immediate in case of quite regular information (e.g., known
location names). In user terms this means that it is possible to focus the activity on
annotating more complex pieces of information (e.g. company and recruiter),
avoiding to be bothered with easy and repetitive ones (such as locations). With some
more training cases the IES is also able to contribute in annotating the complex cases.
onclusions and future work
IES can strongly support users in the annotation task, alleviating users from a big deal
of the annotation burden. Our experiments show that such help is particular strong
and immediate for repetitive or regular cases, allowing focusing the expensive and
time-consuming user activity on more complex cases.In our experiment we have
quantified such support for an experiment about job announcements. Despite these
positive results, we claim that the simple quantitative support is not enough. An
interaction methodology between annotation interface, user and IES is necessary in
order to reduce intrusivity and maintain timeliness of support. The methodology
proposed in this paper addresses such concern, as:
78
1. It inserts in the usual user environment without imposing particular requirements
on the annotation interface used to train the IES (reduced intrusiveness).
2. It maximizes the cooperation between user and IES: users insert annotations in
texts as part of their normal work and at the same time they train the IES. The IES
in turn simplifies the user work by inserting annotations similar to those inserted
by the user in other documents; this collaboration is made timely and effective by
the fact that the IES is retrained after each document annotation.
3. The modality in which the IES system suggests new annotations is fully tuneable
and therefore easily adaptable to the specific user needs/preferences (intrusiveness
is taken under control).
4. It allows to timely train the IES without disrupting the user pace with learning
sessions consuming a large amount of CPU time (and therefore either stopping or
slowing down the annotation process).
There are two open issues that arise from our experience. On the one hand the
effect on the user of excellent IES performances after a small amount of annotation is
still to be considered. For example when P=90, R=75 is reached after only 10 texts (as
for company in the jobs announcement task), users could be tempted to rely on the
IES suggestions only, avoiding any further action apart from correction. This would
be bad not only for the quality of document annotation, but also for the IES
effectiveness. As a matter of fact, each new annotated document is used for further
training. Rules are developed using existing annotations. They are tested on the whole
corpus to check against false positives (e.g. the rest of the corpus is considered a set
of negative examples). A corpus with a relevant number of missing annotations
provides a relevant number of (false) negative examples that disorients the leaner,
degrading its effectiveness and therefore producing worse future annotation. The
entire dimension of the problem is still to be analysed. We are currently considering
applying strategies such as randomly removing annotations in order to test the user
attention. On the other hand the time saved by using an IES is still to be quantified.
The experiments above seem to suggest a strong reduction of annotation time, but we
intend to actually measure the improvement in experiments with real users.
Acknowledgements
This work was carried out within the AKT project (http://www.aktors.org), sponsored
by the UK Engineering and Physical Sciences Research Council (grant
GR/N15764/01). AKT involves the Universities of Aberdeen, Edinburgh, Sheffield,
Southampton and the Open University. Its objectives are to develop technologies to
cope with the six main challenges of knowledge management: acquisition, modelling,
retrieval/extraction, reuse, publication and maintenance. Thanks to Enrico Motta,
Mattia Lanzoni, John Domingue, Steffen Staab and Siegfried Handschuh for a
number of useful discussions. Thanks to the Gate group for providing Annie
(www.gate.ac.uk) and for help in integrating it into Amilcare.
B i b l i og r ap h y
1. F. Ciravegna, A. Lavelli, G. Satta: ‘Bringing information extraction out of the
labs: the Pinocchio Environment', in ECAI2000, Proc. of the 14th European
Conference on Artificial Intelligence, ed., W. Horn, Amsterdam, 2000. IOS Press
79
2. P. Kogut and W. Holmes: “Applying Information Extraction to Generate DAML
Annotations from Web Pages”, K-CAP 2001 Workshop Knowledge Markup &
Semantic Annotation, Victoria B.C., Canada (2001).
3. M. E. Califf, D. Freitag, N. Kushmerick and I. Muslea (eds.): AAAI-99
Workshop on Machine Learning for Information Extraction, Orlando Florida
(1999), http://www.isi.edu/~muslea/RISE/ML4IE/
4. R. Basili, F. Ciravegna, R. Gaizauskas (eds.) ECAI2000 Workshop on Machine
Learning for IE, Berlin (2000), www.dcs.shef.ac.uk/~fabio/ecai-workshop.html
5. F. Ciravegna, N. Kushmerick, R. Mooney and I. Muslea (eds.), IJCAI-2001
Workshop on Adaptive Text Extraction and Mining held in conjunction with the
17th International Conference on Artificial Intelligence, Seattle, (2001),
http://www.smi.ucd.ie/ATEM2001/
6. M. Vargas-Vera, Enrico Motta, J. Domingue, M. Lanzoni, A. Stutt and F.
Ciravegna: “MnM: Ontology driven semi-automatic or automatic support for
semantic markup”, Proc. of the 13th International Conference on Knowledge
Engineering and Knowledge Management, EKAW02, Sigüenza, Spain (2002).
7. S. Handschuh, S. Staab and F. Ciravegna: “S-CREAM - Semi-automatic
CREAtion of Metadata”, Proc. of the 13th International Conference on
Knowledge Engineering and Knowledge Management, EKAW02, Sigüenza,
Spain, (2002).
8. F. Ciravegna and D. Petrelli: “User Involvement in Adaptive Information
Extraction: Position Paper” in Proceedings of the IJCAI-2001 Workshop on
Adaptive Text Extraction and Mining held in conjunction with the 17th
International Conference on Artificial Intelligence, Seattle (2001).
9. D. Maynard, V. Tablan, H. Cunningham, C. Ursu, H. Saggion, K. Bontcheva and
Y. Wilks: “Architectural Elements of Language Engineering Robustness”,
Journal of Natural Language Engineering, Special Issue on Robust Methods in
Analysis of Natural Language Data, forthcoming in 2002.
10. F. Ciravegna: "Adaptive Information Extraction from Text by Rule Induction and
Generalisation" in Proceedings of 17th International Joint Conference on
Artificial Intelligence (2001).
11. F. Ciravegna: "(LP)2, an Adaptive Algorithm for Information Extraction from
Web-related Texts" in Proceedings of the IJCAI-2001 Workshop on Adaptive
Text Extraction and Mining held in conjunction with the 17th International
Conference on Artificial Intelligence (IJCAI-01), Seattle, August, 2001
12. N. Kushmerick, D. Weld and R. Doorenbos: `Wrapper induction for information
extraction', Proc. of 15th International Conference on Artificial Intelligence,
Japan (1997).
13. F. Ciravegna: “Challenges in Information Extraction from Text for Knowledge
Management”, IEEE Intelligent Systems and Their Applications, 16-6,
November, (2001).
14. M. E. Califf: ‘Relational Learning Techniques for Natural Language’ IE, Ph.D.
thesis, Univ. Texas, Austin, (1998), www.cs.utexas.edu/users/mecaliff
15. D. Freitag and N. Kushmerick, `Boosted wrapper induction’, in R. Basili, F.
Ciravegna, R. Gaizauskas (eds.) ECAI2000 Workshop on Machine Learning for
Information Extraction, Berlin, 2000, www.dcs.shef.ac.uk/~fabio/ecai-
workshop.html.
80
Asuncion Gomez-Perez, V. Richard Benjamins (eds.):
Knowledge Engineering and Knowledge Management (Ontologies and the Semantic Web), Proceedings of
the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02),
Lecture Notes in Artificial Intelligence 2473, Springer Verlag
city
0
20
40
60
80
100
0 50 100 150
training texts
%
company
0
20
40
60
80
100
0 50 100 150
training texts
%
country
0
20
40
60
80
100
0 50 100 150
training texts
%
post date
0
20
40
60
80
100
0 50 100 150
training texts
%
recruiter
0
20
40
60
80
100
0 50 100 150
training texts
%
state
0
20
40
60
80
100
0 50 100 150
training texts
%
pla tf or m
0
20
40
60
80
100
0 5 0 100 15 0
tr ainin g te xts
%
lan gu ag e
0
20
40
60
80
100
0 5 0 10 0 150
tr ainin g te xts
%
sa lar y
0
20
40
60
80
100
0 50 1 00 15 0
tr ainin g te xts
%
prec is ion rec all fmeas u re
Figure 5. The learning curve for the different information in the job task
... Ignoring privacy issues, Tabulator Redux enables user to add semantic information to a public wiki. Following Ciravegna et al. [32], it seams reasonable that users should create annotations, as they can annotate their points of interest at nearly no cost (on-the-fly). Furthermore, the authors introduced different requirements that must be accomplished and that are tackled by their own solution. ...
... Furthermore, the authors introduced different requirements that must be accomplished and that are tackled by their own solution. Melita [32] addresses multiple usability issues arising from the pro-activeness of the user and separates the annotation process into two phases: The training phase, where the user adds annotation manually and the active annotation phase, where the system adds semantic information automatically. During the training phase the user is supported by the learning algorithm (LP) 2 [55], which enables an automated annotation behaviour. ...
... Semantic annotations often need manual inpu [53,54]. They are time-consuming, expensive [55], and unavailable in many application scenarios. Furthermore, LDA provides an interpretable intermediate semantic represen tation of data in the form of topic graphs, thus leading to researchers making educated guesses and relying on initial information about the research domain [56]. ...
Article
Full-text available
As a versatile energy carrier, hydrogen possesses tremendous potential to reduce greenhouse emissions and promote energy transition. Global interest in producing hydrogen from renewable energy sources and transporting, storing, and utilizing hydrogen is rising rapidly. However, the high costs of producing clean hydrogen and the uncertain application scenarios for hydrogen energy result in its relatively limited utilization worldwide. It is necessary to find new promising technological paths to drive the development of hydrogen energy. As part of technological innovation, emerging technologies have vital features such as prominent impact, novelty, relatively fast growth, etc. Identifying emerging hydrogen-energy-related technologies is important for discovering innovation opportunities during the energy transition. Existing research lacks analysis of the characteristics of emerging technologies. Thus, this paper proposes a method combining the latent Dirichlet allocation topic model and hydrogen-energy expert group decision-making. This is used to identify emerging hydrogen-related technology regarding two features of emerging technologies, novelty and prominent impact. After data processing, topic modeling, and analysis, the patent dataset was divided into twenty topics. Six emerging topics possess novelty and prominent impact among twenty topics. The results show that the current hotspots aim to promote the application of hydrogen energy by improving the performance of production catalysts, overcoming the wide power fluctuations and large-scale instability of renewable energy power generation, and developing advanced hydrogen safety technologies. This method efficiently identifies emerging technologies from patents and studies their development trends. It fills a gap in the research on emerging technologies in hydrogen-related energy. Research achievements could support the selection of technology pathways during the low-carbon energy transition.
... Une partie de la littérature sur les annotations vise à les associer à des technologies d'extraction d'information textuelles pour l'amélioration ou le peuplement des ontologies (Ciravegna et al., 2002 ;Amardeilh et al., 2005). Un autre courant vise à soutenir les activités d'annotation en se basant sur des ontologies (Limpens et al., 2008 ;Ma et al., 2009), ou pour les mobiliser dans la gestion des connaissances, en lien avec une ontologie (Gandon et Dieng, 2001). ...
... File annotation, or metadata extraction, is arguably somewhat similar to user profiling but done in documents instead on user profiles. This can provide information about documents, videos, etc., either to understand user interests through understanding the used documents, or to provide and filter information matching user interests [13,47]. Following the previous recommender system example, one should (automatically) understand book features to associate with user preferences. ...
Article
Full-text available
Mobile applications often adapt their behavior according to user context, however, they are often limited to consider few sources of contextual information, such as user position or language. This article reviews existing work in context-aware systems (CAS), e.g., how to model context, and discusses further development of CAS and its potential applications by looking at available information, methods and technologies. Social Media seems to be an interesting source of personal information when appropriately exploited. In addition, there are many types of general information, ranging from weather and public transport to information of books and museums. These information sources can be combined in previously unexplored ways, enabling the development of smarter mobile services in different domains. Users are, however, reluctant to provide their personal information to applications; therefore, there is a crave for new regulations and systems that allow applications to use such contextual data without compromising the user privacy.
... Most interactive IE systems focus on annotation of text, labeling of entities, and manual writing of rules. Some annotation and labeling tools are: MITRE's Callisto 3 , Knowtator 4 , SAPIENT (Liakata et al., 2009), brat 5 , Melita (Ciravegna et al., 2002), and XConc Suite (Kim et al., 2008). Akbik et al. (2013) interactively helps non-expert users to manually write patterns over dependency trees. ...
Conference Paper
This paper aims to provide an effective interface for progressive refinement of pattern-based information extraction systems. Pattern-based information extraction (IE) systems have an advantage over machine learning based systems that patterns are easy to customize to cope with errors and are interpretable by humans. Building a pattern-based system is usually an iterative process of trying different parameters and thresholds to learn patterns and entities with high precision and recall. Since patterns are interpretable to humans, it is possible to identify sources of errors, such as patterns responsible for extracting incorrect entities and vice-versa, and correct them. However, it involves time consuming manual inspection of the extracted output. We present a light-weight tool, SPIED, to aid IE system developers in learning entities using patterns with bootstrapping, and visualizing the learned entities and patterns with explanations. SPIED is the first publicly available tool to visualize diagnostic information of multiple pattern learning systems to the best of our knowledge.
Chapter
The internet world contains large volume of text data. The integration of web sources is required to derive needed information. Human annotation is much difficult and tedious. Automated processing is necessary to make these data readable by machines. But mostly they are available in unstructured format, and they need to be formatted into structured form. Structured information is retrieved from unstructured or semi‐structured text which is defined as text analytics. There are many Information Extraction (IE) techniques available to model the documents (product/service reviews). Vector space model uses only the content but not the contextual representation. This complexity is resolved by Semantic web, the initiative of WWW Consortium. The advantage of the use of Semantic web enables the ease of communication between Businesses and in process improvement.
Chapter
This article describes how semantic annotation is the most important need for the categorization of labeled or unlabeled textual documents. Accuracy of document categorization can be greatly improved if documents are indexed or modeled using the semantics rather than the traditional term-frequency model. This annotation has its own challenges like synonymy and polysemy in the document categorization problem. The model proposes to build domain ontology for the textual content so that the problems like synonymy and polysemy in text analysis are resolved to greater extent. Latent Dirichlet Allocation (LDA), the topic modeling technique has been used for feature extraction from the documents. Using the domain knowledge on the concept and the features grouped by LDA, the domain ontology is built in the hierarchical fashion. Empirical results show that LDA is the better feature extraction technique for text documents than TF or TF-IDF indexing technique. Also, the proposed model shows improvement in the accuracy of document categorization when domain ontology built using LDA has been used for document indexing.
Article
Full-text available
This article describes how cloud applications are negotiated, deployed, monitored, evaluated and terminated through the service level agreements (SLA). The service definition & their objectives, performance measures, pricing, roles of the involved parties are stated as part of the SLA. Searching for SLA templates from the provider's place is considered as a cumbersome process for the consumer. Also, it is not guaranteed that retrieved SLAs always match with the consumer requirements. Hence, semantic search engine platforms for cloud SLA using a novel architecture are introduced here. SLA agreements are crawled from the web and annotation is performed in the agreement terms using SLA ontologies to fasten and improve the accuracy of the search process. In the proposed architecture, 3 ontologies are developed for SaaS, PaaS and IaaS as well as 140 SLA documents are gathered. Results revealed that the search efficacy is almost 90% in finding the desired SLA for the consumer to ease negotiation. Moreover, the performance is compared with similar search engine GoNTogle, and it was observed that proposed model produced good results.
Article
Recently, the economy has taken a downturn, which has forced many companies to reduce their costs in IT. This fact has, conversely, benefited the adoption of innovative computing models such as cloud computing, which allow businesses to reduce their fixed IT costs through outsourcing. As the number of cloud services available on the Internet grows, it is more and more difficult for companies to find those that can meet their needs. Under these circumstances, enabling a semantically-enriched search engine for cloud solutions can be a major breakthrough. In this paper, we present a fully-fledged platform based on semantics that (1) assist in generating a semantic description of cloud services, and (2) provide a cloud-focused search tool that makes use of such semantic descriptions to get accurate results from keyword-based searches. The proposed platform has been tested in the ICT domain with promising results.
Conference Paper
Full-text available
Richly interlinked, machine-understandable data constitute the basis for the Semantic Web. We provide a framework, S-CREAM, that allows for creation of metadata and is trainable for a specific domain. Annotating web documents is one of the major techniques for creating metadata on the web. The implementation of S-CREAM, OntoMat-Annotizer supports now the semi-automatic annotation of web pages. This semi-automatic annotation is based on the information extraction component Amilcare. OntoMat-Annotizer extract with the help of Amil-care knowledge structure from web pages through the use of knowledge extraction rules. These rules are the result of a learning-cycle based on already annotated pages.
Conference Paper
LP)2 is a covering algorithm for adaptive Infor- mation Extraction from text (IE). It induces symbolic rules that insert SGML tags into texts by learning from examples found in a user- defined tagged corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. In- duction is performed by bottom-up generaliza- tion of examples in the training corpus. Shallow knowledge about Natural Language Processing (NLP) is used in the generalization process. The algorithm has a considerable success story. From a scientific point of view, experiments re- port excellent results with respect to the current state of the art on two publicly available cor- pora. From an application point of view, a suc- cessful industrial IE tool has been based on (LP)2. Real world applications have been devel- oped and licenses have been released to external companies for building other applications. This paper presents (LP)2, experimental results and applications, and discusses the role of shallow NLP in rule induction.
Conference Paper
An important precondition for realizing the goal of a semantic web is the ability to annotate web resources with semantic information. In order to carry out this task, users need appropriate representation languages, ontologies, and support tools. In this paper we present MnM, an annotation tool which provides both automated and semi-automated support for annotating web pages with semantic contents. MnM integrates a web browser with an ontology editor and provides open APIs to link to ontology servers and for integrating informa-tion extraction tools. MnM can be seen as an early example of the next genera-tion of ontology editors, being web-based, oriented to semantic markup and providing mechanisms for large-scale automatic markup of web pages.