Content uploaded by Rob Bracewell
Author content
All content in this area was uploaded by Rob Bracewell on Oct 12, 2017
Content may be subject to copyright.
Answering Engineers’ Questions Using Semantic Annotations
Paper Number: 06
Question-Answering (QA) systems have proven to be helpful especially to those who feel
uncomfortable entering keywords, sometimes extended with search symbols such as +, *, etc. In
developing such systems, the main focus has been on the enhanced retrieval performance of search es,
and recent trends in QA systems centre on the extraction of exact answers. However, when their
usability was evaluated, some users indicated that they found it difficult to accept the answers due to
the absence of supporting context and rationale. Current approaches to address this problem include
providing answers with linking paragraphs or with summarising extensions. Both methods are
believed to be sufficient to answer questions seeking the names of objects or quantities that have only
a single answer. However, neither method addresses the situation when an answer requires the
comparison and integration of information appearing in multiple documents or in several places in a
single document. This paper argues that coherent answer generation is crucial for such questions and
that the key to this coherence is to analyse texts to a level beyond sentence annotations. To
demonstrate this idea, a prototype has been developed based on Rhetorical Structure Theory and a
preliminary evaluation has been carried out. The evaluation indicates that users prefer to see the
extended answers that can be generated using such semantic annotations, provided that additional
context and rationale information are made available.
Keywords: Information retrieval, question-answering, semantic annotations, natural language
processing, Rhetorical Structure Theory
1
1 Introduction
Electronic documents are one of the most common information sources in organisations and
approximately 90% of organisational memory exists in the form of text-based documents. It has been
reported that 35% of users find it difficult to access information contained in these documents and at
least 60% of the information that is critical to these organisations is not accessible using typical search
tools (80-20 software, 2003). There are two main problems. The first is that there is simply too much
information to be searched. The second is that differences exist between the indexing approaches used
in search engines and the way people perceive and access the contents of documents. This means that
most users find searching for relevant information difficult since it is not possible for them to enter
keywords in a sufficiently precise form for them to be used effectively by current search engines.
Current retrieval systems accept queries from users in the form of a few keywords and retrieve a long
list of matching documents. Users then have to sift through the documents to locate the information
they are looking for. For simple fact-based queries, e.g. What material should be used for this turbine
blade? most users can enter satisfactory keywords that rapidly find the required answers. However,
keyword-based systems cannot cope with questions involving: (1) comparing, e.g. What are the
advantages and disadvantages of using aluminium compared with steel for this saucepan? (2)
reasoning, e.g. How safe are commercial fights? and (3) extracting answers from different documents
and fusing them into a complete answer. In order to obtain useful answers to these types of question,
users currently have to expend considerable time and effort.
In previous research in this domain, automatic query expansion and taxonomy-based searches have
been proposed. Query expansion improves on keyword searching for short questions that require only
a few documents to be located to provide the answers. Taxonomy-based searches require the
hierarchical organisation of domain concepts. This relieves the user of having to enter accurate
2
keywords as information is searchable by selecting concepts. However, considerable effort is required
to create the classifications and maintain the hierarchy. For example, Yahoo 1 employs around 50
subject experts to maintain their directories and indexes. Not many organisations can afford to adopt
such a strategy and current automatic classifiers only achieve 60-80 % accuracy (Mukherjee & Mao,
2004). This means that in automatic classification around a third of the documents will be missed or
misclassified. Manual classifications are subjective, and often based on the few sentences which are
deemed important for individual indexers. Clearly when information is sought that does not match the
indexing, such a taxonomy-based approach does not help.
Offering users the facility to enter their queries in natural language might greatly enhance current
search engine interfaces and be particularly helpful for less experienced users who are not adept at
advanced keyword searches. Recent research into natural language-based retrieval systems has mainly
pursued a Question-Answering (QA) approach. QA systems have successfully retrieved short answers
to natural language questions instead of directing users to a number of documents that might contain
the answers. Typically, QA uses a combination of Information Retrieval (IR) and Natural Language
Processing (NLP) techniques. IR techniques are used to pinpoint a subset of documents and to locate
parts of those documents that are related to the questions. NLP techniques are used for extracting brief
answers. There is great interest in developing robust and reliable QA systems to exploit the enormous
quantity of information available on-line in order to answer simple questions such as: Who is the
president of USA?. This is relatively easy since straightforward NLP techniques, such as pattern-
matching, are sufficient to answer it. The numerous occurrences and multiple reformulations of the
same information available on the Web greatly increase the chance of finding answers that are
syntactically similar to the question (Brill et al., 2001). On the Intranets run by organisations, the
quantity of information, although large, is much less than on the Web and the number occurrences and
reformulations less exhaustive.
1 http://www.yahoo.com
3
Unlike users searching on the Web, those in organisations are likely to ask questions that are not easily
answered by simply looking up syntactic similarities in databases. Such answers can be considered
complex and may need to be inferred from different parts of a single text or from multiple texts. The
initial question posed by a user may be ill-formed, i.e. too broad or too specific, making it difficult for
the retrieval system to interpret and hence further interaction with the user is often necessary.
Answering such complex questions has received little attention within QA research. To answer such
questions, the issues of correctly interpreting the question and presenting the answer must both be
addressed. When presenting answers to complex questions, it is not sufficient just to present the
answer, i.e. the user needs additional supporting information with which to assess the trustworthiness
of the answer. For example, hemlock poisoning and drinking hemlock can both be considered correct
answers to the question: How did Socrates die? (Burger et al., 2001). Users who have some
background knowledge about poisoning might appreciate a brief answer, as they do not want to read
through a long text to extract the answer themselves. Other users with less background knowledge
might prefer to see where the answers came from and want to read more text explaining the answers
before accepting them. Searching precisely for How did Socrates die? on the Web using Google
produces around 748,000 results. It is clear that with a question such as this, answers with varying
formulations were mentioned in numerous documents or in many parts of a single document. Answers
appearing as multiple instances need to be fused efficiently in order to reduce repeated information.
Apart from what is simply stated in the question, the user’s real intention might have been to know the
reason why Socrates chose to die by poisoning. For questions that are ill-formed, it is important that
answers are extended with related information that increases a user’s understanding of the answers. It
is therefore necessary to research suitable ways of presenting answers in a clear and coherent manner,
and providing sufficient supporting information to allow users to decide whether or not to trust the
answers.
4
A combination of two approaches, both using semantic relations, is therefore proposed for presenting
clear and coherent answers. First, duplicate information is removed. Second, answers are synthesised
from multiple occurrences, and then justified by adding supporting information. In order to achieve
this, semantic analysis is necessary of both the questions and the texts from which the answers are to
be extracted. Figure 1 shows an example of how these ideas might be implemented2. The initial
question posed by the user, i.e. What triggered the engine fire alarm in Boeing 727-217?, was aimed
at understanding the cause or causes of the fire alarm going off. The question itself is ambiguous
since it does not specify the engine nor the date of the flight. Assuming that the system correctly
understands the question, it can return the failure of the number 2 engine starter as the cause.
However, for complex incidents such this case, it is difficult to pinpoint particular causes and regard
them as independent of the remaining information. That is, there might be more than one cause for a
single incident, and some causes may depend on other causes. For example, in the example above
there are other contributing causes, e.g. the start valve had re-opened because of a short circuit or the
engine starter had failed due to over-speeding. There could also be consequent effects, e.g. residual
smoke and fire damage to the structure surrounding the number 2 engine. For presenting answers like
this, it is necessary to consider the actual information needs of the user from a knowledge-level. For
example, when faced with an unexpected observation (problem), engineers first assess whether or not
the problem is serious and requires diagnosis. Diagnosis normally proceeds by finding reasons or
causes that impact on the observation. Once the causes are identified, then it is likely that solutions
are required to prevent recurrences. The impact of making various hypothetical changes is likely to be
assessed, along with the advantages and disadvantages of the various solutions proposed. This
example demonstrates that the information needs for users in specific organisations are complex,
requiring not only sophisticated retrieval processing but also the presentation of retrieval results in as
natural a form as possible. Successful synthesis and presentation of such answers depend on the
ability to compare information on a semantic-level such that it produces a chain of semantic relations.
2 An example text is from http://www.tsb.gc.ca/en/reports/air/1996/a96o0125/a96o0125.asp
5
Figure 1. An example of generating a coherent and justified answer
A prototype of semantic-based QA system implementing these ideas has been developed. The
underlying approach is based on identifying various discourse relationships between two spans, such
as cause-effect and elaboration. These types of relationship are derived from a computational
linguistic theory known as Rhetorical Structure Theory (RST). This theory defines a set of rhetorical
relations and uses them to describe how the sentences are combined to form a coherent text (Mann &
Thomson, 1988). As such, RST analysis discovers relationships within a sentence or among
sentences. Since sentences are not usually comprehensible when isolated, this approach provides a
more sophisticated content analysis. These annotations are then used to remove duplicate information
and synthesise answers from multiple occurrences. Finally these answers are justified by adding
supporting information. As information is compared at the semantic-level rather than at the string
level, it is possible to determine whether a causal link exists between two events. This paper mainly
addresses questions related to causal inference and describes a prototype system to test the ideas. The
proposed system is targeted at the engineering area, however the methodology is generic and can be
applied to other domains.
2 Literature Review
Users find QA systems helpful as they do not need to go through each retrieved document to extract
the information they need. Until recently, most QA systems only function ed on specifically created
collections and on limited types of questions, but some attempts have been made to scale systems to
open domains like the Web (Kwok et al., 2001). Experiments show that in comparison to search
engines, e.g. Google, QA systems significantly reduce the effort to obtain answers. AskJeeves3 and
3 www.ask.com
Question: What triggered the engine fire alarm in Boeing 727-217?
The failure of the number 2 engine starter caused a fire as the investigation reveals that
residual smoke and fire damage to the structure surrounding the number 2 engine.
Extended answer:
The engine start valve master switch did not protect the complete circuit, → {causing a short
circuit in the engine wiring harness, a new voltage must have been subsequently available},
→ allowed the number 2 engine start valve to re-open, → {causing the number 2 engine
starter to over speed, because it was being rotated by the air turbine with no load on the
starter}, → causing the failure of the starter as evidenced by a two- by three-inch hole in the
side of the starter gear case, and the air turbine had come out through the retaining screen.
the retaining screen.
6
Brainboost4 are the examples of Internet search engines with a QA interface, but neither provides
fully-fledged QA capabilities. AskJeeves relies on hand-crafted question templates that enable
automatic answer searches, and returns lists of documents instead of intelligently extracting brief
answers. Brainboost supplies answers in plain English, but the correctness of its answers is limited to
specific questions only, and for many questions neither relevant texts nor exact answers are found.
Currently, developments in QA have focused on improving system performance through more
advanced algorithms for extracting exact answers (Voorhess, 2002) A project organised by the US
National Institute of Standards and Technology (NIST) has established benchmarks for evaluating QA
systems. Two new QA system response requirements were introduced in 2002: (1) to return an exact
answer; and (2) to return only one answer. Previous requirements had allowed systems to return five
candidate answers, and the answers could be between 50 and 250 bytes in length. This demonstrates
that current QA systems are focusing on retrieving exact answers to factual questions. For these
systems, performances of over 80% correct answers have been reported. However, user evaluations
consistently highlight the fact that the usability is hindered by the absence of context information that
would allow them to evaluate the trustworthiness of an answer. For example, user studies conducted
by Lin et al. (2003) suggest that users prefer to see the answer in a paragraph rather than as an exact
answer, even for a simple question like: Who was the first man on the Moon?
In comparison to open-domain QA systems, e.g. Web, domain-specific QA systems have the
following additional characteristics (Diekema, et al., 2004; Hickle et al, 2004; Nyberg et al., 2005):
limited amount of data is available in most cases
domain-specific terminologies have to be dealt with
user questions are complex.
4 www.brainboost.com
7
Shallow text processing methods are mostly used for QA systems on the Web due to Web redundancy,
which means that similar information is stated in a variety of ways and repeated in different locations.
However, in the engineering domain, suitable data can be scarce and answers to some questions might
only be found in a few documents and these may exhibit linguistic variations from the questions.
Therefore, intensive NLP techniques that can analyse unstructured texts using semantics and domain
models are more appropriate. Domain ontologies and thesauri are required to define domain-specific
terminologies. Hai and Kosseim (2004) used information in a manually created thesaurus to rank
candidate answers by annotating the special terms occurring both in the queries and candidate
answers. They also used a concept hierarchy for measuring similarities between a document and a
query. Ontologies have also been used for expanding terms in the questions and clarifying ambiguous
terms (Nyberg et al., 2004). Since ontologies can be regarded as storing information as triples e.g.
person – work-for – organisation, users can submit questions linked to such classes and relations in
natural language (Lopez et al., 2005). For example, the question: Is John an employee of IBM? can be
answered by recognising: (1) John is a person; (2) IBM is an organisation; and (3) employee is
inferred from ‘someone who works for an organisation’. Questions other than factual ones need
special attention and a profile of the user can help to improve system performance (Diekema et al.
2004). Suitable ways of presenting answers and how much information should be provided must also
be determined. To address these problems, some researchers proposed interactive QA. To that end,
some QA systems rephrase the questions submitted to confirm whether or not users’ information needs
have been correctly identified (Lin et al., 2003). Advanced dialog implementations have also been
suggested. However, Hickl et al. (2004) argue that the decomposition of user questions into simpler
ones with which answer types are associated could be a more practical solution than a dialog
interaction.
8
Generally, semantic annotations are treated as a similar task to named-entity recognition that identifies
domain concepts and their associations in a single sentence ( Aunimo & Kuuskoski, 2005). This paper
extends the notion of semantic annotation to include discourse relations that identify what information
is generated from the extended sequences of sentences. This goes beyond the meanings of individual
sentences by using their context to explain how the meaning conveyed by one sentence relates to the
meaning conveyed by another. A discourse model is essential for constructing computer systems
capable of interpreting and generating natural language texts. Such models have been used: to assess
student essays with respect to their writing skills; to summarise scientific papers; to extend the
answers to a user’ question with important sentences; and to generate personalised texts customised
for individual reading abilities (Bosma, 2005; Burstein et al., 2003; Teufel, 2001; Williams & Reiter,
2003).
3 Engineering Taxonomy
Retrieval systems in engineering need to employ domain-specific terminologies that differentiate
between specific and general terms. Specific terms are essential to understand users’ questions and
characterise documents in relation to those questions. Some general terms have specific meanings in
engineering. For example the term shoulder has multiple meanings in a dictionary, and in most cases,
it means the part of the body between the neck and the upper arm. However, in engineering, it can
refer to a locating upstand on a shaft. Domain taxonomies arrange such terms into a hierarchy. An
example of an engineering taxonomy is the Engineering Design Integrated Taxonomy (EDIT) and this
taxonomy is used throughout this paper. It consists of four root concepts (Ahmed, 2005):
The design process, i.e. a description of the different tasks undertaken at each stage of product
development, e.g. conceptual design, detail design, brainstorming.
The physical product to be produced, e.g. assemblies, sub-assemblies and components, using
part-of relations. For example, a motor and shaft of a motor.
9
The functions that must be fulfilled by the particular component or assembly. For example,
one of the functions of a compressor disc is to secure the compressor blade and one of the
functions of a cup is to contain liquid.
The issues, namely the considerations, that a designer must take into account when carrying
out a design process, e.g. considering the unit costs or production processes.
A detailed description of the development of a generic methodology to develop engineering design
taxonomies that was used for EDIT can be found in (Ahmed et al., 2005).
4 The Proposed Method
In general, a document can be encoded with various semantics, e.g. customer reviews or causal
accounts of engineering failures, and accessed by users who have very different interests. For
example in the case of product reviews by customers, negative and positive customer opinions are the
main messages for market researchers. On the other hand, designers are more interested in design-
related issues, comments and problems associated with engineering failures. It would therefore be
beneficial to include those semantics that facilitate searching for information in a way that reflects the
interests of the users. For example, for a designer whose task is to reduce fan noise, guidance on how
to minimise aerodynamic noise should be retrieved. On the other hand, if that designer is more
interested in using a specific method for noise reduction, then documents describing the methods
along with their advantages or disadvantages are more useful. With keyword-based indexing, it is not
feasible to extract such semantics since most natural language texts have annotations that are too basic
and no explicit descriptions of the concepts are available. Annotations are formal notes attached to
specific spans of text. Their complexity and representation depend on the mark-up language used.
10
The proposed method works as follows: (1) a document is annotated with a set of relations derived
from RST; (2) the document is classified with EDIT indexes; (3) the document is parsed using NLP
indexing techniques; (4) the RST-annotated document is converted into predicate-argument forms for
effective answer extraction; and (5) a user question is analysed using the same NLP technique. Steps
(1), (2), and (3) can proceed independently, but the step (1) must precede step (4).
4.1 Semantic annotations based on RST
Discourse Analysis (DA) is crucial for constructing computer systems capable of interpreting and
generating natural language texts. DA studies the structure of texts beyond sentence and clause levels,
and structures the information extracted from the texts with semantic relations. It is based on the idea
that well-formed texts exhibit some degree of coherence that can be demonstrated through discourse
connectivity, i.e. logical consistency and semantic continuity between events or concepts. This is in
contrast with most keyword-based indexing that exclusively addresses the sub-sentence level, omitting
the fact that sentences are inter-connected to create a whole text. In order to establish a more robust
and linguistically informed approach to identify important entities and their relations, a deeper
understanding is necessary.
Annotating a text with a discourse structure requires advanced text processing, linguistic resources
such as taxonomies, and, possibly, manual intervention by experts. It certainly increases the work
required to develop QA systems. However, if QA systems are only targeted at certain domains, where
a limited number of texts has to be searched, and experts are available to assist, then detailed linguistic
analysis is feasible. DA generates a discourse structure by defining discourse units, either at sentence
or clause level, and assigning discourse relations between the units. Discourse structures can reveal
various text features and attempts have been made to use them to identify important sentences that are
key to understanding the contents of documents (Kim et al., 2006b; Marcu, 1999). Discourse
11
structures can be used to compare units in multiple documents in order to evaluate similarities and
differences in their meanings, as well as to detect anomalies, duplications and contradictions.
Rhetorical relations are central constructs in RST and convey an author’s intended meaning by
presenting two text spans side by side. These relations are used to indicate why each of the spans was
included by the author and to identify the nucleus spans that are central to the purpose of the
communication. Satellite spans depend on the nucleus spans and provide supporting information.
Nucleus spans are comprehensible independently of the satellites. For example, consider the
following two text spans: (1) Given that the clutch was functional, and (2) it is unlikely that the engine
was driving the starter. A condition relation is identified, with span (2) being the nucleus. Satellite
span (1) is only used to define the condition in which the situation in span (2) occurs. These two spans
are coherent since the person who reads them can establish their relationship.
Rhetorical relations between spans are constrained in three ways: (1) constraints on a nucleus; (2)
constraints on a satellite; and (3) constraints on the link between a nucleus and a satellite. They are
elaborated in terms of the intended effect on the text reader. If an author presents an argument in a
text that is identified as an evidence relation, then it is clear that the author was intending to increase a
reader’s belief in the claim represented in a nucleus span by presenting supporting evidence in a
satellite span. Such relations are identified by applying a recursive procedure to a text until all
relevant units are represented in an RST structure (Taboada & Mann, 2006). The procedure has to be
recursive because the intended communication effect may need to be expressed in a complex unit that
includes other relations. The results of such analyses are RST structures typically represented as
trees, with one top-level relation encompassing other relations at lower levels.
It is difficult to determine the correct number of relations to be used and their types. In the simplest
domains only two relation types may be required, whereas some complex domains may require over
12
400 (Hovy, 1993). Hovy argued that taxonomies with numerous relation types represent sub-types of
taxonomies with fewer types. Some relation types are difficult to distinguish, e.g. elaboration and
example. If there are too many types, inconsistencies of annotation are likely. If there are too few, it
may not be possible to capture all the different types of discourse. Mann and Thompson (1988), for
example, listed 33 relation types to annotate a wide range of English texts. To reduce inconsistencies
of annotation, our method combines similar relation types and eliminates those that do not appear
frequently. A preliminary examination with sample engineering domain data from aircraft incident
reports (see Section 5.1) resulted in the following nine types: background, cause-effect, condition,
contrast, elaboration, evaluation, means, purpose, and solutionhood. Each of them is described
below, along with an example taken from the sample domain data, i.e. aircraft incident reports, using
(N) to indicate a nucleus span and (S) a satellite span.
Background
This type of relation is used to increase a reader’s background understanding (S) of the nucleus span
(N).
(S) While the helicopter was approximately 25 feet above ground level en route to Tobin Lake,
Saskatchewan, to pick up a bucket of water
(N) the engine fire warning light came on and the pilot saw smoke coming out of the engine
cowling.
Cause-Effect
This type of relation is used to link the cause in the nucleus span to the to the effect in the satellite
span or vice versa.
(N-S) Analysis of the fuel hose indicates that the steel braid strands failed
(S-N) as a result of chafing.
13
Condition
This type of relation is used to show the condition (S) under which a hypothetical situation (N) might
be realised.
(S) Given that the clutch was functional,
(N) it is unlikely that the engine was driving the starter.
Contrast
This type of relation is used to contrast incompatibilities between situations, opinions, or events and
there is no distinction between (N) and (S).
(N-S) It is considered likely that the fire was momentarily suppressed
(N-S) but because of the constant supply of fuel and ignition, it re-ignited after the retardant was
spent.
Elaboration
This type of relation is used to elaborate (S) on the situation in (N).
(N) The variable inlet guide vane actuator (VIGVA) hose, which provides fuel pressure to open
the variable guide vanes, was found pinched between the top of the starter/generator and the
impeller housing assembly.
(S) Further inspection of the pinched fuel hose revealed a hole through the steel braiding and
inner lining.
Evaluation
This type of relation is used to provide an evaluation (S) of the statement in (N).
(N) The second option, the engine start valve master switch,
(S) does not provide a positive indication to the flight crew of the start valve operation.
14
Means
This type of relation is used to explain the means (S) by which (N) is realised.
(N) The rest of the fire was extinguished
(S) using a fire truck that arrived on the site.
Purpose
This type of relation is used to describe the purpose (S) achieved through (N).
(N) At 13.5 flight hours prior to the occurrence, the starter/generator had been removed
(S) to accommodate the replacement of the starter/generator seal, then re-installed.
Solutionhood
This type of relation is used to link the problem (S) with the solution (N).
(N) The number 2 engine start control valve and starter were replaced, and the aircraft was
returned to service.
(S) It was determined that the number 2 engine starter had failed.
Figure 2 shows a screenshot of the RST annotation of the sample domain data. A software tool,
RSTTool, is used to complete the annotation (O’Donnell, 2000). RSTTool offers a graphical interface
with which annotators segment a given text into text spans and specify relation types between them. A
computer program written in Perl by the first author has been developed in order to automatically
extract the RST annotations stored by RSTTool. In the box at the bottom of Figure 2 can be seen the
decomposed text spans with the individual spans identified by square brackets. Above can be seen the
corresponding RST analysis tree.
15
[C-GRYC had been modified by the previous owner8 8, ] [Da Services Ltd 89,] [to incorporate a n engine start valve
master switch90]. [The modification was accepted by Transport C anada when the aircraft was imported into Canada
in 199291]. [The engine start valve master switch was put into the electrical circuit between the engine start switches
and the start valve cutout switches on the engine starter 92]. [It provides protection for the s tart circuit up to the start
valve cutout switch93].
Figure 2. Screenshot of RST analysis using RSTTool (O’Donnell, 2000)
It is common to use discourse connectives (or cue phrases) for automatic discovery of discourse
relations from texts. For example, by detecting the word but, a contrast relation between two adjacent
texts can be identified. This approach is easy to implement but can lead to a low coverage, i.e. the
ratio of correctly discovered discourse relations to the total number of discourse relations. A study by
Taboada and Mann (2006) showed that the levels of success using cue phrases ranged from 4% for the
‘summary’ relation to over 90% for the ‘concession’ relation. In order to improve the coverage,
machine learning methods have been used. Marcu et al. (2002) used Naive Bayesian probability to
generate lexical pairs that can identify relation types without relying on cue phrases. For example, the
approach can extract a ‘contrast’ when one text contains good words and another bad words, even
when but does not appear. Whereas this approach produces a good performance, the assumption that
the lexical pairs are independent of each other can lead to a considerable number of training sentences
being required, sometimes over 1 000 000. Although a low presence of cue phrases can lead to many
undiscovered relations, they can serve as a reference for annotators. Discourse text spans are inserted:
16
(1) at every period, semicolon, colon, or comma; and (2) at every cue phrase listed in Table 1.
Annotators first refer to the cue phrases to test whether the corresponding relation types can be used
for a given text. If no direct match is identified, then they select the closest one using their judgement.
Table 1 summarises cue phrases extracted from Knott and Dale (1995) and Williams and Reiter
(2003).
RST-annotated texts are converted into a predicate-argument structure, i.e. predicate (tag
1
:argument
1
,… tag
n
:argument
n
). Predicates represent the main verbs in sentences and tags include subjects,
objects and prepositions. For example, consider the following sentence: Analysis of the fuel hose
indicates that the steel braid strands failed as a result of chafing. For this sentence the ‘evidence’
relation type is used to annotate it as follows: evidence((indicate(subject:analysis of the fuel hose)),
(fail(subject:the steel braid strand, pp:as a result of chafing))).
Table 1. Cue phrases for identifying relation types
Relation types Cue phrases
Background With, probably
Cause-Effect Because, since, as, as a consequence, as a result, thus,
therefore, due to, lead to, consequently
Condition as long as, if…then, if, so long as, unless, until
Contrast although, by contrast, even though, however, though, whereas,
while
Elaboration also, in addition, in particular, for example, in general
Evaluation with, so, but, which, even so
Means by, with, using
Purpose in order to, for the purpose of
Solutionhood proposed solution, options
17
4.2 Semantic-based QA Description
4.2.1 Term indexing
In general, it is difficult to extract good index terms due to inherent ambiguity in natural language
texts. A term in a text, i.e. an alpha-numeric expression, can have different meanings depending on
the domain in which it is being used, and a term can appear more frequently in one domain than in
another. Publicly accessible dictionaries, e.g. WordNet (Miller et al., 1993) are good resources for
obtaining the meanings of the terms, both manually and automatically. For example, according to
WordNet, blade has nine meanings. One definition is: especially a leaf of grass or the broad portion
of a leaf as distinct from the petiole. However, another in the engineering domain is: flat surface that
rotates and pushes against air or water. Terms can also be used in different domains with the same
meaning. For example, certification does not have a different meaning in the engineering domain.
Most keyword-based search systems index a document with a list of keywords ranked with relevance
weightings. Whereas these keywords might be sufficient to describe superficially the contents of a
document, it is difficult to interpret the true message if their precise meanings are not established.
NLP, on the other hand, produces a rich representation of a document at a conceptual level. To
achieve human-like language processing, NLP includes a range of computational techniques for
analysing and representing natural texts at one or more levels of linguistic analysis (Liddy, 1998). It is
common to categorise such techniques into the six levels listed below, each of which has a different
analysis capability and implementation complexity (Allen, 1987). The application of NLP on a text
can be implemented at the simplest level, e.g. morphological level and then extended into a fully-
fledged pragmatic analysis that shows a superior understanding, but requires large resources and
extensive background information. In this paper, NLP processing includes the first five levels, i.e. it
excludes the pragmatic level.
18
Morphological level: component analysis of words, including prefixes, suffixes and roots, e.g.
using is stemmed into use.
Lexical level: word level analysis including a lexical meaning and a Part-Of-Speech (POS)
analysis, e.g. apple is a kind of fruit and is tagged as Noun.
Syntactic level: analysis of words in a sentence in order to determine the grammatical structure
of the sentence.
Semantic level: interpretation of the possible meanings of a sentence, including the
customisation of the meanings for given domains.
Discourse level: interpretation of the structure and the meaning conveyed from a group of
sentences.
Pragmatic level: understanding the purposeful use of language in situations particularly those
aspects of language which require world knowledge.
Figure 3 shows the steps of the indexing process. The text in the box at the bottom of Figure 2 is used
as an example.
Step 1: Pre-processing
Step 4: Term weighting
- Paragraph identification
- Sentence decomposition
- Term identifciation
- POS taggings
- Phrase identification
- Term normalisation
- Acronym identification
Step 2: Syntactic parse
Step 3: Lexical look-up
- Okapi method
Figure 3. Steps of the indexing process
Step 1: Pre-processing
19
One paragraph is identified in the example text, which is then decomposed into four sentences. The
first sentence is:
C-GRYC had been modified by the previous owner, DA Services Ltd, to incorporate an engine start
valve master switch.
Terms are identified as words lying between two spaces including full stop.
Step 2: Syntactic parse
The Apple Pie Parser (Sekine & Grishman, 2001) is used for a syntactic parse that tags POS and
identifies phrases. POS identifies not what a word is, but how it is used. It is useful to extract the
meanings of words since the same word can be used as a verb or a noun in a single sentence or in
different sentences. In a traditional grammar, POS classifies a word into eight categories: verb, noun,
adjective, adverb, conjunctive, pronoun, preposition and interjection. The Apple Pie Parser refers to
the grammars defined in the Penn Treebank to determine the POSs (Marcus et al., 1993). For
example, the first word C-GRYC is tagged as NNPX, i.e. proper single noun. The remain POSs for the
sentence above are shown below:
POS taggings: C-GRYC/NNPX had/VBD been/VBN modified/VBN by/IN the/DT previous/JJ
owner/NN DA/NNPX Services/NNPS Ltd/NNP to/TOINF incorporate/VB an/DT engine/NN start/NN
valve/NN master/NN switch/NN.
Phrase identification groups words grammatically, e.g. into Noun Phrases (NPs) such as { the previous
owner DA Services Ltd} and {an engine start valve master switch}.
Step 3: Lexical look-up
20
Each POS-tagged word is compared with WordNet definitions to achieve term normalisation.
Acronym identification extends an acronym found in a text fragment with its full definition. An
example of term normalisation is:
modified → modify
and of acronym identification is:
DA → Dan-Air.
Step 4: Term weighting
Although it is possible to analyse the full contents of a document, this becomes computationally
expensive when the documents are large. For an effective retrieval, it is desirable to extract only those
portions of a document that are useful and to transform them into special formats. Text indexing
determines the central properties of the content of a text in order to differentiate relevant portions of
text from irrelevant ones. The quality of each index term is evaluated to determine if it is an effective
identifier of the text content. A relative importance weighting is then assigned to each index term. A
common approach is to index a document divided into paragraph-sized units. In this paper, the Okapi
algorithm is used (Franz & Roukos, 1994; Robertson et al., 1995). It weights (
jk
w
) a term (
kt
) in a
paragraph (
j
p
) as follows:
5.0
5.0log
*
_
*5.15.0
n
nN
c
lenave
Plen
c
w
jk
j
jk
jk
Equation (1)
where,
jk
c
is the frequency of the term (
k
t
).
N
is the total number of paragraphs in a dataset and
n
is the number of paragraphs that have contents containing the term (
k
t
).
)( j
plen
is the total
number of frequencies of all terms presented in a paragraph (
j
p
) and
ave len_
is the average number
of terms per paragraph. Using the term weighting method, the example sentence is stored into a vector
model, i.e. each term is associated with its calculated weighting.
21
4.2.2 Domain knowledge in QA
An engineering taxonomy such as EDIT is a useful means to identify domain-specific terms in a
document. The successful extraction of domain-specific terms can improve the accuracy of QA. For
example the answer to the question: What material should be used for this turbine blade is more easily
identified if Titanium is marked-up as a type of material. Among the four root concepts defined in
EDIT, only two are used: Issue and Product. These two concepts exhibit different characteristics.
According to Ahmed (Ahmed, 2005), issue categories are considerations designers must take into
account when carrying out a design process. These can be the descriptions of problems arising during
a product’s lifecycle or new design requirements to be satisfied. In contrast, product categories
comprise a hierarchy list of product names, decomposing an overall technical product or system into
smaller and smaller elements. Different techniques are therefore needed to handle them in the
documents. For issue categories, any technique that automatically classifies a document into pre-
defined categories is suitable. For product categories, the technique of Named-Entity (NE)
recognition is used. In the QA method proposed in this paper, the techniques developed by Kim et al.
(2006a; 2006b) are used. The technique for the classifying issue categories is described in (Kim et al.,
2006b) and the one for classifying product categories, using probability-based NE identifiers, is
described in (Kim et al., 2006a).
4.2.3 QA overview
Figure 4 shows the overall architecture of the proposed QA system. State-of-the-art QA systems can
achieve an accuracy of up to 80%, as demonstrated by recent tests undertaken using TREC datasets,
which mainly consist of newspaper documents (Voorhees, 2002). However, this level of performance
is not expected to be repeated in other environments. The questions in the above tests were carefully
constructed, i.e. no misspellings, and they were mostly factual and based on a single interaction with a
user, i.e. no dialogue.
22
The prototype system proposed in this paper does not aim at achieving better accuracy in question
analysis or in finding answers. Instead, its main objective is to demonstrate the efficiency of RST-
based annotations for coherent answer generation, i.e. Answer Generation, see step (5) in Figure 4.
Figure 4. Overall architecture of the proposed QA system
Each of the steps in Figure 4 will now be described.
Step 1: Question Analysis
The Question Analysis Module decomposes a question into three parts: (1) Question Word; (2)
Question Focus; and (3) Question Attribute. The Question Word indicates a potential answer type,
e.g. where, when, etc. The Question Focus is a word, or a sequence of words, that describe the user’s
information needs that are expressed in the question. The Question Attribute is the part of the question
that remains after removing the Question Word and the Question Focus. It is used to rank candidate
answers in a decreasing order of relevance. An example is given below.
23
Question: What were the consequences of the vibration of the starter/generator?
(1.1) Syntactic parse
POS: What|WP were|VBD the|DT consequences|NNS of|IN the|DT vibration|NN of|IN the|DT
starter|NN /|SYM generator|NN
Phrase identification: NPL{the consequences} PP{the vibration of the starter/generator}
(1.2) Question Word {what}, Question Focus {consequences}, Question Attribute {the vibration of
the starter/generator}
(1.3) EDIT indexes: <Product category=‘Starter_Ducting’> starter <Product
category=‘Electrical_Generator’> generator <Issue category=‘Vibrations’> vibration
(1.4) Relation type: effect
(1.5) Answer format: cause-effect(Question Attribute, <Answer>)
The Question Focus, i.e. consequences, for the example question above is matched with the effect in
the cause-effect relation type. Therefore, the possible answers should be the effects of the events
described in the Question Attribute. For an automatic matching, a semantic similarity between the
Question Focus and the relation type is computed using the method proposed by Resnik (1995). This
method is based on the number of the edges in a semantic hierarchy, e.g. WordNet, encounted between
two terms when locating them in the hierarchy.
Step 2: Answer Retrieval
The Answer Retrieval Module uses the Question Attribute identified by the Question Analysis Module
to select paragraphs that might contain candidate answers. A cosine-based similarity calculation is
used for ranking the selected paragraphs in order of relevance to the keywords that appear in the
Question Attribute.
24
t
i
t
i
i
ij
t
i
i
ij
j
w
w
ww
pqsim
1
2
1
1
,
*
*
2
Equation (2)
where,
j
p
is a given paragraph and
q
is a Question Attribute.
ij
w
is the weight of the term,
i
t
, in
the paragraph (
j
p
). The similarity value is normalised by the total weights of common words.
Step 3: Answer Extraction and Step 4: Answer Scoring
The Answer Extraction Module examines the paragraphs selected in the Answer Retrieval Module in
order to select text spans from which candidate answers can be extracted. The EDIT indexes,
Question Focus, and Question Attribute are used to determine whether the text spans contain the
answer. In doing so, it is necessary to measure how well they are related to the question. An overall
similarity between the text spans and the question is computed by summing following three similarity
scores: (1) the score reflecting whether a given text span is classified with the EDIT indexes; (2) the
score reflecting whether a given text span contains the Question Focus; and (3) the score reflecting the
degree of similarity between a given text span and the Question Attribute. They are summed as
follows:
).(_*))(1()(_*)(_*)( iiiij tsattrstsrststseditstss
Equation (3)
where
)( ij
tss
is the score of the text span,
i
ts
in the paragraph (
j
p
).
)10(
and
)10(
are used to normalise the score
)( ij
tss
to lie be between 0 and 1.
)(_ itsedits
is defined as:
Nindstsedits
positionsk
kii
)()(_
Equation (4)
25
where
)( ki
inds
is a Boolean indicating whether or not a given text span is classified with an EDIT
index number,
ki
ind
, positions is the set of matches against the EDIT indexes returned by the
Question Analysis Module, and N is the total number of elements in this set.
)(_ itsrsts
is a Boolean variable that is true if the RST annotation for the text span matches the
annotation for the question.
)(_ itsattrs
is defined as:
Mtstsattrs
n
m
mi
i
1
)()(_
Equation (5)
where
)( mi
ts
is a Boolean indicating whether or not a given term,
m
t
in the text span,
i
ts
is
matched with a term in the Question Attribute,
m
t
, and M is the total number of terms in the Question
Attribute. The scored text spans are sorted in decreasing order by value and those above a pre-defined
threshold selected.
Step 5: Answer Generation
For coherent answer generation, duplicate sentences or clauses should be removed. Two sentences or
clauses are recurrent if they are exactly equivalent or if they differ only in the level of generality. For
example, the following two clauses are equivalent: the starter had failed and the failure of the number
2 engine starter. On the other hand, sentence 1 below can be replaced by sentence 2 without loss of
information because sentence 2 subsumes the information in sentence 1.
Sentence 1: It is probable that a short circuit in the engine wiring harness allowed the number 2
engine start valve to re-open, causing the number 2 engine starter to over speed and subsequently fail.
Sentence 2: It is probable that a short circuit in the engine wiring harness allowed the number 2
engine start valve to re-open, causing the number 2 engine starter to over speed and subsequently fail,
resulting in an engine fire.
26
Automatic text summarisation systems employ various approaches to compare similar sentences that
have different wordings (Mani & Maybury, 1999). In general, these systems use the following two
steps to produce summaries from a document:
Step 1 identifies and extracts important sentences to be included in a summary.
Step 2 synthesises the extracted sentences to form a summary.
There are two common methods of synthesis in step 2. The non-extractive summary method
suppresses repeated sentences either by extracting a subset of the repetitions or by selecting common
terms. It then reformulates the reduced number of sentences to produce the summary (Barzilay et al.,
1999; Jing & McKeown, 2000). The extractive summary method focuses on the extraction of
important sentences and assembles them sequentially to produce the summary. The objective of these
automatic summarisation systems is to create a shortened version of an original text in order to reduce
the time spent reading and comprehending it. The objective of our proposed approach, on the other
hand, is to extend and synthesize text spans to allow the generation of coherent answers.
In the proposed approach, two text spans are compared to determine whether or not both are similar
using Equation (2). Text spans that have higher similarity values than the pre-defined threshold are
excluded. The algorithm of the Answer Generation Module is shown below as pseudo-code.
Variable definition:
answerList: chains of ‘cause and effect’ to be generated by the Answer Generation Module
textspanList: a list of text spans returned by the Answer Scoring Module
relationlinkedList: a list of text spans linked through the ‘cause and effect’ relation obtained
from the RST annotations.
27
t: one text span being examined and extracted from the textspanList
thresh: a pre-defined threshold used to compare similarity between two text spans
t1: temp variable, t2: temp variable
Initialisation:
answerList ← empty
Repeat (
t← retrieve(textspanList), thresh 0.8, t1empty, t2 ← empty.
If (answerList does not contain t)
THEN (build chains of ‘cause and effect’ for t and merge them with answerList
{
foreach t1 relationlinkedList(t)
{
foreach t2 ← answerList
{ compute similarity between t1 and t2
IF (similarity > thres) { NOTHING }
ELSE { update(answerList,t1) }
}
}
}
ELSE { NOTHING}
remove(textspansList,t)
t←retrieve(textspansList)
)
Until (textspanList is EMPTY)
28
5 Pilot study
This section presents a preliminary evaluation of the prototype QA system. The evaluation tests the
following two hypotheses: (1) the proposed system is efficient at extracting and presenting answers to
causal questions using relation types, and (2) the presentation of the synthesised answers helps users to
understand the retrieved results. The first hypothesis was evaluated by comparing the performance of
the prototype system against that of a standard QA system. The standard system is different from the
prototype system in the following ways:
Step 1: the Answer Format is revised as <Answer> Question Focus Question Attribute. Th is format
indicates that the system should find text spans that have syntactic variations with the Question Focus
and semantic similarities with the Question Attribute. Using the example question in Section 4.2.3, a
potential answer text span can be <Answer> is the consequence of <Question Attribute>.
Step 2: is not revised.
Step 3 and Step 4: the Equation (3) is revised as:
).(_)( iij tsattrstss
Step 5: is not used.
With the standard system, multiple instances were extracted without synthesising them. This
comparison examined whether the answer generation method described in Section 4.2 avoids repeated
information and generates coherent answers. The second hypothesis was evaluated by measuring user
performance in a simple trial.
29
5.1 Trial dataset
For the trial, three official aircraft incident investigation reports were downloaded from websites 5.
Although the incidents happened to three different aircraft types (Boeing 727-217, Bell 205A-1
helicopter, and Boeing 737-8AS), the incidents share a common cause, i.e. an in-flight engine fire.
Although the reports were written by different incident investigation teams, they share a broadly
similar terminology, e.g. emergency landing, engine starter valve, etc. After removing embedded
HTML tags and images, the average document length was 1820 words, or 24 paragraphs. Each
document was first indexed as described in Section 4.2.1, and RST annotations were applied by the
first author using RSTTool. A total of 1 94 relations were annotated. The total numbers for each
relation type were: Background = 20, Cause-Effect= 45, Condition = 16, Contrast = 30, Elaboration =
32, Evaluation = 24, Means = 10, Purpose = 12, and Solutionhood = 5.
5.2 The trial
Six Engineering graduate students and two members of the Engineering Department staff of
Cambridge University participated in the trial. A brief introduction to the trial and trial dataset was
given to the participants. Each participant was asked to answer multiple questions and their
performance and accuracy were measured. The trial consisted of reviewing the answers to three
questions, i.e. one for each incident report. These answers were split into two groups. The first group
was extracted and synthesized using the prototype QA system and the second was extracted using a
standard QA system. For a fixed period of time, the participants were instructed to read the answers
on-line for both systems (see Figure 5), and follow links to the original document if desired. After
this, a further list of questions related to the answers the users had just read was given to the users.
5 (1) http://www.tsb.gc.ca/en/reports/air/1996/a96o0125/ a96o0125.asp,
(2) http://www.tsb.gc.ca/en/reports/air/2002/A02C0114/A02C0114.asp,
(3) http://www.aaib.gov.uk/sites/aaib/cms_resources/dft_avsafety_pdf_029538.pdf
30
Their answers to these questions were used to test their understanding of the answers they had just
seen. In order to avoid the evaluation problems caused by the inclusion of incorrect answers, both
groups of answers were examined in order to verify that they were all true.
Figure 5. An example screenshot of the proposed system
Table 2 shows the three initial questions, along with the associated questions that were used to test the
users’ understanding.
Table 2: The three original questions along with their associated questions
31
Question 1: What triggered the engine fire alarm on the Boeing 727-217?
1. On which engine of the Boeing 727-217 was the fire alarm observed?
2. What were the consequences of the failure of the number 2 engine starter?
3. Why did the number 2 engine starter overspeed?
4. Did the starter valve of the number 2 engine close after the engine was started?
5. Why did the number 2 engine starter valve re-open?
6. How can we determine if the starter valve is open?
Question 2: What triggered the engine fire alarm on the Bell 205A-1 helicopter?
1. Why did the starter/generator start to vibrate?
2. What were the consequences the vibration of the start/generator?
3. Why was the hold-down nut at the 12 o’clock position left-out?
4. Was the engine fire alarm activated due to the abrasion of the cooling fan?
Question 3: What triggered the shut-down of the engine 2 on the Boeing 737-8AS?
1. Does this incident have the same engineering problem to the Boeing 727-217?
2. Did the failure of No 4 bearing in the number 2 engine contribute to the event?
3. What were the consequences of presence of the engine vibration?
5.3 Trial results
Answers to the three original questions shown in Table 2 were extracted and synthesized using the
method described in Section 4.2. The answers then were compared to the set of answers prepared in
Section 5.2. The threshold for the Answer Retrieval Module, i.e. the value for Equation (2), was set as
0.5 meaning that the paragraphs which had cosine-similarity values over 0.5 were selected. The
values for
and
specified in Equation (3) were set as 0.3 and 0.2, respectively. The threshold for
the Answer Scoring Module, i.e. the values for Equation (3) was set as 0.2.
32
The results are shown using two tables, i.e. Table 3 and Table 4. Table 3 summarises the results of the
‘cause and effect’ chains generated by the proposed QA. Table 4 compares the performance of the
proposed QA on retrieving correct text spans with that of the standard QA. Precision and recall were
used to measure the performance. In this paper, precision is defined as the percentage of the retrieved
text spans that are identified as right among the total number of retrieved text spans. Recall is the
percentage of the retrieved right text spans among the total number of right text spans.
Table 3. The overview of ‘cause and effect’ chains generated by the proposed QA
Num. of paragraphs Num. of text spans Depth of the chains
Question 1 17 12 4
Question 2 8 9 4
Question 3 11 6 2
The second column in Table 3 specifies the number of paragraphs returned by the Answer Retrieval
Module and the third one specifies the number of text spans returned by the Answer Extraction and
Scoring Modules. The fourth one specifies the depth of the chains, i.e. the number of cause and effect
nodes along the longest path from the root node down to the farthest leaf node in the chains. For
example, for the Question 1, one ‘cause and effect’ chain with a depth of 4 was generated by
synthesising and extending the 12 text spans extracted from the 17 paragraphs.
The following shows examples of correct and incorrect text spans for the Question 1.
Correct text spans:
(1) The failure of the number 2 engine starter resulted in an engine fire.
(2) The hazard associated with an engine fire caused by a starter failure was recognized and
addressed in AWD 83-01-05 R2.
33
(3) It is probable that a short circuit in the engine wiring harness allowed the number 2 engine start
valve to re-open, causing the number 2 engine starter to over speed and subsequently fail, resulting
in an engine fire.
Incorrect text spans:
(1) Because of the engine’s proximity to the elevator and rudder control systems, a severe in-flight fire
in the number 2 engine is potentially more serious than a fire in either the number 1 or 3 engine.
(2) Fire damage to the engine component wiring precluded any significant testing of the wiring
harness.
(3) Two fire bottles were discharged into the number 2 engine compartment; however, the fire
warning light remained on.
Table 4. Comparison of two QA systems for the task of retrieving correct text spans
STANDARD QA PROPOSED QA
Precision Recall Precision Recall
Question 1 0.45 0.5 0.67 1
Question 2 0.6 0.67 0.78 0.78
Question 3 0.67 0.57 0.83 0.79
Average 0.57 0.58 0.76 0.86
As shown in Table 4, on average, the proposed QA achieved 76% precision and 86% recall when
retrieving text spans for three questions. On the other hand, the standard QA achieved 57% precision
and 58% recall. This suggests that the proposed QA has considerable potential for extracting and
synthesizing answers to causal questions. The task of retrieving text spans is similar to the sentence
selection task in automatic text summarisation systems.
The text summarisation systems referred to earlier in this paper are by Barzilay et al. (1999) and by
Jing and McKeown (2000). In the context of multi-document summarisation, Barzilay et al. (1999)
34
focused on the generation of paraphrasing rules that were used to compare semantic similarity between
two sentences. They tested the rules for the task of identifying common phrases among multiple
sentences. The automatically generated common phrases were then reviewed by human judges. The
reviews identified 39 common phrases and among them the system correctly identified 29 of them. In
addition the identified phrases contained 69% of the correct subjects and 74% of the correct main verbs.
On average, the system achieved 72% accuracy.
Jing and McKeown (2000) carried out three evaluations. The first tested whether the automatic
summarisation system could identify a phrase in the original text that corresponds to the selected
phrase in a human-written abstract. When tested with 10 documents, the automatic system achieved
82% precision and 79% recall on average. The second evaluation tested whether the automatic system
could remove extraneous sentences, i.e. sentence reduction. The result showed that 81% of the
reduction decisions made by the system agreed with those of humans. The third evaluation tested
whether the automatic system could generate coherent summaries. The system achieved 6.1 points out
of 10, i.e. 61% accuracy for generating coherent summaries. Only the first evaluation focused on the
sentence selection.
The performance of our proposed QA when retrieving correct text spans with 76% of precision and
86% recall is slightly better than the work by Barzilay et al. (1999), i.e. 72% accuracy, and comparable
to the work by Jing & McKeown (2000), i.e. 82% precision and 79% recall.
On average, out of 13 questions, the users in the first group, i.e. those who read the answers given by
the proposed QA, incorrectly answered two out of the 13 questions, whereas the users in the second
group incorrectly answered five questions. On average, the users in the first group completed the trial
within 19 minutes, and the users in the second group completed the trial within 25 minutes. Five of
the 13 questions were correctly answered by all the users in the first group, whereas just one question
35
was correctly answered by all the users in the second group. All users in the second group incorrectly
answered question number 6 ‘how can we determine if the starter valve is open’.
Although the preliminary results are encouraging, it is difficult to draw firm conclusions from this trial
for the following reasons: (1) the low number of users in the two groups; and (2) the number of causal
relations in the trial dataset was small. Users in the first group expressed the opinion that the
synthesized chains of ‘cause and effect’ description were helpful in understanding the causes of the
three incidents.
6 Conclusion and further work
Researchers in computational linguistics have speculated that the relation types defined in RST can
improve the performance of QA systems when answering complex questions. The class of causal
reasoning questions, either predictive or diagnostic, is one that we have shown might be better
answered using these relation types. The reason for this is that the majority of causal questions can be
answered in multiple ways, i.e. it is difficult to pinpoint particular causes and regard them as
independent of the remaining information. Generally, identifying the causes of a specific event
involves creating chains of ‘cause and effect’ relations. Without a deep understanding of all the
relevant information contained in a document, it is not possible to derive such causal chains
automatically. It is still not known how users would like such causal chains to be presented, and it is
not suggested that the interface proposed in this paper is necessarily the best. The contribution of this
paper is the demonstration of a method for synthesizing causal information into coherent answers.
The source information can be scattered over different parts of a single document or over multiple
documents. The pilot study indicated that the proposed QA was more efficient at extracting and
synthesizing answers when compared with standard QA, i.e. 19 percentage points increased precision
and 28% percentage points increased recall. The pilot study also indicated that the synthesized chains
36
of ‘cause and effect’ descriptions were helpful not only for quickly understanding the direct causes of
the three incidents but also for being aware of related contexts along with the rationales for the causes
of the incidents.
The main objective is to improve the understanding of the answers generated by QA systems. An
answer is considered to be coherent if duplicate expressions are eliminated and if it is appropriately
extended with additional information. This additional information should help users verify the
answers and increase their awareness of relevant domain information. Using RST annotations, it has
been shown that it is feasible to compare and integrate the information at a semantic level. This leads
to a way of presenting answers in a more natural manner. A pilot trial demonstrated that the answers
generated by the prototype QA system led to more rapid and improved understanding of those
answers.
Further work is planned with the aim of improving the performance of the prototype system in three
ways. First, since engineers have varying levels of domain expertise, the system should consider the
preferences and profiles of individuals. Inexperienced engineers might have very broad information
requests and prefer to explore the domain, whereas experienced engineers might have detailed
information requests aimed at refining their existing knowledge. Novice engineers require more
background information, probably assembled using ‘elaboration’ or ‘background’ relation types.
Second, synthesising sentences extracted from different documents is crucial to generate answers that
are longer than one sentence. When writing a sequence of linked sentences, authors often replace
noun phrases by pronouns, or shortened forms of the phrase, in subsequent sentences, e.g. the number
2 engine starter is replaced by it or the starter. Coreference (anaphora) resolution is a process for
determining the multiple representations of a noun phrase and is a key issue in computational sentence
synthesis. However the main focus of research in this area has been on the resolution of personal
37
pronouns e.g. he, him, his etc. Various techniques have been proposed for automatic coreference
identification and it is planned to extend the prototype QA system by adapting these techniques.
Third, the crucial issue of automatic RST annotation will be addressed since this is essential for the
practical application of the system. Kim et al. (2004) have applied a machine learning algorithm, i.e.
Inductive Logic Programming (ILP), to analyse documents created using the Design Rationale editor
(DRed). This enabled the automatic identification of the relation types (Bracewell & Wallace, 2003,
Bracewell et al., 2004). Tests have demonstrated approximately 80% accuracy. This high figure can
be attributed partly to the structure of the DRed documents in the dataset. These documents are
carefully structured using an argumentation model derived from that of IBIS (Kunz & Rittel, 1970).
The documents comprise linked textual elements of a predefined set of types. These element types
include ‘issue’, ‘answer’ and ‘argument’. The links between them are directed but untyped. This
algorithm will be extended to deal with other types of documents, e.g. Web pages and unstructured
texts.
The main objective of this research was to answer more complex questions than current QA systems
are capable of answering. There are five modules in the architecture of the proposed QA system:
Question Analysis; Answer Retrieval; Answer Extraction; Answer Score; and Answer Generation.
The Question Analysis Module analyses the question in terms of the Question Word, Question Focus
and Question Attribute. The next three modules retrieve, extract and score answers from documents
that have been manually annotated and semi-manually indexed. The manual annotation is based on
nine of the 33 relation types defined in RST. The semi-manual indexing uses the issue and product
categories of the EDIT engineering taxonomy. The main contribution of this research lies in the fifth
module. This module synthesises causal information into coherent answers, drawing information from
both different parts of a single document and from multiple documents. A prototype implementation
shows promise, but additional testing is required. Further developments are proposed that will: (1)
38
allow the system to take into account the preferences and profiles of users; (2) extend the system to
include coreference identification; and (3) eliminate the manual annotation of documents. As with all
computer support systems, the interface is critical and here further empirical research is needed.
Acknowledgements
This work was funded by the University Technology Partnership for Design, which is a collaboration
between Rolls-Royce, BAE SYSTEMS and the Universities of Cambridge, Sheffield and
Southampton. We also thank S. Banerjee and T. Pedensen for software implementing the similarity
measurement method proposed by Resnik.
References
80-20 software. (2003). 80-20 Retriever Enterprise Edition.http://www.80-20.com/brochures/Personal
Email Search Solution.pdf
Ahmed, S. (2005). Encouraging Reuse of Design Knowledge: A Method to Index Knowledge. Design
Studies Journal, 26(6), 565-592.
Ahmed, S., Kim, S., & Wallace, K. M. (2005). A methodology for creating ontologies for engineering
design. Proc. ASME 2005 Int. Design Engineering Technical Conf. on Computers and Information in
Engineering, DETC 2005-84729. U.S.A.
Allen, J. (1987). Natural Language Understanding. Benjamin/Cummings Publishing Company, Inc.
Aunimo, L., & Kuuskoski, R. (2005). Question Answering Using Semantic Annotation. Proc. Cross
Language Evaluation Forum (CLEF). Austria.
Barzilay, R., McKeown, K. R., & Elhadad, M. (1999). Information Fusion in the Context of Multi-
Document Summarization. Proc. Annual Computational Language, pp. 550-557. U.S.A.
39
Bosma, W. (2005). Extending Answers using Discourse Structure. Proc. Workshop on Crossing
Barriers in Text Summarization Research in RANLP, pp. 2-9. Bulgaria.
Bracewell, R. H., Ahmed, S., & Wallace, K. M. (2004). DRed and design folders: a way of caputuring,
storing and passing on knowledge generated during design projects. Proc. Design Automation Conf.,
ASME. USA.
Bracewell, R.H., & Wallace, K. M. (2003). A tool for capturing design rationale . Proc. 14th Int. Conf.
on Engineering Design, pp. 185-186. Stockholm.
Brill, E., Lin, J., Banko, M., Dumais, S. T., & Ng, A.Y. (2001). Data-intensive question answering.
Proc. Tenth Text REtrieval Conf. (TREC 2001), pp. 183-189. U.S.A.
Burger, J., Cardie, C., Chaudhri, V., Gaizauskas, R. and et al. (2001). Issues, Tasks and Program
Structures to Roadmap research in Question & Answering (QA), NIST
Burstein, J., Marcu, D., & Knight, K. (2003). Finding the write stuff: Automatic identification of
discourse structure in student essays. IEEE Intelligent Systems, Jan/Feb, 32-39.
Diekema, A. R., Yilmazel, O., Chen, J., Harwell, S., He, L., & Liddy. E. D. (2004) . Finding Answers
to Complex Questions. In New Directions in Question Answering (Maybury, M. T., Ed), pp. 141-152.
AAAI-MIT Press.
Franz, M., & Roukos, S. (1994) TREC-6 Ad-Hoc Retrieval, Proc. the Sixth Text Retrieval Conf.
(TREC-6), pp.511-516.
Hai, D., & Kosseim, L. (2004). The Problem of Precision in Restricted-Domain Question-Answering:
Some Proposed Methods of Improvement. Proc. Workshop on Question Answering in Restricted
Domains in ACL, pp. 8-15. Barcelona.
Hickl, A., Lehmann, J., Williams, J., & Harabagiu, S. (2004). Experiments with Interactive Question
Answering in Complex Scenarios. Proc. North American Chapter of the Association for
Computational Linguistics annual meeting (HLT-NAACL), U.S.A.
Hovy, E. H. (1993). Automated Discourse Generation Using Discourse Structure Relations. Artificial
Intelligence, 63(1-2), 341-385.
40
Jing, H., & McKeown, K. R. (2000). Cut and Paste Based Text Summarization. Proc. 1st Meeting of
the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp.
178-185. U.S.A.
Kim, S., Bracewell, R.H., & Wallace, K.M. (2004). From discourse analysis to answering design
questions. Proc. Int. Workshop on the Application of Language and Semantic Technologies to support
Knowledge Management Processes, pp. 43-49. U.K.
Kim, S., Ahmed, S., & Wallace, K. M. (2006a). Improving document accessibility through ontology-
based information sharing. Proc. Int. Symposium series on Tools and Methods of Competitive
Engineering, pp. 923-934. Slovenia.
Kim, S., Bracewell, R.H., Ahmed, S., & Wallace, K. M. (2006b). Semantic Annotation to Support
Automatic Taxonomy Classification. Proc. Int. Design Conference (Design 2006), pp. 1171-1178.
Croatia.
Knott, A., & Dale, R. (1995). Using linguistic phenomena to motivate a set of coherence relations.
Discourse Processes, 18(1), 35-62.
Kunz, W., & Rittel, H. W. J. (1970). Issues as Elements of Information Systems. Working Paper 131.
Center for Planning and Development Research, Berkeley, USA. Elsevier Scientific Publishing
Company, 55-169, Inc. Amsterdam.
Kwok, C. Etzioni, O., & Weld, D. S. (2001). Scaling Question Answering to the Web. Proc. of the
10th Int. Conf. on World Wide Web, pp. 150-161. Hong Kong
Liddy, E. D. (1998). Enhanced Text Retrieval Using Natural Language Processing. Bulletin of the
American Society for Information Science and Technology, 24(4), 14-16.
Lin, J., Quan, D., Sinha, V., Bakshi, K., Huynh, D., Katz, B., & Karger, D. R. (2003). What Makes a
Good Answer? The Role of Context in Question Answering. Proc. of the IFIP TC13 Ninth Int. Conf.
On Human-Computer Interaction, Switzerland.
41
Lopez, V., Pasin, M., & Motta, E. (2005). AquaLog: An Ontology-Portable Question Answering
System for the Semantic Web. Proc. of the Second European Semantic Web Conference (ESWC), pp.
546-562. Greece.
Mani, I., & Maybury, M. (1999). Advances in Automatic Text Summarisation. The MIT Press.
Mann, W., & Thompson, S. (1988). Rhetorical structure theory: Toward a functional theory of text
organization. Text, 8(3), 243-281
Marcu, D., & Echihabi, A. (2002). An unsupervised approach to recognising discourse relations. Proc.
the 40th Annual Meeting of the Association for Computational Linguistics, pp. 368-375. U.S.A.
Marcu, D. (1999). Discourse trees are good indicators of importance in text. In Advances in Automatic
Text Summarization (Mani, I., Maybury, M., Eds.), MIT Press
Marsh, J.R., & Wallace, K. (1997). Observations on the Role of Design Experience. Proc. WDK
Annual Workshop, Switzerland.
Miller, G.A., Beckwith, R.W., Fellbaum, C., Gross, D., & Miller, K. (1993). Introduction to wordnet:
An on-line lexical database. International Journal of Lexicography, 3(4), 235-312.
Mukherjee, R., & Mao, J. (2004). Enterprise Search Tough Stuff ACM Queue, 2(2), 36-46
Nyberg, E., Mitamura, T., Frederking, R., Pedro, V., Bilotti, M., Schlaikjer, A., & Hannan, K. (200 5).
Extending the JAVELIN QA System with Domain Semantics. Proc. of the Workshop on Question
Answering in Restricted Domains at AAAI, U.S.A.
O'Donnell, M., (2000). RSTTool 2.4 -- A Markup Tool for Rhetorical Structure Theory . Proc. of the
Int. Natural Language Generation Conference (INLG'2000), pp. 253-256. Israel.
Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity in Taxonomy. Proc.
14th Int. Joint Conf. on Artificial Intelligence, pp. 448-453.
Robertson, S. E., Walker, S., Jones, S., & Hancock-Beaulieu, M. G. (1995) Okapi at TREC-3. Proc. of
the Third Text REtreval Conference (TREC-3), pp. 550-225.
Salton, G. (1989). Advanced Information-Retrieval Models. In Automatic Text Processing (Salton, G.
Ed.), chapter 10. Addison-Wesley Publishing Company.
42
Sekine S., & Grishman R. (2001). A Corpus-Based Probabilistic Grammar with only two Non-
Terminals, Proc. Fourth Int. Workshop on Parsing Technologies, pp. 216-223. Czech Republic.
Taboada, M., & Mann. W. (2006). Rhetorical Structure Theory: Looking back and Moving ahead
Discourse Studies, 8(3) (to appear)
Teufel, S. (2001). Task-based evaluation of summary quality: Describing relationships between
scientific papers. Proc. Int. Workshop on Automatic Summarization at NAACL, U.S.A.
Voorhess, E. M. (2002). Overview of the TREC 2002 Question Answering Track. Proc. of the Text
Retrieval Conference. (TREC).
Williams, S., & Reiter, E. (2003). A corpus analysis of discourse relations for natural language
generation. Proc. of Corpus Linguistics, pp. 899-908, U.K.
43