ArticlePDF Available

Nonfactoid Question Answering as Query-Focused Summarization With Graph-Enhanced Multihop Inference

Authors:

Abstract and Figures

Nonfactoid question answering (QA) is one of the most extensive yet challenging applications and research areas in natural language processing (NLP). Existing methods fall short of handling the long-distance and complex semantic relations between the question and the document sentences. In this work, we propose a novel query-focused summarization method, namely a graph-enhanced multihop query-focused summarizer (GMQS), to tackle the nonfactoid QA problem. Specifically, we leverage graph-enhanced reasoning techniques to elaborate the multihop inference process in nonfactoid QA. Three types of graphs with different semantic relations, namely semantic relevance, topical coherence, and coreference linking, are constructed for explicitly capturing the question-document and sentence-sentence interrelationships. Relational graph attention network (RGAT) is then developed to aggregate the multirelational information accordingly. In addition, the proposed method can be adapted to both extractive and abstractive applications as well as be mutually enhanced by joint learning. Experimental results show that the proposed method consistently outperforms both existing extractive and abstractive methods on two nonfactoid QA datasets, WikiHow and PubMedQA, and possesses the capability of performing explainable multihop reasoning.
Content may be subject to copyright.
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Non-factoid Question Answering as Query-focused
Summarization with Graph-enhanced Multi-hop
Inference
Yang Deng, Wenxuan Zhang, Weiwen Xu, Ying Shen, and Wai Lam
Abstract—Non-factoid question answering (QA) is one of
the most extensive yet challenging applications and research
areas in natural language processing (NLP). Existing methods
fall short of handling the long-distance and complex semantic
relations among the question and the document sentences. In this
work, we propose a novel query-focused summarization method,
namely Graph-enhanced Multi-hop Query-focused Summarizer
(GMQS), to tackle the non-factoid QA problem. Specifically, we
leverage graph-enhanced reasoning techniques to elaborate the
multi-hop inference process in non-factoid QA. Three types of
graphs with different semantic relations, namely semantic rele-
vance, topical coherence, and coreference linking, are constructed
for explicitly capturing the question-document and sentence-
sentence interrelationships. Relational Graph Attention Network
(RGAT) is then developed to aggregate the multi-relational
information accordingly. In addition, the proposed method can
be adapted to both extractive and abstractive applications as
well as be mutually enhanced by joint learning. Experimental
results show that the proposed method consistently outperforms
both existing extractive and abstractive methods on two non-
factoid QA datasets, WikiHow and PubMedQA, and possesses
the capability of performing explainable multi-hop reasoning.
Index Terms—Non-factoid Question Answering, Query-focused
Summarization, Graph Neural Network, Multi-hop Reasoning
I. INTRODUCTION
NON-FACTOID Question Answering (QA) has received
a significant amount of attention recently due to its
board applications on a variety of real-world Community-
based Question Answering (CQA) sites, such as Quora, Stack-
OverFlow, and Amazon Q&A. Different from factoid QA [1],
which can be simply answered by a short text span or a
single sentence without detailed information, e.g., “Who is the
author of Harry Potter?”, the answers for non-factoid questions
are supposed to be more informative, involving some detailed
analysis, like opinions and explanations, to explain or justify
the final answers, such as questions in community QA [2], [3]
or explainable QA [4], [5]. Non-factoid QA contains a wider
The work described in this article is substantially supported by a grant from
the Research Grant Council of the Hong Kong Special Administrative Region,
China (Project Code: 14200719), the National Natural Science Foundation
of China (No.61602013), and the Shenzhen General Research Project (No.
JCYJ20190808182805919). (Corresponding author: Ying Shen)
Yang Deng, Weiwen Xu, and Wai Lam are with the Department of System
Engineering and Engineering Management, The Chinese University of Hong
Kong, Hong Kong 999077. (E-mail: {ydeng, wwxu, wlam}@se.cuhk.edu.hk)
Wenxuan Zhang is with with DAMO Academy, Alibaba Group. (E-mail:
saike.zwx@ alibaba-inc.com)
Ying Shen is with the School of Intelligent Systems Engineering,
Sun Yat-sen University, Gunagzhou 510275, China. (E-mail: sheny76@
mail.sysu.edu.cn)
range of open-ended questions, including “How” or “Why”
questions, yes-no questions. For example, “How to tube feed
a puppy?” or Are human coronaviruses uncommon in patients
with gastrointestinal illness?” cannot be answered without the
context from the document, as the example in Figure 1&6.
In practice, non-factoid QA requires the capability of merg-
ing multiple sparse and diverse information from different
sentences across the whole supporting document or evidences
together to form a concise and complete answer. Document
summarization methods have been adopted as an effective
way to summarize salient information, which can also be
adopted to provide a concise answer for the given question
in the context of non-factoid question answering [7], [8].
Essentially, the key to tackling the non-factoid QA problem
is to measure the relevance degree between the question and
candidate answer sentences [9], [10]. This leads to a variety
of researches that elaborate the semantic interactions between
the question and candidate answer sentences, from Siamese
Neural Models [11], [12] to Compare-Aggregate Models [13],
[14]. However, traditional document summarization methods,
when being applied on non-factoid QA [8], [15], fall short
of capturing the important semantic interactions between the
question and the document sentence.
To achieve this, we investigate the non-factoid QA problem
as a query-focused summarization problem, as they share a
similar goal to produce a concise but informative summary,
driven by a specific query. In the past studies, query-focused
summarization was mainly explored by traditional information
retrieval methods [2], [7], [16], which heavily rely on hand-
crafted features or tedious multi-stage pipelines. Inspired by
the promising performance of deep learning models on other
NLP tasks, several efforts have been made on developing deep
learning based models [3], [17]–[19] to summarize the source
document with the guidance of specific queries. However,
most of them focus on capturing the semantically relevant
information with the query to produce the summary, while
failing to provide informative and logical answers due to the
overlook of two crucial characteristics in non-factoid QA:
The long-distance interrelationships among the document
sentences make it difficult to fetch all the necessary infor-
mation for constructing the final answer.
The complex semantic relations attach great importance to
the reasoning procedure and the explainability of the answer.
As the example shown in Figure 1, given the specific
question, there are several highlighted sentences required
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2
Question: Are human coronaviruses uncommon in patients with gastrointestinal illness?
Document: <S>Coronaviruses infect numerous animal species causing a variety of illnesses including respiratory, neurologic and enteric
disease. <S>Human coronaviruses (HCoV) are mainly associated with respiratory tract disease but have been implicated in enteric disease.
<S>To investigate the frequency of coronaviruses in stool samples from children and adults with gastrointestinal illness by RT-PCR.
<S>Clinical samples submitted for infectious diarrhea testing were collected from December 2007 through March 2008. <S>RNA extraction
and RT-PCR was performed for stools negative for Clostridium difficile using primer sets against HCoV-229E, HCoV-OC43, HCoV-
NL63, and HCoV-HKU1. <S>Clinical data from samples positive for coronaviruses were reviewed and recorded. <S>Samples from 479
patients were collected including 151 pediatric (< or = 18 years), and 328 adults (>18 years). <S>Of these samples, 4 patients (1.3%, 2 adult; 2
pediatric) screened positive for the presence of a coronavirus. <S>All detected coronaviruses were identified as HCoV-HKU1. <S>No stools
screened positive for either HCoV-229E, HCoV-NL63 or HCoV-OC43. <S>All HCoV-HKU1 positive samples occurred between mid-
January to mid-February. <S>Clinical manifestations from HCoV-HKU1 positive patients included diarrhea, emesis and respiratory
complaints. <S>Three (75%) patients were admitted to the hospital with a median length of stay of 6 days. <S>
Answer: Coronaviruses as a group are not commonly identified in stool samples of patients presenting with gastrointestinal illness. HCoV-
HKU1 can be identified in stool samples from children and adults with gastrointestinal disease, with most individuals having respiratory
findings as well. No stool samples screened positive for HCoV-NL63, HCoV-229E, or HCoV-OC43.
Fig. 1. An example from PubMedQA [6]. The highlighted sentences illustrate the inference process when humans answer the given question. Italic represents
direct matching sentences from the question. Underlined and :::::::::::
wavy-underlined represent sentences inferred by 2nd-hop and 3rd-hop reasoning, respectively,
to justify the answer.
to be concentrated for conducting summarization so as to
generate the answer. Besides, one-time inference sometimes
is insufficient for collecting all the required information for
producing a complete answer. It leads to the necessity of
measuring the importance of each sentence, instead of regard-
ing the source text as an undifferentiated whole. Inspired by
recent advances in factoid QA studies [20], [21], one intuitive
approach to address the long-distance interrelationship issue is
to employ multi-hop reasoning, which enables to collect all the
important justifications or evidences that contribute to the final
answer. Recently, [22] develops a multi-hop inference module
for non-factoid QA, based on the semantic relevance degree
among the document sentences. Despite its effectiveness, the
multi-hop reasoning patterns are implicitly obtained from a
single relation, i.e., semantic relevance. There are two other
semantic relations that have been identified to be useful in
studying the interrelationship among the document sentences
in summarization: topical coherence [23], [24] and coreference
linking [25], [26]. On one hand, despite the content transition
in the multi-hop inference process, the latent topic concerning
the given question is supposed to be coherent. On the other
hand, resolving coreference across the whole document can
bridge the long-distance relationship between different sen-
tences that are discussing the same object.
Fortunately, graph structures have the natural advantages
of exploiting both structural and semantic information to rea-
son over multi-hop relational paths. Existing graph-enhanced
multi-hop reasoning techniques are basically proposed for fac-
toid QA [26]–[29], which aims to construct entity graphs for
linking the mentioned entities among sentences. Then, graph
neural networks [30], such as GCN [31], [32], GAT [33]–[35],
are employed to model the multi-hop information transition.
However, in non-factoid QA, the semantic relationships among
sentences are more complicated. Such multiple relations be-
tween textual units are expected to be fully utilized in a unified
graph for detecting salient information and performing explicit
reasoning.
In this work, we tackle the non-factoid QA problem
by proposing a novel query-focused summarization method,
namely Graph-enhanced Multi-hop Query-focused Summa-
rizer (GMQS). In specific, we investigate graph-based rea-
soning techniques to conduct the multi-hop inference for
collecting the key information from the document towards the
given question. Three types of graphs with different semantic
relations, namely Semantic Relevance,Topical Coherence,
and Coreference Linking, are constructed for explicitly cap-
turing the question-document and sentence-sentence interre-
lationships. Relational Graph Attention Network (RGAT) is
then developed to aggregate the multi-relational information
accordingly. In addition, the multi-hop relational information
can then be utilized under either extractive or abstractive
application to produce a summary as the answer to the given
non-factoid question. We empirically show that the proposed
method outperforms existing baselines on non-factoid QA with
a promising capability of multi-hop reasoning.
A preliminary study was published as a conference pa-
per [22]. We substantially enhance the method with three
main improvements: 1) We propose a new graph-enhanced
multi-hop reasoning model for non-factoid QA. 2) We develop
an adaptive relational graph attention network with a multi-
relational graph structure for modeling the complex sentence
relations. 3) We unify the extractive and abstractive query-
focused summarization into one Transformer-based architec-
ture. In addition, we conduct extensive experiments to validate
the proposed method from various aspects, such as automatic
and human evaluation for both extractive and abstractive sce-
narios, the contribution of different components, and detailed
analyses of the multi-hop reasoning process. Overall, the
proposed GMQS method substantially improves MSG [22]
with better performance, training efficiency, and explainability.
The main contributions are summarized as follows:
We propose a novel query-focused summarization method
to tackle the non-factoid question answering problem, which
leverages graph-enhanced reasoning techniques to elaborate
the multi-hop inference for summarizing the key information
to form the answer to the given non-factoid questions.
We identify three types of semantic relations, namely seman-
tic relevance, topical coherence, and coreference linking,
for explicitly modeling the question-document and sentence-
sentence relationships. Relational Graph Attention Network
is developed to aggregate the multi-relational information.
The proposed method unifies the extractive and abstractive
query-focused summarization into one architecture, which
can jointly improve the summarization performance of non-
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 3
factoid QA and be adaptively used for different applications.
Experimental results on two non-factoid QA datasets,
namely WikiHowQA and PubMedQA, show that the pro-
posed method substantially and consistently outperforms
several strong baselines.
II. RE LATE D WOR KS
A. Non-factoid Question Answering
Different from factoid QA that can be tackled by extracting
answer spans [1], [36], generating short sentences [37] or
returning a Boolean answer [38], non-factoid QA aims at
producing relatively informative and complete answers. Most
non-factoid QA studies focus on information retrieval (IR)
based methods, such as answer sentence selection [9] or an-
swer ranking [10], [39], by measuring the semantic relevance
degree between the question and candidate answers or answer
sentences [12]–[14], including Siamese architecture [11], [12]
and Compare-Aggregate framework [13]. In the Siamese ar-
chitecture [11], [12], the same encoder is used to learn the
vector representations for the input sentences (both questions
and answers), individually. In order to enhance the interaction
between the representational learning of the question and
answer, various attention mechanisms [40]–[42] are proposed
to attend the correlated and important information for a better
relevance measurement. Furthermore, the Compare-Aggregate
architecture [13], [14], [43] captures more interactions be-
tween two sentences, by aggregating comparison signals from
low-level elements into high-level representations.
Inspired by the successful applications of text generation
on other NLP tasks, some recent studies [3], [4], [44] adopt
generation-based methods to generate natural sentences as
the answer in non-factoid QA. In specific, several efforts
have been made on tackling long-answer generative ques-
tion answering over supporting documents, which targets on
questions that require detailed explanations [4]. This kind
of QA problem contains a large proportion of non-factoid
questions, such as “how” or “why” type questions [3], [45].
Besides, some studies aim at generating a conclusion for
the concerned question [5], [6]. [4] proposes a multi-task
Seq2Seq model with the concatenation of the question and
support documents to generate long-form answers. [46] and
[5] incorporate some background knowledge into Seq2Seq
model for generating natural answers to why questions and
conclusion-centric questions.
However, existing studies on non-factoid QA typically focus
on capturing the question-related content from the document.
In this paper, we tackle the non-factoid QA as a query-focused
summarization problem, which aims to further merge sparse
and diverse information from different sentences across the
whole document to form a concise but complete answer.
B. Query-focused Summarization
Early works on query-focused summarization mainly in-
vestigate the approach to extracting query-related sentences
to construct the summary [19], [47], [48], which are later
improved by exploiting sentence compression on the extracted
sentences [23], [49]. Recently, some data-driven neural ab-
stractive models are proposed to generate the natural form
of summaries with respect to the given query [17], [18], [50].
However, current studies on query-focused abstractive summa-
rization are restricted by the lack of large-scale datasets [18],
[51]. To overcome this challenge, researchers explore the
utilities of weak supervision [52] and domain adaptation [53]
techniques by leveraging external resources from some related
tasks, or unsupervised learning [16], [54].
In the light of both the capability and limitation of query-
focused summarization studies, some researchers spark a new
pave of query-focused summarization in non-factoid QA [2],
[7], [55], which requires the ability of reasoning or inference
in summarization, not merely relevance measurement, and
also preserves remarkable testbeds of large-scale datasets.
Similar to traditional summarization, according to the type of
summary, query-focused summarization studies in non-factoid
QA can also be categorized into extractive [2], [7], [15], [55]
and abstractive summarization [3], [22], [56]. In this paper, we
investigate the capability of multi-hop reasoning for adapting
query-focused summarization methods into non-factoid QA.
C. Multi-hop Reasoning in QA
One of the challenges for applying neural models on QA
systems is that it is required to preserve the capability of
reasoning for the aggregation of multiple evidence facts in
order to answer complex natural language questions [28], [57].
Many attempts have been made on learning to provide evi-
dence or justifications for a human-understandable explanation
of the multi-hop inference process in factoid QA [20], [21],
[58], [59], where the inferred evidences are only treated as
the middle steps for finding the answer. However, in non-
factoid QA, the intermediate output is also important to form
a complete answer, which requires a bridge between the multi-
hop inference and summarization [22].
Performing explicit multi-hop reasoning on graph structure
has been demonstrated to be an effective approach for multi-
hop factoid QA [26]–[29], [60] and some other text generation
tasks [61], [62]. The multi-hop reasoning modules in these
works mainly focus on linking entities among sentences. In
this work, we investigate the utility of graph-enhanced multi-
hop inference to capture three types of semantic relations in
non-factoid QA systems.
D. Text Summarization
The methods for text summarization are generally catego-
rized into extractive and abstractive approaches. The extractive
methods [63], [64] produce a summary by extracting salient
sentences from the source document, while the abstractive
methods [65]–[67] generate a summary from the vocabulary
based on the understanding of the document. In addition,
researchers attempt to take advantages of both extractive
and abstractive methods by using hybrid techniques, such as
joint learning [68], [69], extract-then-abstract [70], [71]. On
the other hand, many efforts have been made on exploiting
the utilities of graph structures to capture relations between
textual units for benefiting summarization [25], [72], [73].
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 4
Question
Shared Transformer Encoder
Answer 1
Answer 1
Sentence n
Document
Document
Multi-Head
Attention
Qs
VqKq
Question
QqKsVs
HqHsn
˜
Hq˜
Hsn
MeanPool
MeanPool
hqhsn
sem
h(l)
i𝒩i𝚜𝚎𝚖 𝒩i𝚝𝚘𝚙 𝒩i𝚌𝚘𝚛
Graph Attention
+
h(l+1)
i
Question
Attention
Document
Attention
Sentence
Attention
+
=𝚊𝚋𝚜 +𝚎𝚡𝚝
˜
Hd
Masked
Multi-Head Attention
Add & Norm
Answer (shifted)
Multi-Head Attention
Add & Norm
Multi-Head Attention
Add & Norm
Feed Forward
Decoder
Graph-enhanced Multi-hop
Inference Module
Sentence Extractor
Summary
Generator
Cat
Os
Relational Graph Attention Network
st
Projection
Layer
˜
Hq
˜
Hd
Osst
Final Distribution
pd
pq
pv
(a) Overview of GMQS
(b) Summary Generator
(c) Graph-enhanced Multi-hop Inference Module
Graph Attention
Graph Attention
Add & Norm
Feed Forward
Add & Norm
Multi-Head
Attention
Add & Norm
Feed Forward
Add & Norm
Intra- and Inter-sentence Encoder
Fig. 2. Overview of GMQS.
Pretrained language models, such as BERT [74], BART [75],
recently emerge for achieving impressive improvements in text
summarization. In this work, we make the first attempt of
jointly learning the extractive and abstractive query-focused
summarization.
III. PROB LE M DEFI NI TI ON
The input of both extractive and abstractive query-
focused summarization contains a sequence of words
{wq
1, wq
2, ..., wq
mq, ...}for the query qand a sequence of words
{wd
1, wd
2, ..., wd
md, ...}for the document d, where mqand md
are the word indexes. The sequence of words in a document
can also be represented as a sequence of sentences s=
{s1, s2, ..., sn, ...}, where nis the sentence index. The goal of
both extractive and abstractive query-focused summarization
is to produce a summary y, based on the query qand the
document d. Without the loss of generality, we refer the term
“query” as “question” and the term “summary” as “answer”
in the following description of non-factoid QA.
Non-factoid Question Answering as Extractive Query-
focused Summarization: The output of extractive query-
focused summarization is a sequence of predicted probability
{˜ys}for each sentence in the document d, where ˜ys
nrepresents
the probability of the n-th sentence been extracted into the
answer y. The goal is to learn a sentence-level sequence
labeling model fe(·)to determine which sentences should be
included to form the final answer:
fe(q, d) = {˜ys
1,˜ys
2, ..., ˜ys
n, ...}.(1)
Non-factoid Question Answering as Abstractive Query-
focused Summarization: The output of abstractive query-
focused summarization is a sequence of predicted probability
of vocabulary distribution Ptat each time-step t. The goal is
to learn an auto-regressive sequence-to-sequence model fa(·)
to generate new sentences to form the final answer:
fa(q, d, y<t) = Pt.(2)
IV. MET HO D
We introduce the proposed method, namely Graph-enhanced
Multi-hop Query-focused Summarizer (GMQS), for non-
factoid question answering. Figure 2 depicts the overall ar-
chitecture of GMQS, which contains four main components:
Intra- and Inter-sentence Encoder reads the sentences of
both the question and document by capturing semantic rela-
tionships from sentences themselves as well as interactions
between the question and document.
Graph-enhanced Multi-hop Inference Module elaborates
a multi-relational graph structure to perform multi-hop rea-
soning over the whole document by taking into account three
types of semantic relations.
Sentence Extractor scores each sentence in the document
according to the learned sentence representation.
Summary Generator produces the abstractive summary as
the answer to the given question.
A. Intra- and Inter-sentence Encoder
Unlike the encoder of traditional summarization models,
which only needs to establish explicit representations for
a single sentence, query-focused summarization is further
required to capture the interaction between the question and
the document. To achieve this, the encoder is designed to be
capable of modeling intra- and inter-sentence interactions.
We adopt multi-head self-attention module from Trans-
former [76] as the basic unit for encoding the raw text into se-
mantic sentence representations. The multi-head attention unit
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 5
is denoted as MHAtt(Q, K, V ), where Q, K, V are query, key,
and value, respectively. Each multi-head attention unit consists
of three components: (i) The Scale Dot-Product Attention to
apply attention weights upon the value vector with size of dh;
(ii) The feed-forward network with ReLU activation, which
is defined as FFN(·); (iii) The layer normalization, which is
defined as LayerNorm(·). Generally, a multi-head attention
unit can be represented as:
Vatt =softmax QKT/pdhV(3)
MHAtt(Q, K, V ) = LayerNorm(FFN(Vatt ) + V).(4)
Given the question qand the document dthat consists of
a sequence of sentences s={s1, s2, ..., sn}, we first use the
self-attention to compute the representations of the question
and each document sentence separately:
Hq=MHAtt(E(q), E(q), E (q)),(5)
Hsn=MHAtt(E(sn), E(sn), E (sn)),(6)
where E(·)is the embeddings of the input text, which is the
concatenation of word and position embeddings. Such intra-
sentence interaction attends the important information within
the question and each individual document sentence.
After obtaining the encoded representations for all the
input sequences, we perform the cross-attention to capture the
semantically relevant information between the question and
each document sentence:
˜
Hq=1
NXN
n=1 MHAtt(Hsn, Hq, Hq),(7)
˜
Hsn=MHAtt(Hq, Hsn, Hsn),(8)
where ˜
Hqand ˜
Hsnare the attentive representations for the
word sequences of the question and each document sentence,
respectively. Then, meaning pooling operation is applied to
obtain the final encoded sentence representations:
hq=MeanPool(˜
Hq), hsn=MeanPool(˜
Hsn).(9)
B. Graph-enhanced Multi-hop Inference Module
Graph-enhanced Multi-hop Inference Module measures the
degree of importance of each sentence in the document for
producing the answer, through a multi-hop reasoning pro-
cedure, which is based on the graph structure and three
types of semantic and linguistic relations, namely Semantic
Relevance,Topical Coherence, and Co-reference Linking.
1) Multiple Semantic Relations: We first introduce the three
types of semantic and linguistic relations as the backbone of
the Graph-enhanced Multi-hop Inference Module:
(1) Semantic Relevance. There are two kinds of semantic
relevance to be considered for the multi-hop inference in non-
factoid QA. The first one is the relevance degree between
the question and each sentence in the document, which is
also the essential measurement in answer sentence selection
studies [12], [13]. The other one is the information-consistency
between the concerned sentence and those highly weighted
sentences from the previous hops [22]. Therefore, motivated
by Maximal Absolute Relevance (MAR) measurement in [22],
we elaborate the relation of semantic relevance between: (i)
the question and each sentence in the document, and (ii) the
sentence and the most similar sentence in the document.
(2) Topical Coherence. Despite the content transition in the
multi-hop inference process, the concerned latent topic is
supposed to be coherent for collecting the information to
answer the given question [23], [24]. To capture the relation
of topical coherence, we leverage LDA topic model [77] to
identify the latent topic of each sentence in the document.
The sentences estimated with the same latent topic are taken
into consideration for modeling the topical coherence.
(3) Coreference Linking. Resolving long-term coreference
is of great importance in multi-hop question answering [26],
since the question is often concerning about some certain
objects. Instead of implicitly modeling the long-term coref-
erence, we employ a state-of-the-art coreference resolution
tool, NeuralCoref, to link the coreference objects among the
question and all sentences in the document.
2) Multi-relational Graph Construction: To facilitate the
reasoning process, it requires to model and aggregate the
complex relations with multiple hops of refinement. To this
end, we construct a multi-relational graph to represent the rela-
tional information obtained from different relational inference
units. The multi-relational graph is denoted as G= (N,E,R),
with nodes ni N , labeled edges (i.e., relations) between
node niand njas (ni, r, nj) E, where r R is the
relation type between two nodes. We treat the question q,
each document sentence snas a node in G, with the total
number of nodes as 1 + |s|. We initialize each node with their
corresponding encoded sentence representations hobtained
from the encoder described in Section IV-A.
To represent the multi-relational information obtained from
all the relational inference units, we employ different adja-
cency matrices for the graph G. Specifically, the relation types
between two nodes is denoted as r R ={sem,top,cor},
representing the relations of Semantic Relevance,Topical
Coherence, and Coreference Linking, respectively. Three
adjacency matrices can thus be constructed for G:
Asem
i,j =
1,if ni=q, njs,
1,if nis, nj= arg max
njs\ni
Sim(ni, nj),
0,otherwise,
(10)
Atop
i,j =(1,if ni, nj N ,LDA(ni) = LDA(nj),
0,otherwise,(11)
Acor
i,j =(1,if ni, nj N ,CorefN(ni, nj)=,
0,otherwise,(12)
where Sim(·)denotes the semantic similarity function, which
is based on tf-idf cosine similarity between sentences to
capture lexical similarity. LDA(·)denotes the predicted latent
topic by the LDA topic model. CorefN(·)represents the
shared coreference clusters between two sentences, which is
resolved from all the sentences in N.
3) Multi-hop Information Aggregation: In order to capture
the information from multiple semantic relations with a multi-
hop inference process, we investigate the utilities of two kinds
of graph neural networks, namely Relational Graph Convo-
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 6
lutional Network (R-GCN) and Relational Graph Attention
Network (R-GAT).
Relational Graph Convolutional Network. R-GCN [78]
has the capability of aggregating multiple relations between
entities in a knowledge graph for the link prediction task,
which can also be extended to model the multiple semantic
relations for the multi-hop information aggregation in non-
factoid QA. For a node niin G, the multi-relational informa-
tion is aggregated from its neighboring nodes:
h(l+1)
i=σXr∈R Xj∈N r
ib
Ar
i,j W(l)
rh(l)
j,(13)
where h(l)
iis the hidden state of the node niat the l-th
layer of the network, Nr
idenotes the neighboring indices
of the node niunder the relation r(including node ni
itself), W(l)
rR|N dhare trainable parameters representing
the transformation from neighboring nodes and from the
node niitself. σ(·)denotes the activation function, such as
ReLU(x) = max(0, x).b
Ar
i,j is a normalization constant, such
as b
Ar
i,j = 1/|N r
i|in [78]. To avoid the scale changing of the
feature representation, we apply a symmetric normalization
transformation:
b
Ar=D1/2
rArD1/2
r, r {sem,top,cor},(14)
where Aris the adjacency matrix described in Section IV-B2
under the relation r R,Dris the corresponding degree
matrix of Aras Drii =PjAr
i,j .
Relational Graph Attention Network. Despite the success
of considering multi-relational information in the graph, R-
GCN also inherits some limitations from the original GCN.
As opposed to GCN, Graph Attention Network (GAT) [33] is
proposed to assign different importance to neighbors of the
node, instead of using the fixed or pre-defined edge weights.
Motivated by the advantages of GAT and R-GCN, we further
extend R-GCN to be Relational Graph Attention Network (R-
GAT) for enhancing the multi-hop inference process.
Following the graph attention mechanism proposed in [33],
the attention weight αi,j indicates the importance of node j’s
features to node i. For each relation r R, we compute the
relation-specific attention weights αr
i,j as:
αr
i,j =
exp LeakyReLU( b
Ar
i,j ω
r[Wrhi||Wrhj])
P
k∈N r
i
exp LeakyReLU( b
Ar
i,kω
r[Wrhi||Wrhk]),
(15)
where ωrR2d
hand WrRd
h×dhare parameters to be
learnt for relation r.|| denotes the concatenation operation.
The LeakyReLU activation function is applied for nonlinearity.
The graph attention mechanism can be extended to employ
multi-head attention, similar to [76]. Specifically, Kinde-
pendent attention weights can be calculated based on Equa-
tion (15), resulting in the following output node representation
for the next layer:
h(l+1)
i=σ
X
r∈R
1
K
K
X
k=1 X
j∈N r
i
αr,k,(l)
i,j b
Ar
i,j W(l)
r,kh(l)
j
,(16)
where αr,k,(l)
i,j are normalized attention coefficients computed
by the k-th head of attention for relation r, and Wr,k
Rd
h×dhis the corresponding linear transformation matrix to be
learnt. In particular, we denote the output node representations
in the last layer of the graph neural network as oqand osnfor
the question and each document sentence, respectively:
oq=h(LG)
q, osn=h(LG)
sn,(17)
where LGis the number of graph layers. And the number of
graph layers can be regarded as the number of reasoning hops,
since each graph layer only consider the relation between two
adjacent sentences in the graph, while multiple graph layers
can collectively measure the interrelations among multi-hop
connected sentences in the graph.
C. Sentence Extractor
After obtaining the sentence vectors from Graph-enhanced
Multi-hop Inference Module, we build a summarization-
specific classifier to extract summaries based on the multi-hop
inference results. The classifier contains a linear transforma-
tion and the sigmoid function:
˜ys=σ(W
eos+be),(18)
where σ(·)denotes the sigmoid function, WeRd
h×2and
beR2are parameters to be learnt. The extractive query-
based summarization is based on the ranked ˜ysto extract
sentences.
D. Summary Generator
We obtain the token-level representations ˜
Hqand ˜
Hsn
from the encoding phase, and the sentence-level document
representation oqand osnvia the graph-enhanced multi-
hop inference module for the question and each document
sentence, respectively.
Similar to the encoder, we adopt Transformer decoder layer
for decoding. The difference is that the decoder takes into
account two sources of information, including the question
and the document. For each decoder layer:
Xa=MHAtt(E(a), E(a), E (a)),(19)
Xc=MHAtt([ ˜
Hq|| ˜
Hd], Xa, Xa),(20)
Sdec =FFN(Xc),(21)
where E(a)denotes the masked answer embedding, and Sdec
is the hidden states produced by the Transformer decoder
layer. We concatenate all the token-level document sentence
representations to be the token-level document representations
as ˜
Hd=||n˜
Hsn.
Let stdenote the hidden state of the decoder at the t-th step.
The attention for each word in the question and the document,
αq
tand αd
t, are generated by:
eqj
t=ωq
t
Ttanh(Wq˜
Hqj+Wqsst+bq),(22)
αq
t=softmax(eq
t),(23)
edi
t=ωd
t
Ttanh(Wd˜
Hdi+Wdsst+bd),(24)
αd
t=softmax(ed
t),(25)
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 7
where WqRd
h×dh,Wqs Rd
h×dh,WdRd
h×dh,Wds
Rd
h×dh,ωq
tRd
h,ωd
tRd
h,bqRd
h,bdRd
hare
parameters to be learned.
Then, we incorporate the multi-hop inference results Os=
{os1, ..., osn}to compute the dynamic multi-hop reasoning
gate βtfor each sentence in the document:
βt=σ(ωs
t
Ttanh(WsOs+Wssst+bs)),(26)
where WsRd
h×dh,Wss Rd
h×dh,ωs
tRd
h,bsRd
h
are parameters to be learned. We re-weight the word-level
document attention scores αdwith a soft multi-hop reasoning
gate βto attend important justification sentences along with
the decoding process:
ˆαdi
t=αdi
tβt,disk
Piαdi
tβt,disk
.(27)
Thus, the re-weighted word-level document attention ˆαdnat-
urally blends with the results from the multi-hop inference
module to enhance the influence of those important justifica-
tion sentences.
Finally, we extend the basic pointer-generator network [65]
to be a multi-pointer architecture to generate answers with
the dynamic multi-hop reasoning flow as well as handle the
out-of-vocabulary (OOV) issue. Such approach enables GMQS
to copy words from the question as well as be aware of
the differential importance degree of different sentences in
the document. The attention weights αq
tand ˆαd
tare used
to compute context vectors cq
tand cd
tas the probability
distribution over the source words:
cq
t=˜
HT
qαq
t, cd
t=˜
HT
dˆαd
t.(28)
The context vector aggregates the information from the
source text for the current step. We concatenate the context
vector with the decoder state stand pass through a linear
layer to generate the answer representation hs
t:
hs
t=W1[st||cq
t||cd
t] + b1,(29)
where W1Rd
h×3dhand b1Rd
hare parameters to be
learned.
Then, the probability distribution Pvover the fixed vo-
cabulary is obtained by passing the answer representation hs
t
through a softmax layer:
Pv(yt) = softmax(W2hs
t+b2),(30)
where W2R|Vdhand b2R|V|are parameters to be
learned, and |V|denotes the vocabulary size.
The final probability distribution of ytis obtained from three
views of word distributions:
Pq(yt) = Xi:wi=wαqi
t, P d(yt) = Xi:wi=wˆαdi
t,(31)
Pall(yt)=[Pv(yt), P q(yt), P d(yt)],(32)
ρ=softmax(Wρ[st||cq
t||cd
t] + bρ),(33)
Pt(yt) = ρ·Pall(yt),(34)
where WρR3×dhand bρR3are parameters to be learned,
ρis the multi-pointer scalar to determine the weight of each
view of the probability distribution.
TABLE I
STATIST IC S OF DATASE T.
Dataset WikiHow PubMedQA
(train/dev/test)
#Samples 168K / 6K / 6K 169K / 21K / 21K
Avg QLen 7.00 / 7.02 / 7.01 16.3 / 16.4 / 16.3
Avg DLen 582 / 580 / 584 238 / 238 / 239
Avg ALen 62.2 / 62.2 / 62.2 41.0 / 41.0 / 40.9
Avg #Sents/Doc 20.7 / 20.7 / 20.6 9.32 / 9.31 / 9.33
E. Training Procedure
After obtaining ˜ysfrom the sentence extractor, we use the
cross entropy as the objective function for extractive query-
focused summarization:
Lext =1
NXN
n=1 (ys
nlog ˜ys
n+ (1 ys
n) log (1 ˜ys
n)) ,
(35)
where ys
nis the ground-truth label of the n-th sentence been
extracted into the answer y.
With Pt(yt)from the summary generator, we train the
abstractive query-focused summarization to minimize the neg-
ative log-likelihood:
Labs =1
TXT
t=1 log Pt(yt),(36)
where yis the ground-truth answer.
In order to mutually enhance both extractive and abstractive
summarization, the proposed model can be jointly trained by:
L=Labs +λLext,(37)
where λ0is a hyper-parameter for balancing the ratio
between two losses.
V. EXPERIMENTAL SET UP
A. Dataset & Evaluation Metrics
We evaluate the proposed method on two non-factoid QA
datasets with abstractive answers, namely WikiHow [79] and
PubMedQA [6]. WikiHow is an abstractive summarization
dataset collected from a community-based QA website, Wiki-
How1, in which each sample consists of a non-factoid question,
a long article, and the abstractive summary as the answer to
the given question. An actual sample is presented in Fig. 6.
PubMedQA is a conclusion-based biomedical QA dataset
collected from PubMed2abstracts, in which each instance
is composed of a question, a context, and an abstractive
answer which is the summarized conclusion of the context
corresponding to the question. An actual sample is presented in
Fig. 1. The statistics of the WikiHow and PubMedQA datasets
are shown in Table I. We adopt ROUGE F1 (R1, R2, RL) for
automatically evaluating the summarized answers.
1https://www.wikihow.com
2https://www.ncbi.nlm.nih.gov/pubmed/
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8
B. Compared Methods
There are four results of our method, GMQS, as follows:
GMQS-ext and GMQS-abs only use the single-task learn-
ing loss, i.e., Eq. (35) or Eq. (36), to train an extractive or
abstractive summarizer, respectively.
GMQS-ext-joint and GMQS-abs-joint use the joint learn-
ing loss, i.e., Eq. (37), to train the overall framework, but
adopt the output from the sentence extractor or the summary
generator, respectively.
To make comprehensive comparisons, we compare our
method to three different groups of state-of-the-art methods,
including non-factoid question answering, traditional summa-
rization, and query-focused summarization methods.
As for non-factoid QA methods, we adopt both retrieval-
and generation-based methods for comparisons, where the
retrieval-based methods perform a sentence-level classification
task to determine whether the sentence should be selected:
Compare-Aggregate Model (CA) [13] aggregates the com-
parison results in small units of two sentences;
COALA [14] selects answers via the comparison of all
question-answer aspects;
BERT [80] adopts the pairwise fine-tuning to perform
answer sentence selection;
HGN [27] adopts a hierarchical graph to model different
levels of granularity for multi-task learning in factoid QA.
In our case, we apply it as an answer selection model;
MHPGM [26] uses multiple hops of bidirectional attention
and a pointer-generator decoder to read and reason within
a long passage for generating the answer;
S2S-MT [4] uses a multi-task Seq2Seq model with the
concatenation of question and support document;
QPGN [3] is a question-driven pointer-generator network
with co-attention between the question and document.
As for traditional summarization methods, we also adopt
both extractive and abstractive methods as well as hybrid meth-
ods for comparisons, where the question and the document are
concatenated as the input for these methods:
NeuralSum [63] performs extractive summarization as a
sequence labeling task;
NeuSum [64] jointly learns to score and select sentences
for extractive summarization;
PGN [65] copies words from the article via pointing, and
produces novel words by the generator;
CopyTransformer [66] incorporates the copy mechanism
into the Transformer [76] for abstractive summarization;
UnifiedSum [68] is a unified model combining sentence-
level and word-level attentions to take advantage of both
extractive and abstractive summarization approaches;
MGSum [69] uses a multi-granularity interaction network
to encode input documents and unifies extractive and ab-
stractive summarization into one architecture.
BERTSum [74] is a BERT-based general framework en-
compassing both extractive and abstractive summarization,
namely BERTSumExt and BERTSumAbs.
Similarly, we compare the proposed method to both extrac-
tive and abstractive query-focused summarization methods:
MMR [47] applies classical Maximal Marginal Relevance
algorithm for query-based summarization;
AttSum [19] applies the attention mechanism to simulate
the human-like reading when a query is given;
HSCM [55] integrates the hierarchical interaction informa-
tion between the question and document into a sequential
extractive summarization model;
QS [17] utilizes the query information into the pointer-
generation network;
SD2[18] combines a query-based attention model and a
diversity-based attention model;
MSG [22] incorporates multi-hop reasoning into question-
driven summarization.
C. Implementation Details
Following the general settings [76], we apply a six-layer
encoder and a two-layer decoder for all Transformer based
models. The input embedding size and the hidden size are
set to be 512. The word embeddings are randomly initialized.
The size of the Transformer FFN inner representation size is
set to be 2048, and ReLU is used as the activation function.
The learning rate and the dropout rate are set to be 0.0001
and 0.1, respectively. During training, the batch size is set
to be 32, while at the inference phase, we use beam search
with a beam size of 10. For each model, we all train for 20
epochs. We adopt the NLTK package [81] for sentence and
word tokenization. The maximum length of each sentence
and the maximum number of sentences in each document
are set to be 32 and 16, respectively. As for the extractive
summarization setting, we follow previous studies [63], [64] to
select top-3 scored sentences to construct the summary. As for
the abstractive summarization setting, we also follow previous
studies [22] to restrict the length of the generated summary
within the range of 30 and 100. λis set to 0.5, which is tuned
on the validation set. For the graph construction, GenSim3
is adopted to implement the Tf-idf and LDA models, while
NeuralCoref 4is adopted as the coreference resolution tool.
VI. RE SU LTS & AN ALYS IS
A. Overall Performance on Extractive Methods
Table II presents the experimental results of extractive
methods on WikiHow and PubMedQA datasets. Among the
baseline methods, extractive summarization methods perform
better than answer sentence selection methods on WikiHow.
Even the heuristic unsupervised method, LEAD3, achieves a
better performance than these sophisticated answer sentence
selection methods on WikiHow. However, all kinds of base-
lines have a similar performance on PubMedQA. As known
from the dataset statistics in Table I, the average question
length in WikiHow is relatively short, where the inadequate
information in the question restricts the interactive context
modeling between question and answer sentences. Overall,
the proposed method, GMQS, substantially and consistently
3https://radimrehurek.com/gensim/
4https://github.com/huggingface/neuralcoref
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 9
TABLE II
EXP ERI ME NTAL RE SU LTS ON E XTR ACT IV E MET HO DS.
Model WikiHow PubMedQA
R1 R2 RL R1 R2 RL
LEAD3 26.0 7.2 24.3 30.9 9.8 21.2
CA [13] 24.5 6.0 22.6 31.2 9.6 24.5
COALA [14] 26.1 6.2 23.7 31.6 9.8 25.6
BERT [80] 27.1 6.6 24.1 32.0 10.2 25.9
HGN [27] 26.3 6.3 23.9 31.5 9.8 25.5
NeuralSum [63] 26.7 6.4 24.0 30.9 9.7 22.4
NeuSum [64] 26.5 6.2 23.8 31.0 9.7 22.5
MGSum-ext [69] 27.4 7.1 24.4 32.0 10.5 26.1
BERTSumExt [74] 27.7 7.4 25.0 32.2 10.4 26.3
MMR [47] 26.8 6.1 23.6 30.1 9.0 24.4
AttSum [19] 26.4 6.3 24.0 31.2 9.8 25.3
HSCM [55] 27.2 7.0 24.7 32.3 10.1 26.0
GMQS-ext 28.6 7.9 26.1 33.2 11.8 27.6
GMQS-ext-joint 29.0 8.1 26.4 33.5 11.9 27.7
outperforms all the extractive methods, including answer sen-
tence selection, traditional and query-focused summarization
methods, by a noticeable margin on the two datasets. Even
training from scratch, GMQS can achieve competitive perfor-
mance with BERT-based methods, including BERT for answer
sentence selection and extractive summarization. This result
demonstrates the superiority of the proposed graph-enhanced
multi-hop inference method on identifying the important sen-
tences with salient as well as question-related information
for extractive non-factoid QA. In addition, the joint learning
with abstractive summarization further improves the extraction
performance of GMQS.
B. Overall Performance on Abstractive Methods
Experimental results of abstractive methods are summarized
in Table III. There are several notable observations as follows:
(1) Compared with extractive methods, all kinds of ab-
stractive methods perform with more promising results, which
indicates that answers for non-factoid questions include sparse
and diverse information from different sentences across the
whole supporting document or evidences. It is not enough to
simply extract or select original sentences from the document.
(2) MSG and the proposed GMQS, which both consider
the interrelationships among different document sentences by
multi-hop reasoning, outperform other baseline methods with
a substantial margin. This result shows that the multi-hop
inference attaches great importance in non-factoid QA. GMQS
further improves the performance over MSG by capturing
more comprehensive semantic relationships during the multi-
hop inference process.
(3) As for the performance boosting by the joint learning,
the extractive learning makes more contribution to the abstrac-
tive learning than the reverse, since the learned importance
degree of each sentence casts a direct impact on the generated
sentences, according to Equation (27).
In both extractive and abstrative scenario, the proposed
GMQS method substantially and consistently outperforms
those strong baselines, which demonstrates not only the effec-
TABLE III
EXP ERI ME NTAL RE SU LTS ON A BST RAC TI VE ME TH ODS .
Model WikiHow PubMedQA
R1 R2 RL R1 R2 RL
LEAD3 26.0 7.2 24.3 30.9 9.8 21.2
MHPGM [26] 28.0 9.4 27.1 34.0 12.5 28.4
S2S-MT [4] 28.6 9.6 27.5 33.2 12.2 27.8
QPGN [3] 28.8 9.7 27.7 34.2 12.8 28.7
PGN [65] 28.5 9.2 26.5 32.9 11.5 28.1
CopyTransformer [66] 30.2 10.0 28.8 35.0 11.3 27.8
Unified [68] 30.0 9.9 28.7 35.7 12.1 29.0
MGSum-abs [69] 30.4 10.4 29.4 37.0 13.9 30.0
BERTSumAbs [74] 30.4 10.2 29.1 37.5 15.0 30.3
QS [17] 28.8 9.9 27.6 32.6 11.1 26.7
SD2[18] 27.7 7.9 25.8 32.3 10.5 26.0
MSG (3-Hop) [22] 30.5 10.5 29.3 37.2 14.8 30.2
GMQS-abs 31.5 11.2 30.7 38.1 15.3 31.0
GMQS-abs-joint 32.2 11.6 31.2 38.8 15.7 31.6
TABLE IV
HUM AN EVAL UATIO N RES ULTS . TH E FLEI SS KAP PA OF TH E
AN NOTATIO NS I S 0.42, WHICH INDICATES MODERATE AGREEMENT”.
Model Info. Conc. Read. Corr.
COALA 3.05 2.15 3.85 3.01
MGSum-ext 3.19 2.21 4.01 3.14
HSCM 3.33 2.09 3.87 3.32
GMQS-ext-joint 3.41 2.14 3.95 3.56
QPGN 3.53 3.45 3.61 3.30
MGSum-abs 3.98 4.10 4.12 3.48
MSG 4.07 3.75 3.80 3.72
GMQS-abs-joint 4.21 4.02 4.14 3.89
tiveness of the graph-enhanced multi-hop inference on non-
factoid QA, but also its promising applicability.
C. Human Evaluation
We conduct human evaluation to evaluate the generated
answer from four aspects: (1) Informativity: how rich is
the generated answer in information? (2) Conciseness: how
concise the generated answer is? (3) Readability: how fluent
and coherent the generated answer is? (4) Correctness: how
well does the generated answer respond to the given question?
We randomly sample 50 questions from two datasets and
compare their answers produced by three extractive (COALA,
MGSum-ext, HSCM) and three abstractive summarization
methods (QPGN, MGSum-abs, and MSG). Three annota-
tors are asked to score each generated answer with 1 to 5
(higher the better). Results are presented in Table IV. These
annotators are all well-educated research assistants with a
background of NLP and are all native speakers. The ground-
truth answers are provided for evaluating the Correctness of
the genereted answers. As for both extractive and abstractive
methods, GMQS substantially outperforms existing methods
on producing informative and correct answers, and preserving
high-level conciseness and readability as well.
D. Ablation Study
1) Comparisons on Multi-hop Inference Module: In order
to validate the superiority of the proposed graph-enhanced
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 10
(a) WikiHow (b) PubMedQA (c) Impact of # Pronouns
Fig. 3. Impact of different semantic relations.
TABLE V
COMPARISONS ON MULTI-HOP INFERENCE MODULES.
Model WikiHow PubMedQA
R1 R2 RL R1 R2 RL
GMQS-ext-joint 29.0 8.1 26.4 33.5 11.9 27.7
- w/ RGCN 28.4 7.7 25.9 33.3 11.7 27.5
- w/ MHPGM 28.0 7.5 25.0 32.3 11.0 26.6
- w/ MSG 28.1 7.4 25.0 32.4 11.3 26.9
- w/ HGN 27.9 7.4 24.9 32.2 11.0 26.5
- w/o Multi-hop 27.7 7.3 24.7 32.0 10.8 26.3
GMQS-abs-joint 32.2 11.6 31.2 38.8 15.7 31.6
- w/ RGCN 31.6 11.1 30.7 38.6 15.4 31.4
- w/ MHPGM 31.3 10.8 30.3 37.5 14.4 30.4
- w/ MSG 31.5 10.8 30.3 38.2 14.8 30.8
- w/ HGN 31.0 10.6 29.9 37.6 14.4 30.5
- w/o Multi-hop 30.9 10.6 29.8 37.2 14.1 30.3
multi-hop inference module, we conduct comparisons with
other alternative multi-hop inference components as follows:
We first substitute RGAT with RGCN [78] for the aggrega-
tion of multi-relational information, i.e., w/ RGCN.
Another way is to use the self-attention layer [76] to con-
struct a fully-connected sentence graph for node represen-
tation learning, which is similar to the multi-hop reasoning
module in MHPGM [26], i.e., w/ MHPGM.
We alsp adopt the multi-hop inference module proposed in
MSG [22], which elaborates the semantic relevance between
the question and each document sentence as well as among
all the document sentences, i.e., w/ MSG.
The last one is to adapt the Hierarchical Graph Network
(HGN) from [27] into non-factoid QA, which aggregates
different granularity of information for multi-hop inference,
i.e., w/ HGN.
We also consider the situation when the multi-hop inference
module is discarded, i.e., w/o Multi-hop.
The comparison results are presented in Table V. For all
kinds of multi-hop inference modules, they contribute to better
performance on both extractive and abstractive results more
or less, showing the necessity of the multi-hop reasoning
on non-factoid QA. The constructed multi-relational graph
further enables the multi-hop inference module to capture
diverse and complex interrelationships among sentences, lead-
ing to a higher performance of using RGCN and RGAT for
graph representational learning. Overall, the proposed RGAT
achieves the best performance among these alternative multi-
hop inference modules.
2) Impact of Different Semantic Relations: To elaborate
the multi-hop inference upon different reasoning paths, we
model the multiple semantic relations between the question
and the document sentences as well as among the document
sentence. Thus, we examine the effect of each semantic
relation during the multi-hop inference procedure in terms of
discarding each one of these relational graphs. We present
the ablation studies on both the extractive and abstractive
results in Figure 3, where “w/o semantic”, “w/o topical” and
“w/o coreference” denote the GMQS-joint models without
the semantic relevance, topical coherence, and coreference
linking relation when constructing the multi-relational graph,
respectively. Besides, “all relation” refers to the performance
of the model with all three relations. We can see that all of
the semantic relations contribute to the final performance and
discarding any of them leads to a decrease of performance.
This result illustrates the importance of explicitly modeling
the complex relations among the question and the document
sentences for non-factoid QA. The topical coherence and
coreference linking relations attach more importance to the
final performance, while the semantic relevance relation affects
the performance the least as the intra-/inter-sentence encoder
may capture such information to a certain extent. In addition,
we observe that the coreference linking relation is more effec-
tive in the WikiHow dataset. Since there are more pronouns
in the WikiHow dataset, the multi-hop reasoning relies more
on coreference resolution to link the relation among different
sentences in the document. However, as for the PubMedQA
dataset with professional medical documents, the mentioned
entities are clearly stated without using pronouns in the source
document, so that the coreference relation might be less
effective. To better verify this observation, we statistically
present the performance in terms of the number of pronouns
in the source document in Fig. VI-C. It can be observed that
the it is harder to achieve a high performance in cases with a
larger number of pronouns, while the coreference relation is
more effectively in these cases.
E. Analysis of Multi-hop Reasoning
1) Impact of the Number of Hops: In the proposed graph-
enhanced multi-hop inference module, the number of RGAT
layers corresponds to the number of reasoning hops. To
investigate the impact of the number of hops on the model
performance, the experimental results on varying the num-
ber of RGAT layers are shown in Figure 4. We can see
that, as expected, the performance of the model begins with
growth when increasing the number of hops for reasoning.
In particular, even using one hop of inference can make a
noticeable contribution to both the extractive and abstractive
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 11
(a) WikiHow (b) PubMedQA
Fig. 4. Impact of different number of hops.
(a) WikiHow (b) PubMedQA
Fig. 5. Training efficiency analysis.
performance, which indicates the importance of considering
the complex interrelationships among the document sentences.
However, the performance merely changes on WikiHow and
even slightly decreases on PubMedQA, when we further
increase the number of RGAT layers. The possible reason is
that the number of parameters also increases when we adapt
more reasoning hops, leading to the over-fitting issue. This
is a common phenomenon in GNN applications, which has
also been observed from other NLP tasks that require the
capability of multi-hop reasoning, such as knowledge graph
completion [34], multi-choice QA [?], etc.
2) Training Efficiency Analysis: To better understand the
training process of the graph-enhanced multi-hop inference
module, we illustrate the testing performance curves of GMQS
with different multi-hop inference modules as well as without
the multi-hop inference module. Figure 5 shows the learning
curves of the ROUGE-1 F1 score during the training process
on both WikiHow and PubMedQA datasets, respectively.
As for the PubMedQA dataset, the proposed GMQS and
the RGCN-variant quickly converge to the optimal value after
about 12 epochs, and the RGCN-variant is even slightly faster
than the proposed GMQS. However, the other multi-hop vari-
ants, e.g., MSG, need to take almost 20 epochs to converge to
the optimal value, and the non-multi-hop model is the slowest
one. This is because the multi-relational graph structure can
be served as some prior knowledge for assisting in the multi-
hop reasoning, which accelerates the learning process. Besides,
there are more parameters to be trained for the RGAT than
the RGCN, which may cause a slight speed reduction, but
it also provides better performance. In addition, this result
also shows that the multi-hop inference module enables the
model to capture the important and salient information in the
document more quickly. As for the WikiHow dataset, we can
also make a similar conclusion.
3) Case Study: We present a case study in Figure 6
with generated answers from the proposed method and some
baseline methods, including MSG, MGSum, and QPGN, to
intuitively compare these methods. As for marks for the ques-
tion and document, Italic, underlined, and :::::::::::::
wavy-underlined
sentences represent those highly weighted sentences in 1st-
hop, 2nd-hop, and 3rd-hop inference by GMQS, respectively.
While the highlighted sentences represent those sentences
that are supposed to be involved in the final answer. As for
the reference answer and the answers produced by different
methods, Italic, underlined, and ::::::::::::::
wavy-underlined sentences
represent those sentences that are related to the sentences
in 1st-hop, 2nd-hop, and ::::::
3rd-hop from the document, re-
spectively. While the highlighted sentences represent those
sentences that precisely answer the given question, i.e., similar
to the reference answer. In other words, those regular sentences
are incorrect or irrelevant to the given question.
We observe that it probably requires more than 3 hops
of reasoning to infer the answers in this case, since there
are multiple steps to answer the given question. We can still
evaluate how the proposed GMQS handles such a case from
the perspective of 3-hop inference. Compared to the reference
answer, GMQS can capture most of the useful information
to generate a good summary for answering the question,
using either extractive or abstractive methods. Due to the
length limitation in the experimental setup, the extractive result
(GMQS-ext-joint) only fetches a certain number of sentences
with the most important information from different hops of
inference. The abstractive result (GMQS-abs-joint) success-
fully incorporates the key information to form the final answer.
However, MGSum and QPGN introduce some unnecessary or
incorrect information into the summarized answers.
Compared with MSG (3-Hop), which is also capable of
multi-hop inference, the answer generated by GMQS covers
more required information from the source document. This
result indicates that only modeling the semantic relevance
is inadequate for producing a comprehensive answer to the
given non-factoid question. The proposed graph-enhanced
multi-hop inference method enables to explicitly explain the
inferred reasoning paths for producing the final answer. In
this case, we visualized the multi-relational graph concerning
the highlighted sentences during the multi-hop inference
process in Figure 7. It can be observed that Sentence 13 is
not computed to be semantic relevant to other highlighted
sentences. However, it is computed to be topically coherent to
the question as well as linked to Sentence 9 by the coreference
of “milk replacer”.
F. Error Analysis
We conduct error analysis on the generated answers selected
for human evaluation (Section VI-C). Table VI summarizes the
four most frequent error types and their error rates. In general,
missing information and redundant information are the most
common errors in the generated answers by both extractive
and abstractive GMQS methods. Compared with GMQS-ext,
GMQS-abs can greatly avoid errors regarding incoherence in
the generated answers. However, due to the hallucination issue,
which is the typical flaw of generation methods, GMQS-abs
suffers more from the incorrect information.
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 12
Question: How to tube feed a puppy?
Document:
1. You will need a 12 cc syringe, a soft rubber feeding tube, and a 16-inch urethral catheter with a diameter of 5 French (for small dogs) and
8 French (for large dogs). 2. These are the items you will use to create your feeding tube device.
3. You will also need puppy milk replacer that contains goats milk, like ESBILAC®.
4. You can also buy an already assembled feeding tube from your local veterinary office or pet store.
5. You will need to determine the puppy’s weight so that you know how much milk replacer to give him.
6. Place him on a scale to determine his weight. 7. For every ounce of the puppy’s weight, give him 1 cc or ml of the milk replacer.
8. Add one extra cc to be careful. 9. You will want to heat the milk replacer up so that it is easier on the puppy’s stomach.
10. Place the milk into the microwave for three to five seconds so that it reaches a lukewarm temperature.
11. Draw the milk up until you have the measured amount of milk, plus one extra cc.
12. The extra cc will be used to ensure that puppy doesn’t get any air bubbles, which could cause bloating or gas pain.
13. Once the syringe has drawn up all of the milk replacer, press down gently until a tiny drop comes out of the syringe.
14. Doing this will ensure that the syringe is working properly.
15. You will need to attach the end of the rubber feeding tube to the end of the syringe.
16. To do this, place the tip of the rubber tube up against the side of the puppy’s bottom, or last, rib, and run the tube from there to the tip of
the pup’s nose. 17. Pinch the tube where it touches the puppy’s nose and make a mark there with a permanent marker.
Reference Answer: Gather your supplies. Weigh the puppy. Measure out the correct amount of milk into a microwaveable bowl. Use the
syringe to suck up the milk replacer. Attach the feeding tube to the syringe. Measure out the length of the tube you will insert into the
puppy’s mouth.
GMQS-ext-joint: For every ounce of the puppy’s weight, give him 1 cc or ml of the milk replacer. You will want to heat the milk
replacer up so that it is easier on the puppy’s stomach. Once the syringe has drawn up all of the milk replacer, press down gently
until a tiny drop comes out of the syringe. You will need to attach the end of the rubber feeding tube to the end of the syringe.
GMQS-abs-joint: Gather your supplies. Measure the puppy’s weight. Place the milk in the microwave. Fill the syringe with milk
replacer. Attach the syringe to the rubber tube. Insert the syringe into the puppy’s mouth.
MSG (3-Hop): Gather your materials. Measure your puppy’s weight. Heat the milk replacer. Attach the rubber feeding tube to the end
of the syringe. Insert the end of the milk replacer into the milk replacer. Insert the syringe into the puppy’s mouth.
MGSum-abs: Gather your supplies. Measure the puppy’s weight. Add the milk replacer to the puppy’s weight. Place the syringe in the
microwave. Remove the syringe from the syringe.
QPGN: Gather your supplies. Measure the puppy’s weight. Place the milk replacer on the puppy’s stomach. Place the milk replacer on the
puppy’s stomach. Press the milk replacer into the milk replacer.
Fig. 6. Case study from WikiHow.
5
6
9
7
10
13
15
1-Hop
2-Hop
3-Hop
Semantic
Topical
Coreference
17
Fig. 7. Visualization of the multi-relational graph.
TABLE VI
ERRO R ANALYS IS .
Error Type GMQS-ext-joint GMQS-abs-joint
Missing Info. 74% 64%
Redundant Info. 86% 62%
Incorrect Info. 12% 52%
Incoherence 44% 16%
VII. CONCLUSIONS AND FUTURE WORK
In this work, we study the non-factoid QA problem
by proposing a novel query-focused summarization method,
namely Graph-enhanced Multi-hop Query-focused Summa-
rizer (GMQS). Specifically, we investigate graph-based reason-
ing techniques to perform multi-hop reasoning for collecting
key information from documents to answer the given question.
Three types of graphs with different semantic relationships
are constructed, namely semantic relevance, topic coherence,
and coreference linking, to explicitly capture the relationship
between the question and each document sentence as well as
among the document sentences. Then, the Relation Graph At-
tention Network (RGAT) is developed to aggregate the multi-
relational information accordingly. In addition, the proposed
method can be applied to both extractive and abstractive appli-
cations. Extensive experimental results show that the proposed
method outperforms the existing baseline on non-factoid QA
and has promising multi-hop reasoning capabilities.
It is noteworthy that the performance of the proposed
framework depends on the construction of the semantic graphs
to a great extent. In the future, we would like to explore other
more informative graph representations such as knowledge
graph, AMR graph, and leverage them to further improve the
performance. By doing so, the finer-grained relations, such as
word-level or entity-level interaction, can be investigated for
improving the multi-hop inference. In addition, it is also worth
exploring the deeper connections between multi-hop reasoning
and the graph structure and studying more sophisticated graph
neural network structures for the representational learning of
multi-relational graphs.
REFERENCES
[1] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+
questions for machine comprehension of text,” in EMNLP, 2016, pp.
2383–2392.
[2] H. Song, Z. Ren, S. Liang, P. Li, J. Ma, and M. de Rijke, “Summarizing
answers in non-factoid community question-answering,” in WSDM,
2017, pp. 405–414.
[3] Y. Deng, W. Lam, Y. Xie, D. Chen, Y. Li, M. Yang, and Y. Shen,
“Joint learning of answer selection and answer summary generation in
community question answering,” in AAAI, 2020, pp. 7651–7658.
[4] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli, “ELI5:
long form question answering,” in ACL, 2019, pp. 3558–3567.
[5] M. Nakatsuji and S. Okui, “Conclusion-supplement answer generation
for non-factoid questions,” in AAAI, 2020, pp. 8520–8527.
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 13
[6] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, “Pubmedqa:
A dataset for biomedical research question answering,” in EMNLP-
IJCNLP, 2019, pp. 2567–2577.
[7] E. Yulianti, R. Chen, F. Scholer, W. B. Croft, and M. Sanderson,
“Document summarization for answering non-factoid queries,” IEEE
Trans. Knowl. Data Eng., vol. 30, no. 1, pp. 15–28, 2018.
[8] M. Keikha, J. H. Park, and W. B. Croft, “Evaluating answer passages
using summarization measures,” in SIGIR, 2014, pp. 963–966.
[9] Y. Yang, W. Yih, and C. Meek, “Wikiqa: A challenge dataset for open-
domain question answering,” in EMNLP, 2015, pp. 2013–2018.
[10] P. Nakov, L. M`
arquez, W. Magdy, A. Moschitti, J. R. Glass, and
B. Randeree, “Semeval-2015 task 3: Answer selection in community
question answering,” in SemEval@NAACL-HLT, 2015, pp. 269–281.
[11] A. Severyn and A. Moschitti, “Learning to rank short text pairs with
convolutional deep neural networks, in SIGIR, 2015, pp. 373–382.
[12] M. Tan, C. N. dos Santos, B. Xiang, and B. Zhou, “Improved represen-
tation learning for question answer matching,” in ACL, 2016.
[13] S. Wang and J. Jiang, A compare-aggregate model for matching text
sequences,” in ICLR, 2017.
[14] A. R¨
uckl´
e, N. S. Moosavi, and I. Gurevych, “COALA: A neural
coverage-based approach for long answer selection with small data,
in AAAI, 2019, pp. 6932–6939.
[15] L. Wang, H. Raghavan, C. Cardie, and V. Castelli, “Query-focused
opinion summarization for user-generated content,” in COLING, 2014,
pp. 1660–1669.
[16] G. Feigenblat, H. Roitman, O. Boni, and D. Konopnicki, “Unsupervised
query-focused multi-document summarization using the cross entropy
method,” in SIGIR, 2017, pp. 961–964.
[17] J. Hasselqvist, N. Helmertz, and M. K˚
ageb¨
ack, “Query-based abstractive
summarization using neural networks,” CoRR, vol. abs/1712.06100,
2017.
[18] P. Nema, M. M. Khapra, A. Laha, and B. Ravindran, “Diversity driven
attention model for query-based abstractive summarization, in ACL,
2017, pp. 1063–1072.
[19] Z. Cao, W. Li, S. Li, F. Wei, and Y. Li, “Attsum: Joint learning of
focusing and summarization with neural attention,” in COLING, 2016,
pp. 547–556.
[20] V. Yadav, S. Bethard, and M. Surdeanu, “Quick and (not so) dirty:
Unsupervised selection of justification sentences for multi-hop question
answering,” in EMNLP-IJCNLP, 2019, pp. 2578–2589.
[21] ——, “Unsupervised alignment-based iterative evidence retrieval for
multi-hop question answering,” in ACL, 2020.
[22] Y. Deng, W. Zhang, and W. Lam, “Multi-hop inference for question-
driven summarization, in EMNLP, 2020, pp. 6734–6744.
[23] Y. Li and S. Li, “Query-focused multi-document summarization: Com-
bining a topic model with graph-based semi-supervised learning,” in
COLING, 2014, pp. 1197–1207.
[24] Y. Gao, Y. Xu, H. Huang, Q. Liu, L. Wei, and L. Liu, “Jointly
learning topics in sentence embedding for document summarization,”
IEEE Trans. Knowl. Data Eng., vol. 32, no. 4, pp. 688–699, 2020.
[25] W. Li, X. Xiao, J. Liu, H. Wu, H. Wang, and J. Du, “Leveraging graph
to improve abstractive multi-document summarization, in ACL, 2020,
pp. 6232–6243.
[26] L. Bauer, Y. Wang, and M. Bansal, “Commonsense for generative multi-
hop question answering tasks,” in EMNLP, 2018, pp. 4220–4230.
[27] Y. Fang, S. Sun, Z. Gan, R. Pillai, S. Wang, and J. Liu, “Hierarchical
graph network for multi-hop question answering,” in EMNLP, 2020, pp.
8823–8838.
[28] W. Xu, Y. Deng, H. Zhang, D. Cai, and W. Lam, “Exploiting reasoning
chains for multi-hop science question answering,” in Findings of ACL:
EMNLP, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., 2021,
pp. 1143–1156.
[29] Q. Lang, X. Liu, and W. Jia, Afs graph: Multidimensional axiomatic
fuzzy set knowledge graph for open-domain question answering,” IEEE
Trans. Neural Networks Learn. Syst., 2022.
[30] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, A
comprehensive survey on graph neural networks, IEEE Trans. Neural
Networks Learn. Syst., vol. 32, no. 1, pp. 4–24, 2021.
[31] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks, in ICLR, 2017.
[32] S. Jiang, Q. Chen, X. Liu, B. Hu, and L. Zhang, “Multi-hop graph
convolutional network with high-order chebyshev approximation for text
reasoning,” in ACL/IJCNLP, 2021, pp. 6563–6573.
[33] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li `
o, and
Y. Bengio, “Graph attention networks,” in ICLR, 2018.
[34] G. Wang, R. Ying, J. Huang, and J. Leskovec, “Multi-hop attention
graph neural networks,” in IJCAI, 2021, pp. 3089–3096.
[35] J. Ma, J. Liu, Y. Wang, J. Li, and T. Liu, “Relation-aware fine-
grained reasoning network for textbook question answering, IEEE
Trans. Neural Networks Learn. Syst., vol. 34, no. 1, pp. 15–27, 2023.
[36] Q. Liu, X. Geng, H. Huang, T. Qin, J. Lu, and D. Jiang, “Mgrc: An
end-to-end multigranularity reading comprehension model for question
answering,” IEEE Trans. Neural Networks Learn. Syst., 2021.
[37] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder,
and L. Deng, “MS MARCO: A human generated machine reading
comprehension dataset,” in NeurIPS, 2016.
[38] Q. Liu, X. Geng, Y. Wang, E. Cambria, and D. Jiang, “Disentangled
retrieval and reasoning for implicit question answering, IEEE Trans.
Neural Networks Learn. Syst., 2022.
[39] W. Zhang, Y. Deng, and W. Lam, “Answer ranking for product-related
questions via multiple semantic relations modeling,” in SIGIR, 2020, pp.
569–578.
[40] Y. Deng, Y. Xie, Y. Li, M. Yang, W. Lam, and Y. Shen, “Contextualized
knowledge-aware attentive neural network: Enhancing answer selection
with knowledge, ACM Trans. Inf. Syst., vol. 40, no. 1, pp. 2:1–2:33,
2022.
[41] Y. Deng, Y. Shen, M. Yang, Y. Li, N. Du, W. Fan, and K. Lei,
“Knowledge as A bridge: Improving cross-domain answer selection with
external knowledge, in COLING, 2018, pp. 3295–3305.
[42] Y. Shen, Y. Deng, M. Yang, Y. Li, N. Du, W. Fan, and K. Lei,
“Knowledge-aware attentive neural network for ranking question answer
pairs,” in SIGIR, 2018, pp. 901–904.
[43] Z. Wang, W. Hamza, and R. Florian, “Bilateral multi-perspective match-
ing for natural language sentences,” in IJCAI, 2017, pp. 4144–4150.
[44] Y. Deng, Y. Li, W. Zhang, B. Ding, and W. Lam, “Toward personal-
ized answer generation in e-commerce via multi-perspective preference
modeling,” ACM Trans. Inf. Syst., vol. 40, no. 4, pp. 87:1–87:28, 2022.
[45] R. Ishida, K. Torisawa, J. Oh, R. Iida, C. Kruengkrai, and J. Kloetzer,
“Semi-distantly supervised neural model for generating compact answers
to open-domain why questions,” in AAAI, 2018, pp. 5803–5811.
[46] R. Iida, C. Kruengkrai, R. Ishida, K. Torisawa, J. Oh, and J. Kloetzer,
“Exploiting background knowledge in compact answer generation for
why-questions,” in AAAI, 2019, pp. 142–151.
[47] J. J. Lin, N. Madnani, and B. J. Dorr, “Putting the user in the loop: In-
teractive maximal marginal relevance for query-focused summarization,”
in HLT-NAACL, 2010, pp. 305–308.
[48] C. Shen and T. Li, “Learning to rank for query-focused multi-document
summarization,” in ICDM, 2011, pp. 626–634.
[49] L. Wang, H. Raghavan, V. Castelli, R. Florian, and C. Cardie, “A sen-
tence compression based framework to query-focused multi-document
summarization,” in ACL, 2013, pp. 1384–1394.
[50] T. Ishigaki, H. Huang, H. Takamura, H. Chen, and M. Okumura, “Neural
query-biased abstractive summarization using copying mechanism, in
ECIR, 2020, pp. 174–181.
[51] T. Baumel, R. Cohen, and M. Elhadad, “Topic concentration in query
focused summarization datasets,” in AAAI, 2016, pp. 2573–2579.
[52] M. T. R. Laskar, E. Hoque, and J. X. Huang, “WSL-DS: weakly
supervised learning with distant supervision for query focused multi-
document abstractive summarization, in COLING, 2020, pp. 5647–
5654.
[53] Y. Xu and M. Lapata, “Coarse-to-fine query focused multi-document
summarization,” in EMNLP, 2020, pp. 3632–3645.
[54] M. Singh, A. Mishra, Y. Oualil, K. Berberich, and D. Klakow, “Long-
span language models for query-focused unsupervised extractive text
summarization,” in ECIR, 2018, pp. 657–664.
[55] Y. Deng, W. Zhang, Y. Li, M. Yang, W. Lam, and Y. Shen, “Bridging
hierarchical and sequential context modeling for question-driven extrac-
tive answer summarization, in SIGIR, 2020, pp. 1693–1696.
[56] N. Zhang, S. Deng, J. Li, X. Chen, W. Zhang, and H. Chen, “Sum-
marizing chinese medical answer with graph convolution networks and
question-focused dual attention,” in Findings of ACL: EMNLP, 2020,
pp. 15–24.
[57] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and
C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop
question answering,” in EMNLP, 2018, pp. 2369–2380.
[58] Y. Feldman and R. El-Yaniv, “Multi-hop paragraph retrieval for open-
domain question answering,” in ACL, 2019, pp. 2296–2309.
[59] K. Nishida, K. Nishida, M. Nagata, A. Otsuka, I. Saito, H. Asano, and
J. Tomita, Answering while summarizing: Multi-task learning for multi-
hop QA with evidence extraction, in ACL, 2019, pp. 2335–2345.
[60] L. Qiu, Y. Xiao, Y. Qu, H. Zhou, L. Li, W. Zhang, and Y. Yu,
“Dynamically fused graph network for multi-hop reasoning,” in ACL,
2019, pp. 6140–6150.
IEEE TRANSITIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 14
[61] S. Moon, P. Shah, A. Kumar, and R. Subba, “Opendialkg: Explainable
conversational reasoning with attention-based walks over knowledge
graphs,” in ACL, 2019, pp. 845–854.
[62] H. Ji, P. Ke, S. Huang, F. Wei, X. Zhu, and M. Huang, “Language gen-
eration with multi-hop reasoning on commonsense knowledge graph,
in EMNLP, 2020, pp. 725–736.
[63] J. Cheng and M. Lapata, “Neural summarization by extracting sentences
and words,” in ACL, 2016.
[64] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, “Neural
document summarization by jointly learning to score and select sen-
tences,” in ACL, 2018, pp. 654–663.
[65] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization
with pointer-generator networks, in ACL, 2017, pp. 1073–1083.
[66] S. Gehrmann, Y. Deng, and A. M. Rush, “Bottom-up abstractive
summarization,” in EMNLP, 2018, pp. 4098–4109.
[67] M. Yang, C. Li, Y. Shen, Q. Wu, Z. Zhao, and X. Chen, “Hierarchical
human-like deep neural networks for abstractive text summarization,
IEEE Trans. Neural Networks Learn. Syst., vol. 32, no. 6, pp. 2744–
2757, 2021.
[68] W. T. Hsu, C. Lin, M. Lee, K. Min, J. Tang, and M. Sun, A unified
model for extractive and abstractive summarization using inconsistency
loss,” in ACL, 2018, pp. 132–141.
[69] H. Jin, T. Wang, and X. Wan, “Multi-granularity interaction network
for extractive and abstractive multi-document summarization,” in ACL,
2020, pp. 6244–6254.
[70] Y. Chen and M. Bansal, “Fast abstractive summarization with reinforce-
selected sentence rewriting, in ACL, 2018, pp. 675–686.
[71] J. Pilault, R. Li, S. Subramanian, and C. Pal, “On extractive and
abstractive neural document summarization with transformer language
models,” in EMNLP, 2020, pp. 9308–9319.
[72] D. Wang, P. Liu, Y. Zheng, X. Qiu, and X. Huang, “Heterogeneous
graph neural networks for extractive document summarization, in ACL,
2020, pp. 6209–6219.
[73] H. Zhang, C. Wang, Z. Wang, Z. Duan, B. Chen, M. Zhou, R. Henao,
and L. Carin, “Learning hierarchical document graphs from multilevel
sentence relations,” IEEE Trans. Neural Networks Learn. Syst., 2021.
[74] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,
in EMNLP-IJCNLP, 2019, pp. 3728–3738.
[75] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-
sequence pre-training for natural language generation, translation, and
comprehension,” in ACL, 2020, pp. 7871–7880.
[76] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, Attention is all you need,” in NeurIPS,
2017, pp. 5998–6008.
[77] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation, J.
Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
[78] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov,
and M. Welling, “Modeling relational data with graph convolutional
networks,” in ESWC, 2018, pp. 593–607.
[79] M. Koupaee and W. Y. Wang, “Wikihow: A large scale text summariza-
tion dataset,” CoRR, vol. abs/1810.09305, 2018.
[80] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of
deep bidirectional transformers for language understanding,” in NAACL-
HLT, 2019, pp. 4171–4186.
[81] S. Bird, E. Klein, and E. Loper, Natural language processing with
Python: analyzing text with the natural language toolkit. O’Reilly
Media, Inc.”, 2009.
Yang Deng is now working toward the PhD degree
in the Department of System Engineering and En-
gineering Management, The Chinese University of
Hong Kong. He received the BS degree from Beijing
University of Posts and Telecommunications and the
MS degree from Peking University. His research
interests include Natural Language Processing, In-
formation Retrieval, and Deep Learning.
Wenxuan Zhang is currently a research scientist
at Alibaba DAMO Academy. He received the PhD
degree from The Chinese University of Hong Kong.
His research interests include Natural Language Pro-
cessing and Deep Learning. He has published several
papers in top-tier conferences in these areas. He
has also been serving on the program committee
of several international conferences and journals,
including ACL, EMNLP, AAAI, SIGKDD, WSDM
etc.
Weiwen Xu is now working toward the PhD de-
gree in the Department of System Engineering and
Engineering Management, The Chinese University
of Hong Kong. He received the BS degree from
University of Electronic Science and Technology of
China. His research interests include Natural Lan-
guage Processing, Information Retrieval, and Deep
Learning.
Ying Shen is now an Associate Professor in School
of Intelligent Systems Engineering, Sun Yat-Sen
University. She received her Ph.D. degree from
the University of Paris Ouest Nanterre La D´
efense
(France), specialized in Computer Science. She re-
ceived her Erasmus Mundus Master degree in Nat-
ural Language Processing from the University of
Franche-Comt´
e (France) and University of Wolver-
hampton (England). Her research interests include
Natural Language Processing and deep learning.
Wai Lam received a Ph.D. in Computer Science
from the University of Waterloo. He obtained his
BSc. and M.Phil. degrees from The Chinese Uni-
versity of Hong Kong. After completing his Ph.D.
degree, he conducted research at Indiana University
Purdue University Indianapolis (IUPUI) and the Uni-
versity of Iowa. He joined The Chinese University
of Hong Kong, where he is currently a professor.
His research interests include text mining, natu-
ral language processing, and intelligent information
retrieval. He has published extensively in top-tier
conferences and journals in these areas.
... Given the intricate and rapidly changing nature of these systems, anomaly detection poses a significant challenge. To address these issues, different deep learning-based methods were proposed [26,27]. The IoT as a large distributed system with numerous diverse physical devices needs a dynamic services coordination method; the coordinating method can integrate the heterogeneous devices and data storages into the context aware IoT infrastructure. ...
... Fig. 5 showcases the ultimate representation of the graph of replication as depicted in Fig. 4(b). The same problem has been addressed using a different technique [26,27]. ...
Article
Full-text available
Achieving timely access to data objects is a major challenge in big distributed systems like the Internet of Things (IoT) platforms. Therefore, minimizing the data read and write operation time in distributed systems has elevated to a higher priority for system designers and mechanical engineers. Replication and the appropriate placement of the replicas on the most accessible data servers is a problem of NP-complete optimization. The key objectives of the current study are minimizing the data access time, reducing the quantity of replicas, and improving the data availability. The current paper employs the Olympiad Optimization Algorithm (OOA) as a novel population-based and discrete heuristic algorithm to solve the replica placement problem which is also applicable to other fields such as mechanical and computer engineering design problems. This discrete algorithm was inspired by the learning process of student groups who are preparing for the Olympiad exams. The proposed algorithm, which is divide-and-conquer-based with local and global search strategies, was used in solving the replica placement problem in a standard simulated distributed system. The 'European Union Database' (EUData) was employed to evaluate the proposed algorithm, which contains 28 nodes as servers and a network architecture in the format of a complete graph. It was revealed that the proposed technique reduces data access time by 39% with around six replicas, which is vastly superior to the earlier methods. Moreover, the standard deviation of the results of the algorithm's different executions is approximately 0.0062, which is lower than the other techniques' standard deviation within the same experiments.
... Finally, extending multilingual and cross-cultural understanding is essential for global applicability. This involves training models on diverse linguistic and cultural datasets, enabling them to understand and interact in multiple languages and cultural contexts effectively [86][87][88][89]. ...
Article
Full-text available
The field of cognitive computing, conversational AI has witnessed remarkable progress, largely driven by the development of the Generative Pre-trained Transformer (GPT) series, notably ChatGPT. These transformer-based models have revolutionized natural language understanding by effectively capturing context and long-range dependencies. In light of this, this paper conducts a comprehensive exploration of ChatGPT, encompassing its architectural design, training methodology, real-world applications, and future potential within the conversational AI landscape. The paper studies the ChatGPT ability for advanced control and responsiveness, exhibiting a superior capacity for comprehending language and generating precise, informative responses. The comprehensive survey depicts ChatGPT excels in sustaining context and engaging in multi-turn dialogues, thereby fostering more interactive and meaningful conversations. Furthermore, its adaptability for integration into various systems and scalability has broadened its applicability across diverse domains, including customer service, education, content generation, healthcare, gaming, research, and exploration. Additionally, the paper presents alternative conversational AI models, such as Amazon Codewhisperer, Google Bard (LaMDA), Microsoft Bing AI, DeepMind Sparrow, and Character AI, providing a comparative analysis that underscores ChatGPT’s advantages in terms of inference capabilities and future promise. Recognizing the evolution and profound impact of ChatGPT holds paramount significance for researchers and developers at the forefront of AI innovation. In a rapidly evolving conversational AI landscape, ChatGPT emerges as a pivotal player, capable of reshaping the way we interact with AI systems across a wide array of applications.
... This procedure achieves a solution that is much more closely related to a practical mathematical model. To overcome the constraints of solutions acquired via the machine learning method, articial neural network (ANN) [28], [29] witch approaches have been suggested in [30], [31]. It has been shown that an ANN with a hidden layer and a linear activation function may approximate any function when there are a lot of hidden neurons [32]. ...
Article
Full-text available
This paper investigates the mathematical modelling of cybercrime attacks on multiple devices connected to the server. This model is a very successful way for cybercrime, bio-mathematics, and artificial intelligence to investigate and comprehend the behaviour of mannerisms with harmful intentions in a computer system. In this computational model, we are studying the factors (i.e., computer viruses, disease infections, and cyberattacks) that affect connected devices. This compartmental model, SEIAR, represents the various hardware utilised during the cyberattack. The letters S, E, I, A, and R are used to represent different stages or groups of individuals in epidemiological models, helping to understand the spread and control of infectious diseases. The dynamics of the previous model are determined by a series of differential equations. The dynamics of the preceding model are determined by a system of differential equations. Numerical solutions of the model are calculated using backpropagated Levenberg-Marquardt algorithm (BLMA) and a specific optimization algorithm known as the Levenberg-Marquardt algorithm (LMA). Reference solutions were obtained by using the Runge-Kutta algorithm of order 4 (RK-4). The backpropagated Levenberg-Marquardt algorithm (BLMA), commonly known as the damped least-squares (DLS) method. Subsequently, we endeavor to analyze the surrogate solutions obtained for the system and determine the stability of our approach. Moreover, we aim to ascertain fitting curves to the target solutions with minimum errors and achieve a regression value of 1 for all the predicted solutions. The outcome of our simulations ensures that our approach is capable of making precise predictions concerning the behavior of real-world phenomena under varying circumstances. The testing, validation, and training of our technique concerning the reference solutions are then used to determine the accuracy of the surrogate solutions obtained by BLMA. Convergence analysis, error histograms, regression analysis, and curve fitting were used for each differential equation to examine the robustness and accuracy of the design strategy.
... Graph Neural Networks (GNNs) (Morris et al., 2019;Velickovic et al., 2017;Hamilton et al., 2017;Kipf and Welling, 2017) have been demonstrated to be effective in inducing and aggregating symbolic structures on other multi-hop question answering problems (Cao et al., 2019;Fang et al., 2020;Huang and Yang, 2021;Heo et al., 2022;Xu et al., 2021a;Deng et al., 2023b). In practice, the required number of graph layers grows with the multi-hop dependency between two distant nodes (Wang et al., 2021;Hong et al., 2022), which inevitably encounters the problem of oversmoothing . ...
... There are various types of optimization techniques, including generalized algorithm , bounded algorithms , multi-agent alorithms (Wang et al. 2023c;, distributed optimization algorithms (Ma et al. 2023a), adaptive techniques (Jiang and Li 2022;Jiang et al. 2022b), multihop optimization (Deng et al. 2023), dynamic aoptimization (Cheng et al. 2017), multi label (Lu, et al. 2023;Liu et al. 2023), and multi-modal algorithms (Lu et al. 2023), each suited to different types of optimization problems. ...
Article
Full-text available
Accurately predicting accounting profit (PAP) plays a vital role in financial analysis and decision-making for businesses. The analysis of a business’s financial achievements offers significant insights and aids in the formulation of strategic plans. This research paper focuses on improving the chimp optimization algorithm (CHOA) to evolve deep long short-term memory (LSTM) models specifically for financial accounting profit prediction. The proposed hybrid approach combines CHOA’s global search capabilities with deep LSTMs’ sequential modeling abilities, considering both the global and temporal aspects of financial data to enhance prediction accuracy. To overcome CHOA’s tendency to get stuck in local minima, a novel updating technique called adaptive pair reinforced (APR) is introduced, resulting in APRCHOA. In addition to well-known conventional prediction models, this study develops five deep LSTM-based models, namely conventional deep LSTM, CHOA (deep LSTM-CHOA), adaptive reinforcement-based genetic algorithm (deep LSTM-ARGA), marine predator algorithm (deep LSTM-MPA), and adaptive reinforced whale optimization algorithm (deep LSTM-ARWOA). To comprehensively evaluate their effectiveness, the developed deep LSTM-APRCHOA models are assessed using statistical error metrics, namely root mean square error (RMSE), bias, and Nash–Sutcliffe efficiency (NSEF). In the validation set, at a lead time of 1 h, the NSEF values for LSTM, LSTM-MPA, LSTM-CHOA, LSTM-ARGA, LSTM-ARWOA, and deep LSTM-APRCHOA were 0.9100, 0.9312, 0.9350, 0.9650, 0.9722, and 0.9801, respectively. The results indicate that among these models, deep LSTM-APRCHOA demonstrates the highest accuracy for financial profit prediction.
... Analysis of emitted audio signals from targets plays an infuential role in classifying sonar targets using various artifcial intelligence techniques [36][37][38]. Te most propagated sound is emitted from the propeller area and its rotating parts [39][40][41]. It can be concluded from the in-depth research that has been performed so far on the techniques for automatically recognizing sonar targets that the only technique that has a suitable theory and mathematical model to simulate the reference target database is the technique that uses the rotation of sonar targets' rotating parts as a modulation signal. ...
Article
Full-text available
This paper proposes a method for underwater target recognition based on micro-Doppler effects (called STR_MD) using a majority voting ensemble classifier weighted with particle swarm optimization (PSO) (called MV-PSO). The micro-Doppler effect refers to amplitude/phase modulation of the received signal by rotating parts of a target such as propellers. Since different targets’ geometric and physical properties differ, their micro-Doppler signature is different. This inconsistency can be considered an effective issue (especially in the frequency domain) for sonar target recognition. To demonstrate the effectiveness of the proposed method, both simulated and practical micro-Doppler data are produced and applied to the designed STR_MD. Also, MV-PSO with six well-known basic classifiers, k-nearest neighbors (k-NN), Naive Bayes (NB), decision tree (DT), MLP_NN, support vector machine (SVM), and random forest (RF), has been used to evaluate the performance of the proposed method. This ensemble classifier assigns an instance to a class that most base classifiers agree on. However, basic classifiers in a set seldom work just as well. Therefore, in this case, one strategy is to weigh each classification depending on its performance using PSO. The performance parameters measured are the recognition score, reliability, and processing time. The simulation results showed that the correct recognition rate, reliability, and processing time for the simulated data at SNR = 5 dB and 10° viewing angle were 98.50, 98.89, and 9.81 s, respectively, and for the practical dataset with RPM = 1200, 100, 100, and 4.43, respectively. Thus, MV-PSO has a more encouraging performance in STR_MD for simulated and practical micro-Doppler sonar datasets.
Article
Full-text available
The surge in text data has driven extensive research into developing diverse automatic summarization approaches to effectively handle vast textual information. There are several reviews on this topic, yet no large‐scale analysis based on quantitative approaches has been conducted. To provide a comprehensive overview of the field, this study conducted a bibliometric analysis of 3108 papers published from 2010 to 2022, focusing on automatic summarization research regarding topics and trends, top sources, countries/regions, institutions, researchers, and scientific collaborations. We have identified the following trends. First, the number of papers has experienced 65% growth, with the majority being published in computer science conferences. Second, Asian countries and institutions, notably China and India, actively engage in this field and demonstrate a strong inclination toward inter‐regional international collaboration, contributing to more than 24% and 20% of the output, respectively. Third, researchers show a high level of interest in multihead and attention mechanisms, graph‐based semantic analysis, and topic modeling and clustering techniques, with each topic having a prevalence of over 10%. Finally, scholars have been increasingly interested in self‐supervised and zero/few‐shot learning, multihead and attention mechanisms, and temporal analysis and event detection. This study is valuable when it comes to enhancing scholars' and practitioners' understanding of the current hotspots and future directions in automatic summarization. This article is categorized under: Algorithmic Development > Text Mining
Article
Breast cancer poses a significant global health challenge, demanding enhanced diagnostic accuracy and streamlined medical history documentation. This study presents a holistic approach that harnesses the power of artificial intelligence (AI) and machine learning (ML) to address these pressing needs. This study presents a comprehensive methodology for breast cancer diagnosis and medical history generation, integrating data collection, feature extraction, machine learning, and AI-driven history-taking. The research employs a systematic approach to ensure accurate diagnosis and efficient history collection. Data preprocessing merges similar attributes to streamline analysis. Three key algorithms, Support Vector Machine (SVM), K-Nearest Neighbours (KNN), and Fuzzy Logic, are applied. Fuzzy Logic shows exceptional accuracy in handling uncertain data. Deep learning models enhance predictive accuracy, emphasizing the synergy between traditional and deep learning approaches. The AI-driven history collection simplifies the patient history-taking process, adapting questions dynamically based on patient responses. Comprehensive medical history reports summarize patient data, facilitating informed healthcare decisions. The research prioritizes ethical compliance and data privacy. OpenAI has integrated GPT-3.5 to generate automated patient reports, offering structured overviews of patient health history. The study’s results indicate the potential for enhanced disease prediction accuracy and streamlined medical history collection, contributing to more reliable healthcare assessments and patient care. Machine learning, deep learning, and AI-driven approaches hold promise for a wide range of applications, particularly in healthcare and beyond. To access full e-print: https://www.tandfonline.com/eprint/HVHXXWAUDTQVQQHWJQWH/full?target=10.1080/10255842.2023.2300681
Article
Healthcare needs in rural areas differ significantly from those in urban areas. Addressing the healthcare challenges in rural communities is of paramount importance, as these regions often lack access to adequate healthcare facilities. Moreover, technological advancements, particularly in the realm of the Internet of Things (IoT), have brought about significant changes in the healthcare industry. IoT involves connecting real-world objects to digital devices, opening up various possibilities for improving healthcare delivery. One promising application of IoT is its use in monitoring the spread of diseases in remote villages through interconnected sensors and devices. Surprisingly, there has been a noticeable absence of comprehensive research on this topic. Therefore, the primary objective of this study is to conduct a thorough and systematic review of intelligent IoT-based healthcare systems in rural communities and their governance. The analysis covers research papers published until December 2022 to provide valuable insights for future researchers. The selected articles have been categorized into three main groups: monitoring, intelligent services, and body sensor networks. The findings indicate that IoT research has garnered significant attention within the healthcare community. Furthermore, the results illustrate the potential benefits of IoT for governments, especially in rural areas, in improving public health and strengthening economic ties. It is worth noting that establishing a robust security infrastructure is essential for implementing IoT effectively, given its innovative operational principles. In summary, this review enhances scholars' understanding of the current state of IoT research in rural healthcare settings while highlighting areas that warrant further investigation. Additionally, it keeps healthcare professionals informed about the latest advancements and applications of IoT in rural healthcare.
Article
Full-text available
Performance evaluation is a critical process in organizations as it provides valuable insights into employee productivity, identifies areas of improvement, facilitates fair reward systems, and ultimately contributes to the overall growth and success of the company. Most evaluations are based on human intuitionistic data, and performance attributes are divided into sub-attributes for a fair and detailed evaluation. For handling attributes at a sub-attributic level, we introduce a novel lattice-ordered picture fuzzy hypersoft set (LO <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PF</sup> <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">HSS</sub> ), which provides a more valuable structure for certain decision-making problems where uncertainty associated with picture fuzzy sets and the ordering among parameters is crucial. The utilization of LO <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PF</sup> <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">HSS</sub> can enhance decision-making processes by introducing a systematic and ordered representation of parameters. For a detailed illustration of the designed structure, basic operations are defined, which are then used to develop an employee performance evaluation system that incorporates information in the form of membership degree (MD), non-membership degree (NMD), and abstinence degree (AD) while also addressing the issue of parametric ordering. The structure offers great flexibility and versatility in addressing decision-making problems commonly arising in human resource management, as most data is based on human intuition.
Article
To date, most of the existing open-domain question answering (QA) methods focus on explicit questions where the reasoning steps are mentioned explicitly in the question. In this article, we study implicit QA where the reasoning steps are not evident in the question. Implicit QA is challenging in two aspects. First, evidence retrieval is difficult since there is little overlap between a question and its required evidence. Second, answer inference is difficult since the reasoning strategy is latent in the question. To tackle implicit QA, we propose a systematic solution denoted as DisentangledQA, which disentangles topic, attribute, and reasoning strategy from the implicit question to guide the retrieval and reasoning. Specifically, we disentangle the topic and attribute information from the implicit question to guide evidence retrieval. For answer reasoning, we propose a disentangled reasoning model for answer prediction based on retrieved evidence as well as the latent representation of the reasoning strategy. The disentangled framework empowers each module to focus on a specific latent element in the question, and thus, leads to effective representation learning for them. Experiments on the StrategyQA dataset demonstrate the effectiveness of our method in answering implicit questions, improving performance in evidence retrieval and answering inference by 31.7% and 4.5%, respectively, and achieving the best performance on the official leaderboard. In addition, our method achieved the best performance on the challenging EntityQuestions dataset, indicating the effectiveness in improving general open-domain QA tasks.
Article
This paper proposes a neural network-based method for generating compact answers to open-domain why-questions (e.g., "Why was Mr. Trump elected as the president of the US?"). Unlike factoid question answering methods that provide short text spans as answers, existing work for why-question answering have aimed at answering questions by retrieving relatively long text passages, each of which often consists of several sentences, from a text archive. While the actual answer to a why-question may be expressed over several consecutive sentences, these often contain redundant and/or unrelated parts. Such answers would not be suitable for spoken dialog systems and smart speakers such as Amazon Echo, which receive much attention in these days. In this work, we aim at generating non-redundant compact answers to why-questions from answer passages retrieved from a very large web data corpora (4 billion web pages) by an already existing open-domain why-question answering system, using a novel neural network obtained by extending existing summarization methods. We also automatically generate training data using a large number of causal relations automatically extracted from 4 billion web pages by an existing supervised causality recognizer. The data is used to train our neural network, together with manually created training data. Through a series of experiments, we show that both our novel neural network and auto-generated training data improve the quality of the generated answers both in ROUGE score and in a subjective evaluation.
Article
Open-domain question answering (QA) tasks require a model to retrieve inference chains associated with the answer from massive documents. The core of a QA model is the information filtering ability and reasoning ability. This article proposes a semantic knowledge reasoning graph model based on the multidimensional axiomatic fuzzy set (AFS), which can generate the knowledge graph (KG) and build reasoning paths for reading comprehension tasks through unsupervised learning. Moreover, taking advantage of the interpretable AFS framework enables the proposed model to have the ability to learn and analyze the semantic relationships between candidate documents. Meanwhile, the utilization of the multidimensional AFS acquires semantic descriptions of candidate documents more concise and flexible. The similarity degree between paragraphs is calculated according to the AFS description to generate the graph. Interpretable chains of reasoning provided by the AFS knowledge graph (AFS Graph) will serve as the basis for the answer prediction. Compared with the previous methods, the AFS Graph model presented in this article improves interpretability and reasoning ability. Experimental results show that the proposed model can achieve the state-of-the-art performance on datasets of HotpotQA, SQuAD, and Natural Questions Open.
Article
Recently, Product Question Answering (PQA) on E-Commerce platforms has attracted increasing attention as it can act as an intelligent online shopping assistant and improve the customer shopping experience. Its key function, automatic answer generation for product-related questions, has been studied by aiming to generate content-preserving while question-related answers. However, an important characteristic of PQA, i.e., personalization, is neglected by existing methods. It is insufficient to provide the same “completely summarized” answer to all customers, since many customers are more willing to see personalized answers with customized information only for themselves, by taking into consideration their own preferences towards product aspects or information needs. To tackle this challenge, we propose a novel P ersonalized A nswer GE neration method ( PAGE ) with multi-perspective preference modeling, which explores historical user-generated contents to model user preference for generating personalized answers in PQA. Specifically, we first retrieve question-related user history as external knowledge to model knowledge-level user preference. Then we leverage Gaussian Softmax distribution model to capture latent aspect-level user preference. Finally, we develop a persona-aware pointer network to generate personalized answers in terms of both content and style by utilizing personal user preference and dynamic user vocabulary. Experimental results on real-world E-Commerce QA datasets demonstrate that the proposed method outperforms existing methods by generating informative and customized answers, and show that answer generation in E-Commerce can benefit from personalization.
Article
Answer selection, which is involved in many natural language processing applications, such as dialog systems and question answering (QA), is an important yet challenging task in practice, since conventional methods typically suffer from the issues of ignoring diverse real-world background knowledge. In this article, we extensively investigate approaches to enhancing the answer selection model with external knowledge from knowledge graph (KG). First, we present a context-knowledge interaction learning framework, Knowledge-aware Neural Network, which learns the QA sentence representations by considering a tight interaction with the external knowledge from KG and the textual information. Then, we develop two kinds of knowledge-aware attention mechanism to summarize both the context-based and knowledge-based interactions between questions and answers. To handle the diversity and complexity of KG information, we further propose a Contextualized Knowledge-aware Attentive Neural Network, which improves the knowledge representation learning with structure information via a customized Graph Convolutional Network and comprehensively learns context-based and knowledge-based sentence representation via the multi-view knowledge-aware attention mechanism. We evaluate our method on four widely used benchmark QA datasets, including WikiQA, TREC QA, InsuranceQA, and Yahoo QA. Results verify the benefits of incorporating external knowledge from KG and show the robust superiority and extensive applicability of our method.
Article
Organizing the implicit topology of a document as a graph, and further performing feature extraction via the graph convolutional network (GCN), has proven effective in document analysis. However, existing document graphs are often restricted to expressing single-level relations, which are predefined and independent of downstream learning. A set of learnable hierarchical graphs are built to explore multilevel sentence relations, assisted by a hierarchical probabilistic topic model. Based on these graphs, multiple parallel GCNs are used to extract multilevel semantic features, which are aggregated by an attention mechanism for different document-comprehension tasks. Equipped with variational inference, the graph construction and GCN are learned jointly, allowing the graphs to evolve dynamically to better match the downstream task. The effectiveness and efficiency of the proposed multilevel sentence relation graph convolutional network (MuserGCN) is demonstrated via experiments on document classification, abstractive summarization, and matching.
Article
Deep neural network-based models have achieved great success in extractive question answering. Recently, many works have been proposed to model multistage matching for this task, which usually first retrieve relevant paragraphs or sentences and then extract an answer span from the retrieved results. However, such a pipeline-based approach suffers from the error propagation problem, especially for sentence-level retrieval that is usually difficult to achieve high accuracy due to the severe data imbalance problem. Furthermore, since the paragraph/sentence selector and the answer extractor are closely related, modeling them independently does not fully exploit the power of multistage matching. To solve these problems, we propose a novel end-to-end multigranularity reading comprehension model, which is a unified framework to explicitly model three matching granularities, including paragraph identification, sentence selection, and answer extraction. Our approach has two main advantages. First, the end-to-end approach alleviates the error propagation problem in both the training and inference phases. Second, the shared features in a unified model improve the learning of representations of different matching granularities. We conduct a comprehensive comparison on four large-scale datasets (SQuAD-open, NewsQA, SQuAD 2.0, and SQuAD Adversarial) and verify that the proposed approach outperforms both the vanilla BERT model and existing multistage matching approaches. We also conduct an ablation study and verify the effectiveness of the proposed components in our model structure.
Conference Paper
Self-attention mechanism in graph neural networks (GNNs) led to state-of-the-art performance on many graph representation learning tasks. Currently, at every layer, attention is computed between connected pairs of nodes and depends solely on the representation of the two nodes. However, such attention mechanism does not account for nodes that are not directly connected but provide important network context. Here we propose Multi-hop Attention Graph Neural Network (MAGNA), a principled way to incorporate multi-hop context information into every layer of attention computation. MAGNA diffuses the attention scores across the network, which increases the receptive field for every layer of the GNN. Unlike previous approaches, MAGNA uses a diffusion prior on attention values, to efficiently account for all paths between the pair of disconnected nodes. We demonstrate in theory and experiments that MAGNA captures large-scale structural information in every layer, and has a low-pass effect that eliminates noisy high-frequency information from graph data. Experimental results on node classification as well as the knowledge graph completion benchmarks show that MAGNA achieves state-of-the-art results: MAGNA achieves up to 5.7% relative error reduction over the previous state-of-the-art on Cora, Citeseer, and Pubmed. MAGNA also obtains the best performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion MAGNA advances state-of-the-art on WN18RR and FB15k-237 across four different performance metrics.