ArticlePDF Available

A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Authors:

Abstract and Figures

In the last decade, the demand for readily accessible corpora has touched all areas of natural language processing, including coreference resolution. However, it is one of the least considered sub-fields in recent developments. Moreover, almost all existing resources are only available for the English language. To overcome this lack, this work proposes a methodology to create a corpus for coreference resolution in Italian exploiting knowledge of annotated resources in other languages. Starting from OntonNotes, the methodology translates and refines English utterances to obtain utterances respecting Italian grammar, dealing with language-specific phenomena and preserving coreference and mentions. A quantitative and qualitative evaluation is performed to assess the well-formedness of generated utterances, considering readability, grammaticality, and acceptability indexes. The results have confirmed the effectiveness of the methodology in generating a good dataset for coreference resolution starting from an existing one. The goodness of the dataset is also assessed by training a coreference resolution model based on BERT language model, achieving the promising results. Even if the methodology has been tailored for English and Italian languages, it has a general basis easily extendable to other languages, adapting a small number of language-dependent rules to generalize most of the linguistic phenomena of the language under examination.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
A multi-level methodology for the automated translation
of a coreference resolution dataset: an application to the Italian
language
Aniello Minutolo
1
Raffaele Guarasci
1
Emanuele Damiano
1
Giuseppe De Pietro
1
Hamido Fujita
2,3,4
Massimo Esposito
1
Received: 18 January 2022 / Accepted: 18 July 2022 / Published online: 19 September 2022
The Author(s) 2022
Abstract
In the last decade, the demand for readily accessible corpora has touched all areas of natural language processing, including
coreference resolution. However, it is one of the least considered sub-fields in recent developments. Moreover, almost all
existing resources are only available for the English language. To overcome this lack, this work proposes a methodology to
create a corpus for coreference resolution in Italian exploiting knowledge of annotated resources in other languages.
Starting from OntonNotes, the methodology translates and refines English utterances to obtain utterances respecting Italian
grammar, dealing with language-specific phenomena and preserving coreference and mentions. A quantitative and qual-
itative evaluation is performed to assess the well-formedness of generated utterances, considering readability, grammat-
icality, and acceptability indexes. The results have confirmed the effectiveness of the methodology in generating a good
dataset for coreference resolution starting from an existing one. The goodness of the dataset is also assessed by training a
coreference resolution model based on BERT language model, achieving the promising results. Even if the methodology
has been tailored for English and Italian languages, it has a general basis easily extendable to other languages, adapting a
small number of language-dependent rules to generalize most of the linguistic phenomena of the language under
examination.
Keywords Coreference resolution Corpus creation Automated translation Cross-language Natural language
processing Linguistic phenomena
1 Introduction
Coreference resolution (henceforth CR) has a long history
in natural language processing (NLP); knowing who is
being talked about in a text has always been a fascinating
challenge for scholars. Although it is not a new task, CR is
still debated [1], demonstrating its usefulness concerning
practical and theoretical issues. Indeed, coreference infor-
mation has been used in various NLP tasks, such as text
summarization [2], and also with reference to low-resource
languages [3]. Moreover, it has been the object of study for
linguistics theoretical issues [4], focusing on the interpre-
tation of syntactic phenomena like null subjects and pro-
nouns. Over the last decades, many approaches for CR
have succeeded, ranging from simple rule-based systems to
machine- and deep learning approaches [5,6] to rein-
forcement learning-based solutions [7]. These approaches
&Raffaele Guarasci
raffaele.guarasci@cnr.it
1
Institute for High Performance Computing and Networking
of National Research Council of Italy (ICAR-CNR), Via
Pietro Castellino 111, 80131 Naples, Italy
2
Faculty of Information Technology, Ho Chi Minh City
University of Technology (HUTECH), Ho Chi Minh City,
Vietnam
3
Andalusian Research Institute in Data Science and
Computational Intelligence (DaSCI), University of Granada,
Granada, Spain
4
Faculty of Software and Information Science, Iwate
Prefectural University, Iwate, Japan
123
Neural Computing and Applications (2022) 34:22493–22518
https://doi.org/10.1007/s00521-022-07641-3(0123456789().,-volV)(0123456789().,-volV)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
have also been transferred with applications to specific
domains [8].
The history and developments in this field have led to
the creation of numerous corpora specifically annotated for
coreference-related tasks. From the earliest modestly sized
corpora [9] manually created to progressively larger
resources to satisfy the ever-increasing data needs of
machine learning approaches [10] and capable of managing
multiple languages or specific domains. Evaluation cam-
paigns such as SemEval [11] and CoNLL 2012 [10] have
contributed to the proliferation of available datasets.
However, although it is long tradition in NLP, CR is one of
the sub-fields of NLP, which has seen the slowest progress
[1] during the last decade, dominated by the exponential
growth of machine learning. In addition, the vast amount of
resources available for the English language is not matched
by a similar number for the other languages. Datasets in
languages other than English are mainly limited to preex-
isting treebanks to which a specific coreference annotation
level has been added.
About the language under investigation in this work,
Italian, there are a few outdated annotated corpora [1214],
which suffer from limited size, excessive domain-depen-
dence, and lack of a shared annotation standard scheme.
Hence, only a handful of approaches for CR have been
developed.
Starting from this issue, this paper describes an inno-
vative cross-lingual methodology for creating a CR dataset
in a low-resource language starting from a rich-resource
one. The languages here considered are Italian and English,
respectively. In particular, an Italian dataset for the CR has
been generated starting from OntoNotes [15], which is
currently considered the de facto standard for the evalua-
tion of coreference tasks in English since the CoNLL
shared tasks in 2011 and 2012.
The methodology is divided into two distinct steps.
First, a multi-level translation process is applied to the
English sentences extracted from the OntoNotes dataset for
CR. This step aims to translate sentences trying to preserve
mentions they can contain without losing in the translation
the tokens composing the mentions, their positions, and the
verbal agreements involving them. Second, a language
refinement step has been introduced. This step tries to
manage language-dependent phenomena to produce output
sentences compliant with Italian grammar by applying
language-specific rules derived from theoretical linguistics.
These rules perform deletions and substitutions without
losing information about mentions. Original coreference
annotation has been preserved without having sentences
that can sound unnatural or ungrammatical in Italian. This
step is necessary in cases where there is a significant dis-
crepancy between the two languages, in this case, Italian
and English, concerning syntactic constructions involving
personal pronouns that are often used in different ways.
Concerning evaluation, the results have been assessed
both quantitatively and qualitatively. From the quantitative
point of view, the readability of the produced sentences has
been calculated using the Flesch–Kincaid index adapted for
the Italian language [16]. This metric has been supple-
mented with a qualitative analysis carried out by native
speakers using indicators from theoretical linguistics, such
as grammaticality and acceptability. Grammaticality refers
to a sentence’s well-formedness from a syntactic point of
view, e.g., the structure and order of the constituents are
maintained. The concept of acceptability, instead, is related
to how the sentence is considered semantically meaningful
according to the annotator’s judgments. Together, these
two indicators allow assessing the quality of translated
sentences from the perspectives of both grammatical cor-
rectness and meaningfulness for a native speaker. The
goodness of the dataset has also been assessed by training a
CR baseline model based on BERT [17]. Then, the results
have been compared with the ones obtained by the same
model but on the English version of the Ontonotes dataset.
The paper is organized as follows. Section 2reviews the
state of the art of datasets created for CR. It describes both
the datasets created for English and other languages. Sec-
tion 3outlines the research motivations and contributions
of the proposal. In Sect. 4, the methodology adopted for
making the dataset starting from the original English
resource is reported. This section describes the two macro-
steps of translation and linguistic refinement to achieve a
translated text that preserves mentions and coreferences.
Section 5discusses the results obtained, describing the
evaluation process, both quantitative and qualitative, and
outlining the performance achieved by a BERT-based CR
model training on the generated dataset. Finally, Sect. 6
concludes the work.
2 Related work
The datasets developed over the years for CR are of various
kinds. Generalist, domain-specific and multilingual data-
sets characterized by different criteria and annotation
schemes have been created. The vast majority of the
resources—as in all NLP fields—have been made for the
English language, but there have also been developments in
other languages in recent years.
It is worth noting that almost all resources include both
coreference and anaphora resolution since both are part of
the entity resolution family. The clear distinction in ter-
minology between the two concepts is still debated in the
literature. According to some studies, anaphora is a subset
of coreference, while others claim that coreference is part
22494 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
of anaphora. For this paper, the resources available for
coreference will be listed, although these are almost always
valid for anaphora resolution. Notice that—as regards ter-
minology—in this work, the definition of coreference is the
same as adopted in the OntoNotes schema. Therefore
coreference is not limited to noun phrases [18] but includes
pronouns, head of verb phrases and named entities as
potential mentions.
Starting from these premises, this section first surveys
and highlights the main characteristics of existing datasets
for CR in English. Successively, CR resources for lan-
guages other than English are described, specifically out-
lining the ones for Italian.
2.1 CR resources for English
The MUC corpora is the first dataset manually created by
human annotators that also aims for evaluation purposes.
The MUC-6 [19] and MUC-7 [20] are based on North
American news corpora (extracted by the Wall Street
Journal), and they are small in size (318 annotated articles).
Although now rarely used due to their limited domain and
size, they are still considered valid compared to baselines.
MUC has its evaluation metrics and SGML-based anno-
tation format.
The GNOME Corpus [21] instead is created with a
specific cross-domain scope. It includes texts from three
domains (museum labels, pharmaceutical leaflets, and
tutorial dialogues), and it has an annotation level of dis-
course and semantic information. GNOME has also been
used in conjunction with other datasets to create the
ARRAU corpus [22]. It includes corpora from different
domains such as news-wire, dialogues, and fiction. The
annotation scheme is the MMAX2 format which uses
hierarchical XML files at the document and sentence level.
Then, there are corpora developed for specific corefer-
ence-related sub-tasks. The character identification corpus
[23] focuses on the task of speaker-linking in multi-party
conversations extracted from transcriptions of TV shows.
ECB ?[24] is another task-specific corpus. It is devoted to
the topic-based event CR, a topic that has gained much
attention in the literature in recent years.
Other corpora developed for cross-domain purposes
exploit freely available online resources. The GUM corpus
[25] is a multilayer, CoNLL-labeled corpus containing
conversational, instructional, and news texts extracted from
the web. WikiCoref [26] is composed of annotated Wiki-
pedia articles, whose entities are linked to an external
knowledge repository for the mentions. Both corpora use
the OntoNotes schema for the annotation. It is worth noting
that also the English Penn Treebank [27] has been used for
purposes related to coreference tasks. Indeed, it was also
annotated with coreference links as part of the OntoNotes
project [15].
There are also coreference corpora specifically devel-
oped for a single domain. For instance, NP4E [28]isa
small corpus based only on security and terrorism genres. It
is annotated using the MMAX2 format for the event
coreference task. In addition, the healthcare domain has
received special attention, so numerous biomedical corpora
have been created. Starting from GENIA corpus [29],
which contains 2000 MEDLINE abstract, numerous other
resources have been developed, such as Genia Treebank
[30], Genia event annotation [31], and MedCo coreference
annotation [32]. These resources have been the focus of the
BioNLP-2011 shared task on Protein CR [33]. A different
approach is proposed by CRAFT [34] and by its successor
HANNAPIN corpus [35]. These resources contain full
annotated biochemical articles for CR. In the pharmaco-
logical field, the DrugNerAR [36] corpus has been devel-
oped, with the aim of resolving anaphora for extraction
drug–drug interactions in the pharmacological literature.
2.2 CR resources for other languages
The first corpus that also deals with languages other than
English is ACE [37]. Initially based only on the journalistic
domain, it aims to be heterogeneous and domain-inde-
pendent and is annotated for different languages (like
English, Chinese, and Arabic). The covered domains range
from news-wire articles to conversational telephonic
speech and broadcast conversations.
OntoNotes 5.0 [38] was the dataset involved in the
Semeval 2010 [39] and CoNLL 2012 [10], with the aim of
modeling CR for multiple languages. It was created to
classify mentions of equivalence according to the entity to
which they refer. OntoNotes is mostly based on news
articles; it includes three different languages and is anno-
tated using a CoNLL-like format. It is still the most widely
used corpus for evaluation in the literature.
Another parallel corpus available in two languages
(English and German) is ParCor [40]. It is a corpus that
includes data extracted from a specific genre ( TEDx talks
and Bookshop publications). It focuses on a particular
purpose: parallel pronoun CR in different languages in a
machine translation context.
There are very few datasets currently used in the
coreference task concerning the Italian language. VENEX
[12] is a corpus which combines two different corpus-an-
notation initiatives: SI-TAL [41], focused on the creation
of a corpus of written Italian from financial newspapers,
and IPAR [42], which is a collection of spoken task-ori-
ented dialogues of speakers. VENEX uses MATE as
annotation scheme and MMAX for the markup.
Neural Computing and Applications (2022) 34:22493–22518 22495
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Another coreference resource is I-CAB [13], a small
dataset built on news documents taken from the regional
newspaper L’Adige. Texts are annotated using a
scheme derived from the ACE corpus.
The most recent corpus developed for Italian is Live-
Memories [14]. It collects two genres of text: blog sites and
Wikipedia pages related to the history, geography, and
culture of the region of Trentino-Alto Adige/Su
¨dtirol. The
annotation follows the ARRAU guidelines adapted for the
Italian language. Table 1.
These resources present several limitations. First, they
are related to a specific domain: both I-CAB and Live-
Memories corpora contain only texts related to the region
Trentin/Su
¨dtirol (respectively, newspaper articles and
Wikipedia pages and blog sites). The VENEX corpus is
more heterogeneous since it includes articles from financial
newspapers and dialogues. Second, they adopt different
annotation methods. VENEX annotation scheme imple-
ments the scheme proposed in MATE,
1
and the markup
scheme is the simplified form of standoff adopted in the
MMAX annotation tool. ICAB is annotated with a
scheme inspired by the ACE corpus, while LiveMemories
combines annotation methods from the ARRAU corpus for
English [22] and the VENEX project.
3 Research objectives and contribution
The main objective of this work is to propose a cross-
lingual methodology for the creation of a dataset for the CR
by integrating an automatic translation and a rule-based
refinement to transfer existing resources in a source lan-
guage to a target language.
As highlighted in Sect. 2.1, the most recent datasets for
coreference tasks are based on previously developed
resources or treebanks to which an additional level of
specific annotation has been added. This approach is
practical for languages with a great richness of materials.
Still, it cannot be adapted to languages like Italian, which
are often overlooked in many NLP tasks due to limited
resources.
Translating resources already developed in other rich-
resource languages can address this shortcoming, but trying
to maintain the same methodological accuracy used in
creating the original dataset.
Translating existing datasets into other languages offers
many advantages, considerably reducing creation time
compared to creating a resource from scratch. This
approach is not entirely straightforward. A fully automatic
machine translation cannot be sufficiently accurate in
adapting the original text to the linguistic features of the
target language.
Therefore, as an element of novelty, the proposed
methodology includes a step of language refinement
derived from theoretical linguistics theory, particularly
concerning aspects of syntax. This step tries to manage
language-dependent phenomena to produce sentences
compliant with the target language grammar and be per-
ceived as correct by native speakers’ judgements.
Despite this language-dependent refinement step, the
proposed methodology has the character of reproducibility.
It can be extended to other languages, developing a set of
language-dependent refinement rules to generalize most of
the linguistic phenomena of the language under examina-
tion. In addition, starting from existing resources makes it
possible to obtain parallel corpora, useful for subsequent
cross-lingual analysis.
From an application perspective, the proposed method-
ology has been used to create, to the best of our knowledge,
the first medium-scale Italian dataset for CR that also
respects properties of interoperability, domain indepen-
dence, and compliance with annotation standards.
Indeed, the Italian language does not benefit from many
resources, and, as highlighted in Sect. 2.2, existing material
is outdated and restricted to VENEX [12], I-CAB [13], and
LiveMemories corpora [14].
It is worth noting that both the excessive specificity of
their application domains and their lack of a shared anno-
tation standard scheme make interoperability between
existing Italian resources extremely complicated. On the
contrary, the corpus generated with the proposed method-
ology is comparable in size and annotation criteria with
OntoNotes, which is currently considered the essential
resource for the field [15]. The opportunity to compare with
OntoNotes, which is the de facto standard for evaluating
coreference tasks since the CoNLL shared tasks in 2011
and 2012, could open exciting perspectives for multilingual
analysis.
The goodness of the generated dataset is also assessed
concerning the possibility of being used to train a deep
learning model for CR in Italian. To this aim, a baseline
model on the dataset is generated by adopting a state-of-
Table 1 Size comparison of coreference corpora
Corpus Language Size (words) (k)
OntoNotes English 1450
Venex Italian 40
i-Cab Italian 250
LiveMemories Italian 250
1
http://www.andreasmengel.de/pubs/mdag.pdf.
22496 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
the-art deep learning architecture proposed for the same
task in English.
4 Methodology for the creation
of the dataset
The proposed cross-lingual methodology has been devel-
oped starting from the multilingual coreference annotation
of the OntoNotes dataset first proposed by [10]. It is
structured in two macro steps, as highlighted in Fig. 1.
First, a coreference dataset is automatically translated from
a source language into a target one, preserving mentions
and their positions in texts. In detail, OntoNotes is used as
an input coreference dataset expressed in English and
Italian is selected as the target language.
A pipeline has been realized to perform this translation
process.
In detail, first a CR dataset in the source language,
denoted with a, is obtained from the source corpus by
preserving documents, partitions, utterances, and mentions,
but discharging irrelevant information and mentions whose
tokens are contained in other mentions. Then, the dataset
b
1
is obtained from the dataset aby discarding unwanted
utterances, i.e., utterances lacking verbs or composed of
too few or too many tokens. Successively, the dataset b
2
is
obtained from the dataset b
1
by removing unwanted men-
tions, i.e., mentions that can easily lead to ambiguities and
inaccuracies in their translation. After, the dataset b
3
is
obtained from the dataset b
2
by removing all mentioned
clusters within each partition resulting in inconsistency.
Finally, the CR dataset cin the target language is obtained
from the dataset b
3
by translating its utterances and men-
tions through an intelligent token replacement/resolution
procedure guided by the set class(id
m
), which is a set
containing an estimation of the typology, gender and
number of the real-world entities referred by each mention
within the dataset b
3
.
Second, a novel theoretical linguistics-based refinement
is applied to improve the naturalness of the output text in
the target language.
In particular, a series of rewriting rules based on prin-
ciples of theoretical linguistics is applied to make it easy to
obtain a more readable and fluent text in Italian from the
original English text. The rules are structured in such a way
as to ensure the most extensive coverage of the most fre-
quent phenomena in the sentences. Subsequently, they
have been automatically applied to the whole dataset.
Such rules are the most innovative aspect of the work of
the methodology. Through the use of solid theoretical
principles they allow enhancing the accuracy of a machine
translation process on a specific task producing output
sentences as close as possible to those produced by a native
speaker in the target language.
In detail, first the dataset dis obtained from the dataset c
by refining its utterances and mentions through a set of
language-dependent refinement rules based on principles of
theoretical linguistics to improve the naturalness and
readability of the output text in the target language. Then,
the final output corpus is obtained from the dataset dby
eventually rewriting pronouns and adjectives within utter-
ances and mentions to improve their compliance to the
target language concerning the agreement, inflexion, and
subject–object role of grammatical constraints.
In the following, the characteristics of the input coref-
erence dataset and two macro-steps of the methodology are
diffusely explained.
4.1 Source corpus
The starting corpus in the source language is OntoNotes
[15], a dataset containing primarily texts extracted from the
news domain initially developed for the shared tasks on
modeling unrestricted coreference at CoNLL 2011 [43] and
CoNLL 2012 [10].
OntoNotes turns out to be an obligatory choice for many
reasons. First of all, despite its lack of heterogeneity, it has
a remarkable diffusion in the field, becoming the standard
benchmark dataset used for CR. Even most recent systems
perform the evaluation entirely on OntoNotes [44],
although numerous other resources have been created for
different domains.
OntoNotes also offers a considerable advantage in
respect of size. As pointed out in Table 1corpora currently
available for the Italian language are pretty smaller. The
size is a significant issue, primarily as it affects the pos-
sibility of using a corpus as the training set for a machine
learning model.
Another reason lies in the annotation schema. As poin-
ted out by several studies, one of the critical issues in
corpus creation and annotation for the coreference task is
the definition of the unit of text to be chosen as a mention
of an entity.
This definition can depend on syntactic and semantic
factors and involve several controversial problems dis-
cussed in theoretical linguistics. Coreference annotations of
OntoNotes do not use the text (tokens) as a base layer, but
they rely on a morpho-syntactic annotated layer. This
feature relies on the fact that it is built on a hand-tagged
treebank before the coreference dataset. The coreference
portion of OntoNotes is not limited to noun phrases or a
limited set of entity types. The aim of the project was to
annotate linguistic coreference using the most literal
interpretation of the text at a very high degree of
Neural Computing and Applications (2022) 34:22493–22518 22497
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
consistency, even if it meant departing from a particular
linguistic theory [43].
The OntoNotes dataset is divided into three distinct
subsets (Train,Dev, and Test), which can be used for
training, developing, and testing a neural coreference
model. The subsets Train,Dev, and Test are arranged into
sets of documents composed of an ordered list of non-
Fig. 1 The main steps of the
proposed methodology
22498 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
overlapping partitions of ordered utterances. Statistics on
the dataset are reported in Table 2.
Moreover, the distributions of the number of tokens and
mentions per utterance in the OntoNotes dataset are
reported in Fig. 2.
4.2 Translation
The translation step aims to extract, process, and correctly
translate a dataset for CR, operating on both utterances and
mentions contained in them.
As mentioned above, the input dataset is OntoNotes. It
has been chosen as the best choice for this work. But it
should be noted that any dataset for CR could also be
utilized. The source and target languages have been Eng-
lish and Italian, even if almost all the considerations and
procedures described in the following are valid or could be
adapted to other languages.
In more detail, this step is first to extract from the dataset
the set of linguistic information necessary for the transla-
tion. Second, the dataset is simplified by removing utter-
ances, mentions, and mentions clusters not meeting some
specific selection criteria. Third, unique replacement
tokens are identified to be positioned in place of the
mentions in the original utterances to preserve, after the
translation, the tokens composing the mentions, their
positioning, and the verbal agreements involving them.
Lastly, the translation in the target language is performed.
Mentions initially substituted by replacement tokens are
also translated and reinserted in place of their corre-
sponding translated replacement tokens, avoiding ambi-
guities due to more mentions made of the same token(s) in
the same utterance.
In the following, more details are given about the whole
translation process, breaking it down into six sub-steps,
namely (1) data preparation, (2) utterances simplification,
(3) mentions simplification, (4) mentions clusters simpli-
fication, (5) referred entities estimation, and (6) utterances
translation and tokens replacement.
4.2.1 Data preparation
This step consists of a preliminary process to extract the
information necessary to perform the following translation
from the source dataset.
In detail, given Dthe set of documents in the source
dataset, denoted with using P(d)=[P
1
,P
2
,...,P
n
] to denote
the ordered list of non-overlapping partitions of utterances
composing a document d[D, and denoted with S(P)=
[u
1
,u
2
,...,u
l
] to denote the ordered list of utterances con-
tained in a partition P[P(d), this step creates, for each
utterance u[S(P), a quadruple u0=(t(u),p(u),m(u),s(u))
where t(u) and p(u) are, respectively, the list of tokens
composing uand their Penn Treebank POS (Part of
Speech) tags, m(u) is the set of mentions built by selecting
only the ones, eventually existing in u, containing no
tokens of other mentions, and s(u) is the label associated to
the speaker of u.
An example of how the quadruple u0is built is reported
in Fig. 3.
Only the mentions ‘it’ and ‘China’ are selected,
whereas the mention ‘an important city in China called
Yichang’ is discharged since it contains tokens of a shorter
mention, i.e., ‘China.’
Each mention m=(id
m
,s
m
,e
m
) is a triple where id
m
indicates the identifier of the referred real-world entity, s
m
and e
m
are the start and end indexes indicating the position
of the tokens composing the mention in t(u) and their POS
tags in p(u). Distinct mentions m
i
and m
j
are clustered when
they refer to the same real-world entity, i.e., id
mi
=id
mj
, and
Table 2 OntoNotes statistics Measure Train Dev Test
Total documents count 1940 222 222
Partitions for document 1.44 1.55 1.57
Maximum number of partitions in a document 23 21 28
Total partitions count 2802 343 348
Utterances for partition 26.83 28 27.24
Maximum number of utterances in a partition 188 127 140
Total utterances count 75,172 9603 9479
Utterances containing mentions 60,246 7420 7472
Maximum number of tokens in an utterance 210 186 151
Tokens for utterance 17.28 16.98 17.89
Mentions for utterance 2.07 1.99 2.09
Maximum number of mentions in an utterance 25 19 18
Coreference clusters for partition 12.54 13.25 13.024
Total coreference clusters count 35,143 4546 4532
Neural Computing and Applications (2022) 34:22493–22518 22499
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
only if they belong to the same document partition. More
formally, given Ithe set of unique identifiers assigned to
the real-world entities referred in a partition Pof a docu-
ment d, a cluster is defined as follows:
CðP2PðdÞ;id 2IÞ
¼[
u2SðPÞ
½ðidm;sm;emÞ2mðuÞ:idmi¼id
8
<
:
9
=
;
Summarizing, starting from a source dataset containing
ndistinct documents d
1
,d
2
, ..., d
n
,this step produces the
following dataset a:
a¼[
i¼n
i¼1[
P2PðdiÞ[
u2SðPÞ
ðtðuÞ;pðuÞ;mðuÞ;sðuÞÞ½
2
43
5
8
<
:
9
=
;
As an example, Fig. 4reports a document partition
within the dataset aand their associated mentions clusters.
It is worth noting that mentions belonging to different
document partitions are assumed to refer to various real-
world entities, i.e., the identifiers of real-word entities
expire from one partition to another; thus, mentions clus-
ters belonging to a partition are disjoint from the ones
belonging to another partition.
4.2.2 Utterances filtering
This step is essentially devised to elaborate the dataset ato
discard undesired utterances. In particular, first of all, given
an utterance u[a,uis discarded or not in accordance with
the criteria reported in Table 3.
This criterion derives from the consideration that, on the
one hand, utterances containing no verbs or composed of a
Fig. 2 Distributions of number of tokens and mentions per utterance in OntoNotes
22500 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
few numbers of tokens should be discharged since they
usually show missing or wrong grammatical dependencies.
From a strictly linguistic point of view, Verbless sentences
are more likely to be noun phrases instead of well-formed
sentences. On the other hand, too long utterances often
present a complex syntax resulting in difficult under-
standing even for a native-speaking human. The minimum
and maximum thresholds used for selecting the utterances
to be preserved have been chosen based on the Syntactic
Capacity Limitation by human working memory and by
computational language models for the correct under-
standing of the complex syntactic relations of a well-
formed sentence [45].
A clarification on the terminology used is needed. For
this work, the term utterance and sentence can be consid-
ered equivalent, although this is not precisely true in the-
oretical linguistics. OntoNotes only refers to utterances,
which is why short sentences have been discarded in the
proposed methodology. As mentioned above, short sen-
tences tend to be not well-formed precisely because they
are not technically sentences conveying a complete
meaning. They are utterances, smaller units of speech
which do not necessarily have a unit of meaning or a
semantic structure.
Thus, the dataset b
1
is generated as follows:
b1¼afu:u2a^u is dischargedg
As an example, in Fig. 5the same document partition
shown in Fig. 4is considered, where the utterance u
0
is
discharged since it contains zero verbs. The utterances u
2
,
u
3
and u
5
are removed since they are composed of twenty-
eight tokens resulting in intricate, not completely clear
syntactic dependencies and hard to understand.
4.2.3 Mentions simplification
The dataset b
2
is generated by removing the undesired
mentions from the dataset b
1
, i.e., mentions that can easily
lead to ambiguities and inaccuracies in their translation. To
this end, both mentions composed of single or multiple
tokens are evaluated by computing their dependency trees
and using the roots to select the ones to be preserved on the
basis only of the POS tags that can allow for estimating the
gender and number of the referred real-word entities. (The
estimation is performed in a subsequent step.)
It is worth noting that dependency tree roots coincide
with mentions themselves if they are made of single
tokens.
More formally, given a mention m[b
1
, denoted with
r
t(m)
the root of the dependency tree of the tokens
t(m) composing m,mis discarded or not in accordance with
the criteria reported in Table 4.
In particular, on the one hand, single-token mentions, as
well as multi-token mentions containing zero verbs, whose
dependency parse root, is a Personal pronoun in third
person, a Possessive pronoun in third person, a Determiner,
aNoun,oraProper noun, are preserved. In contrast, in the
other cases, they are discharged (note that, according to
various studies [46] from 70 to 90% of the mentions are
pronouns).
This choice is motivated by the fact that these kinds of
mentions can enable the identification of gender and the
number of real-world entities referred to by the mentions
themselves, which, as a core idea of the proposed
methodology, can support the preservation of the verbal
agreements among the translated mentions and the other
tokens within the translated utterance.
On the other hand, multi-token mentions containing one
or more verbs are also discarded. Their dependency tree
can be easily wrong, arising further ambiguities and
Fig. 3 Example of utterance contained in the dataset a
Neural Computing and Applications (2022) 34:22493–22518 22501
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
inaccuracies in the process. Then, the dataset b
2
is gener-
ated as follows:
b2¼b1fm¼idm;sm;em
ðÞ:m2b1^m is dischargedg
An example related to the document partition shown
above is reported in Fig. 6.
In particular, the mention ‘China’ within the utterance
u
0
is preserved since it is a Proper noun. On the contrary,
‘1940’ is discharged since it is a Numeral. Moreover, the
mentions ‘Taihang Mountain’ and ‘the Hundred Regi-
ments Offensive’ within the utterance u
1
are preserved
since their dependency trees exhibit as root, highlighted in
bold, a Proper noun.
Fig. 4 Example of document partition and its mentions clusters
Table 3 The criteria followed for evaluating whether or not to pre-
serve an utterance
IF ucontains and THEN uis
1
?
verbs 5 \card(u)\21 Preserved
1
?
verbs Anything else Discharged
0 verbs Whatever Discharged
Fig. 5 Example of utterances from the dataset anot included in the dataset b
1
22502 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
4.2.4 Mentions clusters simplification
The dataset b
3
is generated by removing from the dataset
b
2
all mentioned clusters within each partition that are
resulted in inconsistency after the previous utterances or
mentions removals. More formally, a mentions cluster Cis
discarded or not according to the criteria shown in Table 5.
In detail, a mentions cluster Cis preserved in the case
when: (1) it is composed of at least two mentions; (2) it
contains at least one mention whose dependency tree
exhibits as root a Noun or a Proper noun. This second
condition is meant to force the cluster to contain at least
one mention capable of introducing a referred real-word
entity. Clusters with zero elements after the previous
removals are automatically discharged since they are
meaningless. Then, the dataset b
3
is generated as follows:
b3¼b2fm2C:C2b2^C is dischargedg
The same example document partition shown above is
reported in Fig. 7, where some clusters are discharged.
In particular, the mentions cluster C
5
is preserved since
it contains two elements, and one of them is the mention
‘the Japanese army,’ whose dependency tree root is a
Noun. On the contrary, the clusters C
0
,C
1
,C
3,
and C
6
are
discharged since their cardinality is less than two. For
instance, the cluster C
0
becomes inconsistent after the
previous removal of the utterance u
0
in the considered
partition. The distribution of tokens and mentions per
utterance in the dataset b
3
is reported in Fig. 8.
4.2.5 Referred entities estimation
This step aims to estimate the typology, gender and number
of a real-world entity referred by a mention. This infor-
mation will be used, in the next step, to determine unique
replacement tokens to be positioned in place of the men-
tions to improve the overall translation by also preserving
the verbal agreement.
More formally, given a mention m=(id
m
,s
m
,e
m
) within
an utterance u
s
[P,withPis a document partition, this step
is in charge of estimating the class class(id
m
) for each m[
b
3
, where class(id
m
) is defined as the triple (type(id
m
),
gender(id
m
), number(id
m
)).
In detail, denoted with t
t
(m) the ordered list of tokens
obtained after the translation of t(m) in the target language,
class(id
m
) is estimated by means of the following sequence
of steps: (1) r
t(m)
is used to determine all the values for the
triple (type(id
m
), gender(id
m
),number(id
m
)); (2) in case
some values for the triple cannot be determined from r
t(m)
,
r
tt(m)
in the target language is used; (3) finally, in case some
values for the triple cannot be determined from both r
t(m)
and r
tt(m)
, they are approximated referring to other men-
tions m0[{C(P,id
m
)-m} belonging to the same cluster.
More precisely, in the case when r
t(m)
is a Personal
pronoun or a Possessive pronoun,class(id
m
) is estimated as
reported in Table 6:
In the last three rows, the gender of class(id
m
) cannot be
determined immediately, and the other mentions belonging
to the same cluster have been used to approximate it.
In case when the token r
t(m)
is a noun or a proper noun,
the gender and the number of class(id
m
) cannot be directly
deduced if the source language is English, since this
information is not typically reported in the POS tags. Then,
gender and number are derived from the POS tag generated
for the token r
tt(m)
in the target language, if reported, or the
other mentions belonging to the same cluster have used to
approximate them.
Furthermore, in the case when the token r
t(m)
is a
Determiner, the only way left is to approximate both
gender and number referring to other mentions belonging
to the same cluster.
The estimation of gender (number) for class(id
m
) from
other mentions m0[{C(P,id
m
)-m} belonging to the same
cluster is performed by calculating the most frequent
gender (number), giving more weight to the genders
(numbers) suggested from pronouns than the ones
Table 4 The criteria followed
for evaluating whether or not to
preserve a mention m
IF mis and
r
t
(m) is THEN mis
Single token A Personal pronoun in third person Preserved
Single token A Possessive pronoun in third person Preserved
Single token A Determiner Preserved
Single token A Noun or a Proper noun Preserved
Single token Anything else Discharged
Multi token with 0 verbs A Personal pronoun in third person Preserved
Multi token with 0 verbs A Possessive pronoun in third person Preserved
Multi token with 0 verbs A Determiner Preserved
Multi token with 0 verbs A Noun or a Proper noun Preserved
Multi token with 0 verbs Anything else Discharged
Multi token with 1
?
verbs Whatever Discharged
Neural Computing and Applications (2022) 34:22493–22518 22503
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
suggested from the nouns. More formally, gender and
number of class(id
m
) are determined as reported in Table 7
and in Table 8, respectively.
As an example, consider two utterances u
1
=‘Lora
Owens is the stepmother of Albert Owens .’’ and u
2
=‘She
joins us now by phone .’ belonging to the same document
partition, and the mentions cluster C
6
={m
1
[u
1
,m
2
[u
2
},
where m
1
=‘Lora Owens’ and m
2
=‘She.’ The class(6
m2
)
can be easily determined as equal to (human, female, sin-
gular), whereas, on the contrary, no information can be
inferred for class(6
m1
) by evaluating the mention m
1
. Thus,
it can be estimated on the basis of the values of the other
mention m
2
belonging to C
6
. Roughly speaking, since the
cluster C
6
contains one pronoun suggesting that the refer-
red real-word entity is a female human, then this infor-
mation can be extended also to the other mention to
estimate its class.
4.2.6 Utterances translation and tokens replacement
This step is devised to perform a sequence of three actions
on each utterance u
s
[b
3
expressed in the source language,
Fig. 6 Example of mentions
from the dataset b
1
not included
in the dataset b
2
Table 5 The criteria followed for evaluating whether or not to pre-
serve a cluster C
IF And THEN mis
card(C) C2Am[C:r
t(m)
is Proper noun or Noun Preserved
card(C) B1 Whatever Discharged
Fig. 7 Example of clusters from
the dataset b
2
not included in
the dataset b
3
22504 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
namely tokens replacement, utterances translation and
tokens resolution.
First, tokens replacement consists in of evaluating, for
each mention m[u
s
, the triple class(id
m
), in order to select
a unique token m0[/u
s
to be positioned in place of mand,
as a result, generate the utterance u0
s. It is worth noting that,
u0
s¼usin the case when no mention is contained in u
s
.
Replacement tokens are randomly extracted from pre-
defined lists of unique tokens built such that, on the one
hand, they exhibit the same type, gender, and number of
the class(id
m
) and, on the other hand, their representations
in the source and target language are the same, i.e., r
t(m
0
)
=
r
tt(m
0
)
. This choice increases the chance that replacement
tokens appear unchanged within a translated utterance.
As an example, the utterance u
s
=‘Lora Owens is the
stepmother of Mary White, she joins us now by phone.’
contains three mentions m
1
=‘Lora Owens,’ m
2
=‘Mary
White,’ and m
3
=‘she.’ In the hypothesis that
class(m
1
)=class(m
2
)=class(m
3
)=(human, female, sin-
gular), three replacement tokens m0
1=‘Gabriella,’
m0
2=‘Serena,’ and m0
3=‘Sabrina’ are selected from a
list of women’s names whose representations in the source
and target language are the same. These tokens are posi-
tioned in place of m
1
,m
2,
and m
3
and, as a result, the
utterance u0
s=‘Gabriella is the stepmother of Serena,
Sabrina joins us now by phone.’ is generated.
Second, utterances translation consists, on the one hand,
in generating the utterance u0
t
by translating u0
sin the target
Fig. 8 Distribution of tokens and mentions per utterance in the dataset b
3
Table 6 The estimation of class(id
m
) for Personal and Possessive
pronouns
IF r
t(m)
is equal to THEN class(id
m
)is
‘she’’, ‘her’’, ‘hers’’, or ‘herself’ (human, female, singular)
‘he’’, ‘him’’, ‘his’’, or ‘himself’ (human, male, singular)
‘it’’, ‘its’’, or ‘itself’ (thing, ?, singular)
‘they’’, ‘their’’, or ‘theirs’ (thing, ?, plural)
‘them’ or ‘themselves’ (thing, ?, plural)
Table 7 The estimation of the gender of class(id
m
) for a mentions
cluster
IF within the cluster THEN gender(id
m
)is
Female pronouns Cmale pronouns Female
Male pronouns Cfemale pronouns Male
Female nouns Cmale nouns Female
Male nouns Cfemale nouns Male
Otherwise Male
Default setting is ‘male’ due by occurrences in the corpus
Neural Computing and Applications (2022) 34:22493–22518 22505
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
language and, on the other hand, verifying for each m02u0
s
its existence in u0
t
. In case when 9m02u0
s:m062 u0
tthe
token replacement is performed again for the utterance u
s
and a distinct token for mis selected.
As an example, for the utterance u0
s=‘Gabriella is the
stepmother of Serena, Sabrina joins us now by phone.’’, the
utterance u0
t
=‘Gabriella ‘e la matrigna di Serena, Sab-
rina si unisce a noi ora per telefono.’ is generated.
Third, tokens resolution consists in of generating, for
each mention m[u
s
, the tokens r
tt(m)
by translating r
t(m)
in
the target language. Moreover, the utterance u
t
is also
generated from u0
t
by resolving each m0within u0
t
through
the positioning of the tokens r
tt(m)
in place of m0. It is worth
noting that u
t
=u0
t
in the case when no replacement token is
contained in u0
t
.
As an example, given the utterance u0
t
=‘Gabriella ‘e la
matrigna di Serena, Sabrina si unisce a noi ora per tele-
fono.’ the replacement tokens m0
1=‘Gabriella,’
m0
2=‘Serena,’ and m0
3=‘Sabrina’ are resolved on the
basis of the translated tokens r
tt(m1)
=‘Lora Owens,’
r
tt(m2)
=‘Maria Bianca’ and r
tt(m3)
=‘lei’ and, as a
result, the utterance u
t
=‘Lora Owens ‘e la matrigna di
Maria Bianca, lei si unisce a noi ora per telefono.’ is
generated.
As a result of this step, the dataset cis generated.
4.3 Linguistic refinement
This step is in charge of applying to the dataset ca set of
language-dependent refinement rules based on principles of
theoretical linguistics to improve the naturalness and
readability of the output text in the target language.
It is necessary to make some minor clarifications about
the differences between the two languages under analysis
from a linguistic point of view. Italian and English have
multiple differences, beginning with their origin, the vari-
ability of constituents in word order, and greater or lesser
morphological richness. First of all, English is a Germanic
language with rigid word order, and extremely small
inflectional variation [47], its fixed subject–verb-order
structure implies a mandatory explicit subject. By contrast,
Italian belongs to the Romance subgroup of Italic lan-
guages, characterized by high verbal inflection and [48]
great freedom in the order of constituents [49]. Such
morphological richness leads to a different configuration of
syntactic structures involving pronouns. In particular, it
results in the omission of the subject pronoun. As pointed
out by recent studies, this misalignment produces diffi-
culties in the translation process since the missing pronoun
is challenging to be reproduced, and it affects the order of
dependencies in the sentence [50,51]. For that reason, from
a practical point of view, the refinement rules for the target
language have been focused on improving the use of per-
sonal and possessive pronouns and, in addition, of pos-
sessive and demonstrative adjectives.
Indeed, generally speaking, personal and possessive
pronouns often represent the primary part of the discourse
used to co-refer to an entity, as reported in [46,52].
Moreover, also for the dataset c, a greater distribution of
pronouns as single-token coreferences is observed and
confirmed, as reported in Table 9.
In more detail, in Italian, two specific phenomena typ-
ically occur altering the use of pronouns and adjectives
concerning English, namely null-subject and agreement
and morphemes inflexion.
Null-subject phenomenon permits an independent
utterance to lack (or lack) an explicit subject. Such trun-
cated utterances have an implied or suppressed subject that
can be determined from the context. In particular, null
subject languages, like Italian, express person, number,
and/or gender agreement with the verb inflexion, making a
subject noun phrase redundant. It is worth noting that the
lack of an explicit subject does not create an utterance
ungrammatical, but it is often perceived as less natural by
native speakers. As an example, in the utterance ‘‘Giovanni
and‘o a far visita a degli amici. Per la strada, [egli ]
compr‘o del vino’ (‘Jonh went to visit some friends. On
the way, [he] bought some wine’) the subject pronoun
‘egli’ (‘he’) is suppressed in Italian. This phenomenon is
not present in English and the strategy of coreference
annotation used in OntoNotes for the pronouns is difficult
to completely match with a language belonging to a dif-
ferent linguistic family, such as Italian, where pronouns
can be omitted when used as the subject in an utterance.
Notice that the translation involving null-subject languages
is still a heavily debated issue in the literature because of
the difficulty in representing dropped pronouns [51]. In
recent years many studies have addressed the problem
proposing different solutions for different languages
[5355], including Italian [56].
On the other hand, agreement is a morpho-syntactic
phenomenon in which the gender and number of the sub-
ject and/or objects of a verb must also be indicated by the
Table 8 The estimation of the number of class(id
m
) for a mentions
cluster
IF within the cluster THEN number(id
m
)is
Singular pronouns Cplural pronouns Singular
Plural pronouns Csingular pronouns Plural
Singular nouns Cplural nouns Singular
Plural nouns Csingular nouns Plural
Otherwise Singular
22506 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
verbal inflexion. As an example, consider the utterance
‘Quello ‘e andato’ (That one is gone), where the singular
masculine pronoun subject ‘quello’ (that one) is agreed
with the past participle ‘andato’ (gone) of the verb ‘an-
dare’ (go) to which it refers. The past participle, indeed,
presents a singular masculine inflexion, as highlighted by
the suffix -o. Therefore, the correct suffix of the pronoun is
the same as that of the noun ‘quello.’
In English, pronouns and adjectives do not exhibit any
inflection, and thus, their agreement with the verbs is not
expressed. On the contrary, they must be in concordance
with the verbal forms in Italian. As a result, after trans-
lating both pronouns and adjectives from English to Italian,
their agreement with the verbs must be verified and granted
if it is not respected.
In summarizing, this step of linguistic refinement is
meant to further refine the dataset cby removing not
mandatory subject pronouns and rewriting pronouns and
adjectives to grant correct agreement and inflexions. In the
following, more details are given, breaking this step down
into two sub-steps, namely (1) Subject Pronouns Deletion
and (2) Pronouns and Adjectives Rewrite.
As a result of this step, the dataset dis generated.
4.3.1 Subject pronouns deletion
This step aims to properly handle the null subject phe-
nomenon for the pronouns occurring in dataset cafter the
translation in the target language, i.e., Italian. It is in charge
of evaluating the utterances within cto (1) delete personal
pronouns assuming the subject role in them; (2) move any
mention associated with deleted subject pronouns on the
verbs in dependency relation with them.
More formally, given an utterance u[c, denoted with
DT(u) its dependency tree, with t
i
and t
j
the ith and jth
elements of the list of tokens t(u), with d(t
j
,t
i
)[DT(u) a
dependency relation from t
j
to t
i
, and with label(d) the label
associated with the typed dependency relation d, the cri-
teria followed for performing the pronouns deletion are
reported in Table 10.
In detail, first, a personal pronoun is identified as the
subject of a clause contained in an utterance u[cby
verifying if it is connected with a verb through a direct
grammatical dependency, typed as subject, in the corre-
sponding dependency tree. Each personal pronoun labeled
as subject can be removed.
If no mention is placed on the subject pronoun to be
deleted, it is simply removed from the utterance. On the
contrary, in case a mention is positioned on it, the mention
is moved toward the verbal constituent it is dependent on,
as calculated in the corresponding dependency tree, fol-
lowing the approach proposed in MATE Guidelines [57]
and LiveMemories Corpus [14].
As an example, in the utterance ‘‘[Egli] ha detto alla
gente che [lei] era una brava cuoca’ (‘[He] has told
people that [she] was a good cook’), the personal pro-
nouns ‘Egli’ and ‘lei’ act as subjects of their clauses and
can be omitted. The deletion of the subject pronouns
‘Egli’ and ‘lei’ generates the shift of the mentions placed
on them toward the verbal constituents ‘ha’ (‘has’) and
‘era’ (‘was’) on which they are dependent.
4.3.2 Pronouns and adjectives rewrite
This step aims to evaluate each utterance to identify pro-
nouns and adjectives that can be rewritten to improve their
compliance to the Italian language concerning the agree-
ment, inflexion, and subject-object role of grammatical
constraints.
The first set of rules operate on personal pronouns in
clauses verifying and correcting (1) their agreement in
number with verbs, in case they assume the role of sub-
jects, and (2) the correspondence between the syntactic role
(subject or object) and the inflected form (first or second
singular person). More formally, given an utterance u[c,
denoted with textitnumber(t[t(u)) the number of a token
tindicating if tis expressed, or is assigned to be, in its
singular or plural form, the criteria adopted to rewrite
personal pronouns are reported in Table 11.
In detail, in the first rule, a personal pronoun t
i
is
identified as the subject of a clause contained in an utter-
ance u[cby verifying if it is connected with a verb t
j
through a direct grammatical dependency d(t
i
,t
j
), typed as
subject, in the corresponding dependency tree. Then, the
agreement in number between the subject pronoun and the
corresponding verb is verified and possibly corrected. As
an example, the utterance ‘Tu siete nella stanza’ (‘You
are in the room’) contains the personal pronoun in second
person singular ‘Tu’ (‘You’) in disagreement with the
plural form of the verb ‘siete’ (‘are’). Thus, the personal
pronoun is rewritten in the plural form as ‘‘Voi.’
In the next two rules, personal pronouns in first or sec-
ond singular person are verified if preceded by a preposi-
tion and wrongly assuming the form of the subject
pronoun, i.e., ‘io’ (‘I’) and ‘tu’’ (‘you’’), and corrected
Table 9 Distribution of most
frequent single-token
coreference POS in the dataset c
Part-of-speech Percentage
Pronouns 33.6
Proper nouns 10.6
Nouns 8.6
Determiners 6.8
Verb 1.08
Adverbs 1.04
Adjectives 0.8
Neural Computing and Applications (2022) 34:22493–22518 22507
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
with the corresponding form for the object role, i.e., ‘me’
(‘me’) and ‘te’ (‘you’’).
In the last two rules, personal pronouns in first or second
singular person are verified if preceded by the conjunction
‘che’ (‘that’) and wrongly presenting their object pro-
noun formho ‘me’ (‘me’) and ‘te’’ (‘you’’), and cor-
rected with the corresponding subject role, i.e., ‘io’ (‘I’)
and ‘tu’ (‘you’). As an example, the utterance ‘Non
credono che me sia pronto’ (‘They do not think me am
ready’) wrongly uses the pronoun ‘me’ in its object role.
Thus, it is rewritten as ‘io’ (‘I’) since it has a subject role
in the clause introduced by the conjunction ‘‘che’ (‘‘that’).
The second set of rewrite rules evaluates the agreement
in gender and number between possessive and demon-
strative adjectives and the noun they refer to (typically the
noun before or immediately after them), following the
criteria reported in Table 12.
In detail, in the first rule, each possessive adjective t
i
within an utterance u[c, is identified as connected with a
noun t
j
by means a direct grammatical dependency d(t
i
,t
j
),
typed as possessive determiner, in the corresponding
dependency tree. Then, the agreement in gender and
number between the possessive adjective and the corre-
sponding noun is verified and possibly corrected. As an
example, the utterance ‘Mia padre lavora in banca’ (‘My
father works in a bank’) contains the possessive adjective
‘Mia’ (‘My’) with the feminine suffix ‘-a’ in disagree-
ment with the male singular noun ‘padre’ (‘father’),
(while the corresponding ‘my’ in English has no inflec-
tion). Thus, it is rewritten as ‘‘mio’ with masculine suffix
‘-o.’
In the second rule, each demonstrative adjective t
i
within an utterance u[cis recognized as related to a noun
t
j
if this latter occurs at most four tokens forward and is
connected through a direct grammatical dependency d(t
i
,t
j
),
typed as a generic determiner, in the corresponding
dependency tree.
Then, the agreement in gender and number between the
demonstrative adjective and the corresponding noun is
verified and possibly corrected. Moreover, the suffix of the
demonstrative adjective t
i
is also checked and modified on
the basis of the initial letters of the token t
i
?1 immedi-
ately following t
i
in u, as reported in Table 13.
The thresholds concerning minimum and maximum
tokens number and the distance of demonstratives are
inspired by recent studies [58] that have quantitatively
estimated the syntactic capacity limitation by human
working memory and by computational language models
for the correct understanding of the complex syntactic
relations of a well-formed sentence.
As an example, the utterance ‘Quella avviso e
`stato
redatto nelle ultime 24 ore .’ (‘That notice has been
drafted in the last 24 hours .’) contains the demonstrative
adjective ‘Quella’ (‘That’) with feminine suffix ‘-a’ in
disagreement with the male singular noun ‘‘avviso’
(‘avviso’). Thus, the demonstrative adjective is rewritten
as ‘Quello’ with masculine suffix ‘-o.’ Moreover, since
the token following the demonstrative starts with a vowel,
‘Quello’ is further replaced with its elided form
(‘Quell’’).
Finally, the last typology of rewriting rule evaluates if a
demonstrative is used as a pronoun and replaces it with a
neuter term following the criteria reported in Table 14.
In detail, this rule evaluates if a demonstrative t
i
within
an utterance u[c, is related to a noun t
j
in a span of
maximum 4 tokens. In the negative case, it is assumed to
work as a pronoun and, thus, it can be replaced by a neuter
term preventing possible agreement errors in long-distance
syntactic dependencies. As an example, the utterance
‘Quella e
`stato fatto nelle ultime 24 ore .’ (‘That has been
done in the last 24 hours .’), contains the
Table 10 The criteria followed
for performing pronouns
deletion
IF and and and THEN
t
i
[t(u) is t
j
[t(u) is Ad(t
j
,t
i
)[DT(u):Am[m(u):s
m
=e
m
=j
Personal pronoun aux r verb label(d)=subject s
m
=e
m
=i t(u) =t(u) -t
i
t
i
[t(u) is t
j
[t(u) is Ad(t
j
,t
i
)[DT(u): otherwise t(u) =t(u) -t
i
Personal pronoun aux or verb label(d)=subject
Table 11 The criteria followed
for rewriting personal pronouns IF t
i
[t(u) is and t
j
[t(u) is and THEN
Personal pronoun aux or verb Ad(t
j
,t
i
)[DT(u):number(t
i
)is
label(d)=subject number(t
j
)
Personal pronoun ‘io/tu’ Preposition j=i-1t
i
is ‘me/te’
Personal pronoun ‘me/te’ conjunction ‘che’’ j =i-1t
i
is ‘io/tu’
22508 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
demonstrative‘Quella’(‘That’) which is not connected
with a noun in a span of maximum four tokens. Thus, it is
replaced by the neuter demonstrative ‘‘Cio
`’’(‘‘That’).
Notice that all rules aimed at rewriting, deleting, and
modifying mentions do not affect the complexity of the
language under consideration. The rules do not simplify
syntactic phenomena or grammar, but they try to respect
the syntax of the target language (Italian) without losing
the information on mentions and co-references present in
the source language (English).
5 Results and evaluation
The dataset dobtained after applying the proposed
methodology is widely described in the following in terms
of statistics and output format.
Moreover, it is also analyzed both quantitatively and
qualitatively to assess the naturalness of its utterances by
investigating, first, the change of their readability from ato
d, and second, their well-formedness concerning syntactic
(grammaticality) and semantic (acceptability) aspects.
Finally, its goodness is also assessed concerning the
possibility of being used to train a deep learning model for
CR in Italian.
5.1 Dataset description
Table 15 reports an overview of the obtained dataset d,
showing the total number of utterances (utts) and the
impact of the linguistic refinements that affect a high
percentage of utterances (about 64% as indicated by refined
utts). Table 15 shows that most of the changes are related
to pronouns. In particular, the row ‘‘subject pronouns
deleted being mentions indicates that the deletion of
subject pronouns (9848 in total) overcomes rewriting rules,
including both pronouns and adjectives (9045 in total).
Adjectives are involved to a lesser extent in both rules, as
seen from the last three rows of Table 15.
Concerning the linguistic rules applied to generate d, the
ones that have found most application instances are dele-
tions, with the consequent shifts in coreference on the verb.
This result is reasonably expected since there is a transition
from a language with a mandatory expressed subject to a
pro-drop language in which the subject pronoun is sys-
tematically missing.
As already mentioned, this has been one of the most
challenging tasks both from a theoretical and practical
point of view. The transition from a language with an
explicit subject (English) to a pro-drop language (Italian) is
not limited to only a deletion process. In fact, it is wide-
spread for the subject pronoun to be labeled as mention in
the original dataset, so it has almost always been necessary
to shift the mention without compromising the dependen-
cies and syntactic structure of the sentence.
The dataset dhas been structured for being released in
both CoNLL and JSON formats.
2
Both formats preserve
morpho-grammatical information on the parts of speech of
each element of the utterance. CoNLL annotation is helpful
to enable the easy interface with tools and models typically
used in CR (see Fig. 9).
Table 12 The criteria followed for rewriting possessive and demonstrative adjectives
IF t
i
[t(u) is and t
j
[t(u) is and THEN
Possessive Noun 9dðtj;tiÞ2DTðuÞ:labelðdÞ¼possessive gender(t
i
)is gender(t
j
)
Adjective determiner number(t
i
)is number(t
j
)
Demonstrative Noun Ad(t
j
,t
i
)[DT(u):gender(t
i
)is gender(t
j
)
Adjective label(d)=determiner ^number(t
i
)is number(t
j
)
i\jBi?4 suffix(t
i
) is set based on t
z
[0] and t
z
[1] where z=i?1
Table 13 The criteria followed
for rewriting possessive and
demonstrative adjectives
Gender(t
j
) is Number(t
j
) is First letter of t
i?1
is Modified form of t
i
is
Masculine Singular/Plural Any voxel Quell’/Quegli
Singular/Plural S ?consonant, PS, GN, X, Y, Z Quello/Quegli
Singular/Plural Other consonants Quel/Quei
Feminine Singular/Plural Any voxel Quella/Quelle
Singular/Plural Any consonant Quell’/Quelle
2
Dataset will be made available upon request at https://nlpit.na.icar.
cnr.it/nlp4it.
Neural Computing and Applications (2022) 34:22493–22518 22509
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
JSON version of the dataset is enriched with additional
information. As shown in Fig. 10 it keeps track of changes
involving utterances and mentions in the generation of the
dataset d, highlighting the data impacted from the subjects
deletion, pronouns and adjective rewrite, and mentions
shift.
For instance, the original utterance shown in Fig. 10 is
‘Esso era facile da gestire una volta che tutti capivano’’ (It
was easy to manage once everybody understood), whereas
it is modified by a deletion rule as shown in ‘‘modified
text. The rewritten utterance has a readability score equal
to 79.26, it drops the pronoun subject ‘‘Esso’ (It) with a
shift of the mention from ‘Esso,’ as can be seen in corefs
old, to the verb ‘era’ (was). Since there is a deletion,
indices indicating the position of the mention (’’start and
‘‘ end’) remain unchanged because the verb takes the
position of the deleted subject pronoun. Notice that, this
shifting process that moves the coref to the verb is con-
sistent with the linguistic theory. The centering role of the
verbal phrase within the sentence reflects theoretical
aspects inherent in the hierarchical dependencies of the
sentence constituents.
5.2 Readability assessment
The first evaluation of the resulting dataset dis performed
quantitatively concerning the criterion of readability.
In natural language, readability is defined as the ease
with which a reader can understand a written text. It
depends on lexical (i.e., the complexity of the vocabulary
used) and syntactic factors (i.e., the presence of nested
subordinate clauses). Several readability scores exist in the
literature, which provides a way to assess a written text’s
quality automatically.
To this aim, for this work, both the readability scores
based on Flesch-Vacca index [16] and the Flesch reading
ease test, adapted to the Italian language, have been cal-
culated for the utterances within both the datasets aand d.
Table 16 shows the readability scores for the dataset a(in
English) and the dataset d(in Italian). The percentage of
utterances falling into each readability range are presented
in each row.
The table shows that the dataset d, expressed in the
target language, is comparable to the readability of the
dataset a, expressed in the source language. This result
suggests that the proposed methodology has not signifi-
cantly altered the overall readability of the utterances.
Instead, there is singular progress in the class grouping
utterances with scores above 80 (an improvement of 4.6
percentage points). As can be noted, there is a significant
drop in inconsistency in judging sentences with readability
between 40 and 60. This result is an expected outcome
since the greater the readability, the greater is the agree-
ment between the annotators [59].
However, even if this readability assessment gives a
rough idea of the validity of the proposed methodology, it
is not without limitations since the used readability scores
are still debated in the literature [60]. For instance, poly-
syllabic words significantly affect the score, and the met-
rics are unbalanced on the lexicon compared to the syntax.
Furthermore, a readable utterance is characterized by a
linear syntax and simple vocabulary, but it can contain
infelicities that make it ill-formed.
5.3 Grammaticality and acceptability assessment
The second evaluation of the resulting dataset dis per-
formed qualitatively to overcome the limitations of these
readability scores by considering the criteria of grammat-
icality and acceptability.
These criteria have a long history in theoretical lin-
guistics [61]. In detail, grammaticality refers to correct
utterances from a syntactic and structural point of view
according to the annotator’s judgments; on the contrary,
acceptability assesses whether an utterance is semantically
valid according to the annotator’s conclusions. In other
words, grammaticality is not necessarily associated with
semantic correctness or acceptability. Still, it refers to a
well-formed utterance, i.e., which conforms to Italian
Table 14 The criteria followed
for rewriting demonstrative
pronouns
IF t
i
[t(u) is and t
j
[t(u) is and THEN
Demonstrative Noun 9=d(t
j
,t
i
)[DT(u):t
i
is ‘cio
`’’
Pronoun label(d)=determiner ^i\jBi?4
Table 15 The impact of the linguistic refinements over the dataset d
Train Test Dev
utts 44,073 5415 5363
Refined utts 28,216 3512 3471
Subject pronouns 34,974 3904 3893
Subject pronouns being mentions 14,511 1871 1517
Subject pronouns deleted being mentions 8764 1111 973
Pronouns and adjectives 37,611 6816 6728
Pronouns and adjectives being mentions 14,623 1887 1528
Pronouns and adjectives rewritten 7346 866 853
22510 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
grammar rules. By contrast, acceptability may consider
some aspects that can only be inferred by a native speaker,
such as cohesion or the naturalness of the utterance.
Therefore, an utterance may be perfectly valid from a
structural point of view but not be semantically compre-
hensible. As an example, the utterance ‘‘Major League
Baseball ha preso 76 dei suoi pipistrelli e li ha radiografati
per il sughero .’ (‘Major League Baseball has taken 76 of
its bats and X-rayed them for corkage .’) is grammatical
because all the constituents are in the right place, and it
does not violate structural constraints. Still, there is no
native speaker who would perceive it as meaningful. In this
case, the error lies in translating specialized terms related
to the sports domain. In particular, the term ‘bat’ is
ambiguous because it can refer either to the object (wooden
club used in the sport of baseball to hit the ball) or to the
animal (as in the incorrect translation ‘pipistrello’).
The second assessment concerning these two criteria has
been carried out by considering a sample of 1000 instances
extracted from the dataset d, with 200 utterances for each
readability class reported in Table 16. The extraction has
not been performed entirely randomly, but it has been
guided by a nonprobability sampling, which is more suit-
able for qualitative data. The assessment has involved three
human native speakers who were asked to manually and
independently label that sample by specifying, for each
utterance, both its grammaticality and acceptability.
The overall agreement between these three raters con-
cerning their annotations of grammaticality and accept-
ability has been measured using the Observed Agreement
index [62]. This index gives a good approximation of
annotators’ agreement in contexts with many annotators,
also offering robustness against imperfect (textual) data
[63]. The index calculates the number of generated utter-
ances with the majority agreement and reports that number
as a percentage of the total number of utterances extracted
by all the annotators. Grammaticality and acceptability
have been calculated using forced-choice binary task [64],
following most of the linguistic methodology in this area
[65].
Table 17 shows the percentage of agreement between
annotators for each readability class.
The total agreement value has been measured as equal to
0.78 and 0.73 in the case of annotations represented by
grammatical or acceptable utterances. According to the
grid for the interpretation of the coefficients proposed by
[66], the values obtained indicate ‘substantial agreement’’
concerning both grammaticality and acceptability.
Human’s judgements seem to be consistent with the
readability scores; a higher value corresponds to a better
agreement, thus a lower presence of ill-formed utterances.
The agreement among the raters regarding grammaticality
increases progressively (from 0.77 to 0.80) concerning the
readability classes. This phenomenon can be explained by
Fig. 9 Example of CoNLL
format
Fig. 10 Example of JSON
format
Neural Computing and Applications (2022) 34:22493–22518 22511
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
the fact that readability is essentially based on the utterance
structure, i.e., syntax, which is the object of the gram-
maticality judgement.
The situation is not different as regards acceptability.
First, an unsurprising slight worsening of the scores con-
cerning lower classes has been highlighted. Lower agree-
ment between annotators is quite common, especially in
the semantic tasks [67]. However, moving on to the classes
containing the most readable utterances, the values are
comparable to the ones of grammaticality. Utterances
considered the most readable are also those that create the
slightest disagreement among the annotators, with the
highest percentage of acceptable utterances.
In summarizing, the performed grammaticality and
acceptability assessment has shown that the proposed
methodology can generate utterances that respect a syn-
tactic well-formedness and are perceived as natural by
native speakers with a good level of agreement. The use of
linguistic refinement rules helps reduce phenomena that
could affect grammatical constraints (as in the case of
rewrite rules) or a perceived naturalness of the sentence (as
in the case of null-subject).
5.4 Linguistic and qualitative assessment
As further evaluation aspects, first, factors other than
readability and annotators’ judgements have been
considered.
Utterances contained in the sample have been analyzed
using different levels of linguistic analysis that include
lexical, morphological and syntactic features. Considered
factors range from lexical richness to the complexity of the
periods, subordinates’ presence, and the vocabulary used.
They are summarized in Table 18. Values in Table 18 show
that syntactic complexity goes through a progressive sim-
plification from the class comprising the least readable
sentences (\20) to the most readable ones ([80). Sen-
tences are shorter (they move from an average length of
12.9 tokens to 7.9), and subordinating conjunctions are
halved to the benefit of increased coordinating ones.
Second, this trend of syntactic and lexical simplification
has been visually inspected by qualitatively examining
some examples extracted from the dataset.
Table 19 collects a set of utterances for each readability
class. The table is structured to visualize all possible
combinations of raters’ judgements. The first column is
dedicated to different readability classes; the second one
shows the id of each utterance. After that, two columns
indicate if the utterance has been evaluated as grammatical
(G) or acceptable (A) by human raters. Examples in
Table 19 show that utterances in the less readable classes
tend toward hypotaxis, with the presence of various types
of subordinate clauses, whereas high-readable classes pre-
fer elementary one-verb sentences. This outcome occurs in
both well-formed and ill-formed utterances.
For instance, the utterance having Id =1d,‘Il governo
degli Stati Uniti pensa che i radicali, commentatori anti-
americani e religiosi sono diventati ospiti frequenti
all’emittente televisiva al Jazeera .’’ (‘The US government
believes that radical, anti-American and religious com-
mentators have become frequent guests at the Al Jazeera
television station .’) has a readability score lower than 20,
so it is challenging to read, but it is perceived as gram-
matical and acceptable by raters, even if it has a subordi-
nate clause introduced by ‘che’ (‘that’), a long-distance
dependency between the singular masculine noun ‘‘com-
mentatori’ (‘West’) and the noun with the role of subject
predicate ‘ospiti’ (‘guests’).
A similar syntactic structure is provided by the utterance
in the class 20–40 having Id =2a,‘Quindi Michelle le
autorit‘a davvero credere che questo testimone per quanto
riguarda la discarica credibile , non essi?’’ (‘So Michelle,
do the authorities really to think this witness regarding the
landfill [is] credible, not they?’), which is full of errors
that make it ungrammatical and difficult for a native
speaker to understand. In detail, the verb appears in its
infinitive form ‘credere’ (‘to think’) and it is not inflected
in agreement with the subject noun ‘autorit‘a’ (‘author-
ities’). Moreover, there is no verb connected to the subject
complement ‘credibile’ (‘credible’) and there is a noun
Table 16 Comparison of readability scores before and after linguistic
refinement
Score Sentence percentage Description
ad
[80 38.9 43.5 Very easy to read
80–60 33.7 36.05 Fairly easy to read
60–40 17.4 15.4 Fairly difficult to read
40–20 7.2 3.8 Difficult to read
\20 2.06 1.1 Extremely difficult to read
Table 17 Annotator agreement
for different readability classes \20 20–40 40–60 60–80 [80 Total
Grammaticality (%) 0.77 0.78 0.70 0.80 0.80 0.78
Acceptability (%) 0.64 0.64 0.75 0.83 0.81 0.73
22512 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
phrase ‘non essi’ (non they) at the end of the utterance
completely disconnected from the syntactic structure.
In readability classes, subordinate clauses are reduced,
and correlation prevails in the syntactic structure. How-
ever, this syntactic simplification does not necessarily
correspond to greater comprehensibility. As mentioned
above, readability tests evaluate the complexity of the
lexicon and structure of the utterance. Shorter utterances
having no subordinates are not always what human raters
consider semantically meaningful or grammatically
correct.
For instance, the utterances having id =5c,‘Poi i
seguaci torn‘o a casa’ (‘Then the followers has gone
home’), and id=5d,‘La grazia di Dio sia con te’ (‘God’s
grace be with you’) have a similar one-verb structure
without any type of syntactic or lexical complexity.
However, the utterance 5c contains a grammatical infelic-
ity, with a wrong agreement between the 3rd person sin-
gular verb ‘torn‘o’ (‘has gone’) and the plural subject
noun ‘seguaci’ (‘followers’).
In summary, readability scores obtained automatically
have proven to be consistent with the raters’ judgments,
allowing sentences to be grouped into classes that are in
line with grammaticality and acceptability. However, it
should be noted that there are numerous other linguistic
variables affecting readability that are independent of the
metrics used, but this is outside the scope of this work.
5.5 Effectiveness assessment as training dataset
The last evaluation has been performed to assess the
goodness of the generated dataset concerning the possi-
bility of being used to train a deep learning model for CR
in Italian. To this aim, a baseline model on the dataset is
generated by adopting a state-of-the-art deep learning
architecture proposed for the same task in English. In
detail, the coreference model proposed by [44] has been
used,
3
by exploiting BERT in its base (cased) version.
4
This choice is justified by the fact that this model has
proven to be effective in the CR task in English, as shown
in [6870].
To the best of our knowledge, no other available
implementation exists for the particular CR task in Italian.
In detail, the architecture of BERT is characterized by
12 encoder layers, known as Transformers Blocks, and 12
attention heads (or Self-Attention as introduced in [71]),
hence feedforward networks with a hidden size of 768.
Each training session has been fixed of 24 epochs, with a
variable learning rate from 0.1 to 0.00001. More archi-
tectural details and training hyperparameters are reported
in Table 20. All experiments have been performed on a
deep learning workstation, with 40 Intel(R) Xeon(R) CPUs
Table 18 Different features affecting the readability on the sample considered
Classes
\20 20–40 40–60 60–80 [80
Lexical features Average length (tokens) 12.9 13.7072 12.1082 114.072 7.9
Type-token ratio 0.8 0.545 0.531 0.513 0.65
Lexical density 0.585 0.545 0.531 0.513 0.536
Nouns 16.30% 15.10% 12.40% 13.40% 10.10%
Proper nouns 5.60% 6.10% 5.00% 6.00% 5.30%
Morphologic features Adjectives 5.90% 5.70% 5.20% 3.70% 3.90%
Verbs 18.00% 20.20% 22.40% 20.20% 22.20%
Conjunctions 4.30% 4.80% 5.50% 4.70% 5.80%
Coordinating conjunctions 59.60% 60.70% 52.20% 69.20% 76.10%
Subordinating conjunctions 40.40% 39.30% 47.80% 30.80% 23.90%
Average number of clauses per utterance 1.816 1.985 2.01 1.723 1.507
Independent clauses 71.20% 69.10% 68.20% 76.90% 92.20%
Subordinate clauses 28.80% 30.90% 31.80% 23.10% 7.80%
Syntactic feature Average word Number per clause 7.104 6.897 6.007 6.603 5.215
Average DPT depth 4.595 4.813 4.456 4.436 3.015
Average depth of noun phrase 1.133 1.131 1.134 1.142 1.064
Average depth of subordinate chain 1.345 1.29 1.265 1.115 1.167
Average length of dependency relations 1.913 1.906 1.848 1.77 1.795
3
https://github.com/lxucs/coref-hoi.
4
https://huggingface.co/dbmdz/bert-base-italian-xxl-cased.
Neural Computing and Applications (2022) 34:22493–22518 22513
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
E5-2630 v4 @ 2.20 GHz, 256 GB of RAM and 4 GPUs
GeForce GTX 1080 Ti. The operating system is Ubuntu
Linux 16.04.7 LTS. Using the train division of the created
dataset, the results have been derived by averaging the
performance of the coreference model over five repetitions
and finally reporting the arithmetic mean of the results,
rounded to the second decimal place. Table 21 reports the
results obtained with three different metrics: MUC [72], B
3
[73] and CEAF
/4
[74].
MUC provides a good measure of the interpretability
achieved by the model, which indicates the goodness in the
prediction of mentions and coreference links among them.
Table 19 Visual examples of sentences for each readability class and raters’ judgements
Class Id G A Utterance
\20 1a -?Quattro esplosioni strappare attraverso la metropolitana vedere
(Four explosions rip through the metro see)
1b -?Deputati dell’opposizione stanno esprimendo suo malcontento
(Opposition deputies are expressing his dissatisfaction.)
1c ?- Occasionalmente, il danno cromosoma lordo era visibile
(Occasionally, gross chromosome damage was visible.)
1d ??Il governo degli Stati Uniti pensa che i radicali, commentatori antiamericani e religiosi sono diventati ospiti frequenti
all’emittente televisiva al Jazeera
(The US government believes that radical, anti-American and religious commentators have become frequent guests at
the al Jazeera television station.)
20–40 2a --Quindi Michelle le autorita‘ davvero credere che questo testimone per quanto riguarda la discarica credibile, non essi ?
(So Michelle, do the authorities really think this witness regarding the landfill is credible?)
2b -?Essi ha risposto, Si, Signore, crediamo
(They replied, Yes, Lord, we believe.)
2c ?- riadattamento per quanto riguarda gli americanismi
(readjustment with regard to Americanisms)
2d ??chiaramente crediamo che Davis stava resistendo
(We clearly believe that Davis was resisting.)
40–60 3a --sua politica sono incorporati nella scrittura Di suo e suo scrivere e
`prima di tutto una celebrazione della liberta
`
(his politics are embedded in his writing and his writing is first and foremost a celebration of freedom)
3b -?Poi trascorrere piu‘ tempo con Loro, e incoraggiare Loro per ottenere piu‘ esercizio fisico e prendersi cura di Stessi
(Then spend more time with them and encourage them to get more exercise and take care of themselves.)
3c ?- e che include Cinquanta centesimi troppo
(and that includes Fifty Cents Too)
3d ??In primo luogo, Stoccolma ha speso 180 milioni di dollari per i miglioramenti dei trasporti prima dell’ l’esperimento
(First, Stockholm spent 180 million dollars on transport improvements before the experiment)
60–80 4a --Il video suona anche le voci di coloro che si accanto al cadavere parlando tra loro
(The video also plays the voices of those next to the corpse talking to each other)
4b -?Avranno bisogno di un’ enorme somma di denaro per portare i bambini Loro in citta
`
(They will need a huge amount of money to bring their children to the city.)
4c ?- Ho chiesto a i tuoi seguaci di forzare lo spirito malvagio fuori
(I have asked your followers to force the evil spirit out.)
4d ??Questo e
`l’ insegnamento che avete sempre sentito: dobbiamo amarci l’ un l’ altro
(This is the lesson you have always heard: we must love one another.)
[80 5a --Lui ha messo le mani di suo su Suo, e subito Lei e
`riuscita a stare dritta. (He put his hands on hers, and immediately she
managed to stand up straight.)
5b -?la cosa con il Golan per dare esso indietro di non dare esso indietro non lo so
(the thing with the Golan to give it back not to give it back I do not know.)
5c ?- Poi i seguaci torn‘o a casa. (Then the followers went home.)
5d ??La grazia di Dio sia con te. (God’s grace be with you)
22514 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
However, MUC lacks discriminability, i.e., the capability
to distinguish between good and not good decisions. On the
contrary, B
3
and CEAF
/4
lack interpretability, but they
measure discriminability. Since none of the metrics is
reliable if taken individually, it is common practice to use
the average of the three as the overall metric.
As shown in Table 21,MUC has achieved better per-
formances on precision and recall, respectively. CEAF
/4
,
instead, has the lowest scores, especially concerning recall
(about 59.25). B
3
provides scores quite similar to those
obtained with CEAF
/4
. On average, the model has
achieved an F1 of about 69,60, which is comparable with
the averaged F1 obtained by the same model but on the
English version of the Ontonotes dataset (about 73.9).
As an example, sentences extracted from the dataset and
shown in Table 22 present cases of correct mention pre-
dictions and wrong ones. Predictions are indicated in bold,
while mentions to which the predictions refer are shown in
small caps.
Concerning the analysis on the typology of errors, in the
first sentence ‘[Essi] hanno scritto oggi’ (They wrote
today) the correctly predicted mention occurs as a verb in
the English text, and it has been shifted on the verb in the
Italian one due to the drop of the subject pronoun ‘Essi’
(They). The second example presents a linear subject–
verb–object sentence with an explicit subject. In this case,
the proper noun acting as a subject is into a prepositional
phrase ‘L’ex avvocato di Clinton’ (Clinton’s former
lawyer), and it is correctly predicted. Moving to the anal-
ysis of incorrectly recognized predictions, it is possible to
note that a more complex syntax affects the predictions.
For instance, in the first example (first sentence of the
wrong predicted row) the utterance contains a dative con-
struction with a clitic pronoun ‘Ci’ (literally us) preceding
the mention ‘riferivamo’ (were referring) and an enclitic
form merged with the verb in the form of suffix -lo for the
coreference ‘farlo’ (to do that). Finally, in the last
example, BERT fails the correct assignment when the
mention occurs as indirect object introduced by a prepo-
sition ‘a questo’ (about this).
In spite of special cases such as those described above
(clitics, convoluted syntax), these results have shown the
effectiveness of the proposed methodology, providing a
new dataset for CR in Italian and setting a baseline for
future developments of this line of research.
6 Conclusions and future work
This work presents a methodology for creating a dataset for
CR in Italian starting from a resource initially designed for
English. This approach can guarantee a quality comparable
to manual annotation while reducing the time and effort it
requires. Starting from the OntoNotes, this methodology
has been articulated in two macro-steps.
The first macro-step is focused on the generation of a
corpus in the target language. This step first extracts from
the OntoNotes the information of interest, such as docu-
ments, partitions, utterances, and mentions, but discharging
irrelevant information and mentions whose tokens are
contained in other mentions. Then, utterances and mentions
are translated through an intelligent token replacement/
resolution procedure guided by the estimation of the
typology, gender and number of the real-world entities
referred by each mention. The second macro-step is
focused on linguistic refinement. This step first tries to
correct all the infelicities introduced in the translation on
aspects of the Italian language not present in English (i.e.,
gender and number agreement). Then, it attempts to make
translated utterances more natural as perceived by a native
speaker (null-subject).
The well-formedness and naturalness of the generated
dataset has been confirmed by means of a quantitative and
qualitative assessment, which has evaluated readability on
all the utterances of the final dataset and grammaticality
and acceptability on a sample of 1000 utterances extracted
from different five readability classes by three human
native speakers. A correlation between the readability score
and raters’ judgements has been also highlighted, with
utterances featuring poor readability having the highest
disagreement among human raters for both grammaticality
and acceptability. The goodness of the dataset has also
been assessed by training a CR model based on BERT,
Table 20 Hyper-parameters
Hyperparameter Value
Epochs 24
Dropout 0.3
Learning rate From 0.1 up to 0.00001
Loss Marginalized
Feature embedding size 20
Max span width 30
Max training sentences 6
Max segment length 256
Dimensions hidden state 256
Number of attention heads 12
Number of hidden layers 12
Hidden size 768
Number of hidden layers 12
Parameters 110 M
Vocabulary size 32,102
Neural Computing and Applications (2022) 34:22493–22518 22515
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
achieving promising results and thus, fixing a reference
point in terms of performance for future comparisons.
It is worth noting that, for this work, English has been
considered as the source language and Italian as the target
one, due to the high and limited number of existing
resources existing for them, respectively. However, the
methodology is not strictly dependent on these two lan-
guages and can be easily applied to other languages, by
only adapting a small set of linguistic rules.
From a methodological perspective, even if the quality
of the final dataset is appreciable, it leaves room for some
future improvements. First, a more extensive list of
refinement rules regarding other linguistic phenomena of
the Italian language will be considered to enhance the
naturalness of the translated utterances. Second, utterances
with more complex syntactic structures will be handled to
improve readability, grammaticality and acceptability.
From an applicative perspective, the dataset will be used to
train novel and better performing models for the task of CR
in Italian.
Data availability The dataset described in this study will be available
at the address https://nlpit.na.icar.cnr.it/nlp4it/#/datasets/.
Declaration
Conflict of interest The authors declare that they have no known
competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper. The authors
declare the following financial interests/personal relationships which
may be considered as potential competing interests.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate
if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit http://creativecommons.
org/licenses/by/4.0/.
References
1. Sukthanker R, Poria S, Cambria E, Thirunavukarasu R (2020)
Anaphora and coreference resolution: a review. Inform Fusion
59:139–162
2. Antunes J, Lins RD, Lima R, Oliveira H, Riss M, Simske SJ
(2018) Automatic cohesive summarization with pronominal
anaphora resolution. Comput Speech Lang 52:141–164
3. Sikdar UK, Ekbal A, Saha S (2016) A generalized framework for
anaphora resolution in Indian languages. Knowl Based Syst
109:147–159
4. Blackwell SE (2001) Testing the Neo-Gricean pragmatic theory
of anaphora: the influence of consistency constraints on inter-
pretations of coreference in Spanish. J Pragmat 33(6):901–941
5. Lee C, Jung S, Park C-E (2017) Anaphora resolution with pointer
networks. Pattern Recogn Lett 95:1–7
6. Stylianou N, Vlahavas I (2021) A neural entity coreference res-
olution review. Expert Syst Appl 168:114466
7. Clark K, Manning CD (2016) Deep reinforcement learning for
mentionranking coreference models. arXiv preprint arXiv:1609.
08667
8. Zheng J, Chapman WW, Crowley RS, Savova GK (2011)
Coreference resolution: a review of general methodologies and
applications in the clinical domain. J Biomed Inform
44(6):1113–1122
9. Hirschman L, Chinchor N (1997) Muc-7 proceedings. Science
Applications International Corporation. See www.muc.saic.com
10. Pradhan S, Moschitti A, Xue N, Uryupina O, Zhang Y (2012)
Conll-2012 shared task: modeling multilingual unrestricted
coreference in ontonotes. In: Joint conference on EMNLP and
CoNLL-shared task, pp 1–40
Table 21 Results achieved with
a BERT-based CR model MUC B3CEAF
/4
avg F1
RPF1RPF1RPF1
73.44 79.56 76.38 64.19 70.83 67.34 59.25 72.24 65.10 69.60
Table 22 Examples of correct
and wrong predictions (bold)
with respect to mentions (small
caps)
Correctly predicted HANNO SCRITTO oggiLei non ha condiviso le note con loro
(THEY wrote todayShe did not share the notes with them)
L’ex avvocato di CLINTONSono mosse accuse contro di lui
(CLINTONSformer lawyerAllegations are made against him)
Wrong predicted CI RIFERIVAMO a essoe
`sempre difficile farlo
(WEwere referring to itit is always difficult to do that)
Ho pensato a lungo a QUESTOMolti criticano cio
`
(I thought about this for a long timeMany people criticise this)
In brackets the English text
22516 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
11. Recasens M, Hovy E (2011) Blanc: Implementing the rand index
for coreference evaluation. Nat Lang Eng 17(4):485–510
12. Poesio M, Delmonte R, Bristot A, Chiran L, Tonelli S (2004) The
Venex corpus of anaphora and deixis In spoken and written
Italian. University of Essex
13. Magnini B, Pianta E, Girardi C, Negri M, Romano L, Speranza
M, Bartalesi V, Sprugnoli R (2006) I-cab: the Italian content
annotation bank. In: 5th International conference on language
resources and evaluation (LREC 2006), pp 963–968
14. Rodrıguez KJ, Delogu F, Versley Y, Stemle EW, Poesio M
(2010) Anaphoric annotation of Wikipedia and blogs in the live
memories corpus. In: Proceedings of LREC, pp 157–163
15. Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2006)
Ontonotes: the 90% solution. In: Proceedings of the human lan-
guage technology conference of the NAACL, companion vol-
ume: short papers, pp 57–60
16. Franchina V, Vacca R (1986) Adaptation of flesh readability
index on a bilingual text written by the same author both in
Italian and English languages. Linguaggi 3:47–49
17. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-
training of deep bidirectional transformers for language under-
standing. arXiv preprint arXiv:1810.04805
18. Pradhan SS, Ramshaw L, Weischedel R, MacBride J, Micciulla L
(2007) Unrestricted coreference: identifying entities and events in
ontonotes. In: International conference on semantic computing
(ICSC 2007). IEEE, pp 446–453
19. Grishman R, Sundheim BM (1996) Message understanding
conference-6: a brief history. In: COLING 1996 volume 1: The
16th international conference on computational linguistics
20. Chinchor NA (1998) Overview of muc-7/met-2. Technical report,
Science Applications International Corp San Diego
21. Poesio M (2004) Discourse annotation and semantic annotation in
the gnome corpus. In: Proceedings of the workshop on discourse
annotation, pp 72–79
22. Poesio M, Artstein R et al (2008) Anaphoric annotation in the
Arrau corpus. In: LREC
23. Chen YH, Choi JD (2016) Character identification on multiparty
conversation: Identifying mentions of characters in TV shows. In:
Proceedings of the 17th annual meeting of the special interest
group on discourse and dialogue, pp 90–100
24. Cybulska A, Vossen P (2014) Guidelines for ecb?annotation of
events and their coreference. In: Technical report NWR-2014-1,
VU University Amsterdam
25. Zeldes A, Zhang S (2016) When annotation schemes change rules
help: a configurable approach to coreference resolution beyond
ontonotes. In: Proceedings of the workshop on coreference res-
olution beyond OntoNotes (CORBON 2016), pp 92–101
26. Ghaddar A, Langlais P (2016) Wikicoref: an English coreference-
annotated corpus of wikipedia articles. In: Proceedings of the
tenth international conference on language resources and evalu-
ation (LREC’16), pp 136–142
27. Marcus MP, Marcinkiewicz MA (2004) Building a large anno-
tated corpus of English: the penn treebank. Comput Linguist
19(2)
28. Hasler L, Orasan C, Naumann K (2006) Nps for events: experi-
ments in coreference annotation. In: Proceedings of the fifth
international conference on language resources and evaluation
(LREC’06)
29. Kim J-D, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus—a
semantically annotated corpus for bio-textmining. Bioinformatics
19(suppl 1):180–182
30. Tateisi Y, Yakushiji A, Ohta T, Tsujii J (2005) Syntax annotation
for the Genia corpus. In: Companion volume to the proceedings
of conference including posters/demos and tutorial abstracts
31. Kim J-D, Ohta T, Tsujii J (2008) Corpus annotation for mining
biomedical events from literature. BMC Bioinform 9(1):10
32. Su J, Yang X, Hong H, Tateisi Y, Tsujii J (2008) Coreference
resolution in biomedical texts: a machine learning approach. In:
Dagstuhl seminar proceedings. Schloss Dagstuhl-Leibniz-Zen-
trum fu
¨r Informatik
33. Nguyen TORBN, Kim JTJD, Pyysalo S (2011) Overview of
bionlp shared task 2011. In: Proceedings of BioNLP shared task
2011 workshop, pp 1–6
34. Cohen KB, Johnson HL, Verspoor K, Roeder C, Hunter LE
(2010) The structural and content aspects of abstracts versus
bodies of full text journal articles are different. BMC Bioinform
11(1):492
35. Batista-Navarro RT, Ananiadou S (2011) Building a coreference-
annotated corpus from the domain of biochemistry. In: Pro-
ceedings of BioNLP 2011 workshop, pp 83–91
36. Segura-Bedmar I, Crespo M, de Pablo C, Martınez P (2009)
Drugnerar: linguistic rule-based anaphora resolver for drug-drug
interaction extraction in pharmacological documents. In: Pro-
ceedings of the third international workshop on data and text
mining in bioinformatics, pp 19–26
37. Doddington GR, Mitchell A, Przybocki MA, Ramshaw LA,
Strassel SM, Weischedel RM (2004) The automatic content
extraction (ace) program-tasks, data, and evaluation. In: Lrec, vol
2. Lisbon, pp 837–840
38. Weischedel R, Palmer M, Marcus M, Hovy E, Pradhan S,
Ramshaw L, Xue N, Taylor A, Kaufman J, Franchini M et al
(2013) Ontonotes release 5.0 ldc2013t19. Linguistic Data Con-
sortium, Philadelphia, p 23
39. Recasens M, Marquez L, Sapena E, MartıMA, Taule M, Hoste
V, Poesio M, Versley Y (2010) Semeval-2010 task 1: coreference
resolution in multiple languages. In: Proceedings of the 5th
international workshop on semantic evaluation, pp 1–8
40. Guillou L, Hardmeier C, Smith A, Tiedemann J, Webber B
(2014) Parcor 1.0: a parallel pronoun-coreference corpus to
support statistical mt. In: 9th International conference on lan-
guage resources and evaluation (LREC), May 26–31, 2014,
Reykjavik, ICELAND. European Language Resources Associa-
tion, pp 3191–3198
41. Montemagni S, Barsotti F, Battista M, Calzolari N, Corazzari O,
Zampolli A, Fanciulli F, Massetani M, Raffaelli R, Basili R et al
(2003) The Italian syntactic-semantic treebank: architecture,
annotation, tools and evaluation
42. Bristot A, Chiran L, Delmonte R (2000) Verso un’annotazione
xml di dialoghi spontanei per l’analisi sintattico-semantica. XI
Giornate di Studio GFS, Multimodalita’e Multimedialit nella
comunicazione, pp 42–50
43. Pradhan S, Ramshaw L, Marcus M, Palmer M, Weischedel R,
Xue N (2011) Conll-2011 shared task: modeling unrestricted
coreference in ontonotes. In: Proceedings of the fifteenth con-
ference on computational natural language learning: shared task,
pp 1–27
44. Lee K, He L, Lewis M, Zettlemoyer L (2017) End-to-end neural
coreference resolution. In: Proceedings of the 2017 conference on
empirical methods in natural language processing, pp 188–197
45. Lakretz Y, Hupkes D, Vergallito A, Marelli M, Baroni M,
Dehaene S (2020) Exploring processing of nested dependencies
in neural-network language models and humans. arXiv preprint
arXiv:2006.11098
46. Kabadjov MA (2007) A comprehensive evaluation of anaphora
resolution and discourse-new classification. PhD thesis, Citeseer
47. Liu H (2010) Dependency direction as a means of word-order
typology: a method based on dependency treebanks. Lingua
120(6):1567–1578. https://doi.org/10.1016/j.lingua.2009.10.001
48. Tsarfaty R, Seddah D, Goldberg Y, Kuebler S, Versley Y, Can-
dito M, Foster J, Rehbein I, Tounsi L (2010) Statistical parsing of
morphologically rich languages (SPMRL) what, how and whi-
ther. In: Proceedings of the NAACL HLT 2010 first workshop on
Neural Computing and Applications (2022) 34:22493–22518 22517
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
statistical parsing of morphologically-rich languages. Association
for Computational Linguistics, Los Angeles, pp 1–12. https://
www.aclweb.org/anthology/W10-1401
49. Liu H, Xu C (2012) Quantitative typological analysis of Romance
languages. Poznan Stud Contemp Linguist 48(4):597–625.
https://doi.org/10.1515/psicl-2012-0027
50. Wang L, Tu Z, Zhang X, Liu S, Li H, Way A, Liu Q (2017) A
novel and robust approach for pro-drop language translation.
Mach Transl 31(1–2):65–87
51. Wang L, Tu Z, Shi S, Zhang T, Graham Y, Liu Q (2018)
Translating pro-drop languages with reconstruction models. In:
McIlraith SA, Weinberger KQ (eds) Proceedings of the thirty-
second AAAI conference on artificial intelligence, (AAAI-18),
the 30th innovative applications of artificial intelligence (IAAI-
18), and the 8th AAAI symposium on educational advances in
artificial intelligence (EAAI18). AAAI Press, New Orleans,
pp 4937–4945. https://www.aaai.org/ocs/index.php/AAAI/
AAAI18/paper/view/16187
52. Evans R (2001) Applying machine learning toward an automatic
classification of it. Literary Linguist Comput 16(1):45–58
53. Yin Q, Zhang Y, Zhang W, Liu T, Wang WY (2018) Zero pro-
noun resolution with attention-based neural network. In: Pro-
ceedings of the 27th international conference on computational
linguistics, pp 13–23
54. Gopal M, Jha GN (2017) Zero pronouns and their resolution in
Sanskrit texts. In: The international symposium on intelligent
systems technologies and applications. Springer, pp 255–267
55. Aloraini A, Poesio M et al (2020) Cross-lingual zero pronoun
resolution
56. Guarasci R, Silvestri S, De Pietro G, Fujita H, Esposito M (2022)
Bert syntactic transfer: a computational experiment on Italian,
French and English languages. Comput Speech Lang 71:101261
57. McKelvie D, Isard A, Mengel A, Baun Møller M, Grosse M,
Klein M (2001) The mate workbench—an annotation tool for xml
coded speech corpora. Speech Commun 33(1):97–112. https://
doi.org/10.1016/S0167-6393(00)00071-6
58. Lakretz Y, Dehaene S, King J-R (2020) What limits our capacity
to process nested long-range dependencies in sentence compre-
hension? Entropy 22(4):446
59. Dell’Orletta F, Wieling M, Venturi G, Cimino A, Montemagni S
(2014) Assessing the readability of sentences: which corpora and
features? In: Proceedings of the ninth workshop on innovative use
of NLP for building educational applications, pp 163–173
60. Crossley SA, Skalicky S, Dascalu M, McNamara DS, Kyle K
(2017) Predicting text comprehension, processing, and familiarity
in adult readers: new approaches to readability formulas. Dis-
course Process 54(5–6):340–359
61. Sprouse J (2018) Acceptability judgments and grammaticality,
prospects and challenges. Syntactic structures after 60 years: the
impact of the Chomskyan revolution in linguistics, vol 129,
pp 195–224
62. Kruskal WH, Goodman L (1954) Measures of association for
cross classifications. J Am Stat Assoc 49(268):732–764
63. Bobicev V, Sokolova M (2017) Inter-annotator agreement in
sentiment analysis: machine learning perspective. In: RANLP,
pp 97–102
64. Sprouse J, Schutze CT, Almeida D (2013) A comparison of
informal and formal acceptability judgments using a random
sample from linguistic inquiry 2001–2010. Lingua 134:219–248.
https://doi.org/10.1016/j.lingua.2013.07.002
65. Langsford S, Perfors A, Hendrickson AT, Kennedy LA, Navarro
DJ (2018) Quantifying sentence acceptability measures: relia-
bility, bias, and variability. Glossa J Gen Linguist 3(1):37. https://
doi.org/10.5334/gjgl.396
66. Landis JR, Koch GG (1977) The measurement of observer
agreement for categorical data. Biometrics 159–174
67. Aroyo L, Welty C (2015) Truth is a lie: crowd truth and the seven
myths of human annotation. AI Mag 36(1):15–24
68. Joshi M, Levy O, Zettlemoyer L, Weld D (2019) BERT for
coreference resolution: baselines and analysis. In: Proceedings of
the 2019 conference on empirical methods in natural language
processing and the 9th international joint conference on natural
language processing (EMNLP-IJCNLP). Association for Com-
putational Linguistics, Hong Kong, China, pp 5803–5808. https://
doi.org/10.18653/v1/D19-1588
69. Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020)
Spanbert: improving pre-training by representing and predicting
spans. Trans Assoc Comput Linguist 8:64–77
70. Xu L, Choi JD (2020) Revealing the myth of higher-order
inference in coreference resolution. arXiv preprint arXiv:2009.
12013
71. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez
AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In:
Advances in neural information processing systems,
pp 5998–6008
72. Vilain M, Burger JD, Aberdeen J, Connolly D, Hirschman L
(1995) A model-theoretic coreference scoring scheme. In: Sixth
message understanding conference (MUC-6): proceedings of a
conference held in Columbia, Maryland, November 6–8, 1995
73. Bagga A (1998) Algorithms for scoring coreference chains. In:
Proceedings of linguistic coreference workshop at the first conf.
on language resources and evaluation (LREC), Granada, Spain,
May 1998
74. Luo X (2005) On coreference resolution performance metrics. In:
Proceedings of human language technology conference and
conference on empirical methods in natural language processing,
pp 25–32
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
22518 Neural Computing and Applications (2022) 34:22493–22518
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Conference Paper
The present work delves into innovative methodologies leveraging the widely used BERT model to enhance the population and enrichment of domain-oriented controlled vocabularies as Thesauri. Starting from BERT’s embeddings, we extracted information from a sample corpus of Cybersecurity related documents and presented a novel Natural Language Processing-inspired pipeline that combines Neural language models, knowledge graph extraction, and natural language inference for identifying implicit relations (adaptable to thesaural relationships) and domain concepts to populate a domain thesaurus. Preliminary results are promising, showing the effectiveness of using the proposed methodology, and thus the applicability of LLMs, BERT in particular, to enrich specialized controlled vocabularies with new knowledge.
Conference Paper
This paper investigates the ability of XLM language model to transfer linguistic knowledge cross-lingually, verifying if and to which extent syntactic dependency relationships learnt in a language are maintained in other languages. In detail, a structural probe is developed to analyse the cross-lingual syntactic transfer capability of XLM model and comparison of cross-language syntactic transfer among languages belonging to different families from a typological classification, which are characterised by very different syntactic constructions. The probe aims to reconstruct the dependency parse tree of a sentence in order to representing the input sentences with the contextual embeddings from XLM layers. The results of the experimental assessment improved the previous results obtained using mBERT model.
Conference Paper
When planning a trip, we wish to receive a precise itinerary, taking into account various factors such as traffic, distance, types of roads. However, we often have to deal with unreliable and inaccurate information. Evaluating the reliability of a route is a crucial aspect to improve the quality of navigation services and guarantee the user an experience that is not only effective but also efficient in terms of cost and travel time. The problem was approached with an innovative solution, which uses fuzzy logic and dispersion indices to measure variations in the traffic situation at different times of the day and on different days of the week, with tests carried out on real routes collected by the Google Maps platform API.
Article
Full-text available
For a given query, the objective of Cross-lingual Passage Re-ranking (XPR) is to rank a list of candidate passages in multiple languages, where only a portion of the passages are in the query’s language. Multilingual BERT (mBERT) is often used for the XPR task and achieves impressive performance. Nevertheless, there still exist two essential issues to be addressed in mBERT, including the performance gap between high- and low-resource languages, and the lack of explicit embedding distribution alignment. Regarding each language as a separated domain, we theoretically explore how these problems lead to errors in XPR under the guidance of domain adaptation. Based on the theoretical analysis, we propose a novel framework that comprises two modules, namely knowledge distillation and adversarial learning. The former enables the knowledge to be transferred from high-resource languages to low-resource ones, narrowing their performance gap. The latter encourages mBERT to align the embedding distributions across different languages by utilizing a novel language-distinguish task and adversarial training. Extensive experiments on in-domain and out-domain datasets confirm the effectiveness and robustness of the proposed framework and show that it can outperform state-of-the-art methods.
Article
Full-text available
Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations. To date, very little attention has been paid to the dropped pronoun (DP) problem within neural machine translation (NMT). In this work, we propose a novel reconstruction-based approach to alleviating DP translation problems for NMT models. Firstly, DPs within all source sentences are automatically annotated with parallel information extracted from the bilingual training corpus. Next, the annotated source sentence is reconstructed from hidden representations in the NMT model. With auxiliary training objectives, in the terms of reconstruction scores, the parameters associated with the NMT model are guided to produce enhanced hidden representations that are encouraged as much as possible to embed annotated DP information. Experimental results on both Chinese-English and Japanese-English dialogue translation tasks show that the proposed approach significantly and consistently improves translation performance over a strong NMT baseline, which is directly built on the training data annotated with DPs.
Article
Full-text available
Sentence comprehension requires inferring, from a sequence of words, the structure of syntactic relationships that bind these words into a semantic representation. Our limited ability to build some specific syntactic structures, such as nested center-embedded clauses (e.g., “The dog that the cat that the mouse bit chased ran away”), suggests a striking capacity limitation of sentence processing, and thus offers a window to understand how the human brain processes sentences. Here, we review the main hypotheses proposed in psycholinguistics to explain such capacity limitation. We then introduce an alternative approach, derived from our recent work on artificial neural networks optimized for language modeling, and predict that capacity limitation derives from the emergence of sparse and feature-specific syntactic units. Unlike psycholinguistic theories, our neural network-based framework provides precise capacity-limit predictions without making any a priori assumptions about the form of the grammar or parser. Finally, we discuss how our framework may clarify the mechanistic underpinning of language processing and its limitations in the human brain.
Article
Full-text available
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE.
Article
This paper investigates the ability of multilingual BERT (mBERT) language model to transfer syntactic knowledge cross-lingually, verifying if and to which extent syntactic dependency relationships learnt in a language are maintained in other languages. In detail, the main contributions of this paper are: (i) an analysis of the cross-lingual syntactic transfer capability of mBERT model; (ii) a detailed comparison of cross-language syntactic transfer among languages belonging to different branches of the Indo-European languages, namely English, Italian and French, which present very different syntactic constructions; (iii) a study on the transferability of a syntactic phenomenon peculiar of Italian language, namely the pronoun dropping (pro-drop), also known as omissibility of the subject. To this end, a structural probe devoted to reconstruct the dependency parse tree of a sentence has been exploited, representing the input sentences with the contextual embeddings from mBERT layers. The results of the experimental assessment have shown a transfer of syntactic knowledge of the mBERT model among these languages. Moreover, the behaviour of the probe in the transition from pro-drop to non-pro-drop languages and vice versa has proven to be more effective in case of languages sharing a common linguistic matrix. The possibility of transferring syntactical knowledge, especially in the case of specific phenomena, meets both a theoretical need and can have important practical implications in syntactic tasks, such as dependency parsing.
Article
Recursive processing in sentence comprehension is considered a hallmark of human linguistic abilities. However, its underlying neural mechanisms remain largely unknown. We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing, namely the storing of grammatical number and gender information in working memory and its use in long-distance agreement (e.g., capturing the correct number agreement between subject and verb when they are separated by other phrases). Although the network, a recurrent architecture with Long Short-Term Memory units, was solely trained to predict the next word in a large corpus, analysis showed the emergence of a very sparse set of specialized units that successfully handled local and long-distance syntactic agreement for grammatical number. However, the simulations also showed that this mechanism does not support full recursion and fails with some long-range embedded dependencies. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns, with or without embedding. Human and model error patterns were remarkably similar, showing that the model echoes various effects observed in human data. However, a key difference was that, with embedded long-range dependencies, humans remained above chance level, while the model's systematic errors brought it below chance. Overall, our study shows that exploring the ways in which modern artificial neural networks process sentences leads to precise and testable hypotheses about human linguistic performance.
Article
Entity Coreference Resolution is the task of resolving all mentions in a document that refer to the same real world entity and is considered as one of the most difficult tasks in natural language understanding. It is of great importance for downstream natural language processing tasks such as entity linking, machine translation, summarization, chatbots, etc. This work aims to give a detailed review of current progress on solving Coreference Resolution using neural-based approaches. It also provides a detailed appraisal of the datasets and evaluation metrics in the field, as well as the subtask of Pronoun Resolution that has seen various improvements in the recent years. We highlight the advantages and disadvantages of the approaches, the challenges of the task, the lack of agreed-upon standards in the task and propose a way to further expand the boundaries of the field.
Article
Coreference resolution aims at resolving repeated references to an object in a document and forms a core component of natural language processing (NLP) research. When used as a component in the processing pipeline of other NLP fields like machine translation, sentiment analysis, paraphrase detection, and summarization, coreference resolution has a potential to highly improve accuracy. A direction of research closely related to coreference resolution is anaphora resolution. Existing literature is often ambiguous in its usage of these terms and often uses them interchangeably. Through this review article, we clarify the scope of these two tasks. We also carry out a detailed analysis of the datasets, evaluation metrics and research methods that have been adopted to tackle these NLP problems. This survey is motivated by the aim of providing readers with a clear understanding of what constitutes these two tasks in NLP research and their related issues.
Article
Automatic Text Summarization is the process of creating a compressed representation of one or more related documents, keeping only the most valuable information. The extractive approach for summarization is the most studied and aims to generate a compressed version of a document by identifying, ranking, and selecting the most relevant sentences or phrases from a text. The selected sentences go verbatim into the summary. However, this strategy may yield incoherent summaries, as pronominal coreferences may appear unbound. To alleviate this problem, this paper proposes a method that solves unbound pronominal anaphoric expressions, automatically enabling the cohesiveness of the extractive summaries. The proposed method can be applied to two distinct scenarios. The first one aims to find and fix unbound anaphoric expressions present in the generated summaries at a post-processing stage; whereas the second one is performed at the preprocessing stage of the proposed pipeline and generates an intermediate version of the input document that resolves the unbound pronominal coreferences. The proposed solution was evaluated on the CNN news corpus using the seventeen summarization techniques most widely acknowledged in the literature and four state-of-the-art summarization systems. Moreover, it also provides a comparative evaluation concerning two distinct assessment scenarios which are compared to a baseline. The experiments performed achieved very encouraging quantitative and qualitative results.