Figure 3 - uploaded by Trevor Cohn
Content may be subject to copyright.
Validity of phrase pairs according to the phrase extraction heuristic. Only the leftmost phrase pair is valid. The others are inconsistent with the alignment or have an unaligned word on a boundary, respectively, indicated by a cross.  

Validity of phrase pairs according to the phrase extraction heuristic. Only the leftmost phrase pair is valid. The others are inconsistent with the alignment or have an unaligned word on a boundary, respectively, indicated by a cross.  

Source publication
Article
Full-text available
Automatic paraphrasing is an important component in many natural language processing tasks. In this paper we present a new parallel corpus with paraphrase annotations. We adopt a definition of paraphrase based on word-alignments and show that it yields high inter-annotator agreement. As Kappa is suited to nominal data, we employ an alternative agre...

Contexts in source publication

Context 1
... our purposes we wish to be maximally conservative in how we process the data, and therefore we do not extract phrase pairs with unaligned words on their boundaries. Figure 3 illustrates the types of phrase pairs our extraction heuristic permits. Here, the pair and reached ↔ and arrived at is consistent with the word alignment. ...
Context 2
... phrase extraction procedure distinguishes between two types of phrase pairs, atomic, i.e., the smallest possible phrase pairs, and composite, which can be created by combining smaller phrase pairs. For example, the phrase pair and reached ↔ and arrived at in Figure 3 is composite, as it can be decomposed into and ↔ and and reached ↔ arrived at. Table 2 shows the atomic and composite phrase pairs extracted from the possible alignments produced by annotators A and B for the sentence pair in Figure 2. ...

Similar publications

Conference Paper
Full-text available
The NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of relations between pairs of words in a text. We present an evaluation task designed to provide a framework for comparing different approaches to classifying semantic relations between nominals in a sentence. This is part of SemEval, the 4t...
Conference Paper
Full-text available
The goal of this paper is to provide an annotation scheme for compounds based on generative lexicon theory (GL, Pustejovsky, 1995; Bassac and Bouillon, 2001). This scheme has been tested on a set of compounds automatically extracted from the Europarl corpus (Koehn, 2005) both in Italian and French. The motivation is twofold. On the one hand, it sho...

Citations

... Kompetensi dari penulis juga mampu meningkatkan minat membaca dan memahami konteks yang akan di parafrase (Reynolds, 1995). Para peneliti sebelumnya telah menyarankan untuk memperluas jaringan kata-kata semantik bagi setiap peneliti seperti menggunakan sinonim, antonim, dan kata-kata asosiatif (Baba, 2009;Keck, 2006;Cohn, Callison-Burch, & Lapata, 2008). Perspektif leksikal mendefinisikan parafrase dengan jenis perubahan leksikal yang dapat terjadi dalam sebuah frasa atau kalimat yang menghasilkan generasi parafrase baru. ...
Article
The lack of knowledge about avoiding plagiarism techniques is a negative cause for students and lecturers to collaborate in scientific publications. The purpose of this community service is to increase the use of paraphrasing techniques when writing scientific publications. Specifically, this community service is conducting training on how paraphrasing techniques are used in scientific publications to avoid high levels of plagiarism. The implementation of this activity is carried out by online training using the "google meet" application. This activity was carried out by 12 researchers from College "A" and 6 researchers from College "B". The pretest-posttest method is used in this community service to evaluate the results of the training activities. The evaluation results showed that there was an increasein the understanding of students and lecturers at two tertiary institutions regarding paraphrasing techniques in reducing plagiarism.
... To evaluate the lexicon-grammar resources within eSPERTo, we are using the 801 sentences of Portuguese from Portugal the Gold CLUE4Paraphrasing. The size of the corpus is comparable to the corpus built by Cohn et al. (2008) to develop and evaluate paraphrase systems for English. It offers the opportunity to compare between paraphrases in Portuguese from Portugal and Portuguese from Brazil. ...
... Finally, those 240 sentences will all be evaluated by each author to assess interannotator agreement. We can than opt for randomly choosing two of them to calculate the agreement statistic proposed by Cohn et al. (2008) or to calculate the agreement among more than two annotators as discussed by Artstein & Poesio (2008). ...
Article
Full-text available
This paper presents a new linguistic resource for the generation of paraphrases in Portuguese, based on the lexicon-grammar framework. The resource components include: (i) a lexicon-grammar based dictionary of 2100 predicate nouns co-occurring with the support verb ser de ‘be of’, such as in ser de uma ajuda inestimável ‘be of invaluable help’; (ii) a lexicon-grammar based dictionary of 6000 predicate nouns co-occurring with the support verb fazer ‘do’ or ‘make’, such as in fazer uma comparação ‘make a comparison’; and (iii) a lexicon-grammar based dictionary of about 5000 human intransitive adjectives co-occurring with the copula verbs ser and/or estar ‘be’, such as in ser simpático ‘be kind’ or estar entusiasmado ‘be enthusiastic’. A set of local grammars explore the properties described in linguistic resources, enabling a variety of text transformation tasks for paraphrasing applications. The paper highlights the different complementary and synergistic components and integration efforts, and presents some preliminary evaluation results on the inclusion of such resources in the eSPERTo paraphrase generation system.
... Unlike FastAlign, which is trained on bitext alone, DiscAlign is pre-trained on bitext and fine-tuned on gold-standard alignments. For this task, a DiscAlign model was pre-trained with 141 million sentences of ParaBank data (Hu et al., 2019b) and finetuned on a 713 sentence subset of the Edinburgh++ corpus (Cohn et al., 2008). 9 Both DiscAlign and FastAlign have been successfully used for cross-lingual word alignment, with DiscAlign outperforming FastAlign on Arabic-English and Chinese-English alignment by a large margin (Stengel-Eskin et al., 2019). ...
Article
Full-text available
We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing datasets or the rapid creation of new datasets using a small, manually produced seed corpus. We demonstrate our approach with experiments on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. With four days of training data collection for a span alignment model and one day of parallel compute, we automatically generate and release to the community 495,300 unique (Frame,Trigger) pairs in diverse sentential contexts, a roughly 50-fold expansion atop FrameNet v1.7. The resulting dataset is intrinsically and extrinsically evaluated in detail, showing positive results on a downstream task.
... Unlike FastAlign, which is trained on bitext alone, DiscAlign is pre-trained on bitext and fine-tuned on gold-standard alignments. For this task, a DiscAlign model was pretrained with 141 million sentences of ParaBank data (Hu et al., 2019b) and finetuned on a 713 sentence subset of the Edinburgh++ corpus (Cohn et al., 2008). Both DiscAlign and FastAlign have been successfully used for cross-lingual word alignment, with DiscAlign outperforming FastAlign on Arabic-English and Chinese-English alignment by a large margin (Stengel-Eskin et al., 2019). ...
Preprint
We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.
... Past work has focused on lexical and short phrasal alignments, in part because most existing corpora consist of mostly word-level align-ments. Yao et al. (2013b) report that 95% of alignments in the MSR RTE (Brockett, 2007) and Ed-inburgh++ (Cohn et al., 2008) corpora are singletoken, lexical paraphrases, and phrases of four or more words are less than 1% of MSR RTE and 3% of Edinburgh++. ...
... Thadani et al. (2012) added dependency arc edits to MANLI's phrase edits, again improving the system's performance. Interestingly, Thadani et al. used both the sure and possible alignments in the Edinburgh++ corpus (Cohn et al., 2008) and showed that training on both gave better performance than training only on sure alignments on this corpus, but no subsequent monolingual alignment systems have taken advantage of possible alignments until we do so this work. ...
... Specifically, two annotators aligned 132 sentences to their infoboxes. We used the Yawat annotation tool (Germann, 2008) and followed the alignment guidelines (and evaluation metrics) used in Cohn et al. (2008). The inter-annotator agreement using macro-averaged f-score was 0.72 (we treated one annotator as the reference and the other one as hypothetical system output). ...
Article
A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism. We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.
... In order to calculate precision and recall, which are well-known metrics for evaluating retrieval (classification) performance, one set of annotations should be considered as the gold standard [92]. In this study, we advocate a voting system as we have four annotators/ raters and two of them are bio-curators. ...
Article
Full-text available
Background: Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created. Methods: We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels. Results: In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%. Conclusions: This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature.
... Specifically, two annotators aligned 132 sentences to their infoboxes. We used the Yawat annotation tool (Germann, 2008) and followed the alignment guidelines (and evaluation metrics) used in Cohn et al. (2008). The inter-annotator agreement using macro-averaged f-score was 0.72 (we treated one annotator as the reference and the other one as hypothetical system output). ...
... Without context, witch and she would not be considered paraphrases, while witch and sorceress probably would. Cohn et al. (2008) annotated words and phrases in the context of sentences in order to analyze the nature of paraphrases and corresponding corpora. In our work, the lowest paraphrase level is the event element level which are seen in context of their sentence. ...
... On the SemEval 2015 Task 1 data (Xu et al., 2014), which is based on the Twitter Paraphrase Corpus (TPC) -this means the tweets are roughly equivalent to 'sentences' -the IAA measured in terms of F-measure is .82. For the phrase level, Cohn et al. (2008) report an F 1 IAA between .71 and .76. They also report IAA on the word level, which is between .74 and .79. ...
... Similarly, Sammons et al. (2010) took existing textual entailment corpora that are classified according to classes including paraphrases and classify the arguments according to paraphrase classes. 4 Cohn et al. (2008) performed an annotation on all three levels in parallel, by using existing sentential paraphrase corpora such as the Microsoft Paraphrase Corpus (MSPC) and adding the other two layers upon those. ...
... Paraphrasing task is done by adjusting the text and expression within linguistically identical sentences. The observer is asked to find each and every element of text that retains the possible meaning [13]. ...