Validity of phrase pairs according to the phrase extraction heuristic. Only the leftmost phrase pair is valid. The others are inconsistent with the alignment or have an unaligned word on a boundary, respectively, indicated by a cross.

Source publication

Constructing Corpora for the Development and Evaluation of Paraphrase Systems

Article

Full-text available

Dec 2008

Automatic paraphrasing is an important component in many natural language processing tasks. In this paper we present a new parallel corpus with paraphrase annotations. We adopt a definition of paraphrase based on word-alignments and show that it yields high inter-annotator agreement. As Kappa is suited to nominal data, we employ an alternative agre...

Context 1

... our purposes we wish to be maximally conservative in how we process the data, and therefore we do not extract phrase pairs with unaligned words on their boundaries. Figure 3 illustrates the types of phrase pairs our extraction heuristic permits. Here, the pair and reached ↔ and arrived at is consistent with the word alignment. ...

View in full-text

Context 2

... phrase extraction procedure distinguishes between two types of phrase pairs, atomic, i.e., the smallest possible phrase pairs, and composite, which can be created by combining smaller phrase pairs. For example, the phrase pair and reached ↔ and arrived at in Figure 3 is composite, as it can be decomposed into and ↔ and and reached ↔ arrived at. Table 2 shows the atomic and composite phrase pairs extracted from the possible alignments produced by annotators A and B for the sentence pair in Figure 2. ...

View in full-text

SemEval-2007 task 04: classification of semantic relations between nominals

Conference Paper

Full-text available

Jun 2007

The NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of relations between pairs of words in a text. We present an evaluation task designed to provide a framework for comparing different approaches to classifying semantic relations between nominals in a sentence. This is part of SemEval, the 4t...

Annotating Qualia Relations in Italian and French Complex Nominals

Conference Paper

Full-text available

May 2012

The goal of this paper is to provide an annotation scheme for compounds based on generative lexicon theory (GL, Pustejovsky, 1995; Bassac and Bouillon, 2001). This scheme has been tested on a set of compounds automatically extracted from the Europarl corpus (Koehn, 2005) both in Italian and French. The motivation is twofold. On the one hand, it sho...

Peningkatan Penggunaan Teknik Parafrase untuk Menekan Plagiarisme dalam Menulis Publikasi Ilmiah

Article

May 2023

The lack of knowledge about avoiding plagiarism techniques is a negative cause for students and lecturers to collaborate in scientific publications. The purpose of this community service is to increase the use of paraphrasing techniques when writing scientific publications. Specifically, this community service is conducting training on how paraphrasing techniques are used in scientific publications to avoid high levels of plagiarism. The implementation of this activity is carried out by online training using the "google meet" application. This activity was carried out by 12 researchers from College "A" and 6 researchers from College "B". The pretest-posttest method is used in this community service to evaluate the results of the training activities. The evaluation results showed that there was an increasein the understanding of students and lecturers at two tertiary institutions regarding paraphrasing techniques in reducing plagiarism.

Linguistic resources for paraphrase generation in Portuguese: a Lexicon-Grammar approach

Article

Full-text available

Mar 2022

This paper presents a new linguistic resource for the generation of paraphrases in Portuguese, based on the lexicon-grammar framework. The resource components include: (i) a lexicon-grammar based dictionary of 2100 predicate nouns co-occurring with the support verb ser de ‘be of’, such as in ser de uma ajuda inestimável ‘be of invaluable help’; (ii) a lexicon-grammar based dictionary of 6000 predicate nouns co-occurring with the support verb fazer ‘do’ or ‘make’, such as in fazer uma comparação ‘make a comparison’; and (iii) a lexicon-grammar based dictionary of about 5000 human intransitive adjectives co-occurring with the copula verbs ser and/or estar ‘be’, such as in ser simpático ‘be kind’ or estar entusiasmado ‘be enthusiastic’. A set of local grammars explore the properties described in linguistic resources, enabling a variety of text transformation tasks for paraphrasing applications. The paper highlights the different complementary and synergistic components and integration efforts, and presents some preliminary evaluation results on the inclusion of such resources in the eSPERTo paraphrase generation system.

Iterative Paraphrastic Augmentation with Discriminative Span Alignment

Article

Full-text available

May 2021

We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing datasets or the rapid creation of new datasets using a small, manually produced seed corpus. We demonstrate our approach with experiments on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. With four days of training data collection for a span alignment model and one day of parallel compute, we automatically generate and release to the community 495,300 unique (Frame,Trigger) pairs in diverse sentential contexts, a roughly 50-fold expansion atop FrameNet v1.7. The resulting dataset is intrinsically and extrinsically evaluated in detail, showing positive results on a downstream task.

Iterative Paraphrastic Augmentation with Discriminative Span Alignment

Preprint

Jul 2020

We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.

Neural Network Alignment for Sentential Paraphrases

Conference Paper

Jan 2019

Bootstrapping Generators from Noisy Data

Article

Apr 2018

A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and associated texts. In this paper we aim to bootstrap generators from large scale datasets where the data (e.g., DBPedia facts) and related texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this challenging task by introducing a special-purpose content selection mechanism. We use multi-instance learning to automatically discover correspondences between data and text pairs and show how these can be used to enhance the content signal while training an encoder-decoder architecture. Experimental results demonstrate that models trained with content-specific objectives improve upon a vanilla encoder-decoder which solely relies on soft attention.

Deep learning meets ontologies: Experiments to anchor the cardiovascular disease ontology in the biomedical literature

Article

Full-text available

Apr 2018

Background: Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created. Methods: We have manually annotated 105 gene/protein names from 25 PubMed titles/abstracts and mapped them to 79 unique UniProtKB entries corresponding to gene and protein classes from the CVDO. Using more than 14 M PubMed articles (titles and available abstracts), word embeddings were generated with CBOW and Skip-gram. We setup two experiments for a synonym detection task, each with four raters, and 3672 pairs of terms (target term and candidate term) from the word embeddings created. For Experiment I, the target terms for 64 UniProtKB entries were those that appear in the titles/abstracts; Experiment II involves 63 UniProtKB entries and the target terms are a combination of terms from PubMed titles/abstracts with terms (i.e. increased context) from the CVDO protein class expressions and labels. Results: In Experiment I, Skip-gram finds term variants (full and/or partial) for 89% of the 64 UniProtKB entries, while CBOW finds term variants for 67%. In Experiment II (with the aid of the CVDO), Skip-gram finds term variants for 95% of the 63 UniProtKB entries, while CBOW finds term variants for 78%. Combining the results of both experiments, Skip-gram finds term variants for 97% of the 79 UniProtKB entries, while CBOW finds term variants for 81%. Conclusions: This study shows performance improvements for both CBOW and Skip-gram on a gene/protein synonym detection task by adding knowledge formalised in the CVDO and without modifying the word embeddings created. Hence, the CVDO supplies context that is effective in inducing term variability for both CBOW and Skip-gram while reducing ambiguity. Skip-gram outperforms CBOW and finds more pertinent term variants for gene/protein names annotated from the scientific literature.

Bootstrapping Generators from Noisy Data

Conference Paper

Jan 2018

Same same, but different: Compositionality of Paraphrase Granularity Levels

Conference Paper

Nov 2017

Review on Aspect Based Sentiment Analysis Using Sentence Minimization

Article

Oct 2017
IJCSE

Validity of phrase pairs according to the phrase extraction heuristic. Only the leftmost phrase pair is valid. The others are inconsistent with the alignment or have an unaligned word on a boundary, respectively, indicated by a cross.

Contexts in source publication

Similar publications

Citations