ArticlePDF Available

Dependency treelet translation: The convergence of statistical and example-based machine-translation?

Authors:

Abstract

We describe a novel approach to machine translation that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two distinct parsers and oracle experiments. We also validate our automated BLEU scores with a small human evaluation.
A preview of the PDF is not available
... At the same time, we have known that it is possible to translate English geographical names efficiently because of the development of machine translation [4]. However, geographical names are composed of a group of phrases without clear grammatical constraints, so the lexical structures at different scales are complex and diverse. ...
Article
Full-text available
In recent years, with increasing international communication and cooperation, the consensus of toponymic information among different countries has become increasingly important. A large number of English geographical names are in urgent need of translation into Chinese, but there are few studies on machine translation of geographical names at present. Therefore, this paper proposes a method of automatically translating English geographical names into Chinese. First, the lexical structure of the geographic names is analyzed to divide the whole name into two parts, the special name and the general name, in an approach based on the statistical template model that implements pointwise mutual information and a directed acyclic graph data structure on the extracted names from different categories of a geographical name corpus. Second, the two parts of the geographic names are translated. The general name can be directly translated via methods of free translation. For the transliteration of the special name, the phonetic symbols are generated based on the cyclic neural network, and then, the syllables are divided based on the minimum entropy and converted into Chinese characters. Finally, the two parts of Chinese characters are combined, and criteria are prepared to evaluate the translation reliability according to the translation process to realize automatic quality inspection and screening of geographical names. As the experimental results show, the method is effective in the translation process of English geographic names into Chinese. This method can be easily extended to other languages such as Arabic.
... Also, Sanguinetti et al. (2014) present a catena-related approach for syntactic alignments in multilingual treebanks. In translation research, catenae are best known as "treelets" (Quirk & Menezes 2006). We employ catenae, which have already been used in NLP applications, to model the interface between the treebank and the lexicon. ...
Article
Full-text available
The paper focuses on the modeling of multiword expressions (MWE) in Bulgarian-English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset). Observations were made on alignments in which at least one multiword expression was used per language. The multiword expressions were classified with respect to the PARSEME lexicon based (WG1) and Treebank-based (WG2) classifications. The non-MWE counterparts of MWEs are also considered. Our approach is data-driven because the data of this study was retrieved from parallel corpora and not from bilingual dictionaries. The survey shows that the predominant translation relation between Bulgarian and English is MWE-to-word, and that this relation does not exclude other translation options. To formalize our observations, a catenae-based modeling of the parallel pairs is proposed.
... There are many types of translation systems that have been built in the past, for example: -Syntax-based translation systems (Yamada and Knight 2001), -Phrase-based SMT systems (Och and Ney 2002;Koehn et al. 2003), -Hierarchical phrase-based SMT systems (Chiang 2005(Chiang , 2007, -Syntactic phrase-based SMT systems (Quirk et al. 2005;Quirk and Menezes 2006). ...
Article
Full-text available
Differences in domains of language use between training data and test data have often been reported to result in performance degradation for phrase-based machine translation models. Throughout the past decade or so, a large body of work aimed at exploring domain-adaptation methods to improve system performance in the face of such domain differences. This paper provides a systematic survey of domain-adaptation methods for phrase-based machine-translation systems. The survey starts out with outlining the sources of errors in various components of phrase-based models due to domain change, including lexical selection, reordering and optimization. Subsequently, it outlines the different research lines to domain adaptation in the literature, and surveys the existing work within these research lines, discussing how these approaches differ and how they relate to each other.
Article
Download Free Sample It has been estimated that over a billion people are using or learning English as a second or foreign language, and the numbers are growing not only for English but for other languages as well. These language learners provide a burgeoning market for tools that help identify and correct learners' writing errors. Unfortunately, the errors targeted by typical commercial proofreading tools do not include those aspects of a second language that are hardest to learn. This volume describes the types of constructions English language learners find most difficult: constructions containing prepositions, articles, and collocations. It provides an overview of the automated approaches that have been developed to identify and correct these and other classes of learner errors in a number of languages. Error annotation and system evaluation are particularly important topics in grammatical error detection because there are no commonly accepted standards. Chapters in the book describe the options available to researchers, recommend best practices for reporting results, and present annotation and evaluation schemes. The final chapters explore recent innovative work that opens new directions for research. It is the authors' hope that this volume will continue to contribute to the growing interest in grammatical error detection by encouraging researchers to take a closer look at the field and its many challenging problems.
Article
Full-text available
Statistical machine translation (SMT) systems perform poorly when it is applied to new target domains. Our goal is to explore domain adaptation approaches and techniques for improving the translation quality of domain-specific SMT systems. However, translating texts from a specific domain (e.g., medicine) is full of challenges. The first challenge is ambiguity. Words or phrases contain different meanings in different contexts. The second one is language style due to the fact that texts from different genres are always presented in different syntax, length and structural organization. The third one is the out-of-vocabulary words (OOVs) problem. In-domain training data are often scarce with low terminology coverage. In this thesis, we explore the state-of-the-art domain adaptation approaches and propose effective solutions to address those problems.
Book
This volume provides an overview of the field of Hybrid Machine Translation (MT) and presents some of the latest research conducted by linguists and practitioners from different multidisciplinary areas. Nowadays, most important developments in MT are achieved by combining data-driven and rule-based techniques. These combinations typically involve hybridization of different traditional paradigms, such as the introduction of linguistic knowledge into statistical approaches to MT, the incorporation of data-driven components into rule-based approaches, or statistical and rule-based pre- and post-processing for both types of MT architectures. The book is of interest primarily to MT specialists, but also – in the wider fields of Computational Linguistics, Machine Learning and Data Mining – to translators and managers of translation companies and departments who are interested in recent developments concerning automated translation tools.
Chapter
This chapter contains a general introduction to the topic of the present book. It presents the current challenges of Machine Translation (MT), in particular for languages where only a limited amount of specialised resources is readily available. To that end, a comprehensive review of the state-of-the-art in MT is performed. Focus is placed on related work on MT methodologies that are portable to new language pairs, and issues such as stability and extensibility are emphasised. It is widely accepted that language portability necessitates an algorithmic approach to extract information from large corpora in an unsupervised manner. This includes both Statistical MT (SMT) and Example-based MT (EBMT). Here, a review of the strengths and shortcomings of the different approaches is performed, in terms of the a priori externally-provided linguistic knowledge and required specialised resources. This review leads to the concept of the proposed MT methodology.
Conference Paper
Full-text available
In this paper, we propose a graph-based translation model which takes advantage of discontinuous phrases. The model segments a graph which combines bigram and dependency relations into subgraphs and produces translations by combining translations of these subgraphs. Experiments on Chinese‐English and German‐English tasks show that our system is significantly better than the phrase-based model. By explicitly modeling the graph segmentation, our system gains further improvement.
Article
Full-text available
An abstract is not available.
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
Article
Full-text available
In this work, the use of a phrasal lexicon for statistical machine translation is proposed, and the relation between data acquisition costs and translation quality for different types and sizes of language resources has been analyzed. The language pairs are Spanish-English and Catalan-English, and the translation is performed in all directions. The phrasal lexicon is used to increase as well as to replace the original training corpus. The augmentation of the phrasal lexicon with the help of additional monolingual language resources containing morpho-syntactic information has been investigated for the translation with scarce training material. Using the augmented phrasal lexicon as additional training data, a reasonable translation quality can be achieved with only 1000 sentence pairs from the desired domain.
Article
Full-text available
Translation systems that automatically extract transfer mappings (rules or examples) from bilingual corpora have been hampered by the difficulty of achieving accurate alignment and acquiring high quality mappings. We describe an algorithm that uses a best-first strategy and a small alignment grammar to significantly improve the quality of the transfer mappings extracted. For each mapping, frequencies are computed and sufficient context is retained to distinguish competing mappings during translation. Variants of the algorithm are run against a corpus containing 200K sentence pairs and evaluated based on the quality of resulting translations.
Article
We introduce (1) a novel with a variety of parallel corpus analysis applications. Aside from the bilingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism's expressiveness suggests that it is particularly well suited to modeling ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.
Article
In the last ten to fifteen years there has been a significant amount of research in Machine Translation within a ‘new’ paradigm of empirical approaches, often labelled collectively as ‘Example-based’ approaches. The first manifestation of this approach caused some surprise and hostility among observers more used to different ways of working, but the techniques were quickly adopted and adapted by many researchers, often creating hybrid systems. This paper reviews the various research efforts within this paradigm reported to date, and attempts a categorisation of different manifestations of the general approach. This paper first appeared in 1999 in Machine Translation 14:113-157. It has been updated with a small number of revisions, and references to more recent work.