Conference Paper

Using Bilingual Segments in Generating Word-to-word Translations

Authors:
  • Prasanna School of Public Health
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We defend that bilingual lexicons automatically extracted from parallel corpora, whose entries have been meanwhile validated by linguists and classified as correct or incorrect, should constitute a specific parallel corpora. And, in this paper, we propose to use word-to-word translations to learn morph-units (comprising of bilingual stems and suffixes) from those bilingual lexicons for two language pairs L1-L2 and L1-L3 to induce a bilingual lexicon for the language pair L2-L3, apart from also learning morph-units for this other language pair. The applicability of bilingual morph-units in L1-L2 and L1-L3 is examined from the perspective of pivot-based lexicon induction for language pair L2-L3 with L1 as bridge. While the lexicon is derived by transitivity, the correspondences are identified based on previously learnt bilingual stems and suffixes rather than surface translation forms. The induced pairs are validated using a binary classifier trained on morphological and similarity-based features using an existing, automatically acquired, manually validated bilingual translation lexicon for language pair L2-L3. In this paper, we discuss the use of English (EN)-French (FR) and English (EN)-Portuguese (PT) lexicon of word-to-word trans- lations in generating word-to-word translations for the language pair FR-PT with EN as pivot language. Generated translations are filtered out first using an SVM-based FR-PT classifier and then are manually validated.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Conference Paper
We discuss approaches for improving bilingual lexicon coverage by automatically suggesting translations for Out-Of-Vocabulary (OOV) terms, employing existing validated bilingual lexicon entries. Resource poor languages such as Hindi, Konkani and Sanskrit characterized by highly inflectional morphology were employed in our experiments. Known surface translations are mined for morphological similarities and bilingual morphemes thus learnt are used in suggesting word-word and phrase translations. Also, word-word translations are generated for the language pair Hindi-Sanskrit by pivoting bilingual stems and suffixes, with Konkani and English as bridge language, former a morphologically rich language while latter morphologically poor.
Conference Paper
Full-text available
Bilingual dictionaries are vital in many areas of natural language processing, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Pivot-based induction consists of using a third language to bridge a language pair. As an approach to create new dictionaries, it can generate wrong translations due to polysemy and ambiguous words. In this paper we propose a constraint approach to pivot-based dictionary induction for the case of two closely related languages. In order to take into account the word senses, we use an approach based on semantic distances, in which possibly missing translations are considered, and instance of induction is encoded as an optimization problem to generate new dictionary. Evaluations show that the proposal achieves 83.7% accuracy and approximately 70.5% recall, thus outperforming the baseline pivot-based method.
Conference Paper
Full-text available
High quality bilingual dictionaries are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language as a pivot to link two other languages is a well-known solution, and usually requires only two input bilingual dictionaries to automatically induce the new one. This approach, however, produces many incorrect translation pairs because the dictionary entries are normally are not transitive due to polysemy and the ambiguous words in the pivot language. Utilizing the complete structures of the input bilingual dictionaries positively influences the result since dropped meanings can be countered. Moreover, an additional input dictionary may provide more complete information for calculating the semantic distance between word senses which is key to suppressing wrong sense matches. This paper proposes an extended constraint optimization model to inducing new dictionaries of closely related languages from multiple input dictionaries, and its formalization based on Integer Linear Programming. Evaluations indicated that the proposal not only outperforms the baseline method, but also shows improvements in performance and scalability as more dictionaries are utilized.
Conference Paper
Full-text available
High quality machine readable dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. In this paper, we proposed a heuristic framework that aims at inducing one-to-one mapping dictionary of a closely related language pair from available dictionaries where a distant language is involved. The key insight of the framework is the ability to create heuristics by using distant language as pivot, incorporate given heuristics, and an iterative induction mechanism that human interaction can be potentially integrated. An experiment based on basic heuristics regarding syntactics and semantics resulted in up to 85.2% correctness in target dictionary with correctness of major part reached 95.3%, which proved that we can perform automated creation of a high quality dictionary with our framework.
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
Conference Paper
Full-text available
An A-C bilingual dictionary can be inferred by merging A-B and B-C dictionaries using B as pivot. However, polysemous pivot words often produce wrong translation candidates. This paper analyzes two methods for pruning wrong candidates: one based on exploiting the structure of the source dictionaries, and the other based on distributional similarity computed from comparable corpora. As both methods depend exclusively on easily available resources, they are well suited to less resourced languages. We studied whether these two techniques complement each other given that they are based on different paradigms. We also researched combining them by looking for the best adequacy depending on various application scenarios. 1
Conference Paper
Full-text available
Recently the LATL has undertaken the development of a multilingual translation system based on a symbolic parsing technology and on a transfer-based translation model. A crucial component of the system is the lexical database, notably the bilingual dictionaries containing the information for the lexical transfer from one language to another. As the number of necessary bilingual dictionaries is a quadratic function of the number of languages considered, we will face the problem of getting a large number of dictionaries. In this paper we discuss a solution to derive a bilingual dictionary by transitivity using existing ones and to check the generated translations in a parallel corpus. Our first experiments concerns the generation of two bilingual dictionaries and the quality of the entries are very promising. The number of generated entries could however be improved and we conclude the paper with the possible ways we plan to explore.
Conference Paper
Machine Translation tasks must tackle the ever-increasing sizes of parallel corpora, requiring space and time efficient solutions to support them. Several approaches were developed based on full-text indices, such as suffix arrays, with important time and space achievements. However, for supporting bilingual tasks, the search time efficiency of such indices can be improved using an extra layer for the text alignment. Additionally, their space requirements can be significantly reduced using more compact indices. We propose a search procedure on top of a compact bilingual framework that improves bilingual search response time, while having a space efficient representation of aligned parallel corpora.
Conference Paper
In this paper, we will address term translation extraction from indexed aligned parallel corpora, by using a couple of association measures combined by a voting scheme, for scaling down translation pairs according to the degree of internal cohesiveness, and evaluate results obtained. Precision obtained is clearly much better than results obtained in related work for the very low range of occurrences we have dealt with, and compares with the best results obtained in word translation.
Conference Paper
Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inefficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.
Conference Paper
This paper proposes a method of constructing a dictionary for a pair of languages from bilingual dictionaries between each of the languages and a third language. Such a method would be useful for language pairs for which wide-coverage bilingual dictionaries are not available, but it suffers from spurious translations caused by the ambiguity of intermediary third-language words. To eliminate spurious translations, the proposed method uses the monolingual corpora of the first and second languages, whose availability is not as limited as that of parallel corpora. Extracting word associations from the corpora of both languages, the method correlates the associated words of an entry word with its translation candidates. It then selects translation candidates that have the highest correlations with a certain percentage or more of the associated words. The method has the following features. It first produces a domain-adapted bilingual dictionary. Second, the resulting bilingual dictionary, which not only provides translations but also associated words supporting each translation, enables contextually based selection of translations. Preliminary experiments using the EDR Japanese-English and LDC Chinese-English dictionaries together with Mainichi Newspaper and Xinhua News Agency corpora demonstrate that the proposed method is viable. The recall and precision could be improved by optimizing the parameters. 1.
Article
This issue's expert guest column is by Eric Allender, who has just taken over the Structural Complexity Column in the Bulletin of the EATCS.Regarding "Journals to Die For" (SIGACT News Complexity Theory Column 16), Joachim von zur Gathen, ...
Measuring Spelling Similarity for Cognate Identification
  • Luís Gomes
  • José Gabriel Pereira Lopes
Luís Gomes and José Gabriel Pereira Lopes. 2011. Measuring Spelling Similarity for Cognate Identification. In Progress in Artificial Intelligence -15th Portuguese Conference on Artificial Intelligence, EPIA 2011, pages 624-633, Lisbon, Portugal, October. Springer.
Identification of bilingual segments for translation generation
  • Luís Kavitha Karimbi Mahesh
  • José Gabriel Pereira Gomes
  • Lopes
Kavitha Karimbi Mahesh, Luís Gomes, and José Gabriel Pereira Lopes. 2014a. Identification of bilingual segments for translation generation. In Advances in Intelligent Data Analysis XIII, volume 8819 of LNCS, pages 167-178. Springer International Publishing.
Sampling-based multilingual alignment
  • Adrien Lardilleux
  • Yves Lepage
Adrien Lardilleux and Yves Lepage. 2009. Sampling-based multilingual alignment. In Proceedings of Recent Advances in Natural Language Processing, pages 214-218.
Automatic construction of a transfer dictionary considering directionality
  • Kyonghee Paik
  • Satoshi Shirai
  • Hiromi Nakaiwa
Kyonghee Paik, Satoshi Shirai, and Hiromi Nakaiwa. 2004. Automatic construction of a transfer dictionary considering directionality. In Proceedings of the Workshop on Multilingual Linguistic Ressources, pages 31-38. Association for Computational Linguistics.
Construction of a bilingual dictionary intermediated by a third language
  • Kumiko Tanaka
  • Kyoji Umemura
Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics -Volume 1, COLING '94, pages 297-303, Stroudsburg, PA, USA. Association for Computational Linguistics.
Pivot-based multilingual dictionary building using wiktionary
  • Juditács
JuditÁcs. 2014. Pivot-based multilingual dictionary building using wiktionary. In LREC, pages 1938-1942. ELRA.