Figure 1 - uploaded by Adam Pease
Content may be subject to copyright.
SUMO mapping to wordnets 

SUMO mapping to wordnets 

Source publication
Article
Full-text available
This paper introduces a recently initiated project that focuses on building a lexical resource for Modern Standard Arabic based on the widely used Princeton WordNet for English (Fellbaum, 1998). Our aim is to develop a linguistic resource with a deep formal semantic foundation in order to capture the richness of Arabic as described in Elkateb (2005...

Context in source publication

Context 1
... given an English synset, all corresponding Arabic variants (if any) will be selected; given an Arabic word, all its senses are determined and for each of them the corresponding English synset is encoded. The Arabic synsets will be extended with hypernym relations to form a closed semantic hierarchy. SUMO will be used to maximize the semantic consistency of the hyponymy links. This will represent the core wordnet, which is a semantic basic for the further extension. The work is mostly done manually. When a new Arabic verb is added, extensions are made from verbal entries, including verbal derivates, nominalizations, verbal nouns, and so on. We also consider the most productive forms of deriving broken plurals. This is done by applying lexical and morphological rules iteratively. The database is further extended downward from the CBCs. First, a layer of hyponyms is chosen based on maximal connectivity, relevance, and generality. Two major pre-processing steps are required, preparation and extension. Preparation entails compiling lexical and morphological rules and processing available bilingual resources from which we construct a homogeneous bilingual dictionary containing information on the Arabic/English word pair. This information includes the Arabic root, the POS, the relative frequencies and the sources supporting the pairing. The Arabic words in these bilingual resources must also be normalized and lemmatized while maintaining vowels and diacritics. We next apply 17 heuristic procedures, previously used for EWN, to the bilingual dictionary in order to derive candidate Arabic words/English synsets mappings. Each mapping includes the Arabic word and root, the English synset, the POS, the relative frequencies, a mapping score, the absolute depth in AWN, the number of gaps between the synset and the top of the AWN hierarchy, and attested tokens of the pair. The Arabic word/English synset pairs constitute the input to a manual validation process. We proceed by chunks of related units (sets of related WN synsets, e.g. hyponymy chains and sets of related Arabic words, i.e., words having the same root) instead of individual units (i.e., synsets, senses, words). Finally, AWN will be completed by filling in the gaps in its structure, covering specific domains, adding terminology and named entities, etc. Each synset construction step is followed by a validation phase, where formal consistency is checked and the coverage is evaluated in terms of frequency of occurrence and domain distribution. The total coverage of AWN will be around 10,000 synsets. Tools to be developed for AWN include a lexicographer's interface modeled on the EWN interface with added facilities for Arabic script. Because AWN is to be aligned not just to PWN but to every wordnet aligned to PWN – either directly or indirectly through an Interlingual Index or the ontology – the database design supports multiple languages. The user interface will be explicitly multilingual and indifferent to the direction of alignment between the conceptual structures of the two languages. In addition to search and browsing facilities for the end users of the completed database, lexicographers require an editing interface. A variety of legacy components are available, each with their relative advantages. The editor's interface will communicate with the database server using Simple Object Access Protocol (SOAP), allowing multiple lexicographers at different sites to maintain a common database. The database structure comprises four principal entity types: item , word , form and link . Items are conceptual entities, including synsets, ontology classes and instances. An item has a unique identifier and descriptive information such as a gloss. Items lexicalized in different languages are distinct. A word entity is a word sense, where the word's citation form is associated with an item via its identifier. A form is an entity that contains lexical information (not merely inflectional variation). The forms are the root and/or the broken plural form, where applicable. A link relates two items, and has a type such as "equivalence," "subsuming," etc. Links interconnect sense items, e.g., a PWN synset to an AWN synset, a synset to a SUMO concept, etc. This data model has been specified in XML as an interchange format, but is also implemented in a MySQL database hosted by one of the partners. A large ontology providing the semantic underpinning for AWN concepts will be built on SUMO, a formal ontology of about 1000 terms and 4000 definitional statements currently that is provided in a first order logic language called Standard Upper Ontology Knowledge Interchange format (SUO-KIF) and also translated into OWL semantic web language. SUMO has natural language generation templates and a multi-lingual lexicon that allows statements in SUO-KIF and SUMO to be expressed in multiple languages. Synsets map to a general SUMO term or a term that is directly equivalent to the given synset (Figure 1). New formal terms will be defined to cover a greater number of equivalence mappings, and the definitions of the new terms will in turn depend upon existing fundamental concepts in SUMO. The process of formalizing definitions will generate feedback as to whether word senses in AWN need to be divided or combined and how glosses may be clarified. Wordnets in other languages linked by synset number will benefit, too. The Sigma ontology development environment will be updated to handle a similar presentation of Unicode-based character sets, including Arabic. The Interlingual Index (ILI) connecting EWN wordnets is a condensed set of more or less universal concepts linking synsets across languages via multiple exhaustive equivalence relations. In EuroWordNet and BalkaNet, English PWN has been used to express equivalence relations across the different languages. By providing many SUMO definitions and terms that correspond to Arabic synsets, we will create the opportunity to use SUMO as the ILI for all wordnets that are currently related to PWN. This is illustrated in Figure 2. If the Arabic word sense for shai is exhaustively defined by relations to SUMO terms, this definition can replace an equivalence relation (er1) that is currently encoded between the Arabic synset shai and a synset tea in PWN. Note that the relations from shai to the SUMO terms need to be exhaustive, which may require multiple relations of different types (sr1 (subsumption), r2, r3) to multiple SUMO terms. If there are also equivalence relations from other languages (e.g. Dutch and Spanish) to the same PWN synset, then these relations grant the linkage of the synsets in these languages to the same SUMO definition. Besides providing a formal semantic framework, SUMO can thus also be used to map synsets across languages, in fact even when there is not an equivalent in English. By composing formal definitions for the non-English synsets, SUMO as an ILI will not only be less biased by English but also has more expressive power. Constructing AWN presents challenges not encountered by established wordnets. These include the script on the one hand and the morphological properties of Semitic languages, centered around roots, on the other hand. The foundations for meeting these challenges have been laid. An innovation with significant consequences for wordnet development is the proposal to substitute English WN as the ILI with SUMO. This work was supported by the United States Central Intelligence Agency. Beesley, K. (2001) Finite-State Morphological Analysis and Generation of Arabic at Xerox, ACL/EACL 2001, July 6 th , Toulouse, France : 1-8 Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A. and Fellbaum, C., (2006). Introducing the Arabic WordNet Project, in Proceedings of the Third International WordNet Conference, Sojka, Choi, Fellbaum and Vossen eds. Black, W. J., and Elkateb, S. (2004) A Prototype English- Arabic Dictionary Based on WordNet, Proceedings of 2 nd Global WordNet Conference, GWC2004, Czech Republic, 67-74. Buckwalter, T. (2002) Arabic Morphological Analysis, Http://www.qamus.org/morphology.htm De Roeck, A., and Al-Fares, W. (2000) A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots Proceedings of the 38 th Annual Meeting of the ACL, Hong Kong, 199-206 Dyvik, H. (2003) Translations as a semantic knowledge source: word alignment and wordnet, Section for Linguistic Studies scientific papers, University of Bergen Dyvik, H. (2002) Translations as Semantic Mirrors: From Parallel Corpus to Wordnet1. Section for Linguistic Studies scientific papers, University of Bergen Elkateb, S and Black, W. J. (2001) Towards the Design of English-Arabic Terminological Knowledge Base, Proceedings of ACL 2000, Toulouse, France:113-118 Elkateb, S and Black, W. J. (2004) A Bilingual Dictionary with Enriched Lexical Information, Proceedings of NEMLAR Cairo, Egypt 2004 Arabic Language Tools and Resources: 79-84 Elkateb, S. (2005) Design and implementation of an English Arabic dictionary/editor. PhD thesis, The University of Manchester, United Kingdom. Farreres, J. (2005) Creation of wide-coverage domain- independent ontologies. PhD thesis, Univertitat Politècnicade Catalunya. Fellbaum, C., (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press . Niles, I., and Pease, A. (2001) Towards a Standard Upper Ontology. In: Proceedings of FOIS 2001, Ogunquit, Maine, pp. 2-9. Pease, A., (2000) Standard Upper Ontology Knowledge Interchange Format. Web document Pease, A., (2003) The Sigma Ontology Development Environment, in Working Notes of the IJCAI-2003 Workshop on Ontology and Distributed Systems. Volume 71 of CEUR Workshop Proceeding series Pustejovsky, J. (1995) The Generative Lexicon, Massachusetts Institute of Technology. Tufis, D. (ed.) (2004) Special Issue on the BalkaNet project. Romanian Journal of Information Science and Technology, Vol.7, nos 1-2 Vossen, P. (ed.) (1998) ...

Similar publications

Article
Full-text available
In this paper we define an evolution mechanism with formal semantics using the metamodeling methodology [Geisler et al.98] based on dynamic logic. A remarkable feature of the metamodeling methodology is the ability to define the relation of intentional and extensional entities within one level, allowing not only for the description of structural re...
Conference Paper
Full-text available
Ontology development and maintenance are complex tasks, so automatic tools are essential for a successful integration between the modeller's intention and the formal semantics in an ontology. In this paper we present a methodology for ontology evolution specifically designed for being used in ontology design tools. It exploits the ontology graphica...

Citations

... In order to perform fine-grained processing of Arabic text, our annotation tool combines state-of-art NLP resources such as Buckwalter Arabic Morphological Analyzer (BAMA) [28] for the lexico-morphological analysis, Arabic WordNet (AWN) [29] for word disambiguation, and FrameNet database for semantic roles analysis. ...
... After this, many other researchers started working on different WordNets for different languages. French Word-Net WOLF (Sagot and Fiser, 2008), Arabic WordNet (AWN) (ElKateb et al., 2006), Polish Word-Net (Derwojedowa et al., 2008, Japanese WordNet (Isahara et al., 2008), Finnish WordNet FinnWord-Net (Linden and Carlson, 2010), NorwegianWord-Net (Fjeld and Nygaard, 2009) and Danish WordNet (Pedersen et al., 2009) are a few examples of these works. There are also projects that link WordNets of different languages to create a multilingual WordNet such as EuroWordNet (EWN) (Vossen, 2007), MultiWordNet (Pianta et al., 2002) and BalkaNet (Tufis et al., 2004). ...
Conference Paper
Full-text available
Wordnets have been popular tools for providing and representing semantic and lexical relations of languages. They are useful tools for various purposes in NLP studies. Many researches created WordNets for different languages. For Turkish, there are two WordNets, namely the Turkish WordNet of BalkaNet and KeNet. In this paper, we present new WordNets for Turkish each of which is based on one of the first 9 editions of the Turkish dictionary starting from the 1944 edition. These WordNets are historical in nature and make implications for Modern Turkish. They are developed by extending KeNet, which was created based on the 2005 and 2011 editions of the Turkish dictionary. In this paper, we explain the steps in creating these 9 new WordNets for Turkish, discuss the challenges in the process and report comparative results about the WordNets.
... Arabic WordNet (AWN) is one of the resources that have been developed in order to overcome these characteristics. It is based on the Princeton WordNet (PWN), which includes 23481 words and 11269 synsets (Elkateb et al., 2006). The AWN is often used in the concept-based IR, which allows the retrieval of documents based on meaning expressed by the terms of the query. ...
Article
Full-text available
In this paper, the authors propose and readapt a new concept-based approach of query expansion in the context of Arabic information retrieval. The purpose is to represent the query by a set of weighted concepts in order to identify better the user's information need. Firstly, concepts are extracted from the initially retrieved documents by the Pseudo-Relevance Feedback method, and then they are integrated into a semantic weighted tree in order to detect more information contained in the related concepts connected by semantic relations to the primary concepts. The authors use the “Arabic WordNet” as a resource to extract, disambiguate concepts and build the semantic tree. Experimental results demonstrate that measure of MAP (Mean Average Precision) is about 10% of improvement using the open source Lucene as IR System on a collection formed from the Arabic BBC news.
... However, we added another source to search for more synonyms, WikiSynonyms, 1 which relies on extracting synonyms from Wikipedia pages. If the detected language is Arabic, we implemented synonym extraction using WikiSynonyms and Arabic WordNet (AWN) [28]. AWN represents an XML file acting as a database containing Arabic words along with their synsets, antonyms, and hyponyms. ...
Article
Full-text available
Much research has been conducted to apply data augmentation techniques (DA), i.e., transformations to a given training set to produce more data synthetically. While DA is often used in computer vision and speech recognition, it is not very common in Natural Language Processing, especially in the Named Entity Recognition (NER) task, i.e., identifying named entities. To the best of our knowledge, it is also not applied in NER on the challenging Code-Switching (CS) data, which is text containing more than one language in the same sentence. This paper presents several practical and easy-to-implement data augmentation techniques to improve the Arabic NER and especially on CS data based on word embedding substitution, a modified version of the Easy Data Augmentation technique, and back-translation. We demonstrate that the proposed methods boost the performance of the NER task on CS data through several experiments on an available AR-EN CS data-set with an increase of F-score equal to 1.5%. The proposed DA methods also eliminate the time and effort of collecting and labeling new data for low resources NER tasks.
... For example, {tree} is a hyponym of {plant} and {plant} is a hypernym of {tree}. The PWN is also extended for other languages including Arabic [28]. The Arabic WordNet is used in this study as a standard resource for providing synonyms of student answer words without changing their meanings, hence improves the accuracy of similarity between student answer and model answer. ...
Article
Full-text available
The manual process of scoring short answers of Arabic essay questions is exhaustive, susceptible to error and consumes instructor’s time and resources. This paper explores longest common subsequence (LCS) algorithm as a string-based text similarity measure for effectively scoring short answers of Arabic essay questions. To achieve this effectiveness, the longest common subsequence is modified by developing weight-based measurement techniques and implemented along with using Arabic WordNet for scoring Arabic short answers. The experiments conducted on a dataset of 330 students’ answers reported Root Mean Square Error (RMSE) value of 0.81 and Pearson correlation r value of 0.94. Findings based on experiments have shown improvements in the accuracy of performance of the proposed approach compared to similar studies. Moreover, the statistical analysis has shown that the proposed method scores students’ answers similar to that of human estimator.
... By way of an example, color is a hypernym for red while carmine and sanguine belong to the same synset red; hence, they are considered synonymous. Similarly, Arabic WordNet (AWN) [85,96] and Hindi WordNet [86] were used for the Foreign-Foreign word connections in the network, whereas an English-to-foreign dictionary was used to generate the English-Foreign word connections. ...
Article
Full-text available
The ever-increasing number of Internet users and online services, such as Amazon, Twitter and Facebook has rapidly motivated people to not just transact using the Internet but to also voice their opinions about products, services, policies, etc. Sentiment analysis is a field of study to extract and analyze public views and opinions. However, current research within this field mainly focuses on building systems and resources using the English language. The primary objective of this study is to examine existing research in building sentiment lexicon systems and to classify the methods with respect to non-English datasets. Additionally, the study also reviewed the tools used to build sentiment lexicons for non-English languages, ranging from those using machine translation to graph-based methods. Shortcomings are highlighted with the approaches along with recommendations to improve the performance of each approach and areas for further study and research.
... Most query expansion methods utilize the knowledge resource such as WordNet. WordNet is a global lexical database which organizes the terms holding identical meanings into sets called synset [12]. These synsets are connected to each other through pre-defined lexical relations. ...
... Arabic WordNet has been constructed by the adaption of the Euro WordNet construction [13]. The Arabic wordNet contains 11,269 concepts [12], comparing with English wordNet which contains 155,287 concepts [14]. Arabic WordNet is commonly used for query expansion where appropriate senses are linked to the original query to provide the desired conceptual information. ...
Article
Full-text available
Abstract In fact, most of information retrieval systems retrieve documents based on keywords matching, which are certainly fail at retrieving documents that have similar meaning with syntactical different keywords (form). One of the well-known approaches to overcome this limitation is query expansion (QE). There are several approaches in query expansion field such as statistical approach. This approach depends on term frequency to generate expansion features; nevertheless it does not consider meaning or term dependency. In addition, there are other approaches such as semantic approach which depends on a knowledge base that has a limited number of terms and relations. In this paper, researchers propose a hybrid approach for query expansion which utilizes both statistical and semantic approach. To select the optimal terms for query expansion, researchers propose an effective weighting method based on particle swarm optimization (PSO). A system prototype was implemented as a proof-of-concept, and its accuracy was evaluated. The experimental was carried out based on real dataset. The experimental results confirm that the proposed approach enhances the accuracy of query expansion.
... The ArSL dictionary is limited to approximately 3200 signs, which causes the OOV problem. We circumvent this problem using the synonym of the OOV words with Arabic WordNet (AWN) [55]. AWN is a semantic database of Arabic words that are grouped into sets of synonyms. ...
Article
Full-text available
Arabic sign language (ArSL) is a full natural language that is used by the deaf in Arab countries to communicate in their community. Unfamiliarity with this language increases the isolation of deaf people from society. This language has a different structure, word order, and lexicon than Arabic. The translation between ArSL and Arabic is a complete machine translation challenge, because the two languages have different structures and grammars. In this work, we propose a rule-based machine translation system to translate Arabic text into ArSL. The proposed system performs a morphological, syntactic, and semantic analysis on an Arabic sentence to translate it into a sentence with the grammar and structure of ArSL. To transcribe ArSL, we propose a gloss system that can be used to represent ArSL. In addition, we develop a parallel corpus in the health domain, which consists of 600 sentences, and will be freely available for researchers. We evaluate our translation system on this corpus and find that our translation system provides an accurate translation for more than 80% of the translated sentences. The article is available by Springer: http://link.springer.com/article/10.1007/s10209-018-0622-8
... The development of efficient Arabic NLP systems, therefore, is important. For instance, after its release, Arabic WordNet (AWN) [9,13] has quickly gained attention and became known in the Arabic NLP community as one of the exceptional and freely available lexical and semantic resources [1]. AWN is a lexical database for Modern Standard Arabic (MSA) [4] in which words that have a common meaning are grouped together in a so-called synsets. ...
Chapter
Full-text available
Arabic WordNet (AWN) is a lexical database, freely available, and useful resource to Natural Language Processing (NLP) research and applications (Information Retrieval, Machine Translation...). This project is built following the methods developed for Princeton WordNet (PWN) and EuroWordNet (EWN). However, this database needs more intention to improve NLP applications. Compared with others wordnets, AWN has a very poor content in both, quantity and quality levels. This paper concentrate on the quality plan, especially on the antonym relations. Therefore, the authors propose a pattern-based approach to extend these relations, using Arabic Corpus and a corpus analysis tool. This proposed method relies on two steps: patterns definition and automatic antonym pair extraction. The evaluation of this approach has given good results.
... The Arabic WordNet (AWN) [10] is a lexical database of the Arabic language following the development process of Princeton English WordNet and Euro WordNet. Our need in the context of building AL-TERp is to know if two words are synonyms or not. ...
Conference Paper
This paper presents AL-TERp (Arabic Language Translation Edit Rate - Plus) an extended version of machine translation evaluation metric TER-Plus that supports Arabic language. This metric takes into accounts some valuable linguistic features of Arabic like synonyms and stems, and correlates well with human judgments. Thus, the development of such tool will bring high benefits to the building of machine translation systems from other languages into Arabic that its quality remains under the expectations, specifically for evaluation and optimization tasks.