ArticlePDF Available

An electronic dictionary as a basis for NLP tools: The Greek case

Authors:

Abstract and Figures

The existence of a Dictionary in electronic form for Modern Greek (MG) is mandatory if one is to process MG at the morphological and syntactic levels since MG is a highly inflectional language with marked stress and a spelling system with many characteristics carried over from Ancient Greek. Moreover, such a tool becomes necessary if one is to create efficient and sophisticated NLP applications with substantial linguistic backing and coverage. The present paper will focus on the deployment of such an electronic dictionary for Modern Greek, which was built in two phases: first it was constructed to be the basis for a spelling correction schema and then it was reconstructed in order to become the platform for the deployment of a wider spectrum of NLP tools.
Content may be subject to copyright.
arXiv:cs/0408061v1 [cs.CL] 26 Aug 2004
An electronic dictionary as a basis for NLP tools: The
Greek case
Ch. Tsalidis (1), A. Vagelatos (2) and G. Orphanos (1)
(1) Neurosoft S.A.
24 Kofidou Street
GR-14231 Athens, Greece
(tsalidis,orphan)@neurosoft.gr
(2) R.A. Computer Technology Institute
13 Eptachalkou Street
GR-11851 Athens, Greece
vagelat@cti.gr
Résumé - Abstract
The existence of a Dictionary in electronic form for Modern Greek (MG) is mandatory if one
is to process MG at the morphological and syntactic levels since MG is a highly inflectional
language with marked stress and a spelling system with many characteristics carried over from
Ancient Greek. Moreover, such a tool becomes necessary if one is to create efficient and sophis-
ticated NLP applications with substantial linguistic backing and coverage. The present paper
will focus on the deployment of such an electronic dictionary for Modern Greek, which was
built in two phases: first it was constructed to be the basis for a spelling correction schema
and then it was reconstructed in order to become the platform for the deployment of a wider
spectrum of NLP tools.
Mots-clefs – Keywords
Lexique, morphologie
Lexicon, morphology
Tsalidis, Vagelatos, Orphanos
1 Introduction
Electronic dictionaries have become among the most indispensable language resources for those
involved in all aspects of natural language processing (NLP). Large–scale language resources
(text and speech corpora, lexicons, grammars) are already developed or under development for
an increasing number of natural languages.
Our computational linguistics (CL) team, over the past ten years has been conducting applied
research toward the development of NLP applications and resources for Modern Greek (MG).
The first step toward this goal was the design and developmentof a spelling corrector for Mod-
ern Greek. This corrector was based on a morphology lexicon. Later on, this morphology
lexicon served as a basis for a number of research project as well as NLP application.
By the end of this project, we came to realise the need for a large-scale electronic dictionary,
which could be the backbone for a wider and more advanced NLP systems as well as a valu-
able research tool. Such a dictionary should contain information for each entry (lemma) at the
various linguistic levels: phonology, morphosyntax, syntax and semantics, as well as enable
linking between the entries for various lexical and semantic relations: synonymy, antonymy,
hyponymy, etc.
In this paper we present a historical overview of the research activities of our CL team, regarding
the development of an electronic dictionary. First the deployment of a morphology lexicon is
described as well as various NLP tools that were based on it. Then the “second phase” of
our research work is presented: the devise of a new coding scheme, the deployment of an
electronic dictionary as well as some supporting tools and NLP applications. Finally we give
our conclusions.
2 How did we come here
Back in 1991 our research unit at Computer Technology Institute (CTI - www.cti.gr) undertook
a project to develop a spelling correction system for Modern Greek. That was the initiation for
the foundation of a “computational linguistics (CL) team” that started to develop a lexicon to
be the basis of the whole project.
The main goals for the lexicon construction (which were mostly focused on the spelling correc-
tor that was the final target) were identified to be [Vagelatos et al., 1994]: (a) Storage economy,
(b) Speed efficiency, (c) Dictionary coverage and (d) Optimum correction schema. Under this
framework, the linguistic analysis of MG led to the construction of a description language
(which we called Greek Word Description Language - GWDL) that described both the inflec-
tional morphology and marked stress of MG [Vagelatos et al., 1994]. As a result, one and a
half year later, a lexicon was developed that contained about 80,000 entries. The possible word
forms produced from these entries have been calculated to exceed one million. In each entry, the
stem(s) of a lexeme was(were) combined with the appropriate GWDL morphological rule(s), in
order to produce all the corresponding word forms.
The primary storage mechanism used to access words in the Lexicon was the “Compressed
Trie” [Knuth, 1973]. It was used as an index to the database of the words. This structure was
relatively small (about 700Kb) compared to data needed to represent the entire lexicon. Thus,
a big part of it (or the whole, if the computer had enough memory) could be loaded into main
An electronic dictionary as a basis for NLP tools: The Greek case
memory. The Compressed Trie contained the part of a word’s stem necessary to distinguish this
word from stems of all other words having the same prefix [Vagelatos et al., 1994].
The correction schema was based on the well-known error categorization [Vagelatos et al., 1994]
of a) orthographical errors which are cognitive errors owing to the substitution of a deviant
spelling for a correct one, when the author doesn’t know the correct spelling of a word or when
he misconceives it and b) typographical errors, that are motoric errors, caused by hitting the
wrong sequence of keys.
Additionally, in MG we faced another error type, namely stress position errors, e.g. κ´ǫφαλι
/kǫfali/ (head) instead of “κǫφ ´αλι”. The correction of this error type is based on the lexicon
structure; words are stored in the lexicon without stress; stress is encoded in the rule part of
each entry. Every word is searched without the stress; as soon as an entry has been matched,
stress rules apply. If the stress is in a different position, then there is probably a stress position
error and the word found is suggested as an alternative.
The lexicon was the heart of thespelling correction system (which later on was incorporated in
the Greek version of Microsoft Office suite). Nevertheless, the lexicon did not serve only as a
basis of this system but also in a number of other research applications like stylistic analysis of
poetic works [Vagelatos, Stamison et al., 1994], computer assisted language learning (CALL)
[Stamison et al., 1995] and word stemming [Vagelatos, Peleki, Christodoulakis, 1994].
3 The new dictionary
The importance of a more advanced and complete electronic dictionary, which would serve as
a basis for a far more wide variety of NLP systems, became evident after the completion of the
above mentioned “first period” of our NLP research team. At that time, it was decided to devote
manpower as well as time in order to rebuild the existing lexicon with the following goals in
mind: the development of a new coding scheme able to incorporate appropriate annotation (the
information that has to be associated with each entry) and the reconstruction of the lexicon’s
data, based on a more appropriate (i.e. corpus based) methodology, taking into account a
corpus that had been deployed for this purpose.
Within this framework, the main requirements for the new dictionary were identified to be: (a)
each lexicon entry should constitute a Lemma, (b) a lemma can contain one or more Lexemes (a
lexeme is the representative of a cluster of morphological variations (word forms) of the same
word), (c) all word forms of a lemma must be defined, (d) all word forms must be correctly
hyphenated, (e) the various morphological parts of each word form (i.e. stem, suffix, prefix,
etc.) must be identifiable, (f) stressing must be handled with an easy and efficient way, (g)
a mechanism to incorporate simple (property) information as well as compound (structured)
information should be supported, (h) full support of meanings of a lemma as in printed lexicog-
raphy must be provided, (i) a power intra lemma reference mechanism should exist in order to
represent “wordnet” links.
3.1 Lexicon’s meta-language schema
In order to fulfill the requirements of the lexicon, a coding scheme was devised to represent all
this information and a special toll was implemented to permit the easy and efficient editing of
Tsalidis, Vagelatos, Orphanos
lemma’s information. A lemma is defined as a set of lexemes:
lemma name [lexeme]
lexeme name morphology semantics
where (in the formulas presented in this paper), [a] means one or more instances, {a} means
zero or more instances, a |b means a or b but not both, while a? means zero or one instance
of a. As the above formula shows, a lexeme definition contains three parts. The name of the
lexeme which identifies the lexeme, the morphology which describes how the lexeme’s word
forms are constructed from their constituent parts and the semantics which holds the meanings
information that can accompany a lemma.
The basic unit of word forms are the letters of MG alphabet. Despite this, words are usually
divided in more complex parts called morphemes. Morphemes constitute the basic unit of word
forms. We distinguish four types of morphemes: prefix, stem, infix and inflection (suffix).
The first three types constitute the lexical-morpheme of the lexeme while the forth type is also
known as the inflectional-morpheme. Formally, a lexeme’s morphology is defined as:
morphology lexical-morpheme, inflectional-morpheme stress
lexical-morpheme [prefix |stem |infix]
inflectional-morpheme [inflection]
stress [final |penultimate |antepenultimate]
There are no stress characters inside the morphemes, while the position of the stress (final,
penultimate, antepenultimate position) is attached as shown in the above formulas. Each mor-
pheme must also be hyphenated so the derived word forms contain hyphenation information.
3.2 Supporting tools
In order to automate and simplify the coding of lexical information, various tools were con-
structed. One of them, LexEdit is a lightweight Lexicon Editor, which was used for the initial
definition of the lexicon entries. Figure 1 shows a typical screen of LexEdit showing processed
lexical entries. In the left pane of the application window we can see the sections that in-
corporate lemmas of the lexicon. We have a section for each MG alphabet letter. The “iota”
(γι ´ωτ α ι)section is selected and in the right pane we have a part of the lemmas starting with
the character “iota”.
3.3 The data
The set of lemmas that were included in the lexicon was collected upon research based on the
most important MG dictionaries ([Kriaras, 1996], [Babiniotis, 1999], [Tegopoulos, Fytrakis, 1993])
and on a corpus of MG texts.
The selection of lemmas is a particularly exigent and laborious work, which, with respect to
MG, is also overloaded by the resent past of bilingualism (official - “kathareyousa” vs demotic
- “dimotiki” languages). What is found in the dictionaries should be checked from different
points of view. Since all the above dictionaries have not been based on text corpora, doubts
arise for many entries, whether certain words or word forms or their meanings exist in MG (and
not in dialects or ancient Greek).
An electronic dictionary as a basis for NLP tools: The Greek case
Figure 1: LexEdit tool
4 NLP Tools
The new language tools that have been deployed are a result of systematic work of three and
more years, at research level -in the areas of lexicography and NLP- as well as at the level of
development of specialised electronic lexicons and computer systems for checking, correction
and text hyphenation. Moreover, they are based on a redesign and reconstruction of the lexicon
as was described above. The tools (they can be found at the Web [Neurosoft, 2003] are renewed
two times per year.
Thus far the following NLP tools have been implemented, based on the lexicon that was de-
scribed above: a new spelling correction system and a hyphenator for MG.
The new spelling correction system includes approximately 90,000 lemmas (over 1,200,000
word forms). It was checked against a corpus of documents with more than 100,000,000 words.
At the same time it includes approximately 200,000 English words thus it allows English and
Greek spelling checking.
The search engine for the suggestions has been improved substantially. A number of new meth-
ods for correcting the errors have been added which are based on optical recognition (e.g., A -
, T - Γ,ΛΛ - M , α-σ), phonetic similarities (e.g., ´ανχoς -´αγχoς). Also, the methods for
correcting the phonetic equivalences have been enhanced (e.g., ´ǫβρǫση -ǫ´υρǫση) and have been
enriched with methods to take care of the usual spelling errors like missing letter, added letter,
transposed letter and wrong letter. Although the processing requirements have been increased,
the new spelling engine is faster than the previous version.
The hyphenator uses rules as well as the dictionary in order to achieve the best possible precision
in hyphenation. Rules are separated in two categories: in those that were handcrafted according
to the rules of hyphenation in MG grammar, and in those that were produced automatically
based on hyphenation information incorporated in the lexicon. The rules of the second category
enable the hyphenator to cope effectively with 26 vowel combinations, which in some words
Tsalidis, Vagelatos, Orphanos
split during syllabification and in other not.
Additionally, the verbal types that are liable to produce hyphenation errors as a result of the
application of the hyphenation rules, have been incorporated in a list of exceptions. This list
contains about 2.700 word forms containing vowel combinations, the syllabification of which
leads to sense ambiguity.
5 Conclusions
The above presentation has hopefully succeeded in establishing an awareness of what is encom-
passed when refer to develop an electronic dictionary for Modern Greek. We have presented
the two main development phases of this dictionary, we have cited limitations and we have
presented numerous research as well as applied projects. More importantly, we have stressed
and explained those features that characterize MG and which, in our point of view, make the
dictionary in electronic form a necessary tool in all kinds of natural language processing.
References
[Kriaras, 1996] KRIARAS E. (1996), The New Greek Dictionary, (in Greek).
[Knuth, 1973] KNUTH E. (1973), The Art of Computer Programming, Volume 3: Sorting and
Searching, Addison-Wesley, Reading, Mass.
[Babiniotis, 1999] BABINOTIS G. (1999), The Dictionary of Modern Greek, (in Greek).
[Neurosoft, 2003] Neurosoft S.A. (2003). Language tools. WWW site:
http://www.neurosoft.gr/download/main.asp.Accesed on 21-Dec-2003.
[Stamison et al., 1995] STAMISON-ATMATZIDI M., VAGELATOS A., CHRISTODOULAKIS D.
(1995), Teaching English Engineering Terminology in a Hypermedia Environment, Com-
puter Assisted Language Learning, An International journal, vol. 8, n. 2-3.
[Tegopoulos, Fytrakis, 1993] TEGOPOULOS G., FYTRAKIS A. (1993), The Dictionary of
Modern Greek, (in Greek).
[Vagelatos et al., 1994] VAGELATOS A., TRIANTOPOULOU T., TSALIDIS C.,
CHRISTODOULAKIS D. (1994), A spelling Correction System for Modern Greek,
International Journal on Artificial Intelligence & Tools, vol. 3, n. 4.
[Vagelatos, Stamison et al., 1994] VAGELATOS A., STAMISON-ATMATZIDI M., TRI-
ADOPOULOU TH., FARMAKI V., CHRISTODOULAKIS D. (1994), Analysis of the
Literary Style of Poet A. Sikelianos - A Computer Based Approach, CONSENCUS EX
MACHINA, Joint International Conference, ALLC-ACH94, Sorbonne, Paris.
[Vagelatos, Peleki, Christodoulakis, 1994] VAGELATOS A., PELEKI F., CHRISTODOULAKIS
D. (1994), Word stemming with the use of a computer, 15th meeting of the linguistic de-
partment of the Aristoteleion University of Thessaloniki, Thessaloniki, Greece, (in Greek).
... In specific, statistical language models that select "word terms" do not work well with Greek text due to its rich morphology, returning poor results for both "precision" and "recall" metrics [9]. For single word terms, we must handle the morphology of words along with the part of speech. ...
... The Greek morphological lexicon (~ 90.000 words, ~ 1.200.000 word forms) that has been implemented in a previous project [9,10] was enriched with the geographical terminology extracted from the corresponding Corpus. The process of collecting corpusbased geography terms was based on the hypothesis that if a word-form found in the corpus is either unknown to the lexicon or has got high TF/IDF score in the corpus, then there are good chances for it to be a geographical word-form. ...
Chapter
Full-text available
The use of digital games to support learning (game-based learning) through an alternative, more attractive way is rapidly growing in both European and worldwide educational sector. Digital games are a rapidly developing field, as they are amongst the most popular technologies that young people use for their entertainment. Within this framework, project “Lexipaignio” was initiated in order to develop an innovative and state-of-the-art NLP (Natural Language Processing) environment for the creation of digital educational games for Greek students. These games will be dynamically generated by the educator for his/her students, in order to improve various vocabulary and linguistic skills, while understanding the context of specific school subject areas. In this paper we present the NLP environment for the subject of Geography.
... Tsalidis and his colleagues (Tsalidis et al., 2004) have been working on a large-scale electronic lexicon for Greek and have developed LexEdit, a lightweight lexicon editor which was used for the initial definition of the lexicon entries. ...
Conference Paper
This paper describes ODL, a description language for lexical information that is being developed within the context of a national project called MLRS (Maltese Language Resource Server) whose goal is to create a national corpus and computational lexicon for the Maltese language. The main aim of ODL is to make the task of the lexicographer easier by allowing lexical specifications to be set out formally so that actual entries will conform to them. The paper describes some of the background motivation, the ODL language itself, and concludes with a short example of how lexical values expressed in ODL can be mapped to an existing tagset together with some speculations about future work.
... Morphosyntactic tagging is based on a broad-coverage Morphological Lexicon of Greek (~100,000 words, i.e. ~1.200,000 word-forms) that was developed in previous projects [15, 16]. The contents of the lexicon are organised into lemmas. ...
Article
Full-text available
This paper presents the design and implementation of terminological and specialized textual resources that were produced in the framework of the Greek research project "IATROLEXI". The aim of the project was to create the critical infrastructure for the Greek language, i.e. linguistic resources and tools for use in high level Natural Language Processing (NLP) applications in the domain of biomedicine. The project was built upon existing resources developed by the project partners and further enhanced within its framework, i.e. a Greek morphological lexicon of about 100,000 words, and language processing tools such as a lemmatiser and a morphosyntactic tagger. Christos Tsalidis, Additionally, it developed new assets, such as a specialized corpus of biomedical texts and an ontology of medical terminology.
Chapter
In this chapter, we examine the structure of language-independent domain knowledge in correlation with its verbalization in the domain lexicon and corpus of a particular language. The lexicon and corpus can be considered as domain models of language and speech in their dichotomy, thus permitting a comprehensive study of domain linguistic diversity and restrictions. The work is devoted to the specifics of the conceptual (categorical) division of the world in the “Research in athletes’ physiology” domain and verbalizations of the domain concepts in the Russian lexicon and corpus. The domain language-independent knowledge is represented by a multi-lingual ontology, while language-dependent knowledge is conveyed by a language-specific (Russian, in our case) lexicon, whose units are linked to the ontology concepts. Both the ontological and Russian domain lexical data are represented in the digital format and used as the knowledge component of a computer annotation tool. The latter allowed automating certain stages of the study and computing statistical indices of the analysis parameters. The work makes certain contribution to the development of the research methodology and computer instrument design. The novelty of the study also lies in the particular linguo-statistical analysis result values that can be used both for theoretical linguistic research and in applied aspects of natural language processing.
Chapter
In this paper, we present the so far implemented infrastructure of the “Lexipaignio” project, a research project co-financed by EU and Greek national funds, where an innovative and state-of-the-art NLP (Natural Language Processing) environment is being developed for the creation of digital educational games for Greek students. An initial, brief presentation of the the position of digital games in the today’s educational system is followed by a more detailed presentation of the implemented NLP infrastructure for the Greek language. Examples of the games that have already been implemented are also provided. The paper concludes with the current stage of the project and the pending steps towards its completion.KeywordsDigital educational gamesNatural Language ProcessingGame-based learningMobile learningOpen and distance learning
Article
Full-text available
In this study, it is aimed to label the news in electronic media according to age groups by using Natural Language Processing. The selected ones for training in the news dataset collected from the news sites were processed in Python language using the NLP Zemberek Library, and a vocabulary dictionary that could represent Childhood, Adolescence and Adult age groups of Havighurst's Development Theory adapted to the current situation was created (which age group of each word as appropriate). A classifier was then proposed to determine the classes of the news dataset selected for testing using this dictionary. As a result of the tests, it was seen that the developed dictionary can detect the correct class with a success rate of 0.70.
Article
The aim of the paper is twofold. First, it aspires to compare the usefulness of a monolingual English learners’ dictionary in electronic and paper form in receptive and productive tasks. Second, it sets out to assess the role of dictionary form in the retention of meaning and collocations. The investigation concerns the paper and electronic versions of a recent monolingual English learners’ dictionary, COBUILD6 (2008). The study reports on an experiment, in which 64 upper-intermediate and advanced students took part. The test consisted of two tasks: receptive and productive. To complete them, each subject was assigned to work with one version of the dictionary. It turns out that COBUILD online was more useful in both tasks. The results of an unexpected retention test prove it to be a better learning tool as well, since it significantly enhanced the retention of both meaning and collocations.
Conference Paper
This paper describes the Dictionary of Italian Collocations (Dici), a tool based on natural language processing technologies that aims to support foreign language learning activities. More specifically, the Dici is designed to be integrated with an online learning environment: in a specific area of an online platform, devoted to the study of vocabulary, students of Italian as a second language can perform receptive and productive learning activities concerning the recognition and the active use of collocations, with the support of all the information stored in the Dici. The paper describes the process of extraction of collocations from a reference Italian corpus, the creation of the dictionary, its structure and its integration with the online learning environment. Keywords: collocations, dictionary, online learning environment.
Conference Paper
Full-text available
Collection and annotation of specialized corpora, for less-spoken languages such as Greek, is crucial endeavour for the development and growth of the language technology research for these languages. This paper presents the design and compilation of a biomedical corpus that took place in the framework of the national R&D project "IATROLEXI". The aim of IATROLEXI is to create the critical infrastructure for the Greek language, i.e. linguistic resources and tools, to be used in advanced natural language processing (NLP) applications, i.e. information extraction, data mining, etc., in the domain of biomedicine. The project will build upon existing resources that have been developed by the project partners, i.e. a Greek morphological lexicon of about 100.000 words, and language processing tools such as a lemmatizer and a morphosyntactic tagger, and it will further develop new resources such as a specialised corpus of biomedical texts that is presented in this paper and an ontology of medical terminology.
Article
Full-text available
Within the framework of a project yielding to the development of an interactive spelling checking/correction system for Modern Greek (M.G.) to run on MS-DOS based computers, our team comprised of several computer engineers and linguists, undertook the following preliminary tasks: The examination and evaluation of pertinent existing research/work from both the computer engineering and linguistic fields, and conducted supplementary research deemed necessary for the purposes of the project. The overall objectives focused on the development of a system that would be convenient to run and use. Unlike similar current systems however, emphasis was given to optimal engineering quality and performance and moreover, optimal linguistic performance attained through substantial linguistic expertise backing.
Article
Full-text available
In order to demonstrate the applicability of the latest computer technology to CALL, the present paper describes a Hypermedia system prototype for the teaching of fteld‐speciftc terminology for EST‐Engineering. The preliminary design and development of the system are presented from a pedagogical as well as a technical perspective with further discussion of the capabilities and prospective utilisation of Hypermedia systems in ESL/EFL/ESP.
The New Greek Dictionary
  • E Kriaras
[Kriaras, 1996] KRIARAS E. (1996), The New Greek Dictionary, (in Greek).
The Dictionary of Modern Greek
  • G Babinotis
[Babiniotis, 1999] BABINOTIS G. (1999), The Dictionary of Modern Greek, (in Greek).
Language tools. WWW site: http://www.neurosoft.gr/download/main.asp. Accesed on 21
  • S A Neurosoft
[Neurosoft, 2003] Neurosoft S.A. (2003). Language tools. WWW site: http://www.neurosoft.gr/download/main.asp. Accesed on 21-Dec-2003.
Word stemming with the use of a computer, 15th meeting of the linguistic department of the Aristoteleion
  • A Vagelatos
  • Christodoulakis D Peleki F
[Vagelatos, Peleki, Christodoulakis, 1994] VAGELATOS A., PELEKI F., CHRISTODOULAKIS D. (1994), Word stemming with the use of a computer, 15th meeting of the linguistic department of the Aristoteleion University of Thessaloniki, Thessaloniki, Greece, (in Greek).