ArticlePDF Available

An electronic dictionary as a basis for NLP tools: The Greek case

August 2004

August 2004
cs.CL/0408061

Source
DBLP

Authors:

Christos Tsalidis

University of Patras

Aristides Vagelatos

Computer Technology Institute and Press

The existence of a Dictionary in electronic form for Modern Greek (MG) is mandatory if one is to process MG at the morphological and syntactic levels since MG is a highly inflectional language with marked stress and a spelling system with many characteristics carried over from Ancient Greek. Moreover, such a tool becomes necessary if one is to create efficient and sophisticated NLP applications with substantial linguistic backing and coverage. The present paper will focus on the deployment of such an electronic dictionary for Modern Greek, which was built in two phases: first it was constructed to be the basis for a spelling correction schema and then it was reconstructed in order to become the platform for the deployment of a wider spectrum of NLP tools.

LexEdit tool

…

Figures - uploaded by Christos Tsalidis

Content may be subject to copyright.

Content uploaded by Christos Tsalidis

Content may be subject to copyright.

arXiv:cs/0408061v1 [cs.CL] 26 Aug 2004

An electronic dictionary as a basis for NLP tools: The

Greek case

Ch. Tsalidis (1), A. Vagelatos (2) and G. Orphanos (1)

(1) Neurosoft S.A.

24 Koﬁdou Street

GR-14231 Athens, Greece

(tsalidis,orphan)@neurosoft.gr

(2) R.A. Computer Technology Institute

13 Eptachalkou Street

GR-11851 Athens, Greece

vagelat@cti.gr

Résumé - Abstract

The existence of a Dictionary in electronic form for Modern Greek (MG) is mandatory if one

is to process MG at the morphological and syntactic levels since MG is a highly inﬂectional

language with marked stress and a spelling system with many characteristics carried over from

Ancient Greek. Moreover, such a tool becomes necessary if one is to create efﬁcient and sophis-

ticated NLP applications with substantial linguistic backing and coverage. The present paper

will focus on the deployment of such an electronic dictionary for Modern Greek, which was

built in two phases: ﬁrst it was constructed to be the basis for a spelling correction schema

and then it was reconstructed in order to become the platform for the deployment of a wider

spectrum of NLP tools.

Mots-clefs – Keywords

Lexique, morphologie

Lexicon, morphology

Tsalidis, Vagelatos, Orphanos

1 Introduction

Electronic dictionaries have become among the most indispensable language resources for those

involved in all aspects of natural language processing (NLP). Large–scale language resources

(text and speech corpora, lexicons, grammars) are already developed or under development for

an increasing number of natural languages.

Our computational linguistics (CL) team, over the past ten years has been conducting applied

research toward the development of NLP applications and resources for Modern Greek (MG).

The ﬁrst step toward this goal was the design and developmentof a spelling corrector for Mod-

ern Greek. This corrector was based on a morphology lexicon. Later on, this morphology

lexicon served as a basis for a number of research project as well as NLP application.

By the end of this project, we came to realise the need for a large-scale electronic dictionary,

which could be the backbone for a wider and more advanced NLP systems as well as a valu-

able research tool. Such a dictionary should contain information for each entry (lemma) at the

various linguistic levels: phonology, morphosyntax, syntax and semantics, as well as enable

linking between the entries for various lexical and semantic relations: synonymy, antonymy,

hyponymy, etc.

In this paper we present a historical overview of the research activities of our CL team, regarding

the development of an electronic dictionary. First the deployment of a morphology lexicon is

described as well as various NLP tools that were based on it. Then the “second phase” of

our research work is presented: the devise of a new coding scheme, the deployment of an

electronic dictionary as well as some supporting tools and NLP applications. Finally we give

our conclusions.

2 How did we come here

Back in 1991 our research unit at Computer Technology Institute (CTI - www.cti.gr) undertook

a project to develop a spelling correction system for Modern Greek. That was the initiation for

the foundation of a “computational linguistics (CL) team” that started to develop a lexicon to

be the basis of the whole project.

The main goals for the lexicon construction (which were mostly focused on the spelling correc-

tor that was the ﬁnal target) were identiﬁed to be [Vagelatos et al., 1994]: (a) Storage economy,

(b) Speed efﬁciency, (c) Dictionary coverage and (d) Optimum correction schema. Under this

framework, the linguistic analysis of MG led to the construction of a description language

(which we called Greek Word Description Language - GWDL) that described both the inﬂec-

tional morphology and marked stress of MG [Vagelatos et al., 1994]. As a result, one and a

half year later, a lexicon was developed that contained about 80,000 entries. The possible word

forms produced from these entries have been calculated to exceed one million. In each entry, the

stem(s) of a lexeme was(were) combined with the appropriate GWDL morphological rule(s), in

order to produce all the corresponding word forms.

The primary storage mechanism used to access words in the Lexicon was the “Compressed

Trie” [Knuth, 1973]. It was used as an index to the database of the words. This structure was

relatively small (about 700Kb) compared to data needed to represent the entire lexicon. Thus,

a big part of it (or the whole, if the computer had enough memory) could be loaded into main

An electronic dictionary as a basis for NLP tools: The Greek case

memory. The Compressed Trie contained the part of a word’s stem necessary to distinguish this

word from stems of all other words having the same preﬁx [Vagelatos et al., 1994].

The correction schema was based on the well-known error categorization [Vagelatos et al., 1994]

of a) orthographical errors which are cognitive errors owing to the substitution of a deviant

spelling for a correct one, when the author doesn’t know the correct spelling of a word or when

he misconceives it and b) typographical errors, that are motoric errors, caused by hitting the

wrong sequence of keys.

Additionally, in MG we faced another error type, namely stress position errors, e.g. “κ´ǫφαλι”

/kǫfali/ (head) instead of “κǫφ ´αλι”. The correction of this error type is based on the lexicon

structure; words are stored in the lexicon without stress; stress is encoded in the rule part of

each entry. Every word is searched without the stress; as soon as an entry has been matched,

stress rules apply. If the stress is in a different position, then there is probably a stress position

error and the word found is suggested as an alternative.

The lexicon was the heart of thespelling correction system (which later on was incorporated in

the Greek version of Microsoft Ofﬁce suite). Nevertheless, the lexicon did not serve only as a

basis of this system but also in a number of other research applications like stylistic analysis of

poetic works [Vagelatos, Stamison et al., 1994], computer assisted language learning (CALL)

[Stamison et al., 1995] and word stemming [Vagelatos, Peleki, Christodoulakis, 1994].

3 The new dictionary

The importance of a more advanced and complete electronic dictionary, which would serve as

a basis for a far more wide variety of NLP systems, became evident after the completion of the

above mentioned “ﬁrst period” of our NLP research team. At that time, it was decided to devote

manpower as well as time in order to rebuild the existing lexicon with the following goals in

mind: the development of a new coding scheme able to incorporate appropriate annotation (the

information that has to be associated with each entry) and the reconstruction of the lexicon’s

data, based on a more appropriate (i.e. corpus based) methodology, taking into account a

corpus that had been deployed for this purpose.

Within this framework, the main requirements for the new dictionary were identiﬁed to be: (a)

each lexicon entry should constitute a Lemma, (b) a lemma can contain one or more Lexemes (a

lexeme is the representative of a cluster of morphological variations (word forms) of the same

word), (c) all word forms of a lemma must be deﬁned, (d) all word forms must be correctly

hyphenated, (e) the various morphological parts of each word form (i.e. stem, sufﬁx, preﬁx,

etc.) must be identiﬁable, (f) stressing must be handled with an easy and efﬁcient way, (g)

a mechanism to incorporate simple (property) information as well as compound (structured)

information should be supported, (h) full support of meanings of a lemma as in printed lexicog-

raphy must be provided, (i) a power intra lemma reference mechanism should exist in order to

represent “wordnet” links.

3.1 Lexicon’s meta-language schema

In order to fulﬁll the requirements of the lexicon, a coding scheme was devised to represent all

this information and a special toll was implemented to permit the easy and efﬁcient editing of

Tsalidis, Vagelatos, Orphanos

lemma’s information. A lemma is deﬁned as a set of lexemes:

lemma →name [lexeme]

lexeme →name morphology semantics

where (in the formulas presented in this paper), [a] means one or more instances, {a} means

zero or more instances, a |b means a or b but not both, while a? means zero or one instance

of a. As the above formula shows, a lexeme deﬁnition contains three parts. The name of the

lexeme which identiﬁes the lexeme, the morphology which describes how the lexeme’s word

forms are constructed from their constituent parts and the semantics which holds the meanings

information that can accompany a lemma.

The basic unit of word forms are the letters of MG alphabet. Despite this, words are usually

divided in more complex parts called morphemes. Morphemes constitute the basic unit of word

forms. We distinguish four types of morphemes: preﬁx, stem, inﬁx and inﬂection (sufﬁx).

The ﬁrst three types constitute the lexical-morpheme of the lexeme while the forth type is also

known as the inﬂectional-morpheme. Formally, a lexeme’s morphology is deﬁned as:

morphology →lexical-morpheme, inﬂectional-morpheme stress

lexical-morpheme →[preﬁx |stem |inﬁx]

inﬂectional-morpheme →[inﬂection]

stress →[ﬁnal |penultimate |antepenultimate]

There are no stress characters inside the morphemes, while the position of the stress (ﬁnal,

penultimate, antepenultimate position) is attached as shown in the above formulas. Each mor-

pheme must also be hyphenated so the derived word forms contain hyphenation information.

3.2 Supporting tools

In order to automate and simplify the coding of lexical information, various tools were con-

structed. One of them, LexEdit is a lightweight Lexicon Editor, which was used for the initial

deﬁnition of the lexicon entries. Figure 1 shows a typical screen of LexEdit showing processed

lexical entries. In the left pane of the application window we can see the sections that in-

corporate lemmas of the lexicon. We have a section for each MG alphabet letter. The “iota”

(γι ´ωτ α −ι)section is selected and in the right pane we have a part of the lemmas starting with

the character “iota”.

3.3 The data

The set of lemmas that were included in the lexicon was collected upon research based on the

most important MG dictionaries ([Kriaras, 1996], [Babiniotis, 1999], [Tegopoulos, Fytrakis, 1993])

and on a corpus of MG texts.

The selection of lemmas is a particularly exigent and laborious work, which, with respect to

MG, is also overloaded by the resent past of bilingualism (ofﬁcial - “kathareyousa” vs demotic

- “dimotiki” languages). What is found in the dictionaries should be checked from different

points of view. Since all the above dictionaries have not been based on text corpora, doubts

arise for many entries, whether certain words or word forms or their meanings exist in MG (and

not in dialects or ancient Greek).

An electronic dictionary as a basis for NLP tools: The Greek case

Figure 1: LexEdit tool

4 NLP Tools

The new language tools that have been deployed are a result of systematic work of three and

more years, at research level -in the areas of lexicography and NLP- as well as at the level of

development of specialised electronic lexicons and computer systems for checking, correction

and text hyphenation. Moreover, they are based on a redesign and reconstruction of the lexicon

as was described above. The tools (they can be found at the Web [Neurosoft, 2003] are renewed

two times per year.

Thus far the following NLP tools have been implemented, based on the lexicon that was de-

scribed above: a new spelling correction system and a hyphenator for MG.

The new spelling correction system includes approximately 90,000 lemmas (over 1,200,000

word forms). It was checked against a corpus of documents with more than 100,000,000 words.

At the same time it includes approximately 200,000 English words thus it allows English and

Greek spelling checking.

The search engine for the suggestions has been improved substantially. A number of new meth-

ods for correcting the errors have been added which are based on optical recognition (e.g., A -

∆, T - Γ,ΛΛ - M , α-σ), phonetic similarities (e.g., ´ανχoς -´αγχoς). Also, the methods for

correcting the phonetic equivalences have been enhanced (e.g., ´ǫβρǫση -ǫ´υρǫση) and have been

enriched with methods to take care of the usual spelling errors like missing letter, added letter,

transposed letter and wrong letter. Although the processing requirements have been increased,

the new spelling engine is faster than the previous version.

The hyphenator uses rules as well as the dictionary in order to achieve the best possible precision

in hyphenation. Rules are separated in two categories: in those that were handcrafted according

to the rules of hyphenation in MG grammar, and in those that were produced automatically

based on hyphenation information incorporated in the lexicon. The rules of the second category

enable the hyphenator to cope effectively with 26 vowel combinations, which in some words

Tsalidis, Vagelatos, Orphanos

split during syllabiﬁcation and in other not.

Additionally, the verbal types that are liable to produce hyphenation errors as a result of the

application of the hyphenation rules, have been incorporated in a list of exceptions. This list

contains about 2.700 word forms containing vowel combinations, the syllabiﬁcation of which

leads to sense ambiguity.

5 Conclusions

The above presentation has hopefully succeeded in establishing an awareness of what is encom-

passed when refer to develop an electronic dictionary for Modern Greek. We have presented

the two main development phases of this dictionary, we have cited limitations and we have

presented numerous research as well as applied projects. More importantly, we have stressed

and explained those features that characterize MG and which, in our point of view, make the

dictionary in electronic form a necessary tool in all kinds of natural language processing.

References

[Kriaras, 1996] KRIARAS E. (1996), The New Greek Dictionary, (in Greek).

[Knuth, 1973] KNUTH E. (1973), The Art of Computer Programming, Volume 3: Sorting and

Searching, Addison-Wesley, Reading, Mass.

[Babiniotis, 1999] BABINOTIS G. (1999), The Dictionary of Modern Greek, (in Greek).

[Neurosoft, 2003] Neurosoft S.A. (2003). Language tools. WWW site:

http://www.neurosoft.gr/download/main.asp.Accesed on 21-Dec-2003.

[Stamison et al., 1995] STAMISON-ATMATZIDI M., VAGELATOS A., CHRISTODOULAKIS D.

(1995), Teaching English Engineering Terminology in a Hypermedia Environment, Com-

puter Assisted Language Learning, An International journal, vol. 8, n. 2-3.

[Tegopoulos, Fytrakis, 1993] TEGOPOULOS G., FYTRAKIS A. (1993), The Dictionary of

Modern Greek, (in Greek).

[Vagelatos et al., 1994] VAGELATOS A., TRIANTOPOULOU T., TSALIDIS C.,

CHRISTODOULAKIS D. (1994), A spelling Correction System for Modern Greek,

International Journal on Artiﬁcial Intelligence & Tools, vol. 3, n. 4.

[Vagelatos, Stamison et al., 1994] VAGELATOS A., STAMISON-ATMATZIDI M., TRI-

ADOPOULOU TH., FARMAKI V., CHRISTODOULAKIS D. (1994), Analysis of the

Literary Style of Poet A. Sikelianos - A Computer Based Approach, CONSENCUS EX

MACHINA, Joint International Conference, ALLC-ACH94, Sorbonne, Paris.

[Vagelatos, Peleki, Christodoulakis, 1994] VAGELATOS A., PELEKI F., CHRISTODOULAKIS

D. (1994), Word stemming with the use of a computer, 15th meeting of the linguistic de-

partment of the Aristoteleion University of Thessaloniki, Thessaloniki, Greece, (in Greek).

Utilizing NLP Tools for the Creation of School Educational Games

Chapter

Full-text available

Mar 2021

The use of digital games to support learning (game-based learning) through an alternative, more attractive way is rapidly growing in both European and worldwide educational sector. Digital games are a rapidly developing field, as they are amongst the most popular technologies that young people use for their entertainment. Within this framework, project “Lexipaignio” was initiated in order to develop an innovative and state-of-the-art NLP (Natural Language Processing) environment for the creation of digital educational games for Greek students. These games will be dynamically generated by the educator for his/her students, in order to improve various vocabulary and linguistic skills, while understanding the context of specific school subject areas. In this paper we present the NLP environment for the subject of Geography.

ODL: an Object Description Language for Lexical Information.

Conference Paper

Jan 2008

Michael Rosner

This paper describes ODL, a description language for lexical information that is being developed within the context of a national project called MLRS (Maltese Language Resource Server) whose goal is to create a national corpus and computational lexicon for the Maltese language. The main aim of ODL is to make the task of the lexicographer easier by allowing lexical specifications to be set out formally so that actual entries will conform to them. The paper describes some of the background motivation, the ODL language itself, and concludes with a short example of how lexical values expressed in ODL can be mapped to an existing tagset together with some speculations about future work.

Developing tools and resources for the biomedical domain of the Greek language

Article

Full-text available

Jun 2011
Health Informat J

This paper presents the design and implementation of terminological and specialized textual resources that were produced in the framework of the Greek research project "IATROLEXI". The aim of the project was to create the critical infrastructure for the Greek language, i.e. linguistic resources and tools for use in high level Natural Language Processing (NLP) applications in the domain of biomedicine. The project was built upon existing resources developed by the project partners and further enhanced within its framework, i.e. a Greek morphological lexicon of about 100,000 words, and language processing tools such as a lemmatiser and a morphosyntactic tagger. Christos Tsalidis, Additionally, it developed new assets, such as a specialized corpus of biomedical texts and an ontology of medical terminology.

Linguo-Statistical Analysis of Domain Concept Verbalization in the Russian Language

Chapter

Feb 2024

In this chapter, we examine the structure of language-independent domain knowledge in correlation with its verbalization in the domain lexicon and corpus of a particular language. The lexicon and corpus can be considered as domain models of language and speech in their dichotomy, thus permitting a comprehensive study of domain linguistic diversity and restrictions. The work is devoted to the specifics of the conceptual (categorical) division of the world in the “Research in athletes’ physiology” domain and verbalizations of the domain concepts in the Russian lexicon and corpus. The domain language-independent knowledge is represented by a multi-lingual ontology, while language-dependent knowledge is conveyed by a language-specific (Russian, in our case) lexicon, whose units are linked to the ontology concepts. Both the ontological and Russian domain lexical data are represented in the digital format and used as the knowledge component of a computer annotation tool. The latter allowed automating certain stages of the study and computing statistical indices of the analysis parameters. The work makes certain contribution to the development of the research methodology and computer instrument design. The novelty of the study also lies in the particular linguo-statistical analysis result values that can be used both for theoretical linguistic research and in applied aspects of natural language processing.

Natural Language Processing Environment to Support Greek Language Educational Games

Chapter

Jan 2022

In this paper, we present the so far implemented infrastructure of the “Lexipaignio” project, a research project co-financed by EU and Greek national funds, where an innovative and state-of-the-art NLP (Natural Language Processing) environment is being developed for the creation of digital educational games for Greek students. An initial, brief presentation of the the position of digital games in the today’s educational system is followed by a more detailed presentation of the implemented NLP infrastructure for the Greek language. Examples of the games that have already been implemented are also provided. The paper concludes with the current stage of the project and the pending steps towards its completion.KeywordsDigital educational gamesNatural Language ProcessingGame-based learningMobile learningOpen and distance learning

Classification of News according to Age Groups Using NLP

Article

Full-text available

Jun 2020

In this study, it is aimed to label the news in electronic media according to age groups by using Natural Language Processing. The selected ones for training in the news dataset collected from the news sites were processed in Python language using the NLP Zemberek Library, and a vocabulary dictionary that could represent Childhood, Adolescence and Adult age groups of Havighurst's Development Theory adapted to the current situation was created (which age group of each word as appropriate). A classifier was then proposed to determine the classes of the news dataset selected for testing using this dictionary. As a result of the tests, it was seen that the developed dictionary can detect the correct class with a success rate of 0.70.

Paper or Electronic? The Role of Dictionary Form in Language Reception, Production and the Retention of Meaning and Collocations

Article

Aug 2010

Anna Dziemianko

The aim of the paper is twofold. First, it aspires to compare the usefulness of a monolingual English learners’ dictionary in electronic and paper form in receptive and productive tasks. Second, it sets out to assess the role of dictionary form in the retention of meaning and collocations. The investigation concerns the paper and electronic versions of a recent monolingual English learners’ dictionary, COBUILD6 (2008). The study reports on an experiment, in which 64 upper-intermediate and advanced students took part. The test consisted of two tasks: receptive and productive. To complete them, each subject was assigned to work with one version of the dictionary. It turns out that COBUILD online was more useful in both tasks. The results of an unexpected retention test prove it to be a better learning tool as well, since it significantly enhanced the retention of both meaning and collocations.

The Dici Project: towards a Dictionary of Italian Collocations integrated with an online language learning platform

Conference Paper

Jan 2010

Stefania Spina

This paper describes the Dictionary of Italian Collocations (Dici), a tool based on natural language processing technologies that aims to support foreign language learning activities. More specifically, the Dici is designed to be integrated with an online learning environment: in a specific area of an online platform, devoted to the study of vocabulary, students of Italian as a second language can perform receptive and productive learning activities concerning the recognition and the active use of collocations, with the support of all the information stored in the Dici. The paper describes the process of extraction of collocations from a reference Italian corpus, the creation of the dictionary, its structure and its integration with the online learning environment. Keywords: collocations, dictionary, online learning environment.

Development of a Greek biomedical corpus

Conference Paper

Full-text available

May 2007

Collection and annotation of specialized corpora, for less-spoken languages such as Greek, is crucial endeavour for the development and growth of the language technology research for these languages. This paper presents the design and compilation of a biomedical corpus that took place in the framework of the national R&D project "IATROLEXI". The aim of IATROLEXI is to create the critical infrastructure for the Greek language, i.e. linguistic resources and tools, to be used in advanced natural language processing (NLP) applications, i.e. information extraction, data mining, etc., in the domain of biomedicine. The project will build upon existing resources that have been developed by the project partners, i.e. a Greek morphological lexicon of about 100.000 words, and language processing tools such as a lemmatizer and a morphosyntactic tagger, and it will further develop new resources such as a specialised corpus of biomedical texts that is presented in this paper and an ontology of medical terminology.

SSLD: a French SMS to Standard Language Dictionary

Conference Paper

Full-text available

Jan 2010

A SPELLING CORRECTION SYSTEM FOR MODERN GREEK

Article

Full-text available

Dec 1994
INT J ARTIF INTELL T

Within the framework of a project yielding to the development of an interactive spelling checking/correction system for Modern Greek (M.G.) to run on MS-DOS based computers, our team comprised of several computer engineers and linguists, undertook the following preliminary tasks: The examination and evaluation of pertinent existing research/work from both the computer engineering and linguistic fields, and conducted supplementary research deemed necessary for the purposes of the project. The overall objectives focused on the development of a system that would be convenient to run and use. Unlike similar current systems however, emphasis was given to optimal engineering quality and performance and moreover, optimal linguistic performance attained through substantial linguistic expertise backing.

Teaching english engineering terminology in a hypermedia environment

Article

Full-text available

Jun 1995

In order to demonstrate the applicability of the latest computer technology to CALL, the present paper describes a Hypermedia system prototype for the teaching of fteld‐speciftc terminology for EST‐Engineering. The preliminary design and development of the system are presented from a pedagogical as well as a technical perspective with further discussion of the capabilities and prospective utilisation of Hypermedia systems in ESL/EFL/ESP.

The Art of Computer Programming — Vol. 3: Sorting and Searching

Article

Jan 1998

2. print

The New Greek Dictionary

Jan 1996

E Kriaras

[Kriaras, 1996] KRIARAS E. (1996), The New Greek Dictionary, (in Greek).

The Dictionary of Modern Greek

Jan 1999

G Babinotis

[Babiniotis, 1999] BABINOTIS G. (1999), The Dictionary of Modern Greek, (in Greek).

Language tools. WWW site: http://www.neurosoft.gr/download/main.asp. Accesed on 21

Jan 2004

S A Neurosoft

[Neurosoft, 2003] Neurosoft S.A. (2003). Language tools. WWW site: http://www.neurosoft.gr/download/main.asp. Accesed on 21-Dec-2003.

Word stemming with the use of a computer, 15th meeting of the linguistic department of the Aristoteleion

Jan 1994

A Vagelatos
Christodoulakis D Peleki F

[Vagelatos, Peleki, Christodoulakis, 1994] VAGELATOS A., PELEKI F., CHRISTODOULAKIS D. (1994), Word stemming with the use of a computer, 15th meeting of the linguistic department of the Aristoteleion University of Thessaloniki, Thessaloniki, Greece, (in Greek).

An electronic dictionary as a basis for NLP tools: The Greek case

Abstract and Figures

Recommended publications

A SPELLING CORRECTION SYSTEM FOR MODERN GREEK

Utilization of a Lexicon for Spelling Correction in Modern Greek

IMPLEMENTATION OF A GREEK MORPHOLOGICAL LEXICON FOR THE BIOMEDICAL DOMAIN

Proofing Tools Technology at Neurosoft S.A.