ChapterPDF Available

Supervised and Unsupervised Learning of Arabic Morphology

January 2007

January 2007

DOI:10.1007/978-1-4020-6046-5_10

In book: Arabic Computational Morphology (pp.181-200)

Authors:

Alexander Clark

King's College London

The broken plural in Arabic is a canonical example of nonconcatenative morphology. We discuss the supervised and unsupervised learning of this type of transduction using different techniques, based on the use of stochastic transducers, trained with the Expectation-Maximisation algorithm. A basic method for supervised learning using the transducers is discussed and then a more advanced technique using a memory-based learning technique with a distance derived from the Fisher kernel of the model. We then discuss how these algorithms can be employed for unsupervised learning, modelling the alignment between the strings as a hidden variable

Content uploaded by Alexander Clark

Content may be subject to copyright.

Arabic Computational Morphology

Text, Speech and Language Technology

VOLUME 38

Series Editors

Nancy Ide, Vassar College, New York

Jean Véronis, Université de Provence and CNRS, France

Editorial Board

Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands

Kenneth W. Church, Microsoft Research Labs, Redmond WA, USA

Judith Klavans, Columbia University, New York, USA

David T. Barnard, University of Regina, Canada

Dan Tuﬁs, Romanian Academy of Sciences, Romania

Joaquim Llisterri, Universitat Autonoma de Barcelona, Spain

Stig Johansson, University of Oslo, Norway

Joseph Mariani, LIMSI-CNRS, France

Arabic Computational

Morphology

Knowledge-based and Empirical

Methods

Edited by

Abdelhadi Soudi

Ecole Nationale de I’Industrie Min´

erale, Rabat, Morocco

Antal van den Bosch

Tilburg University, The Netherlands

G¨unter Neumann

Deutsches Forschungszentrum f ¨

urK¨

unstliche Intelligenz,

Saarbr ¨

ucken, Germany

A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN 978-1-4020-6045-8 (HB)

ISBN 978-1-4020-6046-5 (e-book)

Published by Springer,

P.O. Box 17, 3300 AA Dordrecht, The Netherlands.

www.springer.com

Printed on acid-free paper

2007 Springer

No part of this work may be reproduced, stored in a retrieval system, or transmitted

in any form or by any means, electronic, mechanical, photocopying, microﬁlming, recording

or otherwise, without written permission from the Publisher, with the exception

of any material supplied speciﬁcally for the purpose of being entered

and executed on a computer system, for exclusive use by the purchaser of the work.

Contents

Preface vii

Part 1: Introduction

1. Arabic Computational Morphology: Knowledge-based

and Empirical Methods 3

Abdelhadi Soudi, Günter Neumann and Antal van den Bosch

2. On Arabic Transliteration 15

Nizar Habash, Abdelhadi Soudi and Timothy Buckwalter

3. Issues in Arabic Morphological Analysis 23

Timothy Buckwalter

Part 2: Knowledge-Based Methods

4. A Syllable-based Account of Arabic Morphology 45

Lynne Cahill

5. Inheritance-Based Approach to Arabic Verbal Root-and-Pattern

Morphology 67

Salah R. Al-Najem

6. Arabic Computational Morphology: A Trade-off Between Multiple

Operations and Multiple Stems 89

Violetta Cavalli-Sforza and Abdelhadi Soudi

7. Grammar-Lexis Relations in the Computational Morphology of Arabic 115

Joseph Dichy and Ali Farghaly

Part 3: Empirical Methods

8. Learning to Identify Semitic Roots 143

Ezra Daya, Dan Roth and Shuly Wintner

vi Contents

9. Automatic Processing of Modern Standard Arabic Text 159

Mona Diab, Kadri Hacioglu and Daniel Jurafsky

10. Supervised and Unsupervised Learning of Arabic Morphology 181

Alexander Clark

11. Memory-based Morphological Analysis and Part-of-speech

Tagging of Arabic 201

Antal van den Bosch, Erwin Marsi, and Abdelhadi Soudi

Part 4: Integration of Arabic Morphology

in Larger Applications

12. Light Stemming for Arabic Information Retrieval 221

Leah S. Larkey, Lisa Ballesteros and Margaret E. Connell

13. Adapting Morphology for Arabic Information Retrieval 245

Kareem Darwish and Douglas W. Oard

14. Arabic Morphological Representations for Machine Translation 263

Nizar Habash

15. Arabic Morphological Generation and its Impact on the Quality

of Machine Translation to Arabic 287

Ahmed Guessoum and Rached Zantout

Index 303

Preface

One of the advantagesof having workedin a ﬁeldfor twentyyearsisthat youhave

an opportunityto watchresearch areasgrow frominfancy into maturity. The present

book representsamarriage of two suchﬁelds:computational morphologyandArabic

computational linguistics.

In the mid1980s,Koskenniemihadjustp

ublishedhislandmark (1983) thesis

on Two-Level Morphology. Prior to Koskenniemi there hadof course been work

on computational morphologydating back all the wayto the 1960s,but the ﬁeld

hadnever been a major focus of research. Koskenniemichangedthat bytaking the

computational framework of ﬁnite-state transducers, proposedbyRon Kaplan and

Martin Kay,andmaking it actuallywork in a real system.Koskenniemi provided

apractical implementation of a well-deﬁnedcomputational model, andthisin turn

ledto an explosion of work in ﬁnite-state andother approachesto morphologyover

the ensuing twentyyears. Data-driven methods,which gainedpopularityin the late

1980stook a few yearsto makeanimpactoncomputational morphology,but in the

lastdecade there hasbeen a signiﬁcant amount of work, particularlyin the area of

self-organizing methods for morphological induction.

In the mid1980s, with notable exceptionslike earlywork byBeesley, there was

next to nothing being done on Arabic. There simplywere not the resources, nor

were there verymanypeople who both hadthe linguistictraining andknowledge of

Arabic,aswell astraining in natural language processing. All of thishaschanged.

Now there are quite a few resourcesfor Arabicincluding the roughly400 million

words of Modern StandardArabicnewswire text in the ArabicGigawordcorpus,the

Penn ArabicTreebank, the Prague Dependency treebank, TimBuckwalter’spublicly

available morphological analyzer, aswell asa growing set of resourcesfor Colloquial

Arabic,including the Egyptian, Levantine, Iraqi andGulf dialects.Asevidenced

bythe contributorsto thisvolume, there are now a large number of computational

linguistswith a knowledge of Arabic.Andperhapsmostimportantly, there isa

widespreadinterest in the communityasa whole in Arabiclanguage processing.

Like all goodmarriages,theunion of computational morphologywith Arabic

language processing isone fraught with complexity; for Arabicseems almosttohave

been speciallyengineeredto maximize the difﬁcultiesfor automaticprocessing. The

famous Semitic“root-and-pattern” morphologydeﬁesastraightforwardimplemen-

tation in terms of morphemeconcatenation, andthishasspawnedawide variety

of different computational solutions,manyof which are representedin various

chaptersin thisvolume. Studentsof writing systems have speculatedthat this

vii

viii Preface

root-and-pattern morphologywasultimatelyresponsible for the secondinteresting

anddifﬁcult propertyof Arabic(andseveral other Semiticlanguages), namelythat

the writing systemisimpoverishedin that a fair amount of phonological information

issimplymissing in the script. In itsnormal everydayuse, the script systemati-

callyfailsto represent not onlymost vowel information (isDRS /darasa/,

/durisa/,KTB /kataba/

or /kattaba/?), aswell asboth vowel andnunation information in the nominal case

system(isWLD/waladu/, /waladun/, /waladin/,

asTimBuckwalter shows,theadvent of Unicodehasfailedto standardize Arabic

encoding, so that in dealing with real texts, one hasto be preparedto do a fair amount

of low level normalization; to someextent the differencesreﬂect regional variants

(suchasthe useof/alifmaq ¯for /ya/ in Egyptian texts), but in other casesthey

reﬂect the fact that for all of itsattemptsat rigiddesign, Unicodestill allowsfor a fair

amount of “wiggle room”: the sameissuecomesup in the encoding of South Asian

languagesusing Brahmi-derivedscripts.

The chaptersin thisvolume attest both to the wide variety,andto the sophistication

of the work being done on the computational analysisof Arabicmorphology, both

in terms of approachesto morphological analysis,aswell asin applicationsof such

work to other areas suchasmachine translation andinformation retrieval. To be sure,

part of the reason for the increasedinterest in Arabiclanguage processing isdueto

greater funding opportunitiesfor work on Arabic,andthisin turn hasbeen fueledby

various important political eventsof the pastfewyears.Butitwouldbe shortsighted

to view thisasthe sole justiﬁcation for an increasedinterest in Arabic.For

ms of

Arabicare spoken byroughly250million people in an area spanning North Africa

to the Persian Gulf. It isthe ofﬁcial language of over 20 countries.Itisasigniﬁcant

minoritylanguage of a number of sub-saharan African countries.Andthere isa

large expatriate population spreadthroughout the world. Arabicisthus one of the

world’smajor languages.History,especiallythat of the lasthundredyears,hasnot

been kindto the Arabic-speaking peoples,andtheyhave not hadan economicclout

proportional to their population. Thisisboundto change sooner or later, andthere

will be an increasing needfor toolsthat allow one to use Arabicin the digital world

aseasilyasone can now use English.

The work representedin thisbook isan important milestone along the path

towards that goal. I commendthe editors—AbdelhadiSoudi, Antal van den Bosch,

and—andallofthecontributing authorson itspublication.

Iwish tothankElabbas Benmamounfor helpful feedbackon an earlier draft of

this preface.

RichardSproat

Universityof Illinoisat Urbana-Champaign

Sura/

Günter Neumann

...?), butalso information on consonant gemination (is

...?). If thisweren’t enough,

PART I

Introduction

Arabic Computational Morphology: Knowledge-based

and Empirical Methods

Abdelhadi Soudi1, Günter Neumann2and Antal van den Bosch3

1Ecole Nationale de l’Industrie Minérale, Rabat, Morocco

2Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany

3Tilburg University, The Netherlands

1.1 Overview

The morphology of Arabic poses special challenges to computational natural

language processing systems. The exceptional degree of ambiguity in the writing

system, the rich morphology, and the highly complex word formation process of

roots and patterns all contribute to making computationalapproaches to Arabic very

challenging. Indeed, many computational linguists across the world have taken up

this challenge over time, and we have been able to commit many of the researchers

with a track record in this research area to contribute to this book.

The book’s subtitle aims to reﬂect that widely different computational approaches

to the Arabic morphological system have been proposed. These accounts fall

into two main paradigms: the knowledge-based and the empirical. Since morpho-

logical knowledge plays an essential role in any higher-level understanding and

processing of Arabic text, the book also features a part on the integration of Arabic

morphology in larger applications, namely Information Retrieval (IR) and Machine

Translation (MT).

The book is unique in the following ways:

•It is the ﬁrst comprehensive text that covers both knowledge-based and data-

driven approaches to Arabic morphology;

•It provides broad but rigorous coverage of the computational techniques for the

processing of Arabic morphologyas well as a detailed discussion of the linguistic

approaches on which each computational treatment is based;

•Compared and contrary to already published books in the area, the proposed

book includes contributions in which authors demonstrate how their approaches

A. Soudi, A. van den Bosch and G. Neumann (eds.), Arabic Computational Morphology, 3–14.

2007 Springer.

asoudi@gmail.com

Neumann@dfki.de

Antal.vdnBosch@uvt.nl

4 Soudi et al.

to morphology improve the performance of Natural Language Processing Appli-

cations, namely IR and MT, including experiments and results;

•While the book focuses primarily on Arabic computational morphology, the

authors do show how their approaches could be extended to other Semitic

languages;

•It brings together original and extended contributions from the most distin-

guished actors in knowledge-based and empirical paradigms as well as in Arabic

MT and IR.

First, we offer a brief roadmap for the book. The book is opened by a Preface

by Professor Richard Sproat who has a long-standing experience in computational

morphology. Chapter 2 introduces the transliteration scheme used in this book to

represent Arabic words for readers who cannot read the Arabic script and presents

guidelines for pronouncing Arabic given this transliteration. Chapter 3 provides

a review of the salient issues in Arabic computational morphology. Among the

issues discussed are: the status of non-standard Arabic characters (e.g., Persian

characters), problems in orthography relating to non-standard uses of Arabic

characters, problems in orthography that affect tokenization, deﬁning standard vs.

non-standard orthography, deﬁning contemporary morphological features (e.g., is

the energetic verb form archaic, or simply rare?), designing a maintainable system

(e.g., what level of specialized knowledge is required to make new entries in the

lexicon?), and the need to extend the analysis to written colloquial Arabic, in view

of its increasing and widespread use on the Internet.

1.2 Knowledge-based Approaches

Beneﬁting from the ﬁndings of modern phonology and morphology, the contri-

butions in Part 2 of the book, “Knowledge-based methods”, present computa-

tional treatments built on solid linguistic grounds. This part of the book brings

together aspects of phonology, morphology and computational morphology with

the aim of providing a linguistically tractable account of Arabic morphology. The

following four major linguistic frameworks are presented in the four chapters of

this part:

Syllable-based Morphology (SBM): In SBM, morphological realisations are deﬁned

in terms of their syllable structure. Although most work in syllable-based

morphology has addressed European languages (especially the Germanic languages)

the theory was always intended to apply to all languages. One of the language

groups that appears on the surface to offer the biggest challenge to this theory is the

Semitic language group. In Chapter 4, Cahill presents a syllable-based analysis of

Arabic morphology which demonstrates that, not only is such an analysis possible

for Semitic languages, but the resulting analysis is not signiﬁcantly different from

syllable-based analyses of European languages such as English and German. While

the account presented in this chapter does not require the separation of morphemes

à la McCarthy as is described in the previous chapter, the organisation of the lexicon

Arabic Computational Morphology 5

reﬂects this separation: information about the pattern1, root and vowel inﬂections

is provided by separate nodes. Interestingly, this chapter also captures the depen-

dencies between the Arabic binyanim forms with only a few equations using DATR’s

inheritance techniques.

Root-and-Pattern Morphology: The type of account of Arabic morphology

that is generally accepted by (computational) linguists is that proposed by

McCarthy (1979, 1981). In his proposal, stems are formed by a derivational combi-

nation of a root morpheme and a vowel melody. The two are arranged according

to canonical patterns. Roots are said to interdigitate with patterns to form stems.

McCarthy’s analysis differs from Harris’ (1941) in abstracting out or autosegmental-

izing the vowels from the pattern and placing them on a separate tier of the analysis.

Rules of association then match consonants with C slots and vowels with V slots to

form the abstract stem.

Harris’ segmental analysis consisted of:

Root: k t b “notion of writing”

Pattern _a_a_

Stem katab “wrote”

McCarthy (ibid) autosegmentalizes the vowels from the pattern, as is shown below:

Root Tier

Pattern Tier

Vocalization Tier

CVCVC

Note that the difference between the segmental analysis and the autosegmental

analysis is not just in the notation. The autosegmental approach is introduced to

capture some linguistic phenomena in Arabic, such as Spreading, a process that

involves consonant copying over intervening phonemes.

McCarthy’s autosegmental approach is reﬂected in most of the computational

attempts to model Arabic morphology, especially in the systems written withinﬁnite-

state morphology (Beesley, 1990, 1996; Kay, 1987; Kiraz, 1994, 2000). Since, to

our knowledge, no recent attempts have been made to improve the results of already

published work on Arabic Finite-state morphology, there is no contribution in this

book within this framework. However, it would be useful to brieﬂy review one of

the largest systems ever built for Arabic morphology on the basis of ﬁnite-state

technology,namely the Arabic morphology system implemented using Xerox ﬁnite-

state technology.

Using Xerox lexical and rule compilers, Beesley (1998) argues that the interdig-

itation of semitic roots and patterns is simply an intersection process. It is argued

that triliteral roots are represented as ?∗C?∗C?∗C?∗,where∗denotes zero or more

1Also called measure or binyan (singular of binyanim).

6 Soudi et al.

concatenations of ? (=any symbol) and C represents any consonant.The root drs “the

notion of studying”, for example, can be represented as ?∗d?∗r?∗s?∗. The intersection

of this representation with the pattern CaCaC produces daras “studied”:

(1) [drs & CaCaC] (where the square brackets are symbols that delimit the stem and

the symbol & denotes the intersection of drs and CaCaC.)

Abstract intersected level =daras

The voweling of the pattern can also be abstracted:

(2) Abstract lexical level: [drs & CVCVC] [a]

Abstract intersected level: daras

For each root and pattern, a rule of the form in (1) is generated automatically. The

generated rule is compiled into its corresponding transducer.2The latter maps the

string [drs &CaCaC] into daras.

In order to develop these ideas further, let us consider how hollow verbs are

treated. Hollow verbs have a weak middle radical (cf. Chapter 6 for details). In

Beesley’s system, qwl “the notion of saying” is the underlying spelling of the root.

This is what appears in the root dictionary. As for all roots, the dictionary lists

all the forms that the root can take. For form 1, it is also necessary to specify the

stem vowels. Thus, in the Xerox system, the underlying form 1 perfective pattern

for qwl is CaCuC. The underlying interdigitated stem is therefore qawul. The third

person masculine singular sufﬁx -ais added, and the underlying form qawul+a is

yielded:

(3) Levels of derivation of qaAla “he said”

Upper level: [qwl & CaCuC] + a

Intermediate level: qawul+a

Fully voweled: qaAla

The dictionary is ﬁrst compiled into a ﬁnite-state transducer that recognizes strings

like [qwl &CaCuC]+a. An algorithm then interdigitates the root and pattern into

qawul and creates a transducer that maps between [qwl &CaCuC]+a and qawul+a.

The mapping from the intermediate level to the fully voweled level is performed

by alternation rules that map qawul+a to qaAla. These alternation rules are rather

complex, especially for the handling of wand y,but they are normal ﬁnite-state

alternation rules. All the rules are compiled into a transducer, as well as the lexicon.

The two are then combined via a ﬁnite-state operation called composition (denoted

.o. in regular expressions). In the case at hand, wis deleted in the surface word and

the vowel ais lengthened (and spelled with Alif A).

2A ﬁnite transducer is like an ordinary ﬁnite-state automaton except that it considers two

strings rather than one. In a transducer, the arc labels are symbol pairs having the form x:y.

The ﬁrst member of the pair (the upper level), x, belongs to the input string, and the second

symbol (the lower symbol), y, is part of the output string. If the members are identical, the

pair is written as a single symbol.

Arabic Computational Morphology 7

The result, after composition of the rules, is a single two-level ﬁnite-state trans-

ducer that maps directly between strings like the following:

(4) Upper level: [qwl & CaCuC] + a

Fully voweled: qaAla

The intermediate level disappears in the composition. Thus, if we look up qaAla,

we get [qwl &CaCuC]+a, and if we generate from [qwl &CaCuC]+a,weget

qaAla. Generation is just the reverse of analysis. These transducers are inherently

bi-directional. Bi-directionality is maintained by a direct mapping of each root and

pattern pair to their respective surface realizations.

In Beesley’s system, the intersection mechanism requires the application of a rule

conveying a linguistic phenomenon to every single stem of the language. According

to Kiraz’ (2000) computational evaluation of the lexical compilation of Beesley’s

system, the intersection approach needs mrules of the form in (1) above (i.e., m

intersections) to be compiled into their respective transducers (where r<< m<

(r×v×p),r=roots, v=vocalisms, p=patterns). That is, mis far greater than r,but

less than rtimes vtimes p.

Another problem is that a full recompilation is required for new dictionary entries.

Beesley (personal communication) pointed out that, in the Arabic system (and most

others at Xerox), there is a premium placed on fast runtime performance.

Although the existing ﬁnite-state accounts have tried to show that ﬁnite-state and

two-level morphology techniques are adequate for Arabic morphology, they fall

short of capturing linguistic generalizations. There are two interesting points not

dealt with in these models. The ﬁrst point relates to the syncretism cases exhibited

in the Arabic verbal and noun systems.3Chapters 5 and 6 of this book provide

linguistic evidence that an adequate theory of morphology must incorporate rules

of referral in order to account for some kinds of inﬂectional syncretism.4(cf. these

chapters for further details). The second point relates to the dependencies between

the Arabic verbal patterns. That is some patterns can be derived from other patterns

(see Chapters 4 and 5).

Adopting McCarthy’s multilinear formalization of Arabic morphology, Al-Najem,

introduces an inheritance-based approach that computationally captures the general-

izations, dependencies and syncretisms existing in Arabic morphology in Chapter 5.

For this, Al-Najem uses the lexical knowledge representation language DATR which

enables the deﬁnition of inheritance networks in a relatively simple way.5

Lexeme-based Morphology (LBM): LBM supportsthe claim that the stem is the only

morphologically relevant form of a lexeme. Chapter 6, by Cavalli-Sforza and Soudi,

provides linguistic evidence that the stem is the phonological domain of realization

3Syncretism refers to any instance of what (Carstairs 1987:91) calls “systematic inﬂectional

homonymy”.

4Areferral is deﬁned as the stipulation “that certain combinations of features have the same

realization as certain others” (Zwicky 1985:372).

5http://www.cogs.susx.ac.uk/lab/nlp/datr/datr.html

8 Soudi et al.

rules. It is demonstrated that this claim is appropriate and necessary for capturing

the linguistic facts in a computational account. On the one hand, the authors have

done this by showing the advantagesLexeme-basedMorphology has over the Lexical

Morpheme Hypothesis. On the other hand, the authors have provided a computa-

tional implementation that allows them to test the approach with a non-fragmented

lexicon (i.e., without sub-lexicons: a sub-lexicon for vocalism, a sub-lexicon for roots

and another sub-lexicon for patterns). In this approach, the stem and operations on

the stem become the focus of the representation. Thus, the approach differs from the

previous computational representations of Arabic morphology that have essentially

granted equal status to all the constituents of an Arabic word (the root, the pattern

and the vocalism) by placing them in separate lexicons.

Stem-based Arabic Lexicon with Grammar and Lexis Speciﬁcations: The central

claim of this approach is that stem-grounded lexical databases, with entries

associated with grammar and lexis speciﬁcations, is the most appropriate organi-

sation for the storage of pertinent information for Arabic. In Chapter 7, Dichy and

Farghaly provide an in-depth discussion of the role of grammar-lexis relations in the

computational morphology of Arabic. After presenting the limits of the pattern and

root representation as well as other approaches to Arabic morphology, the authors

argue that entries associated with a ﬁnite set of morphosyntactic w-speciﬁers can

guarantee a complete coverage of data within the boundaries of the word-form.

The contents of this chapter are based on two experiences in Arabic NLP devel-

opment, that of the DIINAR.1 “DIctionnaire Informatisé de l’Arabe”, a compre-

hensive Arabic lexical resource of around 121.000 lemma-entries and that of the

lexical database and analyzers embedded in the SYSTRAN Arabic-English trans-

lator, a fully automatic transfer system (Dichy et al., 2002).

1.3 Empirical Approaches

A major beneﬁt of the knowledge-based methods is that the rules and constraints

for recognizing and classifying the internal structure of words are deﬁned on a

precise linguistic basis. Thus, under the assumption that the set of morphological

rules and constraints deﬁne a linguistically consistent system, then, if a word can

be morphologically analysed, we can be sure that the resulting structure is correct.

Of course, this requires the computational basis of the analysis to be sound and

complete; however, since we also assume this for the computational basis of the

empirical methods, we do not consider this as a unique feature of knowledge-based

methods. Furthermore, it is also often assumed that the modelled linguistic system

is domain independent that is valid and applicable in any domain. As a conse-

quence, the ultimate goal is to implement a linguistic knowledge base -in our case,

a morphological system- that covers all possible allowable structures and only these.

However, this requires not only that all possible allowable structures are known by

the linguist, but that they can be formalized and implemented consistently preferable

as non-redundant as possible.

Arabic Computational Morphology 9

As everybody who has implemented large-scale real-life NLP components might

have experienced, such a strict perspective has at least the following drawbacks:

- Ambiguity: The aim of formalizing all possible allowable structures means that

an NLP component is to be expected to return all possible analyses (or readings)

for a given input for further processing (by a human or another NLP component),

as long as the system does not dispose of any decision criteria on how to

rank or select between the alternative readings relative to a given domain or

application.

- Coverage: Even if a linguistic domain is completely understood, it is extremely

challenging to provide a complete implementation of all phenomena from

scratch, because it might still be unclear how to represent a certain linguistic

entity using the implementation formalism at hand or because not all possible

ways and constraints (e.g., about the nature of the input data) are known

in order to properly embed the NLP component into a larger application

context.

As a solution direction for these kinds of problems, empirical-based methods are

explored and developed in computational linguistics since the 1990s, cf. Cardie

and Mooney (1999). Empirical methods employ machine learning techniques to

automatically extract linguistic knowledge from natural language data directly rather

than require the system developer to manually encode the requisite knowledge.

Since these methods are by deﬁnition data-driven, they actually also learn how

to weight between alternative solutions and how to predict useful information

(e.g., a missing class label) for unknown entities through a rigorous statistically

analysis of the data. In the beginning of the development of the new ﬁeld of

empirical methods for NLP, the proposed approaches have often been considered

as alternative or even competing approaches to the corresponding knowledge-based

methods, cf. Magerman (1995). However, in recent years the trend has become to

consider both approaches more as complementary to each other, and new ways

of integrating knowledge-based and empirical-methods are envisaged and actively

explored.

Part 3 of the book, “Empirical methods”, presents four accounts of data-driven

processing models of Arabic morphology. The contributions reﬂect key advances in

the ﬁeld of machine learning and statistical models applied to natural language.

The Part’s ﬁrst chapter, Chapter 8, by Daya, Roth, and Wintner,acts as a bridge to

the second part, Knowledge-based methods, by addressing the question whether the

performance of machine-learning-based models of morphology can be boosted by

constraining them according to externally coded linguistic knowledge. As argued in

Chapter 8, this is indeed the case. The task studied in Chapter 8 is the identiﬁcation

of the root of a wordform; a hard task central to morphological analysis. The chapter

describes experiments performed with SNoW (Carlson et al., 1999), based on the

Winnow classiﬁer (Littlestone, 1988). In this chapter another bridge is made; results

obtained on Arabic are combined with results obtained on Hebrew.

10 Soudi et al.

Subsequently, In Chapter 9, by Diab, Hacioglu, and Jurafsky, a completely

machine-learning-based account of Arabic morpho-syntax is presented using support

vector machines (Cortes and Vapnik, 1995). In this approach, morphology is seen

as an integral part of the larger problem of part-of-speech tagging and constituent

chunking. Rather than full morphological analysis, the authors focus on clitic

segmentation, while a subsequent part-of-speech tagger assigns detailed morpho-

syntactic tags to the segmented token sequences. Using the recent annotated data that

have become available, i.e. the Arabic Treebank, via the Linguistic Data Consortium,

this chapter sets a standard in an integrative machine-learning approach to shallow

Arabic parsing.

Chapters 10, by Clark, and 11, by Van den Bosch, Marsi, and Soudi, both present

memory-based models, the one in Chapter 10 being essentially unsupervised, and

the one in Chapter 11 supervised, drawing on the same Arabic Treebank data as

used in Chapter 9. Compared to the other chapters in this part, in Chapter 10 Clark

provides a counterpoint in showing the potential strength of unsupervised machine

learning techniques, i.e., techniques for which un-annotated training corpora sufﬁce.

Clark shows how stochastic transducers, trained with the unsupervised expectation-

maximization algorithm (Dempster, Laird, and Rubin, 1977) can be trained to map

base forms to inﬂected forms, even if the examplesare not paired for particular words,

but are merely collected into two unordered sets of base forms and inﬂected forms.

In Chapter 11, Van den Bosch, Marsi and Soudi use a similar but supervised

memory-based approach (Daelemans and Van den Bosch, 2005), which as in

Chapter 9 integrates the task of morphological analysis with part-of-speech tagging.

Rather than in Chapter 9, which focuses on clitic segmentation, Arabic morpho-

logical analysis is deﬁned as encompassing the segmentation and tagging of all

morphemes in a wordform, including spelling changes. This is then formulated as

a lattice generation task, which the memory-based classiﬁcation approach generates

in overlapping fragments. Analysis, in other words, is reduced to sequences of

classiﬁcations that each encode character-by-character segmentation codes, part-

of-speech information, and spelling information; post-processing is subsequently

needed to construct a lattice that ideally does not overgenerate nor undergenerate too

much. Finally, the authors show that their morphological analysis module could be

integrated into a part-of-speech tagger.

The issues played out in Part 3 are, in sum:

(i) The integration and co-learnability of morphological analysis with part-of-

speech tagging and shallow parsing (constituent chunking).

(ii) The limitations of single-step morphologicalanalysis “by classiﬁcation”.

(iii) The relation of analysis, which is easily rephrased as a classiﬁcation task

learnable by machine learning algorithms, to generation, which is much harder

to model in the standard machine learning representation frameworks.

(iv) The role of linguistic knowledge in features or constraints used in the super-

vised learning of morphological analysis.

(v) The transferability of results attained with machine-learning algorithms

between models trained on Hebrew and on Arabic.

Arabic Computational Morphology 11

1.4 Applications of Arabic Computational Morphology

In the two parts 2 and 3, morphological processing has been consideredmainly from

a strict isolated or modular point of perspective. In the ﬁnal part 4, “Integration of

Arabic morphology into larger applications”, morphological processing is mainly

considered as a sub-component of a large-scale end-to-end NLP system, viz. systems

for performing Information Retrieval (IR) or Machine Translations (MT). Under

such a system—integration perspective, it has to be carefully exploited how effec-

tively a morphological component – which might have been mainly developed

with a generic “plug-and-play” design goal – can be embedded into such a larger

application environment. In particular, the speciﬁc input and output requirements

of the application dictate not only constraints on the representation of structure,

but also on the needed depth of the structural analysis. For example, in a full—

text search engine, the main task of the morphological component might be the

part-of-speech tagging and a stem-based segmentation, whereby in the context of

a MT system, additionally morphosyntactic information has to be computed in

order to support the parsing and generation engines. Furthermore, the interplay with

other components which are applied on the same input level has to be considered

carefully. Consider, for example, the case of Named Entity recognition (NER),

i.e., the classiﬁcation of token sequences as a person name, a location name,

a date expression, etc. Here, it depends on the larger application environment,

whether NER is seen as a pre-processor to morphology, as a post-processor or as

mutually independent processes. However, this has a direct inﬂuence on the scala-

bility and robustness of both components. For example, if NER acts as a pre-

processor, then its performance will doubtless inﬂuence the performance of the

morphology.

The chapters in Part 4 focus on aspects of Arabic morphology as an integral

part of IR and MT. Recently, further large-scale applications are in the focus of

attention, e.g., cross-lingual open-domain question answering (cf. Al-Maskari and

Sanderson, 2006; Awadallah and Rauber, 2006; Hammo et al., 2002) and infor-

mation extraction (cf. Abuleil, 2004; Florian et al., 2004; Maloney and Niv, 1998),

but also the recently launched pilot evaluation for Entity translation as part of the

ACE (Automatic Content Extraction) program.6

Part 4, “Integration of Arabic morphology into larger applications”, addresses

key issues relevant to the deployment of Arabic morphological information in

information retrieval and machine translation. One of the salient issues discussed

(and empirically tested through evaluation) is whether a root-based or a stem-

based approach to Arabic morphology would allow effective information retrieval.

Chapter 12 demonstrates that light stemming allows remarkably good information

retrieval without providing correct morphological analyses. In this context, the

authors have addressed the question as to why, given the complexity of Arabic

morphology, a morphological analyzer does not perform better than a simple

stemmer. It might be the case that for future content-based retrieval systems

6http://www.nist.gov/speech/tests/ace/index.htm.

12 Soudi et al.

morphological components are at least as important as simpler stemmers in order to

support semantic information annotation and access.

Chapter 13 presents a rapid method of developing a shallow statistical Arabic

morphological analyzer. The analyzer is concerned with generating possible roots

and stems of a given Arabic word along with a probability estimate of deriving the

word from each of the possible roots. The use of the generatedroots and stems along

with their probability estimates as index terms is evaluated in an information retrieval

application and the results are compared to index terms generated from a rule-based

Arabic morphology tool and an Arabic light stemmer.

With respect to the employment of Arabic morphology in MT, two key points are

addressed:

(i) Multiple representations of morphological information in resources for MT:

Arabic resources (morphological systems, dictionaries, etc.).often use various

morphological representations (e.g., lexeme, stem, root) that are not neces-

sarily compatible with each other, hence the need for machine translation

researchers to relate Arabic resources in differing morphological representa-

tions. Chapter 14 describes the differentrepresentations used by many resources

and their usability in different machine translation approaches (symbolic, statis-

tical and hybrids) for Arabic as source language and as target language.

A framework addressing this compatibility issue in a speciﬁc hybrid MT system

is discussed in detail. In this context, the lexeme-and-feature level of represen-

tation is motivated.

(ii) The role of Arabic morphology generation in MT: Chapter 15 investigates

the impact of Arabic Morphological Generation on the quality of English

to Arabic Machine Translation. To this end, the authors have translated

thousands of sentences from English to Arabic using an online MT system

and have categorized the main morphological information/errors relevant

to Arabic MT into various types of features. A detailed scrutiny of the

translated sentences, with a focus on morphological information/errors, has

revealed that the morphological information captures various linguistic aspects

(syntactic, pragmatic, etc.) and hence affects quite heavily the quality of the

translation.

1.5 A BLARK for Arabic

7http://www.nemlar.org.

Data-driven processing models of Arabic morphology as well as Natural Language

processing applications involving Arabic rely heavily on the existence of Language

Resources (LRs). During a recent EU project on Arabic language resources,

NEMLAR (Network for Euro-Mediterranean Language Resources), two surveys

on the availability of Arabic LRs and on industrial requirements were carried out.

Several tools and LRs have been identiﬁed.7Interestingly, the project also worked

Arabic Computational Morphology 13

out a BLARK (Basic LAnguage Resource Kit) for Arabic.8In this context, three

important issues are discussed:

(i) Availability: the availability of LRs depends on three key factors: (1) Accessi-

bility (an LR that is existent but only company-internal, an LR that is existent

and freely usable for precompetitive research, and an LR that is existent

and freely usable for both precompetitive research and product development;

(2) affordability (resources at a very high cost should not be listed as fully

available); (3) customizability (the degree of manipulability of resources).

(ii) Quality: the soundness, standard-compliance and interoperability of the LR

with other LRs.

(iii) Quantity: what counts as a sufﬁciently large LR (lexicon, corpus etc.).

Based on this BLARK, several LRs have been developed.9

References

Abuleil, S. (2004). Extracting names from Arabic text for question-answering systems. In

Proceedings of RIAO’2004, pp. 638–647, France, 2004.

Al-Maskari, A. and Sanderson, M. (2006). The affect of machine translation on the perfor-

mance of Arabic-English QA system. In Proceedings of the EACL’2006 Workshop on Multi-

lingual Question Answering, MLQA06, pp. 9–14, Trento, Italy.

Awadallah, R., and Rauber, A. (2006). Web-based multiple choice question answering for

English and Arabic questions. In Proceedings of the 28th European Conference on Infor-

mation Retrieval, ECIR 2006, pp. 515–518, London, UK.

Beesley, K. (1990). Finite-state description of Arabic morphology. In Proceedings of the

Second Cambridge Conference: Bilingual Computing in Arabic and English.

Beesley, K. (1996). Arabic ﬁnite-State morphological analysis and generation. In Proceedings

COLING’96,Vol.1, pp. 89–94.

Beesley, K. (1998). Consonant spreading in Arabic stems. In Proceedings of COLING’98,

pp. 117–123.

Cardie, C., and Mooney, R. (1999). Guest editors’ introduction: Machine learning and natural

language. Machine Learning,11:1–3, pp. 1–5, 1999.

Carlson, A., Cumby, C., Rosen, J., and Roth, D. (1999). SNoW user guide. Technical Report

UIUCDCS-R-99-2101, Cognitive Computation Group, Computer Science Department,

University of Illinois.

Carstairs, A. (1987). Allomorphy in Inﬂexion. Groom Helm, London.

Cortes, C., and Vapnik, V. (1995). Support vector networks. Machine Learning,20,

pp. 273–297.

Daelemans, W., and Van den Bosch, A. (2005). Memory-based Language Processing.

Cambridge, UK: Cambridge University Press.

8The BLARK initiative was initially launched in the Netherlands with the aim of setting up

an organized infrastructure for Dutch-based Human Language Technology.

9The BLARK for Arabic as well as the tools and the resources that have been identiﬁed and

developed are available at the NEMLAR website.

14 Soudi et al.

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological),

39:1, pp. 1–38.

Dichy, J., Braham, A., Ghazali, S., and Hassoun, A. (2002). La base de connaissances linguis-

tiques DIINAR.1 (DIctionnaire INformatisé de l’ARabe, version 1). In A. Braham (Ed.),

Proceedings of the International Symposium on The Processing of Arabic (April 18–20,

2002). Université de la Manouba, Tunis.

Florian, R., Hassan, H., Ittycheriah, A., Jing, H., Kambhatla, N., Luo, X., Nicolov, N., and

Roukos, S. (2004). A statistical model for multilingual entity detection and tracking. In

Proceedings of NAACL/HLT-2004, pp. 1–8, Boston, MA, USA.

Hammo, B., Abu-Salem, H., Lytinen, S., and Evens, M. (2002) QARAB: A question

answering system to support the Arabic language. In Proceedings of the ACL-02 Workshop

on Computational Approaches to Semitic Languages, pp. 55–65.

Harris, Z.S. (1941). The Linguistic Structure of Hebrew. In Journal of the American Oriental

Society,62, pp. 143–67.

Kay, M. (1987). Non-concatenative ﬁnite-state morphology. In Proceedings of the Third

Conference of the European Chapter of the Association for Computational Linguistics,

Copenhagen, Denmark, pp. 2–10.

Kiraz, G. (1994). Multi-tape two-level morphology: A case study in semitic non-linear

Morphology. In Proceedings of COLING’94,Vol.1, pp. 180–186.

Kiraz, G. (2000). A Multi-tiered Nonlinear Morphology using Multi-tape Finite State

Automata: A Case Study on Syriac and Arabic. In Computational Linguistics,26:1,

pp. 77–105.

Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-

threshold algorithm. Machine Learning,2, pp. 285–318.

McCarthy, J.A. (1979). Formal Problems in Semitic Phonology and Morphology.Doctoral

Dissertation, MIT.

McCarthy, J.A. (1981). Prosodic Theory of Non-Concatenative Morphology.” In Linguistic

Inquiry,12, pp. 373–418.

Magerman, D.M. (1995). Statistical decision-tree models for parsing. In Proceedings of

the 43rdAnnual Meeting of the Association for Computational Linguistics, ACL-95.

pp. 276–283. ACL: Ann Arbor, MI, USA.

Maloney, J., and Niv, M. (1998). TAGARAB: A fast, accurate Arabic name recognizer.

In Proceedings of the Workshop on Computational Approaches to Semitic Languages,

pp. 8–15. Montreal, Canada.

Zwicky, A. (1985). How to Describe Inﬂection. In Berkeley Linguistic Society, pp. 372–386.

On Arabic Transliteration

Nizar Habash , Abdelhadi Soudi and Timothy Buckwalter

2.1 Introduction

In this chapter, we introduce the transliteration scheme used in this book to

represent Arabic words for readers who cannot read the Arabic script. We follow

the definition of the terms transcription and transliteration given by Beesley

(1998): the term transcription denotes an orthography that characterizes the

phonology or morpho-phonology of a language, whereas the term transliteration

denotes an orthography using carefully substituted orthographical symbols in a

one-to-one, fully reversible mapping with that language’s customary orthography.

This specific definition of transliteration is sometimes called a “strict

transliteration” or “orthographical transliteration” (Beesley, 1998).1

For Arabic script, as used in writing Modern Standard Arabic, there are many

ways to define the orthographic symbol set. The basic Arabic alphabet has 28 let-

ters and eight diacritical marks. However, there are eight additional symbols that

can be treated as separate letters and/or special combinations of letter and addi-

tional diacritics. One example is the Hamza (ΓΰϤϫ hamzaƫ) which can be a separate

A. Soudi, A. van den Bosch and G. Neumann (eds.), Arabic Computational Morphology, 15 –22.

2007 Springer.

We do not consider non-one-to-one schemes because of their potential to add ambiguity

(Beesley, 1998) or become excessively cumbersome to deal with, e.g., SATTS translit-

eration (“Standard Arabic Technical Transliteration System,” 2006).

12 3

Abstract: This chapter introduces the transliteration scheme used to represent Arabic characters

in this book. The scheme is a one-to-one transliteration of the Arabic script that is

complete, easy to read, and consistent with Arabic computer encodings. We present

guidelines for Arabic pronunciation using this transliteration scheme and discuss

various idiosyncrasies of Arabic orthography

Center for Computational Learning Systems, Columbia University, United States

habash@cs.columbia.edu

Ecole Nationale de l’Industrie Minérale, Morocco

asoudi@gmail.com

Linguistic Data Consortium, University of Pennsylvania, United States

timbuck2@ldc.upenn.edu

The Buckwalter Arabic transliteration (Buckwalter, 2001) is a transliteration

scheme that follows the standard encoding choices made for representing Arabic

characters for computers. The Buckwalter transliteration has been used in many

publications in natural language processing and in resources developed at the

Linguistic Data Consortium (LDC). The main advantages of the Buckwalter

transliteration are that it is a strict transliteration (i.e., one-to-one) and that it is written

in ASCII characters. However, the Buckwalter transliteration is not always intuitively

easy to read. We address this problem in our transliteration scheme by extending the

Buckwalter transliteration scheme to include non-ASCII characters whose

pronunciation is easier to remember. For example, instead of Buckwalter’s non-

intuitive * for Ϋ /ð/, v for Ι /ș/ and K for the diacritic ˳˰ /in/, we use ð, ș, and ƭ,

respectively. Buckwalter transliteration choices that imitate Arabic pronunciations are

kept unchanged, e.g., b for Ώ /b/ and k for ϙ /k/. Since non-ASCII characters are less

accessible through standard American keyboards (as opposed to ASCII characters),

this is clearly a trade-off between typing/coding simplicity and ease of readability.

To our knowledge, there has been no earlier attempt to create a one-to-one

transliteration of the Arabic script that is complete and easy-to-read, and that is

consistent with Arabic computer encodings. Almost all of the previously created

schemes to represent Arabic characters for western readers have focused on

representing phonology and morphology (transcription) or mixing between

phonology and orthography, making exceptions for transliteration of morphemes

such as the definite article (“Arabic Transliteration,” 2006; Buckwalter, 2001).

Two transliteration schemes, ISO 233 (“ISO 233,” 2006) and El-Dahdah’s

transliteration (El-Dahdah, 1992), get close to achieving our goal except that they

are not consistent with computer encodings. For example, in ISO 233, the Hamza

(ΓΰϤϫ hamzaƫ) is treated as a diacritic that combines with other characters whereas

all standard encodings of Arabic treat it as an inseparable letter part.

In the Section 2.2 we introduce our transliteration scheme for Arabic script as

used in Modern Standard Arabic. In Section 2.3 we present guidelines for

pronunciation of Arabic using this transliteration scheme and address various

idiosyncrasies of Arabic orthography.

computer encodings of Arabic, such as CP1256, ISO-8859, and Unicode2, do not

all consider the additional eight symbols as separate letters.3 do that. They

Unicode actually implements both approaches, but the use of Hamza as a diacritic in

Unicode is not that common to our knowledge

Arabic letters also have multiple shapes (allographs) that fully depend on their position

in the word, e.g. the letter ω has the forms ω ˰,ω- ,ω- , and ω in its initial, middle, final

and standalone positions, respectively. There are also additional ligatures such as for

ϝ+΍. We do not discuss the possibility of defining an orthographic symbol set that

considers allographs and ligatures as base symbols since Arabic’s simple graphotactic

rules are abstracted away in all of its standard encodings. Considering sub-letter dots in

Arabic as separate symbols is also not discussed for the same reason

–

ϝ΍

but also a diacritic with a limited number of combinations. In fact, standard

16 Habash et al.

letter (˯) or can combine with other letters: ΃, ΅,Ή . As a result, it is possible to

define an orthographic symbol set of Arabic where the Hamza is not just a letter

2.2 This Book’s Arabic Transliteration Scheme

Table 2.1 provides the full list of Arabic transliterations used in this book. The

first three columns contain the symbols for the characters in Arabic script and

contrast our transliteration with the Buckwalter transliteration. Cases where our

transliteration differs from Buckwalter’s are highlighted. The last four columns

Table 2.1. This book’s Arabic transliteration scheme with examples

Characters Examples

Arabic Transliteration Buckwalter Arabic Transliteration Transcription Gloss

˯ ' '

˯ΎϤγ samaA' /samƗ'/ sky

΁ Ɩ | Ϧϣ΁ Ɩmana /'Ɨmana/ he believed

΃ Â > ϝ ΄ ˴˴γ saÂala /sa'ala/ he asked

΅ ǒ & ήϤΗΆϣ muǒtamar /mu'tamar/ conference

· Ӽ < ήΘϧ·ϧΖ Ӽintarnit /'intarnit/ internet

Ή ǔ } Ϟ΋Ύγ saAǔil /sƗ'il/ liquid

΍ A A ϥΎϛ kaAna /kƗna/ he was

Ώ b ΪϳήΑ bariyd /barƯd/ mail

Γ ƫ ΔΒΘϜϣ maktabaƫ

maktabaƫǊ

/maktaba/

/maktabatun/

Library a li-

brary [nom.]

Ε t βϓΎϨΗ tanaAfus /tanƗfus/ competition

Ι ș ΔΛϼΛ șalaAșaƫ /șalƗșa/ three

Ν j ϞϴϤΟ jamiyl /jamƯl/ beautiful

Ρ H H

ΩΎΣ HaAd~ /HƗdd/ sharp

Υ x ΓΫϮΧ xuwðaƫ /xuwða/ helmet

Ω d ϞϴϟΩ daliyl /dalƯl/ guide

Ϋ ð * ΐϫΫ ðahab /ðahab/ gold

έ r ϊϴϓέ rafiyȢ /rafƯȢ/ thin

ί z ΔϨϳί ziynaƫ /zƯna/ decoration

α s ˯ΎϤγ samaA' /samƗ'/ sky

ε š $ ϒϳήη šariyf /šarƯf/ honest

ι S S ΕϮλ Sawt /Sawt/ sound

ν D D ήϳήο Dariyr /DarƯr/ blind

ρ T T ϞϳϮσ Tawiyl /TawƯl/ tall

υ Ć Z ϢϠχ Ćulm /Ćulm/ injustice

ω Ȣ E ϞϤϋ Ȣamal /Ȣamal/ work

ύ Ȗ ΐϳήϏ Ȗariyb /ȖarƯb/ strange

ϑ f ϢϠϴϓ fiylm /fƯlm/ movie

ϕ q έΩΎϗ qaAdir /qƗdir/ capable

ϙ k Ϣϳήϛ kariym /karƯm/ generous

ϝ l άϳάϟ laðiyð /laðƯð/ delicious

ϡ m ήϳΪϣ mudiyr /mudƯr/ manager

ϥ n έϮϧ nuwr /nǌr/ light

ϩ h ϝϮϫ hawl /hawl/ devastation

ϭ w w Ϟλϭ waSl /waSl/ receipt

(Continued)

On Arabic Transliteration 17

Characters Examples

Arabic Transliteration Buckwalter Arabic Transliteration Transcription Gloss

ϯ ý Y ϰϠϋ Ȣalaý /Ȣala/ on

ϱ y y ϦϴΗ tiyn /tƯn/

igs

˴ a a ˴Ϧ ˴ϫ ˴Ω dahana /dahana/ he painted

˵ u u ˴Ϧ ˶ϫ ˵Ω duhina /duhina/ it was painted

˶ i i ˴Ϧ ˶ϫ ˵Ω duhina /duhina/ it was painted

˱ ã F ˱ΎΑΎΘϛ kitaAbAã /kitƗban/ a book [nom.]

˲ Ǌ N ˲Ώ Ύ Θ ϛ kitaAbǊ /kitƗbun/ a book [acc.]

˳ ƭ K ˳Ώ Ύ Θ ϛ kitaAbƭ /kitƗbin/ a book [gen.]

˷ † ~ ~ ˴ή ͉δ ˴ϛ kas~ara /kassara/ he smashed

˸ ‡ . o Ϊ˶Π˸δ˴ϣ mas.jid

or masjid /masjid/ mosque

˰ § _ _ Ϊ˶Π˰˰˰˰˸˰δ˴ϣ mas._jid /masjid/ mosque

† Shadda (ΓΪη šad~aƫ) is a symbol marking consonant doubling.

‡ Sukun (ϥϮϜγ sukuwn) is a symbol marking lack of vowel. It can be used for contrastive

purposes in the transliteration. However, it is not required in this book for the purpose of

improving readability.

§ Tatweel (ϞϳϮτΗ taTwiyl) or Kashida (ΓΪϴθϛ kašiydaƫ) is an orthographic elongation

symbol with no phonetic value.

present English-glossed examples in Arabic script, our transliteration, and a

phonological transcription. Since Arabic words can be written fully diacritized,

partially diacritized or non-diacritized, a transliteration should preserve fully how an

Arabic word is constructed. This includes preserving all possible ambiguities. For

example, the Arabic word ΐΘϛ ktb could represent one of many diacritized words

with different meanings and pronunciations: the noun ΐ˵Θ˵ϛ kutub ‘books’ or the verb

ΐ˴Θ˴ϛ katab ‘he wrote’ among others. Of course, most naturally occurring Arabic text

is not diacritized; however, in this book, diacritized transliterations are always used

for readability unless the point is to discuss diacritization ambiguity. In Table 2.1,

we show examples in fully diacritized transliteration to contrast with the

corresponding transcriptions, but the Arabic text examples are not fully diacritized.4

Table 2.1. (Continued)

In this book, there are very few cases that slightly deviate from our transliteration

scheme:

(i) A special transcription variant is used where the one-to-one transliteration of the Ara-

bic script interferes with the author’s explanation of some linguistic phenomena. For

example, in Chapter 4 “A Syllable-based Account of Arabic Morphology, the author

represents vowel lengthening and gemination by vowel doubling and consonant dou-

bling, respectively. The use of transliteration to represent these phenomena would in-

terfere with the syllabification process. In some cases, the authors use a phonetic

transcription.

(ii) Snapshots from LDC resources that use the Buckwalter transliteration are pre-

sented in the Buckwalter transliteration. This is done in some of the chapters in the

third, empirical part of the book.

18 Habash et al.

2.3 Pronunciation Guidelines

Arabic script, as is used in Modern Standard Arabic, is mostly a phonemic system

with one-to-one mappings of sounds to letters and diacritics. When fully

diacritized, Arabic is almost perfectly phonologically reproducible by readers

given the following few rules and exceptions:

Table 2.2. Arabic consonant pronunciation

Arabic Transliteration Pronunciation

Ώ b Boy

t Toy

ș Three

j Jordan

H Voiceless pharyngeal fricative. Sounds like a sharp h.

x Scottish Loch; Yiddish Chutzpa;

d Door

ð The

r Road

z Zoo

s Sue

š Shoe

S Emphatic s

D Emphatic d

T Emphatic t

Ć Emphatic ð

Emphasis is a bass effect giving an

acoustic impression of hollow reso-

nance to the basic sounds [0].

Ȣ Voiced pharyngeal fricative. Sounds like a sharp a.

Ȗ Parisian French r

f Film

q Uvular stop. Sounds like a deep k.

k Kite

l Cool

m Man

n Man

h Hot

w Would

y Yoke

1. For most consonants, there is no issue in mapping letters to sounds. Some

are easier for English speakers than others. The transcription and translitera-

tion are the same for these cases. Table 2.2 describes how to pronounce these

consonants.

On Arabic Transliteration 19

Table 2.3. Hamza (glottal stop) forms

Arabic ˯ ΁ ΃ ΅ · Ή

Transliteration ' Ɩ Â ǒ Ӽ ǔ

2. The consonant Hamza (ΓΰϤϫ hamzaƫ) has multiple forms in Arabic script.

There are complex rules for Hamza spelling that depend on its vocalic con-

text. For a reader, however, all of these forms are pronounced the same way: a

glottal stop as in the value of ‘tt’ in the London Cockney pronunciation of

bottle. Table 2.3 relates the different forms of Hamza in Arabic script and our

transliteration. The form of the transliteration is intended to evoke the form

used in the Arabic variant as much as possible. For instance, a circumflex is

used with A (΍), w (ϭ) and y (ϱ) to create their corresponding hamzated forms

Â (΃), ǒ (΅) and ǔ (Ή).

3. Arabic has three short vowel diacritics that are represented using a, u and i.

Arabic also has three nunation diacritics. These are short vowels pronounced

followed by an /n/. They are not nasalized vowels. Nunation is a mark of

nominal indefiniteness in Standard Arabic. Finally, Arabic has a consonant

doubling diacritic which repeats previous consonant and also a diacritic for

marking when there is no diacritic. Table 2.4 lists these diacritics, their

names, and corresponding transliteration and transcription values. Diacritics

are largely restricted to religious texts and Arabic language school textbooks.

In this respect, the Arabic writing system depends on the background

knowledge of the reader to accurately pronounce the written word—much as

a reader in English needs to decide on the basis of context whether “read” is

pronounced /rƯd/ (present tense) or /rİd/ (past tense).

Table 2.4. Arabic diacritics

Diacritic Name Transliteration Transcription

˴ ΔΤΘϓ fatHaƫ a /a/

ΔϤο Dam~aƫ u /u/

˶ Γήδϛ kasraƫ i /i/

΢Θϓ ϦϳϮϨΗ tanwiyn fatH ã /an/

Ϣο ϦϳϮϨΗ tanwiyn Dam~ Ǌ /un/

˳ ήδϛ ϦϳϮϨΗ tanwiyn kasr ƭ /in/

ΓΪη šad~aƫ ~ Double previous consonant

˸ ϥϮϜγ sukuwn . No vowel

20 Habash et al.

Table 2.5. Long vowels and diphthongs

Arabic Ύ˴˰ Ϯ˵˰ ϲ˶˰ Ϯ˴˰ ϲ˴˰

Transliteration aA uw iy aw ay

Transcription /Ɨ/ (long a) /ǌ/ (long u) /Ư/ (long i) /aw/ /ay/

4. Long vowels and diphthongs in Arabic are indicated by a combination of a

short vowel and a consonant. Table 2.5 lists the various Arabic long vowels and

diphthongs together with their transliteration and transcription.

5. The letter Alif (΍ A) is used to (a.) hold vowels at the beginning of words, (b.)

represent the long vowel /Ɨ/, and (c.) mark a couple of morphophonemic

symbols in which Alif is not pronounced (See note 8).

6. The /tƗ’ marbǌTa/ (ΔσϮΑήϣ ˯ΎΗ tA’ marbuwTaƫ), Γ ƫ, is typically a feminine

ending. It can only appear at the end of a word and can only be followed by a

diacritic. In standard Arabic it is always pronounced as /t/ unless it is not fol-

lowed by a diacritic, in which case it is silent.5

7. The /alif maqSǌra/ (ΓέϮμϘϣ ϒϟ΃ Âlif maqSuwraƫ), ϯ ý, is a dotless Ya (ϱ y). In

standard Arabic, it is silent and always follows a short vowel a at the end of a

word. For example, ϯϭέ rawaý ‘to tell a story’ is pronounced /rawa/.6

8. There are few exceptions to the guidelines above:

a. The Arabic definite article, ϝ΍ Al /al/, is a prefix that assimilates to the

first consonant in the noun it modifies if this consonant is an alveolar or

dental sound (except for Ν j). This set of letters is called Sun Letters.

They include Ε t, Ι ș, Ω d, Ϋ ð, έ r, ί z, α s, ε š, ι S, ν D, ρ T, υ Ć, ϝ

l, and ϥ n. For example, the word βϤθϟ΍ Alšams ‘the sun’ is pronounced

/aššams/ not */alšams/. The rest of the consonants are called Moon Let-

ters; the definite article is not assimilated with them. For example, the

word ήϤϘϟ΍ Alqamar ‘the moon’ is pronounced /alqamar/ not */aqqamar/.

b. A silent Alif appears in the morpheme ΍ϭ+ +uwA /ǌ/ which indicates

masculine plural conjugation in verbs. Another silent Alif appears after

some nunated nouns, e.g., ˱ΎΑΎΘϛ kitaAbAã /kitƗban/. In some poetic readings,

this Alif can be produced as the long vowel /Ɨ/: /kitƗbƗ/.

c. A common odd spelling is that of the proper name ϭήϤϋ Ȣamrw/ Ȣamr/

‘Amr’ where the final w is silent.

In modern dialects of Arabic, the /tƗ’ marbǌTa/ is always silent except when the noun

ending with it is part of an (/‘idƗfa/ ΔϓΎο· ӼiDAfaƫ) compound, in which case it is pro-

nounced as /t/

In some Arab countries such as Egypt, a common orthographic variation is to use ϯ ý

for the letter ϱ y in word-final position. Orthographic variation in Arabic is further

discussed in Chapter 3

On Arabic Transliteration 21

2.4 Conclusion

In this chapter, we presented the transliteration scheme used in the rest of this

book. This transliteration is a one-to-one easy-to-read complete transliteration of

the Arabic script consistent with Arabic computer encodings. We also presented

guidelines for pronouncing Arabic given this transliteration. We hope that this

transliteration scheme will become a standard to follow in the natural language

processing research community working on Arabic.

Acknowledgements

We would like to thank Ali Farghaly, Joseph Dichy and Owen Rambow for

helpful discussions.

References

Arabic Transliteration. (2006, April 6). In Wikipedia, The Free Encyclopedia. Retrieved

June 19, 2006, from http://en.wikipedia.org/ wiki/Arabic_transliteration

Beesley, K. (1998). Romanization, Transcription and Transliteration. Retrieved June 19,

2006, from the Xerox Research Center Europe web site: http://www.xrce.xerox.

com/competencies/content-analysis/arabic/ info/romanization.html

Buckwalter, T. (2001). Arabic Transliteration. Retrieved June 19, 2006, from

http://www.qamus.org/transliteration.htm

El-Dahdah, A. (1992). Dictionary of Universal Arabic Grammar. Beirut: Librairie du Liban.

ISO 233. (2006, June 18). In Wikipedia, The Free Encyclopedia. Retrieved June 19, 2006,

from http://en.wikipedia.org/wiki/ISO_233

Standard Arabic Technical Transliteration System. (2006, April 6). In Wikipedia, The Free

Encyclopedia. Retrieved June 19, 2006, from http://en.wikipedia.org/wiki/SATTS

22 Habash et al.

Issues in Arabic Morphological Analysis

Timothy Buckwalter

Linguistic Data Consortium, University of Pennsylvania

This chapter is a review of issues in Arabic morphological analysis that have gained

prominence in just the last decade as a result of specific worldwide advancements in

information science and technology. Among these developments are the successful

deployment and widespread use of the Unicode character set, which has greatly ex-

tended the set of available characters for representing Arabic electronically, thus im-

pacting the orthography of the language. Thanks to the proliferation of personal

computers and the widespread success of Web publishing, Arabic texts are now au-

thored primarily in electronic format and are often published without passing first

through the traditional scrutiny of copy editors and other standardizing and normal-

izing filters. These raw published texts are today readily available for computational

analysis, and they reveal various features that are relevant to morphological analysis.

The most salient feature is orthographic variation, and much of it derives from

purely mechanical factors, such as the manner in which specific letter combinations

are displayed on different computer platforms. Other forms of orthographic variation

A. Soudi, A. van den Bosch and G. Neumann (eds.), Arabic Computational Morphology, .

2007 Springer.

23–41

Abstract: The salient issues facing contemporary Arabic morphological analysis are summarized as

predominantly orthographic in nature, although the issue of how to integrate morphologi-

cal analysis of the dialects into the existing morphological analysis of Modern Standard

Arabic is identified as the primary challenge of the next decade. Issues of orthography that

impact morphological analysis stem in part from the successful deployment of the Unicode

standard and the subsequent increase in usage of the expanded Arabic character set, includ-

ing what are properly Persian and Urdu characters. Additional orthographic issues

impacting morphological analysis arise from the persistent and widespread variation in the

spelling of letters such as hamza and tƗ’ marbǌTa, and the increasing lack of differentia-

tion between word-final yƗ’ and alif maqSǌra. The tokenization of Arabic input strings is

also affected by orthography, as typists often neglect to insert a space after words that end

with a non-connector letter. An increasing number of archaic morphological features and

dated lexical items can be observed in Web-based Islamic publications and cannot be over-

looked in contemporary analysis. Finally, the accuracy and completeness of current Arabic

morphological analysis can be questioned in light of the almost complete absence of anno-

tation for lexically-determined features of gender, number, and humanness

3.1 Introduction

timbuck2@ldc.upenn.edu

of the dialects is far from standardized, increasing popular use and dissemination

on the Web are resulting in observable and measurable orthographic conventions.

In the discussion that follows we will review in more detail these basic issues that

affect Arabic morphological analysis, beginning with issues involving the nature

of the input text which impact the pre-processing phase of morphological analysis,

such as tokenization and normalization, and concluding with a brief discussion of

the nature of the analysis itself, especially the set of lexical and grammatical fea-

tures in use today. All textual examples that we cite come from actual corpus data.

3.2 Expansion of the Arabic Character Set

The implementation of the Unicode character set on systems used for authoring

Arabic texts for publication on the Web has dramatically expanded the set of char-

acters available for representing Arabic electronically, and this has had an impact,

both positive and negative, on the orthography of the language, which ultimately

impacts morphological analysis as well. The positive impact is seen in more accu-

rate representations of non-Arabic sounds, and in some cases this facilitates dis-

ambiguation of what would otherwise be homographs, such as with the word /vƗn/

(the name “Van” or the type of vehicle commonly called a “van”). This word is

now occasionally spelled ϥΎׅ VAn (V = ׃ = U+06A4). Normally it would be

spelled ϥΎϓ fAn, which would result in an additional possible analysis /fa-’inna/.

The negative impact of easy access to numerous extended Arabic characters in the

Unicode character set is seen in new and unpredictable orthographic variation that

is linguistically unjustified and hard to detect visually—in fact, much of this varia-

tion can only be detected electronically. Before we examine some individual cases

of anomalous orthography resulting from the richness of Unicode, we will present

some preliminary and basic facts on the Arabic character set.

are less artificial and more a reflection of true idiosyncratic and regional spelling

tendencies, although the design of specific Arabic glyphs, such as hamza in combi-

nation with various characters functioning as orthographic props or “chairs,” appears

to influence typists’ habits as well.

The greatest impact that personal computers and Web publishing have had on

Arabic morphological analysis today is the change they are bringing about in the

language itself. What Hans Wehr (1979) called “modern written Arabic” has long

been synonymous with Modern Standard Arabic. Today, however, modern written

Arabic includes increasing quantities of dialectal Arabic. A simple Web search of

high-frequency dialectal words will yield thousands of Web pages in which dia-

lectal and standard Arabic commingle in written form as they do in spoken form in

real life. This emerging modern written Arabic manifests usage ranging from in-

formal to formal and makes appropriate use of both dialectal Arabic and MSA—and

some hybrid forms—to reflect changing social registers. Although the orthography

24 Buckwalter

(0x21-0x3A, 0x41-0x4A), along with a supplementary set of eight short vowels

and diacritics (0x4B-0x52) whose usage has always been treated as optional. The

fact that these eight short vowels and diacritic needed to be represented with zero-

width glyphs may have deterred some early developers from implementing them

and dealing with the technical difficulties of displaying them correctly. Even

zilla Firefox will first solve complex as-

pects of Arabic display, such as bidirectional rendering and multiple character en-

coding schemes, but will postpone solving the problem of displaying the short

vowels and diacritics correctly. These characters are currently displayed in their

own dedicated spaces rather than as zero-width glyphs positioned above or below

the preceding character (see Figure 3.1).

The eight Arabic short vowels and diacritics have been excluded from input

methods designed for portable devices such as mobile phones, which lack space

for their keypad display. When cell phones were first Arabized for short text mes-

saging in the late 1990s, the author was assigned the task of designing an Arabic

telephone keypad layout for the T9 predictive text input method, whereby tapping

the number sequence 5–9–3–8, for example, would automatically spell out the

word ϡϼγ slAm. (The same numeric sequence spells out the words ϥΎϜγ skAn and

Γϼλ SlAƫ, but they are used less frequently than ϡϼγ slAm, which is displayed

first when using this input method). Although the keypad layout that we proposed

Fig. 3.1. Short vowels and diacritics as displayed in Mozilla Firefox version 1.0

today, successful Web browsers such as Mo

The basic and minimal character set for representing Arabic textual data in

electronic format was defined in the 1980s in the ASMO 449 code page (see

Table 3.1), which identified a minimal character set of 36 Arabic letters

Table 3.1. ASMO 449 Arabic character set

clearly showed that there was insufficient room to display the eight short vowels

and diacritics (see Figure 3.2), the cell phone manufacturers replied that this was

Issues in Arabic Morphological Analysis 25

usage remains associated with specific genres of writing, such as poetry and reli-

gious texts, which by their very nature require extensive, if not full vocalization.

Contemporary electronic storage and transmission of Arabic textual data con-

tinues to make use of the basic set of 36 characters defined in ASMO 449, al-

though four non-standard Arabic characters were introduced into popular use in the

1990s via platform-specific 8-bit encodings such as Mac Arabic (see Table 3.2)

that aimed at providing word processing capabilities for other Arabic-alphabet

based languages, chiefly Persian. The four non-standard letters that are occa-

sionally used alongside the standard Arabic character set (see Table 3.3) typically

represent sounds that are considered non-native to Arabic, although usage may

vary from one region to another in the Arabic-speaking world. For example,

whereas in Egypt ̧ J (U+0686) would represent the non-Egyptian Arabic /j/, as in

the name ̧έϮ̩ JwrJ /jǌrj/ and the loan word Ο̧΍ή jrAJ /garƗj/, in Iraq the letter ̧ J

(U+0686) would be used to represent the sound /þ/, as in the name ϲΒϠ̩ Jlby

/þalabƯ/ and the dialectal word ̨ϧϮϠη šlwnJ /šlǀniþ/. The remaining characters – ̟

P (U+067E), ׃ V (U+06A4), and ̱ G (U+06AF) – are used to represent the

sounds /p/, /v/, and /g/, respectively. The ̱ G (U+06AF) is not used in Egypt be-

cause /g/ is already the normal pronunciation of Ν j (U+062C) in that region of the

Arab world.

It is important to note that before these non-standard characters were intro-

duced, Arabic orthography simply made use of their standard counterparts: the let-

ters Ώ U+0628, Ν U+062C, ϑ U+0641, and ϙ U+0643 (see Table 3.3). The

corresponding non-standard characters ̟ U+067E, ̧ U+0686, ׃ U+06A4, and ̱

U+06AF were adopted not necessarily because the existing orthography was

deemed to be inadequate, but simply because the emerging technology made the

Fig. 3.2. T9 Arabic keypad

not a problem because their consumer research had shown that customers did not

need to use short vowels and diacritics. Although it can be debated that the Arabic

language can survive without these eight short vowels and diacritics—and charac-

ter frequency counts of newswire corpora and informal writing show clearly that

short vowels and diacritics play only a marginal role in the language—

computerization has provided easier access to these characters. Nevertheless, their

26 Buckwalter

Table 3.3. Non-standard Arabic characters

Non-standard character Standard counterpart

̟ P U+067E Ώ b U+0628

̧ J U+0686 Ν j U+062C

׃ V U+06A4 ϑ f U+0641

̱ G U+06AF ϙ k U+0643

new characters more accessible to the common user. A key and obvious factor in

determining the use of these characters is the display, or lack thereof, of these ex-

tended characters on the text input keyboard itself. It should be noted that although

three of these extended Arabic characters (̟ U+067E, ̧ U+0686, and ̱ U+06AF)

are included in the Windows 1256 codepage, they have not been mapped directly

to the Arabic keyboard and must be entered via the awkward sequence of Alt + 4-

digit number representing their decimal value in the codepage.

Today’s Unicode-enabled platforms and word processors have made the entire

extended Arabic alphabet character set (U+0600 to U+06FF) available to users,

and this has resulted in occasional, and relatively isolated, odd electronic encod-

ings of Arabic text on Internet publications. For example, a character frequency

count that we recently conducted using UTF-8 data posted during 2004–2005 on

the CNN Arabic website (arabic.cnn.com) showed statistically significant usage of

̭ U+06A9 (ARABIC LETTER KEHEH), which is used for Persian and Urdu

texts. When examining the source data we discovered that this extended character

was being used in lieu of ϙ U+0643 (ARABIC LETTER KAF), and for no

apparent reason, because the same text contained as many instances of ordinary

KAF as it did of the Persian/Urdu KEHEH. The text data in Figure 3.3 contains

both characters, and because the display glyphs are practically identical, the

Table 3.2. Mac Arabic codepage

Issues in Arabic Morphological Analysis 27

Fig. 3.4. Arabic text encoded logically (left-to-right) with presentation forms

separate shapes of each letter for Arabic (U+FE70 – U+FEFF), for Persian, Urdu,

and other Arabic-script languages (U+FB50 – U+FBFF), and a large inventory of

Arabic ligatures (U+FC00 – U+FDFF).

Arabic Web pages that make use of Arabic presentation forms are quite rare

because use of presentation forms requires encoding each line of text in reverse

order, which carries with it the assumption that all lines end in carriage return and

do not wrap. This so-called “visual” encoding of Arabic, which makes use of dis-

play characters or glyphs—as opposed to “logical” encoding, which makes use of

abstract characters, not their display glyphs—was used briefly in the early days of

Web publishing in Arabic, and has survived in a few legacy Web sites, such as the

Al-Ahram mirror site originally created for non-Arabic enabled browsers

(www.ahram-eg.com). Although the Al-Ahram example involves visual encoding

using a proprietary font and presentation forms, we found at least one example on

the Web involving Unicode presentation forms, where the Arabic names of Old

The easy availability of Unicode extended characters and presentation forms is

producing anomalies in Arabic orthography that appear unexpectedly and in a va-

riety of forms that can often be detected only through statistical analyses, such as

character frequency counts of electronic texts. These anomalies must be resolved,

i.e., corrected or normalized, before the text can be submitted to morphological

analysis.

Testament books were encoded logically but with presentation forms (see Figure 3.4;

observed April 2005 at http://bible.gospelcom.net/versions/).

underlying encoding difference is hard to detect visually. The words containing

KEHEH have been underlined.

Whereas 8-bit encoding formats such as Mac Arabic and Windows 1256 usually

ensured that Arabic would be represented with the basic set of 36 characters and

eight optional diacritics plus no more than four non-standard Arabic characters, to-

day’s multi-byte encoding allows authors and publishers of electronic texts to make

direct use of even the very glyphs reserved for rendering on output devises (moni-

tors and printers) and not intended for text storage and text interchange—the so-

called Presentation Forms. These glyphs include the initial, medial, final, and

Fig. 3.3. Arabic text with KAF (U+0643) and KEHEH (U+06A9)

28 Buckwalter

which in turn affects a writer or typist’s choice of keyboard input characters. We

will begin by discussing normal orthographic variation.

Certain types of orthographic variation in Arabic can be observed in all regions

and countries where Arabic is written. An example of this variation is the ten-

dency to regard stem-initial hamza + alif (U+0623, U+0625) and madda + alif

(U+0622) as instances of alif + optional diacritics, which means that bare alif

(U+0627) can substitute for all three (see Table 3.4). In stem-medial and stem-

final positions these same characters are treated as optional diacritics only if their

removal does not result in ambiguity (see Tables 3.5 and 3.6). The orthographic

Table 3.4. hamza + alif (U+0623, U+0625) and madda + alif (U+0622) in stem-

initial position

ϝϭϻ΍= ϝϭϷ΍ ϥ΍= ϥ΃/ ϥ· ϥ΍ϭ= ϥ΃ϭ/ ϥ·ϭ ϰϟ΍= ϰϟ·

ήΧ΍= ήΧ΁ ϻ΍= ϻ·/ ϻ΃ Ύϣ΍= Ύϣ΃/ Ύϣ· ϥϻ΍= ϥϵ΍

Table 3.5. hamza + alif (U+0623, U+0625) and madda + alif

(U+0622) in stem-medial and stem-final positions – ambiguity

prevents variation in orthography

ϝ΄γϝΎγ ΃ΪΑ΍ΪΑ ϥ΄ϛϥΎϛ

ϥ΍ήϗϥ΁ήϗ ϙΎϨϫϙ΄Ϩϫ ϥ΄ΑϥΎΑ

Table 3.6. hamza + alif (U+0623, U+0625) and madda + alif (U+0622) in stem-medial and

stem-final positions – lack of ambiguity allows for variation in orthography

ήΧΎΘϣ= ήΧ΄Θϣ ϥΎθΑ= ϥ΄θΑ βϴγΎΗ= βϴγ΄Η Ε΍ΪΑ= Ε΃ΪΑ

ΪϴϳΎΗ= Ϊϴϳ΄Η Ε΍έ= Ε΃έ ϒϧΎΘδΗ= ϒϧ΄ΘδΗ ΪϴϛΎΘϟ= Ϊϴϛ΄Θϟ

3.3 Orthographic Variation

Orthographic variation poses a challenge to morphological analysis simply be-

cause variation in surface orthography expands the inventory of unique character

sequences that constitute valid input words. The existence of multiple surface

forms for each word creates a need for additional methods for matching surface

forms with what is usually a single canonical form in the morphological analysis

lexicon. This matching of surface orthography with system-internal canonical

forms is done either through orthography-specific two-level morphological rules

(Beesley, 2001) or through a combination of pattern matching and generation of

orthographic variants (Buckwalter, 2004a). We would like to distinguish two types

of Arabic orthographic variation: (1) normal variation, which is caused by the per-

ception among writers and typists of what is orthographically correct or at least

permissible, and (2) mechanical variation, which is caused by how specific char-

acters and character combinations are displayed on different computer platforms,

Issues in Arabic Morphological Analysis 29

Table 3.7. tƗ’ marbǌTa (U+0629) spelled as hƗ’

(U+0647)

ΔϴΑήόϟ΍= ϪϴΑήόϟ΍ ΔϔϴϠΧ= ϪϔϴϠΧ

ΓΪΤΘϤϟ΍= ϩΪΤΘϤϟ΍ ΓΩΎόγ= ϩΩΎόγ

ΓΰϏ= ϩΰϏ

The most significant orthographic variation that occurs in Arabic today in-

volves the so-called Egyptian spelling of word-final yƗ’ (ϱ U+064A), which in

Egypt, and in various regions where Egyptian spelling predominates, is typically

spelled as undotted yƗ’ (see Table 3.8).

In the Unicode character set this word-final undotted yƗ’ is represented by

U+06CC (ARABIC LETTER FARSI YEH). The Unicode standard (2003, p. 59)

states that this letter “yeh” is written with dots in initial and medial positions, in

which case it maps to Arabic yƗ’ (U+064A), and that in final and separate posi-

tions it maps to alif maqSǌra (U+0649). Systems using 8-bit encoding schemes

have implemented this undotted yƗ’ in two different ways, which we will refer to

by their codepages: Mac Arabic and Windows 1256 (both of which antedated the

Unicode standard). The Mac Arabic implementation came first, and it reflects the

basic definition of Farsi “yeh” found in the Unicode documentation: the letter is

dotted in initial and medial positions, and undotted in final and separate positions.

However, because the Mac Arabic codepage also needed to provide a word-final

Table 3.8. Word-final undotted yƗ’

ϰϓ =ϲϓ ϰΘϟ΍ =ϲΘϟ΍

ϯάϟ΍ =ϱάϟ΍ ϰϫ =ϲϫ

ϯ΃ =ϱ΃ ϰϟϭΩ =ϲϟϭΩ

Arabic spellcheckers would flag these variants as errors. From a morphological

analysis perspective these unambiguous words should be labeled simply as in-

stances of sub-standard orthography, and the ability to analyze them should be re-

garded as basic robust parsing.

The general perception that the two dots on tƗ’ marbǌTa (Γ U+0629) are dia-

critics is seen in this letter’s orthographic variation with word-final and separate

hƗ’ (ϩ U+0629; see Table 3.7), although this use of hƗ’ as a substitute for tƗ’

marbǌTa is restricted primarily to informal written Arabic. Our impression is that

there is a tendency to write hƗ’ in contexts where the tƗ’ marbǌTa would not be

pronounced, such as in noun-adjective phrases (e.g., ϪγέΪϣ ϪϳϮϧΎΛ /madrasa

șƗnawiyya/), but to preserve the two dots of the tƗ’ marbǌTa in iDƗfa (genitive or

possessive) constructions (e.g., ΔγέΪϣ ΕΎϨΒϟ΍ /madrasatu l-banƗt/). We are unaware

of any corpus-based research conducted in this area of orthography.

variation observed in Table 3.6 challenges one’s definition of typographical error:

although the variants with bare alif (U+0627) are unambiguous to readers, most

30 Buckwalter

ϨΜϟ΍ ˰ ΎϯϠόϟ΍ ˰ Δϯ΋΍άϐϟ΍ ˰ Ϫϯϟ΍ ˰ Ύπϯ΃ ˰ Δϯ΋ΎϬϨϟ΍ ˰ ϪϯϠϋ ˰ Ϟϯ΋΍ήγ΍ ˰ βϯ΋έΐΠϯ ˰ Δϯ΋Ύ

Corrected:

ΐΠϳ ˰ Δϴ΋ΎϨΜϟ΍ ˰ ΎϴϠόϟ΍ ˰ Δϴ΋΍άϐϟ΍ ˰ Ϫϴϟ΍ ˰ Ύπϳ΃ ˰ Δϴ΋ΎϬϨϟ΍ ˰ ϪϴϠϋ ˰ Ϟϴ΋΍ήγ΍ ˰ βϴ΋έ

Fig. 3.5. Arabic text with alif maqSǌra in initial and medial positions

Arabic text data that is created on the Mac platform will probably have in-

stances of alif maqSǌra in word-initial and word-medial positions, and when this

text is ported to platforms where alif maqSǌra is used only in word-final position

these orthographically anomalous words become quite obvious (see Figure 3.5).

For morphological analysis it is clear that alif maqSǌra and yƗ’ must be treated as

two different characters, regardless of what glyphs are used to display either ac-

cording to Egyptian or non-Egyptian preferences. Therefore, all instances of initial

and medial alif maqSǌra need to be mapped to yƗ’.

initial or medial dotted yƗ’ is seen as a flexible feature by typists, but this has re-

sulted in mechanical orthographic variation (see below), because words such as

βϴ΋έ /ra’Ưs/ can now be spelled two different ways electronically (see Table 3.9),

and both spellings are attested on the Web with significant Google scores.

dotted yƗ’, which is needed outside the areas where Egyptian orthography pre-

dominates, two overlapping character-to-glyph mappings were created: the char-

acter known as yƗ’ (U+064A) was assigned four glyphs (initial, medial, final and

separate: ϱ ϲ˰ ˰ϴ˰ ˰ϳ) all dotted, and the character known as alif maqSǌra was also

assigned four glyphs, two dotted (initial and medial: ˰ϴ˰ ˰ϳ), and two undotted (final

and separate: ϯ ϰ˰). The fact that one can type either yƗ’ or alif maqSǌra to create

The orthographic variation noted above in so-called Egyptian spelling of undot-

ted word-final yƗ’ has undergone an interesting additional and unexpected develop-

ment since the mid 1990s: the use of word-final dotted yƗ’ (U+064A) has been in-

troduced gradually, but it has been extended to all words that should be spelled with

alif maqSǌra (U+0649), such as ϲΘϣ /matta/, ϲγϮϣ /mǌsa/, ϲϠϋϷ΍ /al-’aȢla/ and ϱήΧ΃

/’uxra/. Coupled with the continuing normal use of undotted word-final yƗ’, this un-

fortunate practice of dotting alif maqSǌra has resulted in orthographic ambiguity for

both yƗ’ and alif maqSǌra. The upshot for morphological analysis is that, when

processing text that comes from Egypt or any region under the influence of Egyptian

orthography, one needs to keep in mind that word-final yƗ’ may actually represent

alif maqSǌra, and vice versa (which is the normal case).

There is one additional form of mechanical orthographic variation that should

be mentioned, and that is the tendency of certain typists to reverse the normal alif

+ fatHatƗn sequence (U+0627 U+064B) because fatHatƗn + alif appears to “look

Table 3.9. Two electronic representations of the word βϴ΋έ

Unicode character sequence Google score (April 2005)

U+0631 U+0626 U+0649 U+0633 967,000

U+0631 U+0626 U+064A U+0633 6,870

Issues in Arabic Morphological Analysis 31

A final case of orthographic variation that merits discussion is the neglected area

of “run-on” words, by which we mean the observed habit of writing certain types of

two-word combinations without intervening or separating space. We regard this as a

tokenization problem and we discuss it in full in the section that follows.

3.4 Tokenization of Arabic Words

By tokenization we mean the process of identifying minimal orthographic units or

“words” that can be submitted for individual morphological analysis. The working

definition of an Arabic word token is straightforward: Arabic words consist of one

or more contiguous “alphabetic” characters (i.e., the set of 36 characters, hamza

through yƗ’, or Unicode U+0621 through U+064A), the set of eight short vowels

and diacritics (U+064B through U+0652), the lengthening character (U+0640),

and the set of four extended characters associated mainly with Persian usage (̟

U+067E, ̧ U+0686, ׃ U+06A4, and ̱ U+06AF). Additional extended Arabic

characters may enter the Arabic orthographic mainstream as various non-Arab

ethnic groups increase their publishing presence and influence on the Web.

For tokenization to work correctly a certain amount of pre-processing is neces-

sary in order to deal with the use of alphabetical characters in non-alphabetical

contexts, such as use of the letter rƗ' (U+0631) as numeric comma (see Figure 3.6),

and use of the lengthening character (U+0640) as punctuation (e.g., hyphen or

better,” although the reverse could also be argued, as seen in the following ortho-

graphic word pairs: ˱Ύ π ϳ ΃ / Ύ˱πϳ΃, ˱ΎΌϴη / Ύ˱Όϴη, ˱Ύϧϼϓ / Ύ˱ϧϼϓ, ˱΍ΪΣ΃ / ΍˱ΪΣ΃. The fact that glyph

design and implementation influences people’s typing habits can also be observed

in several types of hamza + hamza-chair combinations, such as word-final hamza-

on-yƗ’ (U+0626; see Table 3.10), and hamza-on-wƗw (U+0624; see Table 3.11).

The arrows in the tables show the direction in which orthographic normalization

should be implemented. In both cases normalization involves mapping two-

character sequences to single characters: U+0649 U+0621 to U+0626, and

U+0648 U+0621 to U+0624. Note the relative small size of the hamza glyph

placed on the wƗw chair, which makes it difficult to read.

Table 3.11. hamza-on-waw orthographic variation

ήϤΗΆϣ Å ήϤΗ˯Ϯϣ ΍ΪϛΆϣ Å ΍Ϊϛ˯Ϯϣ

ϦϴϟϭΆδϣ Å Ϧϴϟϭ˯Ϯδϣ ΍ήΧΆϣ Å ΍ήΧ˯Ϯϣ

ΕΎδγΆϣ Å ΕΎδγ˯Ϯϣ ϝϭΆδϣ Å ϥϭ˯Ϯη

Table 3.10. Word-final hamza-on-yƗ’ orthographic variation

ΉΩΎΒϣÅ ˯ϯΩΎΒϣ σ Ήέ΍Ϯ Å ˯ϯέ΍Ϯσ

ΊΟϮϓÅ ˯ϰΟϮϓ ΉΩΎϬϟ΍Å ˯ϯΩΎϬϟ΍

ΊσΎΧÅ ˯ϰσΎΧ ΊΟϻÅ ˯ϰΟϻ

32 Buckwalter

creative ways as punctuation (in the same manner that Latin lower case “o” is used

for bullets), and because of their zero-width display characteristics these Arabic

diacritics also have a tendency to become detached from the words they were in-

tended to accompany. Isolated short vowels and diacritics must be treated as punc-

tuation or excluded from the morphological analysis input altogether.

The current approach to Arabic tokenization has overlooked the problem of

“run-on” words, which is the writing of two words without intervening space, a

condition that typically occurs when the first word ends with any of the thirteen

“non-connector” letters: ϯ ϭ ί έ Ϋ Ω Γ ΍ · ΅ ΃ ΁ ˯ (U+0621-U+0625, U+0627, U+0629,

U+062F-U+0632, U+0648-U+0649). Depending on the glyph design (i.e., the

perceived width) of the non-connecting letter, the person composing the text may

feel free not to insert a space between the non-connector and the first letter of the

next word. These “run-on” words are difficult to detect visually, but their presence

in digital data is detected immediately and usually flagged as a typographical er-

ror. These mistakes are relatively frequent, as witnessed by their Google scores

(see Table 3.12).

The most frequent “run-on” words in Arabic are combinations of the high-

frequency function words ϻ /al-/ and Ύϣ /mƗ/ – which end in the non-connector alif

– with following perfect or imperfect verbs, such as ϝ΍ΰϳϻ

/lƗ-yazƗl/, ϡ΍ήϳΎϣ

/mƗ-

yazƗl/, and ϝ΍ίΎϣ /mƗ-zƗl/. The ϻ /lƗ/ of “absolute negation” concatenates freely

with nouns, as in ΪΑϻ

/lƗ-budda/ and Ϛηϻ /lƗ-šakka/. It can be argued that these are

lexicalized collocations, but their spelling with an intervening space (ϝ΍ΰϳ ϻ , Ύϣ ϝ΍ί

and ΪΑ ϻ) is generally more frequent than their spelling as single word units

(see Table 3.13).

Proper noun phrases, such as Ϳ΍ΪΒϋ / Ȣabdallah /, ϦϤΣήϟ΍ΪΒϋ /ȢabdurraHmƗn/,

Ϳ΍έΎΟ /jƗrallah/ and ϦϳΪϟ΍έϮϧ /nǌruddƯn/ are also written with or without intervening

Fig. 3.6. Arabic letter rƗ’ used as numeric comma

Fig. 3.7. Arabic lengthening character used as punctuation

mdash; see Figure 3.7). Short vowels and diacritics are occasionally used in

Issues in Arabic Morphological Analysis 33

3.5 Archaic Lexical Items and Morphological Features

Morphological analysis of Arabic typically means the analysis of Modern Stan-

dard Arabic, which implies the exclusion of archaic vocabulary, orthography and

morphological features associated exclusively with the literature and religious

texts of the Classical period of Arabic literature. Also excluded from MSA are

written forms of the vernacular, although this situation is changing rapidly and

will probably become the greatest challenge in contemporary Arabic NLP (see be-

low, “Integrating the dialects in Arabic morphological analysis”). It is not un-

common for MSA texts dealing with religious topics, or with political topics in a

religious context, to include quotations from the Qur’an and the Hadith, and these

may be a source of archaic lexical items and rare morphological features. Al-

though the orthography of religious quotations could be archaic as well, it is cus-

tomary to use contemporary orthography (e.g. Γϼλ SlAƫ, rather than ΓϮϠλ Slwƫ).

Table 3.13. 1-word and 2-word frequencies of run-on words

Google score (4/2006) Google score (4/2005)

as 2-words as 1-word as 2-words as 1-word

Run-on

words

3,540,000 717,000 412,000 44,300 ϦϜϤϳϻ

4,420,000 1,850,000 792,000 87,100 ϮϫΎϣ

2,210,000 2,010,000 188,000 276,000 ΪΑϻ

1,120,000 192,000 106,000 15,500 ϝ΍ΰΗϻ

1,190,000 155,000 97,500 7,680 ΙΪΣΎϣ

1,160,000 210,000 96,500 8,470 ϖϠόΘϳΎϣ

928,000 356,000 83,500 30,300 Ϛηϻ

910,000 749,000 83,400 67,800 Ζϟ΍ίΎϣ

170 ,000 30,500 12,700 3,760 ΐϳέϻ

Table 3.12. Run-on words and their frequencies

Google score

April 2006 April 2005 March 2004

Run-on

words

17,100 4,420 846 ϡΎϋήϳΪϣ

16,200 1,270 719 ΔϴΟέΎΨϟ΍ήϳίϭ

658 352 162 έϻϭΩέΎϴϠϣ

919 493 158 ΪϤΤϣέϮΘϛΪϟ΍

703 358 130 ϢΗΪϗϭ

space. These name compounds may be regarded as lexicalized units, but syntacti-

cally they are also iD fa constructions, and should be treated accordingly as two

separate word tokens. Some run-on words can have more than one reading, although

this is extremely rare, as in ϢΗΪϘϓ fqdtm, which could be read as two words, /fa-qad

tamma/, or as a single word, /faqadtum/. The proper solution to this problem is to

pre-process input strings and decouple run-on words (Buckwalter, 2004b).

34 Buckwalter

of archaic feature is the use of direct and indirect object pronoun clitics, as in the

word ΎϬϛΎϨΟϭί zwjnAkhA /zawwajnƗkahƗ/, from Qur. 33:37 (al-Ahzab), “We gave

her to you as a wife.” Because these morphological features are limited to specific

lexical items in their source religious texts, they should probably be treated as ex-

ceptions in the morphological analysis lexicon.

We will conclude by discussing two verbal features that have been categorized

as archaic. The first concerns the usage of two alternative forms for the jussive

mood of the doubled verb: the short assimilated form of the stem (e.g., ήϤϳ ymr

/yamurra/) or the long unassimilated form (e.g., έήϤϳ ymrr /yamrur/). Some text-

books and references cite only the assimilated form, ήϤϳ ymr /yamurra/, and this is

correct because the unassimilated form is not attested in contemporary Arabic and

is now considered archaic (Badawi et al., 2004, p. 65). The second feature con-

cerns use of the energetic form, which is rare but not archaic, and requires some

attention because it is typically confused in morphological analysis with the femi-

nine plural form. Badawi et al. (2004, pp. 441–2) provide several citations, and we

encountered the following citation during the morphological annotation and POS

tagging of the first 700,000 words of newswire in the Penn Arabic Treebank

(Maamouri et al., 2004). The actual citation is: ΎϜϳήϣϷ ϒϴϠΣ ϭ΃ ϖϳΪλ Ϫϧ΄Α ΪΣ΃ ϦϋΪΨϨϳ ϻ

/lƗ yanxadiȢanna ’aHadun bi-’anna-hu SadƯqun ’aw HalƯfun li-’amrƯkƗ/. Native in-

formants assure us that the energetic is not uncommon, and can be heard in force-

ful statements such ΪΣ΃ ϦϣϮϘϳ ϻ /lƗ yaqǌmanna ’aHad/ and ϦϟϮϘΗ ϻ /lƗ taqǌlanna/.

Until we have tagged a substantially larger corpus we cannot assess adequately the

extent to which the energetic form is used in contemporary Arabic.

3.6 Lexicon Design and Maintenance

After all aspects of morphological analysis have been adequately addressed, the

only way to improve the quality of the analysis is by improving the lexicon. The

lexicon can be enhanced in terms of its lexical coverage, by adding new words and

An example of an archaic lexical item that was used in MSA context not too

long ago and disseminated widely in the media is the word ϯΰϴο Dyzý, which is

not found in any dictionary of Modern Standard Arabic. This word comes from

Qur. 53:22 (al-Najam) /tilka ’iðan qismatun DƯza/, “This, therefore, is an unjust

division.” This verse was alluded to in a taped speech by Usama bin Laden which

was broadcast widely by the media in November 2002. The implication of this

event for morphological analysis of contemporary Arabic is that MSA can be ex-

pected to include occasional quotations from the established corpus of traditional

religious texts (i.e., Qur’an and Hadith), and that it is advisable to extend the lexi-

cal coverage of morphological analysis to such texts, especially since corpus-

based lexicography is able to detect the usage and frequency of these archaic lexi-

cal items. The phrase ϯΰϴο ΔϤδϗ qsmƫ Dyzý /qismatun DƯza/, for example, is now

relatively well attested on the Web.

Certain archaic morphological features are restricted to religious texts and one

does not find these features used in new MSA contexts. An example of this kind

Issues in Arabic Morphological Analysis 35

roots and patterns, these morphemes are used in the analysis mechanism itself. The

combination, or interdigitation, of these two morphemes produces the equivalent of

a stem entry in the lexicon, but not necessarily the equivalent of that stem’s canoni-

cal surface orthographic form—hence the need for two-level morphology.

Lexicon entries in a two-level morphology system represent not the familiar

normalized surface orthography of traditional Arabic dictionary entries, but rather

their abstract lexical level. For example, the words maktaba, majalla, and maqƗla,

are entered in the Xerox lexicon as maktaba, *majlala, and *maqwala, respec-

tively. Furthermore, because root and pattern morphemes are entered in separate

dictionaries, the actual lexicon entries consist of root morphemes (k-t-b, j-l-l, and

q-w-l, in this example) and pattern morphemes (mafȢala, in this example). In the

Xerox lexicon Arabic prefixes and suffixes are assigned individual entries in lexi-

cons that group items belonging to the same morpheme class, i.e., items that ex-

hibit the same morphotactic behavior. Hence, in addition to the main lexicons of

roots and pattern morphemes, there are various lexicons for morpheme classes

such as prepositions, conjunctions, verb subject and object markers, the definite

article, noun case endings, and verb mood markers. The morphotactic constraints

are implemented via rules that state in which sequence the various lexicons can be

accessed by the morphological analysis engine (Beesley et al., 1989).

Our design of a stem-based system of morphological analysis was motivated in

part by the complexities we experienced with developing the Xerox lexicon (which

was known as the Alpnet lexicon at the time), and with the intricacies in defining

the morphotactic constraints. Therefore, in designing our own system we pursued

an alternate design that (1) simplified lexicon management by representing lexical

items according to their normalized surface orthography, and (2) greatly simplified

the morphotactic component of the system by representing all valid concatenations

of prefixes and suffixes in the lexicon entries themselves. Maintaining three lexi-

cons—of prefixes, stems, and suffixes—is vastly simpler than maintaining a dozen

or more. But the greatest advantage in not using a two-level morphology approach

new meanings to old words, and also by increasing the level of grammatical detail

that is described. We are familiar with two major different types of morphological

analysis lexicons: the Xerox lexicon (Beesley, 2001), whose entries are based on

root and pattern morphemes, and our own lexicon (Buckwalter, 2004a), whose en-

tries make use of word stems. In the argument over which method represents the

correct approach to analyzing a Semitic language such as Arabic, it should be

mentioned that although root and pattern morphology is pervasive in the language,

approximately seven percent of the entries in the lexicon contain no discernable

pattern morpheme (and thus no discernable root morpheme, although Arabs are

often capable of extracting root candidates from many non-Semitic words), and

that these words must be treated with a stem-based approach. It should also be men-

tioned that if root and pattern morpheme information is encoded appropriately in the

lexicon, it can be reported in the analysis output, regardless of which approach one

uses in the analysis. The major difference, of course, is that in a system based on

is that lexicon entries are not orthographic abstractions but rather the familiar

36 Buckwalter

searching for high-frequency dialectal words (see Table 3.14). Whereas some

dialectal words are common to all major dialects (e.g., ϲϠϟ΍ /illƯ/), a search for

groups of words used only in a given dialect will lead to Web pages with text writ-

ten mostly in that dialect. For example, a search for the words ϱΪΑ /biddƯ/, Ϯη /šǌ/,

ζϴϟ /lƝš/ and ϖϠϫ /halla’/ can be used to locate written samples of Levantine

Arabic, and a search for ϩΩέΎϬϨϟ΍ /an-nahƗr-da/, ϝϮϗΎΣ /H-a’ǌl/, ϰΘϗϮϟΩ /di-l-wa’tƯ/

and εΪΤϣ /ma-Hadš/ can be used to find Web pages with Egyptian Arabic. The

text on these Web pages typically contains a mixture of colloquial Arabic and

MSA. If the text is a transcription of speech, such as an interview from a

television show, it often becomes clear that some sections of the transcript could

be sufficiently formal to warrant the use of MSA case endings in the

morphological analysis, but that other sections would sound awkward (too

formal) with these features present, and that sections abounding in colloquial

lexical items and constructions would preclude the presence of almost all case

endings (see Figure 3.8). The main point is that these samples of modern written

Arabic reflect the full range of registers one hears in modern spoken Arabic, with-

out any clear-cut division between dialect and MSA.

The morphological analysis of Arabic dialects is complicated by the relative

absence of orthographic standards and by orthographic variation among different

dialects. However, the rising use of dialectal Arabic on the Web is changing this

situation and some orthographic conventions are taking shape. For example, it can

already be observed that while Egyptians prefer to spell the word /ba’ǌl/ “I say”

with the alif subject marker ( ϝϮϗΎΑ bAqwl), Levantine speakers prefer to write it

without the alif ( ϝϮϘΑ bqwl). The verbal paradigms of the dialects differ radically

from MSA paradigms (e.g., imperative forms such as ϝϮϗ /qǌl/, Ρϭέ /rǌH/, ϑϮη

/šǌf/, and ϲϠΧ /xallƯ/) and include additional affixes that increase the complexity of

not require a thorough knowledge of Arabic derivational morphology, which few

native speakers learn as well as non-native Arabists, who are often known for their

ability to cite the form number of any given verb, regular or irregular, followed by

its respective active and passive participles and verbal noun.

3.7 Integrating the Dialects in Arabic Morphological Analysis

Modern written Arabic is exhibiting a growing influx of dialectal Arabic, largely

as the result of unedited and uncensored publication on the Web. Significant quan-

tities of dialectal Arabic in written form can readily be found on the Web by

the morphotactic component of morphological analysis (e.g., ζϜϠΘϠϗΎϣ /ma-’ulti-

llak-š/ and ϚϟΎϘΒϴΣ /Ha-yib’Ɨ-lak/). In addition, the dialectal lexicon makes use of

many MSA lexical items, such as ιϼΧ /xalƗS/, ϡίϻ /lƗzim/,ϦϜϤϣ /mumkin/, ϲηΎϣ

/mƗšƯ/, ϦϳΎΑ /bƗyin/, ϲϨόϳ /yaȢnƯ/, ϝΎϣ /mƗl/, έΎλ /SƗr/, and Ρ΍έ /rƗH/, but with very

canonical forms of printed dictionaries. This means that dictionary maintenance need

Issues in Arabic Morphological Analysis 37

Fig. 3.8. Sample of modern written Arabic (www.almanara.org/Audio/2-2-05.htm)

Table 3.14. High-frequency dialectal words

ΎϨΣ΍ ήθϋΪΣ΍ ζϋΪΣ΍ ϦϴϨΗ΍ ϱϮΑ΍ ϲϧ΁

ζόϨσ΍ Ϛϳί΍ ϱ΍ί΍ ήθότόΑέ΍ ζότόΑέ΍ ΎϳϮΧ΍

ΡήϴΒϣ΍ ΡέΎΒϣ΍ ϩΩέΎϬϨϟ΍ ϲϠϟ΍ Ϯϛ΍ ήθόϨσ΍

Ϫϴϛϭ΍ ϲϛϭ΍ ϲΘϧ΍ ϮΘϧ΍΍ ϮΘϧ΍ ϰΘϣ΍

ϝϮϗΎΑ άΧΎΑ ϩϮϳ΍ ϰΘϤϳ΍ ζϳ΍ Ϊϳ΍

΍ήΑ ϱΪΑ ΪΠΑ ϝϮϘΘΑ ΔϋΎΘΑ ωΎΘΑ

ϩήϜΑ ΓήϜΑ ϦϳΪόΑ βΑ ϮοήΑ ϪοήΑ

ζότόδΗ ϰϧΎΗ ΕΎϨϴΑ ϝϮϘϴΑ ϰϘΒϴΑ ϲϜϠΑ

ϝϮϗΎΣ ΍ϮΟ ϱΎΟ ϝϮϘΗ ϱϮδΗ ήθότόδΗ

εϮΧ ήθότδϤΧ ζότδϤΧ ϲϠΧ κϟΎΧ ϰϘΒϴΣ

Ρέ Ρ΍έ ϯΩ ϰΘϗϮϟΩ ϱήϏΩ ΎϤϳ΍Ω

Εϼϐη ήθότγ ζότγ Βγήθότό ζότόΒγ ϱί

ϥϮϠη ϮϜη ΪϘη ϪΘϔη ϚΘϔη ΔϠϐη

ϱϮη ϚϧϮη ϥϮη Ϯη ϮϨη ϚϧϮϠη

ϥΎθϋ ζϳΎϋ ΐσ έΎλ ϲη ΔϳϮη

ζϴϓ ϮϛΪϨϋ Ϣϋ ΎϴϠϋ ΩϮϤϠϋ ϥΎθϠϋ

βϳϮϛ ϥΎϤϛ ζϠϛ ϩΪϛ ϡΎϛ Ϧϴϓ

Ϫϴϟ ζϴϟ Ϊόϟ Ϫδϟ ϦϴδϳϮϛ ΔδϳϮϛ

ζϣ εΪΤϣ ΡέΎΒϣ ϝΎϣ ϮϛΎϣ ζϓήϋΎϣ

ϮϨϣ ζϧΎϜϣ ζϴϔϣ ζϠόϣ ΎϳΎόϣ ϙΎόϣ

ϱΎϫ δϧϱϮ Ϧϴϣ Δϴϣ Ϯϣ ΢ϴϨϣ

ϲϜϴϫ Ϛϴϫ ϥϮϫ ϖϠϫ ϼϫ Ύδϫ

ϲΠϳ ΎΑΎϳ Ϧϳϭ ϙΎϳϭ Ύϳϭ ΪΧ΍ϭ

Ϣϳ ϼϳ ϲϨόϳ

in the dialects—such as κϧ /naSS, nuSS/, ΎϴϠϋ /Ȣulya, Ȣalayya/, and έΪϘΑ /bi-qadrin,

different dialect-specific meanings and grammatical functions. Special attention

needs to be given to homographs—items read one way in MSA and another way

ba’dar/. In order to sort out dialectal and MSA features in the analysis, it may be

necessary to maintain separate lexicons and analysis modules for each dialect. The

38 Buckwalter

By “current morphological analysis” we mean the output of the two systems with

which we are familiar: the Xerox two-level morphology system (Beesley, 2001)

and our own (Buckwalter, 2004). (Other systems are described in this publication,

but we have not yet had access to adequate output data samples in order to evalu-

ate them). The Buckwalter system has received considerable exposure and scru-

tiny because of its use in the Penn Arabic Treebank project (Maamouri et al.,

2004). Some incisive criticism of both the Xerox and Buckwalter systems has

been provided by Otakar Smrž (in prep.), of the Prague Arabic Dependency Tree-

bank research group at Charles University, especially in areas concerning the lexi-

con’s coverage of gender, number, and humanness features. Essentially, the Xerox

and Buckwalter analyzer lexicons provide the default gender and number part-of-

speech labels for noun suffixes based on their form without any regard to their ac-

tual semantic value or function in the word. For example, the suffix Γ (U+0629) is

labeled FEM_SG regardless of whether it occurs in the word ΔγέΪϣ /madrasa/

(fem. sg. non-human), ΔϔϴϠΧ /xalƯfa/ (masc. sg. human) or ΔΑέΎϐϣ /maȖƗriba/ (masc.

pl. human). This can only be remedied by systematically entering the necessary in-

formation on gender, number and humanness for each lexical item.

The set of grammatical features that are currently encoded in the Buckwalter

morphological analyzer lexicon have been to a great extent determined by the re-

quirements of the Penn Arabic Treebank project. The Buckwalter system was

originally designed for simple word identification, as a first step towards generat-

ing lemmatized concordances for use in lexicography. Since its adoption for use in

treebanking it has undergone considerable change, mainly in the addition of POS

tags, with fairly precise distinctions applied especially to function words

(Maamouri et al., 2004). However, it is surprising that many of the traditional

3.8 Adequacy and Accuracy of Current

Morphological Analysis

analysis of MSA data that occurs in a predominantly dialectal context is also not free

of complications. For example, should it be vocalized internally according to MSA

canonical lexical forms or according to how it is pronounced? What is the vocaliza-

tion of the MSA word ΔϘτϨϣ, for example, in view of its numerous pronunciations:

/minTaqa, manTiqa, munTiqa/? These are just a few of the questions that arise when

dialectal Arabic commingles with what is traditionally labeled as MSA. In fact, the

presence of dialectal Arabic in a text immediately brings into question whether the

remaining non-dialectal portion of the text should bear any case endings and mood

markers at all in its corresponding morphological analysis.

grammatical categories that are discussed in all treatments of Arabic morphology

have not been needed for successful treebanking, at least as defined in the phrase-

structure Penn Treebank model. It is anticipated that forthcoming pedagogical

Issues in Arabic Morphological Analysis 39

We expect the next decade of Arabic morphological analysis to be challenged by

added complexity in the input data, as the orthography undergoes unpredictable

usage of Unicode characters beyond the basic set required for Arabic, and as the

written language itself is transformed by a steady influx of dialectal forms, forcing

morphological analysis to deal with diglossic texts. The output analysis itself will

be enriched by the complexities and challenges of the input data, and new gram-

matical features will be added to increase the level of detail and accuracy of the

description. Developments in automated syntactic analysis will also influence the

priorities that are followed in developing and improving morphological analysis

algorithms and lexicons. Arabic morphological analysis as a discipline is still in its

early stages.

References

3.9 Conclusion

E. Badawi, M.G. Carter, and A. Wallace. 2004. Modern Written Arabic: A Comprehensive

Grammar. Routledge, London.

Kenneth R. Beesley. 2001. Finite-state Morphological Analysis and Generation of Arabic at

Xerox Research: Status and Plans in 2001, In EALC 2001 Workshop Proceedings

on Arabic Language Processing: Status and Prospects, pp. 1–8, Toulouse, France,

July 2001.

Kenneth R. Beesley, S. Newton, and T. Buckwalter. 1989. Two-Level Finite-State Analysis

of Arabic Morphology, In Proceedings of the Seminar on Bilingual Computing in

Arabic and English, no pagination, University of Cambridge, U.K., September 1989.

following traditional grammatical labels:

x Gender, number, and humanness.

x Active and passive participles and verbal nouns. These categories are already

labeled and cross-linked in the Xerox system, although verbal nouns of Form I

need to be linked explicitly to their respective verb.

x Elative. This category needs to be linked to its related adjectival form (e.g. ήΒϛ΍

/’akbar/ o ήϴΒϛ /kabƯr/).

x Instance noun, unit noun, and collective noun.

x Verb features such as transitive, intransitive, grammatical collocations, etc.

applications of the Buckwalter system will soon require the implementation of the

T. Buckwalter. 2004a. Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic

Data Consortium, catalog number LDC2004L02 and ISBN 1-58563-324-0.

40 Buckwalter

T. Buckwalter. 2004b. Issues in Arabic Orthography and Morphology Analysis. In

Proceedings of the Workshop on Computational Approaches to Arabic Script-based

Languages, COLING 2004, pp. 31–34, Geneva, August 2004.

M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki. 2004. The Penn Arabic Treebank: Build-

ing a Large-Scale Annotated Arabic Corpus. Paper presented at the NEMLAR International

Conference on Arabic Language Resources and Tools, Cairo, Sept. 22–23, 2004.

Otakar Smrž. in prep. Functional Arabic Morphology. Formal System and Implementation.

Ph.D. thesis, Charles University in Prague.

The Unicode Consortium. 2003. The Unicode Standard, version 4.0. Boston, Addison-

Wesley.

H. Wehr. 1979 A Dictionary of Modern Written Arabic. 4th edition, edited. by J. Milton

Cowan. Wiesbaden, Harrassowitz.

Issues in Arabic Morphological Analysis 41

PART II

Knowledge-Based Methods

A Syllable-based Account of Arabic Morphology

Lynne Cahill

Natural Language Technology Group, University of Brighton

lynneca@sussex.ac.uk

Abstract: Syllable-based morphology is an approach to morphology that considers syllables to be the

primary concept in morphological description. The theory proposes that, other than simple

afﬁxation, morphological processes or operations are best deﬁned in terms of the resulting

syllabic structure, with syllable constituents (onset, peak, coda) being deﬁned according to

the morphosyntactic status of the form. Although most work in syllable-based morphology

has addressed European languages (especially the Germanic languages) the theory was

always intended to apply to all languages. One of the language groups that appears on the

surface to offer the biggest challenge to this theory is the Semitic language group. In this

chapter a syllable-based analysis of Arabic morphology is presented which demonstrates

that, not only is such an analysis possible for Semitic languages, but the resulting analysis

is not signiﬁcantly different from syllable-based analyses of European languages such as

English and German

4.1 Introduction

Approaches to the morphology of the semitic languages have tended to assume

that different mechanisms are required to account for a system which, on the

face of it, appears very different from typical European morphology such as afﬁx-

ation and ablaut. However, the syllable-based approach to morphology does not

require the radically differentiated accounts of such morphological systems. In this

approach to morphology, morphological realisations are deﬁned in terms of their

syllable structure, with the values of syllabic constituents deﬁned according to a

range of possible factors including morphosyntactic features, phonological context

and lexical information. It transpires that deﬁning semitic morphology in terms of

Syllable Based Morphology (henceforth SBM) requires very similar mechanisms

to those required for deﬁning the morphology of European languages. The chief

difference between the mechanisms required to deﬁne, for example, the various

ablaut processes in German and English, is of degree rather than nature. That is,

semitic languages may require more different constituents to be deﬁned for each

inﬂected form, but the actual deﬁnitions are identical.

It should be stressed that SBM was developed with the intention of being able to

deﬁne morphological alternations of a wide variety of types from a wide variety of

A. Soudi, A. van den Bosch and G. Neumann (eds.), Arabic Computational Morphology, 45–66.

2007 Springer.

46 Cahill

languages from across the world. The theory assumes a model of lexical represen-

tation that deﬁnes all word forms in terms of their syllable structure.

In this chapter, we present an account of the triliteral verbal morphology of

Classical Arabic verbs. We do not set out to provide a comprehensive account of

the verbal morphological system of Arabic, as many aspects do not differ in any

signiﬁcant way from the systems of European languages (for example, the afﬁxation

processes marking person and number). We begin by giving a brief description of the

theory of syllable-based morphology, with examples from English and German. We

then outline the Arabic data which we aim to cover. Next, we describe our syllable-

based account of Arabic. We then compare this account with previous accounts of

English and German, ﬁnally giving our conclusions.

4.2 Syllable-based Morphology

The theory of syllable-based morphology is described in Cahill and Gazdar (1997,

1999a, 1999b). The fundamental assumption is that morphological alternations can

all be deﬁned in terms of the syllabic structure of a stem. A stem is assumed to

consist of a linear string of syllables, and syllables within a string are identiﬁed

by simple indexing from either end. Syllables are simply numbered from one end

or the other. So, for English and German, both of which exhibit extensive sufﬁx-

ation, together with adaptations to the right-hand end of stems, we assume a system

of counting from the rightmost, or ﬁnal, syllable. Their syllable strings therefore

have the ﬁnal syllable as syl1, the penultimate syllable as syl2 and so on. This

ignores higher level structures such as feet or tone groups, as well as lower level

structures such as mora. However, as argued in Cahill (1990), the indexing appears

to be sufﬁciently powerful to deﬁne in elegant terms a wide range of morpho-

logical alternations from languages as diverse as English, Bontoc, Sanskrit and

Arabic.

The internal structure of a syllable is that given in Pike and Pike (1947), which

we take to be relatively uncontroversial. That is, we assume that a syllable consists

of an onset and a rhyme. An onset consists of a number of consonants from 0 to 3

(the exact number possible depends on the phonotactic constraints of the language

in question). A rhyme consists of a peak (or nucleus) and a coda. A peak is either

a vowel or a syllabic consonant. We assume here that long vowels and diphthongs

have the same syllabic status as short vowels, although this is not an assumption that

will lead to correct analyses of precise phonotactic constraints in many languages.

However, this assumption has no implications for the analysis of Arabic presented

here and so it is unnecessary to discuss this further. A coda is a number of

consonants from 0 to 4 (again, dependent on the phonotactic constraints of the

language).

Another assumption of the account presented here is that the syllable structure

deﬁnitions are embedded within a lexicon structured as a default inheritance

hierarchy. For this we use the lexical knowledge representation language DATR

(Evans and Gazdar, 1996). This enables the deﬁnition of default inheritance networks

A Syllable-based Account of Arabic Morphology 47

in a relatively simple and elegant way. In this chapter, we focus on those parts of

DATR which are necessary for the exposition of the account of Arabic morphology.

It should be stressed that the use of DATR is not essential to the deﬁnition of the

morphology in a syllable-based way, but merely a convenient way of expressing the

information.

DATR allows us to deﬁne nodes in a hierarchy which are linked by inheritance

paths. So a very simple inheritance network might be something like:

Mammal:

<legs> == 4

<fur> == yes

<young> == live.

Human:

<> == Mammal

<legs> == 2.

Platypus:

<> == Mammal

<young> == egg.

Here we deﬁne a few simple facts about mammals. We then deﬁne two subtypes of

mammal, both of which inherit by default from Mammal (via the <> paths). They

each have one piece of information that is not inherited from the node for Mammal,

but which needs to override the default value from Mammal. These equations consist

of a path (enclosed in angle brackets) on one side and a value on the other side (of

the ==). The path consists of zero or more attributes.

We assume that lexemes are typically the bottom-most nodes in the hierarchy, and

they inherit information from nodes above them in the hierarchy. A fairly typical

account of English verbs, for example, would involve nodes representing the inﬂec-

tional subclasses (sing-sang-sung, bring-brought etc.) each of which would in turn

inherit from a Verb node. The kind of information that is involved depends very

much on the intended application, but might be semantic, syntactic, morphological

or phonological. In the case of syllable-based morphology, we are interested in

morphology and (to a limited degree) phonology.

The most simple type of morphological alternation is afﬁxation. This is handled

in a slightly different way from other alternations, but only in the sense that it is

adding material to the stem structure, rather than making adaptations to the existing

structure. Afﬁxation is treated simply as concatenation of two or more strings of

syllables. Reduplicative afﬁxation is treated in the same way, but has elements of the

afﬁx determined (in whole or in part) by elements of the stem.1

1It can be argued that reduplication is in fact just an extreme example of context sensitivity

such as that exhibited by the English plural sufﬁx -s. In the case of the English sufﬁx, the

voicing feature of the sufﬁx is determined by the ﬁnal segment of the stem, whereas in

the case of reduplication, every feature of the segments of the afﬁx is determined by some

element of the stem.

48 Cahill

An afﬁx is assumed to have the same structure as a stem, i.e. a linear sequence of

syllables. In practice, most afﬁxes are monosyllabic, or even single segments, but the

assumed structure is capable of deﬁning all types of afﬁxation.

Other types of morphological alternation involve deﬁning different values for

various constituents of syllables within the stem. A vowel alternation such as the

umlaut seen in German (Haus – Häuser) or any of the ablaut alternations seen in

English (meet – met, sing – sang, choose – chose) can be deﬁned by specifying

different values for the peak in the stem. For example, the lexical entry for meet has

the following values deﬁned by default2:

Meet:<> == Verb

<syl1 onset> == m

<syl1 peak> == i:

<syl1 coda> == t.

This deﬁnes the peak by default to have the value /i:/ in all forms of the verb. The

peak value for the past tense form can then be deﬁned as follows:

<syl1 peak past> == E

This states that the peak of the past tense and participle forms is /ε/. Note that the

DATR language allows us to deﬁne both forms by means of underspeciﬁcation. That

is, we can further specify either the tense or participle forms by adding this attribute

to the deﬁnitional path:

<syl1 peak past tense> == E

Any constituent of the syllable string can be deﬁned in this way, dependent on the

morphosyntactic features that the form realises. We can use two main types of rule:

rules of realisation and rules of referral. The rules above are all rules of realisation.

That is, they deﬁne explicitly how the form is realised (phonologically). Rules of

referral, on the other hand, deﬁne relations between forms. They involve rules such

as the following default rule for German verb forms:

This rule states that (by default) all second person forms are the same as third person

forms. Rules of referral can refer to the local node (as above) or to the global node:

<phn form second> == "<phn form third>"

This will refer back to the original node (or lexeme) that was queried. These notions

will be explained further as they become relevant for the Arabic account.

4.3 The Arabic Data

The data we will cover in this chapter is from Standard Arabic. We will include the

forms for the various different binyanim, as well as the perfective and imperfective

active and passive forms. We will not address bi- and quadriliteral roots, even though

2Note that, in the DATR code, we make use of the SAMPA computer readable phonetic

alphabet (Wells, 1989).

A Syllable-based Account of Arabic Morphology 49

the latter do occur in Classical Arabic. The aim of this chapter is to demonstrate that

there is nothing about Arabic morphology that requires special mechanisms, rather

than to present a fully comprehensiveaccount of the morphological system of Arabic.

We do not include minimal coverage of the inﬂections for person and number.

As they consist primarily of simple afﬁxation, they do not present any problems for

our account. The data we aim to account for is shown in Table 4.1. The full set of

forms as generated by the SBM account is shown in Appendix B. The data we cover

here is from a single verb, “to write”. The forms provided in Table 4.1 are not all

actual forms, as not all of the different binyanim actually occur for all verb roots.

We, therefore, do not provide glosses for these forms. The meanings of the different

binyanim are all related to the stem meaning. For example, the third binyan carries

the meaning “to correspond”.

The most important observation about the Arabic verbal system is that there appear

to be three morphemes that combine to produce a single form. The ﬁrst of these

is the root morpheme, which consists of a skeleton of consonants. In most cases,

these are three consonant, or triliteral, roots. These three consonants usually appear

in all forms of the verb in question and usually appear in the same order. The second

morpheme is the vowel component, which represents the inﬂection for the word

form (active/passive etc.). The ﬁnal morpheme involved is the CV template, which

deﬁnes the arrangement of the consonants and vowels. So, a form like kattab,which

is the perfective active form of the verb “to write”, consists of the stem morpheme

ktb, the verbal inﬂection aa and the template morpheme CVCCVC. In order to fully

specify the form, the ordering of the Cs and Vs needs to be stated. There are many

accounts in the literature of ways of ensuring that the correct consonants get mapped

Tabl e 4 .1. Complete set of verb stems for k-t-b “to write” (from McCarthy, 1981, 381)

Binyan Perfective Imperfective Participle

Active Passive Active Passive Active Passive

I katab kutib aktub uktab kaatib maktuub

II kattab kuttib ukattib ukattab mukattib mukattab

III kaatab kuutib ukaatib ukaatab mukaatib mukaatab

IV Paktab Puktib uPaktib uPaktab muPaktib muPaktab

V takattab tukuttib atakattab utakattab mutakattib mutakattab

VI takaatab takuutib atakaatab utakaatab mutakaatib mutakaatab

VII nkatab nkutib ankatib unkatab munkatib munkatab

VIII ktatab ktutib aktatib uktatab muktatib muktatab

IX ktabab aktabib muktabib

X staktab stuktib astaktib ustaktab mustaktib mustaktab

XI ktaabab aktaabib muktaabib

XII ktawtab aktawtib muktawtib

XIII ktawwab aktawwib muktawwib

XIV ktanbab aktanbib muktanbib

XV ktanbay aktanbiy muktanbiy

50 Cahill

to the correct slots (the eighth binyan ﬂop rule is a particularly nice example!)

(McCarthy 1981, 1990). However, our account does not require any special rules to

ensure this, as the consonant and vowel values are deﬁned for all forms in exactly

the same way.

The system appears to lend itself to such a separation of the morphemes. The root

morpheme provides the underlying sense of the forms, the vowel morpheme provides

the inﬂectional form and the template provides the derived form, or binyan. This in

turn provides more information about the interpretation of the sense. The account

presented here, while not requiring a separation of morphemes, retains this separation

of the kinds of information provided. The organisation of the lexicon reﬂects the

separation, with information about the binyan provided by a set of nodes designed

for that purpose, information about the root provided by what we would consider the

true lexeme nodes and information about the vowel inﬂections provided by a set of

inﬂectional nodes accessed by all verbsin the lexicon. This organisationis analogous

to the organisation of verbs in English and German, with the slight exception of the

binyan nodes. These, however, can be viewed as similar to derived forms of words

that use fully productiveand transparent processes, such as the un- preﬁx in English.

4.4 A syllable-based Account of Arabic

In this section we will present our account of Arabic morphology in three sections.

The ﬁrst section will describe the overall approach, and in particular, the deﬁnition

of the lexeme entries. The second section will look at the verbal inﬂections realised

as vowel patterns. The third section will describe the derivation of the different

binyanim.

4.4.1 The Overall Approach

First, we must examine the verbal forms and determine exactly which parts are deter-

mined by what. A fully inﬂected form of an Arabic verb may consist of preﬁxes, a

stem and sufﬁxes. The sufﬁxes are person and number agreement markers while the

preﬁxes may indicate things like conjunctions. The stem is comprised of arrange-

ments of consonants and vowels which indicate the root, the tense and the mood.

We will not look in any detail at the agreement preﬁxes and sufﬁxes, but we will

concentrate on the stem, to which these afﬁxes may attach.

The stem in question consists of the root consonants, the tense/mood vowels and

in some cases preﬁxes that indicate tense and mood.3In our account, we distinguish

between agreement preﬁxes and tense preﬁxes, which always come closest to the

root and form part of the stem to which the agreement afﬁxes attach.

The basic lexeme entries in our account of Arabic deﬁne the consonantal roots.

However, in our theory, all roots, stems and afﬁxes must be deﬁned as syllable

3As will be explained below, we choose to deﬁne elements that come before the initial root

consonant as preﬁxes, rather than deﬁning a different syllable structure.

A Syllable-based Account of Arabic Morphology 51

root

syl2 syl1

Fig. 4.1. The default structure of the form katab

strings. We must therefore deﬁne a default syllable structure for triliteral stems, with

default positions for the stem consonants. We assume the most simple structure, as

exempliﬁed by the stem katab. The standard phonotactics of Arabic requiresyllables

of CV (preferred) or CVC structure. Our root must be divided into two syllables,

because there are three consonants, and syllables in Arabic may have a maximum

of two consonants. The syllable boundary must be before the middle consonant,

as syllables with a VC structure are not allowed. This therefore gives us a syllable

structure as in Figure 4.1. Note that this structure has no values speciﬁed for either

peak, nor for the coda of the ﬁrst syllable. There is an underlying assumption in

our theory that any constituents whose values remain unspeciﬁed when all infor-

mation (from the lexeme, inﬂection and derivation parts) is taken into account are 0.

Note also that we count the syllables from the right-hand end, so the ﬁnal syllable

is labelled as syl1, the penultimate syllable is syl2 and so on. For the majority of

forms, this is not signiﬁcant as they all have two syllables. However, there are forms

(binyanim 5 and 6) which add a syllable. As they both add the syllable at the front, it

makes more sense to count from the right-hand end so as not to disrupt the syllables

that are shared (in large part) by all binyanim. It should be noted that we opt for

an account that involves several cases of incomplete (illegal) syllables, for example,

syllables that lack a peak. As we will discuss in Section 4.3, this assumption of a

resyllabiﬁcation process is required in virtually any phonologically based account of

morphology in any language.

The underlying syllable structure is deﬁned by default for all languages with the

following three nodes:

Null:

<> == .

Syllable:

<> == Null

<phn $yll form> == "<phn $yll onset>"

"<phn $yll rhyme>"

52 Cahill

<phn $yll rhyme> == "<phn $yll peak>"

"<phn $yll coda>".

Disyllable:

<> == Syllable

<phn root> == <phn syl2> <phn syl1>.

The ﬁrst of these simply provides the ultimate default value, as discussed above, as

being a zero realisation. The second node deﬁnes the default syllable structure, as

well as the default value for a root as being monosyllabic. The obvious default for

English and German is monosyllabic. This is not necessarily the same for Arabic,

as we know that verbs and nouns virtually all have disyllabic (or at least, triliteral)

roots. However, as there are plenty of monosyllables in Arabic (e.g. function words),

the default value is valid, as these speciﬁcations can be given at the nodes for nouns

and verbs. Attributes that begin with a $ symbol are variables. The variable $yll here

ranges over the values syl1, syl2, syl3 etc., indicating that the statements here apply

to any syllable.

The statements simply say that the phonological form of a syllable is the value of

the onset followed by the value of the rhyme. The rhyme, in turn, is the value of the

peak followed by the value of the coda.

For Arabic, we can deﬁne the default verb structure to be disyllabic, with the

following deﬁnition as one part of the information deﬁned for verbs:

Verb:

<> == Disyllable

The values to be speciﬁed for the lexeme katab can then be deﬁned simply as follows:

Katab:

<> == Verb

<phn syl2 onset> == k

<phn syl1 onset> == t

<phn syl1 coda> == b.

In fact, however, we abstract from this to deﬁne the default root for verbs as follows4:

Then, for katab, we need only:

<c1> == k

<c2> == t

<c3> == b

The nature of the default inheritance we assume means that any of the values deﬁned

in any of these nodes can be redeﬁned (overridden). This will be seen extensively in

the deﬁnition of the derived forms (or binyanim) below.

4The node name Root here refers to whichever lexeme node is being queried.

A Syllable-based Account of Arabic Morphology 53

So, to sum up, at the top of the inheritance hierarchy we have a number of nodes

that deﬁne the overall structure of the account. The very top is identical to the hierar-

chies for English and German. This part deﬁnes the default structure of syllables,

and deﬁnes roots as being monosyllables by default. This part of the hierarchy also

deﬁnes a word as consisting of a root and a sufﬁx, again by default.

The next part of the hierarchy, the Verb node, deﬁnes the default values for Arabic

verbs, including the default deﬁnition of verbs as disyllabic as already mentioned.

This performs two main functions. The ﬁrst of these is to deﬁne default values for

the vowels for some of the inﬂected forms. The second is to map deﬁnitions relating

to the ﬁfteen different binyanim to individual nodes providing these deﬁnitions, for

example5:

<bin1> == "Bin1:<>"

The next level of the analysis is the set of binyan nodes, which deﬁne the structure

(position of the consonants) and the vowels for the various binyan forms. These

nodes also deﬁne tense preﬁxes, which consist of either a single vowel or a consonant

and a vowel.

Finally, the lowest level of the hierarchy comprises the lexeme nodes. For our

purposes, these nodes consist of nothing more than the three consonants of the root,

but in a lexicon for a real application, they would include syntactic and semantic

information relating to that lexeme.

4.4.2 The Vowel Inﬂections

In many languages, includingEnglish and German, many morphologicalalternations

are characterised by vowel changes or ablaut. The main difference between these

ablaut processes and the vowel patterns seen in Arabic morphology is that typically

English and German ablaut processes only affect a single vowel (or peak) in a stem.

In fact, most of the words that undergo ablaut processes in these languages are

monosyllabic anyway. However, we do have one instance of multiple vowel change

in English, with the word woman/women. Although orthographically only the second

vowel changes, phonologically both vowels change:

/wUm@n/ – /wImIn/

Of course, we would not want to claim that a single example means that this is a

common thing in English morphology, but the fact that it exists at all does at least

demonstrate that any comprehensive account of English morphology will need to

account for such phenomena. We could, of course, simply say that this is a lexical

exception which cannot (or should not) be accounted for in a rule-governed way.

However, we feel that this is a get-out that is not really satisfactory. The forms in

question are not random suppletive forms, but simply forms that exhibit an unusual

set of changes.

5The quotes here indicate that the node Bin1 now becomes the global node. This is why

the equations above require reference to the original query node via Root.

54 Cahill

It is worth stressing at this pointthat, although cases of disyllabic stems where both

vowels exhibit alternations are rare, cases of more than one constituent in the stem

changing, or other combinations of alternation types are not unusual in English and

German. German has nouns whose plural is formed with a combination of umlaut

andasufﬁx(Haus – Häuser). English has nouns whose plural is formed with a

combination of a sufﬁx and a change in the ﬁnal consonant (wife – wives).

So we need to deﬁne, for the main verbal inﬂections, two vowel slots. In the most

basic form, the perfective active form of binyan I, the two vowels are both /a/. We

can deﬁne these values very simply, as follows:

Verb:

<> == Disyllable

<phn syl2 peak> == a

<phn syl1 peak> == a

These lines tell us simply that unless otherwise speciﬁed, the vowels in both syllables

have the value /a/.

For some speciﬁc inﬂections, we give the different default vowels as follows:

<phn syl2 peak perf pass> == u

<phn syl1 peak perf pass> == i

<phn syl1 peak imp act> == i

<phn syl1 peak part act> == i

These lines tell us that the vowels for the perfective passive are /u/ and /i/, giving

kutib. The imperfective active has the second vowel speciﬁed here as /i/ and the active

participle form speciﬁes the second vowel as /i/. These values are merely defaults,

and do not necessarily represent any individual form. In fact, for binyan I, the ﬁrst

vowel in each of the last two cases is zero (the forms aktub and uktab).

In addition to the vowel speciﬁcations, we also deﬁne here the tense preﬁxes:

<tense prefix imp act> == a

<tense prefix imp pass> == u

<tense prefix part> == m u

These deﬁnitions complete the default values.

4.4.3 The Binyanim

The different binyanimare deﬁned in terms of the positions of the vowels and conso-

nants. The ﬁrst few are all deﬁned in their own terms, but there is a certain amount of

inheritance possible, especially in the later binyanim. The exact inheritance structure

we use is given in Figure 4.2.

This ﬁgure shows that binyanim 1 – 5, 7 and 8 all inherit from the Verb node, with

binyan 10 inheriting from binyan 4 and binyan 6 inheriting from 5. Binyanim 9 and

12 inherit from 8, binyan 13 inherits from 12, 11 and 14 inherit from 9 and ﬁnally

binyan 15 inherits from binyan 14. With these inheritance patterns it is possible to

A Syllable-based Account of Arabic Morphology 55

Bin1 Bin2 Bin3 Bin4 Bin5 Bin7 Bin8

Bin6 Bin9 Bin12

Bin11 Bin14 Bin13

Bin15

Verb

Bin10

Fig. 4.2. The inheritance structure of the binyanim

deﬁne all of the binyanim forms with relatively few equations. Let us now look at

how we do this.

The forms for binyan 1 are as follows:

perfective imperfective participle

active passive active passive active passive

katab kutib aktub uktab kaatib maktuub

The binyan I form is the most basic (semantically), the most widely used and, as

we might expect from the more frequently used areas of a morphological system,

the one with the most irregularities. We analyse the forms as follows. Any elements

that occur before the initial root consonant (in this case, /k/) are considered to be

preﬁxes. Thus the imperfective forms have preﬁxes /a/ and /u/, and the participle

passive form has the preﬁx /ma/. The three consonants all occur only once, and

in the correct order, so we do not need to say any more about those. The vowels,

on the other hand, show a large amount of variation. We see the following six

patterns:

aa i

0uu

56 Cahill

The ﬁrst pair is the default value for verbs anyway, so does not need to be speciﬁed

here (see below). The second pair, likewise, is speciﬁed at the Verb node. The third

pair is speciﬁed here, and the zero value for the fourth pair is also covered in the same

deﬁnition, specifying the zero for both imperfective forms (by underspeciﬁcation).

The /a/ of this pair is inherited from Verb. The participle active form is /aa/. The ﬁnal

pair above is given explicitly in full. In addition to the vowel pairs above, we need

to deﬁne the preﬁxes. For this, we need to specify two that are different from the

deﬁnitions given at Verb. These are the two participle forms. For the other preﬁxes,

we just need to inherit the Verb deﬁnitions.

The whole node for binyan I forms is given below:

Bin1:

<> == Verb

<phn syl1 peak imp act> == u

<phn syl2 peak imp> ==

<phn syl2 peak part pass> ==

<phn syl2 peak part act> == a a

<phn syl1 peak part pass> == u u

<tense prefix part act> ==

<tense prefix part pass> == m a.

There are a number of points that deserve some comment here. The ﬁrst line here

says (uncontroversially) that this node inherits by default from the Verb node. The

next line tells us that in the imperfective active form, the peak of the second syllable

is /u/. The next line tells us that the ﬁrst peak of this (and the imperfective passive

form) is zero. These lines, together with the default values above give us the form

/aktub/ for the ﬁrst binyan imperfective active forms. There are ﬁve more equations,

all of which relate to the participle forms. The participle passive form has a null

vowel in the ﬁrst syllable (like the imperfect forms) and could indeed be deﬁned with

a rule of referral:

We have not opted to do this, as it does not save anything, and is not a mapping that

we want to apply in any other context.

The other binyan deﬁnitions are much simpler to describe. Binyan 2 needs the

middle consonant to be doubled. We do this by specifying that the coda of the initial

syllable is the second root consonant. The preﬁx of the imperfective active form is

also deﬁned to be /u/.

Bin2:

<> == Verb

<tense prefix imp act> == u.

The third binyan form has the ﬁrst vowel doubled and deﬁnes the preﬁx of the imper-

fective active form to be the same as that for Binyan 2. The doubled vowel is deﬁned

as two “copies” of the vowel as deﬁned at the Verb node. It is arguable whether this

A Syllable-based Account of Arabic Morphology 57

is actually a sensible way to deﬁne this value. Would it not be simpler and more

efﬁcient to just state that its value is /aa/? The answer is that, in this case, this would

be a more sensible approach, but in a situation where the same mapping applied to

more than one value for that path, the generalisation would be worth making. The

fourth binyan has a switch of consonants, with the glottal stop /P/ becoming the onset

of the ﬁrst syllable throughout and the initial root consonant becoming the coda of

the initial syllable. This gives forms like /Paktab/ and /Puktib/. Although we might

want to consider the glottal stop part of a preﬁx, as it comes before the initial root

consonant, this analysis does not ﬁt well with the remainder of the forms. If we take

the approach we havetaken, we can inherit all of the other deﬁnitions from the nodes

higher up in the hierarchy.

Bin3:

<> == Verb

<phn syl2 peak> == Verb Verb

<tense prefix imp act> == Bin2.

Bin4:

<> == Verb

<phn syl2 onset> == ’?’

<tense prefix imp act> == Bin2.

Binyan 10 inherits from binyan 4, with two differences. The ﬁrst sees the two

consonants /st/ replacing the ﬁrst consonant, instead of the glottal stop. The second

is the preﬁx for the imperfective active form, which comes direct from Verb, rather

than from binyan 4.

Bin10:

<> == Bin4

<phn syl2 onset> == s t

<tense prefix imp act> == Verb.

The binyan 5 forms are interesting, in that they appear at ﬁrst glance to require an

additional preﬁx of /ta/:

perfective imperfective participle

active passive active passive active passive

takattab tukuttib atakattab utakattab mutakattib mutakattab

However, it can be seen from the imperfective and participle forms that this “preﬁx”

attaches to the root, after the tense preﬁxes. We therefore account for this set of forms

by deﬁning this root as a trisyllabic root, with the additional syllable consisting of

/t/, together with the vowel of the second syllable. There are only two other small

elements that need deﬁnition here: the ﬁnal peak of the imperfective active form is

/a/ (rather than the /i/ that we might expect from the other binyanim), and the coda

of the second syllable is the same as for binyan 2. There might be an argument that

we should make binyan 5 inherit from binyan 2. This is a valid position, but we

58 Cahill

have not chosen to do this, because, although many forms are virtually the same as

the binyan 2 forms, with the additional syllable, there are two forms which require

different speciﬁcations, and so the deﬁnitions would not beneﬁt from being inherited

from binyan 2.

Bin5:

<> == Verb

<phn root> == Trisyllable

<phn syl3 onset> == t

<phn syl3 peak> == "<phn syl2 peak>"

<phn syl1 peak imp act> == a

<phn syl2 coda> == Bin2.

Binyan 6 similarly looks very much like binyan 3, but with the extra syllable.

However, we deﬁne this as inheriting from binyan 5, with three small differences.

The peak of the additional syllable is always /a/, rather than being dependent on

the second peak. The middle consonant is not doubled here, so the coda value is

referred back to the Verb node, and the second peak is doubled, which it inherits

from binyan 3.

Bin6:

<> == Bin5

<phn syl2 coda> == Verb

<phn syl3 peak> == a

<phn syl2 peak> == Bin3.

The seventh binyan forms are also deﬁned as trisyllabic, with the extra syllable

being deﬁned as having only a coda, /n/. This binyan picks up the default values

from Verb for the vowels (which are signiﬁcantly different from the binyan

1forms).

Bin7:

<> == Verb

<phn root> == Trisyllable

<phn syl3 coda> == n.

Binyan 8 is the start of a more interesting set of inheritances involving seven of the

remaining sets of forms. Binyan 8 itself is very simply deﬁned as inheriting from

Verb, with just one difference: the second consonant is added to the ﬁrst onset (after

the ﬁrst consonant). This gives the forms like /ktatab/.6Binyan 9 inherits this infor-

mation, but changes the onset of the second syllable to be the third root consonant,

rather than the second. The forms for binyan 11 are similarly related to binyan

9 forms, but with the single difference that the initial vowel is doubled, which is

inherited from binyan 3.

6The onset of this form is not a legal onset in Arabic syllables. However, these forms will

have other afﬁxes added and a process of resyllabiﬁcation will result in the surface forms

having legal syllable structures.

A Syllable-based Account of Arabic Morphology 59

Bin8:

<> == Verb

<phn syl2 onset> == Verb Root:<c2>.

Bin9:

<> == Bin8

<phn syl1 onset> == Root:<c3>.

Bin11:

<> == Bin9

<phn syl2 peak> == Bin3.

Binyanim 14 and 15 also inherit from binyan 9. The forms for binyan 14 simply

add the consonant /n/ as the initial coda (/ktanbab/) and for 15, the ﬁnal coda is also

changed to be /y/ (/ktanbay/).

Bin14:

<> == Bin9

<phn syl2 coda> == n.

Bin15:

<> == Bin14

<phn syl1 coda> == y.

The forms for binyanim 12 and 13 are also inherited from binyan 8. The binyan 12

forms differ from those of binyan 8 only in having a /w/ as the ﬁrst coda (/ktawtab/),

while 13 changes the second consonant also to /w/ (/ktawwab/).

Bin12:

<> == Bin8

<phn syl2 coda> == w.

Bin13:

<> == Bin12

<phn syl1 onset> == w.

4.5 Comparison Between Accounts of Arabic and English

and German

One of the most striking features of this account of the verbalsystem of Arabic is that

the equations used look, for the most part, no different from the equations required to

deﬁne English and German morphology. The comprehensive accounts of the verbal

morphology of English and German that have been implemented in a syllable-based

framework have structures that are very similar at the top. That is, they deﬁne a

default structure with syllables counted from the right-hand end of the stem. They

deﬁne a default structure consisting of a root and a sufﬁx (with possible preﬁxes, in

60 Cahill

German). Many deﬁnitions for the sub-regular classes of verbs in both English and

German deﬁne values for peaks.

One aspect of the English and German accounts that does not occur in (this

fragment of) the Arabic verbal system is the question of “active afﬁxes” (Cahill and

Gazdar, 1999). We thus deﬁne afﬁxes whose precise realisation may vary depending

on morphosyntactic or phonological context. Although our account does not look

at such details in the Arabic system, and so this may well turn out to have similar

features, what is interesting is that the areas of the Arabic account that appear to

differ most from the English or German, namely the binyan deﬁnitions, do resemble

the active afﬁx deﬁnitions more than the stem deﬁnitions. The deﬁnition of different

consonants and different syllable positions for consonants does not happen very

much within the stems in English and German, but it does happen within the afﬁxes.

For example, German verb sufﬁxes are deﬁned as varying between -e,-te,-et,-est,

-test etc.

One aspect of the Arabic account which may appear to present problems is also

a potential problem for the accounts of English and German. This is the question

of resyllabiﬁcation. The roots, afﬁxes and ﬁnal forms are all deﬁned in terms of

full syllable structures, but in many cases, the ﬁnal syllable structure is different

from the structures at earlier stages of the derivation. To cite a simple example, the

English word inform, has the consonant /m/ as the ﬁnal coda. However, when it has

the sufﬁx -ation added, the /m/ moves to the onset of the following syllable. This

is a very simple example of the kind of thing that will always have to be dealt with

in any account of morphology that assumes underlying phonological structures. We

therefore feel that having to account for situations like that in binyan 7, where we

have introduced a syllable that consists of only a coda /n/, is no more problematic

than dealing with the sufﬁxation example above.

4.6 Conclusions

We have presented an account of Arabic verbal morphology within the theory

of syllable-based morphology. This approach allows us to use precisely the same

mechanisms as used for deﬁning the morphological systems of languages such as

English and German. In addition to this fact, it can be seen from the code deﬁning

all the different binyanim(see Appendix A) that these widely varying, but obviously

related, sets of forms can be deﬁned very simply with only a few equations within a

fairly simple inheritance network.

We have not provided accounts of the other variations found within Semitic

languages, notably the quadriliteral and biliteral roots and the so-called weak forms,

where one of the root consonants is either /y/ or /w/. However, we are conﬁdent

that accounts of these forms could be simply deﬁned within the overall framework

deﬁned above. The quadriliteral and biliteral forms display similar form sets from

the triliteral roots. The quadriliteral roots have a restricted set of binyanim, all of

which have positions for all four consonants. The biliteral roots simply have to

specify which of the two root consonants takes the place of the (missing) third

A Syllable-based Account of Arabic Morphology 61

consonant in each form. The situation with weak forms is actually very much the

kind of phenomenon that syllable-based morphologyhas been used to address within

English and German morphology, namely phonologically-conditioned variation.

Again, to take a simple example, the English plural sufﬁx -s has three different

phonetic realisations, depending on the phonological form of the ﬁnal coda of the

stem. This kind of variation is equivalent to the variation seen in weak forms in

Arabic.

In conclusion, we believe that this account of Arabic, in addition to being a

linguistically interestingand computationally efﬁcient way of representing the verbal

system of Arabic, demonstrates the wide applicability of the theory of syllable-based

morphology.

References

Cahill, L. 1990. “Syllable-based morphology”, In COLING-90, Vol.3 pp. 48–53, Helsinki.

Cahill, L. and G. Gazdar. 1997. “The inﬂectional phonology of German adjectives, deter-

miners and pronouns”, In Linguistics 35:2, pp. 211–245.

Cahill, L. and G. Gazdar. 1999a. “The PolyLex architecture: multilingual lexicons for related

languages”, In Traitement Automatique des Langues, 40:2, pp. 5–23.

Cahill, L. and G. Gazdar. 1999b. “German noun inﬂection”, In Journal of Linguistics 35:1,

pp. 211–245.

Evans, R. and G. Gazdar. 1996. “DATR: A language for lexical knowledge representation”, In

Computational Linguistics 22:2, pp. 167–216.

McCarthy, J. 1981. “A prosodic theory of nonconcatenative morphology” In Linguistic Inquiry

12, pp. 373–418.

McCarthy, J. and A. Prince. 1990. “Prosodic morphology and templatic morphology” In

M. Eid and J. McCarthy (eds) Perspectives on Arabic Linguistics: Papers from the Second

Symposium pp. 1–54.

Pike, K.L. and E.V. Pike. 1947. “Immediate constituents of mazateco syllables”, In Interna-

tional Journal of American Linguistics 13, 1947, pp. 78–91.

Wells, J. 1989. “Computer-coded phonemic notation of individual languages of the

European Community”, In Journal of the International Phonetic Association, 19:1,

pp. 31–54.

Appendix A: The DATR Code

The code listed here is available in the DATR archive at www.datr.org.

% The structure of syllables and syllable sequences

# vars $yll: syl1 syl2 syl3.

Null:

<> == .

Syllable:

<> == Null

62 Cahill

<phn $yll form> == "<phn $yll onset>" "<phn $yll rhyme>"

<phn $yll rhyme> == "<phn $yll peak>" "<phn $yll coda>"

<phn $yll coda> == "<phn $yll body>" "<phn $yll tail>".

Disyllable:

<> == Syllable

<phn root> == <phn syl2> <phn syl1>.

Trisyllable:

<> == Syllable

<phn root> == <phn syl3> <phn syl2> <phn syl1>.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% The (default) structure of affixes and words

Affix:

<> == Syllable.

Word:

<> == Syllable

<mor word> == "<phn root form>" "<mor suffix>".

Verb:

<> == Disyllable

<mor word> == "<agr prefix>" "<tense prefix>"

"<phn root form>" "<agr suffix>"

% <agr prefix> and <agr suffix> are not defined

% further here - they will be realised as 0.

% basic root structure

<phn syl2 peak> == a

<phn syl1 peak> == a

% default vowels for different tenses

<phn syl2 peak perf pass> == u

<phn syl1 peak perf pass> == i

<phn syl1 peak imp act> == i

<phn syl1 peak part act> == i

% prefixes for different tenses

<tense prefix imp act> == a

<tense prefix imp pass> == u

<tense prefix part> == m u

% mapping different binyan forms to binyan nodes.

A Syllable-based Account of Arabic Morphology 63

<bin1> == "Bin1:<>"

<bin2> == "Bin2:<>"

<bin3> == "Bin3:<>"

<bin4> == "Bin4:<>"

<bin5> == "Bin5:<>"

<bin6> == "Bin6:<>"

<bin7> == "Bin7:<>"

<bin8> == "Bin8:<>"

<bin9> == "Bin9:<>"

<bin10> == "Bin10:<>"

<bin11> == "Bin11:<>"

<bin12> == "Bin12:<>"

<bin13> == "Bin13:<>"

<bin14> == "Bin14:<>"

<bin15> == "Bin15:<>".

% Binyan I - the most basic and with the most

% "irregularities"

Bin1:

<> == Verb

<phn syl2 coda imp act> == "<c1>"

<phn syl1 peak imp act> == u

<phn syl2 peak imp> ==

<phn syl2 peak part pass> ==

<phn syl2 peak part act> == a a

<phn syl1 peak part pass> == u u

<tense prefix part act> ==

<tense prefix part pass> == m a.

% Binyan II - doubled middle consonant, one prefix

% difference

Bin2:

<> == Verb

<tense prefix imp act> == u.

% Binyan III - doubled first vowel, prefix difference

% as above

Bin3:

<> == Verb

<phn syl2 peak> == Verb Verb

<tense prefix imp act> == Bin2.

% Binyan IV - c1 moved and /?/ added in its place,

% prefix as above

Bin4:

<> == Verb

<phn syl2 onset> == ’?’

<tense prefix imp act> == Bin2.

% Binyan V - trisyllable, has extra syllable’s

% values defined, different final peak, consonant doubling

64 Cahill

% as for Bin II.

Bin5:

<> == Verb

<phn root> == Trisyllable

<phn syl3 onset> == t

<phn syl3 peak> == "<phn syl2 peak>"

<phn syl1 peak imp act> == a

<phn syl2 coda> == Bin2.

% Binyan VI - as Bin V but no consonant doubling,

% first peak fixed, second doubled as Bin III.

Bin6:

<> == Bin5

<phn syl2 coda> == Verb

<phn syl3 peak> == a

<phn syl2 peak> == Bin3.

% Binyan VII - trisyllable, coda only defined for

% first syllable

Bin7:

<> == Verb

<phn root> == Trisyllable

<phn syl3 coda> == n.

% Binyan VIII - adds second consonant to first onset.

Bin8:

<> == Verb

<phn syl2 onset> == Verb Root:<c2>.

% Binyan IX - as Bin VIII but second coda is third,

% not second, consonant

Bin9:

<> == Bin8

<phn syl1 onset> == Root:<c3>.

% Binyan X - as Bin VI, but with /st/ in place

% of first consonant

Bin10:

<> == Bin4

<phn syl2 onset> == s t

<tense prefix imp act> == Verb.

% Binyan XI - as Bin IX, but vowel doubled as in Bin III

Bin11:

<> == Bin9

<phn syl2 peak> == Bin3.

% Binyan XII - as Bin VIII, first coda /w/

Bin12:

<> == Bin8

<phn syl2 coda> == w.

% Binyan XIII - as Bin XII, second onset also /w/

Bin13:

A Syllable-based Account of Arabic Morphology 65

<> == Bin12

<phn syl1 onset> == w.

% Binyan XIV - as Bin IX, first coda /n/

Bin14:

<> == Bin9

<phn syl2 coda> == n.

% Binyan XV - as Bin XIV, final consonant /y/

Bin15:

<> == Bin14

<phn syl1 coda> == y.

Write:

<> == Verb

<c1> == k

<c2> == t

<c3> == b.

Appendix B: The Output of the Theory

Write:<bin1 mor word perf act> = k a t a b.

Write:<bin2 mor word perf act> = k a t t a b.

Write:<bin3 mor word perf act> = k a a t a b.

Write:<bin4 mor word perf act> = ? a k t a b.

Write:<bin5 mor word perf act> = t a k a t t a b.

Write:<bin6 mor word perf act> = t a k a a t a b.

Write:<bin7 mor word perf act> = n k a t a b.

Write:<bin8 mor word perf act> = k t a t a b.

Write:<bin9 mor word perf act> = k t a b a b.

Write:<bin10 mor word perf act> = s t a k t a b.

Write:<bin11 mor word perf act> = k t a a b a b.

Write:<bin12 mor word perf act> = k t a w t a b.

Write:<bin13 mor word perf act> = k t a w w a b.

Write:<bin14 mor word perf act> = k t a n b a b.

Write:<bin15 mor word perf act> = k t a n b a y.

Write:<bin1 mor word perf pass> = k u t i b.

Write:<bin2 mor word perf pass> = k u t t i b.

Write:<bin3 mor word perf pass> = k u u t i b.

Write:<bin4 mor word perf pass> = ? u k t i b.

Write:<bin5 mor word perf pass> = t u k u t t i b.

Write:<bin6 mor word perf pass> = t a k u u t i b.

Write:<bin7 mor word perf pass> = n k u t i b.

Write:<bin8 mor word perf pass> = k t u t i b.

Write:<bin10 mor word perf pass> = s t u k t i b.

Write:<bin1 mor word imp act> = a k t u b.

Write:<bin2 mor word imp act> = u k a t t i b.

Write:<bin3 mor word imp act> = u k a a t i b.

Write:<bin4 mor word imp act> = u ? a k t i b.

Write:<bin5 mor word imp act> = a t a k a t t a b.

Write:<bin6 mor word imp act> = a t a k a a t a b.

66 Cahill

Write:<bin7 mor word imp act> = a n k a t i b.

Write:<bin8 mor word imp act> = a k t a t i b.

Write:<bin9 mor word imp act> = a k t a b i b.

Write:<bin10 mor word imp act> = a s t a k t i b.

Write:<bin11 mor word imp act> = a k t a a b i b.

Write:<bin12 mor word imp act> = a k t a w t i b.

Write:<bin13 mor word imp act> = a k t a w w i b.

Write:<bin14 mor word imp act> = a k t a n b i b.

Write:<bin15 mor word imp act> = a k t a n b i y.

Write:<bin1 mor word imp pass> = u k t a b.

Write:<bin2 mor word imp pass> = u k a t t a b.

Write:<bin3 mor word imp pass> = u k a a t a b.

Write:<bin4 mor word imp pass> = u ? a k t a b.

Write:<bin5 mor word imp pass> = u t a k a t t a b.

Write:<bin6 mor word imp pass> = u t a k a a t a b.

Write:<bin7 mor word imp pass> = u n k a t a b.

Write:<bin8 mor word imp pass> = u k t a t a b.

Write:<bin10 mor word imp pass> = u s t a k t a b.

Write:<bin1 mor word part act> = k a a t i b.

Write:<bin2 mor word part act> = m u k a t t i b.

Write:<bin3 mor word part act> = m u k a a t i b.

Write:<bin4 mor word part act> = m u ? a k t i b.

Write:<bin5 mor word part act> = m u t a k a t t i b.

Write:<bin6 mor word part act> = m u t a k a a t i b.

Write:<bin7 mor word part act> = m u n k a t i b.

Write:<bin8 mor word part act> = m u k t a t i b.

Write:<bin9 mor word part act> = m u k t a b i b.

Write:<bin10 mor word part act> = m u s t a k t i b.

Write:<bin11 mor word part act> = m u k t a a b i b.

Write:<bin12 mor word part act> = m u k t a w t i b.

Write:<bin13 mor word part act> = m u k t a w w i b.

Write:<bin14 mor word part act> = m u k t a n b i b.

Write:<bin15 mor word part act> = m u k t a n b i y.

Write:<bin1 mor word part pass> = m a k t u u b.

Write:<bin2 mor word part pass> = m u k a t t a b.

Write:<bin3 mor word part pass> = m u k a a t a b.

Write:<bin4 mor word part pass> = m u ? a k t a b.

Write:<bin5 mor word part pass> = m u t a k a t t a b.

Write:<bin6 mor word part pass> = m u t a k a a t a b.

Write:<bin7 mor word part pass> = m u n k a t a b.

Write:<bin8 mor word part pass> = m u k t a t a b.

Write:<bin10 mor word part pass> = m u s t a k t a b.

Inheritance-based Approach to Arabic Verbal

Root-and-Pattern Morphology

Salah R. Al-Najem

Department of Arabic, Faculty of Arts, Kuwait University, P.O. Box: 23558, Safat, Kuwait

Abstract: This chapter introduces a computational approach to thederivation and inﬂection of Arabic

verbs. The approach attempts to capture generalizations, dependencies, and syncretisms

existing in Arabic verbal morphology in a compact non-redundant manner. The approach

is represented by a computational implementation in DATR lexical knowledge repre-

sentation language. Generalizations will be captured through the use of Default Inheri-

tance technique. Dependencies will be handled through the use of Multiple Inheritance

technique. Default Inference technique will be used to capture syncretisms

5.1 Introduction

Arabic morphology is a system governed by a number of generalizations, depen-

dencies, and syncretisms that explain the overt aspects of regularity in the deriva-

tional and inﬂectional structure of Arabic. A generalization is a statement about the

facts of a language, which holds true in all cases or in nearly all cases (Trask 1993).

A dependency is a case in which a (dependent) form is derived from another

form. Finally, a syncretism is the case in which two or more morphemes that are

morphosyntactically distinct appear identical in form.

In the context of implementing Arabic morphology computationally, these gener-

alizations, dependencies, and syncretisms should be taken into consideration.

Capturing such generalizations, dependencies, and syncretisms in a computational

implementation of Arabic morphology can save that implementation from redun-

dancy and can make it more compact, which is a vital issue in linguistics and

computer science. This chapter introduces a computational approach to Arabic

verbal morphology in which these generalizations, dependencies, and syncretisms

are used to systematize Arabic in a concise manner. The approach is repre-

sented by an implementation written in the DATR lexical knowledge representation

language.

A. Soudi, A. van den Bosch and G. Neumann (eds.), Arabic Computational Morphology, 67–88.

2007 Springer.

alnajem@arts.kuniv.edu.kw