Content uploaded by Saber Jahanpour
Author content
All content in this area was uploaded by Saber Jahanpour on Nov 01, 2015
Content may be subject to copyright.
Content uploaded by Saber Jahanpour
Author content
All content in this area was uploaded by Saber Jahanpour on Nov 01, 2015
Content may be subject to copyright.
Introduction to a New Farsi Stemmer
Alireza Mokhtaripour
Department of Electrical and Computer Engineering
Shahid Beheshti University
Tehran, Iran
a_mokhtaripour@yahoo.com
Saber Jahanpour
Department of Electrical and Computer Engineering
Shahid Beheshti University
Tehran, Iran
jahanpour.saber@gmail.com
ABSTRACT
In this poster, a new Farsi (also called Persian) stemmer which
works without dictionary is introduced. Evaluation results show
significant improvement in performance (precision / recall) of the
Information Retrieval (IR) system using this stemmer.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing.
General Terms
Design, Languages, Performance.
Keywords
Farsi, Persian, information retrieval, stemmer, language.
1. INTRODUCTION
Farsi is an Indo-European language spoken and written
primarily in Iran, Tajikistan and parts of Afghanistan. Farsi
alphabet contains 32 letters. Farsi is written from right to left.
Some other languages like Arabic, Kurdish, and Urdu use Farsi’s
form of penmanship but have their own specifications. Farsi also
has its own specifications such as not using accents (except in
special cases) and polymorphism in writing.
One of the popular techniques for improving performance of the
IR systems is providing searchers with ways of finding
morphological variants of search terms. This can be done by a
stemmer. Stemmer recovers the stem or the base form of a word
by either stripping off affixes or by a static lookup in a word list
(such as dictionary) for irregular forms. For example consider two
words ﯼﺮﺴﭘ /pesri:/ ("a boy") and ﺎهﺮﺴﭘ /pesrh8/ ("boys"). The
former is indefinite form and the later is plural form of ﺮﺴﭘ
/pesr/ ("boy") and both can be stemmed to the word "ﺮﺴﭘ ".
In this poster we introduce a new Farsi stemmer working
without dictionary. Evaluation results show significant
improvement in the performance of the IR system using this
stemmer. In addition to this introduction, this poster contains four
more sections. In section two, new advantages of this stemmer are
introduced. Section three is dedicated to design of the stemmer. In
section four evaluation results of the stemmer in term of
precision/ recall are shown.
Copyright is held by the author/owner(s).
CIKM’06, November 5–11, 2006, Arlington, Virginia, USA.
ACM 1-59593-433-2/06/0011.
2. The Farsi Stemmer
Removing affixes is the main task of a stemmer. This is the first
solution in designing a stemmer. Currently there are some Farsi
stemmers like [3] that work in this manner: Look for affixes and
just remove them! This is the whole idea in many of them. But
there are some problems and exceptions in Farsi literature that
ignoring them decreases performance of those Farsi stemmers
dramatically. So we studied Farsi language carefully and then we
determined three objectives in the stemmer to achieve. These
objectives made the stemmer more accurate and sophisticated:
• Looking for a vast variety of affixes
• Considering the changes in the original word after adding
affixes
• Taking care of loan words and their imported rules
Many affixes are removed by the stemmer. But there are some
affixes that removing them from a word generates a root that is far
from the original word in meaning. For example if the suffix
"نﺎﺘﺳ" (/st8n/) (indicating a place) is removed from the word
نﺎﺘﺴﮐﺎﭘ /pakest8P/ (Pakistan, the country), the stem ﮎﺎﭘ /pak/
("clear") is left, that is far from the main word. So the affixes
should be selected carefully to avoid this problem.
In Farsi there are some affix-sensitive words that are sensitive
to a few certain affixes. When one of those affixes is added to an
affix-sensitive word some changes occur. As an example adding
plural suffix "نا" /œn/ to the suffix-sensitive word ﺎﻧاد /d8PC
(savant) generates نﺎﻳﺎﻧاد /d8PC+8P/ (savants) that has the new
letter "ـﻳ" /+/ just before the suffix. Adding, removing or even
replacing one or more letters in the original word is done by those
affixes when they meet the affix-sensitive words. If the stemmer
just removes the affixes and does not care to these changes, it will
fail in matching the root and varieties of the root. In the previous
example if plural suffix "نا" is removed without any forgoing
process then the word ﺎﻧادﯼ /d8PC+ (savant) will be remained
that is not equal to the original word ﺎﻧاد.
Throughout the time some words are imported in Farsi from
other languages like Arabic, English and French. Arabic has the
highest effect, so some Arabic grammatical rules are imported and
are applied to the imported (loan) Arabic words. Fortunately,
these rules are mostly used just for those Arabic words
(sometimes these rules are applied for non-Arabic loan words
too.). So in addition to Farsi grammar we studied imported Arabic
grammatical rules and we considered them in the stemmer.
Despite of looking up in a dictionary the stemmer tries to locate
four discriminator letters "پ" /p/, "ژ" /</, "گ" /g/ and "چ" /V5/ in
826
a word to determine its original language. When the stemmer
ensures that a given word is Arabic, it proceeds through some
Arabic stemming tasks.
3. Design of the Farsi stemmer
The stemmer is completely rule-based. Each rule can be
activated by an affix. The result of each rule is one or more
actions. The starting point of the stemming is removing noun
suffixes and verbal suffixes. Then the stemmer goes to remove
prefixes. The stemmer looks for the changes that have been made
in suffix-sensitive words in all the phases.
The stemmer has ten phases. Each phase is dedicated to one or
more certain grammatical rules. There is a general condition and
it is that after doing each rule’s action(s) the length of the resulted
root should not be less than a certain number. We consider 2
letters as the minimum length of the word. These are ten phases of
the stemmer:
1- Removing1 "ﯼ "(/+/) (indefinite article / possessive-
suffix)
2- Removing auxiliary suffix "ﺪﻧ" /nd/
3- Removing possessive and auxiliary suffixes
4- Removing possessive suffixes: "ت" /t/ and "نﺎﺗ" /t8P
5- Removing plural suffixes
6- Removing comparative suffixes
7- Removing other suffixes
8- Removing "ن" /n/ (sign of infinitive)
9- Removing special end letters
10- Removing prefixes
4. Evaluation
To evaluate the stemmer, a collection of 250 Mb containing
43,680 Farsi documents was used. These documents have several
subjects like sport, economic, policy, history, etc. For more
information about this collection, the reader is referred to [2].
To evaluate the stemmer 25 queries were applied. The relevant
documents of each query were selected by a native Farsi speaker.
First, the system (a classic vector-based system) was started up
without the stemmer in the indexer and the searcher. The queries
were fed to the system and the performance of the system was
evaluated (Table 5). Then the system was restarted up using the
stemmer, and again it was evaluated with the same queries (Table
6). Comparing these two tables, the system which used the
stemmer was 0.151 or 46% better.
5. Future Work
The results of our stemming test indicate that the Farsi stemmer
improves retrieval. Our tests were done on a small collection, so
the effect of the stemmer on bigger collections is not known at
this time. There are some ways that can improve the stemmer as
an example a list that has present tens roots and their
corresponding past tense roots helps to better retrieval of verbs.
1 “Removing” is a short notation. Each phase may have some
other stemming tasks in addition to just removing the affixes.
Table 5. Average precision/recall results using no stemmer
recall precision interpolated
0
10
20
30
40
50
60
70
80
90
100
0.153
0.225
0.293
0.224
0.130
0.124
0.062
0.032
0.007
0.000
0.070
0.581
0.468
0.432
0.333
0.246
0.191
0.122
0.081
0.078
0.070
0.070
Average
0.120 0.243
Table 6. Average precision/recall results using stemmer
recall precision interpolated
0
10
20
30
40
50
60
70
80
90
100
0.140
0.236
0.372
0.276
0.245
0.264
0.207
0.050
0.014
0.000
0.134
0.742
0.685
0.633
0.516
0.463
0.408
0.314
0.166
0.143
0.134
0.134
Average
0.176 0.394
6. REFERENCES
[1] Shariat, M. J. Simple Farsi Grammar (Second impression).
Asaatir, 2000, Iran.
[2] Darrudi, E., Hejazi, M.R, Oroumchian, F. Assessment of a
Modern Farsi Corpus. In Proceedings of the 2nd Workshop
on Information Technology & its Disciplines (WITID) 2004,
ITRC, Iran.
[3] Taghva, K., Beckley, R., Sadeh, M. A Stemming Algorithm
for the Farsi Language. In proceedings of International
Conference on Information Technology: Coding and
Computing (ITXX05) - Volume I pp. 158-162.
[4] Samiei (Gilani), A. Writing and Editing (Third impression),
The Organization for Researching and Composing
University Textbooks in the Humanities (SAMT), 2001, Iran
827