Conference PaperPDF Available

Introduction to a new Farsi stemmer

Authors:

Abstract

In this poster, a new Farsi (also called Persian) stemmer which works without dictionary is introduced. Evaluation results show significant improvement in performance (precision/recall) of the Information Retrieval (IR) system using this stemmer.
Introduction to a New Farsi Stemmer
Alireza Mokhtaripour
Department of Electrical and Computer Engineering
Shahid Beheshti University
Tehran, Iran
a_mokhtaripour@yahoo.com
Saber Jahanpour
Department of Electrical and Computer Engineering
Shahid Beheshti University
Tehran, Iran
jahanpour.saber@gmail.com
ABSTRACT
In this poster, a new Farsi (also called Persian) stemmer which
works without dictionary is introduced. Evaluation results show
significant improvement in performance (precision / recall) of the
Information Retrieval (IR) system using this stemmer.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing.
General Terms
Design, Languages, Performance.
Keywords
Farsi, Persian, information retrieval, stemmer, language.
1. INTRODUCTION
Farsi is an Indo-European language spoken and written
primarily in Iran, Tajikistan and parts of Afghanistan. Farsi
alphabet contains 32 letters. Farsi is written from right to left.
Some other languages like Arabic, Kurdish, and Urdu use Farsi’s
form of penmanship but have their own specifications. Farsi also
has its own specifications such as not using accents (except in
special cases) and polymorphism in writing.
One of the popular techniques for improving performance of the
IR systems is providing searchers with ways of finding
morphological variants of search terms. This can be done by a
stemmer. Stemmer recovers the stem or the base form of a word
by either stripping off affixes or by a static lookup in a word list
(such as dictionary) for irregular forms. For example consider two
words ﯼﺮﺴﭘ /pesri:/ ("a boy") and ﺎهﺮﺴﭘ /pesrh8/ ("boys"). The
former is indefinite form and the later is plural form of ﺮﺴﭘ
/pesr/ ("boy") and both can be stemmed to the word "ﺮﺴﭘ ".
In this poster we introduce a new Farsi stemmer working
without dictionary. Evaluation results show significant
improvement in the performance of the IR system using this
stemmer. In addition to this introduction, this poster contains four
more sections. In section two, new advantages of this stemmer are
introduced. Section three is dedicated to design of the stemmer. In
section four evaluation results of the stemmer in term of
precision/ recall are shown.
Copyright is held by the author/owner(s).
CIKM’06, November 5–11, 2006, Arlington, Virginia, USA.
ACM 1-59593-433-2/06/0011.
2. The Farsi Stemmer
Removing affixes is the main task of a stemmer. This is the first
solution in designing a stemmer. Currently there are some Farsi
stemmers like [3] that work in this manner: Look for affixes and
just remove them! This is the whole idea in many of them. But
there are some problems and exceptions in Farsi literature that
ignoring them decreases performance of those Farsi stemmers
dramatically. So we studied Farsi language carefully and then we
determined three objectives in the stemmer to achieve. These
objectives made the stemmer more accurate and sophisticated:
Looking for a vast variety of affixes
Considering the changes in the original word after adding
affixes
Taking care of loan words and their imported rules
Many affixes are removed by the stemmer. But there are some
affixes that removing them from a word generates a root that is far
from the original word in meaning. For example if the suffix
"نﺎﺘﺳ" (/st8n/) (indicating a place) is removed from the word
نﺎﺘﺴﮐﺎﭘ /pakest8P/ (Pakistan, the country), the stem ﮎﺎﭘ /pak/
("clear") is left, that is far from the main word. So the affixes
should be selected carefully to avoid this problem.
In Farsi there are some affix-sensitive words that are sensitive
to a few certain affixes. When one of those affixes is added to an
affix-sensitive word some changes occur. As an example adding
plural suffix "نا" /œn/ to the suffix-sensitive word ﺎﻧاد /d8PC
(savant) generates نﺎﻳﺎﻧاد /d8PC+8P/ (savants) that has the new
letter "ـﻳ" /+/ just before the suffix. Adding, removing or even
replacing one or more letters in the original word is done by those
affixes when they meet the affix-sensitive words. If the stemmer
just removes the affixes and does not care to these changes, it will
fail in matching the root and varieties of the root. In the previous
example if plural suffix "نا" is removed without any forgoing
process then the word ﺎﻧاد /d8PC+ (savant) will be remained
that is not equal to the original word ﺎﻧاد.
Throughout the time some words are imported in Farsi from
other languages like Arabic, English and French. Arabic has the
highest effect, so some Arabic grammatical rules are imported and
are applied to the imported (loan) Arabic words. Fortunately,
these rules are mostly used just for those Arabic words
(sometimes these rules are applied for non-Arabic loan words
too.). So in addition to Farsi grammar we studied imported Arabic
grammatical rules and we considered them in the stemmer.
Despite of looking up in a dictionary the stemmer tries to locate
four discriminator letters "پ" /p/, "ژ" /</, "گ" /g/ and "چ" /V5/ in
826
a word to determine its original language. When the stemmer
ensures that a given word is Arabic, it proceeds through some
Arabic stemming tasks.
3. Design of the Farsi stemmer
The stemmer is completely rule-based. Each rule can be
activated by an affix. The result of each rule is one or more
actions. The starting point of the stemming is removing noun
suffixes and verbal suffixes. Then the stemmer goes to remove
prefixes. The stemmer looks for the changes that have been made
in suffix-sensitive words in all the phases.
The stemmer has ten phases. Each phase is dedicated to one or
more certain grammatical rules. There is a general condition and
it is that after doing each rule’s action(s) the length of the resulted
root should not be less than a certain number. We consider 2
letters as the minimum length of the word. These are ten phases of
the stemmer:
1- Removing1 " "(/+/) (indefinite article / possessive-
suffix)
2- Removing auxiliary suffix "ﺪﻧ" /nd/
3- Removing possessive and auxiliary suffixes
4- Removing possessive suffixes: "ت" /t/ and "نﺎﺗ" /t8P
5- Removing plural suffixes
6- Removing comparative suffixes
7- Removing other suffixes
8- Removing "ن" /n/ (sign of infinitive)
9- Removing special end letters
10- Removing prefixes
4. Evaluation
To evaluate the stemmer, a collection of 250 Mb containing
43,680 Farsi documents was used. These documents have several
subjects like sport, economic, policy, history, etc. For more
information about this collection, the reader is referred to [2].
To evaluate the stemmer 25 queries were applied. The relevant
documents of each query were selected by a native Farsi speaker.
First, the system (a classic vector-based system) was started up
without the stemmer in the indexer and the searcher. The queries
were fed to the system and the performance of the system was
evaluated (Table 5). Then the system was restarted up using the
stemmer, and again it was evaluated with the same queries (Table
6). Comparing these two tables, the system which used the
stemmer was 0.151 or 46% better.
5. Future Work
The results of our stemming test indicate that the Farsi stemmer
improves retrieval. Our tests were done on a small collection, so
the effect of the stemmer on bigger collections is not known at
this time. There are some ways that can improve the stemmer as
an example a list that has present tens roots and their
corresponding past tense roots helps to better retrieval of verbs.
1 “Removing” is a short notation. Each phase may have some
other stemming tasks in addition to just removing the affixes.
Table 5. Average precision/recall results using no stemmer
recall precision interpolated
0
10
20
30
40
50
60
70
80
90
100
0.153
0.225
0.293
0.224
0.130
0.124
0.062
0.032
0.007
0.000
0.070
0.581
0.468
0.432
0.333
0.246
0.191
0.122
0.081
0.078
0.070
0.070
Average
0.120 0.243
Table 6. Average precision/recall results using stemmer
recall precision interpolated
0
10
20
30
40
50
60
70
80
90
100
0.140
0.236
0.372
0.276
0.245
0.264
0.207
0.050
0.014
0.000
0.134
0.742
0.685
0.633
0.516
0.463
0.408
0.314
0.166
0.143
0.134
0.134
Average
0.176 0.394
6. REFERENCES
[1] Shariat, M. J. Simple Farsi Grammar (Second impression).
Asaatir, 2000, Iran.
[2] Darrudi, E., Hejazi, M.R, Oroumchian, F. Assessment of a
Modern Farsi Corpus. In Proceedings of the 2nd Workshop
on Information Technology & its Disciplines (WITID) 2004,
ITRC, Iran.
[3] Taghva, K., Beckley, R., Sadeh, M. A Stemming Algorithm
for the Farsi Language. In proceedings of International
Conference on Information Technology: Coding and
Computing (ITXX05) - Volume I pp. 158-162.
[4] Samiei (Gilani), A. Writing and Editing (Third impression),
The Organization for Researching and Composing
University Textbooks in the Humanities (SAMT), 2001, Iran
827
... There has been significant work done on Arabic stemmers-most of it is statistical and heuristicsbased [6,10]. Although Farsi and Arabic are written with similar (not the same) scripts, stemmers for those languages are not adequate for Urdu stemming. ...
Conference Paper
Full-text available
... With the use of Bon, recall is improved by 40%. Mokhtaripour [23] developed another stemming method for Persian language by using rule based strategy. This stemmer generates the stem of Persian text without using language dictionary. ...
Article
Full-text available
Urdu language is used by approximately 200 million people for spoken and written communication. Bulk of unstructured Urdu textual data is available in the world. We can employ data mining techniques to extract useful information from such large, potentially informative base data. There are many text processing systems available to process unstructured textual data. However, these systems are mostly language specific with the large proportion of systems applicable to English text. This is primarily due to language dependent pre-processing systems mainly the stemming requirement. Stemming is a vital pre-processing step in the text mining process and its primary aim is to reduce grammatical words form e.g. parts of speech, gender, tense etc. to their root form. In the proposed work, we have developed a rule based comprehensive stemming method for Urdu text. This proposed Urdu stemmer has the ability to generate the stem of Urdu words as well as loan words that belong to borrowed languages such as Arabic, Persian, and Turkish etc. by removing prefix, infix, and suffix from the words. In the proposed stemming technique, we introduced six novel Urdu infix words classes and minimum word length rule to generate the stem of Urdu text. In order to cope with the challenge of Urdu infix stemming, we have developed infix stripping rules for introduced infix words classes and generic stemming rules for prefix and suffix stemming. We also present a probabilistic classification approach to classify Urdu short text. Different experiments are performed to demonstrate the effectiveness and efficacy of the proposed approach. Comparison with existing state-of-the art is also made. Stemming accuracy results demonstrate the adoptability of proposed stemming approach for variety text processing applications. OAPA
... For example, the Persian word (or singers) in which is a possessive noun suffix, is a plural noun suffix and is other noun suffix, the stemmer removes the suffixes and extracted the stem word (sing). Another rule based stemmer is presented by Mokhtaripour and Jahanpour (2006) titled Farsi rule based stemmer. Prefix and suffix lists are created and suffixes list is divided into different sub categories: auxiliary suffix, possessive and auxiliary suffixes, possessive suffixes, plural suffixes, comparative suffixes and other suffixes. ...
Article
Full-text available
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected forms of a word into its root form. Urdu is a morphologically rich language, emerged from different languages, that includes prefix, suffix, infix, co-suffix and circumfixes in inflected and multi-gram words that need to be edited in order to convert them into their stems. This editing (insertion, deletion and substitution) makes the stemming process difficult due to language morphological richness and inclusion of words of foreign languages like Persian and Arabic. In this paper, we present a comprehensive review of different algorithms and techniques of stemming Urdu text and also considering the syntax, morphological similarity and other common features and stemming approaches used in Urdu like languages, i.e. Arabic and Persian analyzed, extract main features, merits and shortcomings of the used stemming approaches. In this paper, we also discuss stemming errors, basic difference between stemming and lemmatization and coin a metric for classification of stemming algorithms. In the final phase, we have presented the future work directions. http://link.springer.com/article/10.1007%2Fs10462-016-9527-1
... The greatest advantage of the system, according to the developers, was that it brought about a significant improvement in precision and recall. Mokhtaripour and Jahanpour (2006) produced another work on Persian stemming. The stemmer they introduced for Persian works did not include a dictionary. ...
Article
Full-text available
The main objective of this paper was to design a model of automatic keyphrase indexing for Persian. The train model, consisting of six features – “TF”, “TF × IDF”, “RE”, “RE × IDF”, “Node Degree” and “First Occurrence” – were elaborated on. These six features were defined briefly and for each feature, the discretization ranges applied as well as the Yes/No probability scores of being an index term were reported. Finally, the way the model, and each of its components, performed were demonstrated in a step-by-step manner by running the software on a sample full-text article. The ultimate assessment of the software on 75 test articles revealed that it had a very good performance on full-texts (F-measure = 27.3%, Precision = 31.68%, and recall = 25.45%) and abstracts (F-measure = 28%, precision = 32.19%, and recall = 26.27%) when default was set at 7. The software also proved successful as regards generation of keyphrases rather than single word index terms at default 7. In all, 58.1% of the index terms generated by the software for full-text documents, and 58.67% of those generated for abstracts were phrases. Finally, 78.86% and 74.48% of the keyterms generated for full-texts and abstracts were judged as relevant by an LIS expert.
... Distributed system increases the subscription, efficiency, time cost, extensibility features, and the quality of responding to the users. Through this, data mining with massive data problem can be resolved by linking separated computers to the network [36]. ...
Article
Full-text available
As the amount of information and the number of queries has been increasing today, indexing is a good solution to fight with the inherent complexity of text retrieval and accelerating information retrieval in different languages. Also N-Gram Indexing is a solution of the issues such as stemming, misspellings, multilingual and partial matching and has the advantages of language independent and error endurance. Persian is a name of a language which is common in the Middle East. It is spoken in some countries like Iran, Afghanistan and Tajikistan. Therefore, Persian is the language of many documents is published on the net. But, not more researches have been done about the Persian documents retrieval. In this paper, we present a method for Persian documents retrieving using N-gram indexing and distribution technique. The proposed index is a method of more effective answering queries that increases the quality of information retrieval substantially and we gain more optimizing retrieval in Persian documents. But the speed of N-gram indexing is low; to solve this problem we design a distributed N-gram indexing mechanism for large systems of Persian language. Compare with the other methods in this field, we improve the quality of retrieved documents and also the speed of information retrieval.
... To stem French words in a corpus a dictionary-based approach is used [3]. Various researches have been done on Arabic and Farsi stemmers, most of them uses statistical and heuristics based approaches [4,5]. A supervised approach was proposed by R. Wicentowski [6] for learning the set of suffixes for the purpose of stripping the word and get root word. ...
Article
Full-text available
This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu & Marathi language). Especially, during last few years, a wide range of information in Indian regional languages has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. To train the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that in the case of Urdu language the frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix stripping algorithm.
Article
The stemming process is crucial and significant in the pre-processing step of natural language processing. The stemmer oversees the stemming process. It facilitates the extraction of morphological variants of a root or base word from the provided word. Over the period, several stemmers for various vernacular languages have been proposed. However, very few research studies have comprehensively investigated these available stemmers. This paper makes multifold contributions. First, we discuss the various stemmers of 15 Indian and 17 non-Indian languages describing their key points, benefits, and drawbacks. All the Indian languages for which stemmers have been built are covered in this study. For the non-Indian languages, stemmers of commonly spoken languages have been covered. Second, we present a language-wise comparative analysis of stemmers based on our identified parameters. Third, we discuss the wordnets and dictionaries available for different languages. Fourth, we provide details of the datasets available for various languages. Fifth, we also provide challenges in existing stemmers and future directions for future researchers. The study presented in this paper reveals that significant research has been carried out for the stemmers of influential languages such as English, Arabic, and Urdu. On the other hand, languages with limited resources, such as Farsi, Polish, Odia, Amharic, and others, have received the least attention for research. Moreover, rigorous analysis reveals that most of the stemmers suffer from over-stemming errors. With a complete catalogue of available stemmers, this study aims to assist the researchers and professionals working in the areas such as information retrieval, semantic annotation, word meaning disambiguation, and ontology learning.
Conference Paper
Full-text available
زبان فارسی جزء زبان‌هایی است که از نظر تصریفی تا حدودی غنی است. وجود صورت‌واژه‌های مختلف سبب بزرگ‌شدن واژگان در سامانه‌های پردازش زبان طبیعی می‌شود. علیرغم پژوهش‌های بسیار زیاد انجام‌شده برای بن‌واژه‌سازی و ریشه‌یابی در فارسی، تاکنون یک مجموعۀ داده استاندارد طلایی با حجم قابل ملاحظه‌ای که بنِ واژه‌ها در آن مشخص شده‌باشد تهیه نشده ‎است. این کمبود داده سبب می‌شود نتوان از روش‌های آماری و یادگیری ماشینی برای بن‌واژه‌سازی بهرهبرد. در این مقاله می‌کوشیم با ارائۀ یک الگوریتم قاعده‌مند، پیکرۀ بی‌جن‌خان که دارای برچسب مقولات دستوری است را بن‌واژه‌سازی کنیم. سپس با ویرایش جزئی، از این داده به‌عنوان دادۀ آموزش برای دو بن‌واژه‌ساز آماری به‌نام‌های مورفت و لمینگ استفاده کنیم و به مقایسۀ عملکرد دو شیوۀ قاعده‌بنیان و آماری در بن‌واژه‌سازی بپردازیم. در این مقایسه، از دادگان درختی فارسی که در آن بنِ واژه‌ها به‌صورت دستی برچسب‌گذاری شده‌است به‌عنوان دادۀ آزمون استفاده می‌شود. نتایج به‌دست‌آمده نشان می‌دهد که شیوه آماری یادگیری ماشینی بانظارت که در این دو ابزار پیاده‌سازی شده‌است کارایی بالاتری را نسبت‌به شیوۀ قاعده بنیان در شرایط واقعی کاربردی به‌دست آورده‌است. / Persian is one of the languages which is to some extent rich in terms of inflection. Existence of different word forms causes the lexicon to be enlarged the in natural language processing systems. Despite the various studies on lemmatization and stemming of Persian, no data that contains lemmas of the words as the gold standard data with a reasonable size are not developed. This lack of data causes not to be able to use statistical and machine learning methods for lemmatization. In this paper, we try to propose a rule-based algorithm to lemmetize the Bijankhan Corpus which contains the syntactic categories of the words. Then, with a minimal editing, this data will be used as the training data for two statistical lemmatizers, Morfettee and Lemming. Finally, we compare the performance of the rule-based and statistical approaches in lemmatization. In this comparison, the data of a Persian Treebank which is lemmatized manually will be used as the test data. The experimental results show that the statistical approach which uses supervised machine learning method have beaten the rule-based approach in the real application.
Conference Paper
Full-text available
Finding the stems of words is one of the most important issues in the field of natural language processing. The stem of a word is related to a part of it that will be achieved after removing affixes from the word. In this paper, we propose a stemmer for Farsi words by employing regular expressions. The proposed stemmer performs in 3 steps: 1) tokenizing the words of a corpus, 2) applying stemming rules on the tokenized words and achieving their stems and 3) Matching the achieved stems with a Farsi dictionary to remove invalid ones. The results achieved by the proposed stemmer shows the high efficiency of it in finding the stems of Farsi words.
Article
Full-text available
The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on the fitness of the frequency and rank of Farsi words with Zipf-Mandelbrot's law. We will then present our measurement of Entropy of Farsi for this corpus.
Conference Paper
In this paper, we report on the design and implementation of a stemmer for the Farsi language. The results of our evaluation on a small Farsi document collection shows a significant improvement in precision/recall over not stemming.
Simple Farsi Grammar (Second impression)
  • M J Shariat
Shariat, M. J. Simple Farsi Grammar (Second impression). Asaatir, 2000, Iran.
Writing and Editing (Third impression) The Organization for Researching and Composing University Textbooks in the Humanities (SAMT)
  • Samiei
  • A ) Gilani
Samiei (Gilani), A. Writing and Editing (Third impression), The Organization for Researching and Composing University Textbooks in the Humanities (SAMT), 2001, Iran
The Organization for Researching and Composing University Textbooks in the Humanities (SAMT)
  • Samiei
Samiei (Gilani), A. Writing and Editing (Third impression), The Organization for Researching and Composing University Textbooks in the Humanities (SAMT), 2001, Iran