Conference PaperPDF Available

Introduction to a new Farsi stemmer

January 2006

January 2006

DOI:10.1145/1183614.1183750

Source
DBLP

Conference: Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6-11, 2006

Authors:

Saber Jahanpour

Worcester Polytechnic Institute

In this poster, a new Farsi (also called Persian) stemmer which works without dictionary is introduced. Evaluation results show significant improvement in performance (precision/recall) of the Information Retrieval (IR) system using this stemmer.

Content uploaded by Saber Jahanpour

Content may be subject to copyright.

Content uploaded by Saber Jahanpour

Content may be subject to copyright.

Introduction to a New Farsi Stemmer

Alireza Mokhtaripour

Department of Electrical and Computer Engineering

Shahid Beheshti University

Tehran, Iran

a_mokhtaripour@yahoo.com

Saber Jahanpour

Department of Electrical and Computer Engineering

Shahid Beheshti University

Tehran, Iran

jahanpour.saber@gmail.com

ABSTRACT

In this poster, a new Farsi (also called Persian) stemmer which

works without dictionary is introduced. Evaluation results show

significant improvement in performance (precision / recall) of the

Information Retrieval (IR) system using this stemmer.

Categories and Subject Descriptors

H.3.1 [Information Storage and Retrieval]: Content Analysis

and Indexing.

General Terms

Design, Languages, Performance.

Keywords

Farsi, Persian, information retrieval, stemmer, language.

1. INTRODUCTION

Farsi is an Indo-European language spoken and written

primarily in Iran, Tajikistan and parts of Afghanistan. Farsi

alphabet contains 32 letters. Farsi is written from right to left.

Some other languages like Arabic, Kurdish, and Urdu use Farsi’s

form of penmanship but have their own specifications. Farsi also

has its own specifications such as not using accents (except in

special cases) and polymorphism in writing.

One of the popular techniques for improving performance of the

IR systems is providing searchers with ways of finding

morphological variants of search terms. This can be done by a

stemmer. Stemmer recovers the stem or the base form of a word

by either stripping off affixes or by a static lookup in a word list

(such as dictionary) for irregular forms. For example consider two

words ﯼﺮﺴﭘ /pesri:/ ("a boy") and ﺎهﺮﺴﭘ /pesrh8/ ("boys"). The

former is indefinite form and the later is plural form of ﺮﺴﭘ

/pesr/ ("boy") and both can be stemmed to the word "ﺮﺴﭘ ".

In this poster we introduce a new Farsi stemmer working

without dictionary. Evaluation results show significant

improvement in the performance of the IR system using this

stemmer. In addition to this introduction, this poster contains four

more sections. In section two, new advantages of this stemmer are

introduced. Section three is dedicated to design of the stemmer. In

section four evaluation results of the stemmer in term of

precision/ recall are shown.

CIKM’06, November 5–11, 2006, Arlington, Virginia, USA.

ACM 1-59593-433-2/06/0011.

2. The Farsi Stemmer

Removing affixes is the main task of a stemmer. This is the first

solution in designing a stemmer. Currently there are some Farsi

stemmers like [3] that work in this manner: Look for affixes and

just remove them! This is the whole idea in many of them. But

there are some problems and exceptions in Farsi literature that

ignoring them decreases performance of those Farsi stemmers

dramatically. So we studied Farsi language carefully and then we

determined three objectives in the stemmer to achieve. These

objectives made the stemmer more accurate and sophisticated:

• Looking for a vast variety of affixes

• Considering the changes in the original word after adding

affixes

• Taking care of loan words and their imported rules

Many affixes are removed by the stemmer. But there are some

affixes that removing them from a word generates a root that is far

from the original word in meaning. For example if the suffix

"نﺎﺘﺳ" (/st8n/) (indicating a place) is removed from the word

نﺎﺘﺴﮐﺎﭘ /pakest8P/ (Pakistan, the country), the stem ﮎﺎﭘ /pak/

("clear") is left, that is far from the main word. So the affixes

should be selected carefully to avoid this problem.

In Farsi there are some affix-sensitive words that are sensitive

to a few certain affixes. When one of those affixes is added to an

affix-sensitive word some changes occur. As an example adding

plural suffix "نا" /œn/ to the suffix-sensitive word ﺎﻧاد /d8PC

(savant) generates نﺎﻳﺎﻧاد /d8PC+8P/ (savants) that has the new

letter "ـﻳ" /+/ just before the suffix. Adding, removing or even

replacing one or more letters in the original word is done by those

affixes when they meet the affix-sensitive words. If the stemmer

just removes the affixes and does not care to these changes, it will

fail in matching the root and varieties of the root. In the previous

example if plural suffix "نا" is removed without any forgoing

process then the word ﺎﻧادﯼ /d8PC+ (savant) will be remained

that is not equal to the original word ﺎﻧاد.

Throughout the time some words are imported in Farsi from

other languages like Arabic, English and French. Arabic has the

highest effect, so some Arabic grammatical rules are imported and

are applied to the imported (loan) Arabic words. Fortunately,

these rules are mostly used just for those Arabic words

(sometimes these rules are applied for non-Arabic loan words

too.). So in addition to Farsi grammar we studied imported Arabic

grammatical rules and we considered them in the stemmer.

Despite of looking up in a dictionary the stemmer tries to locate

four discriminator letters "پ" /p/, "ژ" /</, "گ" /g/ and "چ" /V5/ in

826

a word to determine its original language. When the stemmer

ensures that a given word is Arabic, it proceeds through some

Arabic stemming tasks.

3. Design of the Farsi stemmer

The stemmer is completely rule-based. Each rule can be

activated by an affix. The result of each rule is one or more

actions. The starting point of the stemming is removing noun

suffixes and verbal suffixes. Then the stemmer goes to remove

prefixes. The stemmer looks for the changes that have been made

in suffix-sensitive words in all the phases.

The stemmer has ten phases. Each phase is dedicated to one or

more certain grammatical rules. There is a general condition and

it is that after doing each rule’s action(s) the length of the resulted

root should not be less than a certain number. We consider 2

letters as the minimum length of the word. These are ten phases of

the stemmer:

1- Removing1 "ﯼ "(/+/) (indefinite article / possessive-

suffix)

2- Removing auxiliary suffix "ﺪﻧ" /nd/

3- Removing possessive and auxiliary suffixes

4- Removing possessive suffixes: "ت" /t/ and "نﺎﺗ" /t8P

5- Removing plural suffixes

6- Removing comparative suffixes

7- Removing other suffixes

8- Removing "ن" /n/ (sign of infinitive)

9- Removing special end letters

10- Removing prefixes

4. Evaluation

To evaluate the stemmer, a collection of 250 Mb containing

43,680 Farsi documents was used. These documents have several

subjects like sport, economic, policy, history, etc. For more

information about this collection, the reader is referred to [2].

To evaluate the stemmer 25 queries were applied. The relevant

documents of each query were selected by a native Farsi speaker.

First, the system (a classic vector-based system) was started up

without the stemmer in the indexer and the searcher. The queries

were fed to the system and the performance of the system was

evaluated (Table 5). Then the system was restarted up using the

stemmer, and again it was evaluated with the same queries (Table

6). Comparing these two tables, the system which used the

stemmer was 0.151 or 46% better.

5. Future Work

The results of our stemming test indicate that the Farsi stemmer

improves retrieval. Our tests were done on a small collection, so

the effect of the stemmer on bigger collections is not known at

this time. There are some ways that can improve the stemmer as

an example a list that has present tens roots and their

corresponding past tense roots helps to better retrieval of verbs.

1 “Removing” is a short notation. Each phase may have some

other stemming tasks in addition to just removing the affixes.

Table 5. Average precision/recall results using no stemmer

recall precision interpolated

100

0.153

0.225

0.293

0.224

0.130

0.124

0.062

0.032

0.007

0.000

0.070

0.581

0.468

0.432

0.333

0.246

0.191

0.122

0.081

0.078

0.070

Average

0.120 0.243

Table 6. Average precision/recall results using stemmer

recall precision interpolated

100

0.140

0.236

0.372

0.276

0.245

0.264

0.207

0.050

0.014

0.000

0.134

0.742

0.685

0.633

0.516

0.463

0.408

0.314

0.166

0.143

0.134

Average

0.176 0.394

6. REFERENCES

[1] Shariat, M. J. Simple Farsi Grammar (Second impression).

Asaatir, 2000, Iran.

[2] Darrudi, E., Hejazi, M.R, Oroumchian, F. Assessment of a

Modern Farsi Corpus. In Proceedings of the 2nd Workshop

on Information Technology & its Disciplines (WITID) 2004,

ITRC, Iran.

[3] Taghva, K., Beckley, R., Sadeh, M. A Stemming Algorithm

for the Farsi Language. In proceedings of International

Conference on Information Technology: Coding and

Computing (ITXX05) - Volume I pp. 158-162.

[4] Samiei (Gilani), A. Writing and Editing (Third impression),

The Organization for Researching and Composing

University Textbooks in the Humanities (SAMT), 2001, Iran

827

Challenges in Urdu Stemming (A Progress Report)

Conference Paper

Full-text available

Aug 2007

Kashif Riaz

Pattern Based Comprehensive Urdu Stemmer and Short Text Classification

Article

Full-text available

Dec 2017

Urdu language is used by approximately 200 million people for spoken and written communication. Bulk of unstructured Urdu textual data is available in the world. We can employ data mining techniques to extract useful information from such large, potentially informative base data. There are many text processing systems available to process unstructured textual data. However, these systems are mostly language specific with the large proportion of systems applicable to English text. This is primarily due to language dependent pre-processing systems mainly the stemming requirement. Stemming is a vital pre-processing step in the text mining process and its primary aim is to reduce grammatical words form e.g. parts of speech, gender, tense etc. to their root form. In the proposed work, we have developed a rule based comprehensive stemming method for Urdu text. This proposed Urdu stemmer has the ability to generate the stem of Urdu words as well as loan words that belong to borrowed languages such as Arabic, Persian, and Turkish etc. by removing prefix, infix, and suffix from the words. In the proposed stemming technique, we introduced six novel Urdu infix words classes and minimum word length rule to generate the stem of Urdu text. In order to cope with the challenge of Urdu infix stemming, we have developed infix stripping rules for introduced infix words classes and generic stemming rules for prefix and suffix stemming. We also present a probabilistic classification approach to classify Urdu short text. Different experiments are performed to demonstrate the effectiveness and efficacy of the proposed approach. Comparison with existing state-of-the art is also made. Stemming accuracy results demonstrate the adoptability of proposed stemming approach for variety text processing applications. OAPA

A survey on Urdu and Urdu like language stemmers and stemming techniques

Article

Full-text available

Mar 2018
ARTIF INTELL REV

Sajid Iqbal

Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected forms of a word into its root form. Urdu is a morphologically rich language, emerged from different languages, that includes prefix, suffix, infix, co-suffix and circumfixes in inflected and multi-gram words that need to be edited in order to convert them into their stems. This editing (insertion, deletion and substitution) makes the stemming process difficult due to language morphological richness and inclusion of words of foreign languages like Persian and Arabic. In this paper, we present a comprehensive review of different algorithms and techniques of stemming Urdu text and also considering the syntax, morphological similarity and other common features and stemming approaches used in Urdu like languages, i.e. Arabic and Persian analyzed, extract main features, merits and shortcomings of the used stemming approaches. In this paper, we also discuss stemming errors, basic difference between stemming and lemmatization and coin a metric for classification of stemming algorithms. In the final phase, we have presented the future work directions. http://link.springer.com/article/10.1007%2Fs10462-016-9527-1

FaNexer: Persian Keyphrase Automatic Indexer

Article

Full-text available

Jun 2014

The main objective of this paper was to design a model of automatic keyphrase indexing for Persian. The train model, consisting of six features – “TF”, “TF × IDF”, “RE”, “RE × IDF”, “Node Degree” and “First Occurrence” – were elaborated on. These six features were defined briefly and for each feature, the discretization ranges applied as well as the Yes/No probability scores of being an index term were reported. Finally, the way the model, and each of its components, performed were demonstrated in a step-by-step manner by running the software on a sample full-text article. The ultimate assessment of the software on 75 test articles revealed that it had a very good performance on full-texts (F-measure = 27.3%, Precision = 31.68%, and recall = 25.45%) and abstracts (F-measure = 28%, precision = 32.19%, and recall = 26.27%) when default was set at 7. The software also proved successful as regards generation of keyphrases rather than single word index terms at default 7. In all, 58.1% of the index terms generated by the software for full-text documents, and 58.67% of those generated for abstracts were phrases. Finally, 78.86% and 74.48% of the keyterms generated for full-texts and abstracts were judged as relevant by an LIS expert.

A Distributed N-Gram Indexing System to Optimizing Persian Information Retrieval

Article

Full-text available

Jan 2013

As the amount of information and the number of queries has been increasing today, indexing is a good solution to fight with the inherent complexity of text retrieval and accelerating information retrieval in different languages. Also N-Gram Indexing is a solution of the issues such as stemming, misspellings, multilingual and partial matching and has the advantages of language independent and error endurance. Persian is a name of a language which is common in the Middle East. It is spoken in some countries like Iran, Afghanistan and Tajikistan. Therefore, Persian is the language of many documents is published on the net. But, not more researches have been done about the Persian documents retrieval. In this paper, we present a method for Persian documents retrieving using N-gram indexing and distribution technique. The proposed index is a method of more effective answering queries that increases the quality of information retrieval substantially and we gain more optimizing retrieval in Persian documents. But the speed of N-gram indexing is low; to solve this problem we design a distributed N-gram indexing mechanism for large systems of Persian language. Compare with the other methods in this field, we improve the quality of retrieved documents and also the speed of information retrieval.

An Unsupervised Approach to Develop Stemmer

Article

Full-text available

Aug 2012

Mohd Shahid Husain

This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu & Marathi language). Especially, during last few years, a wide range of information in Indian regional languages has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. To train the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that in the case of Urdu language the frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix stripping algorithm.

A Systematic Review of Stemmers of Indian and Non-Indian Vernacular Languages

Article

Jun 2023

The stemming process is crucial and significant in the pre-processing step of natural language processing. The stemmer oversees the stemming process. It facilitates the extraction of morphological variants of a root or base word from the provided word. Over the period, several stemmers for various vernacular languages have been proposed. However, very few research studies have comprehensively investigated these available stemmers. This paper makes multifold contributions. First, we discuss the various stemmers of 15 Indian and 17 non-Indian languages describing their key points, benefits, and drawbacks. All the Indian languages for which stemmers have been built are covered in this study. For the non-Indian languages, stemmers of commonly spoken languages have been covered. Second, we present a language-wise comparative analysis of stemmers based on our identified parameters. Third, we discuss the wordnets and dictionaries available for different languages. Fourth, we provide details of the datasets available for various languages. Fifth, we also provide challenges in existing stemmers and future directions for future researchers. The study presented in this paper reveals that significant research has been carried out for the stemmers of influential languages such as English, Arabic, and Urdu. On the other hand, languages with limited resources, such as Farsi, Polish, Odia, Amharic, and others, have received the least attention for research. Moreover, rigorous analysis reveals that most of the stemmers suffer from over-stemming errors. With a complete catalogue of available stemmers, this study aims to assist the researchers and professionals working in the areas such as information retrieval, semantic annotation, word meaning disambiguation, and ontology learning.

Morphologically Annotated Amharic Text Corpora

Conference Paper

Jul 2021

گذار از بن‌واژه‌سازی قاعده‌مند به آماری در فارسی / Transition from rule-based to statistical lemmatization in Persian

Conference Paper

Full-text available

Jan 2019

Masood Ghayoomi

زبان فارسی جزء زبان‌هایی است که از نظر تصریفی تا حدودی غنی است. وجود صورت‌واژه‌های مختلف سبب بزرگ‌شدن واژگان در سامانه‌های پردازش زبان طبیعی می‌شود. علیرغم پژوهش‌های بسیار زیاد انجام‌شده برای بن‌واژه‌سازی و ریشه‌یابی در فارسی، تاکنون یک مجموعۀ داده استاندارد طلایی با حجم قابل ملاحظه‌ای که بنِ واژه‌ها در آن مشخص شده‌باشد تهیه نشده ‎است. این کمبود داده سبب می‌شود نتوان از روش‌های آماری و یادگیری ماشینی برای بن‌واژه‌سازی بهرهبرد. در این مقاله می‌کوشیم با ارائۀ یک الگوریتم قاعده‌مند، پیکرۀ بی‌جن‌خان که دارای برچسب مقولات دستوری است را بن‌واژه‌سازی کنیم. سپس با ویرایش جزئی، از این داده به‌عنوان دادۀ آموزش برای دو بن‌واژه‌ساز آماری به‌نام‌های مورفت و لمینگ استفاده کنیم و به مقایسۀ عملکرد دو شیوۀ قاعده‌بنیان و آماری در بن‌واژه‌سازی بپردازیم. در این مقایسه، از دادگان درختی فارسی که در آن بنِ واژه‌ها به‌صورت دستی برچسب‌گذاری شده‌است به‌عنوان دادۀ آزمون استفاده می‌شود. نتایج به‌دست‌آمده نشان می‌دهد که شیوه آماری یادگیری ماشینی بانظارت که در این دو ابزار پیاده‌سازی شده‌است کارایی بالاتری را نسبت‌به شیوۀ قاعده بنیان در شرایط واقعی کاربردی به‌دست آورده‌است. / Persian is one of the languages which is to some extent rich in terms of inflection. Existence of different word forms causes the lexicon to be enlarged the in natural language processing systems. Despite the various studies on lemmatization and stemming of Persian, no data that contains lemmas of the words as the gold standard data with a reasonable size are not developed. This lack of data causes not to be able to use statistical and machine learning methods for lemmatization. In this paper, we try to propose a rule-based algorithm to lemmetize the Bijankhan Corpus which contains the syntactic categories of the words. Then, with a minimal editing, this data will be used as the training data for two statistical lemmatizers, Morfettee and Lemming. Finally, we compare the performance of the rule-based and statistical approaches in lemmatization. In this comparison, the data of a Persian Treebank which is lemmatized manually will be used as the test data. The experimental results show that the statistical approach which uses supervised machine learning method have beaten the rule-based approach in the real application.

Proposing a stemmer for Farsi words, by using regular expressions (in Persian)

Conference Paper

Full-text available

Sep 2018

Finding the stems of words is one of the most important issues in the field of natural language processing. The stem of a word is related to a part of it that will be achieved after removing affixes from the word. In this paper, we propose a stemmer for Farsi words by employing regular expressions. The proposed stemmer performs in 3 steps: 1) tokenizing the words of a corpus, 2) applying stemming rules on the tokenized words and achieving their stems and 3) Matching the achieved stems with a Farsi dictionary to remove invalid ones. The results achieved by the proposed stemmer shows the high efficiency of it in finding the stems of Farsi words.

Assessment of a modern farsi corpus

Article

Full-text available

The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on the fitness of the frequency and rank of Farsi words with Zipf-Mandelbrot's law. We will then present our measurement of Entropy of Farsi for this corpus.

A stemming algorithm for the Farsi language

Conference Paper

May 2005

In this paper, we report on the design and implementation of a stemmer for the Farsi language. The results of our evaluation on a small Farsi document collection shows a significant improvement in precision/recall over not stemming.

Simple Farsi Grammar (Second impression)

Jan 2000

M J Shariat

Shariat, M. J. Simple Farsi Grammar (Second impression). Asaatir, 2000, Iran.

Writing and Editing (Third impression) The Organization for Researching and Composing University Textbooks in the Humanities (SAMT)

Jan 2001

Samiei
A ) Gilani

Samiei (Gilani), A. Writing and Editing (Third impression), The Organization for Researching and Composing University Textbooks in the Humanities (SAMT), 2001, Iran

The Organization for Researching and Composing University Textbooks in the Humanities (SAMT)

Jan 2001

Samiei

Samiei (Gilani), A. Writing and Editing (Third impression), The Organization for Researching and Composing University Textbooks in the Humanities (SAMT), 2001, Iran

Introduction to a new Farsi stemmer

Abstract

Recommended publications

Investigation on Disambiguation in CLIR: Aligned Corpus and Bi-directional Translation-Based Strateg...

Mining named entity transliteration equivalents from comparable corpora

Concept unification of terms in different languages via web mining for Information Retrieval

Evaluating Resources for Query Translation in Cross-Language Information Retrieval